1. 14 3月, 2022 6 次提交
  2. 02 3月, 2022 1 次提交
    • J
      btrfs: do not WARN_ON() if we have PageError set · a50e1fcb
      Josef Bacik 提交于
      Whenever we do any extent buffer operations we call
      assert_eb_page_uptodate() to complain loudly if we're operating on an
      non-uptodate page.  Our overnight tests caught this warning earlier this
      week
      
        WARNING: CPU: 1 PID: 553508 at fs/btrfs/extent_io.c:6849 assert_eb_page_uptodate+0x3f/0x50
        CPU: 1 PID: 553508 Comm: kworker/u4:13 Tainted: G        W         5.17.0-rc3+ #564
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
        Workqueue: btrfs-cache btrfs_work_helper
        RIP: 0010:assert_eb_page_uptodate+0x3f/0x50
        RSP: 0018:ffffa961440a7c68 EFLAGS: 00010246
        RAX: 0017ffffc0002112 RBX: ffffe6e74453f9c0 RCX: 0000000000001000
        RDX: ffffe6e74467c887 RSI: ffffe6e74453f9c0 RDI: ffff8d4c5efc2fc0
        RBP: 0000000000000d56 R08: ffff8d4d4a224000 R09: 0000000000000000
        R10: 00015817fa9d1ef0 R11: 000000000000000c R12: 00000000000007b1
        R13: ffff8d4c5efc2fc0 R14: 0000000001500000 R15: 0000000001cb1000
        FS:  0000000000000000(0000) GS:ffff8d4dbbd00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007ff31d3448d8 CR3: 0000000118be8004 CR4: 0000000000370ee0
        Call Trace:
      
         extent_buffer_test_bit+0x3f/0x70
         free_space_test_bit+0xa6/0xc0
         load_free_space_tree+0x1f6/0x470
         caching_thread+0x454/0x630
         ? rcu_read_lock_sched_held+0x12/0x60
         ? rcu_read_lock_sched_held+0x12/0x60
         ? rcu_read_lock_sched_held+0x12/0x60
         ? lock_release+0x1f0/0x2d0
         btrfs_work_helper+0xf2/0x3e0
         ? lock_release+0x1f0/0x2d0
         ? finish_task_switch.isra.0+0xf9/0x3a0
         process_one_work+0x26d/0x580
         ? process_one_work+0x580/0x580
         worker_thread+0x55/0x3b0
         ? process_one_work+0x580/0x580
         kthread+0xf0/0x120
         ? kthread_complete_and_exit+0x20/0x20
         ret_from_fork+0x1f/0x30
      
      This was partially fixed by c2e39305 ("btrfs: clear extent buffer
      uptodate when we fail to write it"), however all that fix did was keep
      us from finding extent buffers after a failed writeout.  It didn't keep
      us from continuing to use a buffer that we already had found.
      
      In this case we're searching the commit root to cache the block group,
      so we can start committing the transaction and switch the commit root
      and then start writing.  After the switch we can look up an extent
      buffer that hasn't been written yet and start processing that block
      group.  Then we fail to write that block out and clear Uptodate on the
      page, and then we start spewing these errors.
      
      Normally we're protected by the tree lock to a certain degree here.  If
      we read a block we have that block read locked, and we block the writer
      from locking the block before we submit it for the write.  However this
      isn't necessarily fool proof because the read could happen before we do
      the submit_bio and after we locked and unlocked the extent buffer.
      
      Also in this particular case we have path->skip_locking set, so that
      won't save us here.  We'll simply get a block that was valid when we
      read it, but became invalid while we were using it.
      
      What we really want is to catch the case where we've "read" a block but
      it's not marked Uptodate.  On read we ClearPageError(), so if we're
      !Uptodate and !Error we know we didn't do the right thing for reading
      the page.
      
      Fix this by checking !Uptodate && !Error, this way we will not complain
      if our buffer gets invalidated while we're using it, and we'll maintain
      the spirit of the check which is to make sure we have a fully in-cache
      block while we're messing with it.
      
      CC: stable@vger.kernel.org # 5.4+
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a50e1fcb
  3. 22 1月, 2022 1 次提交
  4. 07 1月, 2022 5 次提交
    • Y
      btrfs: fix argument list that the kdoc format and script verified · be8d1a2a
      Yang Li 提交于
      The warnings were found by running scripts/kernel-doc, which is
      caused by using 'make W=1'.
      
      fs/btrfs/extent_io.c:3210: warning: Function parameter or member
      'bio_ctrl' not described in 'btrfs_bio_add_page'
      fs/btrfs/extent_io.c:3210: warning: Excess function parameter 'bio'
      description in 'btrfs_bio_add_page'
      fs/btrfs/extent_io.c:3210: warning: Excess function parameter
      'prev_bio_flags' description in 'btrfs_bio_add_page'
      fs/btrfs/space-info.c:1602: warning: Excess function parameter 'root'
      description in 'btrfs_reserve_metadata_bytes'
      fs/btrfs/space-info.c:1602: warning: Function parameter or member
      'fs_info' not described in 'btrfs_reserve_metadata_bytes'
      
      Note: this is fixing only the warnings regarding parameter list, the
      first line is not strictly conforming to the kdoc format as the btrfs
      codebase does not stick to that and keeps the first line more free form
      (because it's only for internal use).
      Reported-by: NAbaci Robot <abaci@linux.alibaba.com>
      Signed-off-by: NYang Li <yang.lee@linux.alibaba.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ add note ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      be8d1a2a
    • Q
      btrfs: remove reada infrastructure · f26c9238
      Qu Wenruo 提交于
      Currently there is only one user for btrfs metadata readahead, and
      that's scrub.
      
      But even for the single user, it's not providing the correct
      functionality it needs, as scrub needs reada for commit root, which
      current readahead can't provide. (Although it's pretty easy to add such
      feature).
      
      Despite this, there are some extra problems related to metadata
      readahead:
      
      - Duplicated feature with btrfs_path::reada
      
      - Partly duplicated feature of btrfs_fs_info::buffer_radix
        Btrfs already caches its metadata in buffer_radix, while readahead
        tries to read the tree block no matter if it's already cached.
      
      - Poor layer separation
        Metadata readahead works kinda at device level.
        This is definitely not the correct layer it should be, since metadata
        is at btrfs logical address space, it should not bother device at all.
      
        This brings extra chance for bugs to sneak in, while brings
        unnecessary complexity.
      
      - Dead code
        In the very beginning of scrub.c we have #undef DEBUG, rendering all
        the debug related code useless and unable to test.
      
      Thus here I purpose to remove the metadata readahead mechanism
      completely.
      
      [BENCHMARK]
      There is a full benchmark for the scrub performance difference using the
      old btrfs_reada_add() and btrfs_path::reada.
      
      For the worst case (no dirty metadata, slow HDD), there could be a 5%
      performance drop for scrub.
      For other cases (even SATA SSD), there is no distinguishable performance
      difference.
      
      The number is reported scrub speed, in MiB/s.
      The resolution is limited by the reported duration, which only has a
      resolution of 1 second.
      
      	Old		New		Diff
      SSD	455.3		466.332		+2.42%
      HDD	103.927 	98.012		-5.69%
      
      Comprehensive test methodology is in the cover letter of the patch.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f26c9238
    • J
      btrfs: zoned: drop redundant check for REQ_OP_ZONE_APPEND and btrfs_is_zoned · 73672710
      Johannes Thumshirn 提交于
      REQ_OP_ZONE_APPEND can only work on zoned devices, so it is redundant to
      check if the filesystem is zoned when REQ_OP_ZONE_APPEND is set as the
      bio's bio_op.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      73672710
    • J
      btrfs: zoned: sink zone check into btrfs_repair_one_zone · 554aed7d
      Johannes Thumshirn 提交于
      Sink zone check into btrfs_repair_one_zone() so we don't need to do it
      in all callers.
      
      Also as btrfs_repair_one_zone() doesn't return a sensible error, make it
      a boolean function and return false in case it got called on a non-zoned
      filesystem and true on a zoned filesystem.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      554aed7d
    • J
      btrfs: zoned: encapsulate inode locking for zoned relocation · 869f4cdc
      Johannes Thumshirn 提交于
      Encapsulate the inode lock needed for serializing the data relocation
      writes on a zoned filesystem into a helper.
      
      This streamlines the code reading flow and hides special casing for
      zoned filesystems.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      869f4cdc
  5. 03 1月, 2022 1 次提交
    • Q
      btrfs: remove unnecessary @nr_written parameters · 83f1b680
      Qu Wenruo 提交于
      We use @nr_written to record how many pages have been started by
      btrfs_run_delalloc_range().
      
      Currently there are only two cases that would populate @nr_written:
      
      - Inline extent creation
      - Compressed write
      
      But both cases will also set @page_started to one.
      
      In fact, in writepage_delalloc() we have the following code, showing
      that @nr_written is really only utilized for above two cases:
      
      	/* did the fill delalloc function already unlock and start
      	 * the IO?
      	 */
      	if (page_started) {
      		/*
      		 * we've unlocked the page, so we can't update
      		 * the mapping's writeback index, just update
      		 * nr_to_write.
      		 */
      		wbc->nr_to_write -= nr_written;
      		return 1;
      	}
      
      But for such cases, writepage_delalloc() will return 1, and exit
      __extent_writepage() without going through __extent_writepage_io().
      
      Thus this means, inside __extent_writepage_io(), we always get
      @nr_written as 0.
      
      So this patch is going to remove the unnecessary parameter from the
      following functions:
      
      - writepage_delalloc()
      
        As @nr_written passed in is always the initial value 0.
      
        Although inside that function, we still need a local @nr_written
        to update wbc->nr_to_write.
      
      - __extent_writepage_io()
      
        As explained above, @nr_written passed in can only be 0.
      
        This also means we can remove one update_nr_written() call.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      83f1b680
  6. 16 12月, 2021 1 次提交
    • J
      btrfs: check WRITE_ERR when trying to read an extent buffer · 651740a5
      Josef Bacik 提交于
      Filipe reported a hang when we have errors on btrfs.  This turned out to
      be a side-effect of my fix c2e39305 ("btrfs: clear extent buffer
      uptodate when we fail to write it") which made it so we clear
      EXTENT_BUFFER_UPTODATE on an eb when we fail to write it out.
      
      Below is a paste of Filipe's analysis he got from using drgn to debug
      the hang
      
      """
      btree readahead code calls read_extent_buffer_pages(), sets ->io_pages to
      a value while writeback of all pages has not yet completed:
         --> writeback for the first 3 pages finishes, we clear
             EXTENT_BUFFER_UPTODATE from eb on the first page when we get an
             error.
         --> at this point eb->io_pages is 1 and we cleared Uptodate bit from the
             first 3 pages
         --> read_extent_buffer_pages() does not see EXTENT_BUFFER_UPTODATE() so
             it continues, it's able to lock the pages since we obviously don't
             hold the pages locked during writeback
         --> read_extent_buffer_pages() then computes 'num_reads' as 3, and sets
             eb->io_pages to 3, since only the first page does not have Uptodate
             bit set at this point
         --> writeback for the remaining page completes, we ended decrementing
             eb->io_pages by 1, resulting in eb->io_pages == 2, and therefore
             never calling end_extent_buffer_writeback(), so
             EXTENT_BUFFER_WRITEBACK remains in the eb's flags
         --> of course, when the read bio completes, it doesn't and shouldn't
             call end_extent_buffer_writeback()
         --> we should clear EXTENT_BUFFER_UPTODATE only after all pages of
             the eb finished writeback?  or maybe make the read pages code
             wait for writeback of all pages of the eb to complete before
             checking which pages need to be read, touch ->io_pages, submit
             read bio, etc
      
      writeback bit never cleared means we can hang when aborting a
      transaction, at:
      
          btrfs_cleanup_one_transaction()
             btrfs_destroy_marked_extents()
               wait_on_extent_buffer_writeback()
      """
      
      This is a problem because our writes are not synchronized with reads in
      any way.  We clear the UPTODATE flag and then we can easily come in and
      try to read the EB while we're still waiting on other bio's to
      complete.
      
      We have two options here, we could lock all the pages, and then check to
      see if eb->io_pages != 0 to know if we've already got an outstanding
      write on the eb.
      
      Or we can simply check to see if we have WRITE_ERR set on this extent
      buffer.  We set this bit _before_ we clear UPTODATE, so if the read gets
      triggered because we aren't UPTODATE because of a write error we're
      guaranteed to have WRITE_ERR set, and in this case we can simply return
      -EIO.  This will fix the reported hang.
      Reported-by: NFilipe Manana <fdmanana@suse.com>
      Fixes: c2e39305 ("btrfs: clear extent buffer uptodate when we fail to write it")
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      651740a5
  7. 08 12月, 2021 2 次提交
    • J
      btrfs: call mapping_set_error() on btree inode with a write error · 68b85589
      Josef Bacik 提交于
      generic/484 fails sometimes with compression on because the write ends
      up small enough that it goes into the btree.  This means that we never
      call mapping_set_error() on the inode itself, because the page gets
      marked as fine when we inline it into the metadata.  When the metadata
      writeback happens we see it and abort the transaction properly and mark
      the fs as readonly, however we don't do the mapping_set_error() on
      anything.  In syncfs() we will simply return 0 if the sb is marked
      read-only, so we can't check for this in our syncfs callback.  The only
      way the error gets returned if we called mapping_set_error() on
      something.  Fix this by calling mapping_set_error() on the btree inode
      mapping.  This allows us to properly return an error on syncfs and pass
      generic/484 with compression on.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      68b85589
    • J
      btrfs: clear extent buffer uptodate when we fail to write it · c2e39305
      Josef Bacik 提交于
      I got dmesg errors on generic/281 on our overnight fstests.  Looking at
      the history this happens occasionally, with errors like this
      
        WARNING: CPU: 0 PID: 673217 at fs/btrfs/extent_io.c:6848 assert_eb_page_uptodate+0x3f/0x50
        CPU: 0 PID: 673217 Comm: kworker/u4:13 Tainted: G        W         5.16.0-rc2+ #469
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
        Workqueue: btrfs-cache btrfs_work_helper
        RIP: 0010:assert_eb_page_uptodate+0x3f/0x50
        RSP: 0018:ffffae598230bc60 EFLAGS: 00010246
        RAX: 0017ffffc0002112 RBX: ffffebaec4100900 RCX: 0000000000001000
        RDX: ffffebaec45733c7 RSI: ffffebaec4100900 RDI: ffff9fd98919f340
        RBP: 0000000000000d56 R08: ffff9fd98e300000 R09: 0000000000000000
        R10: 0001207370a91c50 R11: 0000000000000000 R12: 00000000000007b0
        R13: ffff9fd98919f340 R14: 0000000001500000 R15: 0000000001cb0000
        FS:  0000000000000000(0000) GS:ffff9fd9fbc00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007f549fcf8940 CR3: 0000000114908004 CR4: 0000000000370ef0
        Call Trace:
      
         extent_buffer_test_bit+0x3f/0x70
         free_space_test_bit+0xa6/0xc0
         load_free_space_tree+0x1d6/0x430
         caching_thread+0x454/0x630
         ? rcu_read_lock_sched_held+0x12/0x60
         ? rcu_read_lock_sched_held+0x12/0x60
         ? rcu_read_lock_sched_held+0x12/0x60
         ? lock_release+0x1f0/0x2d0
         btrfs_work_helper+0xf2/0x3e0
         ? lock_release+0x1f0/0x2d0
         ? finish_task_switch.isra.0+0xf9/0x3a0
         process_one_work+0x270/0x5a0
         worker_thread+0x55/0x3c0
         ? process_one_work+0x5a0/0x5a0
         kthread+0x174/0x1a0
         ? set_kthread_struct+0x40/0x40
         ret_from_fork+0x1f/0x30
      
      This happens because we're trying to read from a extent buffer page that
      is !PageUptodate.  This happens because we will clear the page uptodate
      when we have an IO error, but we don't clear the extent buffer uptodate.
      If we do a read later and find this extent buffer we'll think its valid
      and not return an error, and then trip over this warning.
      
      Fix this by also clearing uptodate on the extent buffer when this
      happens, so that we get an error when we do a btrfs_search_slot() and
      find this block later.
      
      CC: stable@vger.kernel.org # 5.4+
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c2e39305
  8. 27 10月, 2021 16 次提交
    • Q
      btrfs: remove btrfs_bio::logical member · f4f39fc5
      Qu Wenruo 提交于
      The member btrfs_bio::logical is only initialized by two call sites:
      
      - btrfs_repair_one_sector()
        No corresponding site to utilize it.
      
      - btrfs_submit_direct()
        The corresponding site to utilize it is btrfs_check_read_dio_bio().
      
      However for btrfs_check_read_dio_bio(), we can grab the file_offset from
      btrfs_dio_private::file_offset directly.
      
      Thus it turns out we don't really need that btrfs_bio::logical member at
      all.
      
      For btrfs_bio, the logical bytenr can be fetched from its
      bio->bi_iter.bi_sector directly.
      
      So let's just remove the member to save 8 bytes for structure btrfs_bio.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f4f39fc5
    • J
      btrfs: add a BTRFS_FS_ERROR helper · 84961539
      Josef Bacik 提交于
      We have a few flags that are inconsistently used to describe the fs in
      different states of failure.  As of 5963ffca ("btrfs: always abort
      the transaction if we abort a trans handle") we will always set
      BTRFS_FS_STATE_ERROR if we abort, so we don't have to check both ABORTED
      and ERROR to see if things have gone wrong.  Add a helper to check
      BTRFS_FS_STATE_ERROR and then convert all checkers of FS_STATE_ERROR to
      use the helper.
      
      The TRANS_ABORTED bit check was added in af722733 ("Btrfs: clean up
      resources during umount after trans is aborted") but is not actually
      specific.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      84961539
    • Q
      btrfs: subpage: avoid potential deadlock with compression and delalloc · 2749f7ef
      Qu Wenruo 提交于
      [BUG]
      With experimental subpage compression enabled, a simple fsstress can
      lead to self deadlock on page 720896:
      
              mkfs.btrfs -f -s 4k $dev > /dev/null
              mount $dev -o compress $mnt
              $fsstress -p 1 -n 100 -w -d $mnt -v -s 1625511156
      
      [CAUSE]
      If we have a file layout looks like below:
      
      	0	32K	64K	96K	128K
      	|//|		|///////////////|
      	   4K
      
      Then we run delalloc range for the inode, it will:
      
      - Call find_lock_delalloc_range() with @delalloc_start = 0
        Then we got a delalloc range [0, 4K).
      
        This range will be COWed.
      
      - Call find_lock_delalloc_range() again with @delalloc_start = 4K
        Since find_lock_delalloc_range() never cares whether the range
        is still inside page range [0, 64K), it will return range [64K, 128K).
      
        This range meets the condition for subpage compression, will go
        through async COW path.
      
        And async COW path will return @page_started.
      
        But that @page_started is now for range [64K, 128K), not for range
        [0, 64K).
      
      - writepage_dellloc() returned 1 for page [0, 64K)
        Thus page [0, 64K) will not be unlocked, nor its page dirty status
        will be cleared.
      
      Next time when we try to lock page [0, 64K) we will deadlock, as there
      is no one to release page [0, 64K).
      
      This problem will never happen for regular page size as one page only
      contains one sector.  After the first find_lock_delalloc_range() call,
      the @delalloc_end will go beyond @page_end no matter if we found a
      delalloc range or not
      
      Thus this bug only happens for subpage, as now we need multiple runs to
      exhaust the delalloc range of a page.
      
      [FIX]
      Fix the problem by ensuring the delalloc range we ran at least started
      inside @locked_page.
      
      So that we will never get incorrect @page_started.
      
      And to prevent such problem from happening again:
      
      - Make find_lock_delalloc_range() return false if the found range is
        beyond @end value passed in.
      
        Since @end will be utilized now, add an ASSERT() to ensure we pass
        correct @end into find_lock_delalloc_range().
      
        This also means, for selftests we needs to populate @end before calling
        find_lock_delalloc_range().
      
      - New ASSERT() in find_lock_delalloc_range()
        Now we will make sure the @start/@end passed in at least covers part
        of the page.
      
      - New ASSERT() in run_delalloc_range()
        To make sure the range at least starts inside @locked page.
      
      - Use @delalloc_start as proper cursor, while @delalloc_end is always
        reset to @page_end.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2749f7ef
    • Q
      btrfs: rework page locking in __extent_writepage() · e55a0de1
      Qu Wenruo 提交于
      Pages passed to __extent_writepage() are always locked, but they may be
      locked by different functions.
      
      There are two types of locked page for __extent_writepage():
      
      - Page locked by plain lock_page()
        It should not have any subpage::writers count.
        Can be unlocked by unlock_page().
        This is the most common locked page for __extent_writepage() called
        inside extent_write_cache_pages() or extent_write_full_page().
        Rarer cases include the @locked_page from extent_write_locked_range().
      
      - Page locked by lock_delalloc_pages()
        There is only one caller, all pages except @locked_page for
        extent_write_locked_range().
        In this case, we have to call subpage helper to handle the case.
      
      So here we introduce a helper, btrfs_page_unlock_writer(), to allow
      __extent_writepage() to unlock different locked pages.
      
      And since for all other callers of __extent_writepage() their pages are
      ensured to be locked by lock_page(), also add an extra check for
      epd::extent_locked to unlock such pages directly.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e55a0de1
    • Q
      btrfs: subpage: make extent_write_locked_range() compatible · 66448b9d
      Qu Wenruo 提交于
      There are two sites are not subpage compatible yet for
      extent_write_locked_range():
      
      - How @nr_pages are calculated
        For subpage we can have the following range with 64K page size:
      
        0   32K  64K   96K 128K
        |   |////|/////|   |
      
        In that case, although 96K - 32K == 64K, thus it looks like one page
        is enough, but the range spans two pages, not one.
      
        Fix it by doing proper round_up() and round_down() to calculate
        @nr_pages.
      
        Also add some extra ASSERT()s to ensure the range passed in is already
        aligned.
      
      - How the page end is calculated
        Currently we just use cur + PAGE_SIZE - 1 to calculate the page end.
      
        Which can't handle the above range layout, and will trigger ASSERT()
        in btrfs_writepage_endio_finish_ordered(), as the range is no longer
        covered by the page range.
      
        Fix it by taking page end into consideration.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      66448b9d
    • Q
      btrfs: cleanup for extent_write_locked_range() · 2bd0fc93
      Qu Wenruo 提交于
      There are several cleanups for extent_write_locked_range(), most of them
      are pure cleanups, but with some preparation for future subpage support.
      
      - Add a proper comment for which call sites are suitable
        Unlike regular synchronized extent write back, if async COW or zoned
        COW happens, we have all pages in the range still locked.
      
        Thus for those (only) two call sites, we need this function to submit
        page content into bios and submit them.
      
      - Remove @mode parameter
        All the existing two call sites pass WB_SYNC_ALL. No need for @mode
        parameter.
      
      - Better error handling
        Currently if we hit an error during the page iteration loop, we
        overwrite @ret, causing only the last error can be recorded.
      
        Here we add @found_error and @first_error variable to record if we hit
        any error, and the first error we hit.
        So the first error won't get lost.
      
      - Don't reuse @start as the cursor
        We reuse the parameter @start as the cursor to iterate the range, not
        a big problem, but since we're here, introduce a proper @cur as the
        cursor.
      
      - Remove impossible branch
        Since all pages are still locked after the ordered extent is inserted,
        there is no way that pages can get its dirty bit cleared.
        Remove the branch where page is not dirty and replace it with an
        ASSERT().
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2bd0fc93
    • Q
      btrfs: subpage: make add_ra_bio_pages() compatible · 6a404910
      Qu Wenruo 提交于
      [BUG]
      If we remove the subpage limitation in add_ra_bio_pages(), then read a
      compressed extent which has part of its range in next page, like the
      following inode layout:
      
      	0	32K	64K	96K	128K
      	|<--------------|-------------->|
      
      Btrfs will trigger ASSERT() in endio function:
      
        assertion failed: atomic_read(&subpage->readers) >= nbits
        ------------[ cut here ]------------
        kernel BUG at fs/btrfs/ctree.h:3431!
        Internal error: Oops - BUG: 0 [#1] SMP
        Workqueue: btrfs-endio btrfs_work_helper [btrfs]
        Call trace:
         assertfail.constprop.0+0x28/0x2c [btrfs]
         btrfs_subpage_end_reader+0x148/0x14c [btrfs]
         end_page_read+0x8c/0x100 [btrfs]
         end_bio_extent_readpage+0x320/0x6b0 [btrfs]
         bio_endio+0x15c/0x1dc
         end_workqueue_fn+0x44/0x64 [btrfs]
         btrfs_work_helper+0x74/0x250 [btrfs]
         process_one_work+0x1d4/0x47c
         worker_thread+0x180/0x400
         kthread+0x11c/0x120
         ret_from_fork+0x10/0x30
        ---[ end trace c8b7b552d3bb408c ]---
      
      [CAUSE]
      When we read the page range [0, 64K), we find it's a compressed extent,
      and we will try to add extra pages in add_ra_bio_pages() to avoid
      reading the same compressed extent.
      
      But when we add such page into the read bio, it doesn't follow the
      behavior of btrfs_do_readpage() to properly set subpage::readers.
      
      This means, for page [64K, 128K), its subpage::readers is still 0.
      
      And when endio is executed on both pages, since page [64K, 128K) has 0
      subpage::readers, it triggers above ASSERT()
      
      [FIX]
      Function add_ra_bio_pages() is far from subpage compatible, it always
      assume PAGE_SIZE == sectorsize, thus when it skip to next range it
      always just skip PAGE_SIZE.
      
      Make it subpage compatible by:
      
      - Skip to next page properly when needed
        If we find there is already a page cache, we need to skip to next page.
        For that case, we shouldn't just skip PAGE_SIZE bytes, but use
        @pg_index to calculate the next bytenr and continue.
      
      - Only add the page range covered by current extent map
        We need to calculate which range is covered by current extent map and
        only add that part into the read bio.
      
      - Update subpage::readers before submitting the bio
      
      - Use proper cursor other than confusing @last_offset
      
      - Calculate the missed threshold based on sector size
        It's no longer using missed pages, as for 64K page size, we have at
        most 3 pages to skip. (If aligned only 2 pages)
      
      - Add ASSERT() to make sure our bytenr is always aligned
      
      - Add comment for the function
        Add a special note for subpage case, as the function won't really
        work well for subpage cases.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6a404910
    • Q
      btrfs: remove unnecessary parameter delalloc_start for writepage_delalloc() · cf3075fb
      Qu Wenruo 提交于
      In function __extent_writepage() we always pass page start to
      @delalloc_start for writepage_delalloc().
      
      Thus we don't really need @delalloc_start parameter as we can extract it
      from @page.
      
      Remove @delalloc_start parameter and make __extent_writepage() to
      declare @page_start and @page_end as const.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      cf3075fb
    • Q
      btrfs: rename struct btrfs_io_bio to btrfs_bio · c3a3b19b
      Qu Wenruo 提交于
      Previously we had "struct btrfs_bio", which records IO context for
      mirrored IO and RAID56, and "strcut btrfs_io_bio", which records extra
      btrfs specific info for logical bytenr bio.
      
      With "btrfs_bio" renamed to "btrfs_io_context", we are safe to rename
      "btrfs_io_bio" to "btrfs_bio" which is a more suitable name now.
      
      The struct btrfs_bio changes meaning by this commit. There was a
      suggested name like btrfs_logical_bio but it's a bit long and we'd
      prefer to use a shorter name.
      
      This could be a concern for backports to older kernels where the
      different meaning could possibly cause confusion or bugs. Comparing the
      new and old structures, there's no overlap among the struct members so a
      build would break in case of incorrect backport.
      
      We haven't had many backports to bio code anyway so this is more of a
      theoretical cause of bugs and a matter of precaution but we'll need to
      keep the semantic change in mind.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c3a3b19b
    • Q
      btrfs: remove btrfs_bio_alloc() helper · cd8e0cca
      Qu Wenruo 提交于
      The helper btrfs_bio_alloc() is almost the same as btrfs_io_bio_alloc(),
      except it's allocating using BIO_MAX_VECS as @nr_iovecs, and initializes
      bio->bi_iter.bi_sector.
      
      However the naming itself is not using "btrfs_io_bio" to indicate its
      parameter is "strcut btrfs_io_bio" and can be easily confused with
      "struct btrfs_bio".
      
      Considering assigned bio->bi_iter.bi_sector is such a simple work and
      there are already tons of call sites doing that manually, there is no
      need to do that in a helper.
      
      Remove btrfs_bio_alloc() helper, and enhance btrfs_io_bio_alloc()
      function to provide a fail-safe value for its @nr_iovecs.
      
      And then replace all btrfs_bio_alloc() callers with
      btrfs_io_bio_alloc().
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      cd8e0cca
    • Q
      btrfs: rename btrfs_bio to btrfs_io_context · 4c664611
      Qu Wenruo 提交于
      The structure btrfs_bio is used by two different sites:
      
      - bio->bi_private for mirror based profiles
        For those profiles (SINGLE/DUP/RAID1*/RAID10), this structures records
        how many mirrors are still pending, and save the original endio
        function of the bio.
      
      - RAID56 code
        In that case, RAID56 only utilize the stripes info, and no long uses
        that to trace the pending mirrors.
      
      So btrfs_bio is not always bind to a bio, and contains more info for IO
      context, thus renaming it will make the naming less confusing.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4c664611
    • J
      btrfs: zoned: only allow one process to add pages to a relocation inode · 35156d85
      Johannes Thumshirn 提交于
      Don't allow more than one process to add pages to a relocation inode on
      a zoned filesystem, otherwise we cannot guarantee the sequential write
      rule once we're filling preallocated extents on a zoned filesystem.
      Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      35156d85
    • Q
      btrfs: unexport repair_io_failure() · 38d5e541
      Qu Wenruo 提交于
      Function repair_io_failure() is no longer used out of extent_io.c since
      commit 8b9b6f25 ("btrfs: scrub: cleanup the remaining nodatasum
      fixup code"), which removes the last external caller.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      38d5e541
    • A
      btrfs: convert latest_bdev type to btrfs_device and rename · d24fa5c1
      Anand Jain 提交于
      In preparation to fix a bug in btrfs_show_devname().
      
      Convert fs_devices::latest_bdev type from struct block_device to struct
      btrfs_device and, rename the member to fs_devices::latest_dev.
      So that btrfs_show_devname() can use fs_devices::latest_dev::name.
      Tested-by: NSu Yue <l@damenly.su>
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d24fa5c1
    • N
      btrfs: zoned: finish fully written block group · be1a1d7a
      Naohiro Aota 提交于
      If we have written to the zone capacity, the device automatically
      deactivates the zone. Sync up block group side (the active BG list and
      zone_is_active flag) with it.
      
      We need to do it both on data BGs and metadata BGs. On data side, we add a
      hook to btrfs_finish_ordered_io(). On metadata side, we use
      end_extent_buffer_writeback().
      
      To reduce excess lookup of a block group, we mark the last extent buffer in
      a block group with EXTENT_BUFFER_ZONE_FINISH flag. This cannot be done for
      data (ordered_extent), because the address may change due to
      REQ_OP_ZONE_APPEND.
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      be1a1d7a
    • Q
      btrfs: subpage: pack all subpage bitmaps into a larger bitmap · 72a69cd0
      Qu Wenruo 提交于
      Currently we use u16 bitmap to make 4k sectorsize work for 64K page
      size.
      
      But this u16 bitmap is not large enough to contain larger page size like
      128K, nor is space efficient for 16K page size.
      
      To handle both cases, here we pack all subpage bitmaps into a larger
      bitmap, now btrfs_subpage::bitmaps[] will be the ultimate bitmap for
      subpage usage.
      
      Each sub-bitmap will has its start bit number recorded in
      btrfs_subpage_info::*_start, and its bitmap length will be recorded in
      btrfs_subpage_info::bitmap_nr_bits.
      
      All subpage bitmap operations will be converted from using direct u16
      operations to bitmap operations, with above *_start calculated.
      
      For 64K page size with 4K sectorsize, this should not cause much
      difference.
      
      While for 16K page size, we will only need 1 unsigned long (u32) to
      store all the bitmaps, which saves quite some space.
      
      Furthermore, this allows us to support larger page size like 128K and
      258K.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      72a69cd0
  9. 26 10月, 2021 2 次提交
  10. 23 8月, 2021 5 次提交
    • N
      btrfs: zoned: fix ordered extent boundary calculation · 939c7feb
      Naohiro Aota 提交于
      btrfs_lookup_ordered_extent() is supposed to query the offset in a file
      instead of the logical address. Pass the file offset from
      submit_extent_page() to calc_bio_boundaries().
      
      Also, calc_bio_boundaries() relies on the bio's operation flag, so move
      the call site after setting it.
      
      Fixes: 390ed29b ("btrfs: refactor submit_extent_page() to make bio and its flag tracing easier")
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      939c7feb
    • B
      btrfs: initial fsverity support · 14605409
      Boris Burkov 提交于
      Add support for fsverity in btrfs. To support the generic interface in
      fs/verity, we add two new item types in the fs tree for inodes with
      verity enabled. One stores the per-file verity descriptor and btrfs
      verity item and the other stores the Merkle tree data itself.
      
      Verity checking is done in end_page_read just before a page is marked
      uptodate. This naturally handles a variety of edge cases like holes,
      preallocated extents, and inline extents. Some care needs to be taken to
      not try to verity pages past the end of the file, which are accessed by
      the generic buffered file reading code under some circumstances like
      reading to the end of the last page and trying to read again. Direct IO
      on a verity file falls back to buffered reads.
      
      Verity relies on PageChecked for the Merkle tree data itself to avoid
      re-walking up shared paths in the tree. For this reason, we need to
      cache the Merkle tree data. Since the file is immutable after verity is
      turned on, we can cache it at an index past EOF.
      
      Use the new inode ro_flags to store verity on the inode item, so that we
      can enable verity on a file, then rollback to an older kernel and still
      mount the file system and read the file. Since we can't safely write the
      file anymore without ruining the invariants of the Merkle tree, we mark
      a ro_compat flag on the file system when a file has verity enabled.
      Acked-by: NEric Biggers <ebiggers@google.com>
      Co-developed-by: NChris Mason <clm@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      Signed-off-by: NBoris Burkov <boris@bur.io>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      14605409
    • Q
      btrfs: remove the dead comment in writepage_delalloc() · 7361b4ae
      Qu Wenruo 提交于
      When btrfs_run_delalloc_range() failed, we will error out.
      
      But there is a strange comment mentioning that
      btrfs_run_delalloc_range() could have returned value >0 to indicate the
      IO has already started.
      
      Commit 40f76580 ("Btrfs: split up __extent_writepage to lower stack
      usage") introduced the comment, but unfortunately at that time, we were
      already using @page_started to indicate that case, and still return 0.
      
      Furthermore, even if that comment was right (which is not), we would
      return -EIO if the IO had already started.
      
      By all means the comment is incorrect, just remove the comment along
      with the dead check.
      
      Just to be extra safe, add an ASSERT() in btrfs_run_delalloc_range() to
      make sure we either return 0 or error, no positive return value.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7361b4ae
    • C
      btrfs: fix argument type of btrfs_bio_clone_partial() · 21dda654
      Chaitanya Kulkarni 提交于
      The offset and can never be negative use unsigned int instead of int
      type for them.
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      21dda654
    • Q
      btrfs: unify regular and subpage error paths in __extent_writepage() · 963e4db8
      Qu Wenruo 提交于
      [BUG]
      When running btrfs/160 in a loop for subpage with experimental
      compression support, it has a high chance to crash (~20%):
      
       BTRFS critical (device dm-7): panic in __btrfs_add_ordered_extent:238: inconsistency in ordered tree at offset 0 (errno=-17 Object already exists)
       ------------[ cut here ]------------
       kernel BUG at fs/btrfs/ordered-data.c:238!
       Internal error: Oops - BUG: 0 [#1] SMP
       pc : __btrfs_add_ordered_extent+0x550/0x670 [btrfs]
       lr : __btrfs_add_ordered_extent+0x550/0x670 [btrfs]
       Call trace:
        __btrfs_add_ordered_extent+0x550/0x670 [btrfs]
        btrfs_add_ordered_extent+0x2c/0x50 [btrfs]
        run_delalloc_nocow+0x81c/0x8fc [btrfs]
        btrfs_run_delalloc_range+0xa4/0x390 [btrfs]
        writepage_delalloc+0xc0/0x1ac [btrfs]
        __extent_writepage+0xf4/0x370 [btrfs]
        extent_write_cache_pages+0x288/0x4f4 [btrfs]
        extent_writepages+0x58/0xe0 [btrfs]
        btrfs_writepages+0x1c/0x30 [btrfs]
        do_writepages+0x60/0x110
        __filemap_fdatawrite_range+0x108/0x170
        filemap_fdatawrite_range+0x20/0x30
        btrfs_fdatawrite_range+0x34/0x4dc [btrfs]
        __btrfs_write_out_cache+0x34c/0x480 [btrfs]
        btrfs_write_out_cache+0x144/0x220 [btrfs]
        btrfs_start_dirty_block_groups+0x3ac/0x6b0 [btrfs]
        btrfs_commit_transaction+0xd0/0xbb4 [btrfs]
        btrfs_sync_fs+0x64/0x1cc [btrfs]
        sync_fs_one_sb+0x3c/0x50
        iterate_supers+0xcc/0x1d4
        ksys_sync+0x6c/0xd0
        __arm64_sys_sync+0x1c/0x30
        invoke_syscall+0x50/0x120
        el0_svc_common.constprop.0+0x4c/0xd4
        do_el0_svc+0x30/0x9c
        el0_svc+0x2c/0x54
        el0_sync_handler+0x1a8/0x1b0
        el0_sync+0x198/0x1c0
       ---[ end trace 336f67369ae6e0af ]---
      
      [CAUSE]
      For subpage case, we can have multiple sectors inside a page, this makes
      it possible for __extent_writepage() to have part of its page submitted
      before returning.
      
      In btrfs/160, we are using dm-dust to emulate write error, this means
      for certain pages, we could have everything running fine, but at the end
      of __extent_writepage(), one of the submitted bios fails due to dm-dust.
      
      Then the page is marked Error, and we change @ret from 0 to -EIO.
      
      This makes the caller extent_write_cache_pages() to error out, without
      submitting the remaining pages.
      
      Furthermore, since we're erroring out for free space cache, it doesn't
      really care about the error and will update the inode and retry the
      writeback.
      
      Then we re-run the delalloc range, and will try to insert the same
      delalloc range while previous delalloc range is still hanging there,
      triggering the above error.
      
      [FIX]
      The proper fix is to handle errors from __extent_writepage() properly,
      by ending the remaining ordered extent.
      
      But that fix needs the following changes:
      
      - Know at exactly which sector the error happened
        Currently __extent_writepage_io() works for the full page, can't
        return at which sector we hit the error.
      
      - Grab the ordered extent covering the failed sector
      
      As a hotfix for subpage case, here we unify the error paths in
      __extent_writepage().
      
      In fact, the "if (PageError(page))" branch never get executed if @ret is
      still 0 for non-subpage cases.
      
      As for non-subpage case, we never submit current page in
      __extent_writepage(), but only add current page into bio.
      The bio can only get submitted in next page.
      
      Thus we never get PageError() set due to IO failure, thus when we hit
      the branch, @ret is never 0.
      
      By simply removing that @ret assignment, we let subpage case ignore the
      IO failure, thus only error out for fatal errors just like regular
      sectorsize.
      
      So that IO error won't be treated as fatal error not trigger the hanging
      OE problem.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      963e4db8