1. 03 1月, 2022 6 次提交
    • J
      btrfs: set BTRFS_FS_STATE_NO_CSUMS if we fail to load the csum root · 056c8311
      Josef Bacik 提交于
      We have a few places where we skip doing csums if we mounted with one of
      the rescue options that ignores bad csum roots.  In the future when
      there are multiple csum roots it'll be costly to check and see if there
      are any missing csum roots, so simply add a flag to indicate the fs
      should skip loading csums in case of errors.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      056c8311
    • J
      btrfs: change root to fs_info for btrfs_reserve_metadata_bytes · 9270501c
      Josef Bacik 提交于
      We used to need the root for btrfs_reserve_metadata_bytes to check the
      orphan cleanup state, but we no longer need that, we simply need the
      fs_info.  Change btrfs_reserve_metadata_bytes() to use the fs_info, and
      change both btrfs_block_rsv_refill() and btrfs_block_rsv_add() to do the
      same as they simply call btrfs_reserve_metadata_bytes() and then
      manipulate the block_rsv that is being used.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9270501c
    • J
      btrfs: get rid of root->orphan_cleanup_state · 54230013
      Josef Bacik 提交于
      Now that we don't care about the stage of the orphan_cleanup_state,
      simply replace it with a bit on ->state to make sure we don't call the
      orphan cleanup every time we wander into this root.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      54230013
    • J
      btrfs: make BTRFS_RESERVE_FLUSH_EVICT use the global rsv stealing code · ee6adbfd
      Josef Bacik 提交于
      I forgot to convert this over when I introduced the global reserve
      stealing code to the space flushing code.  Evict was simply trying to
      make its reservation and then if it failed it would steal from the
      global rsv, which is racey because it's outside of the normal ticketing
      code.
      
      Fix this by setting ticket->steal if we are BTRFS_RESERVE_FLUSH_EVICT,
      and then make the priority flushing path do the steal for us.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ee6adbfd
    • J
      btrfs: make btrfs_file_extent_inline_item_len take a slot · 437bd07e
      Josef Bacik 提交于
      Instead of getting the btrfs_item for this, simply pass in the slot of
      the item and then use the btrfs_item_size_nr() helper inside of
      btrfs_file_extent_inline_item_len().
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      437bd07e
    • F
      btrfs: fix ENOSPC failure when attempting direct IO write into NOCOW range · f0bfa76a
      Filipe Manana 提交于
      When doing a direct IO write against a file range that either has
      preallocated extents in that range or has regular extents and the file
      has the NOCOW attribute set, the write fails with -ENOSPC when all of
      the following conditions are met:
      
      1) There are no data blocks groups with enough free space matching
         the size of the write;
      
      2) There's not enough unallocated space for allocating a new data block
         group;
      
      3) The extents in the target file range are not shared, neither through
         snapshots nor through reflinks.
      
      This is wrong because a NOCOW write can be done in such case, and in fact
      it's possible to do it using a buffered IO write, since when failing to
      allocate data space, the buffered IO path checks if a NOCOW write is
      possible.
      
      The failure in direct IO write path comes from the fact that early on,
      at btrfs_dio_iomap_begin(), we try to allocate data space for the write
      and if it that fails we return the error and stop - we never check if we
      can do NOCOW. But later, at btrfs_get_blocks_direct_write(), we check
      if we can do a NOCOW write into the range, or a subset of the range, and
      then release the previously reserved data space.
      
      Fix this by doing the data reservation only if needed, when we must COW,
      at btrfs_get_blocks_direct_write() instead of doing it at
      btrfs_dio_iomap_begin(). This also simplifies a bit the logic and removes
      the inneficiency of doing unnecessary data reservations.
      
      The following example test script reproduces the problem:
      
        $ cat dio-nocow-enospc.sh
        #!/bin/bash
      
        DEV=/dev/sdj
        MNT=/mnt/sdj
      
        # Use a small fixed size (1G) filesystem so that it's quick to fill
        # it up.
        # Make sure the mixed block groups feature is not enabled because we
        # later want to not have more space available for allocating data
        # extents but still have enough metadata space free for the file writes.
        mkfs.btrfs -f -b $((1024 * 1024 * 1024)) -O ^mixed-bg $DEV
        mount $DEV $MNT
      
        # Create our test file with the NOCOW attribute set.
        touch $MNT/foobar
        chattr +C $MNT/foobar
      
        # Now fill in all unallocated space with data for our test file.
        # This will allocate a data block group that will be full and leave
        # no (or a very small amount of) unallocated space in the device, so
        # that it will not be possible to allocate a new block group later.
        echo
        echo "Creating test file with initial data..."
        xfs_io -c "pwrite -S 0xab -b 1M 0 900M" $MNT/foobar
      
        # Now try a direct IO write against file range [0, 10M[.
        # This should succeed since this is a NOCOW file and an extent for the
        # range was previously allocated.
        echo
        echo "Trying direct IO write over allocated space..."
        xfs_io -d -c "pwrite -S 0xcd -b 10M 0 10M" $MNT/foobar
      
        umount $MNT
      
      When running the test:
      
        $ ./dio-nocow-enospc.sh
        (...)
      
        Creating test file with initial data...
        wrote 943718400/943718400 bytes at offset 0
        900 MiB, 900 ops; 0:00:01.43 (625.526 MiB/sec and 625.5265 ops/sec)
      
        Trying direct IO write over allocated space...
        pwrite: No space left on device
      
      A test case for fstests will follow, testing both this direct IO write
      scenario as well as the buffered IO write scenario to make it less likely
      to get future regressions on the buffered IO case.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f0bfa76a
  2. 29 10月, 2021 1 次提交
  3. 27 10月, 2021 24 次提交
    • D
      Revert "btrfs: compression: drop kmap/kunmap from generic helpers" · 3a60f653
      David Sterba 提交于
      This reverts commit 4c2bf276.
      
      The kmaps in compression code are still needed and cause crashes on
      32bit machines (ARM, x86). Reproducible eg. by running fstest btrfs/004
      with enabled LZO or ZSTD compression.
      
      Link: https://lore.kernel.org/all/CAJCQCtT+OuemovPO7GZk8Y8=qtOObr0XTDp8jh4OHD6y84AFxw@mail.gmail.com/
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=214839Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3a60f653
    • Q
      btrfs: remove btrfs_bio::logical member · f4f39fc5
      Qu Wenruo 提交于
      The member btrfs_bio::logical is only initialized by two call sites:
      
      - btrfs_repair_one_sector()
        No corresponding site to utilize it.
      
      - btrfs_submit_direct()
        The corresponding site to utilize it is btrfs_check_read_dio_bio().
      
      However for btrfs_check_read_dio_bio(), we can grab the file_offset from
      btrfs_dio_private::file_offset directly.
      
      Thus it turns out we don't really need that btrfs_bio::logical member at
      all.
      
      For btrfs_bio, the logical bytenr can be fetched from its
      bio->bi_iter.bi_sector directly.
      
      So let's just remove the member to save 8 bytes for structure btrfs_bio.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f4f39fc5
    • Q
      btrfs: rename btrfs_dio_private::logical_offset to file_offset · 47926ab5
      Qu Wenruo 提交于
      The naming of "logical_offset" can be confused with logical bytenr of
      the dio range.
      
      In fact it's file offset, and the naming "file_offset" is already widely
      used in all other sites.
      
      Just do the rename to avoid confusion.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      47926ab5
    • N
      btrfs: pull up qgroup checks from delayed-ref core to init time · 681145d4
      Nikolay Borisov 提交于
      Instead of checking whether qgroup processing for a dealyed ref has to
      happen in the core of delayed ref, simply pull the check at init time of
      respective delayed ref structures. This eliminates the final use of
      real_root in delayed-ref core paving the way to making this member
      optional.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      681145d4
    • N
      btrfs: add additional parameters to btrfs_init_tree_ref/btrfs_init_data_ref · f42c5da6
      Nikolay Borisov 提交于
      In order to make 'real_root' used only in ref-verify it's required to
      have the necessary context to perform the same checks that this member
      is used for. So add 'mod_root' which will contain the root on behalf of
      which a delayed ref was created and a 'skip_group' parameter which
      will contain callsite-specific override of skip_qgroup.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f42c5da6
    • J
      btrfs: add a BTRFS_FS_ERROR helper · 84961539
      Josef Bacik 提交于
      We have a few flags that are inconsistently used to describe the fs in
      different states of failure.  As of 5963ffca ("btrfs: always abort
      the transaction if we abort a trans handle") we will always set
      BTRFS_FS_STATE_ERROR if we abort, so we don't have to check both ABORTED
      and ERROR to see if things have gone wrong.  Add a helper to check
      BTRFS_FS_STATE_ERROR and then convert all checkers of FS_STATE_ERROR to
      use the helper.
      
      The TRANS_ABORTED bit check was added in af722733 ("Btrfs: clean up
      resources during umount after trans is aborted") but is not actually
      specific.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      84961539
    • J
      btrfs: change error handling for btrfs_delete_*_in_log · 9a35fc95
      Josef Bacik 提交于
      Currently we will abort the transaction if we get a random error (like
      -EIO) while trying to remove the directory entries from the root log
      during rename.
      
      However since these are simply log tree related errors, we can mark the
      trans as needing a full commit.  Then if the error was truly
      catastrophic we'll hit it during the normal commit and abort as
      appropriate.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9a35fc95
    • Q
      btrfs: subpage: only allow compression if the range is fully page aligned · 0cf9b244
      Qu Wenruo 提交于
      For compressed write, we use a mechanism called async COW, which unlike
      regular run_delalloc_cow() or cow_file_range() will also unlock the
      first page.
      
      This mechanism allows us to continue handling next ranges, without
      waiting for the time consuming compression.
      
      But this has a problem for subpage case, as we could have the following
      delalloc range for a page:
      
      0		32K		64K
      |	|///////|	|///////|
      		\- A		\- B
      
      In the above case, if we pass both ranges to cow_file_range_async(),
      both range A and range B will try to unlock the full page [0, 64K).
      
      And which one finishes later than the other one will try to do other
      page operations like end_page_writeback() on a unlocked page, triggering
      VM layer BUG_ON().
      
      To make subpage compression work at least partially, here we add another
      restriction for it, only allow compression if the delalloc range is
      fully page aligned.
      
      By that, async extent is always ensured to unlock the first page
      exclusively, just like it used to be for regular sectorsize.
      
      In theory, we only need to make sure the delalloc range fully covers its
      first page, but the tail page will be locked anyway, blocking later
      writeback until the compression finishes.
      
      Thus here we choose to make sure the range is fully page aligned before
      doing the compression.
      
      In the future, we could optimize the situation by properly increasing
      subpage::writers number for the locked page, but that also means we need
      to change how we run delalloc range of page.
      (Instead of running each delalloc range we hit, we need to find and lock
      all delalloc ranges covering the page, then run each of them).
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0cf9b244
    • Q
      btrfs: subpage: avoid potential deadlock with compression and delalloc · 2749f7ef
      Qu Wenruo 提交于
      [BUG]
      With experimental subpage compression enabled, a simple fsstress can
      lead to self deadlock on page 720896:
      
              mkfs.btrfs -f -s 4k $dev > /dev/null
              mount $dev -o compress $mnt
              $fsstress -p 1 -n 100 -w -d $mnt -v -s 1625511156
      
      [CAUSE]
      If we have a file layout looks like below:
      
      	0	32K	64K	96K	128K
      	|//|		|///////////////|
      	   4K
      
      Then we run delalloc range for the inode, it will:
      
      - Call find_lock_delalloc_range() with @delalloc_start = 0
        Then we got a delalloc range [0, 4K).
      
        This range will be COWed.
      
      - Call find_lock_delalloc_range() again with @delalloc_start = 4K
        Since find_lock_delalloc_range() never cares whether the range
        is still inside page range [0, 64K), it will return range [64K, 128K).
      
        This range meets the condition for subpage compression, will go
        through async COW path.
      
        And async COW path will return @page_started.
      
        But that @page_started is now for range [64K, 128K), not for range
        [0, 64K).
      
      - writepage_dellloc() returned 1 for page [0, 64K)
        Thus page [0, 64K) will not be unlocked, nor its page dirty status
        will be cleared.
      
      Next time when we try to lock page [0, 64K) we will deadlock, as there
      is no one to release page [0, 64K).
      
      This problem will never happen for regular page size as one page only
      contains one sector.  After the first find_lock_delalloc_range() call,
      the @delalloc_end will go beyond @page_end no matter if we found a
      delalloc range or not
      
      Thus this bug only happens for subpage, as now we need multiple runs to
      exhaust the delalloc range of a page.
      
      [FIX]
      Fix the problem by ensuring the delalloc range we ran at least started
      inside @locked_page.
      
      So that we will never get incorrect @page_started.
      
      And to prevent such problem from happening again:
      
      - Make find_lock_delalloc_range() return false if the found range is
        beyond @end value passed in.
      
        Since @end will be utilized now, add an ASSERT() to ensure we pass
        correct @end into find_lock_delalloc_range().
      
        This also means, for selftests we needs to populate @end before calling
        find_lock_delalloc_range().
      
      - New ASSERT() in find_lock_delalloc_range()
        Now we will make sure the @start/@end passed in at least covers part
        of the page.
      
      - New ASSERT() in run_delalloc_range()
        To make sure the range at least starts inside @locked page.
      
      - Use @delalloc_start as proper cursor, while @delalloc_end is always
        reset to @page_end.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2749f7ef
    • Q
      btrfs: factor uncompressed async extent submission code into a new helper · 2b83a0ee
      Qu Wenruo 提交于
      Introduce a new helper, submit_uncompressed_range(), for async cow cases
      where we fallback to COW.
      
      There are some new updates introduced to the helper:
      
      - Proper locked_page detection
        It's possible that the async_extent range doesn't cover the locked
        page.  In that case we shouldn't unlock the locked page.
      
        In the new helper, we will ensure that we only unlock the locked page
        when:
      
        * The locked page covers part of the async_extent range
        * The locked page is not unlocked by cow_file_range() nor
          extent_write_locked_range()
      
        This also means extra comments are added focusing on the page locking.
      
      - Add extra comment on some rare parameter used.
        We use @unlock_page = 0 for cow_file_range(), where only two call
        sites doing the same thing, including the new helper.
      
        It's definitely worth some comments.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2b83a0ee
    • Q
      btrfs: subpage: make compress_file_range() compatible · 4c162778
      Qu Wenruo 提交于
      In function compress_file_range(), when the compression is finished, the
      function just rounds up @total_in to PAGE_SIZE.  This is fine for
      regular sectorsize == PAGE_SIZE case, but not for subpage.
      
      Just change the ALIGN(, PAGE_SIZE) to round_up(, sectorsize) so that
      both regular sectorsize and subpage sectorsize will be happy.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4c162778
    • Q
      btrfs: cleanup for extent_write_locked_range() · 2bd0fc93
      Qu Wenruo 提交于
      There are several cleanups for extent_write_locked_range(), most of them
      are pure cleanups, but with some preparation for future subpage support.
      
      - Add a proper comment for which call sites are suitable
        Unlike regular synchronized extent write back, if async COW or zoned
        COW happens, we have all pages in the range still locked.
      
        Thus for those (only) two call sites, we need this function to submit
        page content into bios and submit them.
      
      - Remove @mode parameter
        All the existing two call sites pass WB_SYNC_ALL. No need for @mode
        parameter.
      
      - Better error handling
        Currently if we hit an error during the page iteration loop, we
        overwrite @ret, causing only the last error can be recorded.
      
        Here we add @found_error and @first_error variable to record if we hit
        any error, and the first error we hit.
        So the first error won't get lost.
      
      - Don't reuse @start as the cursor
        We reuse the parameter @start as the cursor to iterate the range, not
        a big problem, but since we're here, introduce a proper @cur as the
        cursor.
      
      - Remove impossible branch
        Since all pages are still locked after the ordered extent is inserted,
        there is no way that pages can get its dirty bit cleared.
        Remove the branch where page is not dirty and replace it with an
        ASSERT().
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2bd0fc93
    • Q
      btrfs: refactor submit_compressed_extents() · b4ccace8
      Qu Wenruo 提交于
      We have a big chunk of code inside a while() loop, with tons of strange
      jumps for error handling.  It's definitely not to the code standard of
      today.  Move the code into a new function, submit_one_async_extent().
      
      Since we're here, also do the following changes:
      
      - Comment style change
        To follow the current scheme
      
      - Don't fallback to non-compressed write then hitting ENOSPC
        If we hit ENOSPC for compressed write, how could we reserve more space
        for non-compressed write?
        Thus we go error path directly.
        This removes the retry: label.
      
      - Add more comment for super long parameter list
        Explain which parameter is for, so we don't need to check the
        prototype.
      
      - Move the error handling to submit_one_async_extent()
        Thus no strange code like:
      
        out_free:
      	...
      	goto again;
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b4ccace8
    • Q
      btrfs: remove unused function btrfs_bio_fits_in_stripe() · 6aabd858
      Qu Wenruo 提交于
      As the last caller in compression.c has been removed, we don't need that
      function anymore.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6aabd858
    • Q
      btrfs: subpage: add bitmap for PageChecked flag · e4f94347
      Qu Wenruo 提交于
      Although in btrfs we have very limited usage of PageChecked flag, it's
      still some page flag not yet subpage compatible.
      
      Fix it by introducing btrfs_subpage::checked_offset to do the convert.
      
      For most call sites, especially for free-space cache, COW fixup and
      btrfs_invalidatepage(), they all work in full page mode anyway.
      
      For other call sites, they work as subpage compatible mode.
      
      Some call sites need extra modification:
      
      - btrfs_drop_pages()
        Needs extra parameter to get the real range we need to clear checked
        flag.
      
        Also since btrfs_drop_pages() will accept pages beyond the dirtied
        range, update btrfs_subpage_clamp_range() to handle such case
        by setting @len to 0 if the page is beyond target range.
      
      - btrfs_invalidatepage()
        We need to call subpage helper before calling __btrfs_releasepage(),
        or it will trigger ASSERT() as page->private will be cleared.
      
      - btrfs_verify_data_csum()
        In theory we don't need the io_bio->csum check anymore, but it's
        won't hurt.  Just change the comment.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e4f94347
    • Q
      btrfs: don't pass compressed pages to btrfs_writepage_endio_finish_ordered() · 58469174
      Qu Wenruo 提交于
      Since async_extent holds the compressed page, it would trigger the new
      ASSERT() in btrfs_mark_ordered_io_finished() which checks that the range
      is inside the page.
      
      Now btrfs_writepage_endio_finish_ordered() can accept @page == NULL,
      just pass NULL to btrfs_writepage_endio_finish_ordered().
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      58469174
    • Q
      btrfs: use async_chunk::async_cow to replace the confusing pending pointer · 9e895a8f
      Qu Wenruo 提交于
      For structure async_chunk, we use a very strange member layout to grab
      structure async_cow who owns this async_chunk.
      
      At initialization, it goes like this:
      
      		async_chunk[i].pending = &ctx->num_chunks;
      
      Then at async_cow_free() we do a super weird freeing:
      
      	/*
      	 * Since the pointer to 'pending' is at the beginning of the array of
      	 * async_chunk's, freeing it ensures the whole array has been freed.
      	 */
      	if (atomic_dec_and_test(async_chunk->pending))
      		kvfree(async_chunk->pending);
      
      This is absolutely an abuse of kvfree().
      
      Replace async_chunk::pending with async_chunk::async_cow, so that we can
      grab the async_cow structure directly, without this strange dancing.
      
      And with this change, there is no requirement for any specific member
      location.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9e895a8f
    • F
      btrfs: loop only once over data sizes array when inserting an item batch · b7ef5f3a
      Filipe Manana 提交于
      When inserting a batch of items into a btree, we end up looping over the
      data sizes array 3 times:
      
      1) Once in the caller of btrfs_insert_empty_items(), when it populates the
         array with the data sizes for each item;
      
      2) Once at btrfs_insert_empty_items() to sum the elements of the data
         sizes array and compute the total data size;
      
      3) And then once again at setup_items_for_insert(), where we do exactly
         the same as what we do at btrfs_insert_empty_items(), to compute the
         total data size.
      
      That is not bad for small arrays, but when the arrays have hundreds of
      elements, the time spent on looping is not negligible. For example when
      doing batch inserts of delayed items for dir index items or when logging
      a directory, it's common to have 200 to 260 dir index items in a single
      batch when using a leaf size of 16K and using file names between 8 and 12
      characters. For a 64K leaf size, multiply that by 4. Taking into account
      that during directory logging or when flushing delayed dir index items we
      can have many of those large batches, the time spent on the looping adds
      up quickly.
      
      It's also more important to avoid it at setup_items_for_insert(), since
      we are holding a write lock on a leaf and, in some cases, on upper nodes
      of the btree, which causes us to block other tasks that want to access
      the leaf and nodes for longer than necessary.
      
      So change the code so that setup_items_for_insert() and
      btrfs_insert_empty_items() no longer compute the total data size, and
      instead rely on the caller to supply it. This makes us loop over the
      array only once, where we can both populate the data size array and
      compute the total data size, taking advantage of spatial and temporal
      locality. To make this more manageable, use a structure to contain
      all the relevant details for a batch of items (keys array, data sizes
      array, total data size, number of items), and use it as an argument
      for btrfs_insert_empty_items() and setup_items_for_insert().
      
      This patch is part of a small patchset that is comprised of the following
      patches:
      
        btrfs: loop only once over data sizes array when inserting an item batch
        btrfs: unexport setup_items_for_insert()
        btrfs: use single bulk copy operations when logging directories
      
      This is patch 1/3 and performance results, and the specific tests, are
      included in the changelog of patch 3/3.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b7ef5f3a
    • Q
      btrfs: rename struct btrfs_io_bio to btrfs_bio · c3a3b19b
      Qu Wenruo 提交于
      Previously we had "struct btrfs_bio", which records IO context for
      mirrored IO and RAID56, and "strcut btrfs_io_bio", which records extra
      btrfs specific info for logical bytenr bio.
      
      With "btrfs_bio" renamed to "btrfs_io_context", we are safe to rename
      "btrfs_io_bio" to "btrfs_bio" which is a more suitable name now.
      
      The struct btrfs_bio changes meaning by this commit. There was a
      suggested name like btrfs_logical_bio but it's a bit long and we'd
      prefer to use a shorter name.
      
      This could be a concern for backports to older kernels where the
      different meaning could possibly cause confusion or bugs. Comparing the
      new and old structures, there's no overlap among the struct members so a
      build would break in case of incorrect backport.
      
      We haven't had many backports to bio code anyway so this is more of a
      theoretical cause of bugs and a matter of precaution but we'll need to
      keep the semantic change in mind.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c3a3b19b
    • F
      btrfs: keep track of the last logged keys when logging a directory · dc287224
      Filipe Manana 提交于
      After the first time we log a directory in the current transaction, for
      each directory item in a changed leaf of the subvolume tree, we have to
      check if we previously logged the item, in order to overwrite it in case
      its data changed or skip it in case its data hasn't changed.
      
      Checking if we have logged each item before not only wastes times, but it
      also adds lock contention on the log tree. So in order to minimize the
      number of times we do such checks, keep track of the offset of the last
      key we logged for a directory and, on the next time we log the directory,
      skip the checks for any new keys that have an offset greater than the
      offset we have previously saved. This is specially effective for index
      keys, because the offset for these keys comes from a monotonically
      increasing counter.
      
      This patch is part of a patchset comprised of the following 5 patches:
      
        btrfs: remove root argument from btrfs_log_inode() and its callees
        btrfs: remove redundant log root assignment from log_dir_items()
        btrfs: factor out the copying loop of dir items from log_dir_items()
        btrfs: insert items in batches when logging a directory when possible
        btrfs: keep track of the last logged keys when logging a directory
      
      This is patch 5/5.
      
      The following test was used on a non-debug kernel to measure the impact
      it has on a directory fsync:
      
        $ cat test-dir-fsync.sh
        #!/bin/bash
      
        DEV=/dev/nvme0n1
        MNT=/mnt/nvme0n1
      
        NUM_NEW_FILES=100000
        NUM_FILE_DELETES=1000
      
        mkfs.btrfs -f $DEV
        mount -o ssd $DEV $MNT
      
        mkdir $MNT/testdir
      
        for ((i = 1; i <= $NUM_NEW_FILES; i++)); do
            echo -n > $MNT/testdir/file_$i
        done
      
        # fsync the directory, this will log the new dir items and the inodes
        # they point to, because these are new inodes.
        start=$(date +%s%N)
        xfs_io -c "fsync" $MNT/testdir
        end=$(date +%s%N)
      
        dur=$(( (end - start) / 1000000 ))
        echo "dir fsync took $dur ms after adding $NUM_NEW_FILES files"
      
        # sync to force transaction commit and wipeout the log.
        sync
      
        del_inc=$(( $NUM_NEW_FILES / $NUM_FILE_DELETES ))
        for ((i = 1; i <= $NUM_NEW_FILES; i += $del_inc)); do
            rm -f $MNT/testdir/file_$i
        done
      
        # fsync the directory, this will only log dir items, there are no
        # dentries pointing to new inodes.
        start=$(date +%s%N)
        xfs_io -c "fsync" $MNT/testdir
        end=$(date +%s%N)
      
        dur=$(( (end - start) / 1000000 ))
        echo "dir fsync took $dur ms after deleting $NUM_FILE_DELETES files"
      
        umount $MNT
      
      Test results with NUM_NEW_FILES set to 100 000 and 1 000 000:
      
      **** before patchset, 100 000 files, 1000 deletes ****
      
      dir fsync took 848 ms after adding 100000 files
      dir fsync took 175 ms after deleting 1000 files
      
      **** after patchset, 100 000 files, 1000 deletes ****
      
      dir fsync took 758 ms after adding 100000 files  (-11.2%)
      dir fsync took 63 ms after deleting 1000 files   (-94.1%)
      
      **** before patchset, 1 000 000 files, 1000 deletes ****
      
      dir fsync took 9945 ms after adding 1000000 files
      dir fsync took 473 ms after deleting 1000 files
      
      **** after patchset, 1 000 000 files, 1000 deletes ****
      
      dir fsync took 8677 ms after adding 1000000 files (-13.6%)
      dir fsync took 146 ms after deleting 1000 files   (-105.6%)
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      dc287224
    • J
      btrfs: check for relocation inodes on zoned btrfs in should_nocow · 2adada88
      Johannes Thumshirn 提交于
      Prepare for allowing preallocation for relocation inodes.
      Reviewed-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2adada88
    • J
      btrfs: introduce btrfs_is_data_reloc_root · 37f00a6d
      Johannes Thumshirn 提交于
      There are several places in our codebase where we check if a root is the
      root of the data reloc tree and subsequent patches will introduce more.
      
      Factor out the check into a small helper function instead of open coding
      it multiple times.
      Reviewed-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      37f00a6d
    • A
      btrfs: convert latest_bdev type to btrfs_device and rename · d24fa5c1
      Anand Jain 提交于
      In preparation to fix a bug in btrfs_show_devname().
      
      Convert fs_devices::latest_bdev type from struct block_device to struct
      btrfs_device and, rename the member to fs_devices::latest_dev.
      So that btrfs_show_devname() can use fs_devices::latest_dev::name.
      Tested-by: NSu Yue <l@damenly.su>
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d24fa5c1
    • N
      btrfs: zoned: finish fully written block group · be1a1d7a
      Naohiro Aota 提交于
      If we have written to the zone capacity, the device automatically
      deactivates the zone. Sync up block group side (the active BG list and
      zone_is_active flag) with it.
      
      We need to do it both on data BGs and metadata BGs. On data side, we add a
      hook to btrfs_finish_ordered_io(). On metadata side, we use
      end_extent_buffer_writeback().
      
      To reduce excess lookup of a block group, we mark the last extent buffer in
      a block group with EXTENT_BUFFER_ZONE_FINISH flag. This cannot be done for
      data (ordered_extent), because the address may change due to
      REQ_OP_ZONE_APPEND.
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      be1a1d7a
  4. 18 10月, 2021 2 次提交
  5. 25 8月, 2021 1 次提交
  6. 23 8月, 2021 6 次提交