1. 21 6月, 2021 40 次提交
    • Q
      btrfs: make btrfs_dirty_pages() to be subpage compatible · f02a85d2
      Qu Wenruo 提交于
      Since the extent io tree operations in btrfs_dirty_pages() are already
      subpage compatible, we only need to make the page status update to use
      subpage helpers.
      
      Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
      Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f02a85d2
    • Q
      btrfs: only require sector size alignment for end_bio_extent_writepage() · 321a02db
      Qu Wenruo 提交于
      Just like read page, for subpage support we only require sector size
      alignment.
      
      So change the error message condition to only require sector alignment.
      
      This should not affect existing code, as for regular sectorsize ==
      PAGE_SIZE case, we are still requiring page alignment.
      
      Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
      Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      321a02db
    • Q
      btrfs: provide btrfs_page_clamp_*() helpers · 60e2d255
      Qu Wenruo 提交于
      In the coming subpage RW supports, there are a lot of page status update
      calls which need to be converted to subpage compatible version, which
      needs @start and @len.
      
      Some call sites already have such @start/@len and are already in
      page range, like various endio functions.
      
      But there are also call sites which need to clamp the range for subpage
      case, like btrfs_dirty_pagse() and __process_contig_pages().
      
      Here we introduce new helpers, btrfs_page_clamp_*(), to do and only do the
      clamp for subpage version.
      
      Although in theory all existing btrfs_page_*() calls can be converted to
      use btrfs_page_clamp_*() directly, but that would make us to do
      unnecessary clamp operations.
      
      Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
      Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      60e2d255
    • Q
      btrfs: refactor page status update into process_one_page() · ed8f13bf
      Qu Wenruo 提交于
      In __process_pages_contig() we update page status according to page_ops.
      
      That update process is a bunch of 'if' branches, which lie inside
      two loops, this makes it pretty hard to expand for later subpage
      operations.
      
      So this patch will extract these operations into its own function,
      process_one_pages().
      
      Also since we're refactoring __process_pages_contig(), also move the new
      helper and __process_pages_contig() before the first caller of them, to
      remove the forward declaration.
      
      Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
      Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ed8f13bf
    • Q
      btrfs: pass bytenr directly to __process_pages_contig() · 98af9ab1
      Qu Wenruo 提交于
      As a preparation for incoming subpage support, we need bytenr passed to
      __process_pages_contig() directly, not the current page index.
      
      So change the parameter and all callers to pass bytenr in.
      
      With the modification, here we need to replace the old @index_ret with
      @processed_end for __process_pages_contig(), but this brings a small
      problem.
      
      Normally we follow the inclusive return value, meaning @processed_end
      should be the last byte we processed.
      
      If parameter @start is 0, and we failed to lock any page, then we would
      return @processed_end as -1, causing more problems for
      __unlock_for_delalloc().
      
      So here for @processed_end, we use two different return value patterns.
      If we have locked any page, @processed_end will be the last byte of
      locked page.
      Or it will be @start otherwise.
      
      This change will impact lock_delalloc_pages(), so it needs to check
      @processed_end to only unlock the range if we have locked any.
      
      Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
      Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      98af9ab1
    • Q
      btrfs: fix hang when run_delalloc_range() failed · 968f2566
      Qu Wenruo 提交于
      [BUG]
      When running subpage preparation patches on x86, btrfs/125 will hang
      forever with one ordered extent never finished.
      
      [CAUSE]
      The test case btrfs/125 itself will always fail as the fix is never merged.
      
      When the test fails at balance, btrfs needs to cleanup the ordered
      extent in btrfs_cleanup_ordered_extents() for data reloc inode.
      
      The problem is in the sequence how we cleanup the page Order bit.
      
      Currently it works like:
      
        btrfs_cleanup_ordered_extents()
        |- find_get_page();
        |- btrfs_page_clear_ordered(page);
        |  Now the page doesn't have Ordered bit anymore.
        |  !!! This also includes the first (locked) page !!!
        |
        |- offset += PAGE_SIZE
        |  This is to skip the first page
        |- __endio_write_update_ordered()
           |- btrfs_mark_ordered_io_finished(NULL)
              Except the first page, all ordered extents are finished.
      
      Then the locked page is cleaned up in __extent_writepage():
      
        __extent_writepage()
        |- If (PageError(page))
        |- end_extent_writepage()
           |- btrfs_mark_ordered_io_finished(page)
              |- if (btrfs_test_page_ordered(page))
              |-  !!! The page gets skipped !!!
                  The ordered extent is not decreased as the page doesn't
                  have ordered bit anymore.
      
      This leaves the ordered extent with bytes_left == sectorsize, thus never
      finish.
      
      [FIX]
      The fix is to ensure we never clear page Ordered bit without running the
      ordered extent accounting.
      
      Here we choose to skip the locked page in
      btrfs_cleanup_ordered_extents() so that later end_extent_writepage() can
      properly finish the ordered extent.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      968f2566
    • Q
      btrfs: rename PagePrivate2 to PageOrdered inside btrfs · f57ad937
      Qu Wenruo 提交于
      Inside btrfs we use Private2 page status to indicate we have an ordered
      extent with pending IO for the sector.
      
      But the page status name, Private2, tells us nothing about the bit
      itself, so this patch will rename it to Ordered.
      And with extra comment about the bit added, so reader who is still
      uncertain about the page Ordered status, will find the comment pretty
      easily.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f57ad937
    • Q
      btrfs: refactor btrfs_invalidatepage() for subpage support · 3b835840
      Qu Wenruo 提交于
      This patch will refactor btrfs_invalidatepage() for the incoming subpage
      support.
      
      The involved modifications are:
      
      - Use while() loop instead of "goto again;"
      - Use single variable to determine whether to delete extent states
        Each branch will also have comments why we can or cannot delete the
        extent states
      - Do qgroup free and extent states deletion per-loop
        Current code can only work for PAGE_SIZE == sectorsize case.
      
      This refactor also makes it clear what we do for different sectors:
      
      - Sectors without ordered extent
        We're completely safe to remove all extent states for the sector(s)
      
      - Sectors with ordered extent, but no Private2 bit
        This means the endio has already been executed, we can't remove all
        extent states for the sector(s).
      
      - Sectors with ordere extent, still has Private2 bit
        This means we need to decrease the ordered extent accounting.
        And then it comes to two different variants:
      
        * We have finished and removed the ordered extent
          Then it's the same as "sectors without ordered extent"
        * We didn't finished the ordered extent
          We can remove some extent states, but not all.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3b835840
    • Q
      btrfs: introduce btrfs_lookup_first_ordered_range() · c095f333
      Qu Wenruo 提交于
      Although we already have btrfs_lookup_first_ordered_extent() and
      btrfs_lookup_ordered_extent(), they all have their own limitations:
      
      - btrfs_lookup_ordered_extent() can't do extra range check
      
        It's only designed to lookup any ordered extent before certain bytenr.
      
      - btrfs_lookup_first_ordered_extent() may not return the first ordered
        extent in the range
      
        It doesn't ensure the first ordered extent is returned.
        The existing callers are only interested in exhausting all ordered
        extents in a range, the order is not important.
      
      For incoming btrfs_invalidatepage() refactoring, we need a way to
      properly iterate all ordered extents in their bytenr order of a range.
      
      So this patch will introduce a new function,
      btrfs_lookup_first_ordered_range(), to do ordered extent with bytenr
      order awareness and extra range check.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c095f333
    • Q
      btrfs: update comments in btrfs_invalidatepage() · 266a2586
      Qu Wenruo 提交于
      The existing comments in btrfs_invalidatepage() don't really get to the
      point, especially for what Private2 is really representing and how the
      race avoidance is done.
      
      The truth is, there are only three entrances to do ordered extent
      accounting:
      
      - btrfs_writepage_endio_finish_ordered()
      - __endio_write_update_ordered()
        Those two entrance are just endio functions for dio and buffered
        write.
      
      - btrfs_invalidatepage()
      
      But there is a pitfall, in endio functions there is no check on whether
      the ordered extent is already accounted.
      They just blindly clear the Private2 bit and do the accounting.
      
      So it's all btrfs_invalidatepage()'s responsibility to make sure we
      won't do double account for the same sector.
      
      That's why in btrfs_invalidatepage() we have to wait for page writeback,
      this will ensure all submitted bios have finished, thus their endio
      functions have finished the accounting on the ordered extent.
      
      Then we also check page Private2 to ensure that, we only run ordered
      extent accounting on pages who has no bio submitted.
      
      This patch will rework related comments to make it more clear on the
      race and how we use wait_on_page_writeback() and Private2 to prevent
      double accounting on ordered extent.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      266a2586
    • Q
      btrfs: refactor how we finish ordered extent io for endio functions · e65f152e
      Qu Wenruo 提交于
      Btrfs has two endio functions to mark certain io range finished for
      ordered extents:
      
      - __endio_write_update_ordered()
        This is for direct IO
      
      - btrfs_writepage_endio_finish_ordered()
        This for buffered IO.
      
      However they go different routines to handle ordered extent io:
      
      - Whether to iterate through all ordered extents
        __endio_write_update_ordered() will but
        btrfs_writepage_endio_finish_ordered() will not.
      
        In fact, iterating through all ordered extents will benefit later
        subpage support, while for current PAGE_SIZE == sectorsize requirement
        this behavior makes no difference.
      
      - Whether to update page Private2 flag
        __endio_write_update_ordered() will not update page Private2 flag as
        for iomap direct IO, the page can not be even mapped.
        While btrfs_writepage_endio_finish_ordered() will clear Private2 to
        prevent double accounting against btrfs_invalidatepage().
      
      Those differences are pretty subtle, and the ordered extent iterations
      code in callers makes code much harder to read.
      
      So this patch will introduce a new function,
      btrfs_mark_ordered_io_finished(), to do the heavy lifting:
      
      - Iterate through all ordered extents in the range
      - Do the ordered extent accounting
      - Queue the work for finished ordered extent
      
      This function has two new feature:
      
      - Proper underflow detection and recovery
        The old underflow detection will only detect the problem, then
        continue.
        No proper info like root/inode/ordered extent info, nor noisy enough
        to be caught by fstests.
      
        Furthermore when underflow happens, the ordered extent will never
        finish.
      
        New error detection will reset the bytes_left to 0, do proper
        kernel warning, and output extra info including root, ino, ordered
        extent range, the underflow value.
      
      - Prevent double accounting based on Private2 flag
        Now if we find a range without Private2 flag, we will skip to next
        range.
        As that means someone else has already finished the accounting of
        ordered extent.
      
        This makes no difference for current code, but will be a critical part
        for incoming subpage support, as we can call
        btrfs_mark_ordered_io_finished() for multiple sectors if they are
        beyond inode size.
        Thus such double accounting prevention is a key feature for subpage.
      
      Now both endio functions only need to call that new function.
      
      And since the only caller of btrfs_dec_test_first_ordered_pending() is
      removed, also remove btrfs_dec_test_first_ordered_pending() completely.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e65f152e
    • Q
      btrfs: make Private2 lifespan more consistent · 87b4d86b
      Qu Wenruo 提交于
      Currently we use page Private2 bit to indicate that we have ordered
      extent for the page range.
      
      But the lifespan of it is not consistent, during regular writeback path,
      there are two locations to clear the same PagePrivate2:
      
          T ----- Page marked Dirty
          |
          + ----- Page marked Private2, through btrfs_run_dealloc_range()
          |
          + ----- Page cleared Private2, through btrfs_writepage_cow_fixup()
          |       in __extent_writepage_io()
          |       ^^^ Private2 cleared for the first time
          |
          + ----- Page marked Writeback, through btrfs_set_range_writeback()
          |       in __extent_writepage_io().
          |
          + ----- Page cleared Private2, through
          |       btrfs_writepage_endio_finish_ordered()
          |       ^^^ Private2 cleared for the second time.
          |
          + ----- Page cleared Writeback, through
                  btrfs_writepage_endio_finish_ordered()
      
      Currently PagePrivate2 is mostly to prevent ordered extent accounting
      being executed for both endio and invalidatepage.
      Thus only the one who cleared page Private2 is responsible for ordered
      extent accounting.
      
      But the fact is, in btrfs_writepage_endio_finish_ordered(), page
      Private2 is cleared and ordered extent accounting is executed
      unconditionally.
      
      The race prevention only happens through btrfs_invalidatepage(), where
      we wait for the page writeback first, before checking the Private2 bit.
      
      This means, Private2 is also protected by Writeback bit, and there is no
      need for btrfs_writepage_cow_fixup() to clear Priavte2.
      
      This patch will change btrfs_writepage_cow_fixup() to just check
      PagePrivate2, not to clear it.
      The clearing will happen in either btrfs_invalidatepage() or
      btrfs_writepage_endio_finish_ordered().
      
      This makes the Private2 bit easier to understand, just meaning the page
      has unfinished ordered extent attached to it.
      
      And this patch is a hard requirement for the incoming refactoring for
      how we finished ordered IO for endio context, as the coming patch will
      check Private2 to determine if we need to do the ordered extent
      accounting.  Thus this patch is definitely needed or we will hang due to
      unfinished ordered extent.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      87b4d86b
    • Q
      btrfs: pass btrfs_inode to btrfs_writepage_endio_finish_ordered() · 38a39ac7
      Qu Wenruo 提交于
      There is a pretty bad abuse of btrfs_writepage_endio_finish_ordered() in
      end_compressed_bio_write().
      
      It passes compressed pages to btrfs_writepage_endio_finish_ordered(),
      which is only supposed to accept inode pages.
      
      Thankfully the important info here is the inode, so let's pass
      btrfs_inode directly into btrfs_writepage_endio_finish_ordered(), and
      make @page parameter optional.
      
      By this, end_compressed_bio_write() can happily pass page=NULL while
      still getting everything done properly.
      
      Also, to cooperate with such modification, replace @page parameter for
      trace_btrfs_writepage_end_io_hook() with btrfs_inode.
      Although this removes page_index info, the existing start/len should be
      enough for most usage.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      38a39ac7
    • Q
      btrfs: make subpage metadata write path call its own endio functions · fa04c165
      Qu Wenruo 提交于
      For subpage metadata, we're reusing two functions for subpage metadata
      write:
      
      - end_bio_extent_buffer_writepage()
      - write_one_eb()
      
      But the truth is, for subpage we just call
      end_bio_subpage_eb_writepage() without using any bit in
      end_bio_extent_buffer_writepage().
      
      For write_one_eb(), it's pretty similar, but with a small part of code
      reused.
      
      There is really no need to pollute the existing code path if we're not
      really using most of them.
      
      So this patch will do the following change to separate the subpage
      metadata write path from regular write path by:
      
      - Use end_bio_subpage_eb_writepage() directly as endio in
        write_one_subpage_eb()
      - Directly call write_one_subpage_eb() in submit_eb_subpage()
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fa04c165
    • Q
      btrfs: refactor submit_extent_page() to make bio and its flag tracing easier · 390ed29b
      Qu Wenruo 提交于
      There is a lot of code inside extent_io.c needs both "struct bio
      **bio_ret" and "unsigned long prev_bio_flags", along with some
      parameters like "unsigned long bio_flags".
      
      Such strange parameters are here for bio assembly.
      
      For example, we have such inode page layout:
      
        0       4K      8K      12K
        |<-- Extent A-->|<- EB->|
      
      Then what we do is:
      
      - Page [0, 4K)
        *bio_ret = NULL
        So we allocate a new bio to bio_ret,
        Add page [0, 4K) to *bio_ret.
      
      - Page [4K, 8K)
        *bio_ret != NULL
        We found this page is continuous to *bio_ret,
        and if we're not at stripe boundary, we
        add page [4K, 8K) to *bio_ret.
      
      - Page [8K, 12K)
        *bio_ret != NULL
        But we found this page is not continuous, so
        we submit *bio_ret, then allocate a new bio,
        and add page [8K, 12K) to the new bio.
      
      This means we need to record both the bio and its bio_flag, but we
      record them manually using those strange parameter list, other than
      encapsulating them into their own structure.
      
      So this patch will introduce a new structure, btrfs_bio_ctrl, to record
      both the bio, and its bio_flags.
      
      Also, in above case, for all pages added to the bio, we need to check if
      the new page crosses stripe boundary.  This check itself can be time
      consuming, and we don't really need to do that for each page.
      
      This patch also integrates the stripe boundary check into btrfs_bio_ctrl.
      When a new bio is allocated, the stripe and ordered extent boundary is
      also calculated, so no matter how large the bio will be, we only
      calculate the boundaries once, to save some CPU time.
      
      The following functions/structures are affected:
      
      - struct extent_page_data
        Replace its bio pointer with structure btrfs_bio_ctrl (embedded
        structure, not pointer)
      
      - end_write_bio()
      - flush_write_bio()
        Just change how bio is fetched
      
      - btrfs_bio_add_page()
        Use pre-calculated boundaries instead of re-calculating them.
        And use @bio_ctrl to replace @bio and @prev_bio_flags.
      
      - calc_bio_boundaries()
        New function
      
      - submit_extent_page() callers
      - btrfs_do_readpage() callers
      - contiguous_readpages() callers
        To Use @bio_ctrl to replace @bio and @prev_bio_flags, and how to grab
        bio.
      
      - btrfs_bio_fits_in_ordered_extent()
        Removed, as now the ordered extent size limit is done at bio
        allocation time, no need to check for each page range.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      390ed29b
    • Q
      btrfs: allow btrfs_bio_fits_in_stripe() to accept bio without any page · 1a0b5c4d
      Qu Wenruo 提交于
      Function btrfs_bio_fits_in_stripe() now requires a bio with at least one
      page added.  Or btrfs_get_chunk_map() will fail with -ENOENT.
      
      But in fact this requirement is not needed at all, as we can just pass
      sectorsize for btrfs_get_chunk_map().
      
      This tiny behavior change is important for later subpage refactoring on
      submit_extent_page().
      
      As for 64K page size, we can have a page range with pgoff=0 and size=64K.
      If the logical bytenr is just 16K before the stripe boundary, we have to
      split the page range into two bios.
      
      This means, we must check page range against stripe boundary, even adding
      the range to an empty bio.
      
      This tiny refactoring is for the incoming changes, but on its own,
      regular sectorsize == PAGE_SIZE is not affected anyway.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1a0b5c4d
    • Q
      btrfs: remove the unused parameter @len for btrfs_bio_fits_in_stripe() · 43c0d1a5
      Qu Wenruo 提交于
      The parameter @len is not really used in btrfs_bio_fits_in_stripe(),
      just remove it.
      
      It got removed in 42034313 ("btrfs: let callers of
      btrfs_get_io_geometry pass the em"), before that btrfs_get_chunk_map
      utilized it.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      43c0d1a5
    • Q
      btrfs: make free space cache size consistent across different PAGE_SIZE · 0044ae11
      Qu Wenruo 提交于
      Currently free space cache inode size is determined by two factors:
      
      - block group size
      - PAGE_SIZE
      
      This means, for the same sized block groups, with different PAGE_SIZE,
      it will result in different inode sizes.
      
      This will not be a good thing for subpage support, so change the
      requirement for PAGE_SIZE to sectorsize.
      
      Now for the same 4K sectorsize btrfs, it should result the same inode
      size no matter what the PAGE_SIZE is.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0044ae11
    • Q
      btrfs: scrub: fix subpage repair error caused by hard coded PAGE_SIZE · 8df507cb
      Qu Wenruo 提交于
      [BUG]
      For the following file layout, scrub will not be able to repair all
      these two repairable error, but in fact make one corruption even
      unrepairable:
      
      	  inode offset 0      4k     8K
      Mirror 1               |XXXXXX|      |
      Mirror 2               |      |XXXXXX|
      
      [CAUSE]
      The root cause is the hard coded PAGE_SIZE, which makes scrub repair to
      go crazy for subpage.
      
      For above case, when reading the first sector, we use PAGE_SIZE other
      than sectorsize to read, which makes us to read the full range [0, 64K).
      In fact, after 8K there may be no data at all, we can just get some
      garbage.
      
      Then when doing the repair, we also writeback a full page from mirror 2,
      this means, we will also writeback the corrupted data in mirror 2 back
      to mirror 1, leaving the range [4K, 8K) unrepairable.
      
      [FIX]
      This patch will modify the following PAGE_SIZE use with sectorsize:
      
      - scrub_print_warning_inode()
        Remove the min() and replace PAGE_SIZE with sectorsize.
        The min() makes no sense, as csum is done for the full sector with
        padding.
      
        This fixes a bug that subpage report extra length like:
         checksum error at logical 298844160 on dev /dev/mapper/arm_nvme-test,
         physical 575668224, root 5, inode 257, offset 0, length 12288, links 1 (path: file)
      
        Where the error is only 1 sector.
      
      - scrub_handle_errored_block()
        Comments with PAGE|page involved, all changed to sector.
      
      - scrub_setup_recheck_block()
      - scrub_repair_page_from_good_copy()
      - scrub_add_page_to_wr_bio()
      - scrub_wr_submit()
      - scrub_add_page_to_rd_bio()
      - scrub_block_complete()
        Replace PAGE_SIZE with sectorsize.
        This solves several problems where we read/write extra range for
        subpage case.
      
      RAID56 code is excluded intentionally, as RAID56 has extra PAGE_SIZE
      usage, and is not really safe enough.
      Thus we will reject RAID56 for subpage in later commit.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8df507cb
    • N
      btrfs: use list_last_entry in add_falloc_range · ec87b42f
      Nikolay Borisov 提交于
      Instead of calling list_entry with head->prev simply call
      list_last_entry which makes it obvious which member of the list is
      being referred. This allows to remove the extra 'prev' pointer.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ec87b42f
    • A
      btrfs: fix comment about max_out in btrfs_compress_pages · 4183abf6
      Anand Jain 提交于
      Commit e5d74902 ("btrfs: derive maximum output size in the
      compression implementation") removed @max_out argument in
      btrfs_compress_pages() but its comment remained, remove it.
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4183abf6
    • A
      btrfs: optimize variables size in btrfs_submit_compressed_write · 65b5355f
      Anand Jain 提交于
      Patch "btrfs: reduce compressed_bio member's types" reduced some
      member's size. Function arguments @len, @compressed_len and @nr_pages
      can be declared as unsigned int.
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      65b5355f
    • A
      btrfs: optimize variables size in btrfs_submit_compressed_read · 356b4a2d
      Anand Jain 提交于
      Patch "btrfs: reduce compressed_bio member's types" reduced some
      member's size. Declare the variables @compressed_len, @nr_pages and
      @pg_index size as an unsigned int in the function
      btrfs_submit_compressed_read.
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      356b4a2d
    • A
      btrfs: reduce the variable size to fit nr_pages · 1d08ce58
      Anand Jain 提交于
      Patch "btrfs: reduce compressed_bio member's types" reduced the
      @nr_pages size to unsigned int, its cascading effects are updated here.
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1d08ce58
    • F
      btrfs: avoid unnecessary logging of xattrs during fast fsyncs · b590b839
      Filipe Manana 提交于
      When logging an inode we always log all its xattrs, so that we are able
      to figure out which ones should be deleted during log replay. However this
      is unnecessary when we are doing a fast fsync and no xattrs were added,
      changed or deleted since the last time we logged the inode in the current
      transaction.
      
      So skip the logging of xattrs when the inode was previously logged in the
      current transaction and no xattrs were added, changed or deleted. If any
      changes to xattrs happened, than the inode has BTRFS_INODE_COPY_EVERYTHING
      set in its runtime flags and the xattrs get logged. This saves time on
      scanning for xattrs, allocating memory, COWing log tree extent buffers and
      adding more lock contention on the extent buffers when there are multiple
      tasks logging in parallel.
      
      The use of xattrs is common when using ACLs, some applications, or when
      using security modules like SELinux where every inode gets a security
      xattr added to it.
      
      The following test script, using fio, was used on a box with 12 cores, 64G
      of RAM, a NVMe device and the default non-debug kernel config from Debian.
      It uses 8 concurrent jobs each writing in blocks of 64K to its own 4G file,
      each file with a single xattr of 50 bytes (about the same size for an ACL
      or SELinux xattr), doing random buffered writes with an fsync after each
      write.
      
         $ cat test.sh
         #!/bin/bash
      
         DEV=/dev/nvme0n1
         MNT=/mnt/test
         MOUNT_OPTIONS="-o ssd"
         MKFS_OPTIONS="-d single -m single"
      
         NUM_JOBS=8
         FILE_SIZE=4G
      
         cat <<EOF > /tmp/fio-job.ini
         [writers]
         rw=randwrite
         fsync=1
         fallocate=none
         group_reporting=1
         direct=0
         bs=64K
         ioengine=sync
         size=$FILE_SIZE
         directory=$MNT
         numjobs=$NUM_JOBS
         EOF
      
         echo "performance" | \
             tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
      
         mkfs.btrfs -f $MKFS_OPTIONS $DEV > /dev/null
         mount $MOUNT_OPTIONS $DEV $MNT
      
         echo "Creating files before fio runs, each with 1 xattr of 50 bytes"
         for ((i = 0; i < $NUM_JOBS; i++)); do
             path="$MNT/writers.$i.0"
             truncate -s $FILE_SIZE $path
             setfattr -n user.xa1 -v $(printf '%0.sX' $(seq 50)) $path
         done
      
         fio /tmp/fio-job.ini
         umount $MNT
      
      fio output before this change:
      
      WRITE: bw=120MiB/s (126MB/s), 120MiB/s-120MiB/s (126MB/s-126MB/s), io=32.0GiB (34.4GB), run=272145-272145msec
      
      fio output after this change:
      
      WRITE: bw=142MiB/s (149MB/s), 142MiB/s-142MiB/s (149MB/s-149MB/s), io=32.0GiB (34.4GB), run=230408-230408msec
      
      +16.8% throughput, -16.6% runtime
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b590b839
    • D
      btrfs: add device delete cancel · 67ae34b6
      David Sterba 提交于
      Accept device name "cancel" as a request to cancel running device
      deletion operation. The string is literal, in case there's a real device
      named "cancel", pass it as full absolute path or as "./cancel"
      
      This works for v1 and v2 ioctls when the device is specified by name.
      Moving chunks from the device uses relocation, use the conditional
      exclusive operation start and cancellation helpers
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      67ae34b6
    • D
      btrfs: add cancellation to resize · bb059a37
      David Sterba 提交于
      Accept literal string "cancel" as resize operation and interpret that
      as a request to cancel the running operation. If it's running, wait
      until it finishes current work and return ECANCELED.
      
      Shrinking resize uses relocation to move the chunks away, use the
      conditional exclusive operation start and cancellation helpers.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      bb059a37
    • D
      btrfs: add wrapper for conditional start of exclusive operation · 17aaa434
      David Sterba 提交于
      To support optional cancellation of some operations, add helper that will
      wrap all the combinations. In normal mode it's same as
      btrfs_exclop_start, in cancellation mode it checks if it's already
      running and request cancellation and waits until completion.
      
      The error codes can be returned to to user space and semantics is not
      changed, adding ECANCELED. This should be evaluated as an error and that
      the operation has not completed and the operation should be restarted
      or the filesystem status reviewed.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      17aaa434
    • D
      btrfs: introduce try-lock semantics for exclusive op start · 578bda9e
      David Sterba 提交于
      Add try-lock for exclusive operation start to allow callers to do more
      checks. The same operation must already be running. The try-lock and
      unlock must pair and are a substitute for btrfs_exclop_start, thus it
      must also pair with btrfs_exclop_finish to release the exclop context.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      578bda9e
    • D
      btrfs: add cancellable chunk relocation support · 907d2710
      David Sterba 提交于
      Add support code that will allow canceling relocation on the chunk
      granularity. This is different and independent of balance, that also
      uses relocation but is a higher level operation and manages it's own
      state and pause/cancellation requests.
      
      Relocation is used for resize (shrink) and device deletion so this will
      be a common point to implement cancellation for both. The context is
      entirely in btrfs_relocate_block_group and btrfs_recover_relocation,
      enclosing one chunk relocation. The status bit is set and unset between
      the chunks. As relocation can take long, the effects may not be
      immediate and the request and actual action can slightly race.
      
      The fs_info::reloc_cancel_req is only supposed to be increased and does
      not pair with decrement like fs_info::balance_cancel_req.
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      907d2710
    • D
      btrfs: protect exclusive_operation by super_lock · 0d7ed32c
      David Sterba 提交于
      The exclusive operation is now atomically checked and set using bit
      operations. Switch it to protection by spinlock. The super block lock is
      not frequently used and adding a new lock seems like an overkill so it
      should be safe to reuse it.
      
      The reason to use spinlock is to enhance the locking context so more
      checks can be done, eg. allowing the same exclusive operation enter
      the exclop section and cancel the running one. This will be used for
      resize and device delete.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0d7ed32c
    • D
      btrfs: clean up header members offsets in write helpers · 24880be5
      David Sterba 提交于
      Move header offsetof() to the expression that calculates the address so
      it's part of get_eb_offset_in_page where the 2nd parameter is the member
      offset.
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      24880be5
    • D
      btrfs: simplify eb checksum verification in btrfs_validate_metadata_buffer · dfd29eed
      David Sterba 提交于
      The verification copies the calculated checksum bytes to a temporary
      buffer but this is not necessary. We can map the eb header on the first
      page and use the checksum bytes directly.
      
      This saves at least one function call and boundary checks so it could
      lead to a minor performance improvement.
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      dfd29eed
    • D
      btrfs: remove extra sb::s_id from message in btrfs_validate_metadata_buffer · ff14aa79
      David Sterba 提交于
      The s_id is already printed by message helpers.
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ff14aa79
    • D
      btrfs: reduce compressed_bio members' types · 282ab3ff
      David Sterba 提交于
      Several members of compressed_bio are of type that's unnecessarily big
      for the values that they'd hold:
      
      - the size of the uncompressed and compressed data is 128K now, we can
        keep is as int
      - same for number of pages
      - the compress type fits to a byte
      - the errors is 0/1
      
      The size of the unpatched structure is 80 bytes with several holes.
      Reordering nr_pages next to the pages the hole after pending_bios is
      filled and the resulting size is 56 bytes. This keeps the csums array
      aligned to 8 bytes, which is nice. Further size optimizations may be
      possible but right now it looks good to me:
      
      struct compressed_bio {
              refcount_t                 pending_bios;         /*     0     4 */
              unsigned int               nr_pages;             /*     4     4 */
              struct page * *            compressed_pages;     /*     8     8 */
              struct inode *             inode;                /*    16     8 */
              u64                        start;                /*    24     8 */
              unsigned int               len;                  /*    32     4 */
              unsigned int               compressed_len;       /*    36     4 */
              u8                         compress_type;        /*    40     1 */
              u8                         errors;               /*    41     1 */
      
              /* XXX 2 bytes hole, try to pack */
      
              int                        mirror_num;           /*    44     4 */
              struct bio *               orig_bio;             /*    48     8 */
              u8                         sums[];               /*    56     0 */
      
              /* size: 56, cachelines: 1, members: 12 */
              /* sum members: 54, holes: 1, sum holes: 2 */
              /* last cacheline: 56 bytes */
      };
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      282ab3ff
    • D
    • D
      btrfs: scrub: factor out common scrub_stripe constraints · 7735cd75
      David Sterba 提交于
      There are common values set for the stripe constraints, some of them
      are already factored out. Do that for increment and mirror_num as well.
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7735cd75
    • D
      btrfs: clear log tree recovering status if starting transaction fails · 1aeb6b56
      David Sterba 提交于
      When a log recovery is in progress, lots of operations have to take that
      into account, so we keep this status per tree during the operation. Long
      time ago error handling revamp patch 79787eaa ("btrfs: replace many
      BUG_ONs with proper error handling") removed clearing of the status in
      an error branch. Add it back as was intended in e02119d5 ("Btrfs:
      Add a write ahead tree log to optimize synchronous operations").
      
      There are probably no visible effects, log replay is done only during
      mount and if it fails all structures are cleared so the stale status
      won't be kept.
      
      Fixes: 79787eaa ("btrfs: replace many BUG_ONs with proper error handling")
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1aeb6b56
    • D
      btrfs: clear defrag status of a root if starting transaction fails · 6819703f
      David Sterba 提交于
      The defrag loop processes leaves in batches and starting transaction for
      each. The whole defragmentation on a given root is protected by a bit
      but in case the transaction fails, the bit is not cleared
      
      In case the transaction fails the bit would prevent starting
      defragmentation again, so make sure it's cleared.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6819703f
    • D
      btrfs: sysfs: fix format string for some discard stats · 8c5ec995
      David Sterba 提交于
      The type of discard_bitmap_bytes and discard_extent_bytes is u64 so the
      format should be %llu, though the actual values would hardly ever
      overflow to negative values.
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8c5ec995