1. 16 5月, 2022 40 次提交
    • G
      btrfs: turn delayed_nodes_tree into an XArray · 253bf575
      Gabriel Niebler 提交于
      … in the btrfs_root struct and adjust all usages of this object to use
      the XArray API, because it is notionally easier to use and understand,
      as it provides array semantics, and also takes care of locking for us,
      further simplifying the code.
      
      Also use the opportunity to do some light refactoring.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NGabriel Niebler <gniebler@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      253bf575
    • Q
      btrfs: use ilog2() to replace if () branches for btrfs_bg_flags_to_raid_index() · 719fae89
      Qu Wenruo 提交于
      In function btrfs_bg_flags_to_raid_index(), we use quite some if () to
      convert the BTRFS_BLOCK_GROUP_* bits to a index number.
      
      But the truth is, there is really no such need for so many branches at
      all.
      Since all BTRFS_BLOCK_GROUP_* flags are just one single bit set inside
      BTRFS_BLOCK_GROUP_PROFILES_MASK, we can easily use ilog2() to calculate
      their values.
      
      This calculation has an anchor point, the lowest PROFILE bit, which is
      RAID0.
      
      Even it's fixed on-disk format and should never change, here I added
      extra compile time checks to make it super safe:
      
      1. Make sure RAID0 is always the lowest bit in PROFILE_MASK
         This is done by finding the first (least significant) bit set of
         RAID0 and PROFILE_MASK & ~RAID0.
      
      2. Make sure RAID0 bit set beyond the highest bit of TYPE_MASK
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      719fae89
    • Q
      btrfs: move definition of btrfs_raid_types to volumes.h · f04fbcc6
      Qu Wenruo 提交于
      It's only internally used as another way to represent btrfs profiles,
      it's not exposed through any on-disk format, in fact this
      btrfs_raid_types is diverted from the on-disk format values.
      
      Furthermore, since it's internal structure, its definition can change in
      the future.
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f04fbcc6
    • C
      btrfs: use a normal workqueue for rmw_workers · 385de0ef
      Christoph Hellwig 提交于
      rmw_workers doesn't need ordered execution or thread disabling threshold
      (as the thresh parameter is less than DFT_THRESHOLD).
      
      Just switch to the normal workqueues that use a lot less resources,
      especially in the work_struct vs btrfs_work structures.
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      385de0ef
    • C
      btrfs: use normal workqueues for scrub · be539518
      Christoph Hellwig 提交于
      All three scrub workqueues don't need ordered execution or thread
      disabling threshold (as the thresh parameter is less than DFT_THRESHOLD).
      Just switch to the normal workqueues that use a lot less resources,
      especially in the work_struct vs btrfs_work structures.
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      be539518
    • C
      btrfs: simplify WQ_HIGHPRI handling in struct btrfs_workqueue · a31b4a43
      Christoph Hellwig 提交于
      Just let the one caller that wants optional WQ_HIGHPRI handling allocate
      a separate btrfs_workqueue for that.  This allows to rename struct
      __btrfs_workqueue to btrfs_workqueue, remove a pointer indirection and
      separate allocation for all btrfs_workqueue users and generally simplify
      the code.
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a31b4a43
    • Q
      btrfs: raid56: enable subpage support for RAID56 · a7b8e39c
      Qu Wenruo 提交于
      Now the btrfs RAID56 infrastructure has migrated to use sector_ptr
      interface, it should be safe to enable subpage support for RAID56.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a7b8e39c
    • Q
      btrfs: raid56: make alloc_rbio_essential_pages() subpage compatible · 3907ce29
      Qu Wenruo 提交于
      The non-compatible part is only the bitmap iteration part, now the
      bitmap size is extended to rbio::stripe_nsectors, not the old
      rbio::stripe_npages.
      
      Since we're here, also slightly improve the function by:
      
      - Rename @i to @stripe
      - Rename @bit to @sectornr
      - Move @page and @index into the inner loop
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3907ce29
    • Q
      btrfs: raid56: make steal_rbio() subpage compatible · d4e28d9b
      Qu Wenruo 提交于
      Function steal_rbio() will take all the uptodate pages from the source
      rbio to destination rbio.
      
      With the new stripe_sectors[] array, we also need to do the extra check:
      
      - Check sector::flags to make sure the full page is uptodate
        Now we don't use PageUptodate flag for subpage cases to indicate
        if the page is uptodate.
      
        Instead we need to check all the sectors belong to the page to be sure
        about whether it's full page uptodate.
      
        So here we introduce a new helper, full_page_sectors_uptodate() to do
        the check.
      
      - Update rbio::stripe_sectors[] to use the new page pointer
        We only need to change the page pointer, no need to change anything
        else.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d4e28d9b
    • Q
      btrfs: raid56: make set_bio_pages_uptodate() subpage compatible · 5fdb7afc
      Qu Wenruo 提交于
      Unlike previous code, we can not directly set PageUptodate for stripe
      pages now.  Instead we have to iterate through all the sectors and set
      SECTOR_UPTODATE flag there.
      
      Introduce a new helper find_stripe_sector(), to do the work.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5fdb7afc
    • Q
      btrfs: raid56: remove btrfs_raid_bio::bio_pages array · ac26df8b
      Qu Wenruo 提交于
      The functionality is completely replaced by the new bio_sectors member,
      now it's time to remove the old member.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ac26df8b
    • Q
      btrfs: raid56: make raid56_add_scrub_pages() subpage compatible · 6346f6bf
      Qu Wenruo 提交于
      This requires one extra parameter @pgoff for the function.
      
      In the current code base, scrub is still one page per sector, thus the
      new parameter will always be 0.
      
      It needs the extra subpage scrub optimization code to fully take
      advantage.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6346f6bf
    • Q
      btrfs: raid56: open code rbio_stripe_page_index() · f77183dc
      Qu Wenruo 提交于
      There is only one caller for that helper now, and we're definitely fine
      to open-code it.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f77183dc
    • Q
      btrfs: raid56: make finish_rmw() subpage compatible · 1145059a
      Qu Wenruo 提交于
      With this function converted to subpage compatible sector interfaces,
      the following helper functions can be removed:
      
      - rbio_stripe_page()
      - rbio_pstripe_page()
      - rbio_qstripe_page()
      - page_in_rbio()
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1145059a
    • Q
      btrfs: raid56: make __raid_recover_endio_io() subpage compatible · 07e4d380
      Qu Wenruo 提交于
      This involves:
      
      - Use sector_ptr interface to grab the pointers
      
      - Add sector->pgoff to pointers[]
      
      - Rebuild data using sectorsize instead of PAGE_SIZE
      
      - Use memcpy() to replace copy_page()
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      07e4d380
    • Q
      btrfs: raid56: make finish_parity_scrub() subpage compatible · 46900662
      Qu Wenruo 提交于
      The core is to convert direct page usage into sector_ptr usage, and
      use memcpy() to replace copy_page().
      
      For pointers usage, we need to convert it to kmap_local_page() +
      sector->pgoff.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      46900662
    • Q
      btrfs: raid56: make rbio_add_io_page() subpage compatible · 3e77605d
      Qu Wenruo 提交于
      Make rbio_add_io_page() subpage compatible, which involves:
      
      - Rename rbio_add_io_page() to rbio_add_io_sector()
        Although we still rely on PAGE_SIZE == sectorsize, so add a new
        ASSERT() inside rbio_add_io_sector() to make sure all pgoff is 0.
      
      - Introduce rbio_stripe_sector() helper
        The equivalent of rbio_stripe_page().
      
        This new helper has extra ASSERT()s to validate the stripe and sector
        number.
      
      - Introduce sector_in_rbio() helper
        The equivalent of page_in_rbio().
      
      - Rename @pagenr variables to @sectornr
      
      - Use rbio::stripe_nsectors when iterating the bitmap
      
      Please note that, this only changes the interface, the bios are still
      using full page for IO.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3e77605d
    • Q
      btrfs: raid56: introduce btrfs_raid_bio::bio_sectors · 00425dd9
      Qu Wenruo 提交于
      This new member is going to fully replace bio_pages in the future, but
      for now let's keep them co-exist, until the full switch is done.
      
      Currently cache_rbio_pages() and index_rbio_pages() will also populate
      the new array.
      
      And cache_rbio_pages() need to record which sectors are uptodate, so we
      also need to introduce sector_ptr::uptodate bit.
      
      To avoid extra memory usage, we let the new @uptodate bit to share bits
      with @pgoff.  Now pgoff only has at most 31 bits, which is already more
      than enough, as even for 256K page size, we only need 18 bits.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      00425dd9
    • Q
      btrfs: raid56: introduce btrfs_raid_bio::stripe_sectors · eb357060
      Qu Wenruo 提交于
      The new member is an array of sector_ptr pointers, they will represent
      all sectors inside a full stripe (including P/Q).
      
      They co-operate with btrfs_raid_bio::stripe_pages:
      
      stripe_pages:   | Page 0, range [0, 64K)   | Page 1 ...
      stripe_sectors: |  |  | ...             |  |
                      |  |                    \- sector 15, page 0, pgoff=60K
                      |  \- sector 1, page 0, pgoff=4K
                      \---- sector 0, page 0, pfoff=0
      
      With such structure, we can represent subpage sectors without using
      extra pages.
      
      Here we introduce a new helper, index_stripe_sectors(), to update
      stripe_sectors[] to point to correct page and pgoff.
      
      So every time rbio::stripe_pages[] pointer gets updated, the new helper
      should be called.
      
      The following functions have to call the new helper:
      
      - steal_rbio()
      - alloc_rbio_pages()
      - alloc_rbio_parity_pages()
      - alloc_rbio_essential_pages()
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      eb357060
    • Q
      btrfs: raid56: introduce new cached members for btrfs_raid_bio · 94efbe19
      Qu Wenruo 提交于
      The new members are all related to number of sectors, but the existing
      number of pages members are kept as is:
      
      - nr_sectors
        Total sectors of the full stripe including P/Q.
      
      - stripe_nsectors
        The sectors of a single stripe.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      94efbe19
    • Q
      btrfs: raid56: make btrfs_raid_bio more compact · 29b06838
      Qu Wenruo 提交于
      There are a lot of members using much larger type in btrfs_raid_bio than
      necessary, like nr_pages which represents the total number of a full
      stripe.
      
      Instead of int (which is at least 32bits), u16 is already enough
      (max stripe length will be 256MiB, already beyond current RAID56 device
      number limit).
      
      So this patch will reduce the width of the following members:
      
      - stripe_len to u32
      - nr_pages to u16
      - nr_data to u8
      - real_stripes to u8
      - scrubp to u8
      - faila/b to s8
        As -1 is used to indicate no corruption
      
      This will slightly reduce the size of btrfs_raid_bio from 272 bytes to
      256 bytes, reducing 16 bytes usage.
      
      But please note that, when using btrfs_raid_bio, we allocate extra space
      for it to cover various pointer array, so the reduce memory is not
      really a big saving overall.
      
      As we're here modifying the comments already, update existing comments
      to current code standard.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      29b06838
    • Q
      btrfs: raid56: open code rbio_nr_pages() · 843de58b
      Qu Wenruo 提交于
      The function rbio_nr_pages() is only called once inside alloc_rbio(),
      there is no reason to make it dedicated helper.
      
      Furthermore, the return type doesn't match, the function return "unsigned
      long" which may not be necessary, while the only caller only uses "int".
      
      Since we're doing cleaning up here, also fix the type to "const unsigned
      int" for all involved local variables.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      843de58b
    • Q
      btrfs: reduce width for stripe_len from u64 to u32 · cc353a8b
      Qu Wenruo 提交于
      Currently btrfs uses fixed stripe length (64K), thus u32 is wide enough
      for the usage.
      
      Furthermore, even in the future we choose to enlarge stripe length to
      larger values, I don't believe we would want stripe as large as 4G or
      larger.
      
      So this patch will reduce the width for all in-memory structures and
      parameters, this involves:
      
      - RAID56 related function argument lists
        This allows us to do direct division related to stripe_len.
        Although we will use bits shift to replace the division anyway.
      
      - btrfs_io_geometry structure
        This involves one change to simplify the calculation of both @stripe_nr
        and @stripe_offset, using div64_u64_rem().
        And add extra sanity check to make sure @stripe_offset is always small
        enough for u32.
      
        This saves 8 bytes for the structure.
      
      - map_lookup structure
        This convert @stripe_len to u32, which saves 8 bytes. (saved 4 bytes,
        and removed a 4-bytes hole)
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      cc353a8b
    • C
      btrfs: do not return errors from submit_bio_hook_t instances · ad357938
      Christoph Hellwig 提交于
      Both btrfs_repair_one_sector and submit_bio_one as the direct caller of
      one of the instances ignore errors as they expect the methods themselves
      to call ->bi_end_io on error.  Remove the unused and dangerous return
      value.
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ad357938
    • C
      btrfs: do not return errors from btrfs_submit_compressed_read · cb4411dd
      Christoph Hellwig 提交于
      btrfs_submit_compressed_read already calls ->bi_end_io on error and
      the caller must ignore the return value, so remove it.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      cb4411dd
    • C
      btrfs: do not return errors from btrfs_submit_metadata_bio · 94d9e11b
      Christoph Hellwig 提交于
      btrfs_submit_metadata_bio already calls ->bi_end_io on error and the
      caller must ignore the return value, so remove it.
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      94d9e11b
    • C
      btrfs: remove unused bio_flags argument to btrfs_submit_metadata_bio · abf48d58
      Christoph Hellwig 提交于
      This argument is unused since commit 953651eb ("btrfs: factor out
      helper adding a page to bio") and commit 1b36294a ("btrfs: call
      submit_bio_hook directly for metadata pages") reworked the way metadata
      bio submission is handled.
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      abf48d58
    • C
      btrfs: move btrfs_readpage to extent_io.c · 7aab8b32
      Christoph Hellwig 提交于
      Keep btrfs_readpage next to btrfs_do_readpage and the other address
      space operations.  This allows to keep submit_one_bio and
      struct btrfs_bio_ctrl file local in extent_io.c.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7aab8b32
    • Q
      btrfs: repair super block num_devices automatically · d201238c
      Qu Wenruo 提交于
      [BUG]
      There is a report that a btrfs has a bad super block num devices.
      
      This makes btrfs to reject the fs completely.
      
        BTRFS error (device sdd3): super_num_devices 3 mismatch with num_devices 2 found here
        BTRFS error (device sdd3): failed to read chunk tree: -22
        BTRFS error (device sdd3): open_ctree failed
      
      [CAUSE]
      During btrfs device removal, chunk tree and super block num devs are
      updated in two different transactions:
      
        btrfs_rm_device()
        |- btrfs_rm_dev_item(device)
        |  |- trans = btrfs_start_transaction()
        |  |  Now we got transaction X
        |  |
        |  |- btrfs_del_item()
        |  |  Now device item is removed from chunk tree
        |  |
        |  |- btrfs_commit_transaction()
        |     Transaction X got committed, super num devs untouched,
        |     but device item removed from chunk tree.
        |     (AKA, super num devs is already incorrect)
        |
        |- cur_devices->num_devices--;
        |- cur_devices->total_devices--;
        |- btrfs_set_super_num_devices()
           All those operations are not in transaction X, thus it will
           only be written back to disk in next transaction.
      
      So after the transaction X in btrfs_rm_dev_item() committed, but before
      transaction X+1 (which can be minutes away), a power loss happen, then
      we got the super num mismatch.
      
      This has been fixed by commit bbac5869 ("btrfs: remove device item
      and update super block in the same transaction").
      
      [FIX]
      Make the super_num_devices check less strict, converting it from a hard
      error to a warning, and reset the value to a correct one for the current
      or next transaction commit.
      
      As the number of device items is the critical information where the
      super block num_devices is only a cached value (and also useful for
      cross checking), it's safe to automatically update it. Other device
      related problems like missing device are handled after that and may
      require other means to resolve, like degraded mount. With this fix,
      potentially affected filesystems won't fail mount and require the manual
      repair by btrfs check.
      Reported-by: NLuca Béla Palkovics <luca.bela.palkovics@gmail.com>
      Link: https://lore.kernel.org/linux-btrfs/CA+8xDSpvdm_U0QLBAnrH=zqDq_cWCOH5TiV46CKmp3igr44okQ@mail.gmail.com/
      CC: stable@vger.kernel.org # 4.14+
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d201238c
    • G
      btrfs: do not pass compressed_bio to submit_compressed_bio() · 46fbd18e
      Goldwyn Rodrigues 提交于
      Parameter struct compressed_bio is not used by the function
      submit_compressed_bio(). Remove it.
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      46fbd18e
    • F
      btrfs: avoid double search for block group during NOCOW writes · 2306e83e
      Filipe Manana 提交于
      When doing a NOCOW write, either through direct IO or buffered IO, we do
      two lookups for the block group that contains the target extent: once
      when we call btrfs_inc_nocow_writers() and then later again when we call
      btrfs_dec_nocow_writers() after creating the ordered extent.
      
      The lookups require taking a lock and navigating the red black tree used
      to track all block groups, which can take a non-negligible amount of time
      for a large filesystem with thousands of block groups, as well as lock
      contention and cache line bouncing.
      
      Improve on this by having a single block group search: making
      btrfs_inc_nocow_writers() return the block group to its caller and then
      have the caller pass that block group to btrfs_dec_nocow_writers().
      
      This is part of a patchset comprised of the following patches:
      
        btrfs: remove search start argument from first_logical_byte()
        btrfs: use rbtree with leftmost node cached for tracking lowest block group
        btrfs: use a read/write lock for protecting the block groups tree
        btrfs: return block group directly at btrfs_next_block_group()
        btrfs: avoid double search for block group during NOCOW writes
      
      The following test was used to test these changes from a performance
      perspective:
      
         $ cat test.sh
         #!/bin/bash
      
         modprobe null_blk nr_devices=0
      
         NULL_DEV_PATH=/sys/kernel/config/nullb/nullb0
         mkdir $NULL_DEV_PATH
         if [ $? -ne 0 ]; then
             echo "Failed to create nullb0 directory."
             exit 1
         fi
         echo 2 > $NULL_DEV_PATH/submit_queues
         echo 16384 > $NULL_DEV_PATH/size # 16G
         echo 1 > $NULL_DEV_PATH/memory_backed
         echo 1 > $NULL_DEV_PATH/power
      
         DEV=/dev/nullb0
         MNT=/mnt/nullb0
         LOOP_MNT="$MNT/loop"
         MOUNT_OPTIONS="-o ssd -o nodatacow"
         MKFS_OPTIONS="-R free-space-tree -O no-holes"
      
         cat <<EOF > /tmp/fio-job.ini
         [io_uring_writes]
         rw=randwrite
         fsync=0
         fallocate=posix
         group_reporting=1
         direct=1
         ioengine=io_uring
         iodepth=64
         bs=64k
         filesize=1g
         runtime=300
         time_based
         directory=$LOOP_MNT
         numjobs=8
         thread
         EOF
      
         echo performance | \
             tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
      
         echo
         echo "Using config:"
         echo
         cat /tmp/fio-job.ini
         echo
      
         umount $MNT &> /dev/null
         mkfs.btrfs -f $MKFS_OPTIONS $DEV &> /dev/null
         mount $MOUNT_OPTIONS $DEV $MNT
      
         mkdir $LOOP_MNT
      
         truncate -s 4T $MNT/loopfile
         mkfs.btrfs -f $MKFS_OPTIONS $MNT/loopfile &> /dev/null
         mount $MOUNT_OPTIONS $MNT/loopfile $LOOP_MNT
      
         # Trigger the allocation of about 3500 data block groups, without
         # actually consuming space on underlying filesystem, just to make
         # the tree of block group large.
         fallocate -l 3500G $LOOP_MNT/filler
      
         fio /tmp/fio-job.ini
      
         umount $LOOP_MNT
         umount $MNT
      
         echo 0 > $NULL_DEV_PATH/power
         rmdir $NULL_DEV_PATH
      
      The test was run on a non-debug kernel (Debian's default kernel config),
      the result were the following.
      
      Before patchset:
      
        WRITE: bw=1455MiB/s (1526MB/s), 1455MiB/s-1455MiB/s (1526MB/s-1526MB/s), io=426GiB (458GB), run=300006-300006msec
      
      After patchset:
      
        WRITE: bw=1503MiB/s (1577MB/s), 1503MiB/s-1503MiB/s (1577MB/s-1577MB/s), io=440GiB (473GB), run=300006-300006msec
      
        +3.3% write throughput and +3.3% IO done in the same time period.
      
      The test has somewhat limited coverage scope, as with only NOCOW writes
      we get less contention on the red black tree of block groups, since we
      don't have the extra contention caused by COW writes, namely when
      allocating data extents, pinning and unpinning data extents, but on the
      hand there's access to tree in the NOCOW path, when incrementing a block
      group's number of NOCOW writers.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2306e83e
    • F
      btrfs: return block group directly at btrfs_next_block_group() · 8b01f931
      Filipe Manana 提交于
      At btrfs_next_block_group(), we have this long line with two statements:
      
        cache = btrfs_lookup_first_block_group(...); return cache;
      
      This makes it a bit harder to read due to two statements on the same
      line, so change that to directly return the result of the call to
      btrfs_lookup_first_block_group().
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8b01f931
    • F
      btrfs: use a read/write lock for protecting the block groups tree · 16b0c258
      Filipe Manana 提交于
      Currently we use a spin lock to protect the red black tree that we use to
      track block groups. Most accesses to that tree are actually read only and
      for large filesystems, with thousands of block groups, it actually has
      a bad impact on performance, as concurrent read only searches on the tree
      are serialized.
      
      Read only searches on the tree are very frequent and done when:
      
      1) Pinning and unpinning extents, as we need to lookup the respective
         block group from the tree;
      
      2) Freeing the last reference of a tree block, regardless if we pin the
         underlying extent or add it back to free space cache/tree;
      
      3) During NOCOW writes, both buffered IO and direct IO, we need to check
         if the block group that contains an extent is read only or not and to
         increment the number of NOCOW writers in the block group. For those
         operations we need to search for the block group in the tree.
         Similarly, after creating the ordered extent for the NOCOW write, we
         need to decrement the number of NOCOW writers from the same block
         group, which requires searching for it in the tree;
      
      4) Decreasing the number of extent reservations in a block group;
      
      5) When allocating extents and freeing reserved extents;
      
      6) Adding and removing free space to the free space tree;
      
      7) When releasing delalloc bytes during ordered extent completion;
      
      8) When relocating a block group;
      
      9) During fitrim, to iterate over the block groups;
      
      10) etc;
      
      Write accesses to the tree, to add or remove block groups, are much less
      frequent as they happen only when allocating a new block group or when
      deleting a block group.
      
      We also use the same spin lock to protect the list of currently caching
      block groups. Additions to this list are made when we need to cache a
      block group, because we don't have a free space cache for it (or we have
      but it's invalid), and removals from this list are done when caching of
      the block group's free space finishes. These cases are also not very
      common, but when they happen, they happen only once when the filesystem
      is mounted.
      
      So switch the lock that protects the tree of block groups from a spinning
      lock to a read/write lock.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      16b0c258
    • F
      btrfs: use rbtree with leftmost node cached for tracking lowest block group · 08dddb29
      Filipe Manana 提交于
      We keep track of the start offset of the block group with the lowest start
      offset at fs_info->first_logical_byte. This requires explicitly updating
      that field every time we add, delete or lookup a block group to/from the
      red black tree at fs_info->block_group_cache_tree.
      
      Since the block group with the lowest start address happens to always be
      the one that is the leftmost node of the tree, we can use a red black tree
      that caches the left most node. Then when we need the start address of
      that block group, we can just quickly get the leftmost node in the tree
      and extract the start offset of that node's block group. This avoids the
      need to explicitly keep track of that address in the dedicated member
      fs_info->first_logical_byte, and it also allows the next patch in the
      series to switch the lock that protects the red black tree from a spin
      lock to a read/write lock - without this change it would be tricky
      because block group searches also update fs_info->first_logical_byte.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      08dddb29
    • F
      btrfs: remove search start argument from first_logical_byte() · 0eb997bf
      Filipe Manana 提交于
      The search start argument passed to first_logical_byte() is always 0, as
      we always want to get the logical start address of the block group with
      the lowest logical start address. So remove it, as not only it is not
      necessary, it also makes the following patches that change the lock that
      protects the red black tree of block groups from a spin lock to a
      read/write lock.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0eb997bf
    • Q
      btrfs: return correct error number for __extent_writepage_io() · 44e5801f
      Qu Wenruo 提交于
      [BUG]
      If we hit an error from submit_extent_page() inside
      __extent_writepage_io(), we could still return 0 to the caller, and
      even trigger the warning in btrfs_page_assert_not_dirty().
      
      [CAUSE]
      In __extent_writepage_io(), if we hit an error from
      submit_extent_page(), we will just clean up the range and continue.
      
      This is completely fine for regular PAGE_SIZE == sectorsize, as we can
      only hit one sector in one page, thus after the error we're ensured to
      exit and @ret will be saved.
      
      But for subpage case, we may have other dirty subpage range in the page,
      and in the next loop, we may succeeded submitting the next range.
      
      In that case, @ret will be overwritten, and we return 0 to the caller,
      while we have hit some error.
      
      [FIX]
      Introduce @has_error and @saved_ret to record the first error we hit, so
      we will never forget what error we hit.
      
      CC: stable@vger.kernel.org # 5.15+
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      44e5801f
    • Q
      btrfs: fix the error handling for submit_extent_page() for btrfs_do_readpage() · 10f7f6f8
      Qu Wenruo 提交于
      [BUG]
      Test case generic/475 have a very high chance (almost 100%) to hit a fs
      hang, where a data page will never be unlocked and hang all later
      operations.
      
      [CAUSE]
      In btrfs_do_readpage(), if we hit an error from submit_extent_page() we
      will try to do the cleanup for our current io range, and exit.
      
      This works fine for PAGE_SIZE == sectorsize cases, but not for subpage.
      
      For subpage btrfs_do_readpage() will lock the full page first, which can
      contain several different sectors and extents:
      
       btrfs_do_readpage()
       |- begin_page_read()
       |  |- btrfs_subpage_start_reader();
       |     Now the page will have PAGE_SIZE / sectorsize reader pending,
       |     and the page is locked.
       |
       |- end_page_read() for different branches
       |  This function will reduce subpage readers, and when readers
       |  reach 0, it will unlock the page.
      
      But when submit_extent_page() failed, we only cleanup the current
      io range, while the remaining io range will never be cleaned up, and the
      page remains locked forever.
      
      [FIX]
      Update the error handling of submit_extent_page() to cleanup all the
      remaining subpage range before exiting the loop.
      
      Please note that, now submit_extent_page() can only fail due to
      sanity check in alloc_new_bio().
      
      Thus regular IO errors are impossible to trigger the error path.
      
      CC: stable@vger.kernel.org # 5.15+
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      10f7f6f8
    • Q
      btrfs: avoid double clean up when submit_one_bio() failed · c9583ada
      Qu Wenruo 提交于
      [BUG]
      When running generic/475 with 64K page size and 4K sector size, it has a
      very high chance (almost 100%) to hang, with mostly data page locked but
      no one is going to unlock it.
      
      [CAUSE]
      With commit 1784b7d5 ("btrfs: handle csum lookup errors properly on
      reads"), if we failed to lookup checksum due to metadata IO error, we
      will return error for btrfs_submit_data_bio().
      
      This will cause the page to be unlocked twice in btrfs_do_readpage():
      
       btrfs_do_readpage()
       |- submit_extent_page()
       |  |- submit_one_bio()
       |     |- btrfs_submit_data_bio()
       |        |- if (ret) {
       |        |-     bio->bi_status = ret;
       |        |-     bio_endio(bio); }
       |               In the endio function, we will call end_page_read()
       |               and unlock_extent() to cleanup the subpage range.
       |
       |- if (ret) {
       |-        unlock_extent(); end_page_read() }
                 Here we unlock the extent and cleanup the subpage range
                 again.
      
      For unlock_extent(), it's mostly double unlock safe.
      
      But for end_page_read(), it's not, especially for subpage case,
      as for subpage case we will call btrfs_subpage_end_reader() to reduce
      the reader number, and use that to number to determine if we need to
      unlock the full page.
      
      If double accounted, it can underflow the number and leave the page
      locked without anyone to unlock it.
      
      [FIX]
      The commit 1784b7d5 ("btrfs: handle csum lookup errors properly on
      reads") itself is completely fine, it's our existing code not properly
      handling the error from bio submission hook properly.
      
      This patch will make submit_one_bio() to return void so that the callers
      will never be able to do cleanup when bio submission hook fails.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c9583ada
    • schspa's avatar
      btrfs: use non-bh spin_lock in zstd timer callback · dd7382a2
      schspa 提交于
      This is an optimization for fix fee13fe9 ("btrfs: correct zstd
      workspace manager lock to use spin_lock_bh()")
      
      The critical region for wsm.lock is only accessed by the process context and
      the softirq context.
      
      Because in the soft interrupt, the critical section will not be
      preempted by the soft interrupt again, there is no need to call
      spin_lock_bh(&wsm.lock) to turn off the soft interrupt,
      spin_lock(&wsm.lock) is enough for this situation.
      Signed-off-by: schspa's avatarSchspa Shi <schspa@gmail.com>
      [ minor comment update ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      dd7382a2
    • F
      btrfs: use BTRFS_DIR_START_INDEX at btrfs_create_new_inode() · 49024388
      Filipe Manana 提交于
      We are still using the magic value of 2 at btrfs_create_new_inode(), but
      there's now a constant for that, named BTRFS_DIR_START_INDEX, which was
      introduced in commit 528ee697 ("btrfs: put initial index value of a
      directory in a constant"). So change that to use the constant.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      49024388