1. 06 12月, 2022 40 次提交
    • C
      btrfs: split the bio submission path into a separate file · 103c1972
      Christoph Hellwig 提交于
      The code used by btrfs_submit_bio only interacts with the rest of
      volumes.c through __btrfs_map_block (which itself is a more generic
      version of two exported helpers) and does not really have anything
      to do with volumes.c.  Create a new bio.c file and a bio.h header
      going along with it for the btrfs_bio-based storage layer, which
      will grow even more going forward.
      
      Also update the file with my copyright notice given that a large
      part of the moved code was written or rewritten by me.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      103c1972
    • C
      btrfs: move struct btrfs_tree_parent_check out of disk-io.h · 27137fac
      Christoph Hellwig 提交于
      Move struct btrfs_tree_parent_check out of disk-io.h so that volumes.h
      an various .c files don't have to include disk-io.h just for it.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ use tree-checker.h for the structure ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      27137fac
    • Q
      btrfs: raid56: do data csum verification during RMW cycle · 7a315072
      Qu Wenruo 提交于
      [BUG]
      For the following small script, btrfs will be unable to recover the
      content of file1:
      
        mkfs.btrfs -f -m raid1 -d raid5 -b 1G $dev1 $dev2 $dev3
      
        mount $dev1 $mnt
        xfs_io -f -c "pwrite -S 0xff 0 64k" -c sync $mnt/file1
        md5sum $mnt/file1
        umount $mnt
      
        # Corrupt the above 64K data stripe.
        xfs_io -f -c "pwrite -S 0x00 323026944 64K" -c sync $dev3
        mount $dev1 $mnt
      
        # Write a new 64K, which should be in the other data stripe
        # And this is a sub-stripe write, which will cause RMW
        xfs_io -f -c "pwrite 0 64k" -c sync $mnt/file2
        md5sum $mnt/file1
        umount $mnt
      
      Above md5sum would fail.
      
      [CAUSE]
      There is a long existing problem for raid56 (not limited to btrfs
      raid56) that, if we already have some corrupted on-disk data, and then
      trigger a sub-stripe write (which needs RMW cycle), it can cause further
      damage into P/Q stripe.
      
        Disk 1: data 1 |0x000000000000| <- Corrupted
        Disk 2: data 2 |0x000000000000|
        Disk 2: parity |0xffffffffffff|
      
      In above case, data 1 is already corrupted, the original data should be
      64KiB of 0xff.
      
      At this stage, if we read data 1, and it has data checksum, we can still
      recovery going via the regular RAID56 recovery path.
      
      But if now we decide to write some data into data 2, then we need to go
      RMW.
      Let's say we want to write 64KiB of '0x00' into data 2, then we read the
      on-disk data of data 1, calculate the new parity, resulting the
      following layout:
      
        Disk 1: data 1 |0x000000000000| <- Corrupted
        Disk 2: data 2 |0x000000000000| <- New '0x00' writes
        Disk 2: parity |0x000000000000| <- New Parity.
      
      But the new parity is calculated using the *corrupted* data 1, we can
      no longer recover the correct data of data1.  Thus the corruption is
      forever there.
      
      [FIX]
      To solve above problem, this patch will do a full stripe data checksum
      verification at RMW time.
      
      This involves the following changes:
      
      - Always read the full stripe (including data/P/Q) when doing RMW
        Before we only read the missing data sectors, but since we may do a
        data csum verification and recovery, we need to read everything out.
      
        Please note that, if we have a cached rbio, we don't need to read
        anything, and can treat it the same as full stripe write.
      
        As only stripe with all its csum matches can be cached.
      
      - Verify the data csum during read.
        The goal is only the rbio stripe sectors, and only if the rbio
        already has csum_buf/csum_bitmap filled.
      
        And sectors which cannot pass csum verification will have their bit
        set in error_bitmap.
      
      - Always call recovery_sectors() after we read out all the sectors
        Since error_bitmap will be updated during read, recover_sectors()
        can easily find out all the bad sectors and try to recover (if still
        under tolerance).
      
        And since recovery_sectors() is already migrated to use error_bitmap,
        it can skip vertical stripes which don't have any error.
      
      - Verify the repaired sectors against its csum in recover_vertical()
      
      - Rename rmw_read_and_wait() to rmw_read_wait_recover()
        Since we will always recover the sectors, the old name is no longer
        accurate.
      
        Furthermore since recovery is already done in rmw_read_wait_recover(),
        we no longer need to call recovery_sectors() inside rmw_rbio().
      
      Obviously this will have a performance impact, as we are doing more
      work during RMW cycle:
      
      - Fetch the data checksums
      - Do checksum verification for all data stripes
      - Do checksum verification again after repair
      
      But for full stripe write or cached rbio we won't have the overhead all,
      thus for fully optimized RAID56 workload (always full stripe write),
      there should be no extra overhead.
      
      To me, the extra overhead looks reasonable, as data consistency is way
      more important than performance.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7a315072
    • Q
      btrfs: raid56: prepare data checksums for later RMW verification · c5a41562
      Qu Wenruo 提交于
      This is for later data checksum verification at RMW time.
      
      This patch will try to allocate the needed memory for a locked rbio if
      the rbio is for data exclusively (we don't want to handle mixed bg yet).
      The memory will be released when the rbio is finished.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c5a41562
    • Q
      btrfs: introduce a bitmap based csum range search function · 97e38239
      Qu Wenruo 提交于
      Although we have an existing function, btrfs_lookup_csums_range(), to
      find all data checksums for a range, it's based on a btrfs_ordered_sum
      list.
      
      For the incoming RAID56 data checksum verification at RMW time, we don't
      want to waste time by allocating temporary memory.
      
      So this patch will introduce a new helper, btrfs_lookup_csums_bitmap().
      It will use bitmap based result, which will be a perfect fit for later
      RAID56 usage.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      97e38239
    • Q
      btrfs: refactor checksum calculations in btrfs_lookup_csums_range() · cb649e81
      Qu Wenruo 提交于
      The refactoring involves the following parts:
      
      - Introduce bytes_to_csum_size() and csum_size_to_bytes() helpers
        As we have quite some open-coded calculations, some of them are even
        split into two assignments just to fit 80 chars limit.
      
      - Remove the @csum_size parameter from max_ordered_sum_bytes()
        Csum size can be fetched from @fs_info.
        And we will use the csum_size_to_bytes() helper anyway.
      
      - Add a comment explaining how we handle the first search result
      
      - Use newly introduced helpers to cleanup btrfs_lookup_csums_range()
      
      - Move variables declaration to the minimal scope
      
      - Never mix number of sectors with bytes
        There are several locations doing things like:
      
       			size = min_t(size_t, csum_end - start,
      				     max_ordered_sum_bytes(fs_info));
      			...
      			size >>= fs_info->sectorsize_bits
      
        Or
      
      			offset = (start - key.offset) >> fs_info->sectorsize_bits;
      			offset *= csum_size;
      
        Make sure these variables can only represent BYTES inside the
        function, by using the above bytes_to_csum_size() helpers.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      cb649e81
    • L
      btrfs: allocate btrfs_io_context without GFP_NOFAIL · 9f0eac07
      Li zeming 提交于
      The __GFP_NOFAIL flag could loop indefinitely when allocation memory in
      alloc_btrfs_io_context. The callers starting from __btrfs_map_block
      already handle errors so it's safe to drop the flag.
      Signed-off-by: NLi zeming <zeming@nfschina.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9f0eac07
    • Q
      btrfs: use btrfs_dev_name() helper to handle missing devices better · cb3e217b
      Qu Wenruo 提交于
      [BUG]
      If dev-replace failed to re-construct its data/metadata, the kernel
      message would be incorrect for the missing device:
      
       BTRFS info (device dm-1): dev_replace from <missing disk> (devid 2) to /dev/mapper/test-scratch2 started
       BTRFS error (device dm-1): failed to rebuild valid logical 38862848 for dev (efault)
      
      Note the above "dev (efault)" of the second line.
      While the first line is properly reporting "<missing disk>".
      
      [CAUSE]
      Although dev-replace is using btrfs_dev_name(), the heavy lifting work
      is still done by scrub (scrub is reused by both dev-replace and regular
      scrub).
      
      Unfortunately scrub code never uses btrfs_dev_name() helper, as it's
      only declared locally inside dev-replace.c.
      
      [FIX]
      Fix the output by:
      
      - Move the btrfs_dev_name() helper to volumes.h
      
      - Use btrfs_dev_name() to replace open-coded rcu_str_deref() calls
        Only zoned code is not touched, as I'm not familiar with degraded
        zoned code.
      
      - Constify return value and parameter
      
      Now the output looks pretty sane:
      
       BTRFS info (device dm-1): dev_replace from <missing disk> (devid 2) to /dev/mapper/test-scratch2 started
       BTRFS error (device dm-1): failed to rebuild valid logical 38862848 for dev <missing disk>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      cb3e217b
    • F
      btrfs: use cached state when looking for delalloc ranges with lseek · 3c32c721
      Filipe Manana 提交于
      During lseek (SEEK_HOLE/DATA), whenever we find a hole or prealloc extent,
      we will look for delalloc in that range, and one of the things we do for
      that is to find out ranges in the inode's io_tree marked with
      EXTENT_DELALLOC, using calls to count_range_bits().
      
      Typically there's a single, or few, searches in the io_tree for delalloc
      per lseek call. However it's common for applications to keep calling
      lseek with SEEK_HOLE and SEEK_DATA to find where extents and holes are in
      a file, read the extents and skip holes in order to avoid unnecessary IO
      and save disk space by preserving holes.
      
      One popular user is the cp utility from coreutils. Starting with coreutils
      9.0, cp uses SEEK_HOLE and SEEK_DATA to iterate over the extents of a
      file. Before 9.0, it used fiemap to figure out where holes and extents are
      in the source file. Another popular user is the tar utility when used with
      the --sparse / -S option to detect and preserve holes.
      
      Given that the pattern is to keep calling lseek with a start offset that
      matches the returned offset from the previous lseek call, we can benefit
      from caching the last extent state visited in count_range_bits() and use
      it for the next count_range_bits() from the next lseek call. Example,
      the following strace excerpt from running tar:
      
         $ strace tar cJSvf foo.tar.xz qemu_disk_file.raw
         (...)
         lseek(5, 125019574272, SEEK_HOLE)       = 125024989184
         lseek(5, 125024989184, SEEK_DATA)       = 125024993280
         lseek(5, 125024993280, SEEK_HOLE)       = 125025239040
         lseek(5, 125025239040, SEEK_DATA)       = 125025255424
         lseek(5, 125025255424, SEEK_HOLE)       = 125025353728
         lseek(5, 125025353728, SEEK_DATA)       = 125025357824
         lseek(5, 125025357824, SEEK_HOLE)       = 125026766848
         lseek(5, 125026766848, SEEK_DATA)       = 125026770944
         lseek(5, 125026770944, SEEK_HOLE)       = 125027053568
         (...)
      
      Shows that pattern, which is the same as with cp from coreutils 9.0+.
      
      So start using a cached state for the delalloc searches in lseek, and
      store it in struct file's private data so that it can be reused across
      lseek calls.
      
      This change is part of a patchset that is comprised of the following
      patches:
      
        1/9 btrfs: remove leftover setting of EXTENT_UPTODATE state in an inode's io_tree
        2/9 btrfs: add an early exit when searching for delalloc range for lseek/fiemap
        3/9 btrfs: skip unnecessary delalloc searches during lseek/fiemap
        4/9 btrfs: search for delalloc more efficiently during lseek/fiemap
        5/9 btrfs: remove no longer used btrfs_next_extent_map()
        6/9 btrfs: allow passing a cached state record to count_range_bits()
        7/9 btrfs: update stale comment for count_range_bits()
        8/9 btrfs: use cached state when looking for delalloc ranges with fiemap
        9/9 btrfs: use cached state when looking for delalloc ranges with lseek
      
      The following test was run before and after applying the whole patchset:
      
         $ cat test-cp.sh
         #!/bin/bash
      
         DEV=/dev/sdh
         MNT=/mnt/sdh
      
         # coreutils 8.32, cp uses fiemap to detect holes and extents
         #CP_PROG=/usr/bin/cp
         # coreutils 9.1, cp uses SEEK_HOLE/DATA to detect holes and extents
         CP_PROG=/home/fdmanana/git/hub/coreutils/src/cp
      
         umount $DEV &> /dev/null
         mkfs.btrfs -f $DEV
         mount $DEV $MNT
      
         FILE_SIZE=$((1024 * 1024 * 1024))
         echo "Creating file with a size of $((FILE_SIZE / 1024 / 1024))M"
         # Create a very sparse file, where each extent has a length of 4K and
         # is preceded by a 4K hole and followed by another 4K hole.
         start=$(date +%s%N)
         echo -n > $MNT/foobar
         for ((off = 0; off < $FILE_SIZE; off += 8192)); do
                 xfs_io -c "pwrite -S 0xab $off 4K" $MNT/foobar > /dev/null
                 echo -ne "\r$off / $FILE_SIZE ..."
         done
         end=$(date +%s%N)
         echo -e "\nFile created ($(( (end - start) / 1000000 )) milliseconds)"
      
         start=$(date +%s%N)
         $CP_PROG $MNT/foobar /dev/null
         end=$(date +%s%N)
         dur=$(( (end - start) / 1000000 ))
         echo "cp took $dur milliseconds with data/metadata cached and delalloc"
      
         # Flush all delalloc.
         sync
      
         start=$(date +%s%N)
         $CP_PROG $MNT/foobar /dev/null
         end=$(date +%s%N)
         dur=$(( (end - start) / 1000000 ))
         echo "cp took $dur milliseconds with data/metadata cached and no delalloc"
      
         # Unmount and mount again to test the case without any metadata
         # loaded in memory.
         umount $MNT
         mount $DEV $MNT
      
         start=$(date +%s%N)
         $CP_PROG $MNT/foobar /dev/null
         end=$(date +%s%N)
         dur=$(( (end - start) / 1000000 ))
         echo "cp took $dur milliseconds without data/metadata cached and no delalloc"
      
         umount $MNT
      
      The results, running on a box with a non-debug kernel (Debian's default
      kernel config), were the following:
      
      128M file, before patchset:
      
         cp took 16574 milliseconds with data/metadata cached and delalloc
         cp took 122 milliseconds with data/metadata cached and no delalloc
         cp took 20144 milliseconds without data/metadata cached and no delalloc
      
      128M file, after patchset:
      
         cp took 6277 milliseconds with data/metadata cached and delalloc
         cp took 109 milliseconds with data/metadata cached and no delalloc
         cp took 210 milliseconds without data/metadata cached and no delalloc
      
      512M file, before patchset:
      
         cp took 14369 milliseconds with data/metadata cached and delalloc
         cp took 429 milliseconds with data/metadata cached and no delalloc
         cp took 88034 milliseconds without data/metadata cached and no delalloc
      
      512M file, after patchset:
      
         cp took 12106 milliseconds with data/metadata cached and delalloc
         cp took 427 milliseconds with data/metadata cached and no delalloc
         cp took 824 milliseconds without data/metadata cached and no delalloc
      
      1G file, before patchset:
      
         cp took 10074 milliseconds with data/metadata cached and delalloc
         cp took 886 milliseconds with data/metadata cached and no delalloc
         cp took 181261 milliseconds without data/metadata cached and no delalloc
      
      1G file, after patchset:
      
         cp took 3320 milliseconds with data/metadata cached and delalloc
         cp took 880 milliseconds with data/metadata cached and no delalloc
         cp took 1801 milliseconds without data/metadata cached and no delalloc
      Reported-by: NWang Yugui <wangyugui@e16-tech.com>
      Link: https://lore.kernel.org/linux-btrfs/20221106073028.71F9.409509F4@e16-tech.com/
      Link: https://lore.kernel.org/linux-btrfs/CAL3q7H5NSVicm7nYBJ7x8fFkDpno8z3PYt5aPU43Bajc1H0h1Q@mail.gmail.com/Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3c32c721
    • F
      btrfs: use cached state when looking for delalloc ranges with fiemap · b3e744fe
      Filipe Manana 提交于
      During fiemap, whenever we find a hole or prealloc extent, we will look
      for delalloc in that range, and one of the things we do for that is to
      find out ranges in the inode's io_tree marked with EXTENT_DELALLOC, using
      calls to count_range_bits().
      
      Since we process file extents from left to right, if we have a file with
      several holes or prealloc extents, we benefit from keeping a cached extent
      state record for calls to count_range_bits(). Most of the time the last
      extent state record we visited in one call to count_range_bits() matches
      the first extent state record we will use in the next call to
      count_range_bits(), so there's a benefit here. So use an extent state
      record to cache results from count_range_bits() calls during fiemap.
      
      This change is part of a patchset that has the goal to make performance
      better for applications that use lseek's SEEK_HOLE and SEEK_DATA modes to
      iterate over the extents of a file. Two examples are the cp program from
      coreutils 9.0+ and the tar program (when using its --sparse / -S option).
      A sample test and results are listed in the changelog of the last patch
      in the series:
      
        1/9 btrfs: remove leftover setting of EXTENT_UPTODATE state in an inode's io_tree
        2/9 btrfs: add an early exit when searching for delalloc range for lseek/fiemap
        3/9 btrfs: skip unnecessary delalloc searches during lseek/fiemap
        4/9 btrfs: search for delalloc more efficiently during lseek/fiemap
        5/9 btrfs: remove no longer used btrfs_next_extent_map()
        6/9 btrfs: allow passing a cached state record to count_range_bits()
        7/9 btrfs: update stale comment for count_range_bits()
        8/9 btrfs: use cached state when looking for delalloc ranges with fiemap
        9/9 btrfs: use cached state when looking for delalloc ranges with lseek
      Reported-by: NWang Yugui <wangyugui@e16-tech.com>
      Link: https://lore.kernel.org/linux-btrfs/20221106073028.71F9.409509F4@e16-tech.com/
      Link: https://lore.kernel.org/linux-btrfs/CAL3q7H5NSVicm7nYBJ7x8fFkDpno8z3PYt5aPU43Bajc1H0h1Q@mail.gmail.com/Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b3e744fe
    • F
      btrfs: update stale comment for count_range_bits() · 1ee51a06
      Filipe Manana 提交于
      The comment for count_range_bits() mentions that the search is fast if we
      are asking for a range with the EXTENT_DIRTY bit set. However that is no
      longer true since we don't use that bit and the optimization for that was
      removed in:
      
        commit 71528e9e ("btrfs: get rid of extent_io_tree::dirty_bytes")
      
      So remove that part of the comment mentioning the no longer existing
      optimized case, and, while at it, add proper documentation describing the
      purpose, arguments and return value of the function.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1ee51a06
    • F
      btrfs: allow passing a cached state record to count_range_bits() · 8c6e53a7
      Filipe Manana 提交于
      An inode's io_tree can be quite large and there are cases where due to
      delalloc it can have thousands of extent state records, which makes the
      red black tree have a depth of 10 or more, making the operation of
      count_range_bits() slow if we repeatedly call it for a range that starts
      where, or after, the previous one we called it for. Such use cases are
      when searching for delalloc in a file range that corresponds to a hole or
      a prealloc extent, which is done during lseek SEEK_HOLE/DATA and fiemap.
      
      So introduce a cached state parameter to count_range_bits() which we use
      to store the last extent state record we visited, and then allow the
      caller to pass it again on its next call to count_range_bits(). The next
      patches in the series will make fiemap and lseek use the new parameter.
      
      This change is part of a patchset that has the goal to make performance
      better for applications that use lseek's SEEK_HOLE and SEEK_DATA modes to
      iterate over the extents of a file. Two examples are the cp program from
      coreutils 9.0+ and the tar program (when using its --sparse / -S option).
      A sample test and results are listed in the changelog of the last patch
      in the series:
      
        1/9 btrfs: remove leftover setting of EXTENT_UPTODATE state in an inode's io_tree
        2/9 btrfs: add an early exit when searching for delalloc range for lseek/fiemap
        3/9 btrfs: skip unnecessary delalloc searches during lseek/fiemap
        4/9 btrfs: search for delalloc more efficiently during lseek/fiemap
        5/9 btrfs: remove no longer used btrfs_next_extent_map()
        6/9 btrfs: allow passing a cached state record to count_range_bits()
        7/9 btrfs: update stale comment for count_range_bits()
        8/9 btrfs: use cached state when looking for delalloc ranges with fiemap
        9/9 btrfs: use cached state when looking for delalloc ranges with lseek
      Reported-by: NWang Yugui <wangyugui@e16-tech.com>
      Link: https://lore.kernel.org/linux-btrfs/20221106073028.71F9.409509F4@e16-tech.com/
      Link: https://lore.kernel.org/linux-btrfs/CAL3q7H5NSVicm7nYBJ7x8fFkDpno8z3PYt5aPU43Bajc1H0h1Q@mail.gmail.com/Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8c6e53a7
    • F
      btrfs: remove no longer used btrfs_next_extent_map() · cfd7a17d
      Filipe Manana 提交于
      There are no more users of btrfs_next_extent_map(), the previous patch
      in the series ("btrfs: search for delalloc more efficiently during
      lseek/fiemap") removed the last usage of the function, so delete it.
      
      This change is part of a patchset that has the goal to make performance
      better for applications that use lseek's SEEK_HOLE and SEEK_DATA modes to
      iterate over the extents of a file. Two examples are the cp program from
      coreutils 9.0+ and the tar program (when using its --sparse / -S option).
      A sample test and results are listed in the changelog of the last patch
      in the series:
      
        1/9 btrfs: remove leftover setting of EXTENT_UPTODATE state in an inode's io_tree
        2/9 btrfs: add an early exit when searching for delalloc range for lseek/fiemap
        3/9 btrfs: skip unnecessary delalloc searches during lseek/fiemap
        4/9 btrfs: search for delalloc more efficiently during lseek/fiemap
        5/9 btrfs: remove no longer used btrfs_next_extent_map()
        6/9 btrfs: allow passing a cached state record to count_range_bits()
        7/9 btrfs: update stale comment for count_range_bits()
        8/9 btrfs: use cached state when looking for delalloc ranges with fiemap
        9/9 btrfs: use cached state when looking for delalloc ranges with lseek
      Reported-by: NWang Yugui <wangyugui@e16-tech.com>
      Link: https://lore.kernel.org/linux-btrfs/20221106073028.71F9.409509F4@e16-tech.com/
      Link: https://lore.kernel.org/linux-btrfs/CAL3q7H5NSVicm7nYBJ7x8fFkDpno8z3PYt5aPU43Bajc1H0h1Q@mail.gmail.com/Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      cfd7a17d
    • F
      btrfs: search for delalloc more efficiently during lseek/fiemap · 8ddc8274
      Filipe Manana 提交于
      During lseek (SEEK_HOLE/DATA) and fiemap, when processing a file range
      that corresponds to a hole or a prealloc extent, we have to check if
      there's any delalloc in the range. We do it by searching for delalloc
      ranges in the inode's io_tree (for unflushed delalloc) and in the inode's
      extent map tree (for delalloc that is flushing).
      
      We avoid searching the extent map tree if the number of outstanding
      extents is 0, as in that case we can't have extent maps for our search
      range in the tree that correspond to delalloc that is flushing. However
      if we have any unflushed delalloc, due to buffered writes or mmap writes,
      then the outstanding extents counter is not 0 and we'll search the extent
      map tree. The tree may be large because it can have lots of extent maps
      that were loaded by reads or created by previous writes, therefore taking
      a significant time to search the tree, specially if have a file with a
      lot of holes and/or prealloc extents.
      
      We can improve on this by instead of searching the extent map tree,
      searching the ordered extents tree of the inode, since when delalloc is
      flushing we create an ordered extent along with the new extent map, while
      holding the respective file range locked in the inode's io_tree. The
      ordered extents tree is typically much smaller, since ordered extents have
      a short life and get removed from the tree once they are completed, while
      extent maps can stay for a very long time in the extent map tree, either
      created by previous writes or loaded by read operations.
      
      So use the ordered extents tree instead of the extent maps tree.
      
      This change is part of a patchset that has the goal to make performance
      better for applications that use lseek's SEEK_HOLE and SEEK_DATA modes to
      iterate over the extents of a file. Two examples are the cp program from
      coreutils 9.0+ and the tar program (when using its --sparse / -S option).
      A sample test and results are listed in the changelog of the last patch
      in the series:
      
        1/9 btrfs: remove leftover setting of EXTENT_UPTODATE state in an inode's io_tree
        2/9 btrfs: add an early exit when searching for delalloc range for lseek/fiemap
        3/9 btrfs: skip unnecessary delalloc searches during lseek/fiemap
        4/9 btrfs: search for delalloc more efficiently during lseek/fiemap
        5/9 btrfs: remove no longer used btrfs_next_extent_map()
        6/9 btrfs: allow passing a cached state record to count_range_bits()
        7/9 btrfs: update stale comment for count_range_bits()
        8/9 btrfs: use cached state when looking for delalloc ranges with fiemap
        9/9 btrfs: use cached state when looking for delalloc ranges with lseek
      Reported-by: NWang Yugui <wangyugui@e16-tech.com>
      Link: https://lore.kernel.org/linux-btrfs/20221106073028.71F9.409509F4@e16-tech.com/
      Link: https://lore.kernel.org/linux-btrfs/CAL3q7H5NSVicm7nYBJ7x8fFkDpno8z3PYt5aPU43Bajc1H0h1Q@mail.gmail.com/Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8ddc8274
    • F
      btrfs: skip unnecessary delalloc searches during lseek/fiemap · af979fd6
      Filipe Manana 提交于
      During lseek (SEEK_HOLE/DATA) and fiemap, when processing a file range
      that corresponds to a hole or a prealloc extent, if we find that there is
      no delalloc marked in the inode's io_tree but there is delalloc due to
      an extent map in the io tree, then on the next iteration that calls
      find_delalloc_subrange() we can skip searching the io tree again, since
      on the first call we had no delalloc in the io tree for the whole range.
      
      This change is part of a patchset that has the goal to make performance
      better for applications that use lseek's SEEK_HOLE and SEEK_DATA modes to
      iterate over the extents of a file. Two examples are the cp program from
      coreutils 9.0+ and the tar program (when using its --sparse / -S option).
      A sample test and results are listed in the changelog of the last patch
      in the series:
      
        1/9 btrfs: remove leftover setting of EXTENT_UPTODATE state in an inode's io_tree
        2/9 btrfs: add an early exit when searching for delalloc range for lseek/fiemap
        3/9 btrfs: skip unnecessary delalloc searches during lseek/fiemap
        4/9 btrfs: search for delalloc more efficiently during lseek/fiemap
        5/9 btrfs: remove no longer used btrfs_next_extent_map()
        6/9 btrfs: allow passing a cached state record to count_range_bits()
        7/9 btrfs: update stale comment for count_range_bits()
        8/9 btrfs: use cached state when looking for delalloc ranges with fiemap
        9/9 btrfs: use cached state when looking for delalloc ranges with lseek
      Reported-by: NWang Yugui <wangyugui@e16-tech.com>
      Link: https://lore.kernel.org/linux-btrfs/20221106073028.71F9.409509F4@e16-tech.com/
      Link: https://lore.kernel.org/linux-btrfs/CAL3q7H5NSVicm7nYBJ7x8fFkDpno8z3PYt5aPU43Bajc1H0h1Q@mail.gmail.com/Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      af979fd6
    • F
      btrfs: add an early exit when searching for delalloc range for lseek/fiemap · 40daf3e0
      Filipe Manana 提交于
      During fiemap and lseek (SEEK_HOLE/DATA), when looking for delalloc in a
      range corresponding to a hole or a prealloc extent, if we found the whole
      range marked as delalloc in the inode's io_tree, then we can terminate
      immediately and avoid searching the extent map tree. If not, and if the
      found delalloc starts at the same offset of our search start but ends
      before our search range's end, then we can adjust the search range for
      the search in the extent map tree. So implement those changes.
      
      This change is part of a patchset that has the goal to make performance
      better for applications that use lseek's SEEK_HOLE and SEEK_DATA modes to
      iterate over the extents of a file. Two examples are the cp program from
      coreutils 9.0+ and the tar program (when using its --sparse / -S option).
      A sample test and results are listed in the changelog of the last patch
      in the series:
      
        1/9 btrfs: remove leftover setting of EXTENT_UPTODATE state in an inode's io_tree
        2/9 btrfs: add an early exit when searching for delalloc range for lseek/fiemap
        3/9 btrfs: skip unnecessary delalloc searches during lseek/fiemap
        4/9 btrfs: search for delalloc more efficiently during lseek/fiemap
        5/9 btrfs: remove no longer used btrfs_next_extent_map()
        6/9 btrfs: allow passing a cached state record to count_range_bits()
        7/9 btrfs: update stale comment for count_range_bits()
        8/9 btrfs: use cached state when looking for delalloc ranges with fiemap
        9/9 btrfs: use cached state when looking for delalloc ranges with lseek
      Reported-by: NWang Yugui <wangyugui@e16-tech.com>
      Link: https://lore.kernel.org/linux-btrfs/20221106073028.71F9.409509F4@e16-tech.com/
      Link: https://lore.kernel.org/linux-btrfs/CAL3q7H5NSVicm7nYBJ7x8fFkDpno8z3PYt5aPU43Bajc1H0h1Q@mail.gmail.com/Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      40daf3e0
    • F
      btrfs: remove leftover setting of EXTENT_UPTODATE state in an inode's io_tree · 2c8f5e8c
      Filipe Manana 提交于
      We don't need to set the EXTENT_UPDATE bit in an inode's io_tree to mark a
      range as uptodate, we rely on the pages themselves being uptodate - page
      reading is not triggered for already uptodate pages. Recently we removed
      most use of the EXTENT_UPTODATE for buffered IO with commit 52b029f4
      ("btrfs: remove unnecessary EXTENT_UPTODATE state in buffered I/O path"),
      but there were a few leftovers, namely when reading from holes and
      successfully finishing read repair.
      
      These leftovers are unnecessarily making an inode's tree larger and deeper,
      slowing down searches on it. So remove all the leftovers.
      
      This change is part of a patchset that has the goal to make performance
      better for applications that use lseek's SEEK_HOLE and SEEK_DATA modes to
      iterate over the extents of a file. Two examples are the cp program from
      coreutils 9.0+ and the tar program (when using its --sparse / -S option).
      A sample test and results are listed in the changelog of the last patch
      in the series:
      
        1/9 btrfs: remove leftover setting of EXTENT_UPTODATE state in an inode's io_tree
        2/9 btrfs: add an early exit when searching for delalloc range for lseek/fiemap
        3/9 btrfs: skip unnecessary delalloc searches during lseek/fiemap
        4/9 btrfs: search for delalloc more efficiently during lseek/fiemap
        5/9 btrfs: remove no longer used btrfs_next_extent_map()
        6/9 btrfs: allow passing a cached state record to count_range_bits()
        7/9 btrfs: update stale comment for count_range_bits()
        8/9 btrfs: use cached state when looking for delalloc ranges with fiemap
        9/9 btrfs: use cached state when looking for delalloc ranges with lseek
      Reported-by: NWang Yugui <wangyugui@e16-tech.com>
      Link: https://lore.kernel.org/linux-btrfs/20221106073028.71F9.409509F4@e16-tech.com/
      Link: https://lore.kernel.org/linux-btrfs/CAL3q7H5NSVicm7nYBJ7x8fFkDpno8z3PYt5aPU43Bajc1H0h1Q@mail.gmail.com/Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2c8f5e8c
    • Q
      btrfs: move tree block parentness check into validate_extent_buffer() · 947a6299
      Qu Wenruo 提交于
      [BACKGROUND]
      Although both btrfs metadata and data has their read time verification
      done at endio time (btrfs_validate_metadata_buffer() and
      btrfs_verify_data_csum()), metadata has extra verification, mostly
      parentness check including first key/transid/owner_root/level, done at
      read_tree_block() and btrfs_read_extent_buffer().
      
      On the other hand, all the data verification is done at endio context.
      
      [ENHANCEMENT]
      This patch will make a new union in btrfs_bio, taking the space of the
      old data checksums, thus it will not increase the memory usage.
      
      With that extra btrfs_tree_parent_check inside btrfs_bio, we can just
      pass the check parameter into read_extent_buffer_pages(), and before
      submitting the bio, we can copy the check structure into btrfs_bio.
      
      And finally at endio time, we can grab btrfs_bio::parent_check and pass
      it to validate_extent_buffer(), to move the remaining checks into it.
      
      This brings the following benefits:
      
      - Much simpler btrfs_read_extent_buffer()
        Now it only needs to iterate through all mirrors.
      
      - Simpler read-time transid check
        Previously we go verify_parent_transid() after reading out the extent
        buffer.
        Now the transid check is done inside the endio function, no other
        code can modify the content.
        Thus no need to use the extent lock anymore.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      947a6299
    • Q
      btrfs: concentrate all tree block parentness check parameters into one structure · 789d6a3a
      Qu Wenruo 提交于
      There are several different tree block parentness check parameters used
      across several helpers:
      
      - level
        Mandatory
      
      - transid
        Under most cases it's mandatory, but there are several backref cases
        which skips this check.
      
      - owner_root
      - first_key
        Utilized by most top-down tree search routine. Otherwise can be
        skipped.
      
      Those four members are not always mandatory checks, and some of them are
      the same u64, which means if some arguments got swapped compiler will
      not catch it.
      
      Furthermore if we're going to further expand the parentness check, we
      need to modify quite some helpers just to add one more parameter.
      
      This patch will concentrate all these members into a structure called
      btrfs_tree_parent_check, and pass that structure for the following
      helpers:
      
      - btrfs_read_extent_buffer()
      - read_tree_block()
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      789d6a3a
    • A
      btrfs: move device->name RCU allocation and assign to btrfs_alloc_device() · bb21e302
      Anand Jain 提交于
      There is a repeating code section in the parent function after calling
      btrfs_alloc_device(), as below:
      
            name = rcu_string_strdup(path, GFP_...);
            if (!name) {
                    btrfs_free_device(device);
                    return ERR_PTR(-ENOMEM);
            }
            rcu_assign_pointer(device->name, name);
      
      Except in add_missing_dev() for obvious reasons.
      
      This patch consolidates that repeating code into the btrfs_alloc_device()
      itself so that the parent function doesn't have to duplicate code.
      This consolidation also helps to review issues regarding RCU lock
      violation with device->name.
      
      Parent function device_list_add() and add_missing_dev() use GFP_NOFS for
      the allocation, whereas the rest of the parent functions use GFP_KERNEL,
      so bring the NOFS allocation context using memalloc_nofs_save() in the
      function device_list_add() and add_missing_dev() is already doing it.
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      bb21e302
    • D
      btrfs: constify input buffer parameter in compression code · 3e09b5b2
      David Sterba 提交于
      The input buffers passed down to compression must never be changed,
      switch type to u8 as it's a raw byte buffer and use const.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3e09b5b2
    • Q
      btrfs: raid56: remove the old error tracking system · ad3daf1c
      Qu Wenruo 提交于
      Since all the recovery paths have been migrated to the new error bitmap
      based system, we can remove the old stripe number based system.
      
      This cleanup involves one behavior change:
      
      - Rebuild rbio can no longer be merged
        Previously a rebuild rbio (caused by retry after data csum mismatch)
        can be merged, if the error happens in the same stripe.
      
        But with the new error bitmap based solution, it's much harder to
        compare error bitmaps.
      
        So here we just don't merge rebuild rbio at all.
        This may introduce some performance impact at extreme corner cases,
        but we're willing to take it.
      
      Other than that, this patch will cleanup the following members:
      
      - rbio::faila
      - rbio::failb
        They will be replaced by per-vertical stripe check, which is more
        accurate.
      
      - rbio::error
        It will be replace by per-vertical stripe error bitmap check.
      
      - Allow get_rbio_vertical_errors() to accept NULL pointers for
        @faila and @failb
        Some call sites only want to check if we have errors beyond the
        tolerance.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ad3daf1c
    • Q
      btrfs: raid56: migrate recovery and scrub recovery path to use error_bitmap · 75b47033
      Qu Wenruo 提交于
      Since we have rbio::error_bitmap to indicate exactly where the errors
      are (including read error and csum mismatch error), we can make recovery
      path more accurate.
      
      For example:
      
                   0        32K       64K
           Data 1  |XXXXXXXX|         |
           Data 2  |        |XXXXXXXXX|
           Parity  |        |         |
      
      1) Get csum mismatch when reading data 1 [0, 32K)
      
      2) Mark corresponding range error
         The old code will mark the whole data 1 stripe as error.
         While the new code will only mark data 1 [0, 32K) as error.
      
      3) Recovery path
         The old code will recover data 1 [0, 64K), all using Data 2 and
         parity.
      
         This means, Data 1 [32K, 64K) will be corrupted data, as data 2
         [32K, 64K) is already corrupted.
      
         While the new code will only recover data 1 [0, 32K), as only
         that range has error so far.
      
      This new behavior can avoid populating rbio cache with incorrect data.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      75b47033
    • Q
      btrfs: raid56: introduce btrfs_raid_bio::error_bitmap · 2942a50d
      Qu Wenruo 提交于
      Currently btrfs raid56 uses btrfs_raid_bio::faila and failb to indicate
      which stripe(s) had IO errors.
      
      But that has some problems:
      
      - If one sector failed csum check, the whole stripe where the corruption
        is will be marked error.
        This can reduce the chance we do recover, like this:
      
                0  4K 8K
        Data 1  |XX|  |
        Data 2  |  |XX|
        Parity  |  |  |
      
        In above case, 0~4K in data 1 should be recovered using data 2 and
        parity, while 4K~8K in data 2 should be recovered using data 1 and
        parity.
      
        Currently if we trigger read on 0~4K of data 1, we will also recover
        4K~8K of data 1 using corrupted data 2 and parity, causing wrong
        result in rbio cache.
      
      - Harder to expand for future M-N scheme
        As we're limited to just faila/b, two corruptions.
      
      - Harder to expand to handle extra csum errors
        This can be problematic if we start to do csum verification.
      
      This patch will introduce an extra @error_bitmap, where one bit
      represents error that happened for that sector.
      
      The choice to introduce a new error bitmap other than reusing
      sector_ptr, is to avoid extra search between rbio::stripe_sectors[] and
      rbio::bio_sectors[].
      
      Since we can submit bio using sectors from both sectors, doing proper
      search on both array will more complex.
      
      Although the new bitmap will take extra memory, later we can remove
      things like @error and faila/b to save some memory.
      
      Currently the new error bitmap and failab mechanism coexists, the error
      bitmap is only updated at endio time and recover entrance.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2942a50d
    • D
      btrfs: pass btrfs_inode to btrfs_add_delayed_iput · e55cf7ca
      David Sterba 提交于
      The function is for internal interfaces so we should use the
      btrfs_inode.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e55cf7ca
    • D
      btrfs: use btrfs_inode inside btrfs_verify_data_csum · 5fc24314
      David Sterba 提交于
      The function is mostly using internal interfaces so we should use the
      btrfs_inode.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5fc24314
    • D
      btrfs: use btrfs_inode inside compress_file_range · 99a01bd6
      David Sterba 提交于
      The function is mostly using internal interfaces so we should use the
      btrfs_inode.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      99a01bd6
    • D
      btrfs: switch async_chunk::inode to btrfs_inode · 99a81a44
      David Sterba 提交于
      The async_chunk::inode structure is for internal interfaces so we should
      use the btrfs_inode.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      99a81a44
    • D
      btrfs: pass btrfs_inode to btrfs_inherit_iflags · 7a0443f0
      David Sterba 提交于
      The function is for internal interfaces so we should use the
      btrfs_inode.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7a0443f0
    • D
      btrfs: pass btrfs_inode to inode_tree_add · 4c45a4f4
      David Sterba 提交于
      The function is for internal interfaces so we should use the
      btrfs_inode.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4c45a4f4
    • D
      btrfs: pass btrfs_inode to fixup_tree_root_location · 3c1b1c4c
      David Sterba 提交于
      The function is for internal interfaces so we should use the
      btrfs_inode.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3c1b1c4c
    • D
      btrfs: pass btrfs_inode to btrfs_inode_by_name · d1de429b
      David Sterba 提交于
      The function is for internal interfaces so we should use the
      btrfs_inode.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d1de429b
    • D
      btrfs: pass btrfs_inode to btrfs_unlink_subvol · 5b7544cb
      David Sterba 提交于
      The function is for internal interfaces so we should use the
      btrfs_inode.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5b7544cb
    • D
      btrfs: pass btrfs_inode to btrfs_clear_delalloc_extent · bd54766e
      David Sterba 提交于
      The function is for internal interfaces so we should use the
      btrfs_inode.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      bd54766e
    • D
      btrfs: pass btrfs_inode to btrfs_split_delalloc_extent · 62798a49
      David Sterba 提交于
      The function is for internal interfaces so we should use the
      btrfs_inode.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      62798a49
    • D
      btrfs: pass btrfs_inode to btrfs_set_delalloc_extent · 4c5d166f
      David Sterba 提交于
      The function is for internal interfaces so we should use the
      btrfs_inode.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4c5d166f
    • D
      btrfs: pass btrfs_inode to btrfs_merge_delalloc_extent · 2454151c
      David Sterba 提交于
      The function is for internal interfaces so we should use the
      btrfs_inode.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2454151c
    • D
      btrfs: switch extent_io_tree::private_data to btrfs_inode and rename · 0988fc7b
      David Sterba 提交于
      The extent_io_tree::private_data was meant to be a preparatory work for
      the metadata inode rework but that never materialized. Now it's used
      only for an inode so it's better to change the appropriate type and
      rename it.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0988fc7b
    • D
      btrfs: drop private_data parameter from extent_io_tree_init · 35da5a7e
      David Sterba 提交于
      All callers except one pass NULL, so the parameter can be dropped and
      the inode::io_tree initialization can be open coded.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      35da5a7e
    • D
      btrfs: pass btrfs_inode to btrfs_delete_subvolume · 3c4f91e2
      David Sterba 提交于
      The function is for internal interfaces so we should use the
      btrfs_inode.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3c4f91e2