1. 26 9月, 2022 21 次提交
    • J
      btrfs: move extent state init and alloc functions to their own file · 83cf709a
      Josef Bacik 提交于
      Start cleaning up extent_io.c by moving the extent state code out of it.
      This patch starts with the extent state allocation code and the
      extent_io_tree init code.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      83cf709a
    • J
      btrfs: temporarily export alloc_extent_state helpers · c45379a2
      Josef Bacik 提交于
      We're going to move this code in stages, but while we're doing that we
      need to export these helpers so we can more easily move the code into
      the new file.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c45379a2
    • J
      btrfs: separate out the eb and extent state leak helpers · a40246e8
      Josef Bacik 提交于
      Currently we have the add/del functions generic so that we can use them
      for both extent buffers and extent states.  We want to separate this
      code however, so separate these helpers into per-object helpers in
      anticipation of the split.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a40246e8
    • J
      btrfs: separate out the extent state and extent buffer init code · a62a3bd9
      Josef Bacik 提交于
      In order to help separate the extent buffer from the extent io tree code
      we need to break up the init functions.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a62a3bd9
    • J
      btrfs: use find_first_extent_bit in btrfs_clean_io_failure · cdca85b0
      Josef Bacik 提交于
      Currently we're using find_first_extent_bit_state to check if our state
      contains the given failrec range, however this is more of an internal
      extent_io_tree helper, and is technically unsafe to use because we're
      accessing the state outside of the extent_io_tree lock.
      
      Instead use the normal helper find_first_extent_bit which returns the
      range of the extent state we find in find_first_extent_bit_state and use
      that to do our sanity checking.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      cdca85b0
    • J
      btrfs: convert the io_failure_tree to a plain rb_tree · 87c11705
      Josef Bacik 提交于
      We still have this oddity of stashing the io_failure_record in the
      extent state for the io_failure_tree, which is leftover from when we
      used to stuff private pointers in extent_io_trees.
      
      However this doesn't make a lot of sense for the io failure records, we
      can simply use a normal rb_tree for this.  This will allow us to further
      simplify the extent_io_tree code by removing the io_failure_rec pointer
      from the extent state.
      
      Convert the io_failure_tree to an rb tree + spinlock in the inode, and
      then use our rb tree simple helpers to insert and find failed records.
      This greatly cleans up this code and makes it easier to separate out the
      extent_io_tree code.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      87c11705
    • J
      btrfs: unexport internal failrec functions · a2061748
      Josef Bacik 提交于
      These are internally used functions and are not used outside of
      extent_io.c.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a2061748
    • J
      btrfs: rename clean_io_failure and remove extraneous args · 0d0a762c
      Josef Bacik 提交于
      This is exported, so rename it to btrfs_clean_io_failure.  Additionally
      we are passing in the io tree's and such from the inode, so instead of
      doing all that simply pass in the inode itself and get all the
      components we need directly inside of btrfs_clean_io_failure.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0d0a762c
    • F
      btrfs: make fiemap more efficient and accurate reporting extent sharedness · ac3c0d36
      Filipe Manana 提交于
      The current fiemap implementation does not scale very well with the number
      of extents a file has. This is both because the main algorithm to find out
      the extents has a high algorithmic complexity and because for each extent
      we have to check if it's shared. This second part, checking if an extent
      is shared, is significantly improved by the two previous patches in this
      patchset, while the first part is improved by this specific patch. Every
      now and then we get reports from users mentioning fiemap is too slow or
      even unusable for files with a very large number of extents, such as the
      two recent reports referred to by the Link tags at the bottom of this
      change log.
      
      To understand why the part of finding which extents a file has is very
      inefficient, consider the example of doing a full ranged fiemap against
      a file that has over 100K extents (normal for example for a file with
      more than 10G of data and using compression, which limits the extent size
      to 128K). When we enter fiemap at extent_fiemap(), the following happens:
      
      1) Before entering the main loop, we call get_extent_skip_holes() to get
         the first extent map. This leads us to btrfs_get_extent_fiemap(), which
         in turn calls btrfs_get_extent(), to find the first extent map that
         covers the file range [0, LLONG_MAX).
      
         btrfs_get_extent() will first search the inode's extent map tree, to
         see if we have an extent map there that covers the range. If it does
         not find one, then it will search the inode's subvolume b+tree for a
         fitting file extent item. After finding the file extent item, it will
         allocate an extent map, fill it in with information extracted from the
         file extent item, and add it to the inode's extent map tree (which
         requires a search for insertion in the tree).
      
      2) Then we enter the main loop at extent_fiemap(), emit the details of
         the extent, and call again get_extent_skip_holes(), with a start
         offset matching the end of the extent map we previously processed.
      
         We end up at btrfs_get_extent() again, will search the extent map tree
         and then search the subvolume b+tree for a file extent item if we could
         not find an extent map in the extent tree. We allocate an extent map,
         fill it in with the details in the file extent item, and then insert
         it into the extent map tree (yet another search in this tree).
      
      3) The second step is repeated over and over, until we have processed the
         whole file range. Each iteration ends at btrfs_get_extent(), which
         does a red black tree search on the extent map tree, then searches the
         subvolume b+tree, allocates an extent map and then does another search
         in the extent map tree in order to insert the extent map.
      
         In the best scenario we have all the extent maps already in the extent
         tree, and so for each extent we do a single search on a red black tree,
         so we have a complexity of O(n log n).
      
         In the worst scenario we don't have any extent map already loaded in
         the extent map tree, or have very few already there. In this case the
         complexity is much higher since we do:
      
         - A red black tree search on the extent map tree, which has O(log n)
           complexity, initially very fast since the tree is empty or very
           small, but as we end up allocating extent maps and adding them to
           the tree when we don't find them there, each subsequent search on
           the tree gets slower, since it's getting bigger and bigger after
           each iteration.
      
         - A search on the subvolume b+tree, also O(log n) complexity, but it
           has items for all inodes in the subvolume, not just items for our
           inode. Plus on a filesystem with concurrent operations on other
           inodes, we can block doing the search due to lock contention on
           b+tree nodes/leaves.
      
         - Allocate an extent map - this can block, and can also fail if we
           are under serious memory pressure.
      
         - Do another search on the extent maps red black tree, with the goal
           of inserting the extent map we just allocated. Again, after every
           iteration this tree is getting bigger by 1 element, so after many
           iterations the searches are slower and slower.
      
         - We will not need the allocated extent map anymore, so it's pointless
           to add it to the extent map tree. It's just wasting time and memory.
      
         In short we end up searching the extent map tree multiple times, on a
         tree that is growing bigger and bigger after each iteration. And
         besides that we visit the same leaf of the subvolume b+tree many times,
         since a leaf with the default size of 16K can easily have more than 200
         file extent items.
      
      This is very inefficient overall. This patch changes the algorithm to
      instead iterate over the subvolume b+tree, visiting each leaf only once,
      and only searching in the extent map tree for file ranges that have holes
      or prealloc extents, in order to figure out if we have delalloc there.
      It will never allocate an extent map and add it to the extent map tree.
      This is very similar to what was previously done for the lseek's hole and
      data seeking features.
      
      Also, the current implementation relying on extent maps for figuring out
      which extents we have is not correct. This is because extent maps can be
      merged even if they represent different extents - we do this to minimize
      memory utilization and keep extent map trees smaller. For example if we
      have two extents that are contiguous on disk, once we load the two extent
      maps, they get merged into a single one - however if only one of the
      extents is shared, we end up reporting both as shared or both as not
      shared, which is incorrect.
      
      This reproducer triggers that bug:
      
          $ cat fiemap-bug.sh
          #!/bin/bash
      
          DEV=/dev/sdj
          MNT=/mnt/sdj
      
          mkfs.btrfs -f $DEV
          mount $DEV $MNT
      
          # Create a file with two 256K extents.
          # Since there is no other write activity, they will be contiguous,
          # and their extent maps merged, despite having two distinct extents.
          xfs_io -f -c "pwrite -S 0xab 0 256K" \
                    -c "fsync" \
                    -c "pwrite -S 0xcd 256K 256K" \
                    -c "fsync" \
                    $MNT/foo
      
          # Now clone only the second extent into another file.
          xfs_io -f -c "reflink $MNT/foo 256K 0 256K" $MNT/bar
      
          # Filefrag will report a single 512K extent, and say it's not shared.
          echo
          filefrag -v $MNT/foo
      
          umount $MNT
      
      Running the reproducer:
      
          $ ./fiemap-bug.sh
          wrote 262144/262144 bytes at offset 0
          256 KiB, 64 ops; 0.0038 sec (65.479 MiB/sec and 16762.7030 ops/sec)
          wrote 262144/262144 bytes at offset 262144
          256 KiB, 64 ops; 0.0040 sec (61.125 MiB/sec and 15647.9218 ops/sec)
          linked 262144/262144 bytes at offset 0
          256 KiB, 1 ops; 0.0002 sec (1.034 GiB/sec and 4237.2881 ops/sec)
      
          Filesystem type is: 9123683e
          File size of /mnt/sdj/foo is 524288 (128 blocks of 4096 bytes)
           ext:     logical_offset:        physical_offset: length:   expected: flags:
             0:        0..     127:       3328..      3455:    128:             last,eof
          /mnt/sdj/foo: 1 extent found
      
      We end up reporting that we have a single 512K that is not shared, however
      we have two 256K extents, and the second one is shared. Changing the
      reproducer to clone instead the first extent into file 'bar', makes us
      report a single 512K extent that is shared, which is algo incorrect since
      we have two 256K extents and only the first one is shared.
      
      This patch is part of a larger patchset that is comprised of the following
      patches:
      
          btrfs: allow hole and data seeking to be interruptible
          btrfs: make hole and data seeking a lot more efficient
          btrfs: remove check for impossible block start for an extent map at fiemap
          btrfs: remove zero length check when entering fiemap
          btrfs: properly flush delalloc when entering fiemap
          btrfs: allow fiemap to be interruptible
          btrfs: rename btrfs_check_shared() to a more descriptive name
          btrfs: speedup checking for extent sharedness during fiemap
          btrfs: skip unnecessary extent buffer sharedness checks during fiemap
          btrfs: make fiemap more efficient and accurate reporting extent sharedness
      
      The patchset was tested on a machine running a non-debug kernel (Debian's
      default config) and compared the tests below on a branch without the
      patchset versus the same branch with the whole patchset applied.
      
      The following test for a large compressed file without holes:
      
          $ cat fiemap-perf-test.sh
          #!/bin/bash
      
          DEV=/dev/sdi
          MNT=/mnt/sdi
      
          mkfs.btrfs -f $DEV
          mount -o compress=lzo $DEV $MNT
      
          # 40G gives 327680 128K file extents (due to compression).
          xfs_io -f -c "pwrite -S 0xab -b 1M 0 20G" $MNT/foobar
      
          umount $MNT
          mount -o compress=lzo $DEV $MNT
      
          start=$(date +%s%N)
          filefrag $MNT/foobar
          end=$(date +%s%N)
          dur=$(( (end - start) / 1000000 ))
          echo "fiemap took $dur milliseconds (metadata not cached)"
      
          start=$(date +%s%N)
          filefrag $MNT/foobar
          end=$(date +%s%N)
          dur=$(( (end - start) / 1000000 ))
          echo "fiemap took $dur milliseconds (metadata cached)"
      
          umount $MNT
      
      Before patchset:
      
          $ ./fiemap-perf-test.sh
          (...)
          /mnt/sdi/foobar: 327680 extents found
          fiemap took 3597 milliseconds (metadata not cached)
          /mnt/sdi/foobar: 327680 extents found
          fiemap took 2107 milliseconds (metadata cached)
      
      After patchset:
      
          $ ./fiemap-perf-test.sh
          (...)
          /mnt/sdi/foobar: 327680 extents found
          fiemap took 1214 milliseconds (metadata not cached)
          /mnt/sdi/foobar: 327680 extents found
          fiemap took 684 milliseconds (metadata cached)
      
      That's a speedup of about 3x for both cases (no metadata cached and all
      metadata cached).
      
      The test provided by Pavel (first Link tag at the bottom), which uses
      files with a large number of holes, was also used to measure the gains,
      and it consists on a small C program and a shell script to invoke it.
      The C program is the following:
      
          $ cat pavels-test.c
          #include <stdio.h>
          #include <unistd.h>
          #include <stdlib.h>
          #include <fcntl.h>
      
          #include <sys/stat.h>
          #include <sys/time.h>
          #include <sys/ioctl.h>
      
          #include <linux/fs.h>
          #include <linux/fiemap.h>
      
          #define FILE_INTERVAL (1<<13) /* 8Kb */
      
          long long interval(struct timeval t1, struct timeval t2)
          {
              long long val = 0;
              val += (t2.tv_usec - t1.tv_usec);
              val += (t2.tv_sec - t1.tv_sec) * 1000 * 1000;
              return val;
          }
      
          int main(int argc, char **argv)
          {
              struct fiemap fiemap = {};
              struct timeval t1, t2;
              char data = 'a';
              struct stat st;
              int fd, off, file_size = FILE_INTERVAL;
      
              if (argc != 3 && argc != 2) {
                      printf("usage: %s <path> [size]\n", argv[0]);
                      return 1;
              }
      
              if (argc == 3)
                      file_size = atoi(argv[2]);
              if (file_size < FILE_INTERVAL)
                      file_size = FILE_INTERVAL;
              file_size -= file_size % FILE_INTERVAL;
      
              fd = open(argv[1], O_RDWR | O_CREAT | O_TRUNC, 0644);
              if (fd < 0) {
                  perror("open");
                  return 1;
              }
      
              for (off = 0; off < file_size; off += FILE_INTERVAL) {
                  if (pwrite(fd, &data, 1, off) != 1) {
                      perror("pwrite");
                      close(fd);
                      return 1;
                  }
              }
      
              if (ftruncate(fd, file_size)) {
                  perror("ftruncate");
                  close(fd);
                  return 1;
              }
      
              if (fstat(fd, &st) < 0) {
                  perror("fstat");
                  close(fd);
                  return 1;
              }
      
              printf("size: %ld\n", st.st_size);
              printf("actual size: %ld\n", st.st_blocks * 512);
      
              fiemap.fm_length = FIEMAP_MAX_OFFSET;
              gettimeofday(&t1, NULL);
              if (ioctl(fd, FS_IOC_FIEMAP, &fiemap) < 0) {
                  perror("fiemap");
                  close(fd);
                  return 1;
              }
              gettimeofday(&t2, NULL);
      
              printf("fiemap: fm_mapped_extents = %d\n",
                     fiemap.fm_mapped_extents);
              printf("time = %lld us\n", interval(t1, t2));
      
              close(fd);
              return 0;
          }
      
          $ gcc -o pavels_test pavels_test.c
      
      And the wrapper shell script:
      
          $ cat fiemap-pavels-test.sh
      
          #!/bin/bash
      
          DEV=/dev/sdi
          MNT=/mnt/sdi
      
          mkfs.btrfs -f -O no-holes $DEV
          mount $DEV $MNT
      
          echo
          echo "*********** 256M ***********"
          echo
      
          ./pavels-test $MNT/testfile $((1 << 28))
          echo
          ./pavels-test $MNT/testfile $((1 << 28))
      
          echo
          echo "*********** 512M ***********"
          echo
      
          ./pavels-test $MNT/testfile $((1 << 29))
          echo
          ./pavels-test $MNT/testfile $((1 << 29))
      
          echo
          echo "*********** 1G ***********"
          echo
      
          ./pavels-test $MNT/testfile $((1 << 30))
          echo
          ./pavels-test $MNT/testfile $((1 << 30))
      
          umount $MNT
      
      Running his reproducer before applying the patchset:
      
          *********** 256M ***********
      
          size: 268435456
          actual size: 134217728
          fiemap: fm_mapped_extents = 32768
          time = 4003133 us
      
          size: 268435456
          actual size: 134217728
          fiemap: fm_mapped_extents = 32768
          time = 4895330 us
      
          *********** 512M ***********
      
          size: 536870912
          actual size: 268435456
          fiemap: fm_mapped_extents = 65536
          time = 30123675 us
      
          size: 536870912
          actual size: 268435456
          fiemap: fm_mapped_extents = 65536
          time = 33450934 us
      
          *********** 1G ***********
      
          size: 1073741824
          actual size: 536870912
          fiemap: fm_mapped_extents = 131072
          time = 224924074 us
      
          size: 1073741824
          actual size: 536870912
          fiemap: fm_mapped_extents = 131072
          time = 217239242 us
      
      Running it after applying the patchset:
      
          *********** 256M ***********
      
          size: 268435456
          actual size: 134217728
          fiemap: fm_mapped_extents = 32768
          time = 29475 us
      
          size: 268435456
          actual size: 134217728
          fiemap: fm_mapped_extents = 32768
          time = 29307 us
      
          *********** 512M ***********
      
          size: 536870912
          actual size: 268435456
          fiemap: fm_mapped_extents = 65536
          time = 58996 us
      
          size: 536870912
          actual size: 268435456
          fiemap: fm_mapped_extents = 65536
          time = 59115 us
      
          *********** 1G ***********
      
          size: 1073741824
          actual size: 536870912
          fiemap: fm_mapped_extents = 116251
          time = 124141 us
      
          size: 1073741824
          actual size: 536870912
          fiemap: fm_mapped_extents = 131072
          time = 119387 us
      
      The speedup is massive, both on the first fiemap call and on the second
      one as well, as his test creates files with many holes and small extents
      (every extent follows a hole and precedes another hole).
      
      For the 256M file we go from 4 seconds down to 29 milliseconds in the
      first run, and then from 4.9 seconds down to 29 milliseconds again in the
      second run, a speedup of 138x and 169x, respectively.
      
      For the 512M file we go from 30.1 seconds down to 59 milliseconds in the
      first run, and then from 33.5 seconds down to 59 milliseconds again in the
      second run, a speedup of 510x and 568x, respectively.
      
      For the 1G file, we go from 225 seconds down to 124 milliseconds in the
      first run, and then from 217 seconds down to 119 milliseconds in the
      second run, a speedup of 1815x and 1824x, respectively.
      Reported-by: NPavel Tikhomirov <ptikhomirov@virtuozzo.com>
      Link: https://lore.kernel.org/linux-btrfs/21dd32c6-f1f9-f44a-466a-e18fdc6788a7@virtuozzo.com/Reported-by: NDominique MARTINET <dominique.martinet@atmark-techno.com>
      Link: https://lore.kernel.org/linux-btrfs/Ysace25wh5BbLd5f@atmark-techno.com/Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ac3c0d36
    • F
      btrfs: skip unnecessary extent buffer sharedness checks during fiemap · b8f164e3
      Filipe Manana 提交于
      During fiemap, for each file extent we find, we must check if it's shared
      or not. The sharedness check starts by verifying if the extent is directly
      shared (its refcount in the extent tree is > 1), and if it is not directly
      shared, then we will check if every node in the subvolume b+tree leading
      from the root to the leaf that has the file extent item (in reverse order),
      is shared (through snapshots).
      
      However this second step is not needed if our extent was created in a
      transaction more recent than the last transaction where a snapshot of the
      inode's root happened, because it can't be shared indirectly (through
      shared subtrees) without a snapshot created in a more recent transaction.
      
      So grab the generation of the extent from the extent map and pass it to
      btrfs_is_data_extent_shared(), which will skip this second phase when the
      generation is more recent than the root's last snapshot value. Note that
      we skip this optimization if the extent map is the result of merging 2
      or more extent maps, because in this case its generation is the maximum
      of the generations of all merged extent maps.
      
      The fact the we use extent maps and they can be merged despite the
      underlying extents being distinct (different file extent items in the
      subvolume b+tree and different extent items in the extent b+tree), can
      result in some bugs when reporting shared extents. But this is a problem
      of the current implementation of fiemap relying on extent maps.
      One example where we get incorrect results is:
      
          $ cat fiemap-bug.sh
          #!/bin/bash
      
          DEV=/dev/sdj
          MNT=/mnt/sdj
      
          mkfs.btrfs -f $DEV
          mount $DEV $MNT
      
          # Create a file with two 256K extents.
          # Since there is no other write activity, they will be contiguous,
          # and their extent maps merged, despite having two distinct extents.
          xfs_io -f -c "pwrite -S 0xab 0 256K" \
                    -c "fsync" \
                    -c "pwrite -S 0xcd 256K 256K" \
                    -c "fsync" \
                    $MNT/foo
      
          # Now clone only the second extent into another file.
          xfs_io -f -c "reflink $MNT/foo 256K 0 256K" $MNT/bar
      
          # Filefrag will report a single 512K extent, and say it's not shared.
          echo
          filefrag -v $MNT/foo
      
          umount $MNT
      
      Running the reproducer:
      
          $ ./fiemap-bug.sh
          wrote 262144/262144 bytes at offset 0
          256 KiB, 64 ops; 0.0038 sec (65.479 MiB/sec and 16762.7030 ops/sec)
          wrote 262144/262144 bytes at offset 262144
          256 KiB, 64 ops; 0.0040 sec (61.125 MiB/sec and 15647.9218 ops/sec)
          linked 262144/262144 bytes at offset 0
          256 KiB, 1 ops; 0.0002 sec (1.034 GiB/sec and 4237.2881 ops/sec)
      
          Filesystem type is: 9123683e
          File size of /mnt/sdj/foo is 524288 (128 blocks of 4096 bytes)
           ext:     logical_offset:        physical_offset: length:   expected: flags:
             0:        0..     127:       3328..      3455:    128:             last,eof
          /mnt/sdj/foo: 1 extent found
      
      We end up reporting that we have a single 512K that is not shared, however
      we have two 256K extents, and the second one is shared. Changing the
      reproducer to clone instead the first extent into file 'bar', makes us
      report a single 512K extent that is shared, which is algo incorrect since
      we have two 256K extents and only the first one is shared.
      
      This is z problem that existed before this change, and remains after this
      change, as it can't be easily fixed. The next patch in the series reworks
      fiemap to primarily use file extent items instead of extent maps (except
      for checking for delalloc ranges), with the goal of improving its
      scalability and performance, but it also ends up fixing this particular
      bug caused by extent map merging.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b8f164e3
    • F
      btrfs: speedup checking for extent sharedness during fiemap · 12a824dc
      Filipe Manana 提交于
      One of the most expensive tasks performed during fiemap is to check if
      an extent is shared. This task has two major steps:
      
      1) Check if the data extent is shared. This implies checking the extent
         item in the extent tree, checking delayed references, etc. If we
         find the data extent is directly shared, we terminate immediately;
      
      2) If the data extent is not directly shared (its extent item has a
         refcount of 1), then it may be shared if we have snapshots that share
         subtrees of the inode's subvolume b+tree. So we check if the leaf
         containing the file extent item is shared, then its parent node, then
         the parent node of the parent node, etc, until we reach the root node
         or we find one of them is shared - in which case we stop immediately.
      
      During fiemap we process the extents of a file from left to right, from
      file offset 0 to EOF. This means that we iterate b+tree leaves from left
      to right, and has the implication that we keep repeating that second step
      above several times for the same b+tree path of the inode's subvolume
      b+tree.
      
      For example, if we have two file extent items in leaf X, and the path to
      leaf X is A -> B -> C -> X, then when we try to determine if the data
      extent referenced by the first extent item is shared, we check if the data
      extent is shared - if it's not, then we check if leaf X is shared, if not,
      then we check if node C is shared, if not, then check if node B is shared,
      if not than check if node A is shared. When we move to the next file
      extent item, after determining the data extent is not shared, we repeat
      the checks for X, C, B and A - doing all the expensive searches in the
      extent tree, delayed refs, etc. If we have thousands of tile extents, then
      we keep repeating the sharedness checks for the same paths over and over.
      
      On a file that has no shared extents or only a small portion, it's easy
      to see that this scales terribly with the number of extents in the file
      and the sizes of the extent and subvolume b+trees.
      
      This change eliminates the repeated sharedness check on extent buffers
      by caching the results of the last path used. The results can be used as
      long as no snapshots were created since they were cached (for not shared
      extent buffers) or no roots were dropped since they were cached (for
      shared extent buffers). This greatly reduces the time spent by fiemap for
      files with thousands of extents and/or large extent and subvolume b+trees.
      
      Example performance test:
      
          $ cat fiemap-perf-test.sh
          #!/bin/bash
      
          DEV=/dev/sdi
          MNT=/mnt/sdi
      
          mkfs.btrfs -f $DEV
          mount -o compress=lzo $DEV $MNT
      
          # 40G gives 327680 128K file extents (due to compression).
          xfs_io -f -c "pwrite -S 0xab -b 1M 0 40G" $MNT/foobar
      
          umount $MNT
          mount -o compress=lzo $DEV $MNT
      
          start=$(date +%s%N)
          filefrag $MNT/foobar
          end=$(date +%s%N)
          dur=$(( (end - start) / 1000000 ))
          echo "fiemap took $dur milliseconds (metadata not cached)"
      
          start=$(date +%s%N)
          filefrag $MNT/foobar
          end=$(date +%s%N)
          dur=$(( (end - start) / 1000000 ))
          echo "fiemap took $dur milliseconds (metadata cached)"
      
          umount $MNT
      
      Before this patch:
      
          $ ./fiemap-perf-test.sh
          (...)
          /mnt/sdi/foobar: 327680 extents found
          fiemap took 3597 milliseconds (metadata not cached)
          /mnt/sdi/foobar: 327680 extents found
          fiemap took 2107 milliseconds (metadata cached)
      
      After this patch:
      
          $ ./fiemap-perf-test.sh
          (...)
          /mnt/sdi/foobar: 327680 extents found
          fiemap took 1646 milliseconds (metadata not cached)
          /mnt/sdi/foobar: 327680 extents found
          fiemap took 698 milliseconds (metadata cached)
      
      That's about 2.2x faster when no metadata is cached, and about 3x faster
      when all metadata is cached. On a real filesystem with many other files,
      data, directories, etc, the b+trees will be 2 or 3 levels higher,
      therefore this optimization will have a higher impact.
      
      Several reports of a slow fiemap show up often, the two Link tags below
      refer to two recent reports of such slowness. This patch, together with
      the next ones in the series, is meant to address that.
      
      Link: https://lore.kernel.org/linux-btrfs/21dd32c6-f1f9-f44a-466a-e18fdc6788a7@virtuozzo.com/
      Link: https://lore.kernel.org/linux-btrfs/Ysace25wh5BbLd5f@atmark-techno.com/Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      12a824dc
    • F
      btrfs: rename btrfs_check_shared() to a more descriptive name · 8eedadda
      Filipe Manana 提交于
      The function btrfs_check_shared() is supposed to be used to check if a
      data extent is shared, but its name is too generic, may easily cause
      confusion in the sense that it may be used for metadata extents.
      
      So rename it to btrfs_is_data_extent_shared(), which will also make it
      less confusing after the next change that adds a backref lookup cache for
      the b+tree nodes that lead to the leaf that contains the file extent item
      that points to the target data extent.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8eedadda
    • F
      btrfs: allow fiemap to be interruptible · 09fbc1c8
      Filipe Manana 提交于
      Doing fiemap on a file with a very large number of extents can take a very
      long time, and we have reports of it being too slow (two recent examples
      in the Link tags below), so make it interruptible.
      
      Link: https://lore.kernel.org/linux-btrfs/21dd32c6-f1f9-f44a-466a-e18fdc6788a7@virtuozzo.com/
      Link: https://lore.kernel.org/linux-btrfs/Ysace25wh5BbLd5f@atmark-techno.com/Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      09fbc1c8
    • F
      btrfs: remove zero length check when entering fiemap · 9a42bbae
      Filipe Manana 提交于
      There's no point to check for a 0 length at extent_fiemap(), as before
      calling it, we called fiemap_prep() at btrfs_fiemap(), which already
      checks for a zero length and returns the same -EINVAL error. So remove
      the pointless check.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9a42bbae
    • F
      btrfs: remove check for impossible block start for an extent map at fiemap · f12eec9a
      Filipe Manana 提交于
      During fiemap we are testing if an extent map has a block start with a
      value of EXTENT_MAP_LAST_BYTE, but that is never set on an extent map,
      and never was according to git history. So remove that useless check.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f12eec9a
    • C
      btrfs: give struct btrfs_bio a real end_io handler · 917f32a2
      Christoph Hellwig 提交于
      Currently btrfs_bio end I/O handling is a bit of a mess.  The bi_end_io
      handler and bi_private pointer of the embedded struct bio are both used
      to handle the completion of the high-level btrfs_bio and for the I/O
      completion for the low-level device that the embedded bio ends up being
      sent to.
      
      To support this bi_end_io and bi_private are saved into the
      btrfs_io_context structure and then restored after the bio sent to the
      underlying device has completed the actual I/O.
      
      Untangle this by adding an end I/O handler and private data to struct
      btrfs_bio for the high-level btrfs_bio based completions, and leave the
      actual bio bi_end_io handler and bi_private pointer entirely to the
      low-level device I/O.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Tested-by: NNikolay Borisov <nborisov@suse.com>
      Tested-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      917f32a2
    • C
      btrfs: pass the operation to btrfs_bio_alloc · 6b42f5e3
      Christoph Hellwig 提交于
      Pass the operation to btrfs_bio_alloc, matching what bio_alloc_bioset
      set does.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Tested-by: NNikolay Borisov <nborisov@suse.com>
      Tested-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6b42f5e3
    • C
      btrfs: move btrfs_bio allocation to volumes.c · d45cfb88
      Christoph Hellwig 提交于
      volumes.c is the place that implements the storage layer using the
      btrfs_bio structure, so move the bio_set and allocation helpers there
      as well.
      
      To make up for the new initialization boilerplate, merge the two
      init/exit helpers in extent_io.c into a single one.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Tested-by: NNikolay Borisov <nborisov@suse.com>
      Tested-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d45cfb88
    • C
      btrfs: don't create integrity bioset for btrfs_bioset · 1e408af3
      Christoph Hellwig 提交于
      btrfs never uses bio integrity data itself, so don't allocate
      the integrity pools for btrfs_bioset.
      
      This patch is a revert of the commit b208c2f7 ("btrfs: Fix crash due
      to not allocating integrity data for a set").  The integrity data pool
      is not needed, the bio-integrity code now handles allocating the
      integrity payload without that.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Tested-by: NNikolay Borisov <nborisov@suse.com>
      Tested-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1e408af3
    • E
      btrfs: remove unnecessary EXTENT_UPTODATE state in buffered I/O path · 52b029f4
      Ethan Lien 提交于
      After we copied data to page cache in buffered I/O, we
      1. Insert a EXTENT_UPTODATE state into inode's io_tree, by
         endio_readpage_release_extent(), set_extent_delalloc() or
         set_extent_defrag().
      2. Set page uptodate before we unlock the page.
      
      But the only place we check io_tree's EXTENT_UPTODATE state is in
      btrfs_do_readpage(). We know we enter btrfs_do_readpage() only when we
      have a non-uptodate page, so it is unnecessary to set EXTENT_UPTODATE.
      
      For example, when performing a buffered random read:
      
      	fio --rw=randread --ioengine=libaio --direct=0 --numjobs=4 \
      		--filesize=32G --size=4G --bs=4k --name=job \
      		--filename=/mnt/file --name=job
      
      Then check how many extent_state in io_tree:
      
      	cat /proc/slabinfo | grep btrfs_extent_state | awk '{print $2}'
      
      w/o this patch, we got 640567 btrfs_extent_state.
      w/  this patch, we got    204 btrfs_extent_state.
      
      Maintaining such a big tree brings overhead since every I/O needs to insert
      EXTENT_LOCKED, insert EXTENT_UPTODATE, then remove EXTENT_LOCKED. And in
      every insert or remove, we need to lock io_tree, do tree search, alloc or
      dealloc extent states. By removing unnecessary EXTENT_UPTODATE, we keep
      io_tree in a minimal size and reduce overhead when performing buffered I/O.
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NRobbie Ko <robbieko@synology.com>
      Signed-off-by: NEthan Lien <ethanlien@synology.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      52b029f4
    • U
      btrfs: use atomic_try_cmpxchg in free_extent_buffer · e5677f05
      Uros Bizjak 提交于
      Use `atomic_try_cmpxchg(ptr, &old, new)` instead of
      `atomic_cmpxchg(ptr, old, new) == old` in free_extent_buffer. This
      has two benefits:
      
      - The x86 cmpxchg instruction returns success in the ZF flag, so this
        change saves a compare after cmpxchg, as well as a related move
        instruction in the front of cmpxchg.
      
      - atomic_try_cmpxchg implicitly assigns the *ptr value to &old when
        cmpxchg fails, enabling further code simplifications.
      
      This patch has no functional change.
      Reviewed-by: NBoris Burkov <boris@bur.io>
      Signed-off-by: NUros Bizjak <ubizjak@gmail.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e5677f05
  2. 23 8月, 2022 1 次提交
    • Q
      btrfs: don't merge pages into bio if their page offset is not contiguous · 4a445b7b
      Qu Wenruo 提交于
      [BUG]
      Zygo reported on latest development branch, he could hit
      ASSERT()/BUG_ON() caused crash when doing RAID5 recovery (intentionally
      corrupt one disk, and let btrfs to recover the data during read/scrub).
      
      And The following minimal reproducer can cause extent state leakage at
      rmmod time:
      
        mkfs.btrfs -f -d raid5 -m raid5 $dev1 $dev2 $dev3 -b 1G > /dev/null
        mount $dev1 $mnt
        fsstress -w -d $mnt -n 25 -s 1660807876
        sync
        fssum -A -f -w /tmp/fssum.saved $mnt
        umount $mnt
      
        # Wipe the dev1 but keeps its super block
        xfs_io -c "pwrite -S 0x0 1m 1023m" $dev1
        mount $dev1 $mnt
        fssum -r /tmp/fssum.saved $mnt > /dev/null
        umount $mnt
        rmmod btrfs
      
      This will lead to the following extent states leakage:
      
        BTRFS: state leak: start 499712 end 503807 state 5 in tree 1 refs 1
        BTRFS: state leak: start 495616 end 499711 state 5 in tree 1 refs 1
        BTRFS: state leak: start 491520 end 495615 state 5 in tree 1 refs 1
        BTRFS: state leak: start 487424 end 491519 state 5 in tree 1 refs 1
        BTRFS: state leak: start 483328 end 487423 state 5 in tree 1 refs 1
        BTRFS: state leak: start 479232 end 483327 state 5 in tree 1 refs 1
        BTRFS: state leak: start 475136 end 479231 state 5 in tree 1 refs 1
        BTRFS: state leak: start 471040 end 475135 state 5 in tree 1 refs 1
      
      [CAUSE]
      Since commit 7aa51232 ("btrfs: pass a btrfs_bio to
      btrfs_repair_one_sector"), we always use btrfs_bio->file_offset to
      determine the file offset of a page.
      
      But that usage assume that, one bio has all its page having a continuous
      page offsets.
      
      Unfortunately that's not true, btrfs only requires the logical bytenr
      contiguous when assembling its bios.
      
      From above script, we have one bio looks like this:
      
        fssum-27671  submit_one_bio: bio logical=217739264 len=36864
        fssum-27671  submit_one_bio:   r/i=5/261 page_offset=466944 <<<
        fssum-27671  submit_one_bio:   r/i=5/261 page_offset=724992 <<<
        fssum-27671  submit_one_bio:   r/i=5/261 page_offset=729088
        fssum-27671  submit_one_bio:   r/i=5/261 page_offset=733184
        fssum-27671  submit_one_bio:   r/i=5/261 page_offset=737280
        fssum-27671  submit_one_bio:   r/i=5/261 page_offset=741376
        fssum-27671  submit_one_bio:   r/i=5/261 page_offset=745472
        fssum-27671  submit_one_bio:   r/i=5/261 page_offset=749568
        fssum-27671  submit_one_bio:   r/i=5/261 page_offset=753664
      
      Note that the 1st and the 2nd page has non-contiguous page offsets.
      
      This means, at repair time, we will have completely wrong file offset
      passed in:
      
         kworker/u32:2-19927  btrfs_repair_one_sector: r/i=5/261 page_off=729088 file_off=475136 bio_offset=8192
      
      Since the file offset is incorrect, we latter incorrectly set the extent
      states, and no way to really release them.
      
      Thus later it causes the leakage.
      
      In fact, this can be even worse, since the file offset is incorrect, we
      can hit cases like the incorrect file offset belongs to a HOLE, and
      later cause btrfs_num_copies() to trigger error, finally hit
      BUG_ON()/ASSERT() later.
      
      [FIX]
      Add an extra condition in btrfs_bio_add_page() for uncompressed IO.
      
      Now we will have more strict requirement for bio pages:
      
      - They should all have the same mapping
        (the mapping check is already implied by the call chain)
      
      - Their logical bytenr should be adjacent
        This is the same as the old condition.
      
      - Their page_offset() (file offset) should be adjacent
        This is the new check.
        This would result a slightly increased amount of bios from btrfs
        (needs holes and inside the same stripe boundary to trigger).
      
        But this would greatly reduce the confusion, as it's pretty common
        to assume a btrfs bio would only contain continuous page cache.
      
      Later we may need extra cleanups, as we no longer needs to handle gaps
      between page offsets in endio functions.
      
      Currently this should be the minimal patch to fix commit 7aa51232
      ("btrfs: pass a btrfs_bio to btrfs_repair_one_sector").
      Reported-by: NZygo Blaxell <ce3g8jdj@umail.furryterror.org>
      Fixes: 7aa51232 ("btrfs: pass a btrfs_bio to btrfs_repair_one_sector")
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4a445b7b
  3. 17 8月, 2022 1 次提交
    • J
      btrfs: fix lockdep splat with reloc root extent buffers · b40130b2
      Josef Bacik 提交于
      We have been hitting the following lockdep splat with btrfs/187 recently
      
        WARNING: possible circular locking dependency detected
        5.19.0-rc8+ #775 Not tainted
        ------------------------------------------------------
        btrfs/752500 is trying to acquire lock:
        ffff97e1875a97b8 (btrfs-treloc-02#2){+.+.}-{3:3}, at: __btrfs_tree_lock+0x24/0x110
      
        but task is already holding lock:
        ffff97e1875a9278 (btrfs-tree-01/1){+.+.}-{3:3}, at: __btrfs_tree_lock+0x24/0x110
      
        which lock already depends on the new lock.
      
        the existing dependency chain (in reverse order) is:
      
        -> #2 (btrfs-tree-01/1){+.+.}-{3:3}:
      	 down_write_nested+0x41/0x80
      	 __btrfs_tree_lock+0x24/0x110
      	 btrfs_init_new_buffer+0x7d/0x2c0
      	 btrfs_alloc_tree_block+0x120/0x3b0
      	 __btrfs_cow_block+0x136/0x600
      	 btrfs_cow_block+0x10b/0x230
      	 btrfs_search_slot+0x53b/0xb70
      	 btrfs_lookup_inode+0x2a/0xa0
      	 __btrfs_update_delayed_inode+0x5f/0x280
      	 btrfs_async_run_delayed_root+0x24c/0x290
      	 btrfs_work_helper+0xf2/0x3e0
      	 process_one_work+0x271/0x590
      	 worker_thread+0x52/0x3b0
      	 kthread+0xf0/0x120
      	 ret_from_fork+0x1f/0x30
      
        -> #1 (btrfs-tree-01){++++}-{3:3}:
      	 down_write_nested+0x41/0x80
      	 __btrfs_tree_lock+0x24/0x110
      	 btrfs_search_slot+0x3c3/0xb70
      	 do_relocation+0x10c/0x6b0
      	 relocate_tree_blocks+0x317/0x6d0
      	 relocate_block_group+0x1f1/0x560
      	 btrfs_relocate_block_group+0x23e/0x400
      	 btrfs_relocate_chunk+0x4c/0x140
      	 btrfs_balance+0x755/0xe40
      	 btrfs_ioctl+0x1ea2/0x2c90
      	 __x64_sys_ioctl+0x88/0xc0
      	 do_syscall_64+0x38/0x90
      	 entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
        -> #0 (btrfs-treloc-02#2){+.+.}-{3:3}:
      	 __lock_acquire+0x1122/0x1e10
      	 lock_acquire+0xc2/0x2d0
      	 down_write_nested+0x41/0x80
      	 __btrfs_tree_lock+0x24/0x110
      	 btrfs_lock_root_node+0x31/0x50
      	 btrfs_search_slot+0x1cb/0xb70
      	 replace_path+0x541/0x9f0
      	 merge_reloc_root+0x1d6/0x610
      	 merge_reloc_roots+0xe2/0x260
      	 relocate_block_group+0x2c8/0x560
      	 btrfs_relocate_block_group+0x23e/0x400
      	 btrfs_relocate_chunk+0x4c/0x140
      	 btrfs_balance+0x755/0xe40
      	 btrfs_ioctl+0x1ea2/0x2c90
      	 __x64_sys_ioctl+0x88/0xc0
      	 do_syscall_64+0x38/0x90
      	 entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
        other info that might help us debug this:
      
        Chain exists of:
          btrfs-treloc-02#2 --> btrfs-tree-01 --> btrfs-tree-01/1
      
         Possible unsafe locking scenario:
      
      	 CPU0                    CPU1
      	 ----                    ----
          lock(btrfs-tree-01/1);
      				 lock(btrfs-tree-01);
      				 lock(btrfs-tree-01/1);
          lock(btrfs-treloc-02#2);
      
         *** DEADLOCK ***
      
        7 locks held by btrfs/752500:
         #0: ffff97e292fdf460 (sb_writers#12){.+.+}-{0:0}, at: btrfs_ioctl+0x208/0x2c90
         #1: ffff97e284c02050 (&fs_info->reclaim_bgs_lock){+.+.}-{3:3}, at: btrfs_balance+0x55f/0xe40
         #2: ffff97e284c00878 (&fs_info->cleaner_mutex){+.+.}-{3:3}, at: btrfs_relocate_block_group+0x236/0x400
         #3: ffff97e292fdf650 (sb_internal#2){.+.+}-{0:0}, at: merge_reloc_root+0xef/0x610
         #4: ffff97e284c02378 (btrfs_trans_num_writers){++++}-{0:0}, at: join_transaction+0x1a8/0x5a0
         #5: ffff97e284c023a0 (btrfs_trans_num_extwriters){++++}-{0:0}, at: join_transaction+0x1a8/0x5a0
         #6: ffff97e1875a9278 (btrfs-tree-01/1){+.+.}-{3:3}, at: __btrfs_tree_lock+0x24/0x110
      
        stack backtrace:
        CPU: 1 PID: 752500 Comm: btrfs Not tainted 5.19.0-rc8+ #775
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
        Call Trace:
      
         dump_stack_lvl+0x56/0x73
         check_noncircular+0xd6/0x100
         ? lock_is_held_type+0xe2/0x140
         __lock_acquire+0x1122/0x1e10
         lock_acquire+0xc2/0x2d0
         ? __btrfs_tree_lock+0x24/0x110
         down_write_nested+0x41/0x80
         ? __btrfs_tree_lock+0x24/0x110
         __btrfs_tree_lock+0x24/0x110
         btrfs_lock_root_node+0x31/0x50
         btrfs_search_slot+0x1cb/0xb70
         ? lock_release+0x137/0x2d0
         ? _raw_spin_unlock+0x29/0x50
         ? release_extent_buffer+0x128/0x180
         replace_path+0x541/0x9f0
         merge_reloc_root+0x1d6/0x610
         merge_reloc_roots+0xe2/0x260
         relocate_block_group+0x2c8/0x560
         btrfs_relocate_block_group+0x23e/0x400
         btrfs_relocate_chunk+0x4c/0x140
         btrfs_balance+0x755/0xe40
         btrfs_ioctl+0x1ea2/0x2c90
         ? lock_is_held_type+0xe2/0x140
         ? lock_is_held_type+0xe2/0x140
         ? __x64_sys_ioctl+0x88/0xc0
         __x64_sys_ioctl+0x88/0xc0
         do_syscall_64+0x38/0x90
         entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
      This isn't necessarily new, it's just tricky to hit in practice.  There
      are two competing things going on here.  With relocation we create a
      snapshot of every fs tree with a reloc tree.  Any extent buffers that
      get initialized here are initialized with the reloc root lockdep key.
      However since it is a snapshot, any blocks that are currently in cache
      that originally belonged to the fs tree will have the normal tree
      lockdep key set.  This creates the lock dependency of
      
        reloc tree -> normal tree
      
      for the extent buffer locking during the first phase of the relocation
      as we walk down the reloc root to relocate blocks.
      
      However this is problematic because the final phase of the relocation is
      merging the reloc root into the original fs root.  This involves
      searching down to any keys that exist in the original fs root and then
      swapping the relocated block and the original fs root block.  We have to
      search down to the fs root first, and then go search the reloc root for
      the block we need to replace.  This creates the dependency of
      
        normal tree -> reloc tree
      
      which is why lockdep complains.
      
      Additionally even if we were to fix this particular mismatch with a
      different nesting for the merge case, we're still slotting in a block
      that has a owner of the reloc root objectid into a normal tree, so that
      block will have its lockdep key set to the tree reloc root, and create a
      lockdep splat later on when we wander into that block from the fs root.
      
      Unfortunately the only solution here is to make sure we do not set the
      lockdep key to the reloc tree lockdep key normally, and then reset any
      blocks we wander into from the reloc root when we're doing the merged.
      
      This solves the problem of having mixed tree reloc keys intermixed with
      normal tree keys, and then allows us to make sure in the merge case we
      maintain the lock order of
      
        normal tree -> reloc tree
      
      We handle this by setting a bit on the reloc root when we do the search
      for the block we want to relocate, and any block we search into or COW
      at that point gets set to the reloc tree key.  This works correctly
      because we only ever COW down to the parent node, so we aren't resetting
      the key for the block we're linking into the fs root.
      
      With this patch we no longer have the lockdep splat in btrfs/187.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b40130b2
  4. 26 7月, 2022 2 次提交
    • C
      btrfs: fix repair of compressed extents · 81bd9328
      Christoph Hellwig 提交于
      Currently the checksum of compressed extents is verified based on the
      compressed data and the lower btrfs_bio, but the actual repair process
      is driven by end_bio_extent_readpage on the upper btrfs_bio for the
      decompressed data.
      
      This has a bunch of issues, including not being able to properly
      communicate the failed mirror up in case that the I/O submission got
      preempted, a general loss of if an error was an I/O error or a checksum
      verification failure, but most importantly that this design causes
      btrfs_clean_io_failure to eventually write back the uncompressed good
      data onto the disk sectors that are supposed to contain compressed data.
      
      Fix this by moving the repair to the lower btrfs_bio.  To do so, a fair
      amount of code has to be reshuffled:
      
       a) the lower btrfs_bio now needs a valid csum pointer.  The easiest way
          to achieve that is to pass NULL btrfs_lookup_bio_sums and just use
          the btrfs_bio management of csums.  For a compressed_bio that is
          split into multiple btrfs_bios this means additional memory
          allocations, but the code becomes a lot more regular.
       b) checksum verification now runs directly on the lower btrfs_bio instead
          of the compressed_bio.  This actually nicely simplifies the end I/O
          processing.
       c) btrfs_repair_one_sector can't just look up the logical address for
          the file offset any more, as there is no corresponding relative
          offsets that apply to the file offset and the logic address for
          compressed extents.  Instead require that the saved bvec_iter in the
          btrfs_bio is filled out for all read bios and use that, which again
          removes a fair amount of code.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      81bd9328
    • C
      btrfs: pass a btrfs_bio to btrfs_repair_one_sector · 7aa51232
      Christoph Hellwig 提交于
      Pass the btrfs_bio instead of the plain bio to btrfs_repair_one_sector,
      and remove the start and failed_mirror arguments in favor of deriving
      them from the btrfs_bio.  For this to work ensure that the file_offset
      field is also initialized for buffered I/O.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NBoris Burkov <boris@bur.io>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7aa51232
  5. 25 7月, 2022 15 次提交
    • C
      btrfs: repair all known bad mirrors · c144c63f
      Christoph Hellwig 提交于
      When there is more than a single level of redundancy there can also be
      multiple bad mirrors, and the current read repair code only repairs the
      last bad one.
      
      Restructure btrfs_repair_one_sector so that it records the originally
      failed mirror and the number of copies, and then repair all known bad
      copies until we reach the originally failed copy in clean_io_failure.
      Note that this also means the read repair reads will always start from
      the next bad mirror and not mirror 0.
      
      This fixes btrfs/265 in xfstests.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c144c63f
    • N
      btrfs: replace BTRFS_MAX_EXTENT_SIZE with fs_info->max_extent_size · f7b12a62
      Naohiro Aota 提交于
      On zoned filesystem, data write out is limited by max_zone_append_size,
      and a large ordered extent is split according the size of a bio. OTOH,
      the number of extents to be written is calculated using
      BTRFS_MAX_EXTENT_SIZE, and that estimated number is used to reserve the
      metadata bytes to update and/or create the metadata items.
      
      The metadata reservation is done at e.g, btrfs_buffered_write() and then
      released according to the estimation changes. Thus, if the number of extent
      increases massively, the reserved metadata can run out.
      
      The increase of the number of extents easily occurs on zoned filesystem
      if BTRFS_MAX_EXTENT_SIZE > max_zone_append_size. And, it causes the
      following warning on a small RAM environment with disabling metadata
      over-commit (in the following patch).
      
      [75721.498492] ------------[ cut here ]------------
      [75721.505624] BTRFS: block rsv 1 returned -28
      [75721.512230] WARNING: CPU: 24 PID: 2327559 at fs/btrfs/block-rsv.c:537 btrfs_use_block_rsv+0x560/0x760 [btrfs]
      [75721.581854] CPU: 24 PID: 2327559 Comm: kworker/u64:10 Kdump: loaded Tainted: G        W         5.18.0-rc2-BTRFS-ZNS+ #109
      [75721.597200] Hardware name: Supermicro Super Server/H12SSL-NT, BIOS 2.0 02/22/2021
      [75721.607310] Workqueue: btrfs-endio-write btrfs_work_helper [btrfs]
      [75721.616209] RIP: 0010:btrfs_use_block_rsv+0x560/0x760 [btrfs]
      [75721.646649] RSP: 0018:ffffc9000fbdf3e0 EFLAGS: 00010286
      [75721.654126] RAX: 0000000000000000 RBX: 0000000000004000 RCX: 0000000000000000
      [75721.663524] RDX: 0000000000000004 RSI: 0000000000000008 RDI: fffff52001f7be6e
      [75721.672921] RBP: ffffc9000fbdf420 R08: 0000000000000001 R09: ffff889f8d1fc6c7
      [75721.682493] R10: ffffed13f1a3f8d8 R11: 0000000000000001 R12: ffff88980a3c0e28
      [75721.692284] R13: ffff889b66590000 R14: ffff88980a3c0e40 R15: ffff88980a3c0e8a
      [75721.701878] FS:  0000000000000000(0000) GS:ffff889f8d000000(0000) knlGS:0000000000000000
      [75721.712601] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [75721.720726] CR2: 000055d12e05c018 CR3: 0000800193594000 CR4: 0000000000350ee0
      [75721.730499] Call Trace:
      [75721.735166]  <TASK>
      [75721.739886]  btrfs_alloc_tree_block+0x1e1/0x1100 [btrfs]
      [75721.747545]  ? btrfs_alloc_logged_file_extent+0x550/0x550 [btrfs]
      [75721.756145]  ? btrfs_get_32+0xea/0x2d0 [btrfs]
      [75721.762852]  ? btrfs_get_32+0xea/0x2d0 [btrfs]
      [75721.769520]  ? push_leaf_left+0x420/0x620 [btrfs]
      [75721.776431]  ? memcpy+0x4e/0x60
      [75721.781931]  split_leaf+0x433/0x12d0 [btrfs]
      [75721.788392]  ? btrfs_get_token_32+0x580/0x580 [btrfs]
      [75721.795636]  ? push_for_double_split.isra.0+0x420/0x420 [btrfs]
      [75721.803759]  ? leaf_space_used+0x15d/0x1a0 [btrfs]
      [75721.811156]  btrfs_search_slot+0x1bc3/0x2790 [btrfs]
      [75721.818300]  ? lock_downgrade+0x7c0/0x7c0
      [75721.824411]  ? free_extent_buffer.part.0+0x107/0x200 [btrfs]
      [75721.832456]  ? split_leaf+0x12d0/0x12d0 [btrfs]
      [75721.839149]  ? free_extent_buffer.part.0+0x14f/0x200 [btrfs]
      [75721.846945]  ? free_extent_buffer+0x13/0x20 [btrfs]
      [75721.853960]  ? btrfs_release_path+0x4b/0x190 [btrfs]
      [75721.861429]  btrfs_csum_file_blocks+0x85c/0x1500 [btrfs]
      [75721.869313]  ? rcu_read_lock_sched_held+0x16/0x80
      [75721.876085]  ? lock_release+0x552/0xf80
      [75721.881957]  ? btrfs_del_csums+0x8c0/0x8c0 [btrfs]
      [75721.888886]  ? __kasan_check_write+0x14/0x20
      [75721.895152]  ? do_raw_read_unlock+0x44/0x80
      [75721.901323]  ? _raw_write_lock_irq+0x60/0x80
      [75721.907983]  ? btrfs_global_root+0xb9/0xe0 [btrfs]
      [75721.915166]  ? btrfs_csum_root+0x12b/0x180 [btrfs]
      [75721.921918]  ? btrfs_get_global_root+0x820/0x820 [btrfs]
      [75721.929166]  ? _raw_write_unlock+0x23/0x40
      [75721.935116]  ? unpin_extent_cache+0x1e3/0x390 [btrfs]
      [75721.942041]  btrfs_finish_ordered_io.isra.0+0xa0c/0x1dc0 [btrfs]
      [75721.949906]  ? try_to_wake_up+0x30/0x14a0
      [75721.955700]  ? btrfs_unlink_subvol+0xda0/0xda0 [btrfs]
      [75721.962661]  ? rcu_read_lock_sched_held+0x16/0x80
      [75721.969111]  ? lock_acquire+0x41b/0x4c0
      [75721.974982]  finish_ordered_fn+0x15/0x20 [btrfs]
      [75721.981639]  btrfs_work_helper+0x1af/0xa80 [btrfs]
      [75721.988184]  ? _raw_spin_unlock_irq+0x28/0x50
      [75721.994643]  process_one_work+0x815/0x1460
      [75722.000444]  ? pwq_dec_nr_in_flight+0x250/0x250
      [75722.006643]  ? do_raw_spin_trylock+0xbb/0x190
      [75722.013086]  worker_thread+0x59a/0xeb0
      [75722.018511]  kthread+0x2ac/0x360
      [75722.023428]  ? process_one_work+0x1460/0x1460
      [75722.029431]  ? kthread_complete_and_exit+0x30/0x30
      [75722.036044]  ret_from_fork+0x22/0x30
      [75722.041255]  </TASK>
      [75722.045047] irq event stamp: 0
      [75722.049703] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
      [75722.057610] hardirqs last disabled at (0): [<ffffffff8118a94a>] copy_process+0x1c1a/0x66b0
      [75722.067533] softirqs last  enabled at (0): [<ffffffff8118a989>] copy_process+0x1c59/0x66b0
      [75722.077423] softirqs last disabled at (0): [<0000000000000000>] 0x0
      [75722.085335] ---[ end trace 0000000000000000 ]---
      
      To fix the estimation, we need to introduce fs_info->max_extent_size to
      replace BTRFS_MAX_EXTENT_SIZE, which allow setting the different size for
      regular vs zoned filesystem.
      
      Set fs_info->max_extent_size to BTRFS_MAX_EXTENT_SIZE by default. On zoned
      filesystem, it is set to fs_info->max_zone_append_size.
      
      CC: stable@vger.kernel.org # 5.12+
      Fixes: d8e3fb10 ("btrfs: zoned: use ZONE_APPEND write for zoned mode")
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f7b12a62
    • C
      btrfs: remove extent writepage address space operation · f3e90c1c
      Christoph Hellwig 提交于
      Same as in commit 21b4ee70 ("xfs: drop ->writepage completely"): we
      can remove the callback as it's only used in one place - single page
      writeback from memory reclaim and is not called for cgroup writeback at
      all.
      
      We only allow such writeback from kswapd, not from direct memory
      reclaim, and so it is rarely used. When it comes from kswapd, it is
      effectively random dirty page shoot-down, which is horrible for IO
      patterns. We can rely on background writeback to clean all dirty pages
      in an efficient way and not let it be interrupted by kswapd.
      Suggested-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f3e90c1c
    • D
      btrfs: unify tree search helper returning prev and next nodes · 9db33891
      David Sterba 提交于
      Simplify helper to return only next and prev pointers, we don't need all
      the node/parent/prev/next pointers of __etree_search as there are now
      other specialized helpers. Rename parameters so they follow the naming.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9db33891
    • D
      btrfs: make tree search for insert more generic and use it for tree_search · ec60c76f
      David Sterba 提交于
      With a slight extension of tree_search_for_insert (fill the return node
      and parent return parameters) we can avoid calling __etree_search from
      tree_search, that could be removed eventually in followup patches.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ec60c76f
    • D
      btrfs: open code inexact rbtree search in tree_search · bebb22c1
      David Sterba 提交于
      The call chain from
      
      tree_search
        tree_search_for_insert
          __etree_search
      
      can be open coded and allow further simplifications, here we need a tree
      search with fallback to the next node in case it's not found. This is
      represented as __etree_search parameters next_ret=valid, prev_ret=NULL.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      bebb22c1
    • D
      btrfs: remove node and parent parameters from insert_state · c367602a
      David Sterba 提交于
      There's no caller left that would pass valid pointers to insert_state so
      we can drop them.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c367602a
    • D
      btrfs: add fast path for extent_state insertion · fb8f07d2
      David Sterba 提交于
      In two cases the exact location where to insert the extent state is
      known at the call time so we don't need to pass it to insert_state that
      takes the fast path.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fb8f07d2
    • D
      btrfs: pass bits by value not by pointer for extent_state helpers · 6d92b304
      David Sterba 提交于
      The bits are passed to all extent state helpers for no apparent reason,
      the value only read and never updated so remove the indirection and pass
      it directly. Also unify the type to u32 where needed.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6d92b304
    • D
      btrfs: lift start and end parameters to callers of insert_state · cee51268
      David Sterba 提交于
      Let callers of insert_state to set up the extent state to allow further
      simplifications of the parameters.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      cee51268
    • D
      btrfs: open code rbtree search in insert_state · c7e118cf
      David Sterba 提交于
      The rbtree search is a known pattern and can be open coded, allowing to
      remove the tree_insert and further cleanups.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c7e118cf
    • D
      btrfs: open code rbtree search in split_state · 12c9cdda
      David Sterba 提交于
      Preparatory work to remove tree_insert from extent_io.c, the rbtree
      search loop is a known and simple so it can be open coded.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      12c9cdda
    • C
      btrfs: pass the btrfs_bio_ctrl to submit_one_bio · 722c82ac
      Christoph Hellwig 提交于
      submit_one_bio always works on the bio and compression flags from a
      btrfs_bio_ctrl structure.  Pass the explicitly and clean up the
      calling conventions by handling a NULL bio in submit_one_bio, and
      using the btrfs_bio_ctrl to pass the mirror number as well.
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      722c82ac
    • C
      btrfs: merge end_write_bio and flush_write_bio · 9845e5dd
      Christoph Hellwig 提交于
      Merge end_write_bio and flush_write_bio into a single submit_write_bio
      helper, that either submits the bio or ends it if a negative errno was
      passed in.  This consolidates a lot of duplicated checks in the callers.
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9845e5dd
    • C
      btrfs: don't use bio->bi_private to pass the inode to submit_one_bio · 2d5ac130
      Christoph Hellwig 提交于
      submit_one_bio is only used for page cache I/O, so the inode can be
      trivially derived from the first page in the bio.
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2d5ac130