1. 26 9月, 2022 14 次提交
    • J
      btrfs: unify the lock/unlock extent variants · 570eb97b
      Josef Bacik 提交于
      We have two variants of lock/unlock extent, one set that takes a cached
      state, another that does not.  This is slightly annoying, and generally
      speaking there are only a few places where we don't have a cached state.
      Simplify this by making lock_extent/unlock_extent the only variant and
      make it take a cached state, then convert all the callers appropriately.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      570eb97b
    • J
      btrfs: drop extent_changeset from set_extent_bit · 291bbb1e
      Josef Bacik 提交于
      The only places that set extent_changeset is set_record_extent_bits,
      everywhere else sets it to NULL.  Drop this argument from
      set_extent_bit.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      291bbb1e
    • J
      btrfs: remove failed_start argument from set_extent_bit · 994bcd1e
      Josef Bacik 提交于
      This is only used for internal locking related helpers, everybody else
      just passes in NULL.  I've changed set_extent_bit to __set_extent_bit
      and made it static, removed failed_start from set_extent_bit and have it
      call __set_extent_bit with a NULL failed_start, and I've moved some code
      down below the now static __set_extent_bit.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      994bcd1e
    • J
      btrfs: remove the wake argument from clear_extent_bits · dbbf4992
      Josef Bacik 提交于
      This is only used in the case that we are clearing EXTENT_LOCKED, so
      infer this value from the bits passed in instead of taking it as an
      argument.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      dbbf4992
    • J
      btrfs: drop exclusive_bits from set_extent_bit · c07d1004
      Josef Bacik 提交于
      This is only ever set if we have EXTENT_LOCKED set, so simply push this
      into the function itself and remove the function argument.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c07d1004
    • J
      btrfs: convert the io_failure_tree to a plain rb_tree · 87c11705
      Josef Bacik 提交于
      We still have this oddity of stashing the io_failure_record in the
      extent state for the io_failure_tree, which is leftover from when we
      used to stuff private pointers in extent_io_trees.
      
      However this doesn't make a lot of sense for the io failure records, we
      can simply use a normal rb_tree for this.  This will allow us to further
      simplify the extent_io_tree code by removing the io_failure_rec pointer
      from the extent state.
      
      Convert the io_failure_tree to an rb tree + spinlock in the inode, and
      then use our rb tree simple helpers to insert and find failed records.
      This greatly cleans up this code and makes it easier to separate out the
      extent_io_tree code.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      87c11705
    • J
      btrfs: rename clean_io_failure and remove extraneous args · 0d0a762c
      Josef Bacik 提交于
      This is exported, so rename it to btrfs_clean_io_failure.  Additionally
      we are passing in the io tree's and such from the inode, so instead of
      doing all that simply pass in the inode itself and get all the
      components we need directly inside of btrfs_clean_io_failure.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0d0a762c
    • F
      btrfs: make fiemap more efficient and accurate reporting extent sharedness · ac3c0d36
      Filipe Manana 提交于
      The current fiemap implementation does not scale very well with the number
      of extents a file has. This is both because the main algorithm to find out
      the extents has a high algorithmic complexity and because for each extent
      we have to check if it's shared. This second part, checking if an extent
      is shared, is significantly improved by the two previous patches in this
      patchset, while the first part is improved by this specific patch. Every
      now and then we get reports from users mentioning fiemap is too slow or
      even unusable for files with a very large number of extents, such as the
      two recent reports referred to by the Link tags at the bottom of this
      change log.
      
      To understand why the part of finding which extents a file has is very
      inefficient, consider the example of doing a full ranged fiemap against
      a file that has over 100K extents (normal for example for a file with
      more than 10G of data and using compression, which limits the extent size
      to 128K). When we enter fiemap at extent_fiemap(), the following happens:
      
      1) Before entering the main loop, we call get_extent_skip_holes() to get
         the first extent map. This leads us to btrfs_get_extent_fiemap(), which
         in turn calls btrfs_get_extent(), to find the first extent map that
         covers the file range [0, LLONG_MAX).
      
         btrfs_get_extent() will first search the inode's extent map tree, to
         see if we have an extent map there that covers the range. If it does
         not find one, then it will search the inode's subvolume b+tree for a
         fitting file extent item. After finding the file extent item, it will
         allocate an extent map, fill it in with information extracted from the
         file extent item, and add it to the inode's extent map tree (which
         requires a search for insertion in the tree).
      
      2) Then we enter the main loop at extent_fiemap(), emit the details of
         the extent, and call again get_extent_skip_holes(), with a start
         offset matching the end of the extent map we previously processed.
      
         We end up at btrfs_get_extent() again, will search the extent map tree
         and then search the subvolume b+tree for a file extent item if we could
         not find an extent map in the extent tree. We allocate an extent map,
         fill it in with the details in the file extent item, and then insert
         it into the extent map tree (yet another search in this tree).
      
      3) The second step is repeated over and over, until we have processed the
         whole file range. Each iteration ends at btrfs_get_extent(), which
         does a red black tree search on the extent map tree, then searches the
         subvolume b+tree, allocates an extent map and then does another search
         in the extent map tree in order to insert the extent map.
      
         In the best scenario we have all the extent maps already in the extent
         tree, and so for each extent we do a single search on a red black tree,
         so we have a complexity of O(n log n).
      
         In the worst scenario we don't have any extent map already loaded in
         the extent map tree, or have very few already there. In this case the
         complexity is much higher since we do:
      
         - A red black tree search on the extent map tree, which has O(log n)
           complexity, initially very fast since the tree is empty or very
           small, but as we end up allocating extent maps and adding them to
           the tree when we don't find them there, each subsequent search on
           the tree gets slower, since it's getting bigger and bigger after
           each iteration.
      
         - A search on the subvolume b+tree, also O(log n) complexity, but it
           has items for all inodes in the subvolume, not just items for our
           inode. Plus on a filesystem with concurrent operations on other
           inodes, we can block doing the search due to lock contention on
           b+tree nodes/leaves.
      
         - Allocate an extent map - this can block, and can also fail if we
           are under serious memory pressure.
      
         - Do another search on the extent maps red black tree, with the goal
           of inserting the extent map we just allocated. Again, after every
           iteration this tree is getting bigger by 1 element, so after many
           iterations the searches are slower and slower.
      
         - We will not need the allocated extent map anymore, so it's pointless
           to add it to the extent map tree. It's just wasting time and memory.
      
         In short we end up searching the extent map tree multiple times, on a
         tree that is growing bigger and bigger after each iteration. And
         besides that we visit the same leaf of the subvolume b+tree many times,
         since a leaf with the default size of 16K can easily have more than 200
         file extent items.
      
      This is very inefficient overall. This patch changes the algorithm to
      instead iterate over the subvolume b+tree, visiting each leaf only once,
      and only searching in the extent map tree for file ranges that have holes
      or prealloc extents, in order to figure out if we have delalloc there.
      It will never allocate an extent map and add it to the extent map tree.
      This is very similar to what was previously done for the lseek's hole and
      data seeking features.
      
      Also, the current implementation relying on extent maps for figuring out
      which extents we have is not correct. This is because extent maps can be
      merged even if they represent different extents - we do this to minimize
      memory utilization and keep extent map trees smaller. For example if we
      have two extents that are contiguous on disk, once we load the two extent
      maps, they get merged into a single one - however if only one of the
      extents is shared, we end up reporting both as shared or both as not
      shared, which is incorrect.
      
      This reproducer triggers that bug:
      
          $ cat fiemap-bug.sh
          #!/bin/bash
      
          DEV=/dev/sdj
          MNT=/mnt/sdj
      
          mkfs.btrfs -f $DEV
          mount $DEV $MNT
      
          # Create a file with two 256K extents.
          # Since there is no other write activity, they will be contiguous,
          # and their extent maps merged, despite having two distinct extents.
          xfs_io -f -c "pwrite -S 0xab 0 256K" \
                    -c "fsync" \
                    -c "pwrite -S 0xcd 256K 256K" \
                    -c "fsync" \
                    $MNT/foo
      
          # Now clone only the second extent into another file.
          xfs_io -f -c "reflink $MNT/foo 256K 0 256K" $MNT/bar
      
          # Filefrag will report a single 512K extent, and say it's not shared.
          echo
          filefrag -v $MNT/foo
      
          umount $MNT
      
      Running the reproducer:
      
          $ ./fiemap-bug.sh
          wrote 262144/262144 bytes at offset 0
          256 KiB, 64 ops; 0.0038 sec (65.479 MiB/sec and 16762.7030 ops/sec)
          wrote 262144/262144 bytes at offset 262144
          256 KiB, 64 ops; 0.0040 sec (61.125 MiB/sec and 15647.9218 ops/sec)
          linked 262144/262144 bytes at offset 0
          256 KiB, 1 ops; 0.0002 sec (1.034 GiB/sec and 4237.2881 ops/sec)
      
          Filesystem type is: 9123683e
          File size of /mnt/sdj/foo is 524288 (128 blocks of 4096 bytes)
           ext:     logical_offset:        physical_offset: length:   expected: flags:
             0:        0..     127:       3328..      3455:    128:             last,eof
          /mnt/sdj/foo: 1 extent found
      
      We end up reporting that we have a single 512K that is not shared, however
      we have two 256K extents, and the second one is shared. Changing the
      reproducer to clone instead the first extent into file 'bar', makes us
      report a single 512K extent that is shared, which is algo incorrect since
      we have two 256K extents and only the first one is shared.
      
      This patch is part of a larger patchset that is comprised of the following
      patches:
      
          btrfs: allow hole and data seeking to be interruptible
          btrfs: make hole and data seeking a lot more efficient
          btrfs: remove check for impossible block start for an extent map at fiemap
          btrfs: remove zero length check when entering fiemap
          btrfs: properly flush delalloc when entering fiemap
          btrfs: allow fiemap to be interruptible
          btrfs: rename btrfs_check_shared() to a more descriptive name
          btrfs: speedup checking for extent sharedness during fiemap
          btrfs: skip unnecessary extent buffer sharedness checks during fiemap
          btrfs: make fiemap more efficient and accurate reporting extent sharedness
      
      The patchset was tested on a machine running a non-debug kernel (Debian's
      default config) and compared the tests below on a branch without the
      patchset versus the same branch with the whole patchset applied.
      
      The following test for a large compressed file without holes:
      
          $ cat fiemap-perf-test.sh
          #!/bin/bash
      
          DEV=/dev/sdi
          MNT=/mnt/sdi
      
          mkfs.btrfs -f $DEV
          mount -o compress=lzo $DEV $MNT
      
          # 40G gives 327680 128K file extents (due to compression).
          xfs_io -f -c "pwrite -S 0xab -b 1M 0 20G" $MNT/foobar
      
          umount $MNT
          mount -o compress=lzo $DEV $MNT
      
          start=$(date +%s%N)
          filefrag $MNT/foobar
          end=$(date +%s%N)
          dur=$(( (end - start) / 1000000 ))
          echo "fiemap took $dur milliseconds (metadata not cached)"
      
          start=$(date +%s%N)
          filefrag $MNT/foobar
          end=$(date +%s%N)
          dur=$(( (end - start) / 1000000 ))
          echo "fiemap took $dur milliseconds (metadata cached)"
      
          umount $MNT
      
      Before patchset:
      
          $ ./fiemap-perf-test.sh
          (...)
          /mnt/sdi/foobar: 327680 extents found
          fiemap took 3597 milliseconds (metadata not cached)
          /mnt/sdi/foobar: 327680 extents found
          fiemap took 2107 milliseconds (metadata cached)
      
      After patchset:
      
          $ ./fiemap-perf-test.sh
          (...)
          /mnt/sdi/foobar: 327680 extents found
          fiemap took 1214 milliseconds (metadata not cached)
          /mnt/sdi/foobar: 327680 extents found
          fiemap took 684 milliseconds (metadata cached)
      
      That's a speedup of about 3x for both cases (no metadata cached and all
      metadata cached).
      
      The test provided by Pavel (first Link tag at the bottom), which uses
      files with a large number of holes, was also used to measure the gains,
      and it consists on a small C program and a shell script to invoke it.
      The C program is the following:
      
          $ cat pavels-test.c
          #include <stdio.h>
          #include <unistd.h>
          #include <stdlib.h>
          #include <fcntl.h>
      
          #include <sys/stat.h>
          #include <sys/time.h>
          #include <sys/ioctl.h>
      
          #include <linux/fs.h>
          #include <linux/fiemap.h>
      
          #define FILE_INTERVAL (1<<13) /* 8Kb */
      
          long long interval(struct timeval t1, struct timeval t2)
          {
              long long val = 0;
              val += (t2.tv_usec - t1.tv_usec);
              val += (t2.tv_sec - t1.tv_sec) * 1000 * 1000;
              return val;
          }
      
          int main(int argc, char **argv)
          {
              struct fiemap fiemap = {};
              struct timeval t1, t2;
              char data = 'a';
              struct stat st;
              int fd, off, file_size = FILE_INTERVAL;
      
              if (argc != 3 && argc != 2) {
                      printf("usage: %s <path> [size]\n", argv[0]);
                      return 1;
              }
      
              if (argc == 3)
                      file_size = atoi(argv[2]);
              if (file_size < FILE_INTERVAL)
                      file_size = FILE_INTERVAL;
              file_size -= file_size % FILE_INTERVAL;
      
              fd = open(argv[1], O_RDWR | O_CREAT | O_TRUNC, 0644);
              if (fd < 0) {
                  perror("open");
                  return 1;
              }
      
              for (off = 0; off < file_size; off += FILE_INTERVAL) {
                  if (pwrite(fd, &data, 1, off) != 1) {
                      perror("pwrite");
                      close(fd);
                      return 1;
                  }
              }
      
              if (ftruncate(fd, file_size)) {
                  perror("ftruncate");
                  close(fd);
                  return 1;
              }
      
              if (fstat(fd, &st) < 0) {
                  perror("fstat");
                  close(fd);
                  return 1;
              }
      
              printf("size: %ld\n", st.st_size);
              printf("actual size: %ld\n", st.st_blocks * 512);
      
              fiemap.fm_length = FIEMAP_MAX_OFFSET;
              gettimeofday(&t1, NULL);
              if (ioctl(fd, FS_IOC_FIEMAP, &fiemap) < 0) {
                  perror("fiemap");
                  close(fd);
                  return 1;
              }
              gettimeofday(&t2, NULL);
      
              printf("fiemap: fm_mapped_extents = %d\n",
                     fiemap.fm_mapped_extents);
              printf("time = %lld us\n", interval(t1, t2));
      
              close(fd);
              return 0;
          }
      
          $ gcc -o pavels_test pavels_test.c
      
      And the wrapper shell script:
      
          $ cat fiemap-pavels-test.sh
      
          #!/bin/bash
      
          DEV=/dev/sdi
          MNT=/mnt/sdi
      
          mkfs.btrfs -f -O no-holes $DEV
          mount $DEV $MNT
      
          echo
          echo "*********** 256M ***********"
          echo
      
          ./pavels-test $MNT/testfile $((1 << 28))
          echo
          ./pavels-test $MNT/testfile $((1 << 28))
      
          echo
          echo "*********** 512M ***********"
          echo
      
          ./pavels-test $MNT/testfile $((1 << 29))
          echo
          ./pavels-test $MNT/testfile $((1 << 29))
      
          echo
          echo "*********** 1G ***********"
          echo
      
          ./pavels-test $MNT/testfile $((1 << 30))
          echo
          ./pavels-test $MNT/testfile $((1 << 30))
      
          umount $MNT
      
      Running his reproducer before applying the patchset:
      
          *********** 256M ***********
      
          size: 268435456
          actual size: 134217728
          fiemap: fm_mapped_extents = 32768
          time = 4003133 us
      
          size: 268435456
          actual size: 134217728
          fiemap: fm_mapped_extents = 32768
          time = 4895330 us
      
          *********** 512M ***********
      
          size: 536870912
          actual size: 268435456
          fiemap: fm_mapped_extents = 65536
          time = 30123675 us
      
          size: 536870912
          actual size: 268435456
          fiemap: fm_mapped_extents = 65536
          time = 33450934 us
      
          *********** 1G ***********
      
          size: 1073741824
          actual size: 536870912
          fiemap: fm_mapped_extents = 131072
          time = 224924074 us
      
          size: 1073741824
          actual size: 536870912
          fiemap: fm_mapped_extents = 131072
          time = 217239242 us
      
      Running it after applying the patchset:
      
          *********** 256M ***********
      
          size: 268435456
          actual size: 134217728
          fiemap: fm_mapped_extents = 32768
          time = 29475 us
      
          size: 268435456
          actual size: 134217728
          fiemap: fm_mapped_extents = 32768
          time = 29307 us
      
          *********** 512M ***********
      
          size: 536870912
          actual size: 268435456
          fiemap: fm_mapped_extents = 65536
          time = 58996 us
      
          size: 536870912
          actual size: 268435456
          fiemap: fm_mapped_extents = 65536
          time = 59115 us
      
          *********** 1G ***********
      
          size: 1073741824
          actual size: 536870912
          fiemap: fm_mapped_extents = 116251
          time = 124141 us
      
          size: 1073741824
          actual size: 536870912
          fiemap: fm_mapped_extents = 131072
          time = 119387 us
      
      The speedup is massive, both on the first fiemap call and on the second
      one as well, as his test creates files with many holes and small extents
      (every extent follows a hole and precedes another hole).
      
      For the 256M file we go from 4 seconds down to 29 milliseconds in the
      first run, and then from 4.9 seconds down to 29 milliseconds again in the
      second run, a speedup of 138x and 169x, respectively.
      
      For the 512M file we go from 30.1 seconds down to 59 milliseconds in the
      first run, and then from 33.5 seconds down to 59 milliseconds again in the
      second run, a speedup of 510x and 568x, respectively.
      
      For the 1G file, we go from 225 seconds down to 124 milliseconds in the
      first run, and then from 217 seconds down to 119 milliseconds in the
      second run, a speedup of 1815x and 1824x, respectively.
      Reported-by: NPavel Tikhomirov <ptikhomirov@virtuozzo.com>
      Link: https://lore.kernel.org/linux-btrfs/21dd32c6-f1f9-f44a-466a-e18fdc6788a7@virtuozzo.com/Reported-by: NDominique MARTINET <dominique.martinet@atmark-techno.com>
      Link: https://lore.kernel.org/linux-btrfs/Ysace25wh5BbLd5f@atmark-techno.com/Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ac3c0d36
    • F
      btrfs: properly flush delalloc when entering fiemap · 33a86cfa
      Filipe Manana 提交于
      If the flag FIEMAP_FLAG_SYNC is passed to fiemap, it means all delalloc
      should be flushed and writeback complete. We call the generic helper
      fiemap_prep() which does a filemap_write_and_wait() in case that flag is
      given, however that is not enough if we have compression. Because a
      single filemap_fdatawrite_range() only starts compression (in an async
      thread) and therefore returns before the compression is done and writeback
      is started.
      
      So make btrfs_fiemap(), actually wait for all writeback to start and
      complete if FIEMAP_FLAG_SYNC is set. We start and wait for writeback
      on the whole possible file range, from 0 to LLONG_MAX, because that is
      what the generic code at fiemap_prep() does.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      33a86cfa
    • C
      btrfs: give struct btrfs_bio a real end_io handler · 917f32a2
      Christoph Hellwig 提交于
      Currently btrfs_bio end I/O handling is a bit of a mess.  The bi_end_io
      handler and bi_private pointer of the embedded struct bio are both used
      to handle the completion of the high-level btrfs_bio and for the I/O
      completion for the low-level device that the embedded bio ends up being
      sent to.
      
      To support this bi_end_io and bi_private are saved into the
      btrfs_io_context structure and then restored after the bio sent to the
      underlying device has completed the actual I/O.
      
      Untangle this by adding an end I/O handler and private data to struct
      btrfs_bio for the high-level btrfs_bio based completions, and leave the
      actual bio bi_end_io handler and bi_private pointer entirely to the
      low-level device I/O.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Tested-by: NNikolay Borisov <nborisov@suse.com>
      Tested-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      917f32a2
    • C
      btrfs: pass the operation to btrfs_bio_alloc · 6b42f5e3
      Christoph Hellwig 提交于
      Pass the operation to btrfs_bio_alloc, matching what bio_alloc_bioset
      set does.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Tested-by: NNikolay Borisov <nborisov@suse.com>
      Tested-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6b42f5e3
    • E
      btrfs: remove unnecessary EXTENT_UPTODATE state in buffered I/O path · 52b029f4
      Ethan Lien 提交于
      After we copied data to page cache in buffered I/O, we
      1. Insert a EXTENT_UPTODATE state into inode's io_tree, by
         endio_readpage_release_extent(), set_extent_delalloc() or
         set_extent_defrag().
      2. Set page uptodate before we unlock the page.
      
      But the only place we check io_tree's EXTENT_UPTODATE state is in
      btrfs_do_readpage(). We know we enter btrfs_do_readpage() only when we
      have a non-uptodate page, so it is unnecessary to set EXTENT_UPTODATE.
      
      For example, when performing a buffered random read:
      
      	fio --rw=randread --ioengine=libaio --direct=0 --numjobs=4 \
      		--filesize=32G --size=4G --bs=4k --name=job \
      		--filename=/mnt/file --name=job
      
      Then check how many extent_state in io_tree:
      
      	cat /proc/slabinfo | grep btrfs_extent_state | awk '{print $2}'
      
      w/o this patch, we got 640567 btrfs_extent_state.
      w/  this patch, we got    204 btrfs_extent_state.
      
      Maintaining such a big tree brings overhead since every I/O needs to insert
      EXTENT_LOCKED, insert EXTENT_UPTODATE, then remove EXTENT_LOCKED. And in
      every insert or remove, we need to lock io_tree, do tree search, alloc or
      dealloc extent states. By removing unnecessary EXTENT_UPTODATE, we keep
      io_tree in a minimal size and reduce overhead when performing buffered I/O.
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NRobbie Ko <robbieko@synology.com>
      Signed-off-by: NEthan Lien <ethanlien@synology.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      52b029f4
    • O
      btrfs: rename btrfs_insert_file_extent() to btrfs_insert_hole_extent() · d1f68ba0
      Omar Sandoval 提交于
      btrfs_insert_file_extent() is only ever used to insert holes, so rename
      it and remove the redundant parameters.
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NOmar Sandoval <osandov@osandov.com>
      Signed-off-by: NSweet Tea Dorminy <sweettea-kernel@dorminy.me>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d1f68ba0
    • I
      btrfs: add lockdep annotations for the ordered extents wait event · 5f4403e1
      Ioannis Angelakopoulos 提交于
      This wait event is very similar to the pending ordered wait event in the
      sense that it occurs in a different context than the condition signaling
      for the event. The signaling occurs in btrfs_remove_ordered_extent()
      while the wait event is implemented in btrfs_start_ordered_extent() in
      fs/btrfs/ordered-data.c
      
      However, in this case a thread must not acquire the lockdep map for the
      ordered extents wait event when the ordered extent is related to a free
      space inode. That is because lockdep creates dependencies between locks
      acquired both in execution paths related to normal inodes and paths
      related to free space inodes, thus leading to false positives.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NIoannis Angelakopoulos <iangelak@fb.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5f4403e1
  2. 05 9月, 2022 1 次提交
    • N
      btrfs: zoned: fix API misuse of zone finish waiting · d5b81ced
      Naohiro Aota 提交于
      The commit 2ce543f4 ("btrfs: zoned: wait until zone is finished when
      allocation didn't progress") implemented a zone finish waiting mechanism
      to the write path of zoned mode. However, using
      wait_var_event()/wake_up_all() on fs_info->zone_finish_wait is wrong and
      wait_var_event() just hangs because no one ever wakes it up once it goes
      into sleep.
      
      Instead, we can simply use wait_on_bit_io() and clear_and_wake_up_bit()
      on fs_info->flags with a proper barrier installed.
      
      Fixes: 2ce543f4 ("btrfs: zoned: wait until zone is finished when allocation didn't progress")
      CC: stable@vger.kernel.org # 5.16+
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d5b81ced
  3. 23 8月, 2022 1 次提交
    • J
      btrfs: don't allow large NOWAIT direct reads · 79d3d1d1
      Josef Bacik 提交于
      Dylan and Jens reported a problem where they had an io_uring test that
      was returning short reads, and bisected it to ee5b46a3 ("btrfs:
      increase direct io read size limit to 256 sectors").
      
      The root cause is their test was doing larger reads via io_uring with
      NOWAIT and async.  This was triggering a page fault during the direct
      read, however the first page was able to work just fine and thus we
      submitted a 4k read for a larger iocb.
      
      Btrfs allows for partial IO's in this case specifically because we don't
      allow page faults, and thus we'll attempt to do any io that we can,
      submit what we could, come back and fault in the rest of the range and
      try to do the remaining IO.
      
      However for !is_sync_kiocb() we'll call ->ki_complete() as soon as the
      partial dio is done, which is incorrect.  In the sync case we can exit
      the iomap code, submit more io's, and return with the amount of IO we
      were able to complete successfully.
      
      We were always doing short reads in this case, but for NOWAIT we were
      getting saved by the fact that we were limiting direct reads to
      sectorsize, and if we were larger than that we would return EAGAIN.
      
      Fix the regression by simply returning EAGAIN in the NOWAIT case with
      larger reads, that way io_uring can retry and get the larger IO and have
      the fault logic handle everything properly.
      
      This still leaves the AIO short read case, but that existed before this
      change.  The way to properly fix this would be to handle partial iocb
      completions, but that's a lot of work, for now deal with the regression
      in the most straightforward way possible.
      Reported-by: NDylan Yudaken <dylany@fb.com>
      Fixes: ee5b46a3 ("btrfs: increase direct io read size limit to 256 sectors")
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      79d3d1d1
  4. 03 8月, 2022 1 次提交
  5. 26 7月, 2022 4 次提交
  6. 25 7月, 2022 19 次提交
    • N
      btrfs: simplify error handling in btrfs_lookup_dentry · fc8b235f
      Nikolay Borisov 提交于
      In btrfs_lookup_dentry releasing the reference of the sub_root and the
      running orphan cleanup should only happen if the dentry found actually
      represents a subvolume. This can only be true in the 'else' branch as
      otherwise either fixup_tree_root_location returned an ENOENT error, in
      which case sub_root wouldn't have been changed or if we got a different
      errno this means btrfs_get_fs_root couldn't have executed successfully
      again meaning sub_root will equal to root. So simplify all the branches
      by moving the code into the 'else'.
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fc8b235f
    • N
      btrfs: zoned: wait until zone is finished when allocation didn't progress · 2ce543f4
      Naohiro Aota 提交于
      When the allocated position doesn't progress, we cannot submit IOs to
      finish a block group, but there should be ongoing IOs that will finish a
      block group. So, in that case, we wait for a zone to be finished and retry
      the allocation after that.
      
      Introduce a new flag BTRFS_FS_NEED_ZONE_FINISH for fs_info->flags to
      indicate we need a zone finish to have proceeded. The flag is set when the
      allocator detected it cannot activate a new block group. And, it is cleared
      once a zone is finished.
      
      CC: stable@vger.kernel.org # 5.16+
      Fixes: afba2bc0 ("btrfs: zoned: implement active zone tracking")
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2ce543f4
    • N
      btrfs: zoned: write out partially allocated region · 898793d9
      Naohiro Aota 提交于
      cow_file_range() works in an all-or-nothing way: if it fails to allocate an
      extent for a part of the given region, it gives up all the region including
      the successfully allocated parts. On cow_file_range(), run_delalloc_zoned()
      writes data for the region only when it successfully allocate all the
      region.
      
      This all-or-nothing allocation and write-out are problematic when available
      space in all the block groups are get tight with the active zone
      restriction. btrfs_reserve_extent() try hard to utilize the left space in
      the active block groups and gives up finally and fails with
      -ENOSPC. However, if we send IOs for the successfully allocated region, we
      can finish a zone and can continue on the rest of the allocation on a newly
      allocated block group.
      
      This patch implements the partial write-out for run_delalloc_zoned(). With
      this patch applied, cow_file_range() returns -EAGAIN to tell the caller to
      do something to progress the further allocation, and tells the successfully
      allocated region with done_offset. Furthermore, the zoned extent allocator
      returns -EAGAIN to tell cow_file_range() going back to the caller side.
      
      Actually, we still need to wait for an IO to complete to continue the
      allocation. The next patch implements that part.
      
      CC: stable@vger.kernel.org # 5.16+
      Fixes: afba2bc0 ("btrfs: zoned: implement active zone tracking")
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      898793d9
    • N
      btrfs: convert count_max_extents() to use fs_info->max_extent_size · 7d7672bc
      Naohiro Aota 提交于
      If count_max_extents() uses BTRFS_MAX_EXTENT_SIZE to calculate the number
      of extents needed, btrfs release the metadata reservation too much on its
      way to write out the data.
      
      Now that BTRFS_MAX_EXTENT_SIZE is replaced with fs_info->max_extent_size,
      convert count_max_extents() to use it instead, and fix the calculation of
      the metadata reservation.
      
      CC: stable@vger.kernel.org # 5.12+
      Fixes: d8e3fb10 ("btrfs: zoned: use ZONE_APPEND write for zoned mode")
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7d7672bc
    • N
      btrfs: replace BTRFS_MAX_EXTENT_SIZE with fs_info->max_extent_size · f7b12a62
      Naohiro Aota 提交于
      On zoned filesystem, data write out is limited by max_zone_append_size,
      and a large ordered extent is split according the size of a bio. OTOH,
      the number of extents to be written is calculated using
      BTRFS_MAX_EXTENT_SIZE, and that estimated number is used to reserve the
      metadata bytes to update and/or create the metadata items.
      
      The metadata reservation is done at e.g, btrfs_buffered_write() and then
      released according to the estimation changes. Thus, if the number of extent
      increases massively, the reserved metadata can run out.
      
      The increase of the number of extents easily occurs on zoned filesystem
      if BTRFS_MAX_EXTENT_SIZE > max_zone_append_size. And, it causes the
      following warning on a small RAM environment with disabling metadata
      over-commit (in the following patch).
      
      [75721.498492] ------------[ cut here ]------------
      [75721.505624] BTRFS: block rsv 1 returned -28
      [75721.512230] WARNING: CPU: 24 PID: 2327559 at fs/btrfs/block-rsv.c:537 btrfs_use_block_rsv+0x560/0x760 [btrfs]
      [75721.581854] CPU: 24 PID: 2327559 Comm: kworker/u64:10 Kdump: loaded Tainted: G        W         5.18.0-rc2-BTRFS-ZNS+ #109
      [75721.597200] Hardware name: Supermicro Super Server/H12SSL-NT, BIOS 2.0 02/22/2021
      [75721.607310] Workqueue: btrfs-endio-write btrfs_work_helper [btrfs]
      [75721.616209] RIP: 0010:btrfs_use_block_rsv+0x560/0x760 [btrfs]
      [75721.646649] RSP: 0018:ffffc9000fbdf3e0 EFLAGS: 00010286
      [75721.654126] RAX: 0000000000000000 RBX: 0000000000004000 RCX: 0000000000000000
      [75721.663524] RDX: 0000000000000004 RSI: 0000000000000008 RDI: fffff52001f7be6e
      [75721.672921] RBP: ffffc9000fbdf420 R08: 0000000000000001 R09: ffff889f8d1fc6c7
      [75721.682493] R10: ffffed13f1a3f8d8 R11: 0000000000000001 R12: ffff88980a3c0e28
      [75721.692284] R13: ffff889b66590000 R14: ffff88980a3c0e40 R15: ffff88980a3c0e8a
      [75721.701878] FS:  0000000000000000(0000) GS:ffff889f8d000000(0000) knlGS:0000000000000000
      [75721.712601] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [75721.720726] CR2: 000055d12e05c018 CR3: 0000800193594000 CR4: 0000000000350ee0
      [75721.730499] Call Trace:
      [75721.735166]  <TASK>
      [75721.739886]  btrfs_alloc_tree_block+0x1e1/0x1100 [btrfs]
      [75721.747545]  ? btrfs_alloc_logged_file_extent+0x550/0x550 [btrfs]
      [75721.756145]  ? btrfs_get_32+0xea/0x2d0 [btrfs]
      [75721.762852]  ? btrfs_get_32+0xea/0x2d0 [btrfs]
      [75721.769520]  ? push_leaf_left+0x420/0x620 [btrfs]
      [75721.776431]  ? memcpy+0x4e/0x60
      [75721.781931]  split_leaf+0x433/0x12d0 [btrfs]
      [75721.788392]  ? btrfs_get_token_32+0x580/0x580 [btrfs]
      [75721.795636]  ? push_for_double_split.isra.0+0x420/0x420 [btrfs]
      [75721.803759]  ? leaf_space_used+0x15d/0x1a0 [btrfs]
      [75721.811156]  btrfs_search_slot+0x1bc3/0x2790 [btrfs]
      [75721.818300]  ? lock_downgrade+0x7c0/0x7c0
      [75721.824411]  ? free_extent_buffer.part.0+0x107/0x200 [btrfs]
      [75721.832456]  ? split_leaf+0x12d0/0x12d0 [btrfs]
      [75721.839149]  ? free_extent_buffer.part.0+0x14f/0x200 [btrfs]
      [75721.846945]  ? free_extent_buffer+0x13/0x20 [btrfs]
      [75721.853960]  ? btrfs_release_path+0x4b/0x190 [btrfs]
      [75721.861429]  btrfs_csum_file_blocks+0x85c/0x1500 [btrfs]
      [75721.869313]  ? rcu_read_lock_sched_held+0x16/0x80
      [75721.876085]  ? lock_release+0x552/0xf80
      [75721.881957]  ? btrfs_del_csums+0x8c0/0x8c0 [btrfs]
      [75721.888886]  ? __kasan_check_write+0x14/0x20
      [75721.895152]  ? do_raw_read_unlock+0x44/0x80
      [75721.901323]  ? _raw_write_lock_irq+0x60/0x80
      [75721.907983]  ? btrfs_global_root+0xb9/0xe0 [btrfs]
      [75721.915166]  ? btrfs_csum_root+0x12b/0x180 [btrfs]
      [75721.921918]  ? btrfs_get_global_root+0x820/0x820 [btrfs]
      [75721.929166]  ? _raw_write_unlock+0x23/0x40
      [75721.935116]  ? unpin_extent_cache+0x1e3/0x390 [btrfs]
      [75721.942041]  btrfs_finish_ordered_io.isra.0+0xa0c/0x1dc0 [btrfs]
      [75721.949906]  ? try_to_wake_up+0x30/0x14a0
      [75721.955700]  ? btrfs_unlink_subvol+0xda0/0xda0 [btrfs]
      [75721.962661]  ? rcu_read_lock_sched_held+0x16/0x80
      [75721.969111]  ? lock_acquire+0x41b/0x4c0
      [75721.974982]  finish_ordered_fn+0x15/0x20 [btrfs]
      [75721.981639]  btrfs_work_helper+0x1af/0xa80 [btrfs]
      [75721.988184]  ? _raw_spin_unlock_irq+0x28/0x50
      [75721.994643]  process_one_work+0x815/0x1460
      [75722.000444]  ? pwq_dec_nr_in_flight+0x250/0x250
      [75722.006643]  ? do_raw_spin_trylock+0xbb/0x190
      [75722.013086]  worker_thread+0x59a/0xeb0
      [75722.018511]  kthread+0x2ac/0x360
      [75722.023428]  ? process_one_work+0x1460/0x1460
      [75722.029431]  ? kthread_complete_and_exit+0x30/0x30
      [75722.036044]  ret_from_fork+0x22/0x30
      [75722.041255]  </TASK>
      [75722.045047] irq event stamp: 0
      [75722.049703] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
      [75722.057610] hardirqs last disabled at (0): [<ffffffff8118a94a>] copy_process+0x1c1a/0x66b0
      [75722.067533] softirqs last  enabled at (0): [<ffffffff8118a989>] copy_process+0x1c59/0x66b0
      [75722.077423] softirqs last disabled at (0): [<0000000000000000>] 0x0
      [75722.085335] ---[ end trace 0000000000000000 ]---
      
      To fix the estimation, we need to introduce fs_info->max_extent_size to
      replace BTRFS_MAX_EXTENT_SIZE, which allow setting the different size for
      regular vs zoned filesystem.
      
      Set fs_info->max_extent_size to BTRFS_MAX_EXTENT_SIZE by default. On zoned
      filesystem, it is set to fs_info->max_zone_append_size.
      
      CC: stable@vger.kernel.org # 5.12+
      Fixes: d8e3fb10 ("btrfs: zoned: use ZONE_APPEND write for zoned mode")
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f7b12a62
    • F
      btrfs: replace kmap_atomic() with kmap_local_page() · 4cb2e5e8
      Fabio M. De Francesco 提交于
      kmap_atomic() is being deprecated in favor of kmap_local_page() where it
      is feasible. With kmap_local_page() mappings are per thread, CPU local,
      and not globally visible.
      
      The last use of kmap_atomic is in inode.c where the context is atomic [1]
      and can be safely replaced by kmap_local_page.
      
      Tested with xfstests on a QEMU + KVM 32-bits VM with 4GB RAM and booting a
      kernel with HIGHMEM64GB enabled.
      
      [1] https://lore.kernel.org/linux-btrfs/20220601132545.GM20633@twin.jikos.cz/Suggested-by: NIra Weiny <ira.weiny@intel.com>
      Reviewed-by: NIra Weiny <ira.weiny@intel.com>
      Signed-off-by: NFabio M. De Francesco <fmdefrancesco@gmail.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4cb2e5e8
    • D
      btrfs: switch btrfs_block_rsv::failfast to bool · 710d5921
      David Sterba 提交于
      Use simple bool type for the block reserve failfast status, there's
      short to save space as there used to be int but there's no reason for
      that.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      710d5921
    • C
      btrfs: do not return errors from btrfs_submit_dio_bio · 37899117
      Christoph Hellwig 提交于
      Always consume the bio and call the end_io handler on error instead of
      returning an error and letting the caller handle it.  This matches what
      the block layer submission and the other btrfs bio submission handlers do
      and avoids any confusion on who needs to handle errors.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Tested-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      37899117
    • C
      btrfs: handle allocation failure in btrfs_wq_submit_bio gracefully · ea1f0ced
      Christoph Hellwig 提交于
      btrfs_wq_submit_bio is used for writeback under memory pressure.
      Instead of failing the I/O when we can't allocate the async_submit_bio,
      just punt back to the synchronous submission path.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Tested-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ea1f0ced
    • C
      btrfs: simplify sync/async submission in btrfs_submit_data_write_bio · 82443fd5
      Christoph Hellwig 提交于
      btrfs_submit_data_write_bio special cases the reloc root because the
      checksums are preloaded, but only does so for the !sync case.  The sync
      case can't happen for data relocation, but just handling it more generally
      significantly simplifies the logic.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Tested-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      82443fd5
    • C
      btrfs: do not return errors from btrfs_map_bio · 1a722d8f
      Christoph Hellwig 提交于
      Always consume the bio and call the end_io handler on error instead of
      returning an error and letting the caller handle it.  This matches
      what the block layer submission does and avoids any confusion on who
      needs to handle errors.
      
      As this requires touching all the callers, rename the function to
      btrfs_submit_bio, which describes the functionality much better.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Tested-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1a722d8f
    • D
      btrfs: clean up chained assignments · c1867eb3
      David Sterba 提交于
      The chained assignments may be convenient to write, but make readability
      a bit worse as it's too easy to overlook that there are several values
      set on the same line while this is rather an exception.  Making it
      consistent everywhere avoids surprises.
      
      The pattern where inode times are initialized reuses the first value and
      the order is mtime, ctime. In other blocks the assignments are expanded
      so the order of variables is similar to the neighboring code.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c1867eb3
    • N
      btrfs: replace unnecessary goto with direct return at cow_file_range() · aaafa1eb
      Naohiro Aota 提交于
      The 'goto out' in cow_file_range() in the exit block are not necessary
      and jump back. Replace them with return, while still keeping 'goto out'
      in the main code.
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ keep goto in the main code, update changelog ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      aaafa1eb
    • N
      btrfs: fix error handling of fallback uncompress write · 71aa147b
      Naohiro Aota 提交于
      When cow_file_range() fails in the middle of the allocation loop, it
      unlocks the pages but leaves the ordered extents intact. Thus, we need
      to call btrfs_cleanup_ordered_extents() to finish the created ordered
      extents.
      
      Also, we need to call end_extent_writepage() if locked_page is available
      because btrfs_cleanup_ordered_extents() never processes the region on
      the locked_page.
      
      Furthermore, we need to set the mapping as error if locked_page is
      unavailable before unlocking the pages, so that the errno is properly
      propagated to the user space.
      
      CC: stable@vger.kernel.org # 5.18+
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      71aa147b
    • N
      btrfs: extend btrfs_cleanup_ordered_extents for NULL locked_page · 99826e4c
      Naohiro Aota 提交于
      btrfs_cleanup_ordered_extents() assumes locked_page to be non-NULL, so it
      is not usable for submit_uncompressed_range() which can have NULL
      locked_page.
      
      Add support supports locked_page == NULL case. Also, it rewrites
      redundant "page_offset(locked_page)".
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      99826e4c
    • N
      btrfs: ensure pages are unlocked on cow_file_range() failure · 9ce7466f
      Naohiro Aota 提交于
      There is a hung_task report on zoned btrfs like below.
      
      https://github.com/naota/linux/issues/59
      
        [726.328648] INFO: task rocksdb:high0:11085 blocked for more than 241 seconds.
        [726.329839]       Not tainted 5.16.0-rc1+ #1
        [726.330484] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [726.331603] task:rocksdb:high0   state:D stack:    0 pid:11085 ppid: 11082 flags:0x00000000
        [726.331608] Call Trace:
        [726.331611]  <TASK>
        [726.331614]  __schedule+0x2e5/0x9d0
        [726.331622]  schedule+0x58/0xd0
        [726.331626]  io_schedule+0x3f/0x70
        [726.331629]  __folio_lock+0x125/0x200
        [726.331634]  ? find_get_entries+0x1bc/0x240
        [726.331638]  ? filemap_invalidate_unlock_two+0x40/0x40
        [726.331642]  truncate_inode_pages_range+0x5b2/0x770
        [726.331649]  truncate_inode_pages_final+0x44/0x50
        [726.331653]  btrfs_evict_inode+0x67/0x480
        [726.331658]  evict+0xd0/0x180
        [726.331661]  iput+0x13f/0x200
        [726.331664]  do_unlinkat+0x1c0/0x2b0
        [726.331668]  __x64_sys_unlink+0x23/0x30
        [726.331670]  do_syscall_64+0x3b/0xc0
        [726.331674]  entry_SYSCALL_64_after_hwframe+0x44/0xae
        [726.331677] RIP: 0033:0x7fb9490a171b
        [726.331681] RSP: 002b:00007fb943ffac68 EFLAGS: 00000246 ORIG_RAX: 0000000000000057
        [726.331684] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fb9490a171b
        [726.331686] RDX: 00007fb943ffb040 RSI: 000055a6bbe6ec20 RDI: 00007fb94400d300
        [726.331687] RBP: 00007fb943ffad00 R08: 0000000000000000 R09: 0000000000000000
        [726.331688] R10: 0000000000000031 R11: 0000000000000246 R12: 00007fb943ffb000
        [726.331690] R13: 00007fb943ffb040 R14: 0000000000000000 R15: 00007fb943ffd260
        [726.331693]  </TASK>
      
      While we debug the issue, we found running fstests generic/551 on 5GB
      non-zoned null_blk device in the emulated zoned mode also had a
      similar hung issue.
      
      Also, we can reproduce the same symptom with an error injected
      cow_file_range() setup.
      
      The hang occurs when cow_file_range() fails in the middle of
      allocation. cow_file_range() called from do_allocation_zoned() can
      split the give region ([start, end]) for allocation depending on
      current block group usages. When btrfs can allocate bytes for one part
      of the split regions but fails for the other region (e.g. because of
      -ENOSPC), we return the error leaving the pages in the succeeded regions
      locked. Technically, this occurs only when @unlock == 0. Otherwise, we
      unlock the pages in an allocated region after creating an ordered
      extent.
      
      Considering the callers of cow_file_range(unlock=0) won't write out
      the pages, we can unlock the pages on error exit from
      cow_file_range(). So, we can ensure all the pages except @locked_page
      are unlocked on error case.
      
      In summary, cow_file_range now behaves like this:
      
      - page_started == 1 (return value)
        - All the pages are unlocked. IO is started.
      - unlock == 1
        - All the pages except @locked_page are unlocked in any case
      - unlock == 0
        - On success, all the pages are locked for writing out them
        - On failure, all the pages except @locked_page are unlocked
      
      Fixes: 42c01100 ("btrfs: zoned: introduce dedicated data write path for zoned filesystems")
      CC: stable@vger.kernel.org # 5.12+
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9ce7466f
    • C
      btrfs: remove extent writepage address space operation · f3e90c1c
      Christoph Hellwig 提交于
      Same as in commit 21b4ee70 ("xfs: drop ->writepage completely"): we
      can remove the callback as it's only used in one place - single page
      writeback from memory reclaim and is not called for cgroup writeback at
      all.
      
      We only allow such writeback from kswapd, not from direct memory
      reclaim, and so it is rarely used. When it comes from kswapd, it is
      effectively random dirty page shoot-down, which is horrible for IO
      patterns. We can rely on background writeback to clean all dirty pages
      in an efficient way and not let it be interrupted by kswapd.
      Suggested-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f3e90c1c
    • C
      btrfs: increase direct io read size limit to 256 sectors · ee5b46a3
      Christoph Hellwig 提交于
      Btrfs currently limits direct I/O reads to a single sector, which goes
      back to commit c329861d ("Btrfs: don't allocate a separate csums
      array for direct reads") from Josef.  That commit changes the direct I/O
      code to ".. use the private part of the io_tree for our csums.", but ten
      years later that isn't how checksums for direct reads work, instead they
      use a csums allocation on a per-btrfs_dio_private basis (which have their
      own performance problem for small I/O, but that will be addressed later).
      
      There is no fundamental limit in btrfs itself to limit the I/O size
      except for the size of the checksum array that scales linearly with
      the number of sectors in an I/O.  Pick a somewhat arbitrary limit of
      256 limits, which matches what the buffered reads typically see as
      the upper limit as the limit for direct I/O as well.
      
      This significantly improves direct read performance.  For example a fio
      run doing 1 MiB aio reads with a queue depth of 1 roughly triples the
      throughput:
      
      Baseline:
      
      READ: bw=65.3MiB/s (68.5MB/s), 65.3MiB/s-65.3MiB/s (68.5MB/s-68.5MB/s), io=19.1GiB (20.6GB), run=300013-300013msec
      
      With this patch:
      
      READ: bw=196MiB/s (206MB/s), 196MiB/s-196MiB/s (206MB/s-206MB/s), io=57.5GiB (61.7GB), run=300006-300006msc
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ee5b46a3
    • C
      btrfs: remove the finish_func argument to btrfs_mark_ordered_io_finished · 711f447b
      Christoph Hellwig 提交于
      finish_func is always set to finish_ordered_fn, so remove it and also
      the now pointless and somewhat confusingly named
      __endio_write_update_ordered wrapper.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      711f447b