1. 29 9月, 2022 8 次提交
  2. 26 9月, 2022 10 次提交
    • J
      btrfs: open code and remove btrfs_inode_sectorsize helper · ee8ba05c
      Josef Bacik 提交于
      This is defined in btrfs_inode.h, and dereferences btrfs_root and
      btrfs_fs_info, both of which aren't defined in btrfs_inode.h.
      Additionally, in many places we already have root or fs_info, so this
      helper often makes the code harder to read.  So delete the helper and
      simply open code it in the few places that we use it.
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ee8ba05c
    • J
      btrfs: replace delete argument with EXTENT_CLEAR_ALL_BITS · bd015294
      Josef Bacik 提交于
      Instead of taking up a whole argument to indicate we're clearing
      everything in a range, simply add another EXTENT bit to control this,
      and then update all the callers to drop this argument from the
      clear_extent_bit variants.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      bd015294
    • J
      btrfs: unify the lock/unlock extent variants · 570eb97b
      Josef Bacik 提交于
      We have two variants of lock/unlock extent, one set that takes a cached
      state, another that does not.  This is slightly annoying, and generally
      speaking there are only a few places where we don't have a cached state.
      Simplify this by making lock_extent/unlock_extent the only variant and
      make it take a cached state, then convert all the callers appropriately.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      570eb97b
    • J
      btrfs: remove the wake argument from clear_extent_bits · dbbf4992
      Josef Bacik 提交于
      This is only used in the case that we are clearing EXTENT_LOCKED, so
      infer this value from the bits passed in instead of taking it as an
      argument.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      dbbf4992
    • F
      btrfs: make fiemap more efficient and accurate reporting extent sharedness · ac3c0d36
      Filipe Manana 提交于
      The current fiemap implementation does not scale very well with the number
      of extents a file has. This is both because the main algorithm to find out
      the extents has a high algorithmic complexity and because for each extent
      we have to check if it's shared. This second part, checking if an extent
      is shared, is significantly improved by the two previous patches in this
      patchset, while the first part is improved by this specific patch. Every
      now and then we get reports from users mentioning fiemap is too slow or
      even unusable for files with a very large number of extents, such as the
      two recent reports referred to by the Link tags at the bottom of this
      change log.
      
      To understand why the part of finding which extents a file has is very
      inefficient, consider the example of doing a full ranged fiemap against
      a file that has over 100K extents (normal for example for a file with
      more than 10G of data and using compression, which limits the extent size
      to 128K). When we enter fiemap at extent_fiemap(), the following happens:
      
      1) Before entering the main loop, we call get_extent_skip_holes() to get
         the first extent map. This leads us to btrfs_get_extent_fiemap(), which
         in turn calls btrfs_get_extent(), to find the first extent map that
         covers the file range [0, LLONG_MAX).
      
         btrfs_get_extent() will first search the inode's extent map tree, to
         see if we have an extent map there that covers the range. If it does
         not find one, then it will search the inode's subvolume b+tree for a
         fitting file extent item. After finding the file extent item, it will
         allocate an extent map, fill it in with information extracted from the
         file extent item, and add it to the inode's extent map tree (which
         requires a search for insertion in the tree).
      
      2) Then we enter the main loop at extent_fiemap(), emit the details of
         the extent, and call again get_extent_skip_holes(), with a start
         offset matching the end of the extent map we previously processed.
      
         We end up at btrfs_get_extent() again, will search the extent map tree
         and then search the subvolume b+tree for a file extent item if we could
         not find an extent map in the extent tree. We allocate an extent map,
         fill it in with the details in the file extent item, and then insert
         it into the extent map tree (yet another search in this tree).
      
      3) The second step is repeated over and over, until we have processed the
         whole file range. Each iteration ends at btrfs_get_extent(), which
         does a red black tree search on the extent map tree, then searches the
         subvolume b+tree, allocates an extent map and then does another search
         in the extent map tree in order to insert the extent map.
      
         In the best scenario we have all the extent maps already in the extent
         tree, and so for each extent we do a single search on a red black tree,
         so we have a complexity of O(n log n).
      
         In the worst scenario we don't have any extent map already loaded in
         the extent map tree, or have very few already there. In this case the
         complexity is much higher since we do:
      
         - A red black tree search on the extent map tree, which has O(log n)
           complexity, initially very fast since the tree is empty or very
           small, but as we end up allocating extent maps and adding them to
           the tree when we don't find them there, each subsequent search on
           the tree gets slower, since it's getting bigger and bigger after
           each iteration.
      
         - A search on the subvolume b+tree, also O(log n) complexity, but it
           has items for all inodes in the subvolume, not just items for our
           inode. Plus on a filesystem with concurrent operations on other
           inodes, we can block doing the search due to lock contention on
           b+tree nodes/leaves.
      
         - Allocate an extent map - this can block, and can also fail if we
           are under serious memory pressure.
      
         - Do another search on the extent maps red black tree, with the goal
           of inserting the extent map we just allocated. Again, after every
           iteration this tree is getting bigger by 1 element, so after many
           iterations the searches are slower and slower.
      
         - We will not need the allocated extent map anymore, so it's pointless
           to add it to the extent map tree. It's just wasting time and memory.
      
         In short we end up searching the extent map tree multiple times, on a
         tree that is growing bigger and bigger after each iteration. And
         besides that we visit the same leaf of the subvolume b+tree many times,
         since a leaf with the default size of 16K can easily have more than 200
         file extent items.
      
      This is very inefficient overall. This patch changes the algorithm to
      instead iterate over the subvolume b+tree, visiting each leaf only once,
      and only searching in the extent map tree for file ranges that have holes
      or prealloc extents, in order to figure out if we have delalloc there.
      It will never allocate an extent map and add it to the extent map tree.
      This is very similar to what was previously done for the lseek's hole and
      data seeking features.
      
      Also, the current implementation relying on extent maps for figuring out
      which extents we have is not correct. This is because extent maps can be
      merged even if they represent different extents - we do this to minimize
      memory utilization and keep extent map trees smaller. For example if we
      have two extents that are contiguous on disk, once we load the two extent
      maps, they get merged into a single one - however if only one of the
      extents is shared, we end up reporting both as shared or both as not
      shared, which is incorrect.
      
      This reproducer triggers that bug:
      
          $ cat fiemap-bug.sh
          #!/bin/bash
      
          DEV=/dev/sdj
          MNT=/mnt/sdj
      
          mkfs.btrfs -f $DEV
          mount $DEV $MNT
      
          # Create a file with two 256K extents.
          # Since there is no other write activity, they will be contiguous,
          # and their extent maps merged, despite having two distinct extents.
          xfs_io -f -c "pwrite -S 0xab 0 256K" \
                    -c "fsync" \
                    -c "pwrite -S 0xcd 256K 256K" \
                    -c "fsync" \
                    $MNT/foo
      
          # Now clone only the second extent into another file.
          xfs_io -f -c "reflink $MNT/foo 256K 0 256K" $MNT/bar
      
          # Filefrag will report a single 512K extent, and say it's not shared.
          echo
          filefrag -v $MNT/foo
      
          umount $MNT
      
      Running the reproducer:
      
          $ ./fiemap-bug.sh
          wrote 262144/262144 bytes at offset 0
          256 KiB, 64 ops; 0.0038 sec (65.479 MiB/sec and 16762.7030 ops/sec)
          wrote 262144/262144 bytes at offset 262144
          256 KiB, 64 ops; 0.0040 sec (61.125 MiB/sec and 15647.9218 ops/sec)
          linked 262144/262144 bytes at offset 0
          256 KiB, 1 ops; 0.0002 sec (1.034 GiB/sec and 4237.2881 ops/sec)
      
          Filesystem type is: 9123683e
          File size of /mnt/sdj/foo is 524288 (128 blocks of 4096 bytes)
           ext:     logical_offset:        physical_offset: length:   expected: flags:
             0:        0..     127:       3328..      3455:    128:             last,eof
          /mnt/sdj/foo: 1 extent found
      
      We end up reporting that we have a single 512K that is not shared, however
      we have two 256K extents, and the second one is shared. Changing the
      reproducer to clone instead the first extent into file 'bar', makes us
      report a single 512K extent that is shared, which is algo incorrect since
      we have two 256K extents and only the first one is shared.
      
      This patch is part of a larger patchset that is comprised of the following
      patches:
      
          btrfs: allow hole and data seeking to be interruptible
          btrfs: make hole and data seeking a lot more efficient
          btrfs: remove check for impossible block start for an extent map at fiemap
          btrfs: remove zero length check when entering fiemap
          btrfs: properly flush delalloc when entering fiemap
          btrfs: allow fiemap to be interruptible
          btrfs: rename btrfs_check_shared() to a more descriptive name
          btrfs: speedup checking for extent sharedness during fiemap
          btrfs: skip unnecessary extent buffer sharedness checks during fiemap
          btrfs: make fiemap more efficient and accurate reporting extent sharedness
      
      The patchset was tested on a machine running a non-debug kernel (Debian's
      default config) and compared the tests below on a branch without the
      patchset versus the same branch with the whole patchset applied.
      
      The following test for a large compressed file without holes:
      
          $ cat fiemap-perf-test.sh
          #!/bin/bash
      
          DEV=/dev/sdi
          MNT=/mnt/sdi
      
          mkfs.btrfs -f $DEV
          mount -o compress=lzo $DEV $MNT
      
          # 40G gives 327680 128K file extents (due to compression).
          xfs_io -f -c "pwrite -S 0xab -b 1M 0 20G" $MNT/foobar
      
          umount $MNT
          mount -o compress=lzo $DEV $MNT
      
          start=$(date +%s%N)
          filefrag $MNT/foobar
          end=$(date +%s%N)
          dur=$(( (end - start) / 1000000 ))
          echo "fiemap took $dur milliseconds (metadata not cached)"
      
          start=$(date +%s%N)
          filefrag $MNT/foobar
          end=$(date +%s%N)
          dur=$(( (end - start) / 1000000 ))
          echo "fiemap took $dur milliseconds (metadata cached)"
      
          umount $MNT
      
      Before patchset:
      
          $ ./fiemap-perf-test.sh
          (...)
          /mnt/sdi/foobar: 327680 extents found
          fiemap took 3597 milliseconds (metadata not cached)
          /mnt/sdi/foobar: 327680 extents found
          fiemap took 2107 milliseconds (metadata cached)
      
      After patchset:
      
          $ ./fiemap-perf-test.sh
          (...)
          /mnt/sdi/foobar: 327680 extents found
          fiemap took 1214 milliseconds (metadata not cached)
          /mnt/sdi/foobar: 327680 extents found
          fiemap took 684 milliseconds (metadata cached)
      
      That's a speedup of about 3x for both cases (no metadata cached and all
      metadata cached).
      
      The test provided by Pavel (first Link tag at the bottom), which uses
      files with a large number of holes, was also used to measure the gains,
      and it consists on a small C program and a shell script to invoke it.
      The C program is the following:
      
          $ cat pavels-test.c
          #include <stdio.h>
          #include <unistd.h>
          #include <stdlib.h>
          #include <fcntl.h>
      
          #include <sys/stat.h>
          #include <sys/time.h>
          #include <sys/ioctl.h>
      
          #include <linux/fs.h>
          #include <linux/fiemap.h>
      
          #define FILE_INTERVAL (1<<13) /* 8Kb */
      
          long long interval(struct timeval t1, struct timeval t2)
          {
              long long val = 0;
              val += (t2.tv_usec - t1.tv_usec);
              val += (t2.tv_sec - t1.tv_sec) * 1000 * 1000;
              return val;
          }
      
          int main(int argc, char **argv)
          {
              struct fiemap fiemap = {};
              struct timeval t1, t2;
              char data = 'a';
              struct stat st;
              int fd, off, file_size = FILE_INTERVAL;
      
              if (argc != 3 && argc != 2) {
                      printf("usage: %s <path> [size]\n", argv[0]);
                      return 1;
              }
      
              if (argc == 3)
                      file_size = atoi(argv[2]);
              if (file_size < FILE_INTERVAL)
                      file_size = FILE_INTERVAL;
              file_size -= file_size % FILE_INTERVAL;
      
              fd = open(argv[1], O_RDWR | O_CREAT | O_TRUNC, 0644);
              if (fd < 0) {
                  perror("open");
                  return 1;
              }
      
              for (off = 0; off < file_size; off += FILE_INTERVAL) {
                  if (pwrite(fd, &data, 1, off) != 1) {
                      perror("pwrite");
                      close(fd);
                      return 1;
                  }
              }
      
              if (ftruncate(fd, file_size)) {
                  perror("ftruncate");
                  close(fd);
                  return 1;
              }
      
              if (fstat(fd, &st) < 0) {
                  perror("fstat");
                  close(fd);
                  return 1;
              }
      
              printf("size: %ld\n", st.st_size);
              printf("actual size: %ld\n", st.st_blocks * 512);
      
              fiemap.fm_length = FIEMAP_MAX_OFFSET;
              gettimeofday(&t1, NULL);
              if (ioctl(fd, FS_IOC_FIEMAP, &fiemap) < 0) {
                  perror("fiemap");
                  close(fd);
                  return 1;
              }
              gettimeofday(&t2, NULL);
      
              printf("fiemap: fm_mapped_extents = %d\n",
                     fiemap.fm_mapped_extents);
              printf("time = %lld us\n", interval(t1, t2));
      
              close(fd);
              return 0;
          }
      
          $ gcc -o pavels_test pavels_test.c
      
      And the wrapper shell script:
      
          $ cat fiemap-pavels-test.sh
      
          #!/bin/bash
      
          DEV=/dev/sdi
          MNT=/mnt/sdi
      
          mkfs.btrfs -f -O no-holes $DEV
          mount $DEV $MNT
      
          echo
          echo "*********** 256M ***********"
          echo
      
          ./pavels-test $MNT/testfile $((1 << 28))
          echo
          ./pavels-test $MNT/testfile $((1 << 28))
      
          echo
          echo "*********** 512M ***********"
          echo
      
          ./pavels-test $MNT/testfile $((1 << 29))
          echo
          ./pavels-test $MNT/testfile $((1 << 29))
      
          echo
          echo "*********** 1G ***********"
          echo
      
          ./pavels-test $MNT/testfile $((1 << 30))
          echo
          ./pavels-test $MNT/testfile $((1 << 30))
      
          umount $MNT
      
      Running his reproducer before applying the patchset:
      
          *********** 256M ***********
      
          size: 268435456
          actual size: 134217728
          fiemap: fm_mapped_extents = 32768
          time = 4003133 us
      
          size: 268435456
          actual size: 134217728
          fiemap: fm_mapped_extents = 32768
          time = 4895330 us
      
          *********** 512M ***********
      
          size: 536870912
          actual size: 268435456
          fiemap: fm_mapped_extents = 65536
          time = 30123675 us
      
          size: 536870912
          actual size: 268435456
          fiemap: fm_mapped_extents = 65536
          time = 33450934 us
      
          *********** 1G ***********
      
          size: 1073741824
          actual size: 536870912
          fiemap: fm_mapped_extents = 131072
          time = 224924074 us
      
          size: 1073741824
          actual size: 536870912
          fiemap: fm_mapped_extents = 131072
          time = 217239242 us
      
      Running it after applying the patchset:
      
          *********** 256M ***********
      
          size: 268435456
          actual size: 134217728
          fiemap: fm_mapped_extents = 32768
          time = 29475 us
      
          size: 268435456
          actual size: 134217728
          fiemap: fm_mapped_extents = 32768
          time = 29307 us
      
          *********** 512M ***********
      
          size: 536870912
          actual size: 268435456
          fiemap: fm_mapped_extents = 65536
          time = 58996 us
      
          size: 536870912
          actual size: 268435456
          fiemap: fm_mapped_extents = 65536
          time = 59115 us
      
          *********** 1G ***********
      
          size: 1073741824
          actual size: 536870912
          fiemap: fm_mapped_extents = 116251
          time = 124141 us
      
          size: 1073741824
          actual size: 536870912
          fiemap: fm_mapped_extents = 131072
          time = 119387 us
      
      The speedup is massive, both on the first fiemap call and on the second
      one as well, as his test creates files with many holes and small extents
      (every extent follows a hole and precedes another hole).
      
      For the 256M file we go from 4 seconds down to 29 milliseconds in the
      first run, and then from 4.9 seconds down to 29 milliseconds again in the
      second run, a speedup of 138x and 169x, respectively.
      
      For the 512M file we go from 30.1 seconds down to 59 milliseconds in the
      first run, and then from 33.5 seconds down to 59 milliseconds again in the
      second run, a speedup of 510x and 568x, respectively.
      
      For the 1G file, we go from 225 seconds down to 124 milliseconds in the
      first run, and then from 217 seconds down to 119 milliseconds in the
      second run, a speedup of 1815x and 1824x, respectively.
      Reported-by: NPavel Tikhomirov <ptikhomirov@virtuozzo.com>
      Link: https://lore.kernel.org/linux-btrfs/21dd32c6-f1f9-f44a-466a-e18fdc6788a7@virtuozzo.com/Reported-by: NDominique MARTINET <dominique.martinet@atmark-techno.com>
      Link: https://lore.kernel.org/linux-btrfs/Ysace25wh5BbLd5f@atmark-techno.com/Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ac3c0d36
    • F
      btrfs: make hole and data seeking a lot more efficient · b6e83356
      Filipe Manana 提交于
      The current implementation of hole and data seeking for llseek does not
      scale well in regards to the number of extents and the distance between
      the start offset and the next hole or extent. This is due to a very high
      algorithmic complexity. Often we also get reports of btrfs' hole and data
      seeking (llseek) being too slow, such as at 2017's LSFMM (see the Link
      tag at the bottom).
      
      In order to better understand it, lets consider the case where the start
      offset is 0, we are seeking for a hole and the file size is 16G. Between
      file offset 0 and the first hole in the file there are 100K extents - this
      is common for large files, specially if we have compression enabled, since
      the maximum extent size is limited to 128K. The steps take by the main
      loop of the current algorithm are the following:
      
      1) We start by calling btrfs_get_extent_fiemap(), for file offset 0, which
         calls btrfs_get_extent(). This will first lookup for an extent map in
         the inode's extent map tree (a red black tree). If the extent map is
         not loaded in memory, then it will do a lookup for the corresponding
         file extent item in the subvolume's b+tree, create an extent map based
         on the contents of the file extent item and then add the extent map to
         the extent map tree of the inode;
      
      2) The second iteration calls btrfs_get_extent_fiemap() again, this time
         with a start offset matching the end offset of the previous extent.
         Again, btrfs_get_extent() will first search the extent map tree, and
         if it doesn't find an extent map there, it will again search in the
         b+tree of the subvolume for a matching file extent item, build an
         extent map based on the file extent item, and add the extent map to
         to the extent map tree of the inode;
      
      3) This repeats over and over until we find the first hole (when seeking
         for holes) or until we find the first extent (when seeking for data).
      
         If there no extent maps loaded in memory for each iteration, then on
         each iteration we do 1 extent map tree search, 1 b+tree search, plus
         1 more extent map tree traversal to insert an extent map - plus we
         allocate memory for the extent map.
      
         On each iteration we are growing the size of the extent map tree,
         making each future search slower, and also visiting the same b+tree
         leaves over and over again - taking into account with the default leaf
         size of 16K we can fit more than 200 file extent items in a leaf - so
         we can visit the same b+tree leaf 200+ times, on each visit walking
         down a path from the root to the leaf.
      
      So it's easy to see that what we have now doesn't scale well. Also, it
      loads an extent map for every file extent item into memory, which is not
      efficient - we should add extents maps only when doing IO (writing or
      reading file data).
      
      This change implements a new algorithm which scales much better, and
      works like this:
      
      1) We iterate over the subvolume's b+tree, visiting each leaf that has
         file extent items once and only once;
      
      2) For any file extent items found, that don't represent holes or prealloc
         extents, it will not search the extent map tree - there's no need at
         all for that - an extent map is just an in-memory representation of a
         file extent item;
      
      3) When a hole is found, or a prealloc extent, it will check if there's
         delalloc for its range. For this it will search for EXTENT_DELALLOC
         bits in the inode's io tree and check the extent map tree - this is
         for accounting for unflushed delalloc and for flushed delalloc (the
         period between running delalloc and ordered extent completion),
         respectively. This is similar to what the current implementation does
         when it finds a hole or prealloc extent, but without creating extent
         maps and adding them to the extent map tree in case they are not
         loaded in memory;
      
      4) It never allocates extent maps, or adds extent maps to the inode's
         extent map tree. This not only saves memory and time (from the tree
         insertions and allocations), but also eliminates the possibility of
         -ENOMEM due to allocating too many extent maps.
      
      Part of this new code will also be used later for fiemap (which also
      suffers similar scalability problems).
      
      The following test example can be used to quickly measure the efficiency
      before and after this patch:
      
          $ cat test-seek-hole.sh
          #!/bin/bash
      
          DEV=/dev/sdi
          MNT=/mnt/sdi
      
          mkfs.btrfs -f $DEV
      
          mount -o compress=lzo $DEV $MNT
      
          # 16G file -> 131073 compressed extents.
          xfs_io -f -c "pwrite -S 0xab -b 1M 0 16G" $MNT/foobar
      
          # Leave a 1M hole at file offset 15G.
          xfs_io -c "fpunch 15G 1M" $MNT/foobar
      
          # Unmount and mount again, so that we can test when there's no
          # metadata cached in memory.
          umount $MNT
          mount -o compress=lzo $DEV $MNT
      
          # Test seeking for hole from offset 0 (hole is at offset 15G).
      
          start=$(date +%s%N)
          xfs_io -c "seek -h 0" $MNT/foobar
          end=$(date +%s%N)
          dur=$(( (end - start) / 1000000 ))
          echo "Took $dur milliseconds to seek first hole (metadata not cached)"
          echo
      
          start=$(date +%s%N)
          xfs_io -c "seek -h 0" $MNT/foobar
          end=$(date +%s%N)
          dur=$(( (end - start) / 1000000 ))
          echo "Took $dur milliseconds to seek first hole (metadata cached)"
          echo
      
          umount $MNT
      
      Before this change:
      
          $ ./test-seek-hole.sh
          (...)
          Whence	Result
          HOLE	16106127360
          Took 176 milliseconds to seek first hole (metadata not cached)
      
          Whence	Result
          HOLE	16106127360
          Took 17 milliseconds to seek first hole (metadata cached)
      
      After this change:
      
          $ ./test-seek-hole.sh
          (...)
          Whence	Result
          HOLE	16106127360
          Took 43 milliseconds to seek first hole (metadata not cached)
      
          Whence	Result
          HOLE	16106127360
          Took 13 milliseconds to seek first hole (metadata cached)
      
      That's about 4x faster when no metadata is cached and about 30% faster
      when all metadata is cached.
      
      In practice the differences may often be significantly higher, either due
      to a higher number of extents in a file or because the subvolume's b+tree
      is much bigger than in this example, where we only have one file.
      
      Link: https://lwn.net/Articles/718805/Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b6e83356
    • F
      btrfs: allow hole and data seeking to be interruptible · aed0ca18
      Filipe Manana 提交于
      Doing hole or data seeking on a file with a very large number of extents
      can take a long time, and we have reports of it being too slow (such as
      at LSFMM from 2017, see the Link below). So make it interruptible.
      
      Link: https://lwn.net/Articles/718805/Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      aed0ca18
    • F
      btrfs: log conflicting inodes without holding log mutex of the initial inode · e09d94c9
      Filipe Manana 提交于
      When logging an inode, if we detect the inode has a reference that
      conflicts with some other inode that got renamed, we log that other inode
      while holding the log mutex of the current inode. We then find out if
      there are other inodes that conflict with the first conflicting inode,
      and log them while under the log mutex of the original inode. This is
      fine because the recursion can only happen once.
      
      For the upcoming work where we directly log delayed items without flushing
      them first to the subvolume tree, this recursion adds a lot of complexity
      and it's hard to keep lockdep happy about it.
      
      So collect a list of conflicting inodes and then log the inodes after
      unlocking the log mutex of the inode we started with.
      
      Also limit the maximum number of conflict inodes we log to 10, to avoid
      spending too much time logging (and maybe allocating too many list
      elements too), as typically we don't have more than 1 or 2 conflicting
      inodes - if we go over the limit, simply fallback to a transaction commit.
      
      It is possible to have a very long list of conflicting inodes to be
      intentionally created by a user if he/she creates a very long succession
      of renames like this:
      
        (...)
        rename E to F
        rename D to E
        rename C to D
        rename B to C
        rename A to B
        touch A (create a new file named A)
        fsync A
      
      If that happened for a sequence of hundreds or thousands of renames, it
      could massively slow down the logging and cause other secondary effects
      like for example blocking other fsync operations and transaction commits
      for a very long time (assuming it wouldn't run into -ENOSPC or -ENOMEM
      first). However such cases are very uncommon to happen in practice,
      nevertheless it's better to be prepared for them and avoid chaos.
      Such long sequence of conflicting inodes could be created before this
      change.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e09d94c9
    • O
      btrfs: rename btrfs_insert_file_extent() to btrfs_insert_hole_extent() · d1f68ba0
      Omar Sandoval 提交于
      btrfs_insert_file_extent() is only ever used to insert holes, so rename
      it and remove the redundant parameters.
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NOmar Sandoval <osandov@osandov.com>
      Signed-off-by: NSweet Tea Dorminy <sweettea-kernel@dorminy.me>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d1f68ba0
    • A
      btrfs: fix alignment of VMA for memory mapped files on THP · b0c58223
      Alexander Zhu 提交于
      With CONFIG_READ_ONLY_THP_FOR_FS, the Linux kernel supports using THPs for
      read-only mmapped files, such as shared libraries. However, the kernel
      makes no attempt to actually align those mappings on 2MB boundaries,
      which makes it impossible to use those THPs most of the time. This issue
      applies to general file mapping THP as well as existing setups using
      CONFIG_READ_ONLY_THP_FOR_FS. This is easily fixed by using
      thp_get_unmapped_area for the unmapped_area function in btrfs, which
      is what ext2, ext4, fuse, and xfs all use.
      
      Initially btrfs had been left out in commit 8c07fc452ac0 ("btrfs: fix
      alignment of VMA for memory mapped files on THP") as btrfs does not support
      DAX. However, commit 1854bc6e ("mm/readahead: Align file mappings
      for non-DAX") removed the DAX requirement. We should now be able to call
      thp_get_unmapped_area() for btrfs.
      
      The problem can be seen in /proc/PID/smaps where THPeligible is set to 0
      on mappings to eligible shared object files as shown below.
      
      Before this patch:
      
        7fc6a7e18000-7fc6a80cc000 r-xp 00000000 00:1e 199856
        /usr/lib64/libcrypto.so.1.1.1k
        Size:               2768 kB
        THPeligible:    0
        VmFlags: rd ex mr mw me
      
      With this patch the library is mapped at a 2MB aligned address:
      
        fbdfe200000-7fbdfe4b4000 r-xp 00000000 00:1e 199856
        /usr/lib64/libcrypto.so.1.1.1k
        Size:               2768 kB
        THPeligible:    1
        VmFlags: rd ex mr mw me
      
      This fixes the alignment of VMAs for any mmap of a file that has the
      rd and ex permissions and size >= 2MB. The VMA alignment and
      THPeligible field for anonymous memory is handled separately and
      is thus not effected by this change.
      
      CC: stable@vger.kernel.org # 5.18+
      Signed-off-by: NAlexander Zhu <alexlzhu@fb.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b0c58223
  3. 23 8月, 2022 1 次提交
    • F
      btrfs: update generation of hole file extent item when merging holes · e6e3dec6
      Filipe Manana 提交于
      When punching a hole into a file range that is adjacent with a hole and we
      are not using the no-holes feature, we expand the range of the adjacent
      file extent item that represents a hole, to save metadata space.
      
      However we don't update the generation of hole file extent item, which
      means a full fsync will not log that file extent item if the fsync happens
      in a later transaction (since commit 7f30c072 ("btrfs: stop copying
      old file extents when doing a full fsync")).
      
      For example, if we do this:
      
          $ mkfs.btrfs -f -O ^no-holes /dev/sdb
          $ mount /dev/sdb /mnt
          $ xfs_io -f -c "pwrite -S 0xab 2M 2M" /mnt/foobar
          $ sync
      
      We end up with 2 file extent items in our file:
      
      1) One that represents the hole for the file range [0, 2M), with a
         generation of 7;
      
      2) Another one that represents an extent covering the range [2M, 4M).
      
      After that if we do the following:
      
          $ xfs_io -c "fpunch 2M 2M" /mnt/foobar
      
      We end up with a single file extent item in the file, which represents a
      hole for the range [0, 4M) and with a generation of 7 - because we end
      dropping the data extent for range [2M, 4M) and then update the file
      extent item that represented the hole at [0, 2M), by increasing
      length from 2M to 4M.
      
      Then doing a full fsync and power failing:
      
          $ xfs_io -c "fsync" /mnt/foobar
          <power failure>
      
      will result in the full fsync not logging the file extent item that
      represents the hole for the range [0, 4M), because its generation is 7,
      which is lower than the generation of the current transaction (8).
      As a consequence, after mounting again the filesystem (after log replay),
      the region [2M, 4M) does not have a hole, it still points to the
      previous data extent.
      
      So fix this by always updating the generation of existing file extent
      items representing holes when we merge/expand them. This solves the
      problem and it's the same approach as when we merge prealloc extents that
      got written (at btrfs_mark_extent_written()). Setting the generation to
      the current transaction's generation is also what we do when merging
      the new hole extent map with the previous one or the next one.
      
      A test case for fstests, covering both cases of hole file extent item
      merging (to the left and to the right), will be sent soon.
      
      Fixes: 7f30c072 ("btrfs: stop copying old file extents when doing a full fsync")
      CC: stable@vger.kernel.org # 5.18+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e6e3dec6
  4. 25 7月, 2022 4 次提交
  5. 21 6月, 2022 3 次提交
    • J
      btrfs: fix deadlock with fsync+fiemap+transaction commit · bf7ba8ee
      Josef Bacik 提交于
      We are hitting the following deadlock in production occasionally
      
      Task 1		Task 2		Task 3		Task 4		Task 5
      		fsync(A)
      		 start trans
      						start commit
      				falloc(A)
      				 lock 5m-10m
      				 start trans
      				  wait for commit
      fiemap(A)
       lock 0-10m
        wait for 5m-10m
         (have 0-5m locked)
      
      		 have btrfs_need_log_full_commit
      		  !full_sync
      		  wait_ordered_extents
      								finish_ordered_io(A)
      								lock 0-5m
      								DEADLOCK
      
      We have an existing dependency of file extent lock -> transaction.
      However in fsync if we tried to do the fast logging, but then had to
      fall back to committing the transaction, we will be forced to call
      btrfs_wait_ordered_range() to make sure all of our extents are updated.
      
      This creates a dependency of transaction -> file extent lock, because
      btrfs_finish_ordered_io() will need to take the file extent lock in
      order to run the ordered extents.
      
      Fix this by stopping the transaction if we have to do the full commit
      and we attempted to do the fast logging.  Then attach to the transaction
      and commit it if we need to.
      
      CC: stable@vger.kernel.org # 5.15+
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      bf7ba8ee
    • F
      btrfs: do not BUG_ON() on failure to migrate space when replacing extents · 650c9cab
      Filipe Manana 提交于
      At btrfs_replace_file_extents(), if we fail to migrate reserved metadata
      space from the transaction block reserve into the local block reserve,
      we trigger a BUG_ON(). This is because it should not be possible to have
      a failure here, as we reserved more space when we started the transaction
      than the space we want to migrate. However having a BUG_ON() is way too
      drastic, we can perfectly handle the failure and return the error to the
      caller. So just do that instead, and add a WARN_ON() to make it easier
      to notice the failure if it ever happens (which is particularly useful
      for fstests, and the warning will trigger a failure of a test case).
      Reviewed-by: NBoris Burkov <boris@bur.io>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      650c9cab
    • F
      btrfs: add missing inode updates on each iteration when replacing extents · 983d8209
      Filipe Manana 提交于
      When replacing file extents, called during fallocate, hole punching,
      clone and deduplication, we may not be able to replace/drop all the
      target file extent items with a single transaction handle. We may get
      -ENOSPC while doing it, in which case we release the transaction handle,
      balance the dirty pages of the btree inode, flush delayed items and get
      a new transaction handle to operate on what's left of the target range.
      
      By dropping and replacing file extent items we have effectively modified
      the inode, so we should bump its iversion and update its mtime/ctime
      before we update the inode item. This is because if the transaction
      we used for partially modifying the inode gets committed by someone after
      we release it and before we finish the rest of the range, a power failure
      happens, then after mounting the filesystem our inode has an outdated
      iversion and mtime/ctime, corresponding to the values it had before we
      changed it.
      
      So add the missing iversion and mtime/ctime updates.
      Reviewed-by: NBoris Burkov <boris@bur.io>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      983d8209
  6. 11 6月, 2022 2 次提交
  7. 16 5月, 2022 9 次提交
    • C
      btrfs: add a btrfs_dio_rw wrapper · 36e8c622
      Christoph Hellwig 提交于
      Add a wrapper around iomap_dio_rw that keeps the direct I/O internals
      isolated in inode.c.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      36e8c622
    • F
      btrfs: avoid blocking on space revervation when doing nowait dio writes · d4135134
      Filipe Manana 提交于
      When doing a NOWAIT direct IO write, if we can NOCOW then it means we can
      proceed with the non-blocking, NOWAIT path. However reserving the metadata
      space and qgroup meta space can often result in blocking - flushing
      delalloc, wait for ordered extents to complete, trigger transaction
      commits, etc, going against the semantics of a NOWAIT write.
      
      So make the NOWAIT write path to try to reserve all the metadata it needs
      without resulting in a blocking behaviour - if we get -ENOSPC or -EDQUOT
      then return -EAGAIN to make the caller fallback to a blocking direct IO
      write.
      
      This is part of a patchset comprised of the following patches:
      
        btrfs: avoid blocking on page locks with nowait dio on compressed range
        btrfs: avoid blocking nowait dio when locking file range
        btrfs: avoid double nocow check when doing nowait dio writes
        btrfs: stop allocating a path when checking if cross reference exists
        btrfs: free path at can_nocow_extent() before checking for checksum items
        btrfs: release path earlier at can_nocow_extent()
        btrfs: avoid blocking when allocating context for nowait dio read/write
        btrfs: avoid blocking on space revervation when doing nowait dio writes
      
      The following test was run before and after applying this patchset:
      
        $ cat io-uring-nodatacow-test.sh
        #!/bin/bash
      
        DEV=/dev/sdc
        MNT=/mnt/sdc
      
        MOUNT_OPTIONS="-o ssd -o nodatacow"
        MKFS_OPTIONS="-R free-space-tree -O no-holes"
      
        NUM_JOBS=4
        FILE_SIZE=8G
        RUN_TIME=300
      
        cat <<EOF > /tmp/fio-job.ini
        [io_uring_rw]
        rw=randrw
        fsync=0
        fallocate=posix
        group_reporting=1
        direct=1
        ioengine=io_uring
        iodepth=64
        bssplit=4k/20:8k/20:16k/20:32k/10:64k/10:128k/5:256k/5:512k/5:1m/5
        filesize=$FILE_SIZE
        runtime=$RUN_TIME
        time_based
        filename=foobar
        directory=$MNT
        numjobs=$NUM_JOBS
        thread
        EOF
      
        echo performance | \
           tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
      
        umount $MNT &> /dev/null
        mkfs.btrfs -f $MKFS_OPTIONS $DEV &> /dev/null
        mount $MOUNT_OPTIONS $DEV $MNT
      
        fio /tmp/fio-job.ini
      
        umount $MNT
      
      The test was run a 12 cores box with 64G of ram, using a non-debug kernel
      config (Debian's default config) and a spinning disk.
      
      Result before the patchset:
      
       READ: bw=407MiB/s (427MB/s), 407MiB/s-407MiB/s (427MB/s-427MB/s), io=119GiB (128GB), run=300175-300175msec
      WRITE: bw=407MiB/s (427MB/s), 407MiB/s-407MiB/s (427MB/s-427MB/s), io=119GiB (128GB), run=300175-300175msec
      
      Result after the patchset:
      
       READ: bw=436MiB/s (457MB/s), 436MiB/s-436MiB/s (457MB/s-457MB/s), io=128GiB (137GB), run=300044-300044msec
      WRITE: bw=435MiB/s (456MB/s), 435MiB/s-435MiB/s (456MB/s-456MB/s), io=128GiB (137GB), run=300044-300044msec
      
      That's about +7.2% throughput for reads and +6.9% for writes.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d4135134
    • F
      btrfs: avoid double nocow check when doing nowait dio writes · d7a8ab4e
      Filipe Manana 提交于
      When doing a NOWAIT direct IO write we are checking twice if we can COW
      into the target file range using can_nocow_extent() - once at the very
      beginning of the write path, at btrfs_write_check() via
      check_nocow_nolock(), and later again at btrfs_get_blocks_direct_write().
      
      The can_nocow_extent() function does a lot of expensive things - searching
      for the file extent item in the inode's subvolume tree, searching for the
      extent item in the extent tree, checking delayed references, etc, so it
      isn't a very cheap call.
      
      We can remove the first check at btrfs_write_check(), and add there a
      quick check to verify if the inode has the NODATACOW or PREALLOC flags,
      and quickly bail out if it doesn't have neither of those flags, as that
      means we have to COW and therefore can't comply with the NOWAIT semantics.
      
      After this we do only one call to can_nocow_extent(), while we are at
      btrfs_get_blocks_direct_write(), where we have already locked the file
      range and we did a try lock on the range before, at
      btrfs_dio_iomap_begin() (since the previous patch in the series).
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d7a8ab4e
    • F
      btrfs: add and use helper to assert an inode range is clean · 63c34cb4
      Filipe Manana 提交于
      We have four different scenarios where we don't expect to find ordered
      extents after locking a file range:
      
      1) During plain fallocate;
      2) During hole punching;
      3) During zero range;
      4) During reflinks (both cloning and deduplication).
      
      This is because in all these cases we follow the pattern:
      
      1) Lock the inode's VFS lock in exclusive mode;
      
      2) Lock the inode's i_mmap_lock in exclusive node, to serialize with
         mmap writes;
      
      3) Flush delalloc in a file range and wait for all ordered extents
         to complete - both done through btrfs_wait_ordered_range();
      
      4) Lock the file range in the inode's io_tree.
      
      So add a helper that asserts that we don't have ordered extents for a
      given range. Make the four scenarios listed above use this helper after
      locking the respective file range.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      63c34cb4
    • F
      btrfs: remove ordered extent check and wait during hole punching and zero range · 55961c8a
      Filipe Manana 提交于
      For hole punching and zero range we have this loop that checks if we have
      ordered extents after locking the file range, and if so unlock the range,
      wait for ordered extents, and retry until we don't find more ordered
      extents.
      
      This logic was needed in the past because:
      
      1) Direct IO writes within the i_size boundary did not take the inode's
         VFS lock. This was because that lock used to be a mutex, then some
         years ago it was switched to a rw semaphore (commit 9902af79
         ("parallel lookups: actual switch to rwsem")), and then btrfs was
         changed to take the VFS inode's lock in shared mode for writes that
         don't cross the i_size boundary (commit e9adabb9 ("btrfs: use
         shared lock for direct writes within EOF"));
      
      2) We could race with memory mapped writes, because memory mapped writes
         don't acquire the inode's VFS lock. We don't have that race anymore,
         as we have a rw semaphore to synchronize memory mapped writes with
         fallocate (and reflinking too). That change happened with commit
         8d9b4a16 ("btrfs: exclude mmap from happening during all
         fallocate operations").
      
      So stop looking for ordered extents after locking the file range when
      doing hole punching and zero range operations.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      55961c8a
    • F
      btrfs: lock the inode first before flushing range when punching hole · bd6526d0
      Filipe Manana 提交于
      When doing hole punching we are flushing delalloc and waiting for ordered
      extents to complete before locking the inode (VFS lock and the btrfs
      specific i_mmap_lock). This is fine because even if a write happens after
      we call btrfs_wait_ordered_range() and before we lock the inode (call
      btrfs_inode_lock()), we will notice the write at
      btrfs_punch_hole_lock_range() and flush delalloc and wait for its ordered
      extent.
      
      We can however make this simpler by locking first the inode an then call
      btrfs_wait_ordered_range(), which will allow us to remove the ordered
      extent lookup logic from btrfs_punch_hole_lock_range() in the next patch.
      It also makes the behaviour the same as plain fallocate, hole punching
      and reflinks.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      bd6526d0
    • F
      btrfs: remove ordered extent check and wait during fallocate · ffa8fc60
      Filipe Manana 提交于
      For fallocate() we have this loop that checks if we have ordered extents
      after locking the file range, and if so unlock the range, wait for ordered
      extents, and retry until we don't find more ordered extents.
      
      This logic was needed in the past because:
      
      1) Direct IO writes within the i_size boundary did not take the inode's
         VFS lock. This was because that lock used to be a mutex, then some
         years ago it was switched to a rw semaphore (commit 9902af79
         ("parallel lookups: actual switch to rwsem")), and then btrfs was
         changed to take the VFS inode's lock in shared mode for writes that
         don't cross the i_size boundary (commit e9adabb9 ("btrfs: use
         shared lock for direct writes within EOF"));
      
      2) We could race with memory mapped writes, because memory mapped writes
         don't acquire the inode's VFS lock. We don't have that race anymore,
         as we have a rw semaphore to synchronize memory mapped writes with
         fallocate (and reflinking too). That change happened with commit
         8d9b4a16 ("btrfs: exclude mmap from happening during all
         fallocate operations").
      
      So stop looking for ordered extents after locking the file range when
      doing a plain fallocate.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ffa8fc60
    • F
      btrfs: remove useless dio wait call when doing fallocate zero range · 831e1ee6
      Filipe Manana 提交于
      When starting a fallocate zero range operation, before getting the first
      extent map for the range, we make a call to inode_dio_wait().
      
      This logic was needed in the past because direct IO writes within the
      i_size boundary did not take the inode's VFS lock. This was because that
      lock used to be a mutex, then some years ago it was switched to a rw
      semaphore (by commit 9902af79 ("parallel lookups: actual switch to
      rwsem")), and then btrfs was changed to take the VFS inode's lock in
      shared mode for writes that don't cross the i_size boundary (done in
      commit e9adabb9 ("btrfs: use shared lock for direct writes within
      EOF")). The lockless direct IO writes could result in a race with the
      zero range operation, resulting in the later getting a stale extent
      map for the range.
      
      So remove this no longer needed call to inode_dio_wait(), as fallocate
      takes the inode's VFS lock in exclusive mode and direct IO writes within
      i_size take that same lock in shared mode.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      831e1ee6
    • F
      btrfs: only reserve the needed data space amount during fallocate · 47e1d1c7
      Filipe Manana 提交于
      During a plain fallocate, we always start by reserving an amount of data
      space that matches the length of the range passed to fallocate. When we
      already have extents allocated in that range, we may end up trying to
      reserve a lot more data space then we need, which can result in several
      undesired behaviours:
      
      1) We fail with -ENOSPC. For example the passed range has a length
         of 1G, but there's only one hole with a size of 1M in that range;
      
      2) We temporarily reserve excessive data space that could be used by
         other operations happening concurrently;
      
      3) By reserving much more data space then we need, we can end up
         doing expensive things like triggering dellaloc for other inodes,
         waiting for the ordered extents to complete, trigger transaction
         commits, allocate new block groups, etc.
      
      Example:
      
        $ cat test.sh
        #!/bin/bash
      
        DEV=/dev/sdj
        MNT=/mnt/sdj
      
        mkfs.btrfs -f -b 1g $DEV
        mount $DEV $MNT
      
        # Create a file with a size of 600M and two holes, one at [200M, 201M[
        # and another at [401M, 402M[
        xfs_io -f -c "pwrite -S 0xab 0 200M" \
                  -c "pwrite -S 0xcd 201M 200M" \
                  -c "pwrite -S 0xef 402M 198M" \
                  $MNT/foobar
      
        # Now call fallocate against the whole file range, see if it fails
        # with -ENOSPC or not - it shouldn't since we only need to allocate
        # 2M of data space.
        xfs_io -c "falloc 0 600M" $MNT/foobar
      
        umount $MNT
      
        $ ./test.sh
        (...)
        wrote 209715200/209715200 bytes at offset 0
        200 MiB, 51200 ops; 0.8063 sec (248.026 MiB/sec and 63494.5831 ops/sec)
        wrote 209715200/209715200 bytes at offset 210763776
        200 MiB, 51200 ops; 0.8053 sec (248.329 MiB/sec and 63572.3172 ops/sec)
        wrote 207618048/207618048 bytes at offset 421527552
        198 MiB, 50688 ops; 0.7925 sec (249.830 MiB/sec and 63956.5548 ops/sec)
        fallocate: No space left on device
        $
      
      So fix this by not allocating an amount of data space that matches the
      length of the range passed to fallocate. Instead allocate an amount of
      data space that corresponds to the sum of the sizes of each hole found
      in the range. This reservation now happens after we have locked the file
      range, which is safe since we know at this point there's no delalloc
      in the range because we've taken the inode's VFS lock in exclusive mode,
      we have taken the inode's i_mmap_lock in exclusive mode, we have flushed
      delalloc and waited for all ordered extents in the range to complete.
      
      This type of failure actually seems to happen in practice with systemd,
      and we had at least one report about this in a very long thread which
      is referenced by the Link tag below.
      
      Link: https://lore.kernel.org/linux-btrfs/bdJVxLiFr_PyQSXRUbZJfFW_jAjsGgoMetqPHJMbg-hdy54Xt_ZHhRetmnJ6cJ99eBlcX76wy-AvWwV715c3YndkxneSlod11P1hlaADx0s=@protonmail.com/Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      47e1d1c7
  8. 10 5月, 2022 3 次提交