1. 14 3月, 2022 7 次提交
    • F
      btrfs: fix lost error return value when reading a data page · bbf0ea7e
      Filipe Manana 提交于
      At btrfs_do_readpage(), if we get an error when trying to lookup for an
      extent map, we end up marking the page with the error bit, clearing
      the uptodate bit on it, and doing everything else that should be done.
      However we return success (0) to the caller, when we should return the
      error encoded in the extent map pointer. So fix that by returning the
      error encoded in the pointer.
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      bbf0ea7e
    • D
      btrfs: replace BUILD_BUG_ON by static_assert · a55e65b8
      David Sterba 提交于
      The static_assert introduced in 6bab69c6 ("build_bug.h: add wrapper
      for _Static_assert") has been supported by compilers for a long time
      (gcc 4.6, clang 3.0) and can be used in header files. We don't need to
      put BUILD_BUG_ON to random functions but rather keep it next to the
      definition.
      
      The exception here is the UAPI header btrfs_tree.h that could be
      potentially included by userspace code and the static assert is not
      defined (nor used in any other header).
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a55e65b8
    • J
      btrfs: disable snapshot creation/deletion for extent tree v2 · 813febdb
      Josef Bacik 提交于
      When we stop tracking metadata blocks all of snapshotting will break, so
      disable it until I add the snapshot root and drop tree support.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      813febdb
    • F
      btrfs: stop doing unnecessary log updates during a rename · 259c4b96
      Filipe Manana 提交于
      During a rename, we call __btrfs_unlink_inode(), which will call
      btrfs_del_inode_ref_in_log() and btrfs_del_dir_entries_in_log(), in order
      to remove an inode reference and a directory entry from the log. These
      are necessary when __btrfs_unlink_inode() is called from the unlink path,
      but not necessary when it's called from a rename context, because:
      
      1) For the btrfs_del_inode_ref_in_log() call, it's pointless to delete the
         inode reference related to the old name, because later in the rename
         path we call btrfs_log_new_name(), which will drop all inode references
         from the log and copy all inode references from the subvolume tree to
         the log tree. So we are doing one unnecessary btree operation which
         adds additional latency and lock contention in case there are other
         tasks accessing the log tree;
      
      2) For the btrfs_del_dir_entries_in_log() call, we are now doing the
         equivalent at btrfs_log_new_name() since the previous patch in the
         series, that has the subject "btrfs: avoid logging all directory
         changes during renames". In fact, having __btrfs_unlink_inode() call
         this function not only adds additional latency and lock contention due
         to the extra btree operation, but also can make btrfs_log_new_name()
         unnecessarily log a range item to track the deletion of the old name,
         since it has no way to known that the directory entry related to the
         old name was previously logged and already deleted by
         __btrfs_unlink_inode() through its call to
         btrfs_del_dir_entries_in_log().
      
      So skip those calls at __btrfs_unlink_inode() when we are doing a rename.
      Skipping them also allows us now to reduce the duration of time we are
      pinning a log transaction during renames, which is always beneficial as
      it's not delaying so much other tasks trying to sync the log tree, in
      particular we end up not holding the log transaction pinned while adding
      the new name (adding inode ref, directory entry, etc).
      
      This change is part of a patchset comprised of the following patches:
      
        1/5 btrfs: add helper to delete a dir entry from a log tree
        2/5 btrfs: pass the dentry to btrfs_log_new_name() instead of the inode
        3/5 btrfs: avoid logging all directory changes during renames
        4/5 btrfs: stop doing unnecessary log updates during a rename
        5/5 btrfs: avoid inode logging during rename and link when possible
      
      Just like the previous patch in the series, "btrfs: avoid logging all
      directory changes during renames", the following script mimics part of
      what a package installation/upgrade with zypper does, which is basically
      renaming a lot of files, in some directory under /usr, to a name with a
      suffix of "-RPMDELETE":
      
        $ cat test.sh
        #!/bin/bash
      
        DEV=/dev/nvme0n1
        MNT=/mnt/nvme0n1
      
        NUM_FILES=10000
      
        mkfs.btrfs -f $DEV
        mount $DEV $MNT
      
        mkdir $MNT/testdir
      
        for ((i = 1; i <= $NUM_FILES; i++)); do
            echo -n > $MNT/testdir/file_$i
        done
      
        sync
      
        # Do some change to testdir and fsync it.
        echo -n > $MNT/testdir/file_$((NUM_FILES + 1))
        xfs_io -c "fsync" $MNT/testdir
      
        echo "Renaming $NUM_FILES files..."
        start=$(date +%s%N)
        for ((i = 1; i <= $NUM_FILES; i++)); do
            mv $MNT/testdir/file_$i $MNT/testdir/file_$i-RPMDELETE
        done
        end=$(date +%s%N)
      
        dur=$(( (end - start) / 1000000 ))
        echo "Renames took $dur milliseconds"
      
        umount $MNT
      
      Testing this change on box a using a non-debug kernel (Debian's default
      kernel config) gave the following results:
      
      NUM_FILES=10000, before patchset:                   27399 ms
      NUM_FILES=10000, after patches 1/5 to 3/5 applied:   9093 ms (-66.8%)
      NUM_FILES=10000, after patches 1/5 to 4/5 applied:   9016 ms (-67.1%)
      
      NUM_FILES=5000, before patchset:                     9241 ms
      NUM_FILES=5000, after patches 1/5 to 3/5 applied:    4642 ms (-49.8%)
      NUM_FILES=5000, after patches 1/5 to 4/5 applied:    4553 ms (-50.7%)
      
      NUM_FILES=2000, before patchset:                     2550 ms
      NUM_FILES=2000, after patches 1/5 to 3/5 applied:    1788 ms (-29.9%)
      NUM_FILES=2000, after patches 1/5 to 4/5 applied:    1767 ms (-30.7%)
      
      NUM_FILES=1000, before patchset:                     1088 ms
      NUM_FILES=1000, after patches 1/5 to 3/5 applied:     905 ms (-16.9%)
      NUM_FILES=1000, after patches 1/5 to 4/5 applied:     883 ms (-18.8%)
      
      The next patch in the series (5/5), also contains dbench results after
      applying to whole patchset.
      
      Link: https://bugzilla.opensuse.org/show_bug.cgi?id=1193549Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      259c4b96
    • F
      btrfs: avoid logging all directory changes during renames · 88d2beec
      Filipe Manana 提交于
      When doing a rename of a file, if the file or its old parent directory
      were logged before, we log the new name of the file and then make sure
      we log the old parent directory, to ensure that after a log replay the
      old name of the file is deleted and the new name added.
      
      The logging of the old parent directory can take some time, because it
      will scan all leaves modified in the current transaction, check which
      directory entries were already logged, copy the ones that were not
      logged before, etc. In this rename context all we need to do is make
      sure that the old name of the file is deleted on log replay, so instead
      of triggering a directory log operation, we can just delete the old
      directory entry from the log if it's there, or in case it isn't there,
      just log a range item to signal log replay that the old name must be
      deleted. So change btrfs_log_new_name() to do that.
      
      This scenario is actually not uncommon to trigger, and recently on a
      5.15 kernel, an openSUSE Tumbleweed user reported package installations
      and upgrades, with the zypper tool, were often taking a long time to
      complete, much more than usual. With strace it could be observed that
      zypper was spending over 99% of its time on rename operations, and then
      with further analysis we checked that directory logging was happening
      too frequently and causing high latencies for the rename operations.
      Taking into account that installation/upgrade of some of these packages
      needed about a few thousand file renames, the slowdown was very noticeable
      for the user.
      
      The issue was caused indirectly due to an excessive number of inode
      evictions on a 5.15 kernel, about 100x more compared to a 5.13, 5.14
      or a 5.16-rc8 kernel. After an inode eviction we can't tell for sure,
      in an efficient way, if an inode was previously logged in the current
      transaction, so we are pessimistic and assume it was, because in case
      it was we need to update the logged inode. More details on that in one
      of the patches in the same series (subject "btrfs: avoid inode logging
      during rename and link when possible"). Either way, in case the parent
      directory was logged before, we currently do more work then necessary
      during a rename, and this change minimizes that amount of work.
      
      The following script mimics part of what a package installation/upgrade
      with zypper does, which is basically renaming a lot of files, in some
      directory under /usr, to a name with a suffix of "-RPMDELETE":
      
        $ cat test.sh
        #!/bin/bash
      
        DEV=/dev/nvme0n1
        MNT=/mnt/nvme0n1
      
        NUM_FILES=10000
      
        mkfs.btrfs -f $DEV
        mount $DEV $MNT
      
        mkdir $MNT/testdir
      
        for ((i = 1; i <= $NUM_FILES; i++)); do
            echo -n > $MNT/testdir/file_$i
        done
      
        sync
      
        # Do some change to testdir and fsync it.
        echo -n > $MNT/testdir/file_$((NUM_FILES + 1))
        xfs_io -c "fsync" $MNT/testdir
      
        echo "Renaming $NUM_FILES files..."
        start=$(date +%s%N)
        for ((i = 1; i <= $NUM_FILES; i++)); do
            mv $MNT/testdir/file_$i $MNT/testdir/file_$i-RPMDELETE
        done
        end=$(date +%s%N)
      
        dur=$(( (end - start) / 1000000 ))
        echo "Renames took $dur milliseconds"
      
        umount $MNT
      
      Testing this change on box using a non-debug kernel (Debian's default
      kernel config) gave the following results:
      
      NUM_FILES=10000, before this patch: 27399 ms
      NUM_FILES=10000, after this patch:   9093 ms (-66.8%)
      
      NUM_FILES=5000, before this patch:   9241 ms
      NUM_FILES=5000, after this patch:    4642 ms (-49.8%)
      
      NUM_FILES=2000, before this patch:   2550 ms
      NUM_FILES=2000, after this patch:    1788 ms (-29.9%)
      
      NUM_FILES=1000, before this patch:   1088 ms
      NUM_FILES=1000, after this patch:     905 ms (-16.9%)
      
      Link: https://bugzilla.opensuse.org/show_bug.cgi?id=1193549Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      88d2beec
    • F
      btrfs: pass the dentry to btrfs_log_new_name() instead of the inode · d5f5bd54
      Filipe Manana 提交于
      In the next patch in the series, there will be the need to access the old
      name, and its length, of an inode when logging the inode during a rename.
      So instead of passing the inode to btrfs_log_new_name() pass the dentry,
      because from the dentry we can get the inode, the name and its length.
      
      This will avoid passing 3 new parameters to btrfs_log_new_name() in the
      next patch - the name, its length and an index number. This way we end
      up passing only 1 new parameter, the index number.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d5f5bd54
    • F
      btrfs: put initial index value of a directory in a constant · 528ee697
      Filipe Manana 提交于
      At btrfs_set_inode_index_count() we refer twice to the number 2 as the
      initial index value for a directory (when it's empty), with a proper
      comment explaining the reason for that value. In the next patch I'll
      have to use that magic value in the directory logging code, so put
      the value in a #define at btrfs_inode.h, to avoid hardcoding the
      magic value again at tree-log.c.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      528ee697
  2. 04 3月, 2022 1 次提交
    • F
      btrfs: fallback to blocking mode when doing async dio over multiple extents · ca93e44b
      Filipe Manana 提交于
      Some users recently reported that MariaDB was getting a read corruption
      when using io_uring on top of btrfs. This started to happen in 5.16,
      after commit 51bd9563 ("btrfs: fix deadlock due to page faults
      during direct IO reads and writes"). That changed btrfs to use the new
      iomap flag IOMAP_DIO_PARTIAL and to disable page faults before calling
      iomap_dio_rw(). This was necessary to fix deadlocks when the iovector
      corresponds to a memory mapped file region. That type of scenario is
      exercised by test case generic/647 from fstests.
      
      For this MariaDB scenario, we attempt to read 16K from file offset X
      using IOCB_NOWAIT and io_uring. In that range we have 4 extents, each
      with a size of 4K, and what happens is the following:
      
      1) btrfs_direct_read() disables page faults and calls iomap_dio_rw();
      
      2) iomap creates a struct iomap_dio object, its reference count is
         initialized to 1 and its ->size field is initialized to 0;
      
      3) iomap calls btrfs_dio_iomap_begin() with file offset X, which finds
         the first 4K extent, and setups an iomap for this extent consisting
         of a single page;
      
      4) At iomap_dio_bio_iter(), we are able to access the first page of the
         buffer (struct iov_iter) with bio_iov_iter_get_pages() without
         triggering a page fault;
      
      5) iomap submits a bio for this 4K extent
         (iomap_dio_submit_bio() -> btrfs_submit_direct()) and increments
         the refcount on the struct iomap_dio object to 2; The ->size field
         of the struct iomap_dio object is incremented to 4K;
      
      6) iomap calls btrfs_iomap_begin() again, this time with a file
         offset of X + 4K. There we setup an iomap for the next extent
         that also has a size of 4K;
      
      7) Then at iomap_dio_bio_iter() we call bio_iov_iter_get_pages(),
         which tries to access the next page (2nd page) of the buffer.
         This triggers a page fault and returns -EFAULT;
      
      8) At __iomap_dio_rw() we see the -EFAULT, but we reset the error
         to 0 because we passed the flag IOMAP_DIO_PARTIAL to iomap and
         the struct iomap_dio object has a ->size value of 4K (we submitted
         a bio for an extent already). The 'wait_for_completion' variable
         is not set to true, because our iocb has IOCB_NOWAIT set;
      
      9) At the bottom of __iomap_dio_rw(), we decrement the reference count
         of the struct iomap_dio object from 2 to 1. Because we were not
         the only ones holding a reference on it and 'wait_for_completion' is
         set to false, -EIOCBQUEUED is returned to btrfs_direct_read(), which
         just returns it up the callchain, up to io_uring;
      
      10) The bio submitted for the first extent (step 5) completes and its
          bio endio function, iomap_dio_bio_end_io(), decrements the last
          reference on the struct iomap_dio object, resulting in calling
          iomap_dio_complete_work() -> iomap_dio_complete().
      
      11) At iomap_dio_complete() we adjust the iocb->ki_pos from X to X + 4K
          and return 4K (the amount of io done) to iomap_dio_complete_work();
      
      12) iomap_dio_complete_work() calls the iocb completion callback,
          iocb->ki_complete() with a second argument value of 4K (total io
          done) and the iocb with the adjust ki_pos of X + 4K. This results
          in completing the read request for io_uring, leaving it with a
          result of 4K bytes read, and only the first page of the buffer
          filled in, while the remaining 3 pages, corresponding to the other
          3 extents, were not filled;
      
      13) For the application, the result is unexpected because if we ask
          to read N bytes, it expects to get N bytes read as long as those
          N bytes don't cross the EOF (i_size).
      
      MariaDB reports this as an error, as it's not expecting a short read,
      since it knows it's asking for read operations fully within the i_size
      boundary. This is typical in many applications, but it may also be
      questionable if they should react to such short reads by issuing more
      read calls to get the remaining data. Nevertheless, the short read
      happened due to a change in btrfs regarding how it deals with page
      faults while in the middle of a read operation, and there's no reason
      why btrfs can't have the previous behaviour of returning the whole data
      that was requested by the application.
      
      The problem can also be triggered with the following simple program:
      
        /* Get O_DIRECT */
        #ifndef _GNU_SOURCE
        #define _GNU_SOURCE
        #endif
      
        #include <stdio.h>
        #include <stdlib.h>
        #include <unistd.h>
        #include <fcntl.h>
        #include <errno.h>
        #include <string.h>
        #include <liburing.h>
      
        int main(int argc, char *argv[])
        {
            char *foo_path;
            struct io_uring ring;
            struct io_uring_sqe *sqe;
            struct io_uring_cqe *cqe;
            struct iovec iovec;
            int fd;
            long pagesize;
            void *write_buf;
            void *read_buf;
            ssize_t ret;
            int i;
      
            if (argc != 2) {
                fprintf(stderr, "Use: %s <directory>\n", argv[0]);
                return 1;
            }
      
            foo_path = malloc(strlen(argv[1]) + 5);
            if (!foo_path) {
                fprintf(stderr, "Failed to allocate memory for file path\n");
                return 1;
            }
            strcpy(foo_path, argv[1]);
            strcat(foo_path, "/foo");
      
            /*
             * Create file foo with 2 extents, each with a size matching
             * the page size. Then allocate a buffer to read both extents
             * with io_uring, using O_DIRECT and IOCB_NOWAIT. Before doing
             * the read with io_uring, access the first page of the buffer
             * to fault it in, so that during the read we only trigger a
             * page fault when accessing the second page of the buffer.
             */
             fd = open(foo_path, O_CREAT | O_TRUNC | O_WRONLY |
                      O_DIRECT, 0666);
             if (fd == -1) {
                 fprintf(stderr,
                         "Failed to create file 'foo': %s (errno %d)",
                         strerror(errno), errno);
                 return 1;
             }
      
             pagesize = sysconf(_SC_PAGE_SIZE);
             ret = posix_memalign(&write_buf, pagesize, 2 * pagesize);
             if (ret) {
                 fprintf(stderr, "Failed to allocate write buffer\n");
                 return 1;
             }
      
             memset(write_buf, 0xab, pagesize);
             memset(write_buf + pagesize, 0xcd, pagesize);
      
             /* Create 2 extents, each with a size matching page size. */
             for (i = 0; i < 2; i++) {
                 ret = pwrite(fd, write_buf + i * pagesize, pagesize,
                              i * pagesize);
                 if (ret != pagesize) {
                     fprintf(stderr,
                           "Failed to write to file, ret = %ld errno %d (%s)\n",
                            ret, errno, strerror(errno));
                     return 1;
                 }
                 ret = fsync(fd);
                 if (ret != 0) {
                     fprintf(stderr, "Failed to fsync file\n");
                     return 1;
                 }
             }
      
             close(fd);
             fd = open(foo_path, O_RDONLY | O_DIRECT);
             if (fd == -1) {
                 fprintf(stderr,
                         "Failed to open file 'foo': %s (errno %d)",
                         strerror(errno), errno);
                 return 1;
             }
      
             ret = posix_memalign(&read_buf, pagesize, 2 * pagesize);
             if (ret) {
                 fprintf(stderr, "Failed to allocate read buffer\n");
                 return 1;
             }
      
             /*
              * Fault in only the first page of the read buffer.
              * We want to trigger a page fault for the 2nd page of the
              * read buffer during the read operation with io_uring
              * (O_DIRECT and IOCB_NOWAIT).
              */
             memset(read_buf, 0, 1);
      
             ret = io_uring_queue_init(1, &ring, 0);
             if (ret != 0) {
                 fprintf(stderr, "Failed to create io_uring queue\n");
                 return 1;
             }
      
             sqe = io_uring_get_sqe(&ring);
             if (!sqe) {
                 fprintf(stderr, "Failed to get io_uring sqe\n");
                 return 1;
             }
      
             iovec.iov_base = read_buf;
             iovec.iov_len = 2 * pagesize;
             io_uring_prep_readv(sqe, fd, &iovec, 1, 0);
      
             ret = io_uring_submit_and_wait(&ring, 1);
             if (ret != 1) {
                 fprintf(stderr,
                         "Failed at io_uring_submit_and_wait()\n");
                 return 1;
             }
      
             ret = io_uring_wait_cqe(&ring, &cqe);
             if (ret < 0) {
                 fprintf(stderr, "Failed at io_uring_wait_cqe()\n");
                 return 1;
             }
      
             printf("io_uring read result for file foo:\n\n");
             printf("  cqe->res == %d (expected %d)\n", cqe->res, 2 * pagesize);
             printf("  memcmp(read_buf, write_buf) == %d (expected 0)\n",
                    memcmp(read_buf, write_buf, 2 * pagesize));
      
             io_uring_cqe_seen(&ring, cqe);
             io_uring_queue_exit(&ring);
      
             return 0;
        }
      
      When running it on an unpatched kernel:
      
        $ gcc io_uring_test.c -luring
        $ mkfs.btrfs -f /dev/sda
        $ mount /dev/sda /mnt/sda
        $ ./a.out /mnt/sda
        io_uring read result for file foo:
      
          cqe->res == 4096 (expected 8192)
          memcmp(read_buf, write_buf) == -205 (expected 0)
      
      After this patch, the read always returns 8192 bytes, with the buffer
      filled with the correct data. Although that reproducer always triggers
      the bug in my test vms, it's possible that it will not be so reliable
      on other environments, as that can happen if the bio for the first
      extent completes and decrements the reference on the struct iomap_dio
      object before we do the atomic_dec_and_test() on the reference at
      __iomap_dio_rw().
      
      Fix this in btrfs by having btrfs_dio_iomap_begin() return -EAGAIN
      whenever we try to satisfy a non blocking IO request (IOMAP_NOWAIT flag
      set) over a range that spans multiple extents (or a mix of extents and
      holes). This avoids returning success to the caller when we only did
      partial IO, which is not optimal for writes and for reads it's actually
      incorrect, as the caller doesn't expect to get less bytes read than it has
      requested (unless EOF is crossed), as previously mentioned. This is also
      the type of behaviour that xfs follows (xfs_direct_write_iomap_begin()),
      even though it doesn't use IOMAP_DIO_PARTIAL.
      
      A test case for fstests will follow soon.
      
      Link: https://lore.kernel.org/linux-btrfs/CABVffEM0eEWho+206m470rtM0d9J8ue85TtR-A_oVTuGLWFicA@mail.gmail.com/
      Link: https://lore.kernel.org/linux-btrfs/CAHF2GV6U32gmqSjLe=XKgfcZAmLCiH26cJ2OnHGp5x=VAH4OHQ@mail.gmail.com/
      CC: stable@vger.kernel.org # 5.16+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ca93e44b
  3. 24 2月, 2022 1 次提交
    • Q
      btrfs: reduce extent threshold for autodefrag · 558732df
      Qu Wenruo 提交于
      There is a big gap between inode_should_defrag() and autodefrag extent
      size threshold.  For inode_should_defrag() it has a flexible
      @small_write value. For compressed extent is 16K, and for non-compressed
      extent it's 64K.
      
      However for autodefrag extent size threshold, it's always fixed to the
      default value (256K).
      
      This means, the following write sequence will trigger autodefrag to
      defrag ranges which didn't trigger autodefrag:
      
        pwrite 0 8k
        sync
        pwrite 8k 128K
        sync
      
      The latter 128K write will also be considered as a defrag target (if
      other conditions are met). While only that 8K write is really
      triggering autodefrag.
      
      Such behavior can cause extra IO for autodefrag.
      
      Close the gap, by copying the @small_write value into inode_defrag, so
      that later autodefrag can use the same @small_write value which
      triggered autodefrag.
      
      With the existing transid value, this allows autodefrag really to scan
      the ranges which triggered autodefrag.
      
      Although this behavior change is mostly reducing the extent_thresh value
      for autodefrag, I believe in the future we should allow users to specify
      the autodefrag extent threshold through mount options, but that's an
      other problem to consider in the future.
      
      CC: stable@vger.kernel.org # 5.16+
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      558732df
  4. 07 1月, 2022 11 次提交
  5. 03 1月, 2022 7 次提交
    • J
      btrfs: stop accessing ->csum_root directly · fc28b25e
      Josef Bacik 提交于
      We are going to have multiple csum roots in the future, so convert all
      users of ->csum_root to btrfs_csum_root() and rename ->csum_root to
      ->_csum_root so we can easily find remaining users in the future.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fc28b25e
    • J
      btrfs: set BTRFS_FS_STATE_NO_CSUMS if we fail to load the csum root · 056c8311
      Josef Bacik 提交于
      We have a few places where we skip doing csums if we mounted with one of
      the rescue options that ignores bad csum roots.  In the future when
      there are multiple csum roots it'll be costly to check and see if there
      are any missing csum roots, so simply add a flag to indicate the fs
      should skip loading csums in case of errors.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      056c8311
    • J
      btrfs: change root to fs_info for btrfs_reserve_metadata_bytes · 9270501c
      Josef Bacik 提交于
      We used to need the root for btrfs_reserve_metadata_bytes to check the
      orphan cleanup state, but we no longer need that, we simply need the
      fs_info.  Change btrfs_reserve_metadata_bytes() to use the fs_info, and
      change both btrfs_block_rsv_refill() and btrfs_block_rsv_add() to do the
      same as they simply call btrfs_reserve_metadata_bytes() and then
      manipulate the block_rsv that is being used.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9270501c
    • J
      btrfs: get rid of root->orphan_cleanup_state · 54230013
      Josef Bacik 提交于
      Now that we don't care about the stage of the orphan_cleanup_state,
      simply replace it with a bit on ->state to make sure we don't call the
      orphan cleanup every time we wander into this root.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      54230013
    • J
      btrfs: make BTRFS_RESERVE_FLUSH_EVICT use the global rsv stealing code · ee6adbfd
      Josef Bacik 提交于
      I forgot to convert this over when I introduced the global reserve
      stealing code to the space flushing code.  Evict was simply trying to
      make its reservation and then if it failed it would steal from the
      global rsv, which is racey because it's outside of the normal ticketing
      code.
      
      Fix this by setting ticket->steal if we are BTRFS_RESERVE_FLUSH_EVICT,
      and then make the priority flushing path do the steal for us.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ee6adbfd
    • J
      btrfs: make btrfs_file_extent_inline_item_len take a slot · 437bd07e
      Josef Bacik 提交于
      Instead of getting the btrfs_item for this, simply pass in the slot of
      the item and then use the btrfs_item_size_nr() helper inside of
      btrfs_file_extent_inline_item_len().
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      437bd07e
    • F
      btrfs: fix ENOSPC failure when attempting direct IO write into NOCOW range · f0bfa76a
      Filipe Manana 提交于
      When doing a direct IO write against a file range that either has
      preallocated extents in that range or has regular extents and the file
      has the NOCOW attribute set, the write fails with -ENOSPC when all of
      the following conditions are met:
      
      1) There are no data blocks groups with enough free space matching
         the size of the write;
      
      2) There's not enough unallocated space for allocating a new data block
         group;
      
      3) The extents in the target file range are not shared, neither through
         snapshots nor through reflinks.
      
      This is wrong because a NOCOW write can be done in such case, and in fact
      it's possible to do it using a buffered IO write, since when failing to
      allocate data space, the buffered IO path checks if a NOCOW write is
      possible.
      
      The failure in direct IO write path comes from the fact that early on,
      at btrfs_dio_iomap_begin(), we try to allocate data space for the write
      and if it that fails we return the error and stop - we never check if we
      can do NOCOW. But later, at btrfs_get_blocks_direct_write(), we check
      if we can do a NOCOW write into the range, or a subset of the range, and
      then release the previously reserved data space.
      
      Fix this by doing the data reservation only if needed, when we must COW,
      at btrfs_get_blocks_direct_write() instead of doing it at
      btrfs_dio_iomap_begin(). This also simplifies a bit the logic and removes
      the inneficiency of doing unnecessary data reservations.
      
      The following example test script reproduces the problem:
      
        $ cat dio-nocow-enospc.sh
        #!/bin/bash
      
        DEV=/dev/sdj
        MNT=/mnt/sdj
      
        # Use a small fixed size (1G) filesystem so that it's quick to fill
        # it up.
        # Make sure the mixed block groups feature is not enabled because we
        # later want to not have more space available for allocating data
        # extents but still have enough metadata space free for the file writes.
        mkfs.btrfs -f -b $((1024 * 1024 * 1024)) -O ^mixed-bg $DEV
        mount $DEV $MNT
      
        # Create our test file with the NOCOW attribute set.
        touch $MNT/foobar
        chattr +C $MNT/foobar
      
        # Now fill in all unallocated space with data for our test file.
        # This will allocate a data block group that will be full and leave
        # no (or a very small amount of) unallocated space in the device, so
        # that it will not be possible to allocate a new block group later.
        echo
        echo "Creating test file with initial data..."
        xfs_io -c "pwrite -S 0xab -b 1M 0 900M" $MNT/foobar
      
        # Now try a direct IO write against file range [0, 10M[.
        # This should succeed since this is a NOCOW file and an extent for the
        # range was previously allocated.
        echo
        echo "Trying direct IO write over allocated space..."
        xfs_io -d -c "pwrite -S 0xcd -b 10M 0 10M" $MNT/foobar
      
        umount $MNT
      
      When running the test:
      
        $ ./dio-nocow-enospc.sh
        (...)
      
        Creating test file with initial data...
        wrote 943718400/943718400 bytes at offset 0
        900 MiB, 900 ops; 0:00:01.43 (625.526 MiB/sec and 625.5265 ops/sec)
      
        Trying direct IO write over allocated space...
        pwrite: No space left on device
      
      A test case for fstests will follow, testing both this direct IO write
      scenario as well as the buffered IO write scenario to make it less likely
      to get future regressions on the buffered IO case.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f0bfa76a
  6. 29 10月, 2021 1 次提交
  7. 27 10月, 2021 12 次提交
    • D
      Revert "btrfs: compression: drop kmap/kunmap from generic helpers" · 3a60f653
      David Sterba 提交于
      This reverts commit 4c2bf276.
      
      The kmaps in compression code are still needed and cause crashes on
      32bit machines (ARM, x86). Reproducible eg. by running fstest btrfs/004
      with enabled LZO or ZSTD compression.
      
      Link: https://lore.kernel.org/all/CAJCQCtT+OuemovPO7GZk8Y8=qtOObr0XTDp8jh4OHD6y84AFxw@mail.gmail.com/
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=214839Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3a60f653
    • Q
      btrfs: remove btrfs_bio::logical member · f4f39fc5
      Qu Wenruo 提交于
      The member btrfs_bio::logical is only initialized by two call sites:
      
      - btrfs_repair_one_sector()
        No corresponding site to utilize it.
      
      - btrfs_submit_direct()
        The corresponding site to utilize it is btrfs_check_read_dio_bio().
      
      However for btrfs_check_read_dio_bio(), we can grab the file_offset from
      btrfs_dio_private::file_offset directly.
      
      Thus it turns out we don't really need that btrfs_bio::logical member at
      all.
      
      For btrfs_bio, the logical bytenr can be fetched from its
      bio->bi_iter.bi_sector directly.
      
      So let's just remove the member to save 8 bytes for structure btrfs_bio.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f4f39fc5
    • Q
      btrfs: rename btrfs_dio_private::logical_offset to file_offset · 47926ab5
      Qu Wenruo 提交于
      The naming of "logical_offset" can be confused with logical bytenr of
      the dio range.
      
      In fact it's file offset, and the naming "file_offset" is already widely
      used in all other sites.
      
      Just do the rename to avoid confusion.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      47926ab5
    • N
      btrfs: pull up qgroup checks from delayed-ref core to init time · 681145d4
      Nikolay Borisov 提交于
      Instead of checking whether qgroup processing for a dealyed ref has to
      happen in the core of delayed ref, simply pull the check at init time of
      respective delayed ref structures. This eliminates the final use of
      real_root in delayed-ref core paving the way to making this member
      optional.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      681145d4
    • N
      btrfs: add additional parameters to btrfs_init_tree_ref/btrfs_init_data_ref · f42c5da6
      Nikolay Borisov 提交于
      In order to make 'real_root' used only in ref-verify it's required to
      have the necessary context to perform the same checks that this member
      is used for. So add 'mod_root' which will contain the root on behalf of
      which a delayed ref was created and a 'skip_group' parameter which
      will contain callsite-specific override of skip_qgroup.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f42c5da6
    • J
      btrfs: add a BTRFS_FS_ERROR helper · 84961539
      Josef Bacik 提交于
      We have a few flags that are inconsistently used to describe the fs in
      different states of failure.  As of 5963ffca ("btrfs: always abort
      the transaction if we abort a trans handle") we will always set
      BTRFS_FS_STATE_ERROR if we abort, so we don't have to check both ABORTED
      and ERROR to see if things have gone wrong.  Add a helper to check
      BTRFS_FS_STATE_ERROR and then convert all checkers of FS_STATE_ERROR to
      use the helper.
      
      The TRANS_ABORTED bit check was added in af722733 ("Btrfs: clean up
      resources during umount after trans is aborted") but is not actually
      specific.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      84961539
    • J
      btrfs: change error handling for btrfs_delete_*_in_log · 9a35fc95
      Josef Bacik 提交于
      Currently we will abort the transaction if we get a random error (like
      -EIO) while trying to remove the directory entries from the root log
      during rename.
      
      However since these are simply log tree related errors, we can mark the
      trans as needing a full commit.  Then if the error was truly
      catastrophic we'll hit it during the normal commit and abort as
      appropriate.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9a35fc95
    • Q
      btrfs: subpage: only allow compression if the range is fully page aligned · 0cf9b244
      Qu Wenruo 提交于
      For compressed write, we use a mechanism called async COW, which unlike
      regular run_delalloc_cow() or cow_file_range() will also unlock the
      first page.
      
      This mechanism allows us to continue handling next ranges, without
      waiting for the time consuming compression.
      
      But this has a problem for subpage case, as we could have the following
      delalloc range for a page:
      
      0		32K		64K
      |	|///////|	|///////|
      		\- A		\- B
      
      In the above case, if we pass both ranges to cow_file_range_async(),
      both range A and range B will try to unlock the full page [0, 64K).
      
      And which one finishes later than the other one will try to do other
      page operations like end_page_writeback() on a unlocked page, triggering
      VM layer BUG_ON().
      
      To make subpage compression work at least partially, here we add another
      restriction for it, only allow compression if the delalloc range is
      fully page aligned.
      
      By that, async extent is always ensured to unlock the first page
      exclusively, just like it used to be for regular sectorsize.
      
      In theory, we only need to make sure the delalloc range fully covers its
      first page, but the tail page will be locked anyway, blocking later
      writeback until the compression finishes.
      
      Thus here we choose to make sure the range is fully page aligned before
      doing the compression.
      
      In the future, we could optimize the situation by properly increasing
      subpage::writers number for the locked page, but that also means we need
      to change how we run delalloc range of page.
      (Instead of running each delalloc range we hit, we need to find and lock
      all delalloc ranges covering the page, then run each of them).
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0cf9b244
    • Q
      btrfs: subpage: avoid potential deadlock with compression and delalloc · 2749f7ef
      Qu Wenruo 提交于
      [BUG]
      With experimental subpage compression enabled, a simple fsstress can
      lead to self deadlock on page 720896:
      
              mkfs.btrfs -f -s 4k $dev > /dev/null
              mount $dev -o compress $mnt
              $fsstress -p 1 -n 100 -w -d $mnt -v -s 1625511156
      
      [CAUSE]
      If we have a file layout looks like below:
      
      	0	32K	64K	96K	128K
      	|//|		|///////////////|
      	   4K
      
      Then we run delalloc range for the inode, it will:
      
      - Call find_lock_delalloc_range() with @delalloc_start = 0
        Then we got a delalloc range [0, 4K).
      
        This range will be COWed.
      
      - Call find_lock_delalloc_range() again with @delalloc_start = 4K
        Since find_lock_delalloc_range() never cares whether the range
        is still inside page range [0, 64K), it will return range [64K, 128K).
      
        This range meets the condition for subpage compression, will go
        through async COW path.
      
        And async COW path will return @page_started.
      
        But that @page_started is now for range [64K, 128K), not for range
        [0, 64K).
      
      - writepage_dellloc() returned 1 for page [0, 64K)
        Thus page [0, 64K) will not be unlocked, nor its page dirty status
        will be cleared.
      
      Next time when we try to lock page [0, 64K) we will deadlock, as there
      is no one to release page [0, 64K).
      
      This problem will never happen for regular page size as one page only
      contains one sector.  After the first find_lock_delalloc_range() call,
      the @delalloc_end will go beyond @page_end no matter if we found a
      delalloc range or not
      
      Thus this bug only happens for subpage, as now we need multiple runs to
      exhaust the delalloc range of a page.
      
      [FIX]
      Fix the problem by ensuring the delalloc range we ran at least started
      inside @locked_page.
      
      So that we will never get incorrect @page_started.
      
      And to prevent such problem from happening again:
      
      - Make find_lock_delalloc_range() return false if the found range is
        beyond @end value passed in.
      
        Since @end will be utilized now, add an ASSERT() to ensure we pass
        correct @end into find_lock_delalloc_range().
      
        This also means, for selftests we needs to populate @end before calling
        find_lock_delalloc_range().
      
      - New ASSERT() in find_lock_delalloc_range()
        Now we will make sure the @start/@end passed in at least covers part
        of the page.
      
      - New ASSERT() in run_delalloc_range()
        To make sure the range at least starts inside @locked page.
      
      - Use @delalloc_start as proper cursor, while @delalloc_end is always
        reset to @page_end.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2749f7ef
    • Q
      btrfs: factor uncompressed async extent submission code into a new helper · 2b83a0ee
      Qu Wenruo 提交于
      Introduce a new helper, submit_uncompressed_range(), for async cow cases
      where we fallback to COW.
      
      There are some new updates introduced to the helper:
      
      - Proper locked_page detection
        It's possible that the async_extent range doesn't cover the locked
        page.  In that case we shouldn't unlock the locked page.
      
        In the new helper, we will ensure that we only unlock the locked page
        when:
      
        * The locked page covers part of the async_extent range
        * The locked page is not unlocked by cow_file_range() nor
          extent_write_locked_range()
      
        This also means extra comments are added focusing on the page locking.
      
      - Add extra comment on some rare parameter used.
        We use @unlock_page = 0 for cow_file_range(), where only two call
        sites doing the same thing, including the new helper.
      
        It's definitely worth some comments.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2b83a0ee
    • Q
      btrfs: subpage: make compress_file_range() compatible · 4c162778
      Qu Wenruo 提交于
      In function compress_file_range(), when the compression is finished, the
      function just rounds up @total_in to PAGE_SIZE.  This is fine for
      regular sectorsize == PAGE_SIZE case, but not for subpage.
      
      Just change the ALIGN(, PAGE_SIZE) to round_up(, sectorsize) so that
      both regular sectorsize and subpage sectorsize will be happy.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4c162778
    • Q
      btrfs: cleanup for extent_write_locked_range() · 2bd0fc93
      Qu Wenruo 提交于
      There are several cleanups for extent_write_locked_range(), most of them
      are pure cleanups, but with some preparation for future subpage support.
      
      - Add a proper comment for which call sites are suitable
        Unlike regular synchronized extent write back, if async COW or zoned
        COW happens, we have all pages in the range still locked.
      
        Thus for those (only) two call sites, we need this function to submit
        page content into bios and submit them.
      
      - Remove @mode parameter
        All the existing two call sites pass WB_SYNC_ALL. No need for @mode
        parameter.
      
      - Better error handling
        Currently if we hit an error during the page iteration loop, we
        overwrite @ret, causing only the last error can be recorded.
      
        Here we add @found_error and @first_error variable to record if we hit
        any error, and the first error we hit.
        So the first error won't get lost.
      
      - Don't reuse @start as the cursor
        We reuse the parameter @start as the cursor to iterate the range, not
        a big problem, but since we're here, introduce a proper @cur as the
        cursor.
      
      - Remove impossible branch
        Since all pages are still locked after the ordered extent is inserted,
        there is no way that pages can get its dirty bit cleared.
        Remove the branch where page is not dirty and replace it with an
        ASSERT().
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2bd0fc93