1. 17 3月, 2022 5 次提交
  2. 16 3月, 2022 1 次提交
    • J
      io_uring: recycle apoll_poll entries · 4d9237e3
      Jens Axboe 提交于
      Particularly for networked workloads, io_uring intensively uses its
      poll based backend to get a notification when data/space is available.
      Profiling workloads, we see 3-4% of alloc+free that is directly attributed
      to just the apoll allocation and free (and the rest being skb alloc+free).
      
      For the fast path, we have ctx->uring_lock held already for both issue
      and the inline completions, and we can utilize that to avoid any extra
      locking needed to have a basic recycling cache for the apoll entries on
      both the alloc and free side.
      
      Double poll still requires an allocation. But those are rare and not
      a fast path item.
      
      With the simple cache in place, we see a 3-4% reduction in overhead for
      the workload.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      4d9237e3
  3. 12 3月, 2022 1 次提交
  4. 11 3月, 2022 7 次提交
  5. 10 3月, 2022 18 次提交
  6. 06 3月, 2022 2 次提交
  7. 04 3月, 2022 2 次提交
    • F
      btrfs: fallback to blocking mode when doing async dio over multiple extents · ca93e44b
      Filipe Manana 提交于
      Some users recently reported that MariaDB was getting a read corruption
      when using io_uring on top of btrfs. This started to happen in 5.16,
      after commit 51bd9563 ("btrfs: fix deadlock due to page faults
      during direct IO reads and writes"). That changed btrfs to use the new
      iomap flag IOMAP_DIO_PARTIAL and to disable page faults before calling
      iomap_dio_rw(). This was necessary to fix deadlocks when the iovector
      corresponds to a memory mapped file region. That type of scenario is
      exercised by test case generic/647 from fstests.
      
      For this MariaDB scenario, we attempt to read 16K from file offset X
      using IOCB_NOWAIT and io_uring. In that range we have 4 extents, each
      with a size of 4K, and what happens is the following:
      
      1) btrfs_direct_read() disables page faults and calls iomap_dio_rw();
      
      2) iomap creates a struct iomap_dio object, its reference count is
         initialized to 1 and its ->size field is initialized to 0;
      
      3) iomap calls btrfs_dio_iomap_begin() with file offset X, which finds
         the first 4K extent, and setups an iomap for this extent consisting
         of a single page;
      
      4) At iomap_dio_bio_iter(), we are able to access the first page of the
         buffer (struct iov_iter) with bio_iov_iter_get_pages() without
         triggering a page fault;
      
      5) iomap submits a bio for this 4K extent
         (iomap_dio_submit_bio() -> btrfs_submit_direct()) and increments
         the refcount on the struct iomap_dio object to 2; The ->size field
         of the struct iomap_dio object is incremented to 4K;
      
      6) iomap calls btrfs_iomap_begin() again, this time with a file
         offset of X + 4K. There we setup an iomap for the next extent
         that also has a size of 4K;
      
      7) Then at iomap_dio_bio_iter() we call bio_iov_iter_get_pages(),
         which tries to access the next page (2nd page) of the buffer.
         This triggers a page fault and returns -EFAULT;
      
      8) At __iomap_dio_rw() we see the -EFAULT, but we reset the error
         to 0 because we passed the flag IOMAP_DIO_PARTIAL to iomap and
         the struct iomap_dio object has a ->size value of 4K (we submitted
         a bio for an extent already). The 'wait_for_completion' variable
         is not set to true, because our iocb has IOCB_NOWAIT set;
      
      9) At the bottom of __iomap_dio_rw(), we decrement the reference count
         of the struct iomap_dio object from 2 to 1. Because we were not
         the only ones holding a reference on it and 'wait_for_completion' is
         set to false, -EIOCBQUEUED is returned to btrfs_direct_read(), which
         just returns it up the callchain, up to io_uring;
      
      10) The bio submitted for the first extent (step 5) completes and its
          bio endio function, iomap_dio_bio_end_io(), decrements the last
          reference on the struct iomap_dio object, resulting in calling
          iomap_dio_complete_work() -> iomap_dio_complete().
      
      11) At iomap_dio_complete() we adjust the iocb->ki_pos from X to X + 4K
          and return 4K (the amount of io done) to iomap_dio_complete_work();
      
      12) iomap_dio_complete_work() calls the iocb completion callback,
          iocb->ki_complete() with a second argument value of 4K (total io
          done) and the iocb with the adjust ki_pos of X + 4K. This results
          in completing the read request for io_uring, leaving it with a
          result of 4K bytes read, and only the first page of the buffer
          filled in, while the remaining 3 pages, corresponding to the other
          3 extents, were not filled;
      
      13) For the application, the result is unexpected because if we ask
          to read N bytes, it expects to get N bytes read as long as those
          N bytes don't cross the EOF (i_size).
      
      MariaDB reports this as an error, as it's not expecting a short read,
      since it knows it's asking for read operations fully within the i_size
      boundary. This is typical in many applications, but it may also be
      questionable if they should react to such short reads by issuing more
      read calls to get the remaining data. Nevertheless, the short read
      happened due to a change in btrfs regarding how it deals with page
      faults while in the middle of a read operation, and there's no reason
      why btrfs can't have the previous behaviour of returning the whole data
      that was requested by the application.
      
      The problem can also be triggered with the following simple program:
      
        /* Get O_DIRECT */
        #ifndef _GNU_SOURCE
        #define _GNU_SOURCE
        #endif
      
        #include <stdio.h>
        #include <stdlib.h>
        #include <unistd.h>
        #include <fcntl.h>
        #include <errno.h>
        #include <string.h>
        #include <liburing.h>
      
        int main(int argc, char *argv[])
        {
            char *foo_path;
            struct io_uring ring;
            struct io_uring_sqe *sqe;
            struct io_uring_cqe *cqe;
            struct iovec iovec;
            int fd;
            long pagesize;
            void *write_buf;
            void *read_buf;
            ssize_t ret;
            int i;
      
            if (argc != 2) {
                fprintf(stderr, "Use: %s <directory>\n", argv[0]);
                return 1;
            }
      
            foo_path = malloc(strlen(argv[1]) + 5);
            if (!foo_path) {
                fprintf(stderr, "Failed to allocate memory for file path\n");
                return 1;
            }
            strcpy(foo_path, argv[1]);
            strcat(foo_path, "/foo");
      
            /*
             * Create file foo with 2 extents, each with a size matching
             * the page size. Then allocate a buffer to read both extents
             * with io_uring, using O_DIRECT and IOCB_NOWAIT. Before doing
             * the read with io_uring, access the first page of the buffer
             * to fault it in, so that during the read we only trigger a
             * page fault when accessing the second page of the buffer.
             */
             fd = open(foo_path, O_CREAT | O_TRUNC | O_WRONLY |
                      O_DIRECT, 0666);
             if (fd == -1) {
                 fprintf(stderr,
                         "Failed to create file 'foo': %s (errno %d)",
                         strerror(errno), errno);
                 return 1;
             }
      
             pagesize = sysconf(_SC_PAGE_SIZE);
             ret = posix_memalign(&write_buf, pagesize, 2 * pagesize);
             if (ret) {
                 fprintf(stderr, "Failed to allocate write buffer\n");
                 return 1;
             }
      
             memset(write_buf, 0xab, pagesize);
             memset(write_buf + pagesize, 0xcd, pagesize);
      
             /* Create 2 extents, each with a size matching page size. */
             for (i = 0; i < 2; i++) {
                 ret = pwrite(fd, write_buf + i * pagesize, pagesize,
                              i * pagesize);
                 if (ret != pagesize) {
                     fprintf(stderr,
                           "Failed to write to file, ret = %ld errno %d (%s)\n",
                            ret, errno, strerror(errno));
                     return 1;
                 }
                 ret = fsync(fd);
                 if (ret != 0) {
                     fprintf(stderr, "Failed to fsync file\n");
                     return 1;
                 }
             }
      
             close(fd);
             fd = open(foo_path, O_RDONLY | O_DIRECT);
             if (fd == -1) {
                 fprintf(stderr,
                         "Failed to open file 'foo': %s (errno %d)",
                         strerror(errno), errno);
                 return 1;
             }
      
             ret = posix_memalign(&read_buf, pagesize, 2 * pagesize);
             if (ret) {
                 fprintf(stderr, "Failed to allocate read buffer\n");
                 return 1;
             }
      
             /*
              * Fault in only the first page of the read buffer.
              * We want to trigger a page fault for the 2nd page of the
              * read buffer during the read operation with io_uring
              * (O_DIRECT and IOCB_NOWAIT).
              */
             memset(read_buf, 0, 1);
      
             ret = io_uring_queue_init(1, &ring, 0);
             if (ret != 0) {
                 fprintf(stderr, "Failed to create io_uring queue\n");
                 return 1;
             }
      
             sqe = io_uring_get_sqe(&ring);
             if (!sqe) {
                 fprintf(stderr, "Failed to get io_uring sqe\n");
                 return 1;
             }
      
             iovec.iov_base = read_buf;
             iovec.iov_len = 2 * pagesize;
             io_uring_prep_readv(sqe, fd, &iovec, 1, 0);
      
             ret = io_uring_submit_and_wait(&ring, 1);
             if (ret != 1) {
                 fprintf(stderr,
                         "Failed at io_uring_submit_and_wait()\n");
                 return 1;
             }
      
             ret = io_uring_wait_cqe(&ring, &cqe);
             if (ret < 0) {
                 fprintf(stderr, "Failed at io_uring_wait_cqe()\n");
                 return 1;
             }
      
             printf("io_uring read result for file foo:\n\n");
             printf("  cqe->res == %d (expected %d)\n", cqe->res, 2 * pagesize);
             printf("  memcmp(read_buf, write_buf) == %d (expected 0)\n",
                    memcmp(read_buf, write_buf, 2 * pagesize));
      
             io_uring_cqe_seen(&ring, cqe);
             io_uring_queue_exit(&ring);
      
             return 0;
        }
      
      When running it on an unpatched kernel:
      
        $ gcc io_uring_test.c -luring
        $ mkfs.btrfs -f /dev/sda
        $ mount /dev/sda /mnt/sda
        $ ./a.out /mnt/sda
        io_uring read result for file foo:
      
          cqe->res == 4096 (expected 8192)
          memcmp(read_buf, write_buf) == -205 (expected 0)
      
      After this patch, the read always returns 8192 bytes, with the buffer
      filled with the correct data. Although that reproducer always triggers
      the bug in my test vms, it's possible that it will not be so reliable
      on other environments, as that can happen if the bio for the first
      extent completes and decrements the reference on the struct iomap_dio
      object before we do the atomic_dec_and_test() on the reference at
      __iomap_dio_rw().
      
      Fix this in btrfs by having btrfs_dio_iomap_begin() return -EAGAIN
      whenever we try to satisfy a non blocking IO request (IOMAP_NOWAIT flag
      set) over a range that spans multiple extents (or a mix of extents and
      holes). This avoids returning success to the caller when we only did
      partial IO, which is not optimal for writes and for reads it's actually
      incorrect, as the caller doesn't expect to get less bytes read than it has
      requested (unless EOF is crossed), as previously mentioned. This is also
      the type of behaviour that xfs follows (xfs_direct_write_iomap_begin()),
      even though it doesn't use IOMAP_DIO_PARTIAL.
      
      A test case for fstests will follow soon.
      
      Link: https://lore.kernel.org/linux-btrfs/CABVffEM0eEWho+206m470rtM0d9J8ue85TtR-A_oVTuGLWFicA@mail.gmail.com/
      Link: https://lore.kernel.org/linux-btrfs/CAHF2GV6U32gmqSjLe=XKgfcZAmLCiH26cJ2OnHGp5x=VAH4OHQ@mail.gmail.com/
      CC: stable@vger.kernel.org # 5.16+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ca93e44b
    • D
      cachefiles: Fix incorrect length to fallocate() · b08968f1
      David Howells 提交于
      When cachefiles_shorten_object() calls fallocate() to shape the cache
      file to match the DIO size, it passes the total file size it wants to
      achieve, not the amount of zeros that should be inserted.  Since this is
      meant to preallocate that amount of storage for the file, it can cause
      the cache to fill up the disk and hit ENOSPC.
      
      Fix this by passing the length actually required to go from the current
      EOF to the desired EOF.
      
      Fixes: 7623ed67 ("cachefiles: Implement cookie resize for truncate")
      Reported-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Tested-by: NJeff Layton <jlayton@kernel.org>
      Reviewed-by: NJeff Layton <jlayton@kernel.org>
      cc: linux-cachefs@redhat.com
      Link: https://lore.kernel.org/r/164630854858.3665356.17419701804248490708.stgit@warthog.procyon.org.uk # v1
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b08968f1
  8. 02 3月, 2022 4 次提交
    • F
      btrfs: add missing run of delayed items after unlink during log replay · 4751dc99
      Filipe Manana 提交于
      During log replay, whenever we need to check if a name (dentry) exists in
      a directory we do searches on the subvolume tree for inode references or
      or directory entries (BTRFS_DIR_INDEX_KEY keys, and BTRFS_DIR_ITEM_KEY
      keys as well, before kernel 5.17). However when during log replay we
      unlink a name, through btrfs_unlink_inode(), we may not delete inode
      references and dir index keys from a subvolume tree and instead just add
      the deletions to the delayed inode's delayed items, which will only be
      run when we commit the transaction used for log replay. This means that
      after an unlink operation during log replay, if we attempt to search for
      the same name during log replay, we will not see that the name was already
      deleted, since the deletion is recorded only on the delayed items.
      
      We run delayed items after every unlink operation during log replay,
      except at unlink_old_inode_refs() and at add_inode_ref(). This was due
      to an overlook, as delayed items should be run after evert unlink, for
      the reasons stated above.
      
      So fix those two cases.
      
      Fixes: 0d836392 ("Btrfs: fix mount failure after fsync due to hard link recreation")
      Fixes: 1f250e92 ("Btrfs: fix log replay failure after unlink and link combination")
      CC: stable@vger.kernel.org # 4.19+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4751dc99
    • S
      btrfs: qgroup: fix deadlock between rescan worker and remove qgroup · d4aef1e1
      Sidong Yang 提交于
      The commit e804861b ("btrfs: fix deadlock between quota disable and
      qgroup rescan worker") by Kawasaki resolves deadlock between quota
      disable and qgroup rescan worker. But also there is a deadlock case like
      it. It's about enabling or disabling quota and creating or removing
      qgroup. It can be reproduced in simple script below.
      
      for i in {1..100}
      do
          btrfs quota enable /mnt &
          btrfs qgroup create 1/0 /mnt &
          btrfs qgroup destroy 1/0 /mnt &
          btrfs quota disable /mnt &
      done
      
      Here's why the deadlock happens:
      
      1) The quota rescan task is running.
      
      2) Task A calls btrfs_quota_disable(), locks the qgroup_ioctl_lock
         mutex, and then calls btrfs_qgroup_wait_for_completion(), to wait for
         the quota rescan task to complete.
      
      3) Task B calls btrfs_remove_qgroup() and it blocks when trying to lock
         the qgroup_ioctl_lock mutex, because it's being held by task A. At that
         point task B is holding a transaction handle for the current transaction.
      
      4) The quota rescan task calls btrfs_commit_transaction(). This results
         in it waiting for all other tasks to release their handles on the
         transaction, but task B is blocked on the qgroup_ioctl_lock mutex
         while holding a handle on the transaction, and that mutex is being held
         by task A, which is waiting for the quota rescan task to complete,
         resulting in a deadlock between these 3 tasks.
      
      To resolve this issue, the thread disabling quota should unlock
      qgroup_ioctl_lock before waiting rescan completion. Move
      btrfs_qgroup_wait_for_completion() after unlock of qgroup_ioctl_lock.
      
      Fixes: e804861b ("btrfs: fix deadlock between quota disable and qgroup rescan worker")
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NShin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
      Signed-off-by: NSidong Yang <realwakka@gmail.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d4aef1e1
    • O
      btrfs: fix relocation crash due to premature return from btrfs_commit_transaction() · 5fd76bf3
      Omar Sandoval 提交于
      We are seeing crashes similar to the following trace:
      
      [38.969182] WARNING: CPU: 20 PID: 2105 at fs/btrfs/relocation.c:4070 btrfs_relocate_block_group+0x2dc/0x340 [btrfs]
      [38.973556] CPU: 20 PID: 2105 Comm: btrfs Not tainted 5.17.0-rc4 #54
      [38.974580] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
      [38.976539] RIP: 0010:btrfs_relocate_block_group+0x2dc/0x340 [btrfs]
      [38.980336] RSP: 0000:ffffb0dd42e03c20 EFLAGS: 00010206
      [38.981218] RAX: ffff96cfc4ede800 RBX: ffff96cfc3ce0000 RCX: 000000000002ca14
      [38.982560] RDX: 0000000000000000 RSI: 4cfd109a0bcb5d7f RDI: ffff96cfc3ce0360
      [38.983619] RBP: ffff96cfc309c000 R08: 0000000000000000 R09: 0000000000000000
      [38.984678] R10: ffff96cec0000001 R11: ffffe84c80000000 R12: ffff96cfc4ede800
      [38.985735] R13: 0000000000000000 R14: 0000000000000000 R15: ffff96cfc3ce0360
      [38.987146] FS:  00007f11c15218c0(0000) GS:ffff96d6dfb00000(0000) knlGS:0000000000000000
      [38.988662] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [38.989398] CR2: 00007ffc922c8e60 CR3: 00000001147a6001 CR4: 0000000000370ee0
      [38.990279] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [38.991219] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [38.992528] Call Trace:
      [38.992854]  <TASK>
      [38.993148]  btrfs_relocate_chunk+0x27/0xe0 [btrfs]
      [38.993941]  btrfs_balance+0x78e/0xea0 [btrfs]
      [38.994801]  ? vsnprintf+0x33c/0x520
      [38.995368]  ? __kmalloc_track_caller+0x351/0x440
      [38.996198]  btrfs_ioctl_balance+0x2b9/0x3a0 [btrfs]
      [38.997084]  btrfs_ioctl+0x11b0/0x2da0 [btrfs]
      [38.997867]  ? mod_objcg_state+0xee/0x340
      [38.998552]  ? seq_release+0x24/0x30
      [38.999184]  ? proc_nr_files+0x30/0x30
      [38.999654]  ? call_rcu+0xc8/0x2f0
      [39.000228]  ? __x64_sys_ioctl+0x84/0xc0
      [39.000872]  ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
      [39.001973]  __x64_sys_ioctl+0x84/0xc0
      [39.002566]  do_syscall_64+0x3a/0x80
      [39.003011]  entry_SYSCALL_64_after_hwframe+0x44/0xae
      [39.003735] RIP: 0033:0x7f11c166959b
      [39.007324] RSP: 002b:00007fff2543e998 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
      [39.008521] RAX: ffffffffffffffda RBX: 00007f11c1521698 RCX: 00007f11c166959b
      [39.009833] RDX: 00007fff2543ea40 RSI: 00000000c4009420 RDI: 0000000000000003
      [39.011270] RBP: 0000000000000003 R08: 0000000000000013 R09: 00007f11c16f94e0
      [39.012581] R10: 0000000000000000 R11: 0000000000000246 R12: 00007fff25440df3
      [39.014046] R13: 0000000000000000 R14: 00007fff2543ea40 R15: 0000000000000001
      [39.015040]  </TASK>
      [39.015418] ---[ end trace 0000000000000000 ]---
      [43.131559] ------------[ cut here ]------------
      [43.132234] kernel BUG at fs/btrfs/extent-tree.c:2717!
      [43.133031] invalid opcode: 0000 [#1] PREEMPT SMP PTI
      [43.133702] CPU: 1 PID: 1839 Comm: btrfs Tainted: G        W         5.17.0-rc4 #54
      [43.134863] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
      [43.136426] RIP: 0010:unpin_extent_range+0x37a/0x4f0 [btrfs]
      [43.139913] RSP: 0000:ffffb0dd4216bc70 EFLAGS: 00010246
      [43.140629] RAX: 0000000000000000 RBX: ffff96cfc34490f8 RCX: 0000000000000001
      [43.141604] RDX: 0000000080000001 RSI: 0000000051d00000 RDI: 00000000ffffffff
      [43.142645] RBP: 0000000000000000 R08: 0000000000000000 R09: ffff96cfd07dca50
      [43.143669] R10: ffff96cfc46e8a00 R11: fffffffffffec000 R12: 0000000041d00000
      [43.144657] R13: ffff96cfc3ce0000 R14: ffffb0dd4216bd08 R15: 0000000000000000
      [43.145686] FS:  00007f7657dd68c0(0000) GS:ffff96d6df640000(0000) knlGS:0000000000000000
      [43.146808] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [43.147584] CR2: 00007f7fe81bf5b0 CR3: 00000001093ee004 CR4: 0000000000370ee0
      [43.148589] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [43.149581] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [43.150559] Call Trace:
      [43.150904]  <TASK>
      [43.151253]  btrfs_finish_extent_commit+0x88/0x290 [btrfs]
      [43.152127]  btrfs_commit_transaction+0x74f/0xaa0 [btrfs]
      [43.152932]  ? btrfs_attach_transaction_barrier+0x1e/0x50 [btrfs]
      [43.153786]  btrfs_ioctl+0x1edc/0x2da0 [btrfs]
      [43.154475]  ? __check_object_size+0x150/0x170
      [43.155170]  ? preempt_count_add+0x49/0xa0
      [43.155753]  ? __x64_sys_ioctl+0x84/0xc0
      [43.156437]  ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
      [43.157456]  __x64_sys_ioctl+0x84/0xc0
      [43.157980]  do_syscall_64+0x3a/0x80
      [43.158543]  entry_SYSCALL_64_after_hwframe+0x44/0xae
      [43.159231] RIP: 0033:0x7f7657f1e59b
      [43.161819] RSP: 002b:00007ffda5cd1658 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
      [43.162702] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f7657f1e59b
      [43.163526] RDX: 0000000000000000 RSI: 0000000000009408 RDI: 0000000000000003
      [43.164358] RBP: 0000000000000003 R08: 0000000000000000 R09: 0000000000000000
      [43.165208] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
      [43.166029] R13: 00005621b91c3232 R14: 00005621b91ba580 R15: 00007ffda5cd1800
      [43.166863]  </TASK>
      [43.167125] Modules linked in: btrfs blake2b_generic xor pata_acpi ata_piix libata raid6_pq scsi_mod libcrc32c virtio_net virtio_rng net_failover rng_core failover scsi_common
      [43.169552] ---[ end trace 0000000000000000 ]---
      [43.171226] RIP: 0010:unpin_extent_range+0x37a/0x4f0 [btrfs]
      [43.174767] RSP: 0000:ffffb0dd4216bc70 EFLAGS: 00010246
      [43.175600] RAX: 0000000000000000 RBX: ffff96cfc34490f8 RCX: 0000000000000001
      [43.176468] RDX: 0000000080000001 RSI: 0000000051d00000 RDI: 00000000ffffffff
      [43.177357] RBP: 0000000000000000 R08: 0000000000000000 R09: ffff96cfd07dca50
      [43.178271] R10: ffff96cfc46e8a00 R11: fffffffffffec000 R12: 0000000041d00000
      [43.179178] R13: ffff96cfc3ce0000 R14: ffffb0dd4216bd08 R15: 0000000000000000
      [43.180071] FS:  00007f7657dd68c0(0000) GS:ffff96d6df800000(0000) knlGS:0000000000000000
      [43.181073] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [43.181808] CR2: 00007fe09905f010 CR3: 00000001093ee004 CR4: 0000000000370ee0
      [43.182706] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [43.183591] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      
      We first hit the WARN_ON(rc->block_group->pinned > 0) in
      btrfs_relocate_block_group() and then the BUG_ON(!cache) in
      unpin_extent_range(). This tells us that we are exiting relocation and
      removing the block group with bytes still pinned for that block group.
      This is supposed to be impossible: the last thing relocate_block_group()
      does is commit the transaction to get rid of pinned extents.
      
      Commit d0c2f4fa ("btrfs: make concurrent fsyncs wait less when
      waiting for a transaction commit") introduced an optimization so that
      commits from fsync don't have to wait for the previous commit to unpin
      extents. This was only intended to affect fsync, but it inadvertently
      made it possible for any commit to skip waiting for the previous commit
      to unpin. This is because if a call to btrfs_commit_transaction() finds
      that another thread is already committing the transaction, it waits for
      the other thread to complete the commit and then returns. If that other
      thread was in fsync, then it completes the commit without completing the
      previous commit. This makes the following sequence of events possible:
      
      Thread 1____________________|Thread 2 (fsync)_____________________|Thread 3 (balance)___________________
      btrfs_commit_transaction(N) |                                     |
        btrfs_run_delayed_refs    |                                     |
          pin extents             |                                     |
        ...                       |                                     |
        state = UNBLOCKED         |btrfs_sync_file                      |
                                  |  btrfs_start_transaction(N + 1)     |relocate_block_group
                                  |                                     |  btrfs_join_transaction(N + 1)
                                  |  btrfs_commit_transaction(N + 1)    |
        ...                       |  trans->state = COMMIT_START        |
                                  |                                     |  btrfs_commit_transaction(N + 1)
                                  |                                     |    wait_for_commit(N + 1, COMPLETED)
                                  |  wait_for_commit(N, SUPER_COMMITTED)|
        state = SUPER_COMMITTED   |  ...                                |
        btrfs_finish_extent_commit|                                     |
          unpin_extent_range()    |  trans->state = COMPLETED           |
                                  |                                     |    return
                                  |                                     |
          ...                     |                                     |Thread 1 isn't done, so pinned > 0
                                  |                                     |and we WARN
                                  |                                     |
                                  |                                     |btrfs_remove_block_group
          unpin_extent_range()    |                                     |
            Thread 3 removed the  |                                     |
            block group, so we BUG|                                     |
      
      There are other sequences involving SUPER_COMMITTED transactions that
      can cause a similar outcome.
      
      We could fix this by making relocation explicitly wait for unpinning,
      but there may be other cases that need it. Josef mentioned ENOSPC
      flushing and the free space cache inode as other potential victims.
      Rather than playing whack-a-mole, this fix is conservative and makes all
      commits not in fsync wait for all previous transactions, which is what
      the optimization intended.
      
      Fixes: d0c2f4fa ("btrfs: make concurrent fsyncs wait less when waiting for a transaction commit")
      CC: stable@vger.kernel.org # 5.15+
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5fd76bf3
    • J
      btrfs: do not start relocation until in progress drops are done · b4be6aef
      Josef Bacik 提交于
      We hit a bug with a recovering relocation on mount for one of our file
      systems in production.  I reproduced this locally by injecting errors
      into snapshot delete with balance running at the same time.  This
      presented as an error while looking up an extent item
      
        WARNING: CPU: 5 PID: 1501 at fs/btrfs/extent-tree.c:866 lookup_inline_extent_backref+0x647/0x680
        CPU: 5 PID: 1501 Comm: btrfs-balance Not tainted 5.16.0-rc8+ #8
        RIP: 0010:lookup_inline_extent_backref+0x647/0x680
        RSP: 0018:ffffae0a023ab960 EFLAGS: 00010202
        RAX: 0000000000000001 RBX: 0000000000000000 RCX: 0000000000000000
        RDX: 0000000000000000 RSI: 000000000000000c RDI: 0000000000000000
        RBP: ffff943fd2a39b60 R08: 0000000000000000 R09: 0000000000000001
        R10: 0001434088152de0 R11: 0000000000000000 R12: 0000000001d05000
        R13: ffff943fd2a39b60 R14: ffff943fdb96f2a0 R15: ffff9442fc923000
        FS:  0000000000000000(0000) GS:ffff944e9eb40000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007f1157b1fca8 CR3: 000000010f092000 CR4: 0000000000350ee0
        Call Trace:
         <TASK>
         insert_inline_extent_backref+0x46/0xd0
         __btrfs_inc_extent_ref.isra.0+0x5f/0x200
         ? btrfs_merge_delayed_refs+0x164/0x190
         __btrfs_run_delayed_refs+0x561/0xfa0
         ? btrfs_search_slot+0x7b4/0xb30
         ? btrfs_update_root+0x1a9/0x2c0
         btrfs_run_delayed_refs+0x73/0x1f0
         ? btrfs_update_root+0x1a9/0x2c0
         btrfs_commit_transaction+0x50/0xa50
         ? btrfs_update_reloc_root+0x122/0x220
         prepare_to_merge+0x29f/0x320
         relocate_block_group+0x2b8/0x550
         btrfs_relocate_block_group+0x1a6/0x350
         btrfs_relocate_chunk+0x27/0xe0
         btrfs_balance+0x777/0xe60
         balance_kthread+0x35/0x50
         ? btrfs_balance+0xe60/0xe60
         kthread+0x16b/0x190
         ? set_kthread_struct+0x40/0x40
         ret_from_fork+0x22/0x30
         </TASK>
      
      Normally snapshot deletion and relocation are excluded from running at
      the same time by the fs_info->cleaner_mutex.  However if we had a
      pending balance waiting to get the ->cleaner_mutex, and a snapshot
      deletion was running, and then the box crashed, we would come up in a
      state where we have a half deleted snapshot.
      
      Again, in the normal case the snapshot deletion needs to complete before
      relocation can start, but in this case relocation could very well start
      before the snapshot deletion completes, as we simply add the root to the
      dead roots list and wait for the next time the cleaner runs to clean up
      the snapshot.
      
      Fix this by setting a bit on the fs_info if we have any DEAD_ROOT's that
      had a pending drop_progress key.  If they do then we know we were in the
      middle of the drop operation and set a flag on the fs_info.  Then
      balance can wait until this flag is cleared to start up again.
      
      If there are DEAD_ROOT's that don't have a drop_progress set then we're
      safe to start balance right away as we'll be properly protected by the
      cleaner_mutex.
      
      CC: stable@vger.kernel.org # 5.10+
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b4be6aef