1. 09 2月, 2021 6 次提交
    • J
      btrfs: stop running all delayed refs during snapshot · dac348e9
      Josef Bacik 提交于
      This was added in commit 361048f5 ("Btrfs: fix full backref problem
      when inserting shared block reference") to address a problem where we
      hit the following BUG_ON() in alloc_reserved_tree_block
      
              if (node->type == BTRFS_SHARED_BLOCK_REF_KEY) {
                      BUG_ON(!(flags & BTRFS_BLOCK_FLAG_FULL_BACKREF));
      
      However this BUG_ON() is bogus, and was removed by previous commit:
      
        btrfs: remove bogus BUG_ON in alloc_reserved_tree_block
      
      We no longer need to run delayed refs because of this, and can remove
      this flushing here.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      dac348e9
    • J
      btrfs: move delayed ref flushing for qgroup into qgroup helper · 2a4d84c1
      Josef Bacik 提交于
      The commit d6726335 ("btrfs: qgroup: Make snapshot accounting work
      with new extent-oriented qgroup.") added a flush of the delayed refs
      during snapshot creation in order to get the qgroup accounting properly.
      However this code has changed and been moved to it's own helper that is
      skipped if qgroups are turned off.  Move the flushing to the helper, as
      we do not need it when qgroups are turned off.
      
      Also add a comment explaining why it exists, and why it doesn't actually
      save us.  This will be helpful later when we try to fix qgroup
      accounting properly.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2a4d84c1
    • J
      btrfs: only run delayed refs once before committing · ad368f33
      Josef Bacik 提交于
      We try to pre-flush the delayed refs when committing, because we want to
      do as little work as possible in the critical section of the transaction
      commit.
      
      However doing this twice can lead to very long transaction commit delays
      as other threads are allowed to continue to generate more delayed refs,
      which potentially delays the commit by multiple minutes in very extreme
      cases.
      
      So simply stick to one pre-flush, and then continue the rest of the
      transaction commit.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ad368f33
    • J
      btrfs: only let one thread pre-flush delayed refs in commit · e19eb11f
      Josef Bacik 提交于
      I've been running a stress test that runs 20 workers in their own
      subvolume, which are running an fsstress instance with 4 threads per
      worker, which is 80 total fsstress threads.  In addition to this I'm
      running balance in the background as well as creating and deleting
      snapshots.  This test takes around 12 hours to run normally, going
      slower and slower as the test goes on.
      
      The reason for this is because fsstress is running fsync sometimes, and
      because we're messing with block groups we often fall through to
      btrfs_commit_transaction, so will often have 20-30 threads all calling
      btrfs_commit_transaction at the same time.
      
      These all get stuck contending on the extent tree while they try to run
      delayed refs during the initial part of the commit.
      
      This is suboptimal, really because the extent tree is a single point of
      failure we only want one thread acting on that tree at once to reduce
      lock contention.
      
      Fix this by making the flushing mechanism a bit operation, to make it
      easy to use test_and_set_bit() in order to make sure only one task does
      this initial flush.
      
      Once we're into the transaction commit we only have one thread doing
      delayed ref running, it's just this initial pre-flush that is
      problematic.  With this patch my stress test takes around 90 minutes to
      run, instead of 12 hours.
      
      The memory barrier is not necessary for the flushing bit as it's
      ordered, unlike plain int. The transaction state accessed in
      btrfs_should_end_transaction could be affected by that too as it's not
      always used under transaction lock. Upon Nikolay's analysis in [1]
      it's not necessary:
      
        In should_end_transaction it's read without holding any locks. (U)
      
        It's modified in btrfs_cleanup_transaction without holding the
        fs_info->trans_lock (U), but the STATE_ERROR flag is going to be set.
      
        set in cleanup_transaction under fs_info->trans_lock (L)
        set in btrfs_commit_trans to COMMIT_START under fs_info->trans_lock.(L)
        set in btrfs_commit_trans to COMMIT_DOING under fs_info->trans_lock.(L)
        set in btrfs_commit_trans to COMMIT_UNBLOCK under
        fs_info->trans_lock.(L)
      
        set in btrfs_commit_trans to COMMIT_COMPLETED without locks but at this
        point the transaction is finished and fs_info->running_trans is NULL (U
        but irrelevant).
      
        So by the looks of it we can have a concurrent READ race with a WRITE,
        due to reads not taking a lock. In this case what we want to ensure is
        we either see new or old state. I consulted with Will Deacon and he said
        that in such a case we'd want to annotate the accesses to ->state with
        (READ|WRITE)_ONCE so as to avoid a theoretical tear, in this case I
        don't think this could happen but I imagine at some point KCSAN would
        flag such an access as racy (which it is).
      
      [1] https://lore.kernel.org/linux-btrfs/e1fd5cc1-0f28-f670-69f4-e9958b4964e6@suse.comReviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      [ add comments regarding memory barrier ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e19eb11f
    • N
      btrfs: rename btrfs_find_free_objectid to btrfs_get_free_objectid · 543068a2
      Nikolay Borisov 提交于
      This better reflects the semantics of the function i.e no search is
      performed whatsoever.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      543068a2
    • J
      btrfs: fix error handling in commit_fs_roots · 4f4317c1
      Josef Bacik 提交于
      While doing error injection I would sometimes get a corrupt file system.
      This is because I was injecting errors at btrfs_search_slot, but would
      only do it one time per stack.  This uncovered a problem in
      commit_fs_roots, where if we get an error we would just break.  However
      we're in a nested loop, the first loop being a loop to find all the
      dirty fs roots, and then subsequent root updates would succeed clearing
      the error value.
      
      This isn't likely to happen in real scenarios, however we could
      potentially get a random ENOMEM once and then not again, and we'd end up
      with a corrupted file system.  Fix this by moving the error checking
      around a bit to the main loop, as this is the only place where something
      will fail, and return the error as soon as it occurs.
      
      With this patch my reproducer no longer corrupts the file system.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4f4317c1
  2. 12 1月, 2021 1 次提交
  3. 10 12月, 2020 2 次提交
    • B
      btrfs: keep sb cache_generation consistent with space_cache · 94846229
      Boris Burkov 提交于
      When mounting, btrfs uses the cache_generation in the super block to
      determine if space cache v1 is in use. However, by mounting with
      nospace_cache or space_cache=v2, it is possible to disable space cache
      v1, which does not result in un-setting cache_generation back to 0.
      
      In order to base some logic, like mount option printing in /proc/mounts,
      on the current state of the space cache rather than just the values of
      the mount option, keep the value of cache_generation consistent with the
      status of space cache v1.
      
      We ensure that cache_generation > 0 iff the file system is using
      space_cache v1. This requires committing a transaction on any mount
      which changes whether we are using v1. (v1->nospace_cache, v1->v2,
      nospace_cache->v1, v2->v1).
      
      Since the mechanism for writing out the cache generation is transaction
      commit, but we want some finer grained control over when we un-set it,
      we can't just rely on the SPACE_CACHE mount option, and introduce an
      fs_info flag that mount can use when it wants to unset the generation.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NBoris Burkov <boris@bur.io>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      94846229
    • N
      btrfs: remove inode number cache feature · 5297199a
      Nikolay Borisov 提交于
      It's been deprecated since commit b547a88e ("btrfs: start
      deprecation of mount option inode_cache") which enumerates the reasons.
      
      A filesystem that uses the feature (mount -o inode_cache) tracks the
      inode numbers in bitmaps, that data stay on the filesystem after this
      patch. The size is roughly 5MiB for 1M inodes [1], which is considered
      small enough to be left there. Removal of the change can be implemented
      in btrfs-progs if needed.
      
      [1] https://lore.kernel.org/linux-btrfs/20201127145836.GZ6430@twin.jikos.cz/Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ update changelog ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5297199a
  4. 08 12月, 2020 7 次提交
    • N
      btrfs: return bool from btrfs_should_end_transaction · a2633b6a
      Nikolay Borisov 提交于
      Results in slightly smaller code.
      
      add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-11 (-11)
      Function                                     old     new   delta
      btrfs_should_end_transaction                  96      85     -11
      Total: Before=20070, After=20059, chg -0.05%
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a2633b6a
    • N
      8a8f4dea
    • N
    • J
      btrfs: protect fs_info->caching_block_groups by block_group_cache_lock · bbb86a37
      Josef Bacik 提交于
      I got the following lockdep splat
      
        ======================================================
        WARNING: possible circular locking dependency detected
        5.9.0+ #101 Not tainted
        ------------------------------------------------------
        btrfs-cleaner/3445 is trying to acquire lock:
        ffff89dbec39ab48 (btrfs-root-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x32/0x170
      
        but task is already holding lock:
        ffff89dbeaf28a88 (&fs_info->commit_root_sem){++++}-{3:3}, at: btrfs_find_all_roots+0x41/0x80
      
        which lock already depends on the new lock.
      
        the existing dependency chain (in reverse order) is:
      
        -> #2 (&fs_info->commit_root_sem){++++}-{3:3}:
      	 down_write+0x3d/0x70
      	 btrfs_cache_block_group+0x2d5/0x510
      	 find_free_extent+0xb6e/0x12f0
      	 btrfs_reserve_extent+0xb3/0x1b0
      	 btrfs_alloc_tree_block+0xb1/0x330
      	 alloc_tree_block_no_bg_flush+0x4f/0x60
      	 __btrfs_cow_block+0x11d/0x580
      	 btrfs_cow_block+0x10c/0x220
      	 commit_cowonly_roots+0x47/0x2e0
      	 btrfs_commit_transaction+0x595/0xbd0
      	 sync_filesystem+0x74/0x90
      	 generic_shutdown_super+0x22/0x100
      	 kill_anon_super+0x14/0x30
      	 btrfs_kill_super+0x12/0x20
      	 deactivate_locked_super+0x36/0xa0
      	 cleanup_mnt+0x12d/0x190
      	 task_work_run+0x5c/0xa0
      	 exit_to_user_mode_prepare+0x1df/0x200
      	 syscall_exit_to_user_mode+0x54/0x280
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #1 (&space_info->groups_sem){++++}-{3:3}:
      	 down_read+0x40/0x130
      	 find_free_extent+0x2ed/0x12f0
      	 btrfs_reserve_extent+0xb3/0x1b0
      	 btrfs_alloc_tree_block+0xb1/0x330
      	 alloc_tree_block_no_bg_flush+0x4f/0x60
      	 __btrfs_cow_block+0x11d/0x580
      	 btrfs_cow_block+0x10c/0x220
      	 commit_cowonly_roots+0x47/0x2e0
      	 btrfs_commit_transaction+0x595/0xbd0
      	 sync_filesystem+0x74/0x90
      	 generic_shutdown_super+0x22/0x100
      	 kill_anon_super+0x14/0x30
      	 btrfs_kill_super+0x12/0x20
      	 deactivate_locked_super+0x36/0xa0
      	 cleanup_mnt+0x12d/0x190
      	 task_work_run+0x5c/0xa0
      	 exit_to_user_mode_prepare+0x1df/0x200
      	 syscall_exit_to_user_mode+0x54/0x280
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #0 (btrfs-root-00){++++}-{3:3}:
      	 __lock_acquire+0x1167/0x2150
      	 lock_acquire+0xb9/0x3d0
      	 down_read_nested+0x43/0x130
      	 __btrfs_tree_read_lock+0x32/0x170
      	 __btrfs_read_lock_root_node+0x3a/0x50
      	 btrfs_search_slot+0x614/0x9d0
      	 btrfs_find_root+0x35/0x1b0
      	 btrfs_read_tree_root+0x61/0x120
      	 btrfs_get_root_ref+0x14b/0x600
      	 find_parent_nodes+0x3e6/0x1b30
      	 btrfs_find_all_roots_safe+0xb4/0x130
      	 btrfs_find_all_roots+0x60/0x80
      	 btrfs_qgroup_trace_extent_post+0x27/0x40
      	 btrfs_add_delayed_data_ref+0x3fd/0x460
      	 btrfs_free_extent+0x42/0x100
      	 __btrfs_mod_ref+0x1d7/0x2f0
      	 walk_up_proc+0x11c/0x400
      	 walk_up_tree+0xf0/0x180
      	 btrfs_drop_snapshot+0x1c7/0x780
      	 btrfs_clean_one_deleted_snapshot+0xfb/0x110
      	 cleaner_kthread+0xd4/0x140
      	 kthread+0x13a/0x150
      	 ret_from_fork+0x1f/0x30
      
        other info that might help us debug this:
      
        Chain exists of:
          btrfs-root-00 --> &space_info->groups_sem --> &fs_info->commit_root_sem
      
         Possible unsafe locking scenario:
      
      	 CPU0                    CPU1
      	 ----                    ----
          lock(&fs_info->commit_root_sem);
      				 lock(&space_info->groups_sem);
      				 lock(&fs_info->commit_root_sem);
          lock(btrfs-root-00);
      
         *** DEADLOCK ***
      
        3 locks held by btrfs-cleaner/3445:
         #0: ffff89dbeaf28838 (&fs_info->cleaner_mutex){+.+.}-{3:3}, at: cleaner_kthread+0x6e/0x140
         #1: ffff89dbeb6c7640 (sb_internal){.+.+}-{0:0}, at: start_transaction+0x40b/0x5c0
         #2: ffff89dbeaf28a88 (&fs_info->commit_root_sem){++++}-{3:3}, at: btrfs_find_all_roots+0x41/0x80
      
        stack backtrace:
        CPU: 0 PID: 3445 Comm: btrfs-cleaner Not tainted 5.9.0+ #101
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-2.fc32 04/01/2014
        Call Trace:
         dump_stack+0x8b/0xb0
         check_noncircular+0xcf/0xf0
         __lock_acquire+0x1167/0x2150
         ? __bfs+0x42/0x210
         lock_acquire+0xb9/0x3d0
         ? __btrfs_tree_read_lock+0x32/0x170
         down_read_nested+0x43/0x130
         ? __btrfs_tree_read_lock+0x32/0x170
         __btrfs_tree_read_lock+0x32/0x170
         __btrfs_read_lock_root_node+0x3a/0x50
         btrfs_search_slot+0x614/0x9d0
         ? find_held_lock+0x2b/0x80
         btrfs_find_root+0x35/0x1b0
         ? do_raw_spin_unlock+0x4b/0xa0
         btrfs_read_tree_root+0x61/0x120
         btrfs_get_root_ref+0x14b/0x600
         find_parent_nodes+0x3e6/0x1b30
         btrfs_find_all_roots_safe+0xb4/0x130
         btrfs_find_all_roots+0x60/0x80
         btrfs_qgroup_trace_extent_post+0x27/0x40
         btrfs_add_delayed_data_ref+0x3fd/0x460
         btrfs_free_extent+0x42/0x100
         __btrfs_mod_ref+0x1d7/0x2f0
         walk_up_proc+0x11c/0x400
         walk_up_tree+0xf0/0x180
         btrfs_drop_snapshot+0x1c7/0x780
         ? btrfs_clean_one_deleted_snapshot+0x73/0x110
         btrfs_clean_one_deleted_snapshot+0xfb/0x110
         cleaner_kthread+0xd4/0x140
         ? btrfs_alloc_root+0x50/0x50
         kthread+0x13a/0x150
         ? kthread_create_worker_on_cpu+0x40/0x40
         ret_from_fork+0x1f/0x30
      
      while testing another lockdep fix.  This happens because we're using the
      commit_root_sem to protect fs_info->caching_block_groups, which creates
      a dependency on the groups_sem -> commit_root_sem, which is problematic
      because we will allocate blocks while holding tree roots.  Fix this by
      making the list itself protected by the fs_info->block_group_cache_lock.
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      bbb86a37
    • J
      btrfs: update last_byte_to_unpin in switch_commit_roots · 27d56e62
      Josef Bacik 提交于
      While writing an explanation for the need of the commit_root_sem for
      btrfs_prepare_extent_commit, I realized we have a slight hole that could
      result in leaked space if we have to do the old style caching.  Consider
      the following scenario
      
       commit root
       +----+----+----+----+----+----+----+
       |\\\\|    |\\\\|\\\\|    |\\\\|\\\\|
       +----+----+----+----+----+----+----+
       0    1    2    3    4    5    6    7
      
       new commit root
       +----+----+----+----+----+----+----+
       |    |    |    |\\\\|    |    |\\\\|
       +----+----+----+----+----+----+----+
       0    1    2    3    4    5    6    7
      
      Prior to this patch, we run btrfs_prepare_extent_commit, which updates
      the last_byte_to_unpin, and then we subsequently run
      switch_commit_roots.  In this example lets assume that
      caching_ctl->progress == 1 at btrfs_prepare_extent_commit() time, which
      means that cache->last_byte_to_unpin == 1.  Then we go and do the
      switch_commit_roots(), but in the meantime the caching thread has made
      some more progress, because we drop the commit_root_sem and re-acquired
      it.  Now caching_ctl->progress == 3.  We swap out the commit root and
      carry on to unpin.
      
      The race can happen like:
      
        1) The caching thread was running using the old commit root when it
           found the extent for [2, 3);
      
        2) Then it released the commit_root_sem because it was in the last
           item of a leaf and the semaphore was contended, and set ->progress
           to 3 (value of 'last'), as the last extent item in the current leaf
           was for the extent for range [2, 3);
      
        3) Next time it gets the commit_root_sem, will start using the new
           commit root and search for a key with offset 3, so it never finds
           the hole for [2, 3).
      
        So the caching thread never saw [2, 3) as free space in any of the
        commit roots, and by the time finish_extent_commit() was called for
        the range [0, 3), ->last_byte_to_unpin was 1, so it only returned the
        subrange [0, 1) to the free space cache, skipping [2, 3).
      
      In the unpin code we have last_byte_to_unpin == 1, so we unpin [0,1),
      but do not unpin [2,3).  However because caching_ctl->progress == 3 we
      do not see the newly freed section of [2,3), and thus do not add it to
      our free space cache.  This results in us missing a chunk of free space
      in memory (on disk too, unless we have a power failure before writing
      the free space cache to disk).
      
      Fix this by making sure the ->last_byte_to_unpin is set at the same time
      that we swap the commit roots, this ensures that we will always be
      consistent.
      
      CC: stable@vger.kernel.org # 5.8+
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      [ update changelog with Filipe's review comments ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      27d56e62
    • J
      btrfs: locking: remove all the blocking helpers · ac5887c8
      Josef Bacik 提交于
      Now that we're using a rw_semaphore we no longer need to indicate if a
      lock is blocking or not, nor do we need to flip the entire path from
      blocking to spinning.  Remove these helpers and all the places they are
      called.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ac5887c8
    • F
      btrfs: do not start and wait for delalloc on snapshot roots on transaction commit · 88090ad3
      Filipe Manana 提交于
      We do not need anymore to start writeback for delalloc of roots that are
      being snapshotted and wait for it to complete. This was done in commit
      609e804d ("Btrfs: fix file corruption after snapshotting due to mix
      of buffered/DIO writes") to fix a type of file corruption where files in a
      snapshot end up having their i_size updated in a non-ordered way, leaving
      implicit file holes, when buffered IO writes that increase a file's size
      are followed by direct IO writes that also increase the file's size.
      
      This is not needed anymore because we now have a more generic mechanism
      to prevent a non-ordered i_size update since commit 9ddc959e
      ("btrfs: use the file extent tree infrastructure"), which addresses this
      scenario involving snapshots as well.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      88090ad3
  5. 07 10月, 2020 2 次提交
    • J
      btrfs: introduce BTRFS_NESTING_COW for cow'ing blocks · 9631e4cc
      Josef Bacik 提交于
      When we COW a block we are holding a lock on the original block, and
      then we lock the new COW block.  Because our lockdep maps are based on
      root + level, this will make lockdep complain.  We need a way to
      indicate a subclass for locking the COW'ed block, so plumb through our
      btrfs_lock_nesting from btrfs_cow_block down to the btrfs_init_buffer,
      and then introduce BTRFS_NESTING_COW to be used for cow'ing blocks.
      
      The reason I've added all this extra infrastructure is because there
      will be need of different nesting classes in follow up patches.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9631e4cc
    • F
      btrfs: make fast fsyncs wait only for writeback · 48778179
      Filipe Manana 提交于
      Currently regardless of a full or a fast fsync we always wait for ordered
      extents to complete, and then start logging the inode after that. However
      for fast fsyncs we can just wait for the writeback to complete, we don't
      need to wait for the ordered extents to complete since we use the list of
      modified extents maps to figure out which extents we must log and we can
      get their checksums directly from the ordered extents that are still in
      flight, otherwise look them up from the checksums tree.
      
      Until commit b5e6c3e1 ("btrfs: always wait on ordered extents at
      fsync time"), for fast fsyncs, we used to start logging without even
      waiting for the writeback to complete first, we would wait for it to
      complete after logging, while holding a transaction open, which lead to
      performance issues when using cgroups and probably for other cases too,
      as wait for IO while holding a transaction handle should be avoided as
      much as possible. After that, for fast fsyncs, we started to wait for
      ordered extents to complete before starting to log, which adds some
      latency to fsyncs and we even got at least one report about a performance
      drop which bisected to that particular change:
      
      https://lore.kernel.org/linux-btrfs/20181109215148.GF23260@techsingularity.net/
      
      This change makes fast fsyncs only wait for writeback to finish before
      starting to log the inode, instead of waiting for both the writeback to
      finish and for the ordered extents to complete. This brings back part of
      the logic we had that extracts checksums from in flight ordered extents,
      which are not yet in the checksums tree, and making sure transaction
      commits wait for the completion of ordered extents previously logged
      (by far most of the time they have already completed by the time a
      transaction commit starts, resulting in no wait at all), to avoid any
      data loss if an ordered extent completes after the transaction used to
      log an inode is committed, followed by a power failure.
      
      When there are no other tasks accessing the checksums and the subvolume
      btrees, the ordered extent completion is pretty fast, typically taking
      100 to 200 microseconds only in my observations. However when there are
      other tasks accessing these btrees, ordered extent completion can take a
      lot more time due to lock contention on nodes and leaves of these btrees.
      I've seen cases over 2 milliseconds, which starts to be significant. In
      particular when we do have concurrent fsyncs against different files there
      is a lot of contention on the checksums btree, since we have many tasks
      writing the checksums into the btree and other tasks that already started
      the logging phase are doing lookups for checksums in the btree.
      
      This change also turns all ranged fsyncs into full ranged fsyncs, which
      is something we already did when not using the NO_HOLES features or when
      doing a full fsync. This is to guarantee we never miss checksums due to
      writeback having been triggered only for a part of an extent, and we end
      up logging the full extent but only checksums for the written range, which
      results in missing checksums after log replay. Allowing ranged fsyncs to
      operate again only in the original range, when using the NO_HOLES feature
      and doing a fast fsync is doable but requires some non trivial changes to
      the writeback path, which can always be worked on later if needed, but I
      don't think they are a very common use case.
      
      Several tests were performed using fio for different numbers of concurrent
      jobs, each writing and fsyncing its own file, for both sequential and
      random file writes. The tests were run on bare metal, no virtualization,
      on a box with 12 cores (Intel i7-8700), 64Gb of RAM and a NVMe device,
      with a kernel configuration that is the default of typical distributions
      (debian in this case), without debug options enabled (kasan, kmemleak,
      slub debug, debug of page allocations, lock debugging, etc).
      
      The following script that calls fio was used:
      
        $ cat test-fsync.sh
        #!/bin/bash
      
        DEV=/dev/nvme0n1
        MNT=/mnt/btrfs
        MOUNT_OPTIONS="-o ssd -o space_cache=v2"
        MKFS_OPTIONS="-d single -m single"
      
        if [ $# -ne 5 ]; then
          echo "Use $0 NUM_JOBS FILE_SIZE FSYNC_FREQ BLOCK_SIZE [write|randwrite]"
          exit 1
        fi
      
        NUM_JOBS=$1
        FILE_SIZE=$2
        FSYNC_FREQ=$3
        BLOCK_SIZE=$4
        WRITE_MODE=$5
      
        if [ "$WRITE_MODE" != "write" ] && [ "$WRITE_MODE" != "randwrite" ]; then
          echo "Invalid WRITE_MODE, must be 'write' or 'randwrite'"
          exit 1
        fi
      
        cat <<EOF > /tmp/fio-job.ini
        [writers]
        rw=$WRITE_MODE
        fsync=$FSYNC_FREQ
        fallocate=none
        group_reporting=1
        direct=0
        bs=$BLOCK_SIZE
        ioengine=sync
        size=$FILE_SIZE
        directory=$MNT
        numjobs=$NUM_JOBS
        EOF
      
        echo "performance" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
      
        echo
        echo "Using config:"
        echo
        cat /tmp/fio-job.ini
        echo
      
        umount $MNT &> /dev/null
        mkfs.btrfs -f $MKFS_OPTIONS $DEV
        mount $MOUNT_OPTIONS $DEV $MNT
        fio /tmp/fio-job.ini
        umount $MNT
      
      The results were the following:
      
      *************************
      *** sequential writes ***
      *************************
      
      ==== 1 job, 8GiB file, fsync frequency 1, block size 64KiB ====
      
      Before patch:
      
      WRITE: bw=36.6MiB/s (38.4MB/s), 36.6MiB/s-36.6MiB/s (38.4MB/s-38.4MB/s), io=8192MiB (8590MB), run=223689-223689msec
      
      After patch:
      
      WRITE: bw=40.2MiB/s (42.1MB/s), 40.2MiB/s-40.2MiB/s (42.1MB/s-42.1MB/s), io=8192MiB (8590MB), run=203980-203980msec
      (+9.8%, -8.8% runtime)
      
      ==== 2 jobs, 4GiB files, fsync frequency 1, block size 64KiB ====
      
      Before patch:
      
      WRITE: bw=35.8MiB/s (37.5MB/s), 35.8MiB/s-35.8MiB/s (37.5MB/s-37.5MB/s), io=8192MiB (8590MB), run=228950-228950msec
      
      After patch:
      
      WRITE: bw=43.5MiB/s (45.6MB/s), 43.5MiB/s-43.5MiB/s (45.6MB/s-45.6MB/s), io=8192MiB (8590MB), run=188272-188272msec
      (+21.5% throughput, -17.8% runtime)
      
      ==== 4 jobs, 2GiB files, fsync frequency 1, block size 64KiB ====
      
      Before patch:
      
      WRITE: bw=50.1MiB/s (52.6MB/s), 50.1MiB/s-50.1MiB/s (52.6MB/s-52.6MB/s), io=8192MiB (8590MB), run=163446-163446msec
      
      After patch:
      
      WRITE: bw=64.5MiB/s (67.6MB/s), 64.5MiB/s-64.5MiB/s (67.6MB/s-67.6MB/s), io=8192MiB (8590MB), run=126987-126987msec
      (+28.7% throughput, -22.3% runtime)
      
      ==== 8 jobs, 1GiB files, fsync frequency 1, block size 64KiB ====
      
      Before patch:
      
      WRITE: bw=64.0MiB/s (68.1MB/s), 64.0MiB/s-64.0MiB/s (68.1MB/s-68.1MB/s), io=8192MiB (8590MB), run=126075-126075msec
      
      After patch:
      
      WRITE: bw=86.8MiB/s (91.0MB/s), 86.8MiB/s-86.8MiB/s (91.0MB/s-91.0MB/s), io=8192MiB (8590MB), run=94358-94358msec
      (+35.6% throughput, -25.2% runtime)
      
      ==== 16 jobs, 512MiB files, fsync frequency 1, block size 64KiB ====
      
      Before patch:
      
      WRITE: bw=79.8MiB/s (83.6MB/s), 79.8MiB/s-79.8MiB/s (83.6MB/s-83.6MB/s), io=8192MiB (8590MB), run=102694-102694msec
      
      After patch:
      
      WRITE: bw=107MiB/s (112MB/s), 107MiB/s-107MiB/s (112MB/s-112MB/s), io=8192MiB (8590MB), run=76446-76446msec
      (+34.1% throughput, -25.6% runtime)
      
      ==== 32 jobs, 512MiB files, fsync frequency 1, block size 64KiB ====
      
      Before patch:
      
      WRITE: bw=93.2MiB/s (97.7MB/s), 93.2MiB/s-93.2MiB/s (97.7MB/s-97.7MB/s), io=16.0GiB (17.2GB), run=175836-175836msec
      
      After patch:
      
      WRITE: bw=111MiB/s (117MB/s), 111MiB/s-111MiB/s (117MB/s-117MB/s), io=16.0GiB (17.2GB), run=147001-147001msec
      (+19.1% throughput, -16.4% runtime)
      
      ==== 64 jobs, 512MiB files, fsync frequency 1, block size 64KiB ====
      
      Before patch:
      
      WRITE: bw=108MiB/s (114MB/s), 108MiB/s-108MiB/s (114MB/s-114MB/s), io=32.0GiB (34.4GB), run=302656-302656msec
      
      After patch:
      
      WRITE: bw=133MiB/s (140MB/s), 133MiB/s-133MiB/s (140MB/s-140MB/s), io=32.0GiB (34.4GB), run=246003-246003msec
      (+23.1% throughput, -18.7% runtime)
      
      ************************
      ***   random writes  ***
      ************************
      
      ==== 1 job, 8GiB file, fsync frequency 16, block size 4KiB ====
      
      Before patch:
      
      WRITE: bw=11.5MiB/s (12.0MB/s), 11.5MiB/s-11.5MiB/s (12.0MB/s-12.0MB/s), io=8192MiB (8590MB), run=714281-714281msec
      
      After patch:
      
      WRITE: bw=11.6MiB/s (12.2MB/s), 11.6MiB/s-11.6MiB/s (12.2MB/s-12.2MB/s), io=8192MiB (8590MB), run=705959-705959msec
      (+0.9% throughput, -1.7% runtime)
      
      ==== 2 jobs, 4GiB files, fsync frequency 16, block size 4KiB ====
      
      Before patch:
      
      WRITE: bw=12.8MiB/s (13.5MB/s), 12.8MiB/s-12.8MiB/s (13.5MB/s-13.5MB/s), io=8192MiB (8590MB), run=638101-638101msec
      
      After patch:
      
      WRITE: bw=13.1MiB/s (13.7MB/s), 13.1MiB/s-13.1MiB/s (13.7MB/s-13.7MB/s), io=8192MiB (8590MB), run=625374-625374msec
      (+2.3% throughput, -2.0% runtime)
      
      ==== 4 jobs, 2GiB files, fsync frequency 16, block size 4KiB ====
      
      Before patch:
      
      WRITE: bw=15.4MiB/s (16.2MB/s), 15.4MiB/s-15.4MiB/s (16.2MB/s-16.2MB/s), io=8192MiB (8590MB), run=531146-531146msec
      
      After patch:
      
      WRITE: bw=17.8MiB/s (18.7MB/s), 17.8MiB/s-17.8MiB/s (18.7MB/s-18.7MB/s), io=8192MiB (8590MB), run=460431-460431msec
      (+15.6% throughput, -13.3% runtime)
      
      ==== 8 jobs, 1GiB files, fsync frequency 16, block size 4KiB ====
      
      Before patch:
      
      WRITE: bw=19.9MiB/s (20.8MB/s), 19.9MiB/s-19.9MiB/s (20.8MB/s-20.8MB/s), io=8192MiB (8590MB), run=412664-412664msec
      
      After patch:
      
      WRITE: bw=22.2MiB/s (23.3MB/s), 22.2MiB/s-22.2MiB/s (23.3MB/s-23.3MB/s), io=8192MiB (8590MB), run=368589-368589msec
      (+11.6% throughput, -10.7% runtime)
      
      ==== 16 jobs, 512MiB files, fsync frequency 16, block size 4KiB ====
      
      Before patch:
      
      WRITE: bw=29.3MiB/s (30.7MB/s), 29.3MiB/s-29.3MiB/s (30.7MB/s-30.7MB/s), io=8192MiB (8590MB), run=279924-279924msec
      
      After patch:
      
      WRITE: bw=30.4MiB/s (31.9MB/s), 30.4MiB/s-30.4MiB/s (31.9MB/s-31.9MB/s), io=8192MiB (8590MB), run=269258-269258msec
      (+3.8% throughput, -3.8% runtime)
      
      ==== 32 jobs, 512MiB files, fsync frequency 16, block size 4KiB ====
      
      Before patch:
      
      WRITE: bw=36.9MiB/s (38.7MB/s), 36.9MiB/s-36.9MiB/s (38.7MB/s-38.7MB/s), io=16.0GiB (17.2GB), run=443581-443581msec
      
      After patch:
      
      WRITE: bw=41.6MiB/s (43.6MB/s), 41.6MiB/s-41.6MiB/s (43.6MB/s-43.6MB/s), io=16.0GiB (17.2GB), run=394114-394114msec
      (+12.7% throughput, -11.2% runtime)
      
      ==== 64 jobs, 512MiB files, fsync frequency 16, block size 4KiB ====
      
      Before patch:
      
      WRITE: bw=45.9MiB/s (48.1MB/s), 45.9MiB/s-45.9MiB/s (48.1MB/s-48.1MB/s), io=32.0GiB (34.4GB), run=714614-714614msec
      
      After patch:
      
      WRITE: bw=48.8MiB/s (51.1MB/s), 48.8MiB/s-48.8MiB/s (51.1MB/s-51.1MB/s), io=32.0GiB (34.4GB), run=672087-672087msec
      (+6.3% throughput, -6.0% runtime)
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      48778179
  6. 08 9月, 2020 1 次提交
    • F
      btrfs: fix NULL pointer dereference after failure to create snapshot · 2d892ccd
      Filipe Manana 提交于
      When trying to get a new fs root for a snapshot during the transaction
      at transaction.c:create_pending_snapshot(), if btrfs_get_new_fs_root()
      fails we leave "pending->snap" pointing to an error pointer, and then
      later at ioctl.c:create_snapshot() we dereference that pointer, resulting
      in a crash:
      
        [12264.614689] BUG: kernel NULL pointer dereference, address: 00000000000007c4
        [12264.615650] #PF: supervisor write access in kernel mode
        [12264.616487] #PF: error_code(0x0002) - not-present page
        [12264.617436] PGD 0 P4D 0
        [12264.618328] Oops: 0002 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
        [12264.619150] CPU: 0 PID: 2310635 Comm: fsstress Tainted: G        W         5.9.0-rc3-btrfs-next-67 #1
        [12264.619960] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
        [12264.621769] RIP: 0010:btrfs_mksubvol+0x438/0x4a0 [btrfs]
        [12264.622528] Code: bc ef ff ff (...)
        [12264.624092] RSP: 0018:ffffaa6fc7277cd8 EFLAGS: 00010282
        [12264.624669] RAX: 00000000fffffff4 RBX: ffff9d3e8f151a60 RCX: 0000000000000000
        [12264.625249] RDX: 0000000000000001 RSI: ffffffff9d56c9be RDI: fffffffffffffff4
        [12264.625830] RBP: ffff9d3e8f151b48 R08: 0000000000000000 R09: 0000000000000000
        [12264.626413] R10: 0000000000000000 R11: 0000000000000000 R12: 00000000fffffff4
        [12264.626994] R13: ffff9d3ede380538 R14: ffff9d3ede380500 R15: ffff9d3f61b2eeb8
        [12264.627582] FS:  00007f140d5d8200(0000) GS:ffff9d3fb5e00000(0000) knlGS:0000000000000000
        [12264.628176] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [12264.628773] CR2: 00000000000007c4 CR3: 000000020f8e8004 CR4: 00000000003706f0
        [12264.629379] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        [12264.629994] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        [12264.630594] Call Trace:
        [12264.631227]  btrfs_mksnapshot+0x7b/0xb0 [btrfs]
        [12264.631840]  __btrfs_ioctl_snap_create+0x16f/0x1a0 [btrfs]
        [12264.632458]  btrfs_ioctl_snap_create_v2+0xb0/0xf0 [btrfs]
        [12264.633078]  btrfs_ioctl+0x1864/0x3130 [btrfs]
        [12264.633689]  ? do_sys_openat2+0x1a7/0x2d0
        [12264.634295]  ? kmem_cache_free+0x147/0x3a0
        [12264.634899]  ? __x64_sys_ioctl+0x83/0xb0
        [12264.635488]  __x64_sys_ioctl+0x83/0xb0
        [12264.636058]  do_syscall_64+0x33/0x80
        [12264.636616]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        (gdb) list *(btrfs_mksubvol+0x438)
        0x7c7b8 is in btrfs_mksubvol (fs/btrfs/ioctl.c:858).
        853		ret = 0;
        854		pending_snapshot->anon_dev = 0;
        855	fail:
        856		/* Prevent double freeing of anon_dev */
        857		if (ret && pending_snapshot->snap)
        858			pending_snapshot->snap->anon_dev = 0;
        859		btrfs_put_root(pending_snapshot->snap);
        860		btrfs_subvolume_release_metadata(root, &pending_snapshot->block_rsv);
        861	free_pending:
        862		if (pending_snapshot->anon_dev)
      
      So fix this by setting "pending->snap" to NULL if we get an error from the
      call to btrfs_get_new_fs_root() at transaction.c:create_pending_snapshot().
      
      Fixes: 2dfb1e43 ("btrfs: preallocate anon block device at first phase of snapshot creation")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2d892ccd
  7. 27 7月, 2020 3 次提交
    • J
      btrfs: return EROFS for BTRFS_FS_STATE_ERROR cases · fbabd4a3
      Josef Bacik 提交于
      Eric reported seeing this message while running generic/475
      
        BTRFS: error (device dm-3) in btrfs_sync_log:3084: errno=-117 Filesystem corrupted
      
      Full stack trace:
      
        BTRFS: error (device dm-0) in btrfs_commit_transaction:2323: errno=-5 IO failure (Error while writing out transaction)
        BTRFS info (device dm-0): forced readonly
        BTRFS warning (device dm-0): Skipping commit of aborted transaction.
        ------------[ cut here ]------------
        BTRFS: error (device dm-0) in cleanup_transaction:1894: errno=-5 IO failure
        BTRFS: Transaction aborted (error -117)
        BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c6480 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c6488 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c6490 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c6498 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c64a0 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c64a8 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c64b0 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c64b8 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c64c0 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3572 rw 0,0 sector 0x1b85e8 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3572 rw 0,0 sector 0x1b85f0 len 4096 err no 10
        WARNING: CPU: 3 PID: 23985 at fs/btrfs/tree-log.c:3084 btrfs_sync_log+0xbc8/0xd60 [btrfs]
        BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d4288 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d4290 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d4298 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42a0 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42a8 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42b0 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42b8 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42c0 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42c8 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42d0 len 4096 err no 10
        CPU: 3 PID: 23985 Comm: fsstress Tainted: G        W    L    5.8.0-rc4-default+ #1181
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
        RIP: 0010:btrfs_sync_log+0xbc8/0xd60 [btrfs]
        RSP: 0018:ffff909a44d17bd0 EFLAGS: 00010286
        RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000001
        RDX: ffff8f3be41cb940 RSI: ffffffffb0108d2b RDI: ffffffffb0108ff7
        RBP: ffff909a44d17e70 R08: 0000000000000000 R09: 0000000000000000
        R10: 0000000000000000 R11: 0000000000037988 R12: ffff8f3bd20e4000
        R13: ffff8f3bd20e4428 R14: 00000000ffffff8b R15: ffff909a44d17c70
        FS:  00007f6a6ed3fb80(0000) GS:ffff8f3c3dc00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007f6a6ed3e000 CR3: 00000000525c0003 CR4: 0000000000160ee0
        Call Trace:
         ? finish_wait+0x90/0x90
         ? __mutex_unlock_slowpath+0x45/0x2a0
         ? lock_acquire+0xa3/0x440
         ? lockref_put_or_lock+0x9/0x30
         ? dput+0x20/0x4a0
         ? dput+0x20/0x4a0
         ? do_raw_spin_unlock+0x4b/0xc0
         ? _raw_spin_unlock+0x1f/0x30
         btrfs_sync_file+0x335/0x490 [btrfs]
         do_fsync+0x38/0x70
         __x64_sys_fsync+0x10/0x20
         do_syscall_64+0x50/0xe0
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        RIP: 0033:0x7f6a6ef1b6e3
        Code: Bad RIP value.
        RSP: 002b:00007ffd01e20038 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
        RAX: ffffffffffffffda RBX: 000000000007a120 RCX: 00007f6a6ef1b6e3
        RDX: 00007ffd01e1ffa0 RSI: 00007ffd01e1ffa0 RDI: 0000000000000003
        RBP: 0000000000000003 R08: 0000000000000001 R09: 00007ffd01e2004c
        R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000009f
        R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
        irq event stamp: 0
        hardirqs last  enabled at (0): [<0000000000000000>] 0x0
        hardirqs last disabled at (0): [<ffffffffb007fe0b>] copy_process+0x67b/0x1b00
        softirqs last  enabled at (0): [<ffffffffb007fe0b>] copy_process+0x67b/0x1b00
        softirqs last disabled at (0): [<0000000000000000>] 0x0
        ---[ end trace af146e0e38433456 ]---
        BTRFS: error (device dm-0) in btrfs_sync_log:3084: errno=-117 Filesystem corrupted
      
      This ret came from btrfs_write_marked_extents().  If we get an aborted
      transaction via EIO before, we'll see it in btree_write_cache_pages()
      and return EUCLEAN, which gets printed as "Filesystem corrupted".
      
      Except we shouldn't be returning EUCLEAN here, we need to be returning
      EROFS because EUCLEAN is reserved for actual corruption, not IO errors.
      
      We are inconsistent about our handling of BTRFS_FS_STATE_ERROR
      elsewhere, but we want to use EROFS for this particular case.  The
      original transaction abort has the real error code for why we ended up
      with an aborted transaction, all subsequent actions just need to return
      EROFS because they may not have a trans handle and have no idea about
      the original cause of the abort.
      
      After patch "btrfs: don't WARN if we abort a transaction with EROFS" the
      stacktrace will not be dumped either.
      Reported-by: NEric Sandeen <esandeen@redhat.com>
      CC: stable@vger.kernel.org # 5.4+
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ add full test stacktrace ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fbabd4a3
    • Q
      btrfs: qgroup: remove ASYNC_COMMIT mechanism in favor of reserve retry-after-EDQUOT · adca4d94
      Qu Wenruo 提交于
      commit a514d638 ("btrfs: qgroup: Commit transaction in advance to
      reduce early EDQUOT") tries to reduce the early EDQUOT problems by
      checking the qgroup free against threshold and tries to wake up commit
      kthread to free some space.
      
      The problem of that mechanism is, it can only free qgroup per-trans
      metadata space, can't do anything to data, nor prealloc qgroup space.
      
      Now since we have the ability to flush qgroup space, and implemented
      retry-after-EDQUOT behavior, such mechanism can be completely replaced.
      
      So this patch will cleanup such mechanism in favor of
      retry-after-EDQUOT.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      adca4d94
    • Q
      btrfs: preallocate anon block device at first phase of snapshot creation · 2dfb1e43
      Qu Wenruo 提交于
      [BUG]
      When the anonymous block device pool is exhausted, subvolume/snapshot
      creation fails with EMFILE (Too many files open). This has been reported
      by a user. The allocation happens in the second phase during transaction
      commit where it's only way out is to abort the transaction
      
        BTRFS: Transaction aborted (error -24)
        WARNING: CPU: 17 PID: 17041 at fs/btrfs/transaction.c:1576 create_pending_snapshot+0xbc4/0xd10 [btrfs]
        RIP: 0010:create_pending_snapshot+0xbc4/0xd10 [btrfs]
        Call Trace:
         create_pending_snapshots+0x82/0xa0 [btrfs]
         btrfs_commit_transaction+0x275/0x8c0 [btrfs]
         btrfs_mksubvol+0x4b9/0x500 [btrfs]
         btrfs_ioctl_snap_create_transid+0x174/0x180 [btrfs]
         btrfs_ioctl_snap_create_v2+0x11c/0x180 [btrfs]
         btrfs_ioctl+0x11a4/0x2da0 [btrfs]
         do_vfs_ioctl+0xa9/0x640
         ksys_ioctl+0x67/0x90
         __x64_sys_ioctl+0x1a/0x20
         do_syscall_64+0x5a/0x110
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        ---[ end trace 33f2f83f3d5250e9 ]---
        BTRFS: error (device sda1) in create_pending_snapshot:1576: errno=-24 unknown
        BTRFS info (device sda1): forced readonly
        BTRFS warning (device sda1): Skipping commit of aborted transaction.
        BTRFS: error (device sda1) in cleanup_transaction:1831: errno=-24 unknown
      
      [CAUSE]
      When the global anonymous block device pool is exhausted, the following
      call chain will fail, and lead to transaction abort:
      
       btrfs_ioctl_snap_create_v2()
       |- btrfs_ioctl_snap_create_transid()
          |- btrfs_mksubvol()
             |- btrfs_commit_transaction()
                |- create_pending_snapshot()
                   |- btrfs_get_fs_root()
                      |- btrfs_init_fs_root()
                         |- get_anon_bdev()
      
      [FIX]
      Although we can't enlarge the anonymous block device pool, at least we
      can preallocate anon_dev for subvolume/snapshot in the first phase,
      outside of transaction context and exactly at the moment the user calls
      the creation ioctl.
      Reported-by: NGreed Rong <greedrong@gmail.com>
      Link: https://lore.kernel.org/linux-btrfs/CA+UqX+NTrZ6boGnWHhSeZmEY5J76CTqmYjO2S+=tHJX7nb9DPw@mail.gmail.com/
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2dfb1e43
  8. 25 5月, 2020 5 次提交
    • D
      btrfs: simplify root lookup by id · 56e9357a
      David Sterba 提交于
      The main function to lookup a root by its id btrfs_get_fs_root takes the
      whole key, while only using the objectid. The value of offset is preset
      to (u64)-1 but not actually used until btrfs_find_root that does the
      actual search.
      
      Switch btrfs_get_fs_root to use only objectid and remove all local
      variables that existed just for the lookup. The actual key for search is
      set up in btrfs_get_fs_root, reusing another key variable.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      56e9357a
    • Q
      btrfs: rename BTRFS_ROOT_REF_COWS to BTRFS_ROOT_SHAREABLE · 92a7cc42
      Qu Wenruo 提交于
      The name BTRFS_ROOT_REF_COWS is not very clear about the meaning.
      
      In fact, that bit can only be set to those trees:
      
      - Subvolume roots
      - Data reloc root
      - Reloc roots for above roots
      
      All other trees won't get this bit set.  So just by the result, it is
      obvious that, roots with this bit set can have tree blocks shared with
      other trees.  Either shared by snapshots, or by reloc roots (an special
      snapshot created by relocation).
      
      This patch will rename BTRFS_ROOT_REF_COWS to BTRFS_ROOT_SHAREABLE to
      make it easier to understand, and update all comment mentioning
      "reference counted" to follow the rename.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      92a7cc42
    • F
      btrfs: rename member 'trimming' of block group to a more generic name · 6b7304af
      Filipe Manana 提交于
      Back in 2014, commit 04216820 ("Btrfs: fix race between fs trimming
      and block group remove/allocation"), I added the 'trimming' member to the
      block group structure. Its purpose was to prevent races between trimming
      and block group deletion/allocation by pinning the block group in a way
      that prevents its logical address and device extents from being reused
      while trimming is in progress for a block group, so that if another task
      deletes the block group and then another task allocates a new block group
      that gets the same logical address and device extents while the trimming
      task is still in progress.
      
      After the previous fix for scrub (patch "btrfs: fix a race between scrub
      and block group removal/allocation"), scrub now also has the same needs that
      trimming has, so the member name 'trimming' no longer makes sense.
      Since there is already a 'pinned' member in the block group that refers
      to space reservations (pinned bytes), rename the member to 'frozen',
      add a comment on top of it to describe its general purpose and rename
      the helpers to increment and decrement the counter as well, to match
      the new member name.
      
      The next patch in the series will move the helpers into a more suitable
      file (from free-space-cache.c to block-group.c).
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6b7304af
    • J
      btrfs: force chunk allocation if our global rsv is larger than metadata · 9c343784
      Josef Bacik 提交于
      Nikolay noticed a bunch of test failures with my global rsv steal
      patches.  At first he thought they were introduced by them, but they've
      been failing for a while with 64k nodes.
      
      The problem is with 64k nodes we have a global reserve that calculates
      out to 13MiB on a freshly made file system, which only has 8MiB of
      metadata space.  Because of changes I previously made we no longer
      account for the global reserve in the overcommit logic, which means we
      correctly allow overcommit to happen even though we are already
      overcommitted.
      
      However in some corner cases, for example btrfs/170, we will allocate
      the entire file system up with data chunks before we have enough space
      pressure to allocate a metadata chunk.  Then once the fs is full we
      ENOSPC out because we cannot overcommit and the global reserve is taking
      up all of the available space.
      
      The most ideal way to deal with this is to change our space reservation
      stuff to take into account the height of the tree's that we're
      modifying, so that our global reserve calculation does not end up so
      obscenely large.
      
      However that is a huge undertaking.  Instead fix this by forcing a chunk
      allocation if the global reserve is larger than the total metadata
      space.  This gives us essentially the same behavior that happened
      before, we get a chunk allocated and these tests can pass.
      
      This is meant to be a stop-gap measure until we can tackle the "tree
      height only" project.
      
      Fixes: 0096420a ("btrfs: do not account global reserve in can_overcommit")
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Tested-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9c343784
    • J
      btrfs: improve global reserve stealing logic · 7f9fe614
      Josef Bacik 提交于
      For unlink transactions and block group removal
      btrfs_start_transaction_fallback_global_rsv will first try to start an
      ordinary transaction and if it fails it will fall back to reserving the
      required amount by stealing from the global reserve. This is problematic
      because of all the same reasons we had with previous iterations of the
      ENOSPC handling, thundering herd.  We get a bunch of failures all at
      once, everybody tries to allocate from the global reserve, some win and
      some lose, we get an ENSOPC.
      
      Fix this behavior by introducing BTRFS_RESERVE_FLUSH_ALL_STEAL. It's
      used to mark unlink reservation. To fix this we need to integrate this
      logic into the normal ENOSPC infrastructure.  We still go through all of
      the normal flushing work, and at the moment we begin to fail all the
      tickets we try to satisfy any tickets that are allowed to steal by
      stealing from the global reserve.  If this works we start the flushing
      system over again just like we would with a normal ticket satisfaction.
      This serializes our global reserve stealing, so we don't have the
      thundering herd problem.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Tested-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7f9fe614
  9. 27 4月, 2020 1 次提交
    • Q
      btrfs: transaction: Avoid deadlock due to bad initialization timing of fs_info::journal_info · fcc99734
      Qu Wenruo 提交于
      [BUG]
      One run of btrfs/063 triggered the following lockdep warning:
        ============================================
        WARNING: possible recursive locking detected
        5.6.0-rc7-custom+ #48 Not tainted
        --------------------------------------------
        kworker/u24:0/7 is trying to acquire lock:
        ffff88817d3a46e0 (sb_internal#2){.+.+}, at: start_transaction+0x66c/0x890 [btrfs]
      
        but task is already holding lock:
        ffff88817d3a46e0 (sb_internal#2){.+.+}, at: start_transaction+0x66c/0x890 [btrfs]
      
        other info that might help us debug this:
         Possible unsafe locking scenario:
      
               CPU0
               ----
          lock(sb_internal#2);
          lock(sb_internal#2);
      
         *** DEADLOCK ***
      
         May be due to missing lock nesting notation
      
        4 locks held by kworker/u24:0/7:
         #0: ffff88817b495948 ((wq_completion)btrfs-endio-write){+.+.}, at: process_one_work+0x557/0xb80
         #1: ffff888189ea7db8 ((work_completion)(&work->normal_work)){+.+.}, at: process_one_work+0x557/0xb80
         #2: ffff88817d3a46e0 (sb_internal#2){.+.+}, at: start_transaction+0x66c/0x890 [btrfs]
         #3: ffff888174ca4da8 (&fs_info->reloc_mutex){+.+.}, at: btrfs_record_root_in_trans+0x83/0xd0 [btrfs]
      
        stack backtrace:
        CPU: 0 PID: 7 Comm: kworker/u24:0 Not tainted 5.6.0-rc7-custom+ #48
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
        Workqueue: btrfs-endio-write btrfs_work_helper [btrfs]
        Call Trace:
         dump_stack+0xc2/0x11a
         __lock_acquire.cold+0xce/0x214
         lock_acquire+0xe6/0x210
         __sb_start_write+0x14e/0x290
         start_transaction+0x66c/0x890 [btrfs]
         btrfs_join_transaction+0x1d/0x20 [btrfs]
         find_free_extent+0x1504/0x1a50 [btrfs]
         btrfs_reserve_extent+0xd5/0x1f0 [btrfs]
         btrfs_alloc_tree_block+0x1ac/0x570 [btrfs]
         btrfs_copy_root+0x213/0x580 [btrfs]
         create_reloc_root+0x3bd/0x470 [btrfs]
         btrfs_init_reloc_root+0x2d2/0x310 [btrfs]
         record_root_in_trans+0x191/0x1d0 [btrfs]
         btrfs_record_root_in_trans+0x90/0xd0 [btrfs]
         start_transaction+0x16e/0x890 [btrfs]
         btrfs_join_transaction+0x1d/0x20 [btrfs]
         btrfs_finish_ordered_io+0x55d/0xcd0 [btrfs]
         finish_ordered_fn+0x15/0x20 [btrfs]
         btrfs_work_helper+0x116/0x9a0 [btrfs]
         process_one_work+0x632/0xb80
         worker_thread+0x80/0x690
         kthread+0x1a3/0x1f0
         ret_from_fork+0x27/0x50
      
      It's pretty hard to reproduce, only one hit so far.
      
      [CAUSE]
      This is because we're calling btrfs_join_transaction() without re-using
      the current running one:
      
      btrfs_finish_ordered_io()
      |- btrfs_join_transaction()		<<< Call #1
         |- btrfs_record_root_in_trans()
            |- btrfs_reserve_extent()
      	 |- btrfs_join_transaction()	<<< Call #2
      
      Normally such btrfs_join_transaction() call should re-use the existing
      one, without trying to re-start a transaction.
      
      But the problem is, in btrfs_join_transaction() call #1, we call
      btrfs_record_root_in_trans() before initializing current::journal_info.
      
      And in btrfs_join_transaction() call #2, we're relying on
      current::journal_info to avoid such deadlock.
      
      [FIX]
      Call btrfs_record_root_in_trans() after we have initialized
      current::journal_info.
      
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fcc99734
  10. 24 3月, 2020 11 次提交
  11. 19 2月, 2020 1 次提交