1. 12 1月, 2023 4 次提交
    • F
      btrfs: do not abort transaction on failure to write log tree when syncing log · 16199ad9
      Filipe Manana 提交于
      When syncing the log, if we fail to write log tree extent buffers, we mark
      the log for a full commit and abort the transaction. However we don't need
      to abort the transaction, all we really need to do is to make sure no one
      can commit a superblock pointing to new log tree roots. Just because we
      got a failure writing extent buffers for a log tree, it does not mean we
      will also fail to do a transaction commit.
      
      One particular case is if due to a bug somewhere, when writing log tree
      extent buffers, the tree checker detects some corruption and the writeout
      fails because of that. Aborting the transaction can be very disruptive for
      a user, specially if the issue happened on a root filesystem. One example
      is the scenario in the Link tag below, where an isolated corruption on log
      tree leaves was causing transaction aborts when syncing the log.
      
      Link: https://lore.kernel.org/linux-btrfs/ae169fc6-f504-28f0-a098-6fa6a4dfb612@leemhuis.info/
      CC: stable@vger.kernel.org # 5.15+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      16199ad9
    • F
      btrfs: add missing setup of log for full commit at add_conflicting_inode() · 94cd63ae
      Filipe Manana 提交于
      When logging conflicting inodes, if we reach the maximum limit of inodes,
      we return BTRFS_LOG_FORCE_COMMIT to force a transaction commit. However
      we don't mark the log for full commit (with btrfs_set_log_full_commit()),
      which means that once we leave the log transaction and before we commit
      the transaction, some other task may sync the log, which is incomplete
      as we have not logged all conflicting inodes, leading to some inconsistent
      in case that log ends up being replayed.
      
      So also call btrfs_set_log_full_commit() at add_conflicting_inode().
      
      Fixes: e09d94c9 ("btrfs: log conflicting inodes without holding log mutex of the initial inode")
      CC: stable@vger.kernel.org # 6.1
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      94cd63ae
    • F
      btrfs: fix directory logging due to race with concurrent index key deletion · 8bb6898d
      Filipe Manana 提交于
      Sometimes we log a directory without holding its VFS lock, so while we
      logging it, dir index entries may be added or removed. This typically
      happens when logging a dentry from a parent directory that points to a
      new directory, through log_new_dir_dentries(), or when while logging
      some other inode we also need to log its parent directories (through
      btrfs_log_all_parents()).
      
      This means that while we are at log_dir_items(), we may not find a dir
      index key we found before, because it was deleted in the meanwhile, so
      a call to btrfs_search_slot() may return 1 (key not found). In that case
      we return from log_dir_items() with a success value (the variable 'err'
      has a value of 0). This can lead to a few problems, specially in the case
      where the variable 'last_offset' has a value of (u64)-1 (and it's
      initialized to that when it was declared):
      
      1) By returning from log_dir_items() with success (0) and a value of
         (u64)-1 for '*last_offset_ret', we end up not logging any other dir
         index keys that follow the missing, just deleted, index key. The
         (u64)-1 value makes log_directory_changes() not call log_dir_items()
         again;
      
      2) Before returning with success (0), log_dir_items(), will log a dir
         index range item covering a range from the last old dentry index
         (stored in the variable 'last_old_dentry_offset') to the value of
         'last_offset'. If 'last_offset' has a value of (u64)-1, then it means
         if the log is persisted and replayed after a power failure, it will
         cause deletion of all the directory entries that have an index number
         between last_old_dentry_offset + 1 and (u64)-1;
      
      3) We can end up returning from log_dir_items() with
         ctx->last_dir_item_offset having a lower value than
         inode->last_dir_index_offset, because the former is set to the current
         key we are processing at process_dir_items_leaf(), and at the end of
         log_directory_changes() we set inode->last_dir_index_offset to the
         current value of ctx->last_dir_item_offset. So if for example a
         deletion of a lower dir index key happened, we set
         ctx->last_dir_item_offset to that index value, then if we return from
         log_dir_items() because btrfs_search_slot() returned 1, we end up
         returning from log_dir_items() with success (0) and then
         log_directory_changes() sets inode->last_dir_index_offset to a lower
         value than it had before.
         This can result in unpredictable and unexpected behaviour when we
         need to log again the directory in the same transaction, and can result
         in ending up with a log tree leaf that has duplicated keys, as we do
         batch insertions of dir index keys into a log tree.
      
      So fix this by making log_dir_items() move on to the next dir index key
      if it does not find the one it was looking for.
      Reported-by: NDavid Arendt <admin@prnet.org>
      Link: https://lore.kernel.org/linux-btrfs/ae169fc6-f504-28f0-a098-6fa6a4dfb612@leemhuis.info/
      CC: stable@vger.kernel.org # 4.14+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8bb6898d
    • F
      btrfs: fix missing error handling when logging directory items · 6d3d970b
      Filipe Manana 提交于
      When logging a directory, at log_dir_items(), if we get an error when
      attempting to search the subvolume tree for a dir index item, we end up
      returning 0 (success) from log_dir_items() because 'err' is left with a
      value of 0.
      
      This can lead to a few problems, specially in the case the variable
      'last_offset' has a value of (u64)-1 (and it's initialized to that when
      it was declared):
      
      1) By returning from log_dir_items() with success (0) and a value of
         (u64)-1 for '*last_offset_ret', we end up not logging any other dir
         index keys that follow the missing, just deleted, index key. The
         (u64)-1 value makes log_directory_changes() not call log_dir_items()
         again;
      
      2) Before returning with success (0), log_dir_items(), will log a dir
         index range item covering a range from the last old dentry index
         (stored in the variable 'last_old_dentry_offset') to the value of
         'last_offset'. If 'last_offset' has a value of (u64)-1, then it means
         if the log is persisted and replayed after a power failure, it will
         cause deletion of all the directory entries that have an index number
         between last_old_dentry_offset + 1 and (u64)-1;
      
      3) We can end up returning from log_dir_items() with
         ctx->last_dir_item_offset having a lower value than
         inode->last_dir_index_offset, because the former is set to the current
         key we are processing at process_dir_items_leaf(), and at the end of
         log_directory_changes() we set inode->last_dir_index_offset to the
         current value of ctx->last_dir_item_offset. So if for example a
         deletion of a lower dir index key happened, we set
         ctx->last_dir_item_offset to that index value, then if we return from
         log_dir_items() because btrfs_search_slot() returned an error, we end up
         returning without any error from log_dir_items() and then
         log_directory_changes() sets inode->last_dir_index_offset to a lower
         value than it had before.
         This can result in unpredictable and unexpected behaviour when we
         need to log again the directory in the same transaction, and can result
         in ending up with a log tree leaf that has duplicated keys, as we do
         batch insertions of dir index keys into a log tree.
      
      Fix this by setting 'err' to the value of 'ret' in case
      btrfs_search_slot() or btrfs_previous_item() returned an error. That will
      result in falling back to a full transaction commit.
      Reported-by: NDavid Arendt <admin@prnet.org>
      Link: https://lore.kernel.org/linux-btrfs/ae169fc6-f504-28f0-a098-6fa6a4dfb612@leemhuis.info/
      Fixes: e02119d5 ("Btrfs: Add a write ahead tree log to optimize synchronous operations")
      CC: stable@vger.kernel.org # 4.14+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6d3d970b
  2. 21 12月, 2022 1 次提交
  3. 06 12月, 2022 20 次提交
  4. 23 11月, 2022 1 次提交
    • F
      btrfs: do not modify log tree while holding a leaf from fs tree locked · 796787c9
      Filipe Manana 提交于
      When logging an inode in full mode, or when logging xattrs or when logging
      the dir index items of a directory, we are modifying the log tree while
      holding a read lock on a leaf from the fs/subvolume tree. This can lead to
      a deadlock in rare circumstances, but it is a real possibility, and it was
      recently reported by syzbot with the following trace from lockdep:
      
         WARNING: possible circular locking dependency detected
         6.1.0-rc5-next-20221116-syzkaller #0 Not tainted
         ------------------------------------------------------
         syz-executor.1/16154 is trying to acquire lock:
         ffff88807e3084a0 (&delayed_node->mutex){+.+.}-{3:3}, at: __btrfs_release_delayed_node.part.0+0xa1/0xf30 fs/btrfs/delayed-inode.c:256
      
         but task is already holding lock:
         ffff88807df33078 (btrfs-log-00){++++}-{3:3}, at: __btrfs_tree_lock+0x32/0x3d0 fs/btrfs/locking.c:197
      
         which lock already depends on the new lock.
      
         the existing dependency chain (in reverse order) is:
      
         -> #2 (btrfs-log-00){++++}-{3:3}:
                down_read_nested+0x9e/0x450 kernel/locking/rwsem.c:1634
                __btrfs_tree_read_lock+0x32/0x350 fs/btrfs/locking.c:135
                btrfs_tree_read_lock fs/btrfs/locking.c:141 [inline]
                btrfs_read_lock_root_node+0x82/0x3a0 fs/btrfs/locking.c:280
                btrfs_search_slot_get_root fs/btrfs/ctree.c:1678 [inline]
                btrfs_search_slot+0x3ca/0x2c70 fs/btrfs/ctree.c:1998
                btrfs_lookup_csum+0x116/0x3f0 fs/btrfs/file-item.c:209
                btrfs_csum_file_blocks+0x40e/0x1370 fs/btrfs/file-item.c:1021
                log_csums.isra.0+0x244/0x2d0 fs/btrfs/tree-log.c:4258
                copy_items.isra.0+0xbfb/0xed0 fs/btrfs/tree-log.c:4403
                copy_inode_items_to_log+0x13d6/0x1d90 fs/btrfs/tree-log.c:5873
                btrfs_log_inode+0xb19/0x4680 fs/btrfs/tree-log.c:6495
                btrfs_log_inode_parent+0x890/0x2a20 fs/btrfs/tree-log.c:6982
                btrfs_log_dentry_safe+0x59/0x80 fs/btrfs/tree-log.c:7083
                btrfs_sync_file+0xa41/0x13c0 fs/btrfs/file.c:1921
                vfs_fsync_range+0x13e/0x230 fs/sync.c:188
                generic_write_sync include/linux/fs.h:2856 [inline]
                iomap_dio_complete+0x73a/0x920 fs/iomap/direct-io.c:128
                btrfs_direct_write fs/btrfs/file.c:1536 [inline]
                btrfs_do_write_iter+0xba2/0x1470 fs/btrfs/file.c:1668
                call_write_iter include/linux/fs.h:2160 [inline]
                do_iter_readv_writev+0x20b/0x3b0 fs/read_write.c:735
                do_iter_write+0x182/0x700 fs/read_write.c:861
                vfs_iter_write+0x74/0xa0 fs/read_write.c:902
                iter_file_splice_write+0x745/0xc90 fs/splice.c:686
                do_splice_from fs/splice.c:764 [inline]
                direct_splice_actor+0x114/0x180 fs/splice.c:931
                splice_direct_to_actor+0x335/0x8a0 fs/splice.c:886
                do_splice_direct+0x1ab/0x280 fs/splice.c:974
                do_sendfile+0xb19/0x1270 fs/read_write.c:1255
                __do_sys_sendfile64 fs/read_write.c:1323 [inline]
                __se_sys_sendfile64 fs/read_write.c:1309 [inline]
                __x64_sys_sendfile64+0x259/0x2c0 fs/read_write.c:1309
                do_syscall_x64 arch/x86/entry/common.c:50 [inline]
                do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80
                entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
         -> #1 (btrfs-tree-00){++++}-{3:3}:
                __lock_release kernel/locking/lockdep.c:5382 [inline]
                lock_release+0x371/0x810 kernel/locking/lockdep.c:5688
                up_write+0x2a/0x520 kernel/locking/rwsem.c:1614
                btrfs_tree_unlock_rw fs/btrfs/locking.h:189 [inline]
                btrfs_unlock_up_safe+0x1e3/0x290 fs/btrfs/locking.c:238
                search_leaf fs/btrfs/ctree.c:1832 [inline]
                btrfs_search_slot+0x265e/0x2c70 fs/btrfs/ctree.c:2074
                btrfs_insert_empty_items+0xbd/0x1c0 fs/btrfs/ctree.c:4133
                btrfs_insert_delayed_item+0x826/0xfa0 fs/btrfs/delayed-inode.c:746
                btrfs_insert_delayed_items fs/btrfs/delayed-inode.c:824 [inline]
                __btrfs_commit_inode_delayed_items fs/btrfs/delayed-inode.c:1111 [inline]
                __btrfs_run_delayed_items+0x280/0x590 fs/btrfs/delayed-inode.c:1153
                flush_space+0x147/0xe90 fs/btrfs/space-info.c:728
                btrfs_async_reclaim_metadata_space+0x541/0xc10 fs/btrfs/space-info.c:1086
                process_one_work+0x9bf/0x1710 kernel/workqueue.c:2289
                worker_thread+0x669/0x1090 kernel/workqueue.c:2436
                kthread+0x2e8/0x3a0 kernel/kthread.c:376
                ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:308
      
         -> #0 (&delayed_node->mutex){+.+.}-{3:3}:
                check_prev_add kernel/locking/lockdep.c:3097 [inline]
                check_prevs_add kernel/locking/lockdep.c:3216 [inline]
                validate_chain kernel/locking/lockdep.c:3831 [inline]
                __lock_acquire+0x2a43/0x56d0 kernel/locking/lockdep.c:5055
                lock_acquire kernel/locking/lockdep.c:5668 [inline]
                lock_acquire+0x1e3/0x630 kernel/locking/lockdep.c:5633
                __mutex_lock_common kernel/locking/mutex.c:603 [inline]
                __mutex_lock+0x12f/0x1360 kernel/locking/mutex.c:747
                __btrfs_release_delayed_node.part.0+0xa1/0xf30 fs/btrfs/delayed-inode.c:256
                __btrfs_release_delayed_node fs/btrfs/delayed-inode.c:251 [inline]
                btrfs_release_delayed_node fs/btrfs/delayed-inode.c:281 [inline]
                btrfs_remove_delayed_node+0x52/0x60 fs/btrfs/delayed-inode.c:1285
                btrfs_evict_inode+0x511/0xf30 fs/btrfs/inode.c:5554
                evict+0x2ed/0x6b0 fs/inode.c:664
                dispose_list+0x117/0x1e0 fs/inode.c:697
                prune_icache_sb+0xeb/0x150 fs/inode.c:896
                super_cache_scan+0x391/0x590 fs/super.c:106
                do_shrink_slab+0x464/0xce0 mm/vmscan.c:843
                shrink_slab_memcg mm/vmscan.c:912 [inline]
                shrink_slab+0x388/0x660 mm/vmscan.c:991
                shrink_node_memcgs mm/vmscan.c:6088 [inline]
                shrink_node+0x93d/0x1f30 mm/vmscan.c:6117
                shrink_zones mm/vmscan.c:6355 [inline]
                do_try_to_free_pages+0x3b4/0x17a0 mm/vmscan.c:6417
                try_to_free_mem_cgroup_pages+0x3a4/0xa70 mm/vmscan.c:6732
                reclaim_high.constprop.0+0x182/0x230 mm/memcontrol.c:2393
                mem_cgroup_handle_over_high+0x190/0x520 mm/memcontrol.c:2578
                try_charge_memcg+0xe0c/0x12f0 mm/memcontrol.c:2816
                try_charge mm/memcontrol.c:2827 [inline]
                charge_memcg+0x90/0x3b0 mm/memcontrol.c:6889
                __mem_cgroup_charge+0x2b/0x90 mm/memcontrol.c:6910
                mem_cgroup_charge include/linux/memcontrol.h:667 [inline]
                __filemap_add_folio+0x615/0xf80 mm/filemap.c:852
                filemap_add_folio+0xaf/0x1e0 mm/filemap.c:934
                __filemap_get_folio+0x389/0xd80 mm/filemap.c:1976
                pagecache_get_page+0x2e/0x280 mm/folio-compat.c:104
                find_or_create_page include/linux/pagemap.h:612 [inline]
                alloc_extent_buffer+0x2b9/0x1580 fs/btrfs/extent_io.c:4588
                btrfs_init_new_buffer fs/btrfs/extent-tree.c:4869 [inline]
                btrfs_alloc_tree_block+0x2e1/0x1320 fs/btrfs/extent-tree.c:4988
                __btrfs_cow_block+0x3b2/0x1420 fs/btrfs/ctree.c:440
                btrfs_cow_block+0x2fa/0x950 fs/btrfs/ctree.c:595
                btrfs_search_slot+0x11b0/0x2c70 fs/btrfs/ctree.c:2038
                btrfs_update_root+0xdb/0x630 fs/btrfs/root-tree.c:137
                update_log_root fs/btrfs/tree-log.c:2841 [inline]
                btrfs_sync_log+0xbfb/0x2870 fs/btrfs/tree-log.c:3064
                btrfs_sync_file+0xdb9/0x13c0 fs/btrfs/file.c:1947
                vfs_fsync_range+0x13e/0x230 fs/sync.c:188
                generic_write_sync include/linux/fs.h:2856 [inline]
                iomap_dio_complete+0x73a/0x920 fs/iomap/direct-io.c:128
                btrfs_direct_write fs/btrfs/file.c:1536 [inline]
                btrfs_do_write_iter+0xba2/0x1470 fs/btrfs/file.c:1668
                call_write_iter include/linux/fs.h:2160 [inline]
                do_iter_readv_writev+0x20b/0x3b0 fs/read_write.c:735
                do_iter_write+0x182/0x700 fs/read_write.c:861
                vfs_iter_write+0x74/0xa0 fs/read_write.c:902
                iter_file_splice_write+0x745/0xc90 fs/splice.c:686
                do_splice_from fs/splice.c:764 [inline]
                direct_splice_actor+0x114/0x180 fs/splice.c:931
                splice_direct_to_actor+0x335/0x8a0 fs/splice.c:886
                do_splice_direct+0x1ab/0x280 fs/splice.c:974
                do_sendfile+0xb19/0x1270 fs/read_write.c:1255
                __do_sys_sendfile64 fs/read_write.c:1323 [inline]
                __se_sys_sendfile64 fs/read_write.c:1309 [inline]
                __x64_sys_sendfile64+0x259/0x2c0 fs/read_write.c:1309
                do_syscall_x64 arch/x86/entry/common.c:50 [inline]
                do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80
                entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
         other info that might help us debug this:
      
         Chain exists of:
           &delayed_node->mutex --> btrfs-tree-00 --> btrfs-log-00
      
         Possible unsafe locking scenario:
      
                CPU0                    CPU1
                ----                    ----
           lock(btrfs-log-00);
                                        lock(btrfs-tree-00);
                                        lock(btrfs-log-00);
           lock(&delayed_node->mutex);
      
      Holding a read lock on a leaf from a fs/subvolume tree creates a nasty
      lock dependency when we are COWing extent buffers for the log tree and we
      have two tasks modifying the log tree, with each one in one of the
      following 2 scenarios:
      
      1) Modifying the log tree triggers an extent buffer allocation while
         holding a write lock on a parent extent buffer from the log tree.
         Allocating the pages for an extent buffer, or the extent buffer
         struct, can trigger inode eviction and finally the inode eviction
         will trigger a release/remove of a delayed node, which requires
         taking the delayed node's mutex;
      
      2) Allocating a metadata extent for a log tree can trigger the async
         reclaim thread and make us wait for it to release enough space and
         unblock our reservation ticket. The reclaim thread can start flushing
         delayed items, and that in turn results in the need to lock delayed
         node mutexes and in the need to write lock extent buffers of a
         subvolume tree - all this while holding a write lock on the parent
         extent buffer in the log tree.
      
      So one task in scenario 1) running in parallel with another task in
      scenario 2) could lead to a deadlock, one wanting to lock a delayed node
      mutex while having a read lock on a leaf from the subvolume, while the
      other is holding the delayed node's mutex and wants to write lock the same
      subvolume leaf for flushing delayed items.
      
      Fix this by cloning the leaf of the fs/subvolume tree, release/unlock the
      fs/subvolume leaf and use the clone leaf instead.
      
      Reported-by: syzbot+9b7c21f486f5e7f8d029@syzkaller.appspotmail.com
      Link: https://lore.kernel.org/linux-btrfs/000000000000ccc93c05edc4d8cf@google.com/
      CC: stable@vger.kernel.org # 6.0+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      796787c9
  5. 29 9月, 2022 1 次提交
  6. 26 9月, 2022 13 次提交
    • J
      btrfs: unify the lock/unlock extent variants · 570eb97b
      Josef Bacik 提交于
      We have two variants of lock/unlock extent, one set that takes a cached
      state, another that does not.  This is slightly annoying, and generally
      speaking there are only a few places where we don't have a cached state.
      Simplify this by making lock_extent/unlock_extent the only variant and
      make it take a cached state, then convert all the callers appropriately.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      570eb97b
    • F
      btrfs: simplify adding and replacing references during log replay · 7059c658
      Filipe Manana 提交于
      During log replay, when adding/replacing inode references, there are two
      special cases that have special code for them:
      
      1) When we have an inode with two or more hardlinks in the same directory,
         therefore two or more names encoded in the same inode reference item,
         and one of the hard links gets renamed to the old name of another hard
         link - that is, the index number for a name changes. This was added in
         commit 0d836392 ("Btrfs: fix mount failure after fsync due to
         hard link recreation"), and is covered by test case generic/502 from
         fstests;
      
      2) When we have several inodes that got renamed to an old name of some
         other inode, in a cascading style. The code to deal with this special
         case was added in commit 6b5fc433 ("Btrfs: fix fsync after
         succession of renames of different files"), and is covered by test
         cases generic/526 and generic/527 from fstests.
      
      Both cases can be deal with by making sure __add_inode_ref() is always
      called by add_inode_ref() for every name encoded in the inode reference
      item, and not just for the first name that has a conflict. With such
      change we no longer need that special casing for the two cases mentioned
      before. So do those changes.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7059c658
    • F
      btrfs: use delayed items when logging a directory · 30b80f3c
      Filipe Manana 提交于
      When logging a directory we start by flushing all its delayed items.
      That results in adding dir index items to the subvolume btree, for new
      dentries, and removing dir index items from the subvolume btree for any
      dentries that were deleted.
      
      This makes it straightforward to log a directory simply by iterating over
      all the modified subvolume btree leaves, especially when we used to log
      both dir index keys and dir item keys (before commit 339d0354
      ("btrfs: only copy dir index keys when logging a directory") and when we
      used to copy old dir index entries for leaves modified in the current
      transaction (before commit 732d591a ("btrfs: stop copying old dir
      items when logging a directory")).
      
      From an efficiency point of view this has a couple of drawbacks:
      
      1) Adds extra latency, due to copying delayed items to the subvolume btree
         and deleting dir index items from the btree.
      
         Further if there are other tasks accessing the btree, which is common
         (syscalls like creat, mkdir, rename, link, unlink, truncate, reflinks,
         etc, finishing an ordered extent, etc), lock contention can cause
         further delays, both to the task logging a directory and to the other
         tasks accessing the btree;
      
      2) More time spent overall flushing delayed items, if after logging the
         directory further changes are done to the directory in the same
         transaction.
      
         For example, if we add 10 dentries to a directory, fsync it, add more
         10 dentries, fsync it again, then add more 10 dentries and fsync it
         again, then we end up inserting 3 batches of 10 items to the subvolume
         btree. With the changes from this patch, we flush all the delayed items
         to the btree only once - a single batch of 30 items, and outside the
         logging code (transaction commit or when delayed items are flushed
         asynchronously).
      
      This change simply skips the flushing of delayed items every time we log a
      directory. Instead we copy the delayed insertion items directly to the log
      tree and delete delayed deletion items directly from the log tree.
      Therefore avoiding changing first the subvolume btree and then scanning it
      for new items to copy from it to the log tree and detecting deletions
      by observing gaps in consecutive dir index keys in subvolume btree leaves.
      
      Running the following tests on a non-debug kernel (Debian's default kernel
      config), on a box with a NVMe device, a 12 cores Intel CPU and 64G of ram,
      produced the results below.
      
      The results compare a branch without this patch and all the other patches
      it depends on versus the same branch with the patchset applied.
      
      The patchset is comprised of the following patches:
      
        btrfs: don't drop dir index range items when logging a directory
        btrfs: remove the root argument from log_new_dir_dentries()
        btrfs: update stale comment for log_new_dir_dentries()
        btrfs: free list element sooner at log_new_dir_dentries()
        btrfs: avoid memory allocation at log_new_dir_dentries() for common case
        btrfs: remove root argument from btrfs_delayed_item_reserve_metadata()
        btrfs: store index number instead of key in struct btrfs_delayed_item
        btrfs: remove unused logic when looking up delayed items
        btrfs: shrink the size of struct btrfs_delayed_item
        btrfs: search for last logged dir index if it's not cached in the inode
        btrfs: move need_log_inode() to above log_conflicting_inodes()
        btrfs: move log_new_dir_dentries() above btrfs_log_inode()
        btrfs: log conflicting inodes without holding log mutex of the initial inode
        btrfs: skip logging parent dir when conflicting inode is not a dir
        btrfs: use delayed items when logging a directory
      
      Custom test script for testing time spent at btrfs_log_inode():
      
         #!/bin/bash
      
         DEV=/dev/nvme0n1
         MNT=/mnt/nvme0n1
      
         # Total number of files to create in the test directory.
         NUM_FILES=10000
         # Fsync after creating or renaming N files.
         FSYNC_AFTER=100
      
         umount $DEV &> /dev/null
         mkfs.btrfs -f $DEV
         mount -o ssd $DEV $MNT
      
         TEST_DIR=$MNT/testdir
         mkdir $TEST_DIR
      
         echo "Creating files..."
         for ((i = 1; i <= $NUM_FILES; i++)); do
                 echo -n > $TEST_DIR/file_$i
                 if (( ($i % $FSYNC_AFTER) == 0 )); then
                         xfs_io -c "fsync" $TEST_DIR
                 fi
         done
      
         sync
      
         echo "Renaming files..."
         for ((i = 1; i <= $NUM_FILES; i++)); do
                 mv $TEST_DIR/file_$i $TEST_DIR/file_$i.renamed
                 if (( ($i % $FSYNC_AFTER) == 0 )); then
                         xfs_io -c "fsync" $TEST_DIR
                 fi
         done
      
         umount $MNT
      
      And using the following bpftrace script to capture the total time that is
      spent at btrfs_log_inode():
      
         #!/usr/bin/bpftrace
      
         k:btrfs_log_inode
         {
                 @start_log_inode[tid] = nsecs;
         }
      
         kr:btrfs_log_inode
         /@start_log_inode[tid]/
         {
                 $dur = (nsecs - @start_log_inode[tid]) / 1000;
                 @btrfs_log_inode_total_time = sum($dur);
                 delete(@start_log_inode[tid]);
         }
      
         END
         {
                 clear(@start_log_inode);
         }
      
      Result before applying patchset:
      
         @btrfs_log_inode_total_time: 622642
      
      Result after applying patchset:
      
         @btrfs_log_inode_total_time: 354134    (-43.1% time spent)
      
      The following dbench script was also used for testing:
      
         #!/bin/bash
      
         NUM_JOBS=$(nproc --all)
      
         DEV=/dev/nvme0n1
         MNT=/mnt/nvme0n1
         MOUNT_OPTIONS="-o ssd"
         MKFS_OPTIONS="-O no-holes -R free-space-tree"
      
         echo "performance" | \
             tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
      
         umount $DEV &> /dev/null
         mkfs.btrfs -f $MKFS_OPTIONS $DEV
         mount $MOUNT_OPTIONS $DEV $MNT
      
         dbench -D $MNT --skip-cleanup -t 120 -S $NUM_JOBS
      
         umount $MNT
      
      Before patchset:
      
       Operation      Count    AvgLat    MaxLat
       ----------------------------------------
       NTCreateX    3322265     0.034    21.032
       Close        2440562     0.002     0.994
       Rename        140664     1.150   269.633
       Unlink        670796     1.093   269.678
       Deltree           96     5.481    15.510
       Mkdir             48     0.004     0.052
       Qpathinfo    3010924     0.014     8.127
       Qfileinfo     528055     0.001     0.518
       Qfsinfo       552113     0.003     0.372
       Sfileinfo     270575     0.005     0.688
       Find         1164176     0.052    13.931
       WriteX       1658537     0.019     5.918
       ReadX        5207412     0.003     1.034
       LockX          10818     0.003     0.079
       UnlockX        10818     0.002     0.313
       Flush         232811     1.027   269.735
      
      Throughput 869.867 MB/sec (sync dirs)  12 clients  12 procs  max_latency=269.741 ms
      
      After patchset:
      
       Operation      Count    AvgLat    MaxLat
       ----------------------------------------
       NTCreateX    4152738     0.029    20.863
       Close        3050770     0.002     1.119
       Rename        175829     0.871   211.741
       Unlink        838447     0.845   211.724
       Deltree          120     4.798    14.162
       Mkdir             60     0.003     0.005
       Qpathinfo    3763807     0.011     4.673
       Qfileinfo     660111     0.001     0.400
       Qfsinfo       690141     0.003     0.429
       Sfileinfo     338260     0.005     0.725
       Find         1455273     0.046     6.787
       WriteX       2073307     0.017     5.690
       ReadX        6509193     0.003     1.171
       LockX          13522     0.003     0.077
       UnlockX        13522     0.002     0.125
       Flush         291044     0.811   211.631
      
      Throughput 1089.27 MB/sec (sync dirs)  12 clients  12 procs  max_latency=211.750 ms
      
      (+25.2% throughput, -21.5% max latency)
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      30b80f3c
    • F
      btrfs: skip logging parent dir when conflicting inode is not a dir · 5557a069
      Filipe Manana 提交于
      When we find a conflicting inode (an inode that had the same name and
      parent directory as the inode we are logging now) that was deleted in the
      current transaction, we always end up logging its parent directory.
      
      This is to deal with the case where the conflicting inode corresponds to
      a deleted subvolume/snapshot or a directory that had subvolumes/snapshots
      (or some subdirectory inside it had subvolumes/snapshots, etc), because
      we can't deal with dropping subvolumes/snapshots during log replay. So
      if we log the parent directory, and if we are dealing with these special
      cases, then we fallback to a transaction commit when logging the parent,
      because its last_unlink_trans will match the current transaction (which
      gets set and propagated when a subvolume/snapshot is deleted).
      
      This change skips the logging of the parent directory when the conflicting
      inode is not a directory (or a subvolume/snapshot). This is ok because in
      this case logging the current inode is enough to trigger an unlink of the
      conflicting inode during log replay.
      
      So for a case like this:
      
        $ mkdir /mnt/dir
        $ echo -n "first foo data" > /mnt/dir/foo
      
        $ sync
      
        $ rm -f /mnt/dir/foo
        $ echo -n "second foo data" > /mnt/dir/foo
        $ xfs_io -c "fsync" /mnt/dir/foo
      
      We avoid logging parent directory "dir" when logging the new file "foo".
      In other cases it avoids falling back to a transaction commit, when the
      parent directory has a last_unlink_trans value that matches the current
      transaction, due to moving a file from it to some other directory.
      
      This is a case that happens frequently with dbench for example, where a
      new file that has the name/parent of another file that was deleted in the
      current transaction, is fsynced.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5557a069
    • F
      btrfs: log conflicting inodes without holding log mutex of the initial inode · e09d94c9
      Filipe Manana 提交于
      When logging an inode, if we detect the inode has a reference that
      conflicts with some other inode that got renamed, we log that other inode
      while holding the log mutex of the current inode. We then find out if
      there are other inodes that conflict with the first conflicting inode,
      and log them while under the log mutex of the original inode. This is
      fine because the recursion can only happen once.
      
      For the upcoming work where we directly log delayed items without flushing
      them first to the subvolume tree, this recursion adds a lot of complexity
      and it's hard to keep lockdep happy about it.
      
      So collect a list of conflicting inodes and then log the inodes after
      unlocking the log mutex of the inode we started with.
      
      Also limit the maximum number of conflict inodes we log to 10, to avoid
      spending too much time logging (and maybe allocating too many list
      elements too), as typically we don't have more than 1 or 2 conflicting
      inodes - if we go over the limit, simply fallback to a transaction commit.
      
      It is possible to have a very long list of conflicting inodes to be
      intentionally created by a user if he/she creates a very long succession
      of renames like this:
      
        (...)
        rename E to F
        rename D to E
        rename C to D
        rename B to C
        rename A to B
        touch A (create a new file named A)
        fsync A
      
      If that happened for a sequence of hundreds or thousands of renames, it
      could massively slow down the logging and cause other secondary effects
      like for example blocking other fsync operations and transaction commits
      for a very long time (assuming it wouldn't run into -ENOSPC or -ENOMEM
      first). However such cases are very uncommon to happen in practice,
      nevertheless it's better to be prepared for them and avoid chaos.
      Such long sequence of conflicting inodes could be created before this
      change.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e09d94c9
    • F
      btrfs: move log_new_dir_dentries() above btrfs_log_inode() · f6d86dbe
      Filipe Manana 提交于
      The static function log_new_dir_dentries() is currently defined below
      btrfs_log_inode(), but in an upcoming patch a new function is introduced
      that is called by btrfs_log_inode() and this new function needs to call
      log_new_dir_dentries(). So move log_new_dir_dentries() to a location
      between btrfs_log_inode() and need_log_inode() (the later is called by
      log_new_dir_dentries()).
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f6d86dbe
    • F
      btrfs: move need_log_inode() to above log_conflicting_inodes() · a3751024
      Filipe Manana 提交于
      The static function need_log_inode() is defined below btrfs_log_inode()
      and log_conflicting_inodes(), but in the next patches in the series we
      will need to call need_log_inode() in a couple new functions that will be
      used by btrfs_log_inode(). So move its definition to a location above
      log_conflicting_inodes().
      
      Also make its arguments 'const', since they are not supposed to be
      modified.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a3751024
    • F
      btrfs: search for last logged dir index if it's not cached in the inode · 193df624
      Filipe Manana 提交于
      The key offset of the last dir index item that was logged is stored in
      the inode's last_dir_index_offset field. However that field is not
      persisted in the inode item or elsewhere, so if the inode gets evicted
      and reloaded, it gets a value of (u64)-1, so that when we are logging
      dir index items we check if they were logged before, to avoid attempts
      to insert duplicated keys and fallback to a transaction commit.
      
      Improve on this by searching for the last dir index that was logged when
      we start logging a directory if the inode's last_dir_index_offset is not
      set (has a value of (u64)-1) and it was logged before. This avoids
      checking if each dir index item we find was already logged before, and
      simplifies the logging of dir index items (process_dir_items_leaf()).
      
      This will also be needed for an incoming change where we start logging
      delayed items directly, without flushing them first.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      193df624
    • F
      btrfs: avoid memory allocation at log_new_dir_dentries() for common case · 009d9bea
      Filipe Manana 提交于
      At log_new_dir_dentries() we always start by allocating a list element
      for the starting inode and then do a while loop with the condition being
      a list emptiness check.
      
      This however is not needed, we can avoid allocating this initial list
      element and then just check for the list emptiness at the end of the
      loop's body. So just do that to save one memory allocation from the
      kmalloc-32 slab.
      
      This allows for not doing any memory allocation when we don't have any
      subdirectory to log, which is a very common case.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      009d9bea
    • F
      btrfs: free list element sooner at log_new_dir_dentries() · 40084813
      Filipe Manana 提交于
      At log_new_dir_dentries(), there's no need to keep the current list
      element allocated while processing the leaves with directory items for
      the current directory, and while logging other inodes. Plus in case we
      find a subdirectory, we also end up allocating a new list element while
      the current one is still allocated, temporarily using more memory than
      necessary.
      
      So free the current list element early on, before processing leaves.
      Also make the removal and release of all list elements in case of an
      error more simple by eliminating the label and goto, adding an explicit
      loop to release all list elements in case an error happens.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      40084813
    • F
      btrfs: update stale comment for log_new_dir_dentries() · b96c552b
      Filipe Manana 提交于
      The comment refers to the function log_dir_items() in order to check why
      the inodes of new directory entries need to be logged, but the relevant
      comments are no longer at log_dir_items(), they were moved to the function
      process_dir_items_leaf() in commit eb10d85e ("btrfs: factor out the
      copying loop of dir items from log_dir_items()"). So update it with the
      current function name.
      
      Also remove references with i_mutex to "VFS lock", since the inode lock
      is no longer a mutex since 2016 (it's now a rw semaphore).
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b96c552b
    • F
      btrfs: remove the root argument from log_new_dir_dentries() · 8786a6d7
      Filipe Manana 提交于
      There's no point in passing a root argument to log_new_dir_dentries()
      because it always corresponds to the root of the given inode. So remove
      it and extract the root from the given inode.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8786a6d7
    • F
      btrfs: don't drop dir index range items when logging a directory · 04fc7d51
      Filipe Manana 提交于
      When logging a directory that was previously logged in the current
      transaction, we drop all the range items (BTRFS_DIR_LOG_INDEX_KEY key
      type). This is because we will process all leaves in the subvolume's tree
      that were changed in the current transaction and then add range items for
      covering new dir index items and deleted dir index items, which could
      cover now a larger range than before.
      
      We used to fail if we tried to insert a range item key that already
      exists, so we dropped all range items to avoid failing. However nowadays,
      since commit 750ee454 ("btrfs: fix assertion failure when logging
      directory key range item"), we simply update any range item that already
      exists, increasing its range's last dir index if needed. Since the range
      covered by a range item can never decrease, due to the fact that dir index
      values come from a monotonically increasing counter and are never reused,
      we can stop dropping all range items before we start logging a directory.
      By not dropping the items we can avoid having occasional tree rebalance
      operations.
      
      This will also be needed for an incoming change where we start logging
      delayed items directly, without flushing them first.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      04fc7d51