1. 26 9月, 2022 14 次提交
    • J
      btrfs: unify the lock/unlock extent variants · 570eb97b
      Josef Bacik 提交于
      We have two variants of lock/unlock extent, one set that takes a cached
      state, another that does not.  This is slightly annoying, and generally
      speaking there are only a few places where we don't have a cached state.
      Simplify this by making lock_extent/unlock_extent the only variant and
      make it take a cached state, then convert all the callers appropriately.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      570eb97b
    • F
      btrfs: simplify adding and replacing references during log replay · 7059c658
      Filipe Manana 提交于
      During log replay, when adding/replacing inode references, there are two
      special cases that have special code for them:
      
      1) When we have an inode with two or more hardlinks in the same directory,
         therefore two or more names encoded in the same inode reference item,
         and one of the hard links gets renamed to the old name of another hard
         link - that is, the index number for a name changes. This was added in
         commit 0d836392 ("Btrfs: fix mount failure after fsync due to
         hard link recreation"), and is covered by test case generic/502 from
         fstests;
      
      2) When we have several inodes that got renamed to an old name of some
         other inode, in a cascading style. The code to deal with this special
         case was added in commit 6b5fc433 ("Btrfs: fix fsync after
         succession of renames of different files"), and is covered by test
         cases generic/526 and generic/527 from fstests.
      
      Both cases can be deal with by making sure __add_inode_ref() is always
      called by add_inode_ref() for every name encoded in the inode reference
      item, and not just for the first name that has a conflict. With such
      change we no longer need that special casing for the two cases mentioned
      before. So do those changes.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7059c658
    • F
      btrfs: use delayed items when logging a directory · 30b80f3c
      Filipe Manana 提交于
      When logging a directory we start by flushing all its delayed items.
      That results in adding dir index items to the subvolume btree, for new
      dentries, and removing dir index items from the subvolume btree for any
      dentries that were deleted.
      
      This makes it straightforward to log a directory simply by iterating over
      all the modified subvolume btree leaves, especially when we used to log
      both dir index keys and dir item keys (before commit 339d0354
      ("btrfs: only copy dir index keys when logging a directory") and when we
      used to copy old dir index entries for leaves modified in the current
      transaction (before commit 732d591a ("btrfs: stop copying old dir
      items when logging a directory")).
      
      From an efficiency point of view this has a couple of drawbacks:
      
      1) Adds extra latency, due to copying delayed items to the subvolume btree
         and deleting dir index items from the btree.
      
         Further if there are other tasks accessing the btree, which is common
         (syscalls like creat, mkdir, rename, link, unlink, truncate, reflinks,
         etc, finishing an ordered extent, etc), lock contention can cause
         further delays, both to the task logging a directory and to the other
         tasks accessing the btree;
      
      2) More time spent overall flushing delayed items, if after logging the
         directory further changes are done to the directory in the same
         transaction.
      
         For example, if we add 10 dentries to a directory, fsync it, add more
         10 dentries, fsync it again, then add more 10 dentries and fsync it
         again, then we end up inserting 3 batches of 10 items to the subvolume
         btree. With the changes from this patch, we flush all the delayed items
         to the btree only once - a single batch of 30 items, and outside the
         logging code (transaction commit or when delayed items are flushed
         asynchronously).
      
      This change simply skips the flushing of delayed items every time we log a
      directory. Instead we copy the delayed insertion items directly to the log
      tree and delete delayed deletion items directly from the log tree.
      Therefore avoiding changing first the subvolume btree and then scanning it
      for new items to copy from it to the log tree and detecting deletions
      by observing gaps in consecutive dir index keys in subvolume btree leaves.
      
      Running the following tests on a non-debug kernel (Debian's default kernel
      config), on a box with a NVMe device, a 12 cores Intel CPU and 64G of ram,
      produced the results below.
      
      The results compare a branch without this patch and all the other patches
      it depends on versus the same branch with the patchset applied.
      
      The patchset is comprised of the following patches:
      
        btrfs: don't drop dir index range items when logging a directory
        btrfs: remove the root argument from log_new_dir_dentries()
        btrfs: update stale comment for log_new_dir_dentries()
        btrfs: free list element sooner at log_new_dir_dentries()
        btrfs: avoid memory allocation at log_new_dir_dentries() for common case
        btrfs: remove root argument from btrfs_delayed_item_reserve_metadata()
        btrfs: store index number instead of key in struct btrfs_delayed_item
        btrfs: remove unused logic when looking up delayed items
        btrfs: shrink the size of struct btrfs_delayed_item
        btrfs: search for last logged dir index if it's not cached in the inode
        btrfs: move need_log_inode() to above log_conflicting_inodes()
        btrfs: move log_new_dir_dentries() above btrfs_log_inode()
        btrfs: log conflicting inodes without holding log mutex of the initial inode
        btrfs: skip logging parent dir when conflicting inode is not a dir
        btrfs: use delayed items when logging a directory
      
      Custom test script for testing time spent at btrfs_log_inode():
      
         #!/bin/bash
      
         DEV=/dev/nvme0n1
         MNT=/mnt/nvme0n1
      
         # Total number of files to create in the test directory.
         NUM_FILES=10000
         # Fsync after creating or renaming N files.
         FSYNC_AFTER=100
      
         umount $DEV &> /dev/null
         mkfs.btrfs -f $DEV
         mount -o ssd $DEV $MNT
      
         TEST_DIR=$MNT/testdir
         mkdir $TEST_DIR
      
         echo "Creating files..."
         for ((i = 1; i <= $NUM_FILES; i++)); do
                 echo -n > $TEST_DIR/file_$i
                 if (( ($i % $FSYNC_AFTER) == 0 )); then
                         xfs_io -c "fsync" $TEST_DIR
                 fi
         done
      
         sync
      
         echo "Renaming files..."
         for ((i = 1; i <= $NUM_FILES; i++)); do
                 mv $TEST_DIR/file_$i $TEST_DIR/file_$i.renamed
                 if (( ($i % $FSYNC_AFTER) == 0 )); then
                         xfs_io -c "fsync" $TEST_DIR
                 fi
         done
      
         umount $MNT
      
      And using the following bpftrace script to capture the total time that is
      spent at btrfs_log_inode():
      
         #!/usr/bin/bpftrace
      
         k:btrfs_log_inode
         {
                 @start_log_inode[tid] = nsecs;
         }
      
         kr:btrfs_log_inode
         /@start_log_inode[tid]/
         {
                 $dur = (nsecs - @start_log_inode[tid]) / 1000;
                 @btrfs_log_inode_total_time = sum($dur);
                 delete(@start_log_inode[tid]);
         }
      
         END
         {
                 clear(@start_log_inode);
         }
      
      Result before applying patchset:
      
         @btrfs_log_inode_total_time: 622642
      
      Result after applying patchset:
      
         @btrfs_log_inode_total_time: 354134    (-43.1% time spent)
      
      The following dbench script was also used for testing:
      
         #!/bin/bash
      
         NUM_JOBS=$(nproc --all)
      
         DEV=/dev/nvme0n1
         MNT=/mnt/nvme0n1
         MOUNT_OPTIONS="-o ssd"
         MKFS_OPTIONS="-O no-holes -R free-space-tree"
      
         echo "performance" | \
             tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
      
         umount $DEV &> /dev/null
         mkfs.btrfs -f $MKFS_OPTIONS $DEV
         mount $MOUNT_OPTIONS $DEV $MNT
      
         dbench -D $MNT --skip-cleanup -t 120 -S $NUM_JOBS
      
         umount $MNT
      
      Before patchset:
      
       Operation      Count    AvgLat    MaxLat
       ----------------------------------------
       NTCreateX    3322265     0.034    21.032
       Close        2440562     0.002     0.994
       Rename        140664     1.150   269.633
       Unlink        670796     1.093   269.678
       Deltree           96     5.481    15.510
       Mkdir             48     0.004     0.052
       Qpathinfo    3010924     0.014     8.127
       Qfileinfo     528055     0.001     0.518
       Qfsinfo       552113     0.003     0.372
       Sfileinfo     270575     0.005     0.688
       Find         1164176     0.052    13.931
       WriteX       1658537     0.019     5.918
       ReadX        5207412     0.003     1.034
       LockX          10818     0.003     0.079
       UnlockX        10818     0.002     0.313
       Flush         232811     1.027   269.735
      
      Throughput 869.867 MB/sec (sync dirs)  12 clients  12 procs  max_latency=269.741 ms
      
      After patchset:
      
       Operation      Count    AvgLat    MaxLat
       ----------------------------------------
       NTCreateX    4152738     0.029    20.863
       Close        3050770     0.002     1.119
       Rename        175829     0.871   211.741
       Unlink        838447     0.845   211.724
       Deltree          120     4.798    14.162
       Mkdir             60     0.003     0.005
       Qpathinfo    3763807     0.011     4.673
       Qfileinfo     660111     0.001     0.400
       Qfsinfo       690141     0.003     0.429
       Sfileinfo     338260     0.005     0.725
       Find         1455273     0.046     6.787
       WriteX       2073307     0.017     5.690
       ReadX        6509193     0.003     1.171
       LockX          13522     0.003     0.077
       UnlockX        13522     0.002     0.125
       Flush         291044     0.811   211.631
      
      Throughput 1089.27 MB/sec (sync dirs)  12 clients  12 procs  max_latency=211.750 ms
      
      (+25.2% throughput, -21.5% max latency)
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      30b80f3c
    • F
      btrfs: skip logging parent dir when conflicting inode is not a dir · 5557a069
      Filipe Manana 提交于
      When we find a conflicting inode (an inode that had the same name and
      parent directory as the inode we are logging now) that was deleted in the
      current transaction, we always end up logging its parent directory.
      
      This is to deal with the case where the conflicting inode corresponds to
      a deleted subvolume/snapshot or a directory that had subvolumes/snapshots
      (or some subdirectory inside it had subvolumes/snapshots, etc), because
      we can't deal with dropping subvolumes/snapshots during log replay. So
      if we log the parent directory, and if we are dealing with these special
      cases, then we fallback to a transaction commit when logging the parent,
      because its last_unlink_trans will match the current transaction (which
      gets set and propagated when a subvolume/snapshot is deleted).
      
      This change skips the logging of the parent directory when the conflicting
      inode is not a directory (or a subvolume/snapshot). This is ok because in
      this case logging the current inode is enough to trigger an unlink of the
      conflicting inode during log replay.
      
      So for a case like this:
      
        $ mkdir /mnt/dir
        $ echo -n "first foo data" > /mnt/dir/foo
      
        $ sync
      
        $ rm -f /mnt/dir/foo
        $ echo -n "second foo data" > /mnt/dir/foo
        $ xfs_io -c "fsync" /mnt/dir/foo
      
      We avoid logging parent directory "dir" when logging the new file "foo".
      In other cases it avoids falling back to a transaction commit, when the
      parent directory has a last_unlink_trans value that matches the current
      transaction, due to moving a file from it to some other directory.
      
      This is a case that happens frequently with dbench for example, where a
      new file that has the name/parent of another file that was deleted in the
      current transaction, is fsynced.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5557a069
    • F
      btrfs: log conflicting inodes without holding log mutex of the initial inode · e09d94c9
      Filipe Manana 提交于
      When logging an inode, if we detect the inode has a reference that
      conflicts with some other inode that got renamed, we log that other inode
      while holding the log mutex of the current inode. We then find out if
      there are other inodes that conflict with the first conflicting inode,
      and log them while under the log mutex of the original inode. This is
      fine because the recursion can only happen once.
      
      For the upcoming work where we directly log delayed items without flushing
      them first to the subvolume tree, this recursion adds a lot of complexity
      and it's hard to keep lockdep happy about it.
      
      So collect a list of conflicting inodes and then log the inodes after
      unlocking the log mutex of the inode we started with.
      
      Also limit the maximum number of conflict inodes we log to 10, to avoid
      spending too much time logging (and maybe allocating too many list
      elements too), as typically we don't have more than 1 or 2 conflicting
      inodes - if we go over the limit, simply fallback to a transaction commit.
      
      It is possible to have a very long list of conflicting inodes to be
      intentionally created by a user if he/she creates a very long succession
      of renames like this:
      
        (...)
        rename E to F
        rename D to E
        rename C to D
        rename B to C
        rename A to B
        touch A (create a new file named A)
        fsync A
      
      If that happened for a sequence of hundreds or thousands of renames, it
      could massively slow down the logging and cause other secondary effects
      like for example blocking other fsync operations and transaction commits
      for a very long time (assuming it wouldn't run into -ENOSPC or -ENOMEM
      first). However such cases are very uncommon to happen in practice,
      nevertheless it's better to be prepared for them and avoid chaos.
      Such long sequence of conflicting inodes could be created before this
      change.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e09d94c9
    • F
      btrfs: move log_new_dir_dentries() above btrfs_log_inode() · f6d86dbe
      Filipe Manana 提交于
      The static function log_new_dir_dentries() is currently defined below
      btrfs_log_inode(), but in an upcoming patch a new function is introduced
      that is called by btrfs_log_inode() and this new function needs to call
      log_new_dir_dentries(). So move log_new_dir_dentries() to a location
      between btrfs_log_inode() and need_log_inode() (the later is called by
      log_new_dir_dentries()).
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f6d86dbe
    • F
      btrfs: move need_log_inode() to above log_conflicting_inodes() · a3751024
      Filipe Manana 提交于
      The static function need_log_inode() is defined below btrfs_log_inode()
      and log_conflicting_inodes(), but in the next patches in the series we
      will need to call need_log_inode() in a couple new functions that will be
      used by btrfs_log_inode(). So move its definition to a location above
      log_conflicting_inodes().
      
      Also make its arguments 'const', since they are not supposed to be
      modified.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a3751024
    • F
      btrfs: search for last logged dir index if it's not cached in the inode · 193df624
      Filipe Manana 提交于
      The key offset of the last dir index item that was logged is stored in
      the inode's last_dir_index_offset field. However that field is not
      persisted in the inode item or elsewhere, so if the inode gets evicted
      and reloaded, it gets a value of (u64)-1, so that when we are logging
      dir index items we check if they were logged before, to avoid attempts
      to insert duplicated keys and fallback to a transaction commit.
      
      Improve on this by searching for the last dir index that was logged when
      we start logging a directory if the inode's last_dir_index_offset is not
      set (has a value of (u64)-1) and it was logged before. This avoids
      checking if each dir index item we find was already logged before, and
      simplifies the logging of dir index items (process_dir_items_leaf()).
      
      This will also be needed for an incoming change where we start logging
      delayed items directly, without flushing them first.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      193df624
    • F
      btrfs: avoid memory allocation at log_new_dir_dentries() for common case · 009d9bea
      Filipe Manana 提交于
      At log_new_dir_dentries() we always start by allocating a list element
      for the starting inode and then do a while loop with the condition being
      a list emptiness check.
      
      This however is not needed, we can avoid allocating this initial list
      element and then just check for the list emptiness at the end of the
      loop's body. So just do that to save one memory allocation from the
      kmalloc-32 slab.
      
      This allows for not doing any memory allocation when we don't have any
      subdirectory to log, which is a very common case.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      009d9bea
    • F
      btrfs: free list element sooner at log_new_dir_dentries() · 40084813
      Filipe Manana 提交于
      At log_new_dir_dentries(), there's no need to keep the current list
      element allocated while processing the leaves with directory items for
      the current directory, and while logging other inodes. Plus in case we
      find a subdirectory, we also end up allocating a new list element while
      the current one is still allocated, temporarily using more memory than
      necessary.
      
      So free the current list element early on, before processing leaves.
      Also make the removal and release of all list elements in case of an
      error more simple by eliminating the label and goto, adding an explicit
      loop to release all list elements in case an error happens.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      40084813
    • F
      btrfs: update stale comment for log_new_dir_dentries() · b96c552b
      Filipe Manana 提交于
      The comment refers to the function log_dir_items() in order to check why
      the inodes of new directory entries need to be logged, but the relevant
      comments are no longer at log_dir_items(), they were moved to the function
      process_dir_items_leaf() in commit eb10d85e ("btrfs: factor out the
      copying loop of dir items from log_dir_items()"). So update it with the
      current function name.
      
      Also remove references with i_mutex to "VFS lock", since the inode lock
      is no longer a mutex since 2016 (it's now a rw semaphore).
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b96c552b
    • F
      btrfs: remove the root argument from log_new_dir_dentries() · 8786a6d7
      Filipe Manana 提交于
      There's no point in passing a root argument to log_new_dir_dentries()
      because it always corresponds to the root of the given inode. So remove
      it and extract the root from the given inode.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8786a6d7
    • F
      btrfs: don't drop dir index range items when logging a directory · 04fc7d51
      Filipe Manana 提交于
      When logging a directory that was previously logged in the current
      transaction, we drop all the range items (BTRFS_DIR_LOG_INDEX_KEY key
      type). This is because we will process all leaves in the subvolume's tree
      that were changed in the current transaction and then add range items for
      covering new dir index items and deleted dir index items, which could
      cover now a larger range than before.
      
      We used to fail if we tried to insert a range item key that already
      exists, so we dropped all range items to avoid failing. However nowadays,
      since commit 750ee454 ("btrfs: fix assertion failure when logging
      directory key range item"), we simply update any range item that already
      exists, increasing its range's last dir index if needed. Since the range
      covered by a range item can never decrease, due to the fact that dir index
      values come from a monotonically increasing counter and are never reused,
      we can stop dropping all range items before we start logging a directory.
      By not dropping the items we can avoid having occasional tree rebalance
      operations.
      
      This will also be needed for an incoming change where we start logging
      delayed items directly, without flushing them first.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      04fc7d51
    • O
      btrfs: rename btrfs_insert_file_extent() to btrfs_insert_hole_extent() · d1f68ba0
      Omar Sandoval 提交于
      btrfs_insert_file_extent() is only ever used to insert holes, so rename
      it and remove the redundant parameters.
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NOmar Sandoval <osandov@osandov.com>
      Signed-off-by: NSweet Tea Dorminy <sweettea-kernel@dorminy.me>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d1f68ba0
  2. 17 8月, 2022 2 次提交
    • F
      btrfs: fix warning during log replay when bumping inode link count · 769030e1
      Filipe Manana 提交于
      During log replay, at add_link(), we may increment the link count of
      another inode that has a reference that conflicts with a new reference
      for the inode currently being processed.
      
      During log replay, at add_link(), we may drop (unlink) a reference from
      some inode in the subvolume tree if that reference conflicts with a new
      reference found in the log for the inode we are currently processing.
      
      After the unlink, If the link count has decreased from 1 to 0, then we
      increment the link count to prevent the inode from being deleted if it's
      evicted by an iput() call, because we may have references to add to that
      inode later on (and we will fixup its link count later during log replay).
      
      However incrementing the link count from 0 to 1 triggers a warning:
      
        $ cat fs/inode.c
        (...)
        void inc_nlink(struct inode *inode)
        {
              if (unlikely(inode->i_nlink == 0)) {
                       WARN_ON(!(inode->i_state & I_LINKABLE));
                       atomic_long_dec(&inode->i_sb->s_remove_count);
              }
        (...)
      
      The I_LINKABLE flag is only set when creating an O_TMPFILE file, so it's
      never set during log replay.
      
      Most of the time, the warning isn't triggered even if we dropped the last
      reference of the conflicting inode, and this is because:
      
      1) The conflicting inode was previously marked for fixup, through a call
         to link_to_fixup_dir(), which increments the inode's link count;
      
      2) And the last iput() on the inode has not triggered eviction of the
         inode, nor was eviction triggered after the iput(). So at add_link(),
         even if we unlink the last reference of the inode, its link count ends
         up being 1 and not 0.
      
      So this means that if eviction is triggered after link_to_fixup_dir() is
      called, at add_link() we will read the inode back from the subvolume tree
      and have it with a correct link count, matching the number of references
      it has on the subvolume tree. So if when we are at add_link() the inode
      has exactly one reference only, its link count is 1, and after the unlink
      its link count becomes 0.
      
      So fix this by using set_nlink() instead of inc_nlink(), as the former
      accepts a transition from 0 to 1 and it's what we use in other similar
      contexts (like at link_to_fixup_dir().
      
      Also make add_inode_ref() use set_nlink() instead of inc_nlink() to
      bump the link count from 0 to 1.
      
      The warning is actually harmless, but it may scare users. Josef also ran
      into it recently.
      
      CC: stable@vger.kernel.org # 5.1+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      769030e1
    • F
      btrfs: fix lost error handling when looking up extended ref on log replay · 7a6b75b7
      Filipe Manana 提交于
      During log replay, when processing inode references, if we get an error
      when looking up for an extended reference at __add_inode_ref(), we ignore
      it and proceed, returning success (0) if no other error happens after the
      lookup. This is obviously wrong because in case an extended reference
      exists and it encodes some name not in the log, we need to unlink it,
      otherwise the filesystem state will not match the state it had after the
      last fsync.
      
      So just make __add_inode_ref() return an error it gets from the extended
      reference lookup.
      
      Fixes: f186373f ("btrfs: extended inode refs")
      CC: stable@vger.kernel.org # 4.9+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7a6b75b7
  3. 25 7月, 2022 3 次提交
    • F
      btrfs: join running log transaction when logging new name · 723df2bc
      Filipe Manana 提交于
      When logging a new name, in case of a rename, we pin the log before
      changing it. We then either delete a directory entry from the log or
      insert a key range item to mark the old name for deletion on log replay.
      
      However when doing one of those log changes we may have another task that
      started writing out the log (at btrfs_sync_log()) and it started before
      we pinned the log root. So we may end up changing a log tree while its
      writeback is being started by another task syncing the log. This can lead
      to inconsistencies in a log tree and other unexpected results during log
      replay, because we can get some committed node pointing to a node/leaf
      that ends up not getting written to disk before the next log commit.
      
      The problem, conceptually, started to happen in commit 88d2beec
      ("btrfs: avoid logging all directory changes during renames"), because
      there we started to update the log without joining its current transaction
      first.
      
      However the problem only became visible with commit 259c4b96
      ("btrfs: stop doing unnecessary log updates during a rename"), and that is
      because we used to pin the log at btrfs_rename() and then before entering
      btrfs_log_new_name(), when unlinking the old dentry, we ended up at
      btrfs_del_inode_ref_in_log() and btrfs_del_dir_entries_in_log(). Both
      of them join the current log transaction, effectively waiting for any log
      transaction writeout (due to acquiring the root's log_mutex). This made it
      safe even after leaving the current log transaction, because we remained
      with the log pinned when we called btrfs_log_new_name().
      
      Then in commit 259c4b96 ("btrfs: stop doing unnecessary log updates
      during a rename"), we removed the log pinning from btrfs_rename() and
      stopped calling btrfs_del_inode_ref_in_log() and
      btrfs_del_dir_entries_in_log() during the rename, and started to do all
      the needed work at btrfs_log_new_name(), but without joining the current
      log transaction, only pinning the log, which is racy because another task
      may have started writeout of the log tree right before we pinned the log.
      
      Both commits landed in kernel 5.18, so it doesn't make any practical
      difference which should be blamed, but I'm blaming the second commit only
      because with the first one, by chance, the problem did not happen due to
      the fact we joined the log transaction after pinning the log and unpinned
      it only after calling btrfs_log_new_name().
      
      So make btrfs_log_new_name() join the current log transaction instead of
      pinning it, so that we never do log updates if it's writeout is starting.
      
      Fixes: 259c4b96 ("btrfs: stop doing unnecessary log updates during a rename")
      CC: stable@vger.kernel.org # 5.18+
      Reported-by: NZygo Blaxell <ce3g8jdj@umail.furryterror.org>
      Tested-by: NZygo Blaxell <ce3g8jdj@umail.furryterror.org>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      723df2bc
    • J
      btrfs: tree-log: make the return value for log syncing consistent · f31f09f6
      Josef Bacik 提交于
      Currently we will return 1 or -EAGAIN if we decide we need to commit
      the transaction rather than sync the log.  In practice this doesn't
      really matter, we interpret any !0 and !BTRFS_NO_LOG_SYNC as needing to
      commit the transaction.  However this makes it hard to figure out what
      the correct thing to do is.
      
      Fix this up by defining BTRFS_LOG_FORCE_COMMIT and using this in all the
      places where we want to force the transaction to be committed.
      
      CC: stable@vger.kernel.org # 5.15+
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f31f09f6
    • D
      btrfs: fix typos in comments · 143823cf
      David Sterba 提交于
      Codespell has found a few typos.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      143823cf
  4. 16 5月, 2022 2 次提交
  5. 06 5月, 2022 1 次提交
    • F
      btrfs: fix assertion failure when logging directory key range item · 750ee454
      Filipe Manana 提交于
      When inserting a key range item (BTRFS_DIR_LOG_INDEX_KEY) while logging
      a directory, we don't expect the insertion to fail with -EEXIST, because
      we are holding the directory's log_mutex and we have dropped all existing
      BTRFS_DIR_LOG_INDEX_KEY keys from the log tree before we started to log
      the directory. However it's possible that during the logging we attempt
      to insert the same BTRFS_DIR_LOG_INDEX_KEY key twice, but for this to
      happen we need to race with insertions of items from other inodes in the
      subvolume's tree while we are logging a directory. Here's how this can
      happen:
      
      1) We are logging a directory with inode number 1000 that has its items
         spread across 3 leaves in the subvolume's tree:
      
         leaf A - has index keys from the range 2 to 20 for example. The last
         item in the leaf corresponds to a dir item for index number 20. All
         these dir items were created in a past transaction.
      
         leaf B - has index keys from the range 22 to 100 for example. It has
         no keys from other inodes, all its keys are dir index keys for our
         directory inode number 1000. Its first key is for the dir item with
         a sequence number of 22. All these dir items were also created in a
         past transaction.
      
         leaf C - has index keys for our directory for the range 101 to 120 for
         example. This leaf also has items from other inodes, and its first
         item corresponds to the dir item for index number 101 for our directory
         with inode number 1000;
      
      2) When we finish processing the items from leaf A at log_dir_items(),
         we log a BTRFS_DIR_LOG_INDEX_KEY key with an offset of 21 and a last
         offset of 21, meaning the log is authoritative for the index range
         from 21 to 21 (a single sequence number). At this point leaf B was
         not yet modified in the current transaction;
      
      3) When we return from log_dir_items() we have released our read lock on
         leaf B, and have set *last_offset_ret to 21 (index number of the first
         item on leaf B minus 1);
      
      4) Some other task inserts an item for other inode (inode number 1001 for
         example) into leaf C. That resulted in pushing some items from leaf C
         into leaf B, in order to make room for the new item, so now leaf B
         has dir index keys for the sequence number range from 22 to 102 and
         leaf C has the dir items for the sequence number range 103 to 120;
      
      5) At log_directory_changes() we call log_dir_items() again, passing it
         a 'min_offset' / 'min_key' value of 22 (*last_offset_ret from step 3
         plus 1, so 21 + 1). Then btrfs_search_forward() leaves us at slot 0
         of leaf B, since leaf B was modified in the current transaction.
      
         We have also initialized 'last_old_dentry_offset' to 20 after calling
         btrfs_previous_item() at log_dir_items(), as it left us at the last
         item of leaf A, which refers to the dir item with sequence number 20;
      
      6) We then call process_dir_items_leaf() to process the dir items of
         leaf B, and when we process the first item, corresponding to slot 0,
         sequence number 22, we notice the dir item was created in a past
         transaction and its sequence number is greater than the value of
         *last_old_dentry_offset + 1 (20 + 1), so we decide to log again a
         BTRFS_DIR_LOG_INDEX_KEY key with an offset of 21 and an end range
         of 21 (key.offset - 1 == 22 - 1 == 21), which results in an -EEXIST
         error from insert_dir_log_key(), as we have already inserted that
         key at step 2, triggering the assertion at process_dir_items_leaf().
      
      The trace produced in dmesg is like the following:
      
      assertion failed: ret != -EEXIST, in fs/btrfs/tree-log.c:3857
      [198255.980839][ T7460] ------------[ cut here ]------------
      [198255.981666][ T7460] kernel BUG at fs/btrfs/ctree.h:3617!
      [198255.983141][ T7460] invalid opcode: 0000 [#1] PREEMPT SMP KASAN PTI
      [198255.984080][ T7460] CPU: 0 PID: 7460 Comm: repro-ghost-dir Not tainted 5.18.0-5314c78ac373-misc-next+
      [198255.986027][ T7460] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
      [198255.988600][ T7460] RIP: 0010:assertfail.constprop.0+0x1c/0x1e
      [198255.989465][ T7460] Code: 8b 4c 89 (...)
      [198255.992599][ T7460] RSP: 0018:ffffc90007387188 EFLAGS: 00010282
      [198255.993414][ T7460] RAX: 000000000000003d RBX: 0000000000000065 RCX: 0000000000000000
      [198255.996056][ T7460] RDX: 0000000000000001 RSI: ffffffff8b62b180 RDI: fffff52000e70e24
      [198255.997668][ T7460] RBP: ffffc90007387188 R08: 000000000000003d R09: ffff8881f0e16507
      [198255.999199][ T7460] R10: ffffed103e1c2ca0 R11: 0000000000000001 R12: 00000000ffffffef
      [198256.000683][ T7460] R13: ffff88813befc630 R14: ffff888116c16e70 R15: ffffc90007387358
      [198256.007082][ T7460] FS:  00007fc7f7c24640(0000) GS:ffff8881f0c00000(0000) knlGS:0000000000000000
      [198256.009939][ T7460] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [198256.014133][ T7460] CR2: 0000560bb16d0b78 CR3: 0000000140b34005 CR4: 0000000000170ef0
      [198256.015239][ T7460] Call Trace:
      [198256.015674][ T7460]  <TASK>
      [198256.016313][ T7460]  log_dir_items.cold+0x16/0x2c
      [198256.018858][ T7460]  ? replay_one_extent+0xbf0/0xbf0
      [198256.025932][ T7460]  ? release_extent_buffer+0x1d2/0x270
      [198256.029658][ T7460]  ? rcu_read_lock_sched_held+0x16/0x80
      [198256.031114][ T7460]  ? lock_acquired+0xbe/0x660
      [198256.032633][ T7460]  ? rcu_read_lock_sched_held+0x16/0x80
      [198256.034386][ T7460]  ? lock_release+0xcf/0x8a0
      [198256.036152][ T7460]  log_directory_changes+0xf9/0x170
      [198256.036993][ T7460]  ? log_dir_items+0xba0/0xba0
      [198256.037661][ T7460]  ? do_raw_write_unlock+0x7d/0xe0
      [198256.038680][ T7460]  btrfs_log_inode+0x233b/0x26d0
      [198256.041294][ T7460]  ? log_directory_changes+0x170/0x170
      [198256.042864][ T7460]  ? btrfs_attach_transaction_barrier+0x60/0x60
      [198256.045130][ T7460]  ? rcu_read_lock_sched_held+0x16/0x80
      [198256.046568][ T7460]  ? lock_release+0xcf/0x8a0
      [198256.047504][ T7460]  ? lock_downgrade+0x420/0x420
      [198256.048712][ T7460]  ? ilookup5_nowait+0x81/0xa0
      [198256.049747][ T7460]  ? lock_downgrade+0x420/0x420
      [198256.050652][ T7460]  ? do_raw_spin_unlock+0xa9/0x100
      [198256.051618][ T7460]  ? __might_resched+0x128/0x1c0
      [198256.052511][ T7460]  ? __might_sleep+0x66/0xc0
      [198256.053442][ T7460]  ? __kasan_check_read+0x11/0x20
      [198256.054251][ T7460]  ? iget5_locked+0xbd/0x150
      [198256.054986][ T7460]  ? run_delayed_iput_locked+0x110/0x110
      [198256.055929][ T7460]  ? btrfs_iget+0xc7/0x150
      [198256.056630][ T7460]  ? btrfs_orphan_cleanup+0x4a0/0x4a0
      [198256.057502][ T7460]  ? free_extent_buffer+0x13/0x20
      [198256.058322][ T7460]  btrfs_log_inode+0x2654/0x26d0
      [198256.059137][ T7460]  ? log_directory_changes+0x170/0x170
      [198256.060020][ T7460]  ? rcu_read_lock_sched_held+0x16/0x80
      [198256.060930][ T7460]  ? rcu_read_lock_sched_held+0x16/0x80
      [198256.061905][ T7460]  ? lock_contended+0x770/0x770
      [198256.062682][ T7460]  ? btrfs_log_inode_parent+0xd04/0x1750
      [198256.063582][ T7460]  ? lock_downgrade+0x420/0x420
      [198256.064432][ T7460]  ? preempt_count_sub+0x18/0xc0
      [198256.065550][ T7460]  ? __mutex_lock+0x580/0xdc0
      [198256.066654][ T7460]  ? stack_trace_save+0x94/0xc0
      [198256.068008][ T7460]  ? __kasan_check_write+0x14/0x20
      [198256.072149][ T7460]  ? __mutex_unlock_slowpath+0x12a/0x430
      [198256.073145][ T7460]  ? mutex_lock_io_nested+0xcd0/0xcd0
      [198256.074341][ T7460]  ? wait_for_completion_io_timeout+0x20/0x20
      [198256.075345][ T7460]  ? lock_downgrade+0x420/0x420
      [198256.076142][ T7460]  ? lock_contended+0x770/0x770
      [198256.076939][ T7460]  ? do_raw_spin_lock+0x1c0/0x1c0
      [198256.078401][ T7460]  ? btrfs_sync_file+0x5e6/0xa40
      [198256.080598][ T7460]  btrfs_log_inode_parent+0x523/0x1750
      [198256.081991][ T7460]  ? wait_current_trans+0xc8/0x240
      [198256.083320][ T7460]  ? lock_downgrade+0x420/0x420
      [198256.085450][ T7460]  ? btrfs_end_log_trans+0x70/0x70
      [198256.086362][ T7460]  ? rcu_read_lock_sched_held+0x16/0x80
      [198256.087544][ T7460]  ? lock_release+0xcf/0x8a0
      [198256.088305][ T7460]  ? lock_downgrade+0x420/0x420
      [198256.090375][ T7460]  ? dget_parent+0x8e/0x300
      [198256.093538][ T7460]  ? do_raw_spin_lock+0x1c0/0x1c0
      [198256.094918][ T7460]  ? lock_downgrade+0x420/0x420
      [198256.097815][ T7460]  ? do_raw_spin_unlock+0xa9/0x100
      [198256.101822][ T7460]  ? dget_parent+0xb7/0x300
      [198256.103345][ T7460]  btrfs_log_dentry_safe+0x48/0x60
      [198256.105052][ T7460]  btrfs_sync_file+0x629/0xa40
      [198256.106829][ T7460]  ? start_ordered_ops.constprop.0+0x120/0x120
      [198256.109655][ T7460]  ? __fget_files+0x161/0x230
      [198256.110760][ T7460]  vfs_fsync_range+0x6d/0x110
      [198256.111923][ T7460]  ? start_ordered_ops.constprop.0+0x120/0x120
      [198256.113556][ T7460]  __x64_sys_fsync+0x45/0x70
      [198256.114323][ T7460]  do_syscall_64+0x5c/0xc0
      [198256.115084][ T7460]  ? syscall_exit_to_user_mode+0x3b/0x50
      [198256.116030][ T7460]  ? do_syscall_64+0x69/0xc0
      [198256.116768][ T7460]  ? do_syscall_64+0x69/0xc0
      [198256.117555][ T7460]  ? do_syscall_64+0x69/0xc0
      [198256.118324][ T7460]  ? sysvec_call_function_single+0x57/0xc0
      [198256.119308][ T7460]  ? asm_sysvec_call_function_single+0xa/0x20
      [198256.120363][ T7460]  entry_SYSCALL_64_after_hwframe+0x44/0xae
      [198256.121334][ T7460] RIP: 0033:0x7fc7fe97b6ab
      [198256.122067][ T7460] Code: 0f 05 48 (...)
      [198256.125198][ T7460] RSP: 002b:00007fc7f7c23950 EFLAGS: 00000293 ORIG_RAX: 000000000000004a
      [198256.126568][ T7460] RAX: ffffffffffffffda RBX: 00007fc7f7c239f0 RCX: 00007fc7fe97b6ab
      [198256.127942][ T7460] RDX: 0000000000000002 RSI: 000056167536bcf0 RDI: 0000000000000004
      [198256.129302][ T7460] RBP: 0000000000000004 R08: 0000000000000000 R09: 000000007ffffeb8
      [198256.130670][ T7460] R10: 00000000000001ff R11: 0000000000000293 R12: 0000000000000001
      [198256.132046][ T7460] R13: 0000561674ca8140 R14: 00007fc7f7c239d0 R15: 000056167536dab8
      [198256.133403][ T7460]  </TASK>
      
      Fix this by treating -EEXIST as expected at insert_dir_log_key() and have
      it update the item with an end offset corresponding to the maximum between
      the previously logged end offset and the new requested end offset. The end
      offsets may be different due to dir index key deletions that happened as
      part of unlink operations while we are logging a directory (triggered when
      fsyncing some other inode parented by the directory) or during renames
      which always attempt to log a single dir index deletion.
      Reported-by: NZygo Blaxell <ce3g8jdj@umail.furryterror.org>
      Link: https://lore.kernel.org/linux-btrfs/YmyefE9mc2xl5ZMz@hungrycats.org/
      Fixes: 732d591a ("btrfs: stop copying old dir items when logging a directory")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      750ee454
  6. 28 4月, 2022 1 次提交
    • F
      btrfs: always log symlinks in full mode · d0e64a98
      Filipe Manana 提交于
      On Linux, empty symlinks are invalid, and attempting to create one with
      the system call symlink(2) results in an -ENOENT error and this is
      explicitly documented in the man page.
      
      If we rename a symlink that was created in the current transaction and its
      parent directory was logged before, we actually end up logging the symlink
      without logging its content, which is stored in an inline extent. That
      means that after a power failure we can end up with an empty symlink,
      having no content and an i_size of 0 bytes.
      
      It can be easily reproduced like this:
      
        $ mkfs.btrfs -f /dev/sdc
        $ mount /dev/sdc /mnt
      
        $ mkdir /mnt/testdir
        $ sync
      
        # Create a file inside the directory and fsync the directory.
        $ touch /mnt/testdir/foo
        $ xfs_io -c "fsync" /mnt/testdir
      
        # Create a symlink inside the directory and then rename the symlink.
        $ ln -s /mnt/testdir/foo /mnt/testdir/bar
        $ mv /mnt/testdir/bar /mnt/testdir/baz
      
        # Now fsync again the directory, this persist the log tree.
        $ xfs_io -c "fsync" /mnt/testdir
      
        <power failure>
      
        $ mount /dev/sdc /mnt
        $ stat -c %s /mnt/testdir/baz
        0
        $ readlink /mnt/testdir/baz
        $
      
      Fix this by always logging symlinks in full mode (LOG_INODE_ALL), so that
      their content is also logged.
      
      A test case for fstests will follow.
      
      CC: stable@vger.kernel.org # 4.9+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d0e64a98
  7. 19 4月, 2022 1 次提交
  8. 14 3月, 2022 16 次提交
    • F
      btrfs: add and use helper for unlinking inode during log replay · 313ab753
      Filipe Manana 提交于
      During log replay there is this pattern of running delayed items after
      every inode unlink. To avoid repeating this several times, move the
      logic into an helper function and use it instead of calling
      btrfs_unlink_inode() followed by btrfs_run_delayed_items().
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      313ab753
    • F
      btrfs: reset last_reflink_trans after fsyncing inode · 23e3337f
      Filipe Manana 提交于
      When an inode has a last_reflink_trans matching the current transaction,
      we have to take special care when logging its checksums in order to
      avoid getting checksum items with overlapping ranges in a log tree,
      which could result in missing checksums after log replay (more on that
      in the changelogs of commit 40e046ac ("Btrfs: fix missing data
      checksums after replaying a log tree") and commit e289f03e ("btrfs:
      fix corrupt log due to concurrent fsync of inodes with shared extents")).
      We also need to make sure a full fsync will copy all old file extent
      items it finds in modified leaves, because they might have been copied
      from some other inode.
      
      However once we fsync an inode, we don't need to keep paying the price of
      that extra special care in future fsyncs done in the same transaction,
      unless the inode is used for another reflink operation or the full sync
      flag is set on it (truncate, failure to allocate extent maps for holes,
      and other exceptional and infrequent cases).
      
      So after we fsync an inode reset its last_unlink_trans to zero. In case
      another reflink happens, we continue to update the last_reflink_trans of
      the inode, just as before. Also set last_reflink_trans to the generation
      of the last transaction that modified the inode whenever we need to set
      the full sync flag on the inode, just like when we need to load an inode
      from disk after eviction.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      23e3337f
    • F
      btrfs: voluntarily relinquish cpu when doing a full fsync · 96acb375
      Filipe Manana 提交于
      Doing a full fsync may require processing many leaves of metadata, which
      can take some time and result in a task monopolizing a cpu for too long.
      So add a cond_resched() after processing a leaf when doing a full fsync,
      while not holding any locks on any tree (a subvolume or a log tree).
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      96acb375
    • F
      btrfs: hold on to less memory when logging checksums during full fsync · 5b7ce5e2
      Filipe Manana 提交于
      When doing a full fsync, at copy_items(), we iterate over all extents and
      then collect their checksums into a list. After copying all the extents to
      the log tree, we then log all the previously collected checksums.
      
      Before the previous patch in the series (subject "btrfs: stop copying old
      file extents when doing a full fsync"), we had to do it this way, because
      while we were iterating over the items in the leaf of the subvolume tree,
      we were holding a write lock on a leaf of the log tree, so logging the
      checksums for an extent right after we collected them could result in a
      deadlock, in case the checksum items ended up in the same leaf.
      
      However after the previous patch in the series we now do a first iteration
      over all the items in the leaf of the subvolume tree before locking a path
      in the log tree, so we can now log the checksums right after we have
      obtained them. This avoids holding in memory all checksums for all extents
      in the leaf while copying items from the source leaf to the log tree. The
      amount of memory used to hold all checksums of the extents in a leaf can
      be significant. For example if a leaf has 200 file extent items referring
      to 1M extents, using the default crc32c checksums, would result in using
      over 200K of memory (not accounting for the extra overhead of struct
      btrfs_ordered_sum), with smaller or less extents it would be less, but
      it could be much more with more extents per leaf and/or much larger
      extents.
      
      So change copy_items() to log the checksums for an extent after looking
      them up, and then free their memory, as they are no longer necessary.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5b7ce5e2
    • F
      btrfs: stop copying old file extents when doing a full fsync · 7f30c072
      Filipe Manana 提交于
      When logging an inode in full sync mode, we go over every leaf that was
      modified in the current transaction and has items associated to our inode,
      and then copy all those items into the log tree. This includes copying
      file extent items that were created and added to the inode in past
      transactions, which is useless and only makes use more leaf space in the
      log tree.
      
      It's common to have a file with many file extent items spanning many
      leaves where only a few file extent items are new and need to be logged,
      and in such case we log all the file extent items we find in the modified
      leaves.
      
      So change the full sync behaviour to skip over file extent items that are
      not needed. Those are the ones that match the following criteria:
      
      1) Have a generation older than the current transaction and the inode
         was not a target of a reflink operation, as that can copy file extent
         items from a past generation from some other inode into our inode, so
         we have to log them;
      
      2) Start at an offset within i_size - we must log anything at or beyond
         i_size, otherwise we would lose prealloc extents after log replay.
      
      The following script exercises a scenario where this happens, and it's
      somehow close enough to what happened often on a SQL Server workload which
      I had to debug sometime ago to fix an issue where a pattern of writes to
      prealloc extents and fsync resulted in fsync failing with -EIO (that was
      commit ea7036de ("btrfs: fix fsync failure and transaction abort
      after writes to prealloc extents")). In that particular case, we had large
      files that had random writes and were often truncated, which made the
      next fsync be a full sync.
      
        $ cat test.sh
        #!/bin/bash
      
        DEV=/dev/sdi
        MNT=/mnt/sdi
      
        MKFS_OPTIONS="-O no-holes -R free-space-tree"
        MOUNT_OPTIONS="-o ssd"
      
        FILE_SIZE=$((1 * 1024 * 1024 * 1024)) # 1G
        # FILE_SIZE=$((2 * 1024 * 1024 * 1024)) # 2G
        # FILE_SIZE=$((512 * 1024 * 1024)) # 512M
      
        mkfs.btrfs -f $MKFS_OPTIONS $DEV
        mount $MOUNT_OPTIONS $DEV $MNT
      
        # Create a file with many extents. Use direct IO to make it faster
        # to create the file - using buffered IO we would have to fsync
        # after each write (terribly slow).
        echo "Creating file with $((FILE_SIZE / 4096)) extents of 4K each..."
        xfs_io -f -d -c "pwrite -b 4K 0 $FILE_SIZE" $MNT/foobar
      
        # Commit the transaction, so every extent after this is from an
        # old generation.
        sync
      
        # Now rewrite only a few extents, which are all far spread apart from
        # each other (e.g. 1G / 32M = 32 extents).
        # After this only a few extents have a new generation, while all other
        # ones have an old generation.
        echo "Rewriting $((FILE_SIZE / (32 * 1024 * 1024))) extents..."
        for ((i = 0; i < $FILE_SIZE; i += $((32 * 1024 * 1024)))); do
            xfs_io -c "pwrite $i 4K" $MNT/foobar >/dev/null
        done
      
        # Fsync, the inode logged in full sync mode since it was never fsynced
        # before.
        echo "Fsyncing file..."
        xfs_io -c "fsync" $MNT/foobar
      
        umount $MNT
      
      And the following bpftrace program was running when executing the test
      script:
      
        $ cat bpf-script.sh
        #!/usr/bin/bpftrace
      
        k:btrfs_log_inode
        {
            @start_log_inode[tid] = nsecs;
        }
      
        kr:btrfs_log_inode
        /@start_log_inode[tid]/
        {
            @log_inode_dur[tid] = (nsecs - @start_log_inode[tid]) / 1000;
            delete(@start_log_inode[tid]);
        }
      
        k:btrfs_sync_log
        {
            @start_sync_log[tid] = nsecs;
        }
      
        kr:btrfs_sync_log
        /@start_sync_log[tid]/
        {
            $sync_log_dur = (nsecs - @start_sync_log[tid]) / 1000;
            printf("btrfs_log_inode() took %llu us\n", @log_inode_dur[tid]);
            printf("btrfs_sync_log()  took %llu us\n", $sync_log_dur);
            delete(@start_sync_log[tid]);
            delete(@log_inode_dur[tid]);
            exit();
        }
      
      With 512M test file, before this patch:
      
        btrfs_log_inode() took 15218 us
        btrfs_sync_log()  took 1328 us
      
        Log tree has 17 leaves and 1 node, its total size is 294912 bytes.
      
      With 512M test file, after this patch:
      
        btrfs_log_inode() took 14760 us
        btrfs_sync_log()  took 588 us
      
        Log tree has a single leaf, its total size is 16K.
      
      With 1G test file, before this patch:
      
        btrfs_log_inode() took 27301 us
        btrfs_sync_log()  took 1767 us
      
        Log tree has 33 leaves and 1 node, its total size is 557056 bytes.
      
      With 1G test file, after this patch:
      
        btrfs_log_inode() took 26166 us
        btrfs_sync_log()  took 593 us
      
        Log tree has a single leaf, its total size is 16K
      
      With 2G test file, before this patch:
      
        btrfs_log_inode() took 50892 us
        btrfs_sync_log()  took 3127 us
      
        Log tree has 65 leaves and 1 node, its total size is 1081344 bytes.
      
      With 2G test file, after this patch:
      
        btrfs_log_inode() took 50126 us
        btrfs_sync_log()  took 586 us
      
        Log tree has a single leaf, its total size is 16K.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7f30c072
    • F
      btrfs: prepare extents to be logged before locking a log tree path · e1f53ed8
      Filipe Manana 提交于
      When we want to log an extent, in the fast fsync path, we obtain a path
      to the leaf that will hold the file extent item either through a deletion
      search, via btrfs_drop_extents(), or through an insertion search using
      btrfs_insert_empty_item(). After that we fill the file extent item's
      fields one by one directly on the leaf.
      
      Instead of doing that, we could prepare the file extent item before
      obtaining a btree path, and then copy the prepared extent item with a
      single operation once we get the path. This helps avoid some contention
      on the log tree, since we are holding write locks for longer than
      necessary, especially in the case where the path is obtained via
      btrfs_drop_extents() through a deletion search, which always keeps a
      write lock on the nodes at levels 1 and 2 (besides the leaf).
      
      This change does that, we prepare the file extent item that is going to
      be inserted before acquiring a path, and then copy it into a leaf using
      a single copy operation once we get a path.
      
      This change if part of a patchset that is comprised of the following
      patches:
      
        1/6 btrfs: remove unnecessary leaf free space checks when pushing items
        2/6 btrfs: avoid unnecessary COW of leaves when deleting items from a leaf
        3/6 btrfs: avoid unnecessary computation when deleting items from a leaf
        4/6 btrfs: remove constraint on number of visited leaves when replacing extents
        5/6 btrfs: remove useless path release in the fast fsync path
        6/6 btrfs: prepare extents to be logged before locking a log tree path
      
      The following test was run to measure the impact of the whole patchset:
      
        $ cat test.sh
        #!/bin/bash
      
        DEV=/dev/sdi
        MNT=/mnt/sdi
        MOUNT_OPTIONS="-o ssd"
        MKFS_OPTIONS="-R free-space-tree -O no-holes"
      
        NUM_JOBS=8
        FILE_SIZE=128M
        RUN_TIME=200
      
        cat <<EOF > /tmp/fio-job.ini
        [writers]
        rw=randwrite
        fsync=1
        fallocate=none
        group_reporting=1
        direct=0
        bssplit=4k/20:8k/20:16k/20:32k/10:64k/10:128k/5:256k/5:512k/5:1m/5
        ioengine=sync
        filesize=$FILE_SIZE
        runtime=$RUN_TIME
        time_based
        directory=$MNT
        numjobs=$NUM_JOBS
        thread
        EOF
      
        echo "performance" | \
            tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
      
        echo
        echo "Using config:"
        echo
        cat /tmp/fio-job.ini
        echo
      
        umount $MNT &> /dev/null
        mkfs.btrfs -f $MKFS_OPTIONS $DEV
        mount $MOUNT_OPTIONS $DEV $MNT
      
        fio /tmp/fio-job.ini
      
        umount $MNT
      
      The test ran inside a VM (8 cores, 32G of RAM) with the target disk
      mapping to a raw NVMe device, and using a non-debug kernel config
      (Debian's default config).
      
      Before the patchset:
      
      WRITE: bw=116MiB/s (122MB/s), 116MiB/s-116MiB/s (122MB/s-122MB/s), io=22.7GiB (24.4GB), run=200013-200013msec
      
      After the patchset:
      
      WRITE: bw=125MiB/s (131MB/s), 125MiB/s-125MiB/s (131MB/s-131MB/s), io=24.3GiB (26.1GB), run=200007-200007msec
      
      A 7.8% gain on throughput and +7.0% more IO done in the same period of
      time (200 seconds).
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e1f53ed8
    • F
      btrfs: remove useless path release in the fast fsync path · d8457531
      Filipe Manana 提交于
      There's no point in calling btrfs_release_path() after finishing the loop
      that logs the modified extents, since log_one_extent() returns with the
      path released. In case the list of extents is empty, the path is already
      released, so there's no need for that case as well.
      So just remove that unnecessary btrfs_release_path() call.
      
      This change if part of a patchset that is comprised of the following
      patches:
      
        1/6 btrfs: remove unnecessary leaf free space checks when pushing items
        2/6 btrfs: avoid unnecessary COW of leaves when deleting items from a leaf
        3/6 btrfs: avoid unnecessary computation when deleting items from a leaf
        4/6 btrfs: remove constraint on number of visited leaves when replacing extents
        5/6 btrfs: remove useless path release in the fast fsync path
        6/6 btrfs: prepare extents to be logged before locking a log tree path
      
      The last patch in the series has some performance test result in its
      changelog.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d8457531
    • F
      btrfs: use single variable to track return value at btrfs_log_inode() · 65faced5
      Filipe Manana 提交于
      At btrfs_log_inode(), we have two variables to track errors and the
      return value of the function, named 'ret' and 'err'. In some places we
      use 'ret' and if gets a non-zero value we assign its value to 'err'
      and then jump to the 'out' label, while in other places we use 'err'
      directly without 'ret' as an intermediary. This is inconsistent, error
      prone and not necessary. So change that to use only the 'ret' variable,
      making this consistent with most functions in btrfs.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      65faced5
    • F
      btrfs: avoid inode logging during rename and link when possible · 0f8ce498
      Filipe Manana 提交于
      During a rename or link operation, we need to determine if an inode was
      previously logged or not, and if it was, do some update to the logged
      inode. We used to rely exclusively on the logged_trans field of struct
      btrfs_inode to determine that, but that was not reliable because the
      value of that field is not persisted in the inode item, so it's lost
      when an inode is evicted and loaded back again. That led to several
      issues in the past, such as not persisting deletions (such as the case
      fixed by commit 803f0f64 ("Btrfs: fix fsync not persisting dentry
      deletions due to inode evictions")), or resulting in losing a file
      after an inode eviction followed by a rename (commit ecc64fab
      ("btrfs: fix lost inode on log replay after mix of fsync, rename and
      inode eviction")), besides other issues.
      
      So the inode_logged() helper was introduced and used to determine if an
      inode was possibly logged before in the current transaction, with the
      caveat that it could return false positives, in the sense that even if an
      inode was not logged before in the current transaction, it could still
      return true, but never to return false in case the inode was logged.
      >From a functional point of view that is fine, but from a performance
      perspective it can introduce significant latencies to rename and link
      operations, as they will end up doing inode logging even when it is not
      necessary.
      
      Recently on a 5.15 kernel, an openSUSE Tumbleweed user reported package
      installations and upgrades, with the zypper tool, were often taking a
      long time to complete. With strace it could be observed that zypper was
      spending about 99% of its time on rename operations, and then with
      further analysis we checked that directory logging was happening too
      frequently. Taking into account that installation/upgrade of some of the
      packages needed a few thousand file renames, the slowdown was very
      noticeable for the user.
      
      The issue was caused indirectly due to an excessive number of inode
      evictions on a 5.15 kernel, about 100x more compared to a 5.13, 5.14 or
      a 5.16-rc8 kernel. While triggering the inode evictions if something
      outside btrfs' control, btrfs could still behave better by eliminating
      the false positives from the inode_logged() helper.
      
      So change inode_logged() to actually eliminate such false positives caused
      by inode eviction and when an inode was never logged since the filesystem
      was mounted, as both cases relate to when the logged_trans field of struct
      btrfs_inode has a value of zero. When it can not determine if the inode
      was logged based only on the logged_trans value, lookup for the existence
      of the inode item in the log tree - if it's there then we known the inode
      was logged, if it's not there then it can not have been logged in the
      current transaction. Once we determine if the inode was logged, update
      the logged_trans value to avoid future calls to have to search in the log
      tree again.
      
      Alternatively, we could start storing logged_trans in the on disk inode
      item structure (struct btrfs_inode_item) in the unused space it still has,
      but that would be a bit odd because:
      
      1) We only care about logged_trans since the filesystem was mounted, we
         don't care about its value from a previous mount. Having it persisted
         in the inode item structure would not make the best use of the precious
         unused space;
      
      2) In order to get logged_trans persisted before inode eviction, we would
         have to update the delayed inode when we finish logging the inode and
         update its logged_trans in struct btrfs_inode, which makes it a bit
         cumbersome since we need to check if the delayed inode exists, if not
         create it and populate it and deal with any errors (-ENOMEM mostly).
      
      This change is part of a patchset comprised of the following patches:
      
        1/5 btrfs: add helper to delete a dir entry from a log tree
        2/5 btrfs: pass the dentry to btrfs_log_new_name() instead of the inode
        3/5 btrfs: avoid logging all directory changes during renames
        4/5 btrfs: stop doing unnecessary log updates during a rename
        5/5 btrfs: avoid inode logging during rename and link when possible
      
      The following test script mimics part of what the zypper tool does during
      package installations/upgrades. It does not triggers inode evictions, but
      it's similar because it triggers false positives from the inode_logged()
      helper, because the inodes have a logged_trans of 0, there's a log tree
      due to a fsync of an unrelated file and the directory inode has its
      last_trans field set to the current transaction:
      
        $ cat test.sh
      
        #!/bin/bash
      
        DEV=/dev/nvme0n1
        MNT=/mnt/nvme0n1
      
        NUM_FILES=10000
      
        mkfs.btrfs -f $DEV
        mount $DEV $MNT
      
        mkdir $MNT/testdir
      
        for ((i = 1; i <= $NUM_FILES; i++)); do
            echo -n > $MNT/testdir/file_$i
        done
      
        sync
      
        # Now do some change to an unrelated file and fsync it.
        # This is just to create a log tree to make sure that inode_logged()
        # does not return false when called against "testdir".
        xfs_io -f -c "pwrite 0 4K" -c "fsync" $MNT/foo
      
        # Do some change to testdir. This is to make sure inode_logged()
        # will return true when called against "testdir", because its
        # logged_trans is 0, it was changed in the current transaction
        # and there's a log tree.
        echo -n > $MNT/testdir/file_$((NUM_FILES + 1))
      
        echo "Renaming $NUM_FILES files..."
        start=$(date +%s%N)
        for ((i = 1; i <= $NUM_FILES; i++)); do
            mv $MNT/testdir/file_$i $MNT/testdir/file_$i-RPMDELETE
        done
        end=$(date +%s%N)
      
        dur=$(( (end - start) / 1000000 ))
        echo "Renames took $dur milliseconds"
      
        umount $MNT
      
      Testing this change on a box using a non-debug kernel (Debian's default
      kernel config) gave the following results:
      
      NUM_FILES=10000, before patchset:                   27837 ms
      NUM_FILES=10000, after patches 1/5 to 4/5 applied:   9236 ms (-66.8%)
      NUM_FILES=10000, after whole patchset applied:       8902 ms (-68.0%)
      
      NUM_FILES=5000, before patchset:                     9127 ms
      NUM_FILES=5000, after patches 1/5 to 4/5 applied:    4640 ms (-49.2%)
      NUM_FILES=5000, after whole patchset applied:        4441 ms (-51.3%)
      
      NUM_FILES=2000, before patchset:                     2528 ms
      NUM_FILES=2000, after patches 1/5 to 4/5 applied:    1983 ms (-21.6%)
      NUM_FILES=2000, after whole patchset applied:        1747 ms (-30.9%)
      
      NUM_FILES=1000, before patchset:                     1085 ms
      NUM_FILES=1000, after patches 1/5 to 4/5 applied:     893 ms (-17.7%)
      NUM_FILES=1000, after whole patchset applied:         867 ms (-20.1%)
      
      Running dbench on the same physical machine with the following script:
      
        $ cat run-dbench.sh
        #!/bin/bash
      
        NUM_JOBS=$(nproc --all)
      
        DEV=/dev/nvme0n1
        MNT=/mnt/nvme0n1
        MOUNT_OPTIONS="-o ssd"
        MKFS_OPTIONS="-O no-holes -R free-space-tree"
      
        echo "performance" | \
            tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
      
        mkfs.btrfs -f $MKFS_OPTIONS $DEV
        mount $MOUNT_OPTIONS $DEV $MNT
      
        dbench -D $MNT -t 120 $NUM_JOBS
      
        umount $MNT
      
      Before patchset:
      
       Operation      Count    AvgLat    MaxLat
       ----------------------------------------
       NTCreateX    3761352     0.032   143.843
       Close        2762770     0.002     2.273
       Rename        159304     0.291    67.037
       Unlink        759784     0.207   143.998
       Deltree           72     4.028    15.977
       Mkdir             36     0.003     0.006
       Qpathinfo    3409780     0.013     9.678
       Qfileinfo     596772     0.001     0.878
       Qfsinfo       625189     0.003     1.245
       Sfileinfo     306443     0.006     1.840
       Find         1318106     0.063    19.798
       WriteX       1871137     0.021     8.532
       ReadX        5897325     0.003     3.567
       LockX          12252     0.003     0.258
       UnlockX        12252     0.002     0.100
       Flush         263666     3.327   155.632
      
      Throughput 980.047 MB/sec  12 clients  12 procs  max_latency=155.636 ms
      
      After whole patchset applied:
      
       Operation      Count    AvgLat    MaxLat
       ----------------------------------------
       NTCreateX    4195584     0.033   107.742
       Close        3081932     0.002     1.935
       Rename        177641     0.218    14.905
       Unlink        847333     0.166   107.822
       Deltree          118     5.315    15.247
       Mkdir             59     0.004     0.048
       Qpathinfo    3802612     0.014    10.302
       Qfileinfo     666748     0.001     1.034
       Qfsinfo       697329     0.003     0.944
       Sfileinfo     341712     0.006     2.099
       Find         1470365     0.065     9.359
       WriteX       2093921     0.021     8.087
       ReadX        6576234     0.003     3.407
       LockX          13660     0.003     0.308
       UnlockX        13660     0.002     0.114
       Flush         294090     2.906   115.539
      
      Throughput 1093.11 MB/sec  12 clients  12 procs  max_latency=115.544 ms
      
      +11.5% throughput    -25.8% max latency   rename max latency -77.8%
      
      Link: https://bugzilla.opensuse.org/show_bug.cgi?id=1193549Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0f8ce498
    • F
      btrfs: stop doing unnecessary log updates during a rename · 259c4b96
      Filipe Manana 提交于
      During a rename, we call __btrfs_unlink_inode(), which will call
      btrfs_del_inode_ref_in_log() and btrfs_del_dir_entries_in_log(), in order
      to remove an inode reference and a directory entry from the log. These
      are necessary when __btrfs_unlink_inode() is called from the unlink path,
      but not necessary when it's called from a rename context, because:
      
      1) For the btrfs_del_inode_ref_in_log() call, it's pointless to delete the
         inode reference related to the old name, because later in the rename
         path we call btrfs_log_new_name(), which will drop all inode references
         from the log and copy all inode references from the subvolume tree to
         the log tree. So we are doing one unnecessary btree operation which
         adds additional latency and lock contention in case there are other
         tasks accessing the log tree;
      
      2) For the btrfs_del_dir_entries_in_log() call, we are now doing the
         equivalent at btrfs_log_new_name() since the previous patch in the
         series, that has the subject "btrfs: avoid logging all directory
         changes during renames". In fact, having __btrfs_unlink_inode() call
         this function not only adds additional latency and lock contention due
         to the extra btree operation, but also can make btrfs_log_new_name()
         unnecessarily log a range item to track the deletion of the old name,
         since it has no way to known that the directory entry related to the
         old name was previously logged and already deleted by
         __btrfs_unlink_inode() through its call to
         btrfs_del_dir_entries_in_log().
      
      So skip those calls at __btrfs_unlink_inode() when we are doing a rename.
      Skipping them also allows us now to reduce the duration of time we are
      pinning a log transaction during renames, which is always beneficial as
      it's not delaying so much other tasks trying to sync the log tree, in
      particular we end up not holding the log transaction pinned while adding
      the new name (adding inode ref, directory entry, etc).
      
      This change is part of a patchset comprised of the following patches:
      
        1/5 btrfs: add helper to delete a dir entry from a log tree
        2/5 btrfs: pass the dentry to btrfs_log_new_name() instead of the inode
        3/5 btrfs: avoid logging all directory changes during renames
        4/5 btrfs: stop doing unnecessary log updates during a rename
        5/5 btrfs: avoid inode logging during rename and link when possible
      
      Just like the previous patch in the series, "btrfs: avoid logging all
      directory changes during renames", the following script mimics part of
      what a package installation/upgrade with zypper does, which is basically
      renaming a lot of files, in some directory under /usr, to a name with a
      suffix of "-RPMDELETE":
      
        $ cat test.sh
        #!/bin/bash
      
        DEV=/dev/nvme0n1
        MNT=/mnt/nvme0n1
      
        NUM_FILES=10000
      
        mkfs.btrfs -f $DEV
        mount $DEV $MNT
      
        mkdir $MNT/testdir
      
        for ((i = 1; i <= $NUM_FILES; i++)); do
            echo -n > $MNT/testdir/file_$i
        done
      
        sync
      
        # Do some change to testdir and fsync it.
        echo -n > $MNT/testdir/file_$((NUM_FILES + 1))
        xfs_io -c "fsync" $MNT/testdir
      
        echo "Renaming $NUM_FILES files..."
        start=$(date +%s%N)
        for ((i = 1; i <= $NUM_FILES; i++)); do
            mv $MNT/testdir/file_$i $MNT/testdir/file_$i-RPMDELETE
        done
        end=$(date +%s%N)
      
        dur=$(( (end - start) / 1000000 ))
        echo "Renames took $dur milliseconds"
      
        umount $MNT
      
      Testing this change on box a using a non-debug kernel (Debian's default
      kernel config) gave the following results:
      
      NUM_FILES=10000, before patchset:                   27399 ms
      NUM_FILES=10000, after patches 1/5 to 3/5 applied:   9093 ms (-66.8%)
      NUM_FILES=10000, after patches 1/5 to 4/5 applied:   9016 ms (-67.1%)
      
      NUM_FILES=5000, before patchset:                     9241 ms
      NUM_FILES=5000, after patches 1/5 to 3/5 applied:    4642 ms (-49.8%)
      NUM_FILES=5000, after patches 1/5 to 4/5 applied:    4553 ms (-50.7%)
      
      NUM_FILES=2000, before patchset:                     2550 ms
      NUM_FILES=2000, after patches 1/5 to 3/5 applied:    1788 ms (-29.9%)
      NUM_FILES=2000, after patches 1/5 to 4/5 applied:    1767 ms (-30.7%)
      
      NUM_FILES=1000, before patchset:                     1088 ms
      NUM_FILES=1000, after patches 1/5 to 3/5 applied:     905 ms (-16.9%)
      NUM_FILES=1000, after patches 1/5 to 4/5 applied:     883 ms (-18.8%)
      
      The next patch in the series (5/5), also contains dbench results after
      applying to whole patchset.
      
      Link: https://bugzilla.opensuse.org/show_bug.cgi?id=1193549Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      259c4b96
    • F
      btrfs: avoid logging all directory changes during renames · 88d2beec
      Filipe Manana 提交于
      When doing a rename of a file, if the file or its old parent directory
      were logged before, we log the new name of the file and then make sure
      we log the old parent directory, to ensure that after a log replay the
      old name of the file is deleted and the new name added.
      
      The logging of the old parent directory can take some time, because it
      will scan all leaves modified in the current transaction, check which
      directory entries were already logged, copy the ones that were not
      logged before, etc. In this rename context all we need to do is make
      sure that the old name of the file is deleted on log replay, so instead
      of triggering a directory log operation, we can just delete the old
      directory entry from the log if it's there, or in case it isn't there,
      just log a range item to signal log replay that the old name must be
      deleted. So change btrfs_log_new_name() to do that.
      
      This scenario is actually not uncommon to trigger, and recently on a
      5.15 kernel, an openSUSE Tumbleweed user reported package installations
      and upgrades, with the zypper tool, were often taking a long time to
      complete, much more than usual. With strace it could be observed that
      zypper was spending over 99% of its time on rename operations, and then
      with further analysis we checked that directory logging was happening
      too frequently and causing high latencies for the rename operations.
      Taking into account that installation/upgrade of some of these packages
      needed about a few thousand file renames, the slowdown was very noticeable
      for the user.
      
      The issue was caused indirectly due to an excessive number of inode
      evictions on a 5.15 kernel, about 100x more compared to a 5.13, 5.14
      or a 5.16-rc8 kernel. After an inode eviction we can't tell for sure,
      in an efficient way, if an inode was previously logged in the current
      transaction, so we are pessimistic and assume it was, because in case
      it was we need to update the logged inode. More details on that in one
      of the patches in the same series (subject "btrfs: avoid inode logging
      during rename and link when possible"). Either way, in case the parent
      directory was logged before, we currently do more work then necessary
      during a rename, and this change minimizes that amount of work.
      
      The following script mimics part of what a package installation/upgrade
      with zypper does, which is basically renaming a lot of files, in some
      directory under /usr, to a name with a suffix of "-RPMDELETE":
      
        $ cat test.sh
        #!/bin/bash
      
        DEV=/dev/nvme0n1
        MNT=/mnt/nvme0n1
      
        NUM_FILES=10000
      
        mkfs.btrfs -f $DEV
        mount $DEV $MNT
      
        mkdir $MNT/testdir
      
        for ((i = 1; i <= $NUM_FILES; i++)); do
            echo -n > $MNT/testdir/file_$i
        done
      
        sync
      
        # Do some change to testdir and fsync it.
        echo -n > $MNT/testdir/file_$((NUM_FILES + 1))
        xfs_io -c "fsync" $MNT/testdir
      
        echo "Renaming $NUM_FILES files..."
        start=$(date +%s%N)
        for ((i = 1; i <= $NUM_FILES; i++)); do
            mv $MNT/testdir/file_$i $MNT/testdir/file_$i-RPMDELETE
        done
        end=$(date +%s%N)
      
        dur=$(( (end - start) / 1000000 ))
        echo "Renames took $dur milliseconds"
      
        umount $MNT
      
      Testing this change on box using a non-debug kernel (Debian's default
      kernel config) gave the following results:
      
      NUM_FILES=10000, before this patch: 27399 ms
      NUM_FILES=10000, after this patch:   9093 ms (-66.8%)
      
      NUM_FILES=5000, before this patch:   9241 ms
      NUM_FILES=5000, after this patch:    4642 ms (-49.8%)
      
      NUM_FILES=2000, before this patch:   2550 ms
      NUM_FILES=2000, after this patch:    1788 ms (-29.9%)
      
      NUM_FILES=1000, before this patch:   1088 ms
      NUM_FILES=1000, after this patch:     905 ms (-16.9%)
      
      Link: https://bugzilla.opensuse.org/show_bug.cgi?id=1193549Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      88d2beec
    • F
      btrfs: pass the dentry to btrfs_log_new_name() instead of the inode · d5f5bd54
      Filipe Manana 提交于
      In the next patch in the series, there will be the need to access the old
      name, and its length, of an inode when logging the inode during a rename.
      So instead of passing the inode to btrfs_log_new_name() pass the dentry,
      because from the dentry we can get the inode, the name and its length.
      
      This will avoid passing 3 new parameters to btrfs_log_new_name() in the
      next patch - the name, its length and an index number. This way we end
      up passing only 1 new parameter, the index number.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d5f5bd54
    • F
      btrfs: add helper to delete a dir entry from a log tree · 839061fe
      Filipe Manana 提交于
      Move the code that finds and deletes a logged dir entry out of
      btrfs_del_dir_entries_in_log() into a helper function. This new helper
      function will be used by another patch in the same series, and serves
      to avoid having duplicated logic.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      839061fe
    • F
      btrfs: stop trying to log subdirectories created in past transactions · de6bc7f5
      Filipe Manana 提交于
      When logging a directory we are trying to log subdirectories that were
      changed in the current transaction and created in a past transaction.
      This type of behaviour was introduced by commit 2f2ff0ee ("Btrfs:
      fix metadata inconsistencies after directory fsync"), to fix some metadata
      inconsistencies that in the meanwhile no longer need this behaviour due to
      numerous other changes that happened throughout the years.
      
      This behaviour, besides not needed anymore, it's also undesirable because:
      
      1) It's not reliable because it's only triggered for the directories
         of dentries (dir items) that happen to be present on a leaf that
         was changed in the current transaction. If a dentry that points to
         a directory resides on a leaf that was not changed in the current
         transaction, then it's not logged, as at log_dir_items() and
         log_new_dir_dentries() we use btrfs_search_forward();
      
      2) It's not required by posix or any standard, it's undefined territory.
         The only way to guarantee a subdirectory is logged, it to explicitly
         fsync it;
      
      Making the behaviour guaranteed would require scanning all directory
      items, check which point to a directory, and then fsync each subdirectory
      which was modified in the current transaction. This could be very
      expensive for large directories with many subdirectories and/or large
      subdirectories.
      
      So remove that obsolete logic.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      de6bc7f5
    • F
      btrfs: stop copying old dir items when logging a directory · 732d591a
      Filipe Manana 提交于
      When logging a directory, we go over every leaf of the subvolume tree that
      was changed in the current transaction and copy all its dir index keys to
      the log tree.
      
      That includes copying dir index keys created in past transactions. This is
      done mostly for simplicity, as after logging the keys we log an item that
      specifies the start and end ranges of the keys we logged. That item is
      then used during log replay to figure out which keys need to be deleted -
      every key in that range that we find in the subvolume tree and is not in
      the log tree, needs to be deleted.
      
      Now that we log only dir index keys, and not dir item keys anymore, when
      we remove dentries from a directory (due to unlink and rename operations),
      we can get entire leaves that we changed only for deleting old dir index
      keys, or that have few dir index keys that are new - this is due to the
      fact that the offset for new index keys comes from a monotonically
      increasing counter.
      
      We can avoid logging dir index keys from past transactions, and in order
      to track the deletions, only log range items (BTRFS_DIR_LOG_INDEX_KEY key
      type) when we find gaps between consecutive index keys. This massively
      reduces the amount of logged metadata when we have deleted directory
      entries, even if it's a small percentage of the total number of entries.
      The reduction comes from both less items that are logged and instead of
      logging many dir index items (struct btrfs_dir_item), which have a size
      of 30 bytes plus a file name, we typically log just a few range items
      (struct btrfs_dir_log_item), which take only 8 bytes each.
      
      Even if no entries were deleted from a directory and only new entries
      were added, we typically still get a reduction on the amount of logged
      metadata, because it's very likely the first leaf that got the new
      dir index entries also has several old dir index entries.
      
      So change the logging logic to not log dir index keys created in past
      transactions and log a range item for every gap it finds between each
      pair of consecutive index keys, to ensure deletions are tracked and
      replayed on log replay.
      
      This patch is part of a patchset comprised of the following patches:
      
       1/4 btrfs: don't log unnecessary boundary keys when logging directory
       2/4 btrfs: put initial index value of a directory in a constant
       3/4 btrfs: stop copying old dir items when logging a directory
       4/4 btrfs: stop trying to log subdirectories created in past transactions
      
      The following test was run on a branch without this patchset and on a
      branch with the first three patches applied:
      
        $ cat test.sh
        #!/bin/bash
      
        DEV=/dev/nvme0n1
        MNT=/mnt/nvme0n1
      
        NUM_FILES=1000000
        NUM_FILE_DELETES=10000
      
        MKFS_OPTIONS="-O no-holes -R free-space-tree"
        MOUNT_OPTIONS="-o ssd"
      
        mkfs.btrfs -f $MKFS_OPTIONS $DEV
        mount $MOUNT_OPTIONS $DEV $MNT
      
        mkdir $MNT/testdir
        for ((i = 1; i <= $NUM_FILES; i++)); do
            echo -n > $MNT/testdir/file_$i
        done
      
        sync
      
        del_inc=$(( $NUM_FILES / $NUM_FILE_DELETES ))
        for ((i = 1; i <= $NUM_FILES; i += $del_inc)); do
            rm -f $MNT/testdir/file_$i
        done
      
        start=$(date +%s%N)
        xfs_io -c "fsync" $MNT/testdir
        end=$(date +%s%N)
      
        dur=$(( (end - start) / 1000000 ))
        echo "dir fsync took $dur ms after deleting $NUM_FILE_DELETES files"
        echo
      
        umount $MNT
      
      The test was run on a non-debug kernel (Debian's default kernel config),
      and the results were the following for various values of NUM_FILES and
      NUM_FILE_DELETES:
      
      ** before, NUM_FILES = 1 000 000, NUM_FILE_DELETES = 10 000 **
      
      dir fsync took 585 ms after deleting 10000 files
      
      ** after, NUM_FILES = 1 000 000, NUM_FILE_DELETES = 10 000 **
      
      dir fsync took 34 ms after deleting 10000 files   (-94.2%)
      
      ** before, NUM_FILES = 100 000, NUM_FILE_DELETES = 1 000 **
      
      dir fsync took 50 ms after deleting 1000 files
      
      ** after, NUM_FILES = 100 000, NUM_FILE_DELETES = 1 000 **
      
      dir fsync took 7 ms after deleting 1000 files    (-86.0%)
      
      ** before, NUM_FILES = 10 000, NUM_FILE_DELETES = 100 **
      
      dir fsync took 9 ms after deleting 100 files
      
      ** after, NUM_FILES = 10 000, NUM_FILE_DELETES = 100 **
      
      dir fsync took 5 ms after deleting 100 files     (-44.4%)
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      732d591a
    • F
      btrfs: don't log unnecessary boundary keys when logging directory · a450a4af
      Filipe Manana 提交于
      Before we start to log dir index keys from a leaf, we check if there is a
      previous index key, which normally is at the end of a leaf that was not
      changed in the current transaction. Then we log that key and set the start
      of logged range (item of type BTRFS_DIR_LOG_INDEX_KEY) to the offset of
      that key. This is to ensure that if there were deleted index keys between
      that key and the first key we are going to log, those deletions are
      replayed in case we need to replay to the log after a power failure.
      However we really don't need to log that previous key, we can just set the
      start of the logged range to that key's offset plus 1. This achieves the
      same and avoids logging one dir index key.
      
      The same logic is performed when we finish logging the index keys of a
      leaf and we find that the next leaf has index keys and was not changed in
      the current transaction. We are logging the first key of that next leaf
      and use its offset as the end of range we log. This is just to ensure that
      if there were deleted index keys between the last index key we logged and
      the first key of that next leaf, those index keys are deleted if we end
      up replaying the log. However that is not necessary, we can avoid logging
      that first index key of the next leaf and instead set the end of the
      logged range to match the offset of that index key minus 1.
      
      So avoid logging those index keys at the boundaries and adjust the start
      and end offsets of the logged ranges as described above.
      
      This patch is part of a patchset comprised of the following patches:
      
        1/4 btrfs: don't log unnecessary boundary keys when logging directory
        2/4 btrfs: put initial index value of a directory in a constant
        3/4 btrfs: stop copying old dir items when logging a directory
        4/4 btrfs: stop trying to log subdirectories created in past transactions
      
      Performance test results are listed in the changelog of patch 3/4.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a450a4af