1. 17 8月, 2022 1 次提交
  2. 25 7月, 2022 3 次提交
    • F
      btrfs: join running log transaction when logging new name · 723df2bc
      Filipe Manana 提交于
      When logging a new name, in case of a rename, we pin the log before
      changing it. We then either delete a directory entry from the log or
      insert a key range item to mark the old name for deletion on log replay.
      
      However when doing one of those log changes we may have another task that
      started writing out the log (at btrfs_sync_log()) and it started before
      we pinned the log root. So we may end up changing a log tree while its
      writeback is being started by another task syncing the log. This can lead
      to inconsistencies in a log tree and other unexpected results during log
      replay, because we can get some committed node pointing to a node/leaf
      that ends up not getting written to disk before the next log commit.
      
      The problem, conceptually, started to happen in commit 88d2beec
      ("btrfs: avoid logging all directory changes during renames"), because
      there we started to update the log without joining its current transaction
      first.
      
      However the problem only became visible with commit 259c4b96
      ("btrfs: stop doing unnecessary log updates during a rename"), and that is
      because we used to pin the log at btrfs_rename() and then before entering
      btrfs_log_new_name(), when unlinking the old dentry, we ended up at
      btrfs_del_inode_ref_in_log() and btrfs_del_dir_entries_in_log(). Both
      of them join the current log transaction, effectively waiting for any log
      transaction writeout (due to acquiring the root's log_mutex). This made it
      safe even after leaving the current log transaction, because we remained
      with the log pinned when we called btrfs_log_new_name().
      
      Then in commit 259c4b96 ("btrfs: stop doing unnecessary log updates
      during a rename"), we removed the log pinning from btrfs_rename() and
      stopped calling btrfs_del_inode_ref_in_log() and
      btrfs_del_dir_entries_in_log() during the rename, and started to do all
      the needed work at btrfs_log_new_name(), but without joining the current
      log transaction, only pinning the log, which is racy because another task
      may have started writeout of the log tree right before we pinned the log.
      
      Both commits landed in kernel 5.18, so it doesn't make any practical
      difference which should be blamed, but I'm blaming the second commit only
      because with the first one, by chance, the problem did not happen due to
      the fact we joined the log transaction after pinning the log and unpinned
      it only after calling btrfs_log_new_name().
      
      So make btrfs_log_new_name() join the current log transaction instead of
      pinning it, so that we never do log updates if it's writeout is starting.
      
      Fixes: 259c4b96 ("btrfs: stop doing unnecessary log updates during a rename")
      CC: stable@vger.kernel.org # 5.18+
      Reported-by: NZygo Blaxell <ce3g8jdj@umail.furryterror.org>
      Tested-by: NZygo Blaxell <ce3g8jdj@umail.furryterror.org>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      723df2bc
    • J
      btrfs: tree-log: make the return value for log syncing consistent · f31f09f6
      Josef Bacik 提交于
      Currently we will return 1 or -EAGAIN if we decide we need to commit
      the transaction rather than sync the log.  In practice this doesn't
      really matter, we interpret any !0 and !BTRFS_NO_LOG_SYNC as needing to
      commit the transaction.  However this makes it hard to figure out what
      the correct thing to do is.
      
      Fix this up by defining BTRFS_LOG_FORCE_COMMIT and using this in all the
      places where we want to force the transaction to be committed.
      
      CC: stable@vger.kernel.org # 5.15+
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f31f09f6
    • D
      btrfs: fix typos in comments · 143823cf
      David Sterba 提交于
      Codespell has found a few typos.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      143823cf
  3. 16 5月, 2022 2 次提交
  4. 06 5月, 2022 1 次提交
    • F
      btrfs: fix assertion failure when logging directory key range item · 750ee454
      Filipe Manana 提交于
      When inserting a key range item (BTRFS_DIR_LOG_INDEX_KEY) while logging
      a directory, we don't expect the insertion to fail with -EEXIST, because
      we are holding the directory's log_mutex and we have dropped all existing
      BTRFS_DIR_LOG_INDEX_KEY keys from the log tree before we started to log
      the directory. However it's possible that during the logging we attempt
      to insert the same BTRFS_DIR_LOG_INDEX_KEY key twice, but for this to
      happen we need to race with insertions of items from other inodes in the
      subvolume's tree while we are logging a directory. Here's how this can
      happen:
      
      1) We are logging a directory with inode number 1000 that has its items
         spread across 3 leaves in the subvolume's tree:
      
         leaf A - has index keys from the range 2 to 20 for example. The last
         item in the leaf corresponds to a dir item for index number 20. All
         these dir items were created in a past transaction.
      
         leaf B - has index keys from the range 22 to 100 for example. It has
         no keys from other inodes, all its keys are dir index keys for our
         directory inode number 1000. Its first key is for the dir item with
         a sequence number of 22. All these dir items were also created in a
         past transaction.
      
         leaf C - has index keys for our directory for the range 101 to 120 for
         example. This leaf also has items from other inodes, and its first
         item corresponds to the dir item for index number 101 for our directory
         with inode number 1000;
      
      2) When we finish processing the items from leaf A at log_dir_items(),
         we log a BTRFS_DIR_LOG_INDEX_KEY key with an offset of 21 and a last
         offset of 21, meaning the log is authoritative for the index range
         from 21 to 21 (a single sequence number). At this point leaf B was
         not yet modified in the current transaction;
      
      3) When we return from log_dir_items() we have released our read lock on
         leaf B, and have set *last_offset_ret to 21 (index number of the first
         item on leaf B minus 1);
      
      4) Some other task inserts an item for other inode (inode number 1001 for
         example) into leaf C. That resulted in pushing some items from leaf C
         into leaf B, in order to make room for the new item, so now leaf B
         has dir index keys for the sequence number range from 22 to 102 and
         leaf C has the dir items for the sequence number range 103 to 120;
      
      5) At log_directory_changes() we call log_dir_items() again, passing it
         a 'min_offset' / 'min_key' value of 22 (*last_offset_ret from step 3
         plus 1, so 21 + 1). Then btrfs_search_forward() leaves us at slot 0
         of leaf B, since leaf B was modified in the current transaction.
      
         We have also initialized 'last_old_dentry_offset' to 20 after calling
         btrfs_previous_item() at log_dir_items(), as it left us at the last
         item of leaf A, which refers to the dir item with sequence number 20;
      
      6) We then call process_dir_items_leaf() to process the dir items of
         leaf B, and when we process the first item, corresponding to slot 0,
         sequence number 22, we notice the dir item was created in a past
         transaction and its sequence number is greater than the value of
         *last_old_dentry_offset + 1 (20 + 1), so we decide to log again a
         BTRFS_DIR_LOG_INDEX_KEY key with an offset of 21 and an end range
         of 21 (key.offset - 1 == 22 - 1 == 21), which results in an -EEXIST
         error from insert_dir_log_key(), as we have already inserted that
         key at step 2, triggering the assertion at process_dir_items_leaf().
      
      The trace produced in dmesg is like the following:
      
      assertion failed: ret != -EEXIST, in fs/btrfs/tree-log.c:3857
      [198255.980839][ T7460] ------------[ cut here ]------------
      [198255.981666][ T7460] kernel BUG at fs/btrfs/ctree.h:3617!
      [198255.983141][ T7460] invalid opcode: 0000 [#1] PREEMPT SMP KASAN PTI
      [198255.984080][ T7460] CPU: 0 PID: 7460 Comm: repro-ghost-dir Not tainted 5.18.0-5314c78ac373-misc-next+
      [198255.986027][ T7460] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
      [198255.988600][ T7460] RIP: 0010:assertfail.constprop.0+0x1c/0x1e
      [198255.989465][ T7460] Code: 8b 4c 89 (...)
      [198255.992599][ T7460] RSP: 0018:ffffc90007387188 EFLAGS: 00010282
      [198255.993414][ T7460] RAX: 000000000000003d RBX: 0000000000000065 RCX: 0000000000000000
      [198255.996056][ T7460] RDX: 0000000000000001 RSI: ffffffff8b62b180 RDI: fffff52000e70e24
      [198255.997668][ T7460] RBP: ffffc90007387188 R08: 000000000000003d R09: ffff8881f0e16507
      [198255.999199][ T7460] R10: ffffed103e1c2ca0 R11: 0000000000000001 R12: 00000000ffffffef
      [198256.000683][ T7460] R13: ffff88813befc630 R14: ffff888116c16e70 R15: ffffc90007387358
      [198256.007082][ T7460] FS:  00007fc7f7c24640(0000) GS:ffff8881f0c00000(0000) knlGS:0000000000000000
      [198256.009939][ T7460] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [198256.014133][ T7460] CR2: 0000560bb16d0b78 CR3: 0000000140b34005 CR4: 0000000000170ef0
      [198256.015239][ T7460] Call Trace:
      [198256.015674][ T7460]  <TASK>
      [198256.016313][ T7460]  log_dir_items.cold+0x16/0x2c
      [198256.018858][ T7460]  ? replay_one_extent+0xbf0/0xbf0
      [198256.025932][ T7460]  ? release_extent_buffer+0x1d2/0x270
      [198256.029658][ T7460]  ? rcu_read_lock_sched_held+0x16/0x80
      [198256.031114][ T7460]  ? lock_acquired+0xbe/0x660
      [198256.032633][ T7460]  ? rcu_read_lock_sched_held+0x16/0x80
      [198256.034386][ T7460]  ? lock_release+0xcf/0x8a0
      [198256.036152][ T7460]  log_directory_changes+0xf9/0x170
      [198256.036993][ T7460]  ? log_dir_items+0xba0/0xba0
      [198256.037661][ T7460]  ? do_raw_write_unlock+0x7d/0xe0
      [198256.038680][ T7460]  btrfs_log_inode+0x233b/0x26d0
      [198256.041294][ T7460]  ? log_directory_changes+0x170/0x170
      [198256.042864][ T7460]  ? btrfs_attach_transaction_barrier+0x60/0x60
      [198256.045130][ T7460]  ? rcu_read_lock_sched_held+0x16/0x80
      [198256.046568][ T7460]  ? lock_release+0xcf/0x8a0
      [198256.047504][ T7460]  ? lock_downgrade+0x420/0x420
      [198256.048712][ T7460]  ? ilookup5_nowait+0x81/0xa0
      [198256.049747][ T7460]  ? lock_downgrade+0x420/0x420
      [198256.050652][ T7460]  ? do_raw_spin_unlock+0xa9/0x100
      [198256.051618][ T7460]  ? __might_resched+0x128/0x1c0
      [198256.052511][ T7460]  ? __might_sleep+0x66/0xc0
      [198256.053442][ T7460]  ? __kasan_check_read+0x11/0x20
      [198256.054251][ T7460]  ? iget5_locked+0xbd/0x150
      [198256.054986][ T7460]  ? run_delayed_iput_locked+0x110/0x110
      [198256.055929][ T7460]  ? btrfs_iget+0xc7/0x150
      [198256.056630][ T7460]  ? btrfs_orphan_cleanup+0x4a0/0x4a0
      [198256.057502][ T7460]  ? free_extent_buffer+0x13/0x20
      [198256.058322][ T7460]  btrfs_log_inode+0x2654/0x26d0
      [198256.059137][ T7460]  ? log_directory_changes+0x170/0x170
      [198256.060020][ T7460]  ? rcu_read_lock_sched_held+0x16/0x80
      [198256.060930][ T7460]  ? rcu_read_lock_sched_held+0x16/0x80
      [198256.061905][ T7460]  ? lock_contended+0x770/0x770
      [198256.062682][ T7460]  ? btrfs_log_inode_parent+0xd04/0x1750
      [198256.063582][ T7460]  ? lock_downgrade+0x420/0x420
      [198256.064432][ T7460]  ? preempt_count_sub+0x18/0xc0
      [198256.065550][ T7460]  ? __mutex_lock+0x580/0xdc0
      [198256.066654][ T7460]  ? stack_trace_save+0x94/0xc0
      [198256.068008][ T7460]  ? __kasan_check_write+0x14/0x20
      [198256.072149][ T7460]  ? __mutex_unlock_slowpath+0x12a/0x430
      [198256.073145][ T7460]  ? mutex_lock_io_nested+0xcd0/0xcd0
      [198256.074341][ T7460]  ? wait_for_completion_io_timeout+0x20/0x20
      [198256.075345][ T7460]  ? lock_downgrade+0x420/0x420
      [198256.076142][ T7460]  ? lock_contended+0x770/0x770
      [198256.076939][ T7460]  ? do_raw_spin_lock+0x1c0/0x1c0
      [198256.078401][ T7460]  ? btrfs_sync_file+0x5e6/0xa40
      [198256.080598][ T7460]  btrfs_log_inode_parent+0x523/0x1750
      [198256.081991][ T7460]  ? wait_current_trans+0xc8/0x240
      [198256.083320][ T7460]  ? lock_downgrade+0x420/0x420
      [198256.085450][ T7460]  ? btrfs_end_log_trans+0x70/0x70
      [198256.086362][ T7460]  ? rcu_read_lock_sched_held+0x16/0x80
      [198256.087544][ T7460]  ? lock_release+0xcf/0x8a0
      [198256.088305][ T7460]  ? lock_downgrade+0x420/0x420
      [198256.090375][ T7460]  ? dget_parent+0x8e/0x300
      [198256.093538][ T7460]  ? do_raw_spin_lock+0x1c0/0x1c0
      [198256.094918][ T7460]  ? lock_downgrade+0x420/0x420
      [198256.097815][ T7460]  ? do_raw_spin_unlock+0xa9/0x100
      [198256.101822][ T7460]  ? dget_parent+0xb7/0x300
      [198256.103345][ T7460]  btrfs_log_dentry_safe+0x48/0x60
      [198256.105052][ T7460]  btrfs_sync_file+0x629/0xa40
      [198256.106829][ T7460]  ? start_ordered_ops.constprop.0+0x120/0x120
      [198256.109655][ T7460]  ? __fget_files+0x161/0x230
      [198256.110760][ T7460]  vfs_fsync_range+0x6d/0x110
      [198256.111923][ T7460]  ? start_ordered_ops.constprop.0+0x120/0x120
      [198256.113556][ T7460]  __x64_sys_fsync+0x45/0x70
      [198256.114323][ T7460]  do_syscall_64+0x5c/0xc0
      [198256.115084][ T7460]  ? syscall_exit_to_user_mode+0x3b/0x50
      [198256.116030][ T7460]  ? do_syscall_64+0x69/0xc0
      [198256.116768][ T7460]  ? do_syscall_64+0x69/0xc0
      [198256.117555][ T7460]  ? do_syscall_64+0x69/0xc0
      [198256.118324][ T7460]  ? sysvec_call_function_single+0x57/0xc0
      [198256.119308][ T7460]  ? asm_sysvec_call_function_single+0xa/0x20
      [198256.120363][ T7460]  entry_SYSCALL_64_after_hwframe+0x44/0xae
      [198256.121334][ T7460] RIP: 0033:0x7fc7fe97b6ab
      [198256.122067][ T7460] Code: 0f 05 48 (...)
      [198256.125198][ T7460] RSP: 002b:00007fc7f7c23950 EFLAGS: 00000293 ORIG_RAX: 000000000000004a
      [198256.126568][ T7460] RAX: ffffffffffffffda RBX: 00007fc7f7c239f0 RCX: 00007fc7fe97b6ab
      [198256.127942][ T7460] RDX: 0000000000000002 RSI: 000056167536bcf0 RDI: 0000000000000004
      [198256.129302][ T7460] RBP: 0000000000000004 R08: 0000000000000000 R09: 000000007ffffeb8
      [198256.130670][ T7460] R10: 00000000000001ff R11: 0000000000000293 R12: 0000000000000001
      [198256.132046][ T7460] R13: 0000561674ca8140 R14: 00007fc7f7c239d0 R15: 000056167536dab8
      [198256.133403][ T7460]  </TASK>
      
      Fix this by treating -EEXIST as expected at insert_dir_log_key() and have
      it update the item with an end offset corresponding to the maximum between
      the previously logged end offset and the new requested end offset. The end
      offsets may be different due to dir index key deletions that happened as
      part of unlink operations while we are logging a directory (triggered when
      fsyncing some other inode parented by the directory) or during renames
      which always attempt to log a single dir index deletion.
      Reported-by: NZygo Blaxell <ce3g8jdj@umail.furryterror.org>
      Link: https://lore.kernel.org/linux-btrfs/YmyefE9mc2xl5ZMz@hungrycats.org/
      Fixes: 732d591a ("btrfs: stop copying old dir items when logging a directory")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      750ee454
  5. 28 4月, 2022 1 次提交
    • F
      btrfs: always log symlinks in full mode · d0e64a98
      Filipe Manana 提交于
      On Linux, empty symlinks are invalid, and attempting to create one with
      the system call symlink(2) results in an -ENOENT error and this is
      explicitly documented in the man page.
      
      If we rename a symlink that was created in the current transaction and its
      parent directory was logged before, we actually end up logging the symlink
      without logging its content, which is stored in an inline extent. That
      means that after a power failure we can end up with an empty symlink,
      having no content and an i_size of 0 bytes.
      
      It can be easily reproduced like this:
      
        $ mkfs.btrfs -f /dev/sdc
        $ mount /dev/sdc /mnt
      
        $ mkdir /mnt/testdir
        $ sync
      
        # Create a file inside the directory and fsync the directory.
        $ touch /mnt/testdir/foo
        $ xfs_io -c "fsync" /mnt/testdir
      
        # Create a symlink inside the directory and then rename the symlink.
        $ ln -s /mnt/testdir/foo /mnt/testdir/bar
        $ mv /mnt/testdir/bar /mnt/testdir/baz
      
        # Now fsync again the directory, this persist the log tree.
        $ xfs_io -c "fsync" /mnt/testdir
      
        <power failure>
      
        $ mount /dev/sdc /mnt
        $ stat -c %s /mnt/testdir/baz
        0
        $ readlink /mnt/testdir/baz
        $
      
      Fix this by always logging symlinks in full mode (LOG_INODE_ALL), so that
      their content is also logged.
      
      A test case for fstests will follow.
      
      CC: stable@vger.kernel.org # 4.9+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d0e64a98
  6. 19 4月, 2022 1 次提交
  7. 14 3月, 2022 17 次提交
    • F
      btrfs: add and use helper for unlinking inode during log replay · 313ab753
      Filipe Manana 提交于
      During log replay there is this pattern of running delayed items after
      every inode unlink. To avoid repeating this several times, move the
      logic into an helper function and use it instead of calling
      btrfs_unlink_inode() followed by btrfs_run_delayed_items().
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      313ab753
    • F
      btrfs: reset last_reflink_trans after fsyncing inode · 23e3337f
      Filipe Manana 提交于
      When an inode has a last_reflink_trans matching the current transaction,
      we have to take special care when logging its checksums in order to
      avoid getting checksum items with overlapping ranges in a log tree,
      which could result in missing checksums after log replay (more on that
      in the changelogs of commit 40e046ac ("Btrfs: fix missing data
      checksums after replaying a log tree") and commit e289f03e ("btrfs:
      fix corrupt log due to concurrent fsync of inodes with shared extents")).
      We also need to make sure a full fsync will copy all old file extent
      items it finds in modified leaves, because they might have been copied
      from some other inode.
      
      However once we fsync an inode, we don't need to keep paying the price of
      that extra special care in future fsyncs done in the same transaction,
      unless the inode is used for another reflink operation or the full sync
      flag is set on it (truncate, failure to allocate extent maps for holes,
      and other exceptional and infrequent cases).
      
      So after we fsync an inode reset its last_unlink_trans to zero. In case
      another reflink happens, we continue to update the last_reflink_trans of
      the inode, just as before. Also set last_reflink_trans to the generation
      of the last transaction that modified the inode whenever we need to set
      the full sync flag on the inode, just like when we need to load an inode
      from disk after eviction.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      23e3337f
    • F
      btrfs: voluntarily relinquish cpu when doing a full fsync · 96acb375
      Filipe Manana 提交于
      Doing a full fsync may require processing many leaves of metadata, which
      can take some time and result in a task monopolizing a cpu for too long.
      So add a cond_resched() after processing a leaf when doing a full fsync,
      while not holding any locks on any tree (a subvolume or a log tree).
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      96acb375
    • F
      btrfs: hold on to less memory when logging checksums during full fsync · 5b7ce5e2
      Filipe Manana 提交于
      When doing a full fsync, at copy_items(), we iterate over all extents and
      then collect their checksums into a list. After copying all the extents to
      the log tree, we then log all the previously collected checksums.
      
      Before the previous patch in the series (subject "btrfs: stop copying old
      file extents when doing a full fsync"), we had to do it this way, because
      while we were iterating over the items in the leaf of the subvolume tree,
      we were holding a write lock on a leaf of the log tree, so logging the
      checksums for an extent right after we collected them could result in a
      deadlock, in case the checksum items ended up in the same leaf.
      
      However after the previous patch in the series we now do a first iteration
      over all the items in the leaf of the subvolume tree before locking a path
      in the log tree, so we can now log the checksums right after we have
      obtained them. This avoids holding in memory all checksums for all extents
      in the leaf while copying items from the source leaf to the log tree. The
      amount of memory used to hold all checksums of the extents in a leaf can
      be significant. For example if a leaf has 200 file extent items referring
      to 1M extents, using the default crc32c checksums, would result in using
      over 200K of memory (not accounting for the extra overhead of struct
      btrfs_ordered_sum), with smaller or less extents it would be less, but
      it could be much more with more extents per leaf and/or much larger
      extents.
      
      So change copy_items() to log the checksums for an extent after looking
      them up, and then free their memory, as they are no longer necessary.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5b7ce5e2
    • F
      btrfs: stop copying old file extents when doing a full fsync · 7f30c072
      Filipe Manana 提交于
      When logging an inode in full sync mode, we go over every leaf that was
      modified in the current transaction and has items associated to our inode,
      and then copy all those items into the log tree. This includes copying
      file extent items that were created and added to the inode in past
      transactions, which is useless and only makes use more leaf space in the
      log tree.
      
      It's common to have a file with many file extent items spanning many
      leaves where only a few file extent items are new and need to be logged,
      and in such case we log all the file extent items we find in the modified
      leaves.
      
      So change the full sync behaviour to skip over file extent items that are
      not needed. Those are the ones that match the following criteria:
      
      1) Have a generation older than the current transaction and the inode
         was not a target of a reflink operation, as that can copy file extent
         items from a past generation from some other inode into our inode, so
         we have to log them;
      
      2) Start at an offset within i_size - we must log anything at or beyond
         i_size, otherwise we would lose prealloc extents after log replay.
      
      The following script exercises a scenario where this happens, and it's
      somehow close enough to what happened often on a SQL Server workload which
      I had to debug sometime ago to fix an issue where a pattern of writes to
      prealloc extents and fsync resulted in fsync failing with -EIO (that was
      commit ea7036de ("btrfs: fix fsync failure and transaction abort
      after writes to prealloc extents")). In that particular case, we had large
      files that had random writes and were often truncated, which made the
      next fsync be a full sync.
      
        $ cat test.sh
        #!/bin/bash
      
        DEV=/dev/sdi
        MNT=/mnt/sdi
      
        MKFS_OPTIONS="-O no-holes -R free-space-tree"
        MOUNT_OPTIONS="-o ssd"
      
        FILE_SIZE=$((1 * 1024 * 1024 * 1024)) # 1G
        # FILE_SIZE=$((2 * 1024 * 1024 * 1024)) # 2G
        # FILE_SIZE=$((512 * 1024 * 1024)) # 512M
      
        mkfs.btrfs -f $MKFS_OPTIONS $DEV
        mount $MOUNT_OPTIONS $DEV $MNT
      
        # Create a file with many extents. Use direct IO to make it faster
        # to create the file - using buffered IO we would have to fsync
        # after each write (terribly slow).
        echo "Creating file with $((FILE_SIZE / 4096)) extents of 4K each..."
        xfs_io -f -d -c "pwrite -b 4K 0 $FILE_SIZE" $MNT/foobar
      
        # Commit the transaction, so every extent after this is from an
        # old generation.
        sync
      
        # Now rewrite only a few extents, which are all far spread apart from
        # each other (e.g. 1G / 32M = 32 extents).
        # After this only a few extents have a new generation, while all other
        # ones have an old generation.
        echo "Rewriting $((FILE_SIZE / (32 * 1024 * 1024))) extents..."
        for ((i = 0; i < $FILE_SIZE; i += $((32 * 1024 * 1024)))); do
            xfs_io -c "pwrite $i 4K" $MNT/foobar >/dev/null
        done
      
        # Fsync, the inode logged in full sync mode since it was never fsynced
        # before.
        echo "Fsyncing file..."
        xfs_io -c "fsync" $MNT/foobar
      
        umount $MNT
      
      And the following bpftrace program was running when executing the test
      script:
      
        $ cat bpf-script.sh
        #!/usr/bin/bpftrace
      
        k:btrfs_log_inode
        {
            @start_log_inode[tid] = nsecs;
        }
      
        kr:btrfs_log_inode
        /@start_log_inode[tid]/
        {
            @log_inode_dur[tid] = (nsecs - @start_log_inode[tid]) / 1000;
            delete(@start_log_inode[tid]);
        }
      
        k:btrfs_sync_log
        {
            @start_sync_log[tid] = nsecs;
        }
      
        kr:btrfs_sync_log
        /@start_sync_log[tid]/
        {
            $sync_log_dur = (nsecs - @start_sync_log[tid]) / 1000;
            printf("btrfs_log_inode() took %llu us\n", @log_inode_dur[tid]);
            printf("btrfs_sync_log()  took %llu us\n", $sync_log_dur);
            delete(@start_sync_log[tid]);
            delete(@log_inode_dur[tid]);
            exit();
        }
      
      With 512M test file, before this patch:
      
        btrfs_log_inode() took 15218 us
        btrfs_sync_log()  took 1328 us
      
        Log tree has 17 leaves and 1 node, its total size is 294912 bytes.
      
      With 512M test file, after this patch:
      
        btrfs_log_inode() took 14760 us
        btrfs_sync_log()  took 588 us
      
        Log tree has a single leaf, its total size is 16K.
      
      With 1G test file, before this patch:
      
        btrfs_log_inode() took 27301 us
        btrfs_sync_log()  took 1767 us
      
        Log tree has 33 leaves and 1 node, its total size is 557056 bytes.
      
      With 1G test file, after this patch:
      
        btrfs_log_inode() took 26166 us
        btrfs_sync_log()  took 593 us
      
        Log tree has a single leaf, its total size is 16K
      
      With 2G test file, before this patch:
      
        btrfs_log_inode() took 50892 us
        btrfs_sync_log()  took 3127 us
      
        Log tree has 65 leaves and 1 node, its total size is 1081344 bytes.
      
      With 2G test file, after this patch:
      
        btrfs_log_inode() took 50126 us
        btrfs_sync_log()  took 586 us
      
        Log tree has a single leaf, its total size is 16K.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7f30c072
    • F
      btrfs: prepare extents to be logged before locking a log tree path · e1f53ed8
      Filipe Manana 提交于
      When we want to log an extent, in the fast fsync path, we obtain a path
      to the leaf that will hold the file extent item either through a deletion
      search, via btrfs_drop_extents(), or through an insertion search using
      btrfs_insert_empty_item(). After that we fill the file extent item's
      fields one by one directly on the leaf.
      
      Instead of doing that, we could prepare the file extent item before
      obtaining a btree path, and then copy the prepared extent item with a
      single operation once we get the path. This helps avoid some contention
      on the log tree, since we are holding write locks for longer than
      necessary, especially in the case where the path is obtained via
      btrfs_drop_extents() through a deletion search, which always keeps a
      write lock on the nodes at levels 1 and 2 (besides the leaf).
      
      This change does that, we prepare the file extent item that is going to
      be inserted before acquiring a path, and then copy it into a leaf using
      a single copy operation once we get a path.
      
      This change if part of a patchset that is comprised of the following
      patches:
      
        1/6 btrfs: remove unnecessary leaf free space checks when pushing items
        2/6 btrfs: avoid unnecessary COW of leaves when deleting items from a leaf
        3/6 btrfs: avoid unnecessary computation when deleting items from a leaf
        4/6 btrfs: remove constraint on number of visited leaves when replacing extents
        5/6 btrfs: remove useless path release in the fast fsync path
        6/6 btrfs: prepare extents to be logged before locking a log tree path
      
      The following test was run to measure the impact of the whole patchset:
      
        $ cat test.sh
        #!/bin/bash
      
        DEV=/dev/sdi
        MNT=/mnt/sdi
        MOUNT_OPTIONS="-o ssd"
        MKFS_OPTIONS="-R free-space-tree -O no-holes"
      
        NUM_JOBS=8
        FILE_SIZE=128M
        RUN_TIME=200
      
        cat <<EOF > /tmp/fio-job.ini
        [writers]
        rw=randwrite
        fsync=1
        fallocate=none
        group_reporting=1
        direct=0
        bssplit=4k/20:8k/20:16k/20:32k/10:64k/10:128k/5:256k/5:512k/5:1m/5
        ioengine=sync
        filesize=$FILE_SIZE
        runtime=$RUN_TIME
        time_based
        directory=$MNT
        numjobs=$NUM_JOBS
        thread
        EOF
      
        echo "performance" | \
            tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
      
        echo
        echo "Using config:"
        echo
        cat /tmp/fio-job.ini
        echo
      
        umount $MNT &> /dev/null
        mkfs.btrfs -f $MKFS_OPTIONS $DEV
        mount $MOUNT_OPTIONS $DEV $MNT
      
        fio /tmp/fio-job.ini
      
        umount $MNT
      
      The test ran inside a VM (8 cores, 32G of RAM) with the target disk
      mapping to a raw NVMe device, and using a non-debug kernel config
      (Debian's default config).
      
      Before the patchset:
      
      WRITE: bw=116MiB/s (122MB/s), 116MiB/s-116MiB/s (122MB/s-122MB/s), io=22.7GiB (24.4GB), run=200013-200013msec
      
      After the patchset:
      
      WRITE: bw=125MiB/s (131MB/s), 125MiB/s-125MiB/s (131MB/s-131MB/s), io=24.3GiB (26.1GB), run=200007-200007msec
      
      A 7.8% gain on throughput and +7.0% more IO done in the same period of
      time (200 seconds).
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e1f53ed8
    • F
      btrfs: remove useless path release in the fast fsync path · d8457531
      Filipe Manana 提交于
      There's no point in calling btrfs_release_path() after finishing the loop
      that logs the modified extents, since log_one_extent() returns with the
      path released. In case the list of extents is empty, the path is already
      released, so there's no need for that case as well.
      So just remove that unnecessary btrfs_release_path() call.
      
      This change if part of a patchset that is comprised of the following
      patches:
      
        1/6 btrfs: remove unnecessary leaf free space checks when pushing items
        2/6 btrfs: avoid unnecessary COW of leaves when deleting items from a leaf
        3/6 btrfs: avoid unnecessary computation when deleting items from a leaf
        4/6 btrfs: remove constraint on number of visited leaves when replacing extents
        5/6 btrfs: remove useless path release in the fast fsync path
        6/6 btrfs: prepare extents to be logged before locking a log tree path
      
      The last patch in the series has some performance test result in its
      changelog.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d8457531
    • F
      btrfs: use single variable to track return value at btrfs_log_inode() · 65faced5
      Filipe Manana 提交于
      At btrfs_log_inode(), we have two variables to track errors and the
      return value of the function, named 'ret' and 'err'. In some places we
      use 'ret' and if gets a non-zero value we assign its value to 'err'
      and then jump to the 'out' label, while in other places we use 'err'
      directly without 'ret' as an intermediary. This is inconsistent, error
      prone and not necessary. So change that to use only the 'ret' variable,
      making this consistent with most functions in btrfs.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      65faced5
    • F
      btrfs: avoid inode logging during rename and link when possible · 0f8ce498
      Filipe Manana 提交于
      During a rename or link operation, we need to determine if an inode was
      previously logged or not, and if it was, do some update to the logged
      inode. We used to rely exclusively on the logged_trans field of struct
      btrfs_inode to determine that, but that was not reliable because the
      value of that field is not persisted in the inode item, so it's lost
      when an inode is evicted and loaded back again. That led to several
      issues in the past, such as not persisting deletions (such as the case
      fixed by commit 803f0f64 ("Btrfs: fix fsync not persisting dentry
      deletions due to inode evictions")), or resulting in losing a file
      after an inode eviction followed by a rename (commit ecc64fab
      ("btrfs: fix lost inode on log replay after mix of fsync, rename and
      inode eviction")), besides other issues.
      
      So the inode_logged() helper was introduced and used to determine if an
      inode was possibly logged before in the current transaction, with the
      caveat that it could return false positives, in the sense that even if an
      inode was not logged before in the current transaction, it could still
      return true, but never to return false in case the inode was logged.
      >From a functional point of view that is fine, but from a performance
      perspective it can introduce significant latencies to rename and link
      operations, as they will end up doing inode logging even when it is not
      necessary.
      
      Recently on a 5.15 kernel, an openSUSE Tumbleweed user reported package
      installations and upgrades, with the zypper tool, were often taking a
      long time to complete. With strace it could be observed that zypper was
      spending about 99% of its time on rename operations, and then with
      further analysis we checked that directory logging was happening too
      frequently. Taking into account that installation/upgrade of some of the
      packages needed a few thousand file renames, the slowdown was very
      noticeable for the user.
      
      The issue was caused indirectly due to an excessive number of inode
      evictions on a 5.15 kernel, about 100x more compared to a 5.13, 5.14 or
      a 5.16-rc8 kernel. While triggering the inode evictions if something
      outside btrfs' control, btrfs could still behave better by eliminating
      the false positives from the inode_logged() helper.
      
      So change inode_logged() to actually eliminate such false positives caused
      by inode eviction and when an inode was never logged since the filesystem
      was mounted, as both cases relate to when the logged_trans field of struct
      btrfs_inode has a value of zero. When it can not determine if the inode
      was logged based only on the logged_trans value, lookup for the existence
      of the inode item in the log tree - if it's there then we known the inode
      was logged, if it's not there then it can not have been logged in the
      current transaction. Once we determine if the inode was logged, update
      the logged_trans value to avoid future calls to have to search in the log
      tree again.
      
      Alternatively, we could start storing logged_trans in the on disk inode
      item structure (struct btrfs_inode_item) in the unused space it still has,
      but that would be a bit odd because:
      
      1) We only care about logged_trans since the filesystem was mounted, we
         don't care about its value from a previous mount. Having it persisted
         in the inode item structure would not make the best use of the precious
         unused space;
      
      2) In order to get logged_trans persisted before inode eviction, we would
         have to update the delayed inode when we finish logging the inode and
         update its logged_trans in struct btrfs_inode, which makes it a bit
         cumbersome since we need to check if the delayed inode exists, if not
         create it and populate it and deal with any errors (-ENOMEM mostly).
      
      This change is part of a patchset comprised of the following patches:
      
        1/5 btrfs: add helper to delete a dir entry from a log tree
        2/5 btrfs: pass the dentry to btrfs_log_new_name() instead of the inode
        3/5 btrfs: avoid logging all directory changes during renames
        4/5 btrfs: stop doing unnecessary log updates during a rename
        5/5 btrfs: avoid inode logging during rename and link when possible
      
      The following test script mimics part of what the zypper tool does during
      package installations/upgrades. It does not triggers inode evictions, but
      it's similar because it triggers false positives from the inode_logged()
      helper, because the inodes have a logged_trans of 0, there's a log tree
      due to a fsync of an unrelated file and the directory inode has its
      last_trans field set to the current transaction:
      
        $ cat test.sh
      
        #!/bin/bash
      
        DEV=/dev/nvme0n1
        MNT=/mnt/nvme0n1
      
        NUM_FILES=10000
      
        mkfs.btrfs -f $DEV
        mount $DEV $MNT
      
        mkdir $MNT/testdir
      
        for ((i = 1; i <= $NUM_FILES; i++)); do
            echo -n > $MNT/testdir/file_$i
        done
      
        sync
      
        # Now do some change to an unrelated file and fsync it.
        # This is just to create a log tree to make sure that inode_logged()
        # does not return false when called against "testdir".
        xfs_io -f -c "pwrite 0 4K" -c "fsync" $MNT/foo
      
        # Do some change to testdir. This is to make sure inode_logged()
        # will return true when called against "testdir", because its
        # logged_trans is 0, it was changed in the current transaction
        # and there's a log tree.
        echo -n > $MNT/testdir/file_$((NUM_FILES + 1))
      
        echo "Renaming $NUM_FILES files..."
        start=$(date +%s%N)
        for ((i = 1; i <= $NUM_FILES; i++)); do
            mv $MNT/testdir/file_$i $MNT/testdir/file_$i-RPMDELETE
        done
        end=$(date +%s%N)
      
        dur=$(( (end - start) / 1000000 ))
        echo "Renames took $dur milliseconds"
      
        umount $MNT
      
      Testing this change on a box using a non-debug kernel (Debian's default
      kernel config) gave the following results:
      
      NUM_FILES=10000, before patchset:                   27837 ms
      NUM_FILES=10000, after patches 1/5 to 4/5 applied:   9236 ms (-66.8%)
      NUM_FILES=10000, after whole patchset applied:       8902 ms (-68.0%)
      
      NUM_FILES=5000, before patchset:                     9127 ms
      NUM_FILES=5000, after patches 1/5 to 4/5 applied:    4640 ms (-49.2%)
      NUM_FILES=5000, after whole patchset applied:        4441 ms (-51.3%)
      
      NUM_FILES=2000, before patchset:                     2528 ms
      NUM_FILES=2000, after patches 1/5 to 4/5 applied:    1983 ms (-21.6%)
      NUM_FILES=2000, after whole patchset applied:        1747 ms (-30.9%)
      
      NUM_FILES=1000, before patchset:                     1085 ms
      NUM_FILES=1000, after patches 1/5 to 4/5 applied:     893 ms (-17.7%)
      NUM_FILES=1000, after whole patchset applied:         867 ms (-20.1%)
      
      Running dbench on the same physical machine with the following script:
      
        $ cat run-dbench.sh
        #!/bin/bash
      
        NUM_JOBS=$(nproc --all)
      
        DEV=/dev/nvme0n1
        MNT=/mnt/nvme0n1
        MOUNT_OPTIONS="-o ssd"
        MKFS_OPTIONS="-O no-holes -R free-space-tree"
      
        echo "performance" | \
            tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
      
        mkfs.btrfs -f $MKFS_OPTIONS $DEV
        mount $MOUNT_OPTIONS $DEV $MNT
      
        dbench -D $MNT -t 120 $NUM_JOBS
      
        umount $MNT
      
      Before patchset:
      
       Operation      Count    AvgLat    MaxLat
       ----------------------------------------
       NTCreateX    3761352     0.032   143.843
       Close        2762770     0.002     2.273
       Rename        159304     0.291    67.037
       Unlink        759784     0.207   143.998
       Deltree           72     4.028    15.977
       Mkdir             36     0.003     0.006
       Qpathinfo    3409780     0.013     9.678
       Qfileinfo     596772     0.001     0.878
       Qfsinfo       625189     0.003     1.245
       Sfileinfo     306443     0.006     1.840
       Find         1318106     0.063    19.798
       WriteX       1871137     0.021     8.532
       ReadX        5897325     0.003     3.567
       LockX          12252     0.003     0.258
       UnlockX        12252     0.002     0.100
       Flush         263666     3.327   155.632
      
      Throughput 980.047 MB/sec  12 clients  12 procs  max_latency=155.636 ms
      
      After whole patchset applied:
      
       Operation      Count    AvgLat    MaxLat
       ----------------------------------------
       NTCreateX    4195584     0.033   107.742
       Close        3081932     0.002     1.935
       Rename        177641     0.218    14.905
       Unlink        847333     0.166   107.822
       Deltree          118     5.315    15.247
       Mkdir             59     0.004     0.048
       Qpathinfo    3802612     0.014    10.302
       Qfileinfo     666748     0.001     1.034
       Qfsinfo       697329     0.003     0.944
       Sfileinfo     341712     0.006     2.099
       Find         1470365     0.065     9.359
       WriteX       2093921     0.021     8.087
       ReadX        6576234     0.003     3.407
       LockX          13660     0.003     0.308
       UnlockX        13660     0.002     0.114
       Flush         294090     2.906   115.539
      
      Throughput 1093.11 MB/sec  12 clients  12 procs  max_latency=115.544 ms
      
      +11.5% throughput    -25.8% max latency   rename max latency -77.8%
      
      Link: https://bugzilla.opensuse.org/show_bug.cgi?id=1193549Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0f8ce498
    • F
      btrfs: stop doing unnecessary log updates during a rename · 259c4b96
      Filipe Manana 提交于
      During a rename, we call __btrfs_unlink_inode(), which will call
      btrfs_del_inode_ref_in_log() and btrfs_del_dir_entries_in_log(), in order
      to remove an inode reference and a directory entry from the log. These
      are necessary when __btrfs_unlink_inode() is called from the unlink path,
      but not necessary when it's called from a rename context, because:
      
      1) For the btrfs_del_inode_ref_in_log() call, it's pointless to delete the
         inode reference related to the old name, because later in the rename
         path we call btrfs_log_new_name(), which will drop all inode references
         from the log and copy all inode references from the subvolume tree to
         the log tree. So we are doing one unnecessary btree operation which
         adds additional latency and lock contention in case there are other
         tasks accessing the log tree;
      
      2) For the btrfs_del_dir_entries_in_log() call, we are now doing the
         equivalent at btrfs_log_new_name() since the previous patch in the
         series, that has the subject "btrfs: avoid logging all directory
         changes during renames". In fact, having __btrfs_unlink_inode() call
         this function not only adds additional latency and lock contention due
         to the extra btree operation, but also can make btrfs_log_new_name()
         unnecessarily log a range item to track the deletion of the old name,
         since it has no way to known that the directory entry related to the
         old name was previously logged and already deleted by
         __btrfs_unlink_inode() through its call to
         btrfs_del_dir_entries_in_log().
      
      So skip those calls at __btrfs_unlink_inode() when we are doing a rename.
      Skipping them also allows us now to reduce the duration of time we are
      pinning a log transaction during renames, which is always beneficial as
      it's not delaying so much other tasks trying to sync the log tree, in
      particular we end up not holding the log transaction pinned while adding
      the new name (adding inode ref, directory entry, etc).
      
      This change is part of a patchset comprised of the following patches:
      
        1/5 btrfs: add helper to delete a dir entry from a log tree
        2/5 btrfs: pass the dentry to btrfs_log_new_name() instead of the inode
        3/5 btrfs: avoid logging all directory changes during renames
        4/5 btrfs: stop doing unnecessary log updates during a rename
        5/5 btrfs: avoid inode logging during rename and link when possible
      
      Just like the previous patch in the series, "btrfs: avoid logging all
      directory changes during renames", the following script mimics part of
      what a package installation/upgrade with zypper does, which is basically
      renaming a lot of files, in some directory under /usr, to a name with a
      suffix of "-RPMDELETE":
      
        $ cat test.sh
        #!/bin/bash
      
        DEV=/dev/nvme0n1
        MNT=/mnt/nvme0n1
      
        NUM_FILES=10000
      
        mkfs.btrfs -f $DEV
        mount $DEV $MNT
      
        mkdir $MNT/testdir
      
        for ((i = 1; i <= $NUM_FILES; i++)); do
            echo -n > $MNT/testdir/file_$i
        done
      
        sync
      
        # Do some change to testdir and fsync it.
        echo -n > $MNT/testdir/file_$((NUM_FILES + 1))
        xfs_io -c "fsync" $MNT/testdir
      
        echo "Renaming $NUM_FILES files..."
        start=$(date +%s%N)
        for ((i = 1; i <= $NUM_FILES; i++)); do
            mv $MNT/testdir/file_$i $MNT/testdir/file_$i-RPMDELETE
        done
        end=$(date +%s%N)
      
        dur=$(( (end - start) / 1000000 ))
        echo "Renames took $dur milliseconds"
      
        umount $MNT
      
      Testing this change on box a using a non-debug kernel (Debian's default
      kernel config) gave the following results:
      
      NUM_FILES=10000, before patchset:                   27399 ms
      NUM_FILES=10000, after patches 1/5 to 3/5 applied:   9093 ms (-66.8%)
      NUM_FILES=10000, after patches 1/5 to 4/5 applied:   9016 ms (-67.1%)
      
      NUM_FILES=5000, before patchset:                     9241 ms
      NUM_FILES=5000, after patches 1/5 to 3/5 applied:    4642 ms (-49.8%)
      NUM_FILES=5000, after patches 1/5 to 4/5 applied:    4553 ms (-50.7%)
      
      NUM_FILES=2000, before patchset:                     2550 ms
      NUM_FILES=2000, after patches 1/5 to 3/5 applied:    1788 ms (-29.9%)
      NUM_FILES=2000, after patches 1/5 to 4/5 applied:    1767 ms (-30.7%)
      
      NUM_FILES=1000, before patchset:                     1088 ms
      NUM_FILES=1000, after patches 1/5 to 3/5 applied:     905 ms (-16.9%)
      NUM_FILES=1000, after patches 1/5 to 4/5 applied:     883 ms (-18.8%)
      
      The next patch in the series (5/5), also contains dbench results after
      applying to whole patchset.
      
      Link: https://bugzilla.opensuse.org/show_bug.cgi?id=1193549Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      259c4b96
    • F
      btrfs: avoid logging all directory changes during renames · 88d2beec
      Filipe Manana 提交于
      When doing a rename of a file, if the file or its old parent directory
      were logged before, we log the new name of the file and then make sure
      we log the old parent directory, to ensure that after a log replay the
      old name of the file is deleted and the new name added.
      
      The logging of the old parent directory can take some time, because it
      will scan all leaves modified in the current transaction, check which
      directory entries were already logged, copy the ones that were not
      logged before, etc. In this rename context all we need to do is make
      sure that the old name of the file is deleted on log replay, so instead
      of triggering a directory log operation, we can just delete the old
      directory entry from the log if it's there, or in case it isn't there,
      just log a range item to signal log replay that the old name must be
      deleted. So change btrfs_log_new_name() to do that.
      
      This scenario is actually not uncommon to trigger, and recently on a
      5.15 kernel, an openSUSE Tumbleweed user reported package installations
      and upgrades, with the zypper tool, were often taking a long time to
      complete, much more than usual. With strace it could be observed that
      zypper was spending over 99% of its time on rename operations, and then
      with further analysis we checked that directory logging was happening
      too frequently and causing high latencies for the rename operations.
      Taking into account that installation/upgrade of some of these packages
      needed about a few thousand file renames, the slowdown was very noticeable
      for the user.
      
      The issue was caused indirectly due to an excessive number of inode
      evictions on a 5.15 kernel, about 100x more compared to a 5.13, 5.14
      or a 5.16-rc8 kernel. After an inode eviction we can't tell for sure,
      in an efficient way, if an inode was previously logged in the current
      transaction, so we are pessimistic and assume it was, because in case
      it was we need to update the logged inode. More details on that in one
      of the patches in the same series (subject "btrfs: avoid inode logging
      during rename and link when possible"). Either way, in case the parent
      directory was logged before, we currently do more work then necessary
      during a rename, and this change minimizes that amount of work.
      
      The following script mimics part of what a package installation/upgrade
      with zypper does, which is basically renaming a lot of files, in some
      directory under /usr, to a name with a suffix of "-RPMDELETE":
      
        $ cat test.sh
        #!/bin/bash
      
        DEV=/dev/nvme0n1
        MNT=/mnt/nvme0n1
      
        NUM_FILES=10000
      
        mkfs.btrfs -f $DEV
        mount $DEV $MNT
      
        mkdir $MNT/testdir
      
        for ((i = 1; i <= $NUM_FILES; i++)); do
            echo -n > $MNT/testdir/file_$i
        done
      
        sync
      
        # Do some change to testdir and fsync it.
        echo -n > $MNT/testdir/file_$((NUM_FILES + 1))
        xfs_io -c "fsync" $MNT/testdir
      
        echo "Renaming $NUM_FILES files..."
        start=$(date +%s%N)
        for ((i = 1; i <= $NUM_FILES; i++)); do
            mv $MNT/testdir/file_$i $MNT/testdir/file_$i-RPMDELETE
        done
        end=$(date +%s%N)
      
        dur=$(( (end - start) / 1000000 ))
        echo "Renames took $dur milliseconds"
      
        umount $MNT
      
      Testing this change on box using a non-debug kernel (Debian's default
      kernel config) gave the following results:
      
      NUM_FILES=10000, before this patch: 27399 ms
      NUM_FILES=10000, after this patch:   9093 ms (-66.8%)
      
      NUM_FILES=5000, before this patch:   9241 ms
      NUM_FILES=5000, after this patch:    4642 ms (-49.8%)
      
      NUM_FILES=2000, before this patch:   2550 ms
      NUM_FILES=2000, after this patch:    1788 ms (-29.9%)
      
      NUM_FILES=1000, before this patch:   1088 ms
      NUM_FILES=1000, after this patch:     905 ms (-16.9%)
      
      Link: https://bugzilla.opensuse.org/show_bug.cgi?id=1193549Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      88d2beec
    • F
      btrfs: pass the dentry to btrfs_log_new_name() instead of the inode · d5f5bd54
      Filipe Manana 提交于
      In the next patch in the series, there will be the need to access the old
      name, and its length, of an inode when logging the inode during a rename.
      So instead of passing the inode to btrfs_log_new_name() pass the dentry,
      because from the dentry we can get the inode, the name and its length.
      
      This will avoid passing 3 new parameters to btrfs_log_new_name() in the
      next patch - the name, its length and an index number. This way we end
      up passing only 1 new parameter, the index number.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d5f5bd54
    • F
      btrfs: add helper to delete a dir entry from a log tree · 839061fe
      Filipe Manana 提交于
      Move the code that finds and deletes a logged dir entry out of
      btrfs_del_dir_entries_in_log() into a helper function. This new helper
      function will be used by another patch in the same series, and serves
      to avoid having duplicated logic.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      839061fe
    • F
      btrfs: stop trying to log subdirectories created in past transactions · de6bc7f5
      Filipe Manana 提交于
      When logging a directory we are trying to log subdirectories that were
      changed in the current transaction and created in a past transaction.
      This type of behaviour was introduced by commit 2f2ff0ee ("Btrfs:
      fix metadata inconsistencies after directory fsync"), to fix some metadata
      inconsistencies that in the meanwhile no longer need this behaviour due to
      numerous other changes that happened throughout the years.
      
      This behaviour, besides not needed anymore, it's also undesirable because:
      
      1) It's not reliable because it's only triggered for the directories
         of dentries (dir items) that happen to be present on a leaf that
         was changed in the current transaction. If a dentry that points to
         a directory resides on a leaf that was not changed in the current
         transaction, then it's not logged, as at log_dir_items() and
         log_new_dir_dentries() we use btrfs_search_forward();
      
      2) It's not required by posix or any standard, it's undefined territory.
         The only way to guarantee a subdirectory is logged, it to explicitly
         fsync it;
      
      Making the behaviour guaranteed would require scanning all directory
      items, check which point to a directory, and then fsync each subdirectory
      which was modified in the current transaction. This could be very
      expensive for large directories with many subdirectories and/or large
      subdirectories.
      
      So remove that obsolete logic.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      de6bc7f5
    • F
      btrfs: stop copying old dir items when logging a directory · 732d591a
      Filipe Manana 提交于
      When logging a directory, we go over every leaf of the subvolume tree that
      was changed in the current transaction and copy all its dir index keys to
      the log tree.
      
      That includes copying dir index keys created in past transactions. This is
      done mostly for simplicity, as after logging the keys we log an item that
      specifies the start and end ranges of the keys we logged. That item is
      then used during log replay to figure out which keys need to be deleted -
      every key in that range that we find in the subvolume tree and is not in
      the log tree, needs to be deleted.
      
      Now that we log only dir index keys, and not dir item keys anymore, when
      we remove dentries from a directory (due to unlink and rename operations),
      we can get entire leaves that we changed only for deleting old dir index
      keys, or that have few dir index keys that are new - this is due to the
      fact that the offset for new index keys comes from a monotonically
      increasing counter.
      
      We can avoid logging dir index keys from past transactions, and in order
      to track the deletions, only log range items (BTRFS_DIR_LOG_INDEX_KEY key
      type) when we find gaps between consecutive index keys. This massively
      reduces the amount of logged metadata when we have deleted directory
      entries, even if it's a small percentage of the total number of entries.
      The reduction comes from both less items that are logged and instead of
      logging many dir index items (struct btrfs_dir_item), which have a size
      of 30 bytes plus a file name, we typically log just a few range items
      (struct btrfs_dir_log_item), which take only 8 bytes each.
      
      Even if no entries were deleted from a directory and only new entries
      were added, we typically still get a reduction on the amount of logged
      metadata, because it's very likely the first leaf that got the new
      dir index entries also has several old dir index entries.
      
      So change the logging logic to not log dir index keys created in past
      transactions and log a range item for every gap it finds between each
      pair of consecutive index keys, to ensure deletions are tracked and
      replayed on log replay.
      
      This patch is part of a patchset comprised of the following patches:
      
       1/4 btrfs: don't log unnecessary boundary keys when logging directory
       2/4 btrfs: put initial index value of a directory in a constant
       3/4 btrfs: stop copying old dir items when logging a directory
       4/4 btrfs: stop trying to log subdirectories created in past transactions
      
      The following test was run on a branch without this patchset and on a
      branch with the first three patches applied:
      
        $ cat test.sh
        #!/bin/bash
      
        DEV=/dev/nvme0n1
        MNT=/mnt/nvme0n1
      
        NUM_FILES=1000000
        NUM_FILE_DELETES=10000
      
        MKFS_OPTIONS="-O no-holes -R free-space-tree"
        MOUNT_OPTIONS="-o ssd"
      
        mkfs.btrfs -f $MKFS_OPTIONS $DEV
        mount $MOUNT_OPTIONS $DEV $MNT
      
        mkdir $MNT/testdir
        for ((i = 1; i <= $NUM_FILES; i++)); do
            echo -n > $MNT/testdir/file_$i
        done
      
        sync
      
        del_inc=$(( $NUM_FILES / $NUM_FILE_DELETES ))
        for ((i = 1; i <= $NUM_FILES; i += $del_inc)); do
            rm -f $MNT/testdir/file_$i
        done
      
        start=$(date +%s%N)
        xfs_io -c "fsync" $MNT/testdir
        end=$(date +%s%N)
      
        dur=$(( (end - start) / 1000000 ))
        echo "dir fsync took $dur ms after deleting $NUM_FILE_DELETES files"
        echo
      
        umount $MNT
      
      The test was run on a non-debug kernel (Debian's default kernel config),
      and the results were the following for various values of NUM_FILES and
      NUM_FILE_DELETES:
      
      ** before, NUM_FILES = 1 000 000, NUM_FILE_DELETES = 10 000 **
      
      dir fsync took 585 ms after deleting 10000 files
      
      ** after, NUM_FILES = 1 000 000, NUM_FILE_DELETES = 10 000 **
      
      dir fsync took 34 ms after deleting 10000 files   (-94.2%)
      
      ** before, NUM_FILES = 100 000, NUM_FILE_DELETES = 1 000 **
      
      dir fsync took 50 ms after deleting 1000 files
      
      ** after, NUM_FILES = 100 000, NUM_FILE_DELETES = 1 000 **
      
      dir fsync took 7 ms after deleting 1000 files    (-86.0%)
      
      ** before, NUM_FILES = 10 000, NUM_FILE_DELETES = 100 **
      
      dir fsync took 9 ms after deleting 100 files
      
      ** after, NUM_FILES = 10 000, NUM_FILE_DELETES = 100 **
      
      dir fsync took 5 ms after deleting 100 files     (-44.4%)
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      732d591a
    • F
      btrfs: don't log unnecessary boundary keys when logging directory · a450a4af
      Filipe Manana 提交于
      Before we start to log dir index keys from a leaf, we check if there is a
      previous index key, which normally is at the end of a leaf that was not
      changed in the current transaction. Then we log that key and set the start
      of logged range (item of type BTRFS_DIR_LOG_INDEX_KEY) to the offset of
      that key. This is to ensure that if there were deleted index keys between
      that key and the first key we are going to log, those deletions are
      replayed in case we need to replay to the log after a power failure.
      However we really don't need to log that previous key, we can just set the
      start of the logged range to that key's offset plus 1. This achieves the
      same and avoids logging one dir index key.
      
      The same logic is performed when we finish logging the index keys of a
      leaf and we find that the next leaf has index keys and was not changed in
      the current transaction. We are logging the first key of that next leaf
      and use its offset as the end of range we log. This is just to ensure that
      if there were deleted index keys between the last index key we logged and
      the first key of that next leaf, those index keys are deleted if we end
      up replaying the log. However that is not necessary, we can avoid logging
      that first index key of the next leaf and instead set the end of the
      logged range to match the offset of that index key minus 1.
      
      So avoid logging those index keys at the boundaries and adjust the start
      and end offsets of the logged ranges as described above.
      
      This patch is part of a patchset comprised of the following patches:
      
        1/4 btrfs: don't log unnecessary boundary keys when logging directory
        2/4 btrfs: put initial index value of a directory in a constant
        3/4 btrfs: stop copying old dir items when logging a directory
        4/4 btrfs: stop trying to log subdirectories created in past transactions
      
      Performance test results are listed in the changelog of patch 3/4.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a450a4af
    • F
      btrfs: remove write and wait of struct walk_control · c816d705
      Filipe Manana 提交于
      The ->write and ->wait fields of struct walk_control, used for log trees,
      are not used since 2008, more specifically since commit d0c803c4
      ("Btrfs: Record dirty pages tree-log pages in an extent_io tree") and
      since commit d0c803c4 ("Btrfs: Record dirty pages tree-log pages in
      an extent_io tree"). So just remove them, along with the function
      btrfs_write_tree_block(), which is also not used anymore after removing
      the ->write member.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c816d705
  8. 02 3月, 2022 2 次提交
    • F
      btrfs: add missing run of delayed items after unlink during log replay · 4751dc99
      Filipe Manana 提交于
      During log replay, whenever we need to check if a name (dentry) exists in
      a directory we do searches on the subvolume tree for inode references or
      or directory entries (BTRFS_DIR_INDEX_KEY keys, and BTRFS_DIR_ITEM_KEY
      keys as well, before kernel 5.17). However when during log replay we
      unlink a name, through btrfs_unlink_inode(), we may not delete inode
      references and dir index keys from a subvolume tree and instead just add
      the deletions to the delayed inode's delayed items, which will only be
      run when we commit the transaction used for log replay. This means that
      after an unlink operation during log replay, if we attempt to search for
      the same name during log replay, we will not see that the name was already
      deleted, since the deletion is recorded only on the delayed items.
      
      We run delayed items after every unlink operation during log replay,
      except at unlink_old_inode_refs() and at add_inode_ref(). This was due
      to an overlook, as delayed items should be run after evert unlink, for
      the reasons stated above.
      
      So fix those two cases.
      
      Fixes: 0d836392 ("Btrfs: fix mount failure after fsync due to hard link recreation")
      Fixes: 1f250e92 ("Btrfs: fix log replay failure after unlink and link combination")
      CC: stable@vger.kernel.org # 4.19+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4751dc99
    • F
      btrfs: fix lost prealloc extents beyond eof after full fsync · d9947887
      Filipe Manana 提交于
      When doing a full fsync, if we have prealloc extents beyond (or at) eof,
      and the leaves that contain them were not modified in the current
      transaction, we end up not logging them. This results in losing those
      extents when we replay the log after a power failure, since the inode is
      truncated to the current value of the logged i_size.
      
      Just like for the fast fsync path, we need to always log all prealloc
      extents starting at or beyond i_size. The fast fsync case was fixed in
      commit 471d557a ("Btrfs: fix loss of prealloc extents past i_size
      after fsync log replay") but it missed the full fsync path. The problem
      exists since the very early days, when the log tree was added by
      commit e02119d5 ("Btrfs: Add a write ahead tree log to optimize
      synchronous operations").
      
      Example reproducer:
      
        $ mkfs.btrfs -f /dev/sdc
        $ mount /dev/sdc /mnt
      
        # Create our test file with many file extent items, so that they span
        # several leaves of metadata, even if the node/page size is 64K. Use
        # direct IO and not fsync/O_SYNC because it's both faster and it avoids
        # clearing the full sync flag from the inode - we want the fsync below
        # to trigger the slow full sync code path.
        $ xfs_io -f -d -c "pwrite -b 4K 0 16M" /mnt/foo
      
        # Now add two preallocated extents to our file without extending the
        # file's size. One right at i_size, and another further beyond, leaving
        # a gap between the two prealloc extents.
        $ xfs_io -c "falloc -k 16M 1M" /mnt/foo
        $ xfs_io -c "falloc -k 20M 1M" /mnt/foo
      
        # Make sure everything is durably persisted and the transaction is
        # committed. This makes all created extents to have a generation lower
        # than the generation of the transaction used by the next write and
        # fsync.
        sync
      
        # Now overwrite only the first extent, which will result in modifying
        # only the first leaf of metadata for our inode. Then fsync it. This
        # fsync will use the slow code path (inode full sync bit is set) because
        # it's the first fsync since the inode was created/loaded.
        $ xfs_io -c "pwrite 0 4K" -c "fsync" /mnt/foo
      
        # Extent list before power failure.
        $ xfs_io -c "fiemap -v" /mnt/foo
        /mnt/foo:
         EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
           0: [0..7]:          2178048..2178055     8   0x0
           1: [8..16383]:      26632..43007     16376   0x0
           2: [16384..32767]:  2156544..2172927 16384   0x0
           3: [32768..34815]:  2172928..2174975  2048 0x800
           4: [34816..40959]:  hole              6144
           5: [40960..43007]:  2174976..2177023  2048 0x801
      
        <power fail>
      
        # Mount fs again, trigger log replay.
        $ mount /dev/sdc /mnt
      
        # Extent list after power failure and log replay.
        $ xfs_io -c "fiemap -v" /mnt/foo
        /mnt/foo:
         EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
           0: [0..7]:          2178048..2178055     8   0x0
           1: [8..16383]:      26632..43007     16376   0x0
           2: [16384..32767]:  2156544..2172927 16384   0x1
      
        # The prealloc extents at file offsets 16M and 20M are missing.
      
      So fix this by calling btrfs_log_prealloc_extents() when we are doing a
      full fsync, so that we always log all prealloc extents beyond eof.
      
      A test case for fstests will follow soon.
      
      CC: stable@vger.kernel.org # 4.19+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d9947887
  9. 31 1月, 2022 1 次提交
    • F
      btrfs: skip reserved bytes warning on unmount after log cleanup failure · 40cdc509
      Filipe Manana 提交于
      After the recent changes made by commit c2e39305 ("btrfs: clear
      extent buffer uptodate when we fail to write it") and its followup fix,
      commit 651740a5 ("btrfs: check WRITE_ERR when trying to read an
      extent buffer"), we can now end up not cleaning up space reservations of
      log tree extent buffers after a transaction abort happens, as well as not
      cleaning up still dirty extent buffers.
      
      This happens because if writeback for a log tree extent buffer failed,
      then we have cleared the bit EXTENT_BUFFER_UPTODATE from the extent buffer
      and we have also set the bit EXTENT_BUFFER_WRITE_ERR on it. Later on,
      when trying to free the log tree with free_log_tree(), which iterates
      over the tree, we can end up getting an -EIO error when trying to read
      a node or a leaf, since read_extent_buffer_pages() returns -EIO if an
      extent buffer does not have EXTENT_BUFFER_UPTODATE set and has the
      EXTENT_BUFFER_WRITE_ERR bit set. Getting that -EIO means that we return
      immediately as we can not iterate over the entire tree.
      
      In that case we never update the reserved space for an extent buffer in
      the respective block group and space_info object.
      
      When this happens we get the following traces when unmounting the fs:
      
      [174957.284509] BTRFS: error (device dm-0) in cleanup_transaction:1913: errno=-5 IO failure
      [174957.286497] BTRFS: error (device dm-0) in free_log_tree:3420: errno=-5 IO failure
      [174957.399379] ------------[ cut here ]------------
      [174957.402497] WARNING: CPU: 2 PID: 3206883 at fs/btrfs/block-group.c:127 btrfs_put_block_group+0x77/0xb0 [btrfs]
      [174957.407523] Modules linked in: btrfs overlay dm_zero (...)
      [174957.424917] CPU: 2 PID: 3206883 Comm: umount Tainted: G        W         5.16.0-rc5-btrfs-next-109 #1
      [174957.426689] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
      [174957.428716] RIP: 0010:btrfs_put_block_group+0x77/0xb0 [btrfs]
      [174957.429717] Code: 21 48 8b bd (...)
      [174957.432867] RSP: 0018:ffffb70d41cffdd0 EFLAGS: 00010206
      [174957.433632] RAX: 0000000000000001 RBX: ffff8b09c3848000 RCX: ffff8b0758edd1c8
      [174957.434689] RDX: 0000000000000001 RSI: ffffffffc0b467e7 RDI: ffff8b0758edd000
      [174957.436068] RBP: ffff8b0758edd000 R08: 0000000000000000 R09: 0000000000000000
      [174957.437114] R10: 0000000000000246 R11: 0000000000000000 R12: ffff8b09c3848148
      [174957.438140] R13: ffff8b09c3848198 R14: ffff8b0758edd188 R15: dead000000000100
      [174957.439317] FS:  00007f328fb82800(0000) GS:ffff8b0a2d200000(0000) knlGS:0000000000000000
      [174957.440402] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [174957.441164] CR2: 00007fff13563e98 CR3: 0000000404f4e005 CR4: 0000000000370ee0
      [174957.442117] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [174957.443076] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [174957.443948] Call Trace:
      [174957.444264]  <TASK>
      [174957.444538]  btrfs_free_block_groups+0x255/0x3c0 [btrfs]
      [174957.445238]  close_ctree+0x301/0x357 [btrfs]
      [174957.445803]  ? call_rcu+0x16c/0x290
      [174957.446250]  generic_shutdown_super+0x74/0x120
      [174957.446832]  kill_anon_super+0x14/0x30
      [174957.447305]  btrfs_kill_super+0x12/0x20 [btrfs]
      [174957.447890]  deactivate_locked_super+0x31/0xa0
      [174957.448440]  cleanup_mnt+0x147/0x1c0
      [174957.448888]  task_work_run+0x5c/0xa0
      [174957.449336]  exit_to_user_mode_prepare+0x1e5/0x1f0
      [174957.449934]  syscall_exit_to_user_mode+0x16/0x40
      [174957.450512]  do_syscall_64+0x48/0xc0
      [174957.450980]  entry_SYSCALL_64_after_hwframe+0x44/0xae
      [174957.451605] RIP: 0033:0x7f328fdc4a97
      [174957.452059] Code: 03 0c 00 f7 (...)
      [174957.454320] RSP: 002b:00007fff13564ec8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
      [174957.455262] RAX: 0000000000000000 RBX: 00007f328feea264 RCX: 00007f328fdc4a97
      [174957.456131] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000560b8ae51dd0
      [174957.457118] RBP: 0000560b8ae51ba0 R08: 0000000000000000 R09: 00007fff13563c40
      [174957.458005] R10: 00007f328fe49fc0 R11: 0000000000000246 R12: 0000000000000000
      [174957.459113] R13: 0000560b8ae51dd0 R14: 0000560b8ae51cb0 R15: 0000000000000000
      [174957.460193]  </TASK>
      [174957.460534] irq event stamp: 0
      [174957.461003] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
      [174957.461947] hardirqs last disabled at (0): [<ffffffffb0e94214>] copy_process+0x934/0x2040
      [174957.463147] softirqs last  enabled at (0): [<ffffffffb0e94214>] copy_process+0x934/0x2040
      [174957.465116] softirqs last disabled at (0): [<0000000000000000>] 0x0
      [174957.466323] ---[ end trace bc7ee0c490bce3af ]---
      [174957.467282] ------------[ cut here ]------------
      [174957.468184] WARNING: CPU: 2 PID: 3206883 at fs/btrfs/block-group.c:3976 btrfs_free_block_groups+0x330/0x3c0 [btrfs]
      [174957.470066] Modules linked in: btrfs overlay dm_zero (...)
      [174957.483137] CPU: 2 PID: 3206883 Comm: umount Tainted: G        W         5.16.0-rc5-btrfs-next-109 #1
      [174957.484691] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
      [174957.486853] RIP: 0010:btrfs_free_block_groups+0x330/0x3c0 [btrfs]
      [174957.488050] Code: 00 00 00 ad de (...)
      [174957.491479] RSP: 0018:ffffb70d41cffde0 EFLAGS: 00010206
      [174957.492520] RAX: ffff8b08d79310b0 RBX: ffff8b09c3848000 RCX: 0000000000000000
      [174957.493868] RDX: 0000000000000001 RSI: fffff443055ee600 RDI: ffffffffb1131846
      [174957.495183] RBP: ffff8b08d79310b0 R08: 0000000000000000 R09: 0000000000000000
      [174957.496580] R10: 0000000000000001 R11: 0000000000000000 R12: ffff8b08d7931000
      [174957.498027] R13: ffff8b09c38492b0 R14: dead000000000122 R15: dead000000000100
      [174957.499438] FS:  00007f328fb82800(0000) GS:ffff8b0a2d200000(0000) knlGS:0000000000000000
      [174957.500990] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [174957.502117] CR2: 00007fff13563e98 CR3: 0000000404f4e005 CR4: 0000000000370ee0
      [174957.503513] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [174957.504864] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [174957.506167] Call Trace:
      [174957.506654]  <TASK>
      [174957.507047]  close_ctree+0x301/0x357 [btrfs]
      [174957.507867]  ? call_rcu+0x16c/0x290
      [174957.508567]  generic_shutdown_super+0x74/0x120
      [174957.509447]  kill_anon_super+0x14/0x30
      [174957.510194]  btrfs_kill_super+0x12/0x20 [btrfs]
      [174957.511123]  deactivate_locked_super+0x31/0xa0
      [174957.511976]  cleanup_mnt+0x147/0x1c0
      [174957.512610]  task_work_run+0x5c/0xa0
      [174957.513309]  exit_to_user_mode_prepare+0x1e5/0x1f0
      [174957.514231]  syscall_exit_to_user_mode+0x16/0x40
      [174957.515069]  do_syscall_64+0x48/0xc0
      [174957.515718]  entry_SYSCALL_64_after_hwframe+0x44/0xae
      [174957.516688] RIP: 0033:0x7f328fdc4a97
      [174957.517413] Code: 03 0c 00 f7 d8 (...)
      [174957.521052] RSP: 002b:00007fff13564ec8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
      [174957.522514] RAX: 0000000000000000 RBX: 00007f328feea264 RCX: 00007f328fdc4a97
      [174957.523950] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000560b8ae51dd0
      [174957.525375] RBP: 0000560b8ae51ba0 R08: 0000000000000000 R09: 00007fff13563c40
      [174957.526763] R10: 00007f328fe49fc0 R11: 0000000000000246 R12: 0000000000000000
      [174957.528058] R13: 0000560b8ae51dd0 R14: 0000560b8ae51cb0 R15: 0000000000000000
      [174957.529404]  </TASK>
      [174957.529843] irq event stamp: 0
      [174957.530256] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
      [174957.531061] hardirqs last disabled at (0): [<ffffffffb0e94214>] copy_process+0x934/0x2040
      [174957.532075] softirqs last  enabled at (0): [<ffffffffb0e94214>] copy_process+0x934/0x2040
      [174957.533083] softirqs last disabled at (0): [<0000000000000000>] 0x0
      [174957.533865] ---[ end trace bc7ee0c490bce3b0 ]---
      [174957.534452] BTRFS info (device dm-0): space_info 4 has 1070841856 free, is not full
      [174957.535404] BTRFS info (device dm-0): space_info total=1073741824, used=2785280, pinned=0, reserved=49152, may_use=0, readonly=65536 zone_unusable=0
      [174957.537029] BTRFS info (device dm-0): global_block_rsv: size 0 reserved 0
      [174957.537859] BTRFS info (device dm-0): trans_block_rsv: size 0 reserved 0
      [174957.538697] BTRFS info (device dm-0): chunk_block_rsv: size 0 reserved 0
      [174957.539552] BTRFS info (device dm-0): delayed_block_rsv: size 0 reserved 0
      [174957.540403] BTRFS info (device dm-0): delayed_refs_rsv: size 0 reserved 0
      
      This also means that in case we have log tree extent buffers that are
      still dirty, we can end up not cleaning them up in case we find an
      extent buffer with EXTENT_BUFFER_WRITE_ERR set on it, as in that case
      we have no way for iterating over the rest of the tree.
      
      This issue is very often triggered with test cases generic/475 and
      generic/648 from fstests.
      
      The issue could almost be fixed by iterating over the io tree attached to
      each log root which keeps tracks of the range of allocated extent buffers,
      log_root->dirty_log_pages, however that does not work and has some
      inconveniences:
      
      1) After we sync the log, we clear the range of the extent buffers from
         the io tree, so we can't find them after writeback. We could keep the
         ranges in the io tree, with a separate bit to signal they represent
         extent buffers already written, but that means we need to hold into
         more memory until the transaction commits.
      
         How much more memory is used depends a lot on whether we are able to
         allocate contiguous extent buffers on disk (and how often) for a log
         tree - if we are able to, then a single extent state record can
         represent multiple extent buffers, otherwise we need multiple extent
         state record structures to track each extent buffer.
         In fact, my earlier approach did that:
      
         https://lore.kernel.org/linux-btrfs/3aae7c6728257c7ce2279d6660ee2797e5e34bbd.1641300250.git.fdmanana@suse.com/
      
         However that can cause a very significant negative impact on
         performance, not only due to the extra memory usage but also because
         we get a larger and deeper dirty_log_pages io tree.
         We got a report that, on beefy machines at least, we can get such
         performance drop with fsmark for example:
      
         https://lore.kernel.org/linux-btrfs/20220117082426.GE32491@xsang-OptiPlex-9020/
      
      2) We would be doing it only to deal with an unexpected and exceptional
         case, which is basically failure to read an extent buffer from disk
         due to IO failures. On a healthy system we don't expect transaction
         aborts to happen after all;
      
      3) Instead of relying on iterating the log tree or tracking the ranges
         of extent buffers in the dirty_log_pages io tree, using the radix
         tree that tracks extent buffers (fs_info->buffer_radix) to find all
         log tree extent buffers is not reliable either, because after writeback
         of an extent buffer it can be evicted from memory by the release page
         callback of the btree inode (btree_releasepage()).
      
      Since there's no way to be able to properly cleanup a log tree without
      being able to read its extent buffers from disk and without using more
      memory to track the logical ranges of the allocated extent buffers do
      the following:
      
      1) When we fail to cleanup a log tree, setup a flag that indicates that
         failure;
      
      2) Trigger writeback of all log tree extent buffers that are still dirty,
         and wait for the writeback to complete. This is just to cleanup their
         state, page states, page leaks, etc;
      
      3) When unmounting the fs, ignore if the number of bytes reserved in a
         block group and in a space_info is not 0 if, and only if, we failed to
         cleanup a log tree. Also ignore only for metadata block groups and the
         metadata space_info object.
      
      This is far from a perfect solution, but it serves to silence test
      failures such as those from generic/475 and generic/648. However having
      a non-zero value for the reserved bytes counters on unmount after a
      transaction abort, is not such a terrible thing and it's completely
      harmless, it does not affect the filesystem integrity in any way.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      40cdc509
  10. 07 1月, 2022 6 次提交
  11. 03 1月, 2022 4 次提交
    • J
      btrfs: stop accessing ->csum_root directly · fc28b25e
      Josef Bacik 提交于
      We are going to have multiple csum roots in the future, so convert all
      users of ->csum_root to btrfs_csum_root() and rename ->csum_root to
      ->_csum_root so we can easily find remaining users in the future.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fc28b25e
    • J
      btrfs: drop the _nr from the item helpers · 3212fa14
      Josef Bacik 提交于
      Now that all call sites are using the slot number to modify item values,
      rename the SETGET helpers to raw_item_*(), and then rework the _nr()
      helpers to be the btrfs_item_*() btrfs_set_item_*() helpers, and then
      rename all of the callers to the new helpers.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3212fa14
    • F
      btrfs: remove no longer needed logic for replaying directory deletes · ccae4a19
      Filipe Manana 提交于
      Now that we log only dir index keys when logging a directory, we no longer
      need to deal with dir item keys in the log replay code for replaying
      directory deletes. This is also true for the case when we replay a log
      tree created by a kernel that still logs dir items.
      
      So remove the remaining code of the replay of directory deletes algorithm
      that deals with dir item keys.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ccae4a19
    • F
      btrfs: only copy dir index keys when logging a directory · 339d0354
      Filipe Manana 提交于
      Currently, when logging a directory, we copy both dir items and dir index
      items from the fs/subvolume tree to the log tree. Both items have exactly
      the same data (same struct btrfs_dir_item), the difference lies in the key
      values, where a dir index key contains the index number of a directory
      entry while the dir item key does not, as it's used for doing fast lookups
      of an entry by name, while the former is used for sorting entries when
      listing a directory.
      
      We can exploit that and log only the dir index items, since they contain
      all the information needed to correctly add, replace and delete directory
      entries when replaying a log tree. Logging only the dir index items is
      also backward and forward compatible: an unpatched kernel (without this
      change) can correctly replay a log tree generated by a patched kernel
      (with this patch), and a patched kernel can correctly replay a log tree
      generated by an unpatched kernel.
      
      The backward compatibility is ensured because:
      
      1) For inserting a new dentry: a dentry is only inserted when we find a
         new dir index key - we can only insert if we know the dir index offset,
         which is encoded in the dir index key's offset;
      
      2) For deleting dentries: during log replay, before adding or replacing
         dentries, we first replay dentry deletions. Whenever we find a dir item
         key or a dir index key in the subvolume/fs tree that is not logged in
         a range for which the log tree is authoritative, we do the unlink of
         the dentry, which removes both the existing dir item key and the dir
         index key. Therefore logging just dir index keys is enough to ensure
         dentry deletions are correctly replayed;
      
      3) For dentry replacements: they work when we log only dir index keys
         and this is mostly due to a combination of 1) and 2). If we replace a
         dentry with name "foobar" to point from inode A to inode B, then we
         know the dir index key for the new dentry is different from the old
         one, as it has an index number (key offset) larger than the old one.
         This results in replaying a deletion, through replay_dir_deletes(),
         that causes the old dentry to be removed, both the dir item key and
         the dir index key, as mentioned at 2). Then when processing the new
         dir index key, we add the new dentry, adding both a new dir item key
         and a new index key pointing to inode B, as stated in 1).
      
      The forward compatibility, the ability for a patched kernel to replay a
      log created by an older, unpatched kernel, comes from the changes required
      for making sure we are able to replay a log that only contains dir index
      keys - we simply ignore every dir item key we find.
      
      So modify directory logging to log only dir index items, and modify the
      log replay process to ignore dir item keys, from log trees created by an
      unpatched kernel, and process only with dir index keys. This reduces the
      amount of logged metadata by about half, and therefore the time spent
      logging or fsyncing large directories (less CPU time and less IO).
      
      The following test script was used to measure this change:
      
         #!/bin/bash
      
         DEV=/dev/nvme0n1
         MNT=/mnt/nvme0n1
      
         NUM_NEW_FILES=1000000
         NUM_FILE_DELETES=10000
      
         mkfs.btrfs -f $DEV
         mount -o ssd $DEV $MNT
      
         mkdir $MNT/testdir
      
         for ((i = 1; i <= $NUM_NEW_FILES; i++)); do
                 echo -n > $MNT/testdir/file_$i
         done
      
         start=$(date +%s%N)
         xfs_io -c "fsync" $MNT/testdir
         end=$(date +%s%N)
      
         dur=$(( (end - start) / 1000000 ))
         echo "dir fsync took $dur ms after adding $NUM_NEW_FILES files"
      
         # sync to force transaction commit and wipeout the log.
         sync
      
         del_inc=$(( $NUM_NEW_FILES / $NUM_FILE_DELETES ))
         for ((i = 1; i <= $NUM_NEW_FILES; i += $del_inc)); do
                 rm -f $MNT/testdir/file_$i
         done
      
         start=$(date +%s%N)
         xfs_io -c "fsync" $MNT/testdir
         end=$(date +%s%N)
      
         dur=$(( (end - start) / 1000000 ))
         echo "dir fsync took $dur ms after deleting $NUM_FILE_DELETES files"
         echo
      
         umount $MNT
      
      The tests were run on a physical machine, with a non-debug kernel (Debian's
      default kernel config), for different values of $NUM_NEW_FILES and
      $NUM_FILE_DELETES, and the results were the following:
      
      ** Before patch, NUM_NEW_FILES = 1 000 000, NUM_DELETE_FILES = 10 000 **
      
      dir fsync took 8412 ms after adding 1000000 files
      dir fsync took 500 ms after deleting 10000 files
      
      ** After patch, NUM_NEW_FILES = 1 000 000, NUM_DELETE_FILES = 10 000 **
      
      dir fsync took 4252 ms after adding 1000000 files   (-49.5%)
      dir fsync took 269 ms after deleting 10000 files    (-46.2%)
      
      ** Before patch, NUM_NEW_FILES = 100 000, NUM_DELETE_FILES = 1 000 **
      
      dir fsync took 745 ms after adding 100000 files
      dir fsync took 59 ms after deleting 1000 files
      
      ** After patch, NUM_NEW_FILES = 100 000, NUM_DELETE_FILES = 1 000 **
      
      dir fsync took 404 ms after adding 100000 files   (-45.8%)
      dir fsync took 31 ms after deleting 1000 files    (-47.5%)
      
      ** Before patch, NUM_NEW_FILES = 10 000, NUM_DELETE_FILES = 1 000 **
      
      dir fsync took 67 ms after adding 10000 files
      dir fsync took 9 ms after deleting 1000 files
      
      ** After patch, NUM_NEW_FILES = 10 000, NUM_DELETE_FILES = 1 000 **
      
      dir fsync took 36 ms after adding 10000 files   (-46.3%)
      dir fsync took 5 ms after deleting 1000 files   (-44.4%)
      
      ** Before patch, NUM_NEW_FILES = 1 000, NUM_DELETE_FILES = 100 **
      
      dir fsync took 9 ms after adding 1000 files
      dir fsync took 4 ms after deleting 100 files
      
      ** After patch, NUM_NEW_FILES = 1 000, NUM_DELETE_FILES = 100 **
      
      dir fsync took 7 ms after adding 1000 files     (-22.2%)
      dir fsync took 3 ms after deleting 100 files    (-25.0%)
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      339d0354
  12. 14 12月, 2021 1 次提交
    • F
      btrfs: fix missing last dir item offset update when logging directory · 1b2e5e5c
      Filipe Manana 提交于
      When logging a directory, once we finish processing a leaf that is full
      of dir items, if we find the next leaf was not modified in the current
      transaction, we grab the first key of that next leaf and log it as to
      mark the end of a key range boundary.
      
      However we did not update the value of ctx->last_dir_item_offset, which
      tracks the offset of the last logged key. This can result in subsequent
      logging of the same directory in the current transaction to not realize
      that key was already logged, and then add it to the middle of a batch
      that starts with a lower key, resulting later in a leaf with one key
      that is duplicated and at non-consecutive slots. When that happens we get
      an error later when writing out the leaf, reporting that there is a pair
      of keys in wrong order. The report is something like the following:
      
      Dec 13 21:44:50 kernel: BTRFS critical (device dm-0): corrupt leaf:
      root=18446744073709551610 block=118444032 slot=21, bad key order, prev
      (704687 84 4146773349) current (704687 84 1063561078)
      Dec 13 21:44:50 kernel: BTRFS info (device dm-0): leaf 118444032 gen
      91449 total ptrs 39 free space 546 owner 18446744073709551610
      Dec 13 21:44:50 kernel:         item 0 key (704687 1 0) itemoff 3835
      itemsize 160
      Dec 13 21:44:50 kernel:                 inode generation 35532 size
      1026 mode 40755
      Dec 13 21:44:50 kernel:         item 1 key (704687 12 704685) itemoff
      3822 itemsize 13
      Dec 13 21:44:50 kernel:         item 2 key (704687 24 3817753667)
      itemoff 3736 itemsize 86
      Dec 13 21:44:50 kernel:         item 3 key (704687 60 0) itemoff 3728 itemsize 8
      Dec 13 21:44:50 kernel:         item 4 key (704687 72 0) itemoff 3720 itemsize 8
      Dec 13 21:44:50 kernel:         item 5 key (704687 84 140445108)
      itemoff 3666 itemsize 54
      Dec 13 21:44:50 kernel:                 dir oid 704793 type 1
      Dec 13 21:44:50 kernel:         item 6 key (704687 84 298800632)
      itemoff 3599 itemsize 67
      Dec 13 21:44:50 kernel:                 dir oid 707849 type 2
      Dec 13 21:44:50 kernel:         item 7 key (704687 84 476147658)
      itemoff 3532 itemsize 67
      Dec 13 21:44:50 kernel:                 dir oid 707901 type 2
      Dec 13 21:44:50 kernel:         item 8 key (704687 84 633818382)
      itemoff 3471 itemsize 61
      Dec 13 21:44:50 kernel:                 dir oid 704694 type 2
      Dec 13 21:44:50 kernel:         item 9 key (704687 84 654256665)
      itemoff 3403 itemsize 68
      Dec 13 21:44:50 kernel:                 dir oid 707841 type 1
      Dec 13 21:44:50 kernel:         item 10 key (704687 84 995843418)
      itemoff 3331 itemsize 72
      Dec 13 21:44:50 kernel:                 dir oid 2167736 type 1
      Dec 13 21:44:50 kernel:         item 11 key (704687 84 1063561078)
      itemoff 3278 itemsize 53
      Dec 13 21:44:50 kernel:                 dir oid 704799 type 2
      Dec 13 21:44:50 kernel:         item 12 key (704687 84 1101156010)
      itemoff 3225 itemsize 53
      Dec 13 21:44:50 kernel:                 dir oid 704696 type 1
      Dec 13 21:44:50 kernel:         item 13 key (704687 84 2521936574)
      itemoff 3173 itemsize 52
      Dec 13 21:44:50 kernel:                 dir oid 704704 type 2
      Dec 13 21:44:50 kernel:         item 14 key (704687 84 2618368432)
      itemoff 3112 itemsize 61
      Dec 13 21:44:50 kernel:                 dir oid 704738 type 1
      Dec 13 21:44:50 kernel:         item 15 key (704687 84 2676316190)
      itemoff 3046 itemsize 66
      Dec 13 21:44:50 kernel:                 dir oid 2167729 type 1
      Dec 13 21:44:50 kernel:         item 16 key (704687 84 3319104192)
      itemoff 2986 itemsize 60
      Dec 13 21:44:50 kernel:                 dir oid 704745 type 2
      Dec 13 21:44:50 kernel:         item 17 key (704687 84 3908046265)
      itemoff 2929 itemsize 57
      Dec 13 21:44:50 kernel:                 dir oid 2167734 type 1
      Dec 13 21:44:50 kernel:         item 18 key (704687 84 3945713089)
      itemoff 2857 itemsize 72
      Dec 13 21:44:50 kernel:                 dir oid 2167730 type 1
      Dec 13 21:44:50 kernel:         item 19 key (704687 84 4077169308)
      itemoff 2795 itemsize 62
      Dec 13 21:44:50 kernel:                 dir oid 704688 type 1
      Dec 13 21:44:50 kernel:         item 20 key (704687 84 4146773349)
      itemoff 2727 itemsize 68
      Dec 13 21:44:50 kernel:                 dir oid 707892 type 1
      Dec 13 21:44:50 kernel:         item 21 key (704687 84 1063561078)
      itemoff 2674 itemsize 53
      Dec 13 21:44:50 kernel:                 dir oid 704799 type 2
      Dec 13 21:44:50 kernel:         item 22 key (704687 96 2) itemoff 2612
      itemsize 62
      Dec 13 21:44:50 kernel:         item 23 key (704687 96 6) itemoff 2551
      itemsize 61
      Dec 13 21:44:50 kernel:         item 24 key (704687 96 7) itemoff 2498
      itemsize 53
      Dec 13 21:44:50 kernel:         item 25 key (704687 96 12) itemoff
      2446 itemsize 52
      Dec 13 21:44:50 kernel:         item 26 key (704687 96 14) itemoff
      2385 itemsize 61
      Dec 13 21:44:50 kernel:         item 27 key (704687 96 18) itemoff
      2325 itemsize 60
      Dec 13 21:44:50 kernel:         item 28 key (704687 96 24) itemoff
      2271 itemsize 54
      Dec 13 21:44:50 kernel:         item 29 key (704687 96 28) itemoff
      2218 itemsize 53
      Dec 13 21:44:50 kernel:         item 30 key (704687 96 62) itemoff
      2150 itemsize 68
      Dec 13 21:44:50 kernel:         item 31 key (704687 96 66) itemoff
      2083 itemsize 67
      Dec 13 21:44:50 kernel:         item 32 key (704687 96 75) itemoff
      2015 itemsize 68
      Dec 13 21:44:50 kernel:         item 33 key (704687 96 79) itemoff
      1948 itemsize 67
      Dec 13 21:44:50 kernel:         item 34 key (704687 96 82) itemoff
      1882 itemsize 66
      Dec 13 21:44:50 kernel:         item 35 key (704687 96 83) itemoff
      1810 itemsize 72
      Dec 13 21:44:50 kernel:         item 36 key (704687 96 85) itemoff
      1753 itemsize 57
      Dec 13 21:44:50 kernel:         item 37 key (704687 96 87) itemoff
      1681 itemsize 72
      Dec 13 21:44:50 kernel:         item 38 key (704694 1 0) itemoff 1521
      itemsize 160
      Dec 13 21:44:50 kernel:                 inode generation 35534 size 30
      mode 40755
      Dec 13 21:44:50 kernel: BTRFS error (device dm-0): block=118444032
      write time tree block corruption detected
      
      So fix that by adding the missing update of ctx->last_dir_item_offset with
      the offset of the boundary key.
      Reported-by: NChris Murphy <lists@colorremedies.com>
      Link: https://lore.kernel.org/linux-btrfs/CAJCQCtT+RSzpUjbMq+UfzNUMe1X5+1G+DnAGbHC=OZ=iRS24jg@mail.gmail.com/
      Fixes: dc287224 ("btrfs: keep track of the last logged keys when logging a directory")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1b2e5e5c