1. 16 5月, 2022 14 次提交
  2. 28 4月, 2022 1 次提交
  3. 19 4月, 2022 2 次提交
  4. 06 4月, 2022 2 次提交
    • N
      btrfs: release correct delalloc amount in direct IO write path · 6d82ad13
      Naohiro Aota 提交于
      Running generic/406 causes the following WARNING in btrfs_destroy_inode()
      which tells there are outstanding extents left.
      
      In btrfs_get_blocks_direct_write(), we reserve a temporary outstanding
      extents with btrfs_delalloc_reserve_metadata() (or indirectly from
      btrfs_delalloc_reserve_space(()). We then release the outstanding extents
      with btrfs_delalloc_release_extents(). However, the "len" can be modified
      in the COW case, which releases fewer outstanding extents than expected.
      
      Fix it by calling btrfs_delalloc_release_extents() for the original length.
      
      To reproduce the warning, the filesystem should be 1 GiB.  It's
      triggering a short-write, due to not being able to allocate a large
      extent and instead allocating a smaller one.
      
        WARNING: CPU: 0 PID: 757 at fs/btrfs/inode.c:8848 btrfs_destroy_inode+0x1e6/0x210 [btrfs]
        Modules linked in: btrfs blake2b_generic xor lzo_compress
        lzo_decompress raid6_pq zstd zstd_decompress zstd_compress xxhash zram
        zsmalloc
        CPU: 0 PID: 757 Comm: umount Not tainted 5.17.0-rc8+ #101
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS d55cb5a 04/01/2014
        RIP: 0010:btrfs_destroy_inode+0x1e6/0x210 [btrfs]
        RSP: 0018:ffffc9000327bda8 EFLAGS: 00010206
        RAX: 0000000000000000 RBX: ffff888100548b78 RCX: 0000000000000000
        RDX: 0000000000026900 RSI: 0000000000000000 RDI: ffff888100548b78
        RBP: ffff888100548940 R08: 0000000000000000 R09: ffff88810b48aba8
        R10: 0000000000000001 R11: ffff8881004eb240 R12: ffff88810b48a800
        R13: ffff88810b48ec08 R14: ffff88810b48ed00 R15: ffff888100490c68
        FS:  00007f8549ea0b80(0000) GS:ffff888237c00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007f854a09e733 CR3: 000000010a2e9003 CR4: 0000000000370eb0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        Call Trace:
         <TASK>
         destroy_inode+0x33/0x70
         dispose_list+0x43/0x60
         evict_inodes+0x161/0x1b0
         generic_shutdown_super+0x2d/0x110
         kill_anon_super+0xf/0x20
         btrfs_kill_super+0xd/0x20 [btrfs]
         deactivate_locked_super+0x27/0x90
         cleanup_mnt+0x12c/0x180
         task_work_run+0x54/0x80
         exit_to_user_mode_prepare+0x152/0x160
         syscall_exit_to_user_mode+0x12/0x30
         do_syscall_64+0x42/0x80
         entry_SYSCALL_64_after_hwframe+0x44/0xae
         RIP: 0033:0x7f854a000fb7
      
      Fixes: f0bfa76a ("btrfs: fix ENOSPC failure when attempting direct IO write into NOCOW range")
      CC: stable@vger.kernel.org # 5.17
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Tested-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6d82ad13
    • H
      btrfs: zoned: remove redundant condition in btrfs_run_delalloc_range · 9435be73
      Haowen Bai 提交于
      The logic !A || A && B is equivalent to !A || B. so we can
      make code clear.
      
      Note: though it's preferred to be in the more human readable form, there
      have been repeated reports and patches as the expression is detected by
      tools so apply it to reduce the load.
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NHaowen Bai <baihaowen@meizu.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ add note ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9435be73
  5. 02 4月, 2022 1 次提交
  6. 25 3月, 2022 2 次提交
    • K
      btrfs: prevent subvol with swapfile from being deleted · 60021bd7
      Kaiwen Hu 提交于
      A subvolume with an active swapfile must not be deleted otherwise it
      would not be possible to deactivate it.
      
      After the subvolume is deleted, we cannot swapoff the swapfile in this
      deleted subvolume because the path is unreachable.  The swapfile is
      still active and holding references, the filesystem cannot be unmounted.
      
      The test looks like this:
      
        mkfs.btrfs -f $dev > /dev/null
        mount $dev $mnt
      
        btrfs sub create $mnt/subvol
        touch $mnt/subvol/swapfile
        chmod 600 $mnt/subvol/swapfile
        chattr +C $mnt/subvol/swapfile
        dd if=/dev/zero of=$mnt/subvol/swapfile bs=1K count=4096
        mkswap $mnt/subvol/swapfile
        swapon $mnt/subvol/swapfile
      
        btrfs sub delete $mnt/subvol
        swapoff $mnt/subvol/swapfile  # failed: No such file or directory
        swapoff --all
      
        unmount $mnt                  # target is busy.
      
      To prevent above issue, we simply check that whether the subvolume
      contains any active swapfile, and stop the deleting process.  This
      behavior is like snapshot ioctl dealing with a swapfile.
      
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: NRobbie Ko <robbieko@synology.com>
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NKaiwen Hu <kevinhu@synology.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      60021bd7
    • J
      btrfs: do not warn for free space inode in cow_file_range · a7d16d9a
      Josef Bacik 提交于
      This is a long time leftover from when I originally added the free space
      inode, the point was to catch cases where we weren't honoring the NOCOW
      flag.  However there exists a race with relocation, if we allocate our
      free space inode in a block group that is about to be relocated, we
      could trigger the COW path before the relocation has the opportunity to
      find the extents and delete the free space cache.  In production where
      we have auto-relocation enabled we're seeing this WARN_ON_ONCE() around
      5k times in a 2 week period, so not super common but enough that it's at
      the top of our metrics.
      
      We're properly handling the error here, and with us phasing out v1 space
      cache anyway just drop the WARN_ON_ONCE.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a7d16d9a
  7. 23 3月, 2022 1 次提交
  8. 15 3月, 2022 2 次提交
  9. 14 3月, 2022 15 次提交
    • F
      btrfs: reset last_reflink_trans after fsyncing inode · 23e3337f
      Filipe Manana 提交于
      When an inode has a last_reflink_trans matching the current transaction,
      we have to take special care when logging its checksums in order to
      avoid getting checksum items with overlapping ranges in a log tree,
      which could result in missing checksums after log replay (more on that
      in the changelogs of commit 40e046ac ("Btrfs: fix missing data
      checksums after replaying a log tree") and commit e289f03e ("btrfs:
      fix corrupt log due to concurrent fsync of inodes with shared extents")).
      We also need to make sure a full fsync will copy all old file extent
      items it finds in modified leaves, because they might have been copied
      from some other inode.
      
      However once we fsync an inode, we don't need to keep paying the price of
      that extra special care in future fsyncs done in the same transaction,
      unless the inode is used for another reflink operation or the full sync
      flag is set on it (truncate, failure to allocate extent maps for holes,
      and other exceptional and infrequent cases).
      
      So after we fsync an inode reset its last_unlink_trans to zero. In case
      another reflink happens, we continue to update the last_reflink_trans of
      the inode, just as before. Also set last_reflink_trans to the generation
      of the last transaction that modified the inode whenever we need to set
      the full sync flag on the inode, just like when we need to load an inode
      from disk after eviction.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      23e3337f
    • J
      btrfs: do not double complete bio on errors during compressed reads · f9f15de8
      Josef Bacik 提交于
      I hit some weird panics while fixing up the error handling from
      btrfs_lookup_bio_sums().  Turns out the compression path will complete
      the bio we use if we set up any of the compression bios and then return
      an error, and then btrfs_submit_data_bio() will also call bio_endio() on
      the bio.
      
      Fix this by making btrfs_submit_compressed_read() responsible for
      calling bio_endio() on the bio if there are any errors.  Currently it
      was only doing it if we created the compression bios, otherwise it was
      depending on btrfs_submit_data_bio() to do the right thing.  This
      creates the above problem, so fix up btrfs_submit_compressed_read() to
      always call bio_endio() in case of an error, and then simply return from
      btrfs_submit_data_bio() if we had to call
      btrfs_submit_compressed_read().
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f9f15de8
    • O
      btrfs: add BTRFS_IOC_ENCODED_WRITE · 7c0c7269
      Omar Sandoval 提交于
      The implementation resembles direct I/O: we have to flush any ordered
      extents, invalidate the page cache, and do the io tree/delalloc/extent
      map/ordered extent dance. From there, we can reuse the compression code
      with a minor modification to distinguish the write from writeback. This
      also creates inline extents when possible.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7c0c7269
    • O
      btrfs: add BTRFS_IOC_ENCODED_READ ioctl · 1881fba8
      Omar Sandoval 提交于
      There are 4 main cases:
      
      1. Inline extents: we copy the data straight out of the extent buffer.
      2. Hole/preallocated extents: we fill in zeroes.
      3. Regular, uncompressed extents: we read the sectors we need directly
         from disk.
      4. Regular, compressed extents: we read the entire compressed extent
         from disk and indicate what subset of the decompressed extent is in
         the file.
      
      This initial implementation simplifies a few things that can be improved
      in the future:
      
      - Cases 1, 3, and 4 allocate temporary memory to read into before
        copying out to userspace.
      - We don't do read repair, because it turns out that read repair is
        currently broken for compressed data.
      - We hold the inode lock during the operation.
      
      Note that we don't need to hold the mmap lock. We may race with
      btrfs_page_mkwrite() and read the old data from before the page was
      dirtied:
      
      btrfs_page_mkwrite         btrfs_encoded_read
      ---------------------------------------------------
      (enter)                    (enter)
                                 btrfs_wait_ordered_range
      lock_extent_bits
      btrfs_page_set_dirty
      unlock_extent_cached
      (exit)
                                 lock_extent_bits
                                 read extent (dirty page hasn't been flushed,
                                              so this is the old data)
                                 unlock_extent_cached
                                 (exit)
      
      we read the old data from before the page was dirtied. But, that's true
      even if we were to hold the mmap lock:
      
      btrfs_page_mkwrite               btrfs_encoded_read
      -------------------------------------------------------------------
      (enter)                          (enter)
                                       btrfs_inode_lock(BTRFS_ILOCK_MMAP)
      down_read(i_mmap_lock) (blocked)
                                       btrfs_wait_ordered_range
                                       lock_extent_bits
      				 read extent (page hasn't been dirtied,
                                                    so this is the old data)
                                       unlock_extent_cached
                                       btrfs_inode_unlock(BTRFS_ILOCK_MMAP)
      down_read(i_mmap_lock) returns
      lock_extent_bits
      btrfs_page_set_dirty
      unlock_extent_cached
      
      In other words, this is inherently racy, so it's fine that we return the
      old data in this tiny window.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1881fba8
    • O
      btrfs: optionally extend i_size in cow_file_range_inline() · d9496e8a
      Omar Sandoval 提交于
      Currently, an inline extent is always created after i_size is extended
      from btrfs_dirty_pages(). However, for encoded writes, we only want to
      update i_size after we successfully created the inline extent. Add an
      update_i_size parameter to cow_file_range_inline() and
      insert_inline_extent() and pass in the size of the extent rather than
      determining it from i_size.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ reformat comment ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d9496e8a
    • O
      btrfs: clean up cow_file_range_inline() · 8dd9872d
      Omar Sandoval 提交于
      The start parameter to cow_file_range_inline() (and
      insert_inline_extent()) is always 0, so get rid of it and simplify the
      logic in those two functions. Pass btrfs_inode to insert_inline_extent()
      and remove the redundant root parameter. Also document the requirements
      for creating an inline extent. No functional change.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8dd9872d
    • O
      btrfs: support different disk extent size for delalloc · 28c9b1e7
      Omar Sandoval 提交于
      Currently, we always reserve the same extent size in the file and extent
      size on disk for delalloc because the former is the worst case for the
      latter. For BTRFS_IOC_ENCODED_WRITE writes, we know the exact size of
      the extent on disk, which may be less than or greater than (for
      bookends) the size in the file. Add a disk_num_bytes parameter to
      btrfs_delalloc_reserve_metadata() so that we can reserve the correct
      amount of csum bytes. No functional change.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      28c9b1e7
    • O
      btrfs: add ram_bytes and offset to btrfs_ordered_extent · cb36a9bb
      Omar Sandoval 提交于
      Currently, we only create ordered extents when ram_bytes == num_bytes
      and offset == 0. However, BTRFS_IOC_ENCODED_WRITE writes may create
      extents which only refer to a subset of the full unencoded extent, so we
      need to plumb these fields through the ordered extent infrastructure and
      pass them down to insert_reserved_file_extent().
      
      Since we're changing the btrfs_add_ordered_extent* signature, let's get
      rid of the trivial wrappers and add a kernel-doc.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      cb36a9bb
    • O
      btrfs: don't advance offset for compressed bios in btrfs_csum_one_bio() · e331f6b1
      Omar Sandoval 提交于
      btrfs_csum_one_bio() loops over each filesystem block in the bio while
      keeping a cursor of its current logical position in the file in order to
      look up the ordered extent to add the checksums to. However, this
      doesn't make much sense for compressed extents, as a sector on disk does
      not correspond to a sector of decompressed file data. It happens to work
      because:
      
      1) the compressed bio always covers one ordered extent
      2) the size of the bio is always less than the size of the ordered
         extent
      
      However, the second point will not always be true for encoded writes.
      
      Let's add a boolean parameter to btrfs_csum_one_bio() to indicate that
      it can assume that the bio only covers one ordered extent. Since we're
      already changing the signature, let's get rid of the contig parameter
      and make it implied by the offset parameter, similar to the change we
      recently made to btrfs_lookup_bio_sums(). Additionally, let's rename
      nr_sectors to blockcount to make it clear that it's the number of
      filesystem blocks, not the number of 512-byte sectors.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e331f6b1
    • F
      btrfs: fix lost error return value when reading a data page · bbf0ea7e
      Filipe Manana 提交于
      At btrfs_do_readpage(), if we get an error when trying to lookup for an
      extent map, we end up marking the page with the error bit, clearing
      the uptodate bit on it, and doing everything else that should be done.
      However we return success (0) to the caller, when we should return the
      error encoded in the extent map pointer. So fix that by returning the
      error encoded in the pointer.
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      bbf0ea7e
    • D
      btrfs: replace BUILD_BUG_ON by static_assert · a55e65b8
      David Sterba 提交于
      The static_assert introduced in 6bab69c6 ("build_bug.h: add wrapper
      for _Static_assert") has been supported by compilers for a long time
      (gcc 4.6, clang 3.0) and can be used in header files. We don't need to
      put BUILD_BUG_ON to random functions but rather keep it next to the
      definition.
      
      The exception here is the UAPI header btrfs_tree.h that could be
      potentially included by userspace code and the static assert is not
      defined (nor used in any other header).
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a55e65b8
    • J
      btrfs: disable snapshot creation/deletion for extent tree v2 · 813febdb
      Josef Bacik 提交于
      When we stop tracking metadata blocks all of snapshotting will break, so
      disable it until I add the snapshot root and drop tree support.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      813febdb
    • F
      btrfs: stop doing unnecessary log updates during a rename · 259c4b96
      Filipe Manana 提交于
      During a rename, we call __btrfs_unlink_inode(), which will call
      btrfs_del_inode_ref_in_log() and btrfs_del_dir_entries_in_log(), in order
      to remove an inode reference and a directory entry from the log. These
      are necessary when __btrfs_unlink_inode() is called from the unlink path,
      but not necessary when it's called from a rename context, because:
      
      1) For the btrfs_del_inode_ref_in_log() call, it's pointless to delete the
         inode reference related to the old name, because later in the rename
         path we call btrfs_log_new_name(), which will drop all inode references
         from the log and copy all inode references from the subvolume tree to
         the log tree. So we are doing one unnecessary btree operation which
         adds additional latency and lock contention in case there are other
         tasks accessing the log tree;
      
      2) For the btrfs_del_dir_entries_in_log() call, we are now doing the
         equivalent at btrfs_log_new_name() since the previous patch in the
         series, that has the subject "btrfs: avoid logging all directory
         changes during renames". In fact, having __btrfs_unlink_inode() call
         this function not only adds additional latency and lock contention due
         to the extra btree operation, but also can make btrfs_log_new_name()
         unnecessarily log a range item to track the deletion of the old name,
         since it has no way to known that the directory entry related to the
         old name was previously logged and already deleted by
         __btrfs_unlink_inode() through its call to
         btrfs_del_dir_entries_in_log().
      
      So skip those calls at __btrfs_unlink_inode() when we are doing a rename.
      Skipping them also allows us now to reduce the duration of time we are
      pinning a log transaction during renames, which is always beneficial as
      it's not delaying so much other tasks trying to sync the log tree, in
      particular we end up not holding the log transaction pinned while adding
      the new name (adding inode ref, directory entry, etc).
      
      This change is part of a patchset comprised of the following patches:
      
        1/5 btrfs: add helper to delete a dir entry from a log tree
        2/5 btrfs: pass the dentry to btrfs_log_new_name() instead of the inode
        3/5 btrfs: avoid logging all directory changes during renames
        4/5 btrfs: stop doing unnecessary log updates during a rename
        5/5 btrfs: avoid inode logging during rename and link when possible
      
      Just like the previous patch in the series, "btrfs: avoid logging all
      directory changes during renames", the following script mimics part of
      what a package installation/upgrade with zypper does, which is basically
      renaming a lot of files, in some directory under /usr, to a name with a
      suffix of "-RPMDELETE":
      
        $ cat test.sh
        #!/bin/bash
      
        DEV=/dev/nvme0n1
        MNT=/mnt/nvme0n1
      
        NUM_FILES=10000
      
        mkfs.btrfs -f $DEV
        mount $DEV $MNT
      
        mkdir $MNT/testdir
      
        for ((i = 1; i <= $NUM_FILES; i++)); do
            echo -n > $MNT/testdir/file_$i
        done
      
        sync
      
        # Do some change to testdir and fsync it.
        echo -n > $MNT/testdir/file_$((NUM_FILES + 1))
        xfs_io -c "fsync" $MNT/testdir
      
        echo "Renaming $NUM_FILES files..."
        start=$(date +%s%N)
        for ((i = 1; i <= $NUM_FILES; i++)); do
            mv $MNT/testdir/file_$i $MNT/testdir/file_$i-RPMDELETE
        done
        end=$(date +%s%N)
      
        dur=$(( (end - start) / 1000000 ))
        echo "Renames took $dur milliseconds"
      
        umount $MNT
      
      Testing this change on box a using a non-debug kernel (Debian's default
      kernel config) gave the following results:
      
      NUM_FILES=10000, before patchset:                   27399 ms
      NUM_FILES=10000, after patches 1/5 to 3/5 applied:   9093 ms (-66.8%)
      NUM_FILES=10000, after patches 1/5 to 4/5 applied:   9016 ms (-67.1%)
      
      NUM_FILES=5000, before patchset:                     9241 ms
      NUM_FILES=5000, after patches 1/5 to 3/5 applied:    4642 ms (-49.8%)
      NUM_FILES=5000, after patches 1/5 to 4/5 applied:    4553 ms (-50.7%)
      
      NUM_FILES=2000, before patchset:                     2550 ms
      NUM_FILES=2000, after patches 1/5 to 3/5 applied:    1788 ms (-29.9%)
      NUM_FILES=2000, after patches 1/5 to 4/5 applied:    1767 ms (-30.7%)
      
      NUM_FILES=1000, before patchset:                     1088 ms
      NUM_FILES=1000, after patches 1/5 to 3/5 applied:     905 ms (-16.9%)
      NUM_FILES=1000, after patches 1/5 to 4/5 applied:     883 ms (-18.8%)
      
      The next patch in the series (5/5), also contains dbench results after
      applying to whole patchset.
      
      Link: https://bugzilla.opensuse.org/show_bug.cgi?id=1193549Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      259c4b96
    • F
      btrfs: avoid logging all directory changes during renames · 88d2beec
      Filipe Manana 提交于
      When doing a rename of a file, if the file or its old parent directory
      were logged before, we log the new name of the file and then make sure
      we log the old parent directory, to ensure that after a log replay the
      old name of the file is deleted and the new name added.
      
      The logging of the old parent directory can take some time, because it
      will scan all leaves modified in the current transaction, check which
      directory entries were already logged, copy the ones that were not
      logged before, etc. In this rename context all we need to do is make
      sure that the old name of the file is deleted on log replay, so instead
      of triggering a directory log operation, we can just delete the old
      directory entry from the log if it's there, or in case it isn't there,
      just log a range item to signal log replay that the old name must be
      deleted. So change btrfs_log_new_name() to do that.
      
      This scenario is actually not uncommon to trigger, and recently on a
      5.15 kernel, an openSUSE Tumbleweed user reported package installations
      and upgrades, with the zypper tool, were often taking a long time to
      complete, much more than usual. With strace it could be observed that
      zypper was spending over 99% of its time on rename operations, and then
      with further analysis we checked that directory logging was happening
      too frequently and causing high latencies for the rename operations.
      Taking into account that installation/upgrade of some of these packages
      needed about a few thousand file renames, the slowdown was very noticeable
      for the user.
      
      The issue was caused indirectly due to an excessive number of inode
      evictions on a 5.15 kernel, about 100x more compared to a 5.13, 5.14
      or a 5.16-rc8 kernel. After an inode eviction we can't tell for sure,
      in an efficient way, if an inode was previously logged in the current
      transaction, so we are pessimistic and assume it was, because in case
      it was we need to update the logged inode. More details on that in one
      of the patches in the same series (subject "btrfs: avoid inode logging
      during rename and link when possible"). Either way, in case the parent
      directory was logged before, we currently do more work then necessary
      during a rename, and this change minimizes that amount of work.
      
      The following script mimics part of what a package installation/upgrade
      with zypper does, which is basically renaming a lot of files, in some
      directory under /usr, to a name with a suffix of "-RPMDELETE":
      
        $ cat test.sh
        #!/bin/bash
      
        DEV=/dev/nvme0n1
        MNT=/mnt/nvme0n1
      
        NUM_FILES=10000
      
        mkfs.btrfs -f $DEV
        mount $DEV $MNT
      
        mkdir $MNT/testdir
      
        for ((i = 1; i <= $NUM_FILES; i++)); do
            echo -n > $MNT/testdir/file_$i
        done
      
        sync
      
        # Do some change to testdir and fsync it.
        echo -n > $MNT/testdir/file_$((NUM_FILES + 1))
        xfs_io -c "fsync" $MNT/testdir
      
        echo "Renaming $NUM_FILES files..."
        start=$(date +%s%N)
        for ((i = 1; i <= $NUM_FILES; i++)); do
            mv $MNT/testdir/file_$i $MNT/testdir/file_$i-RPMDELETE
        done
        end=$(date +%s%N)
      
        dur=$(( (end - start) / 1000000 ))
        echo "Renames took $dur milliseconds"
      
        umount $MNT
      
      Testing this change on box using a non-debug kernel (Debian's default
      kernel config) gave the following results:
      
      NUM_FILES=10000, before this patch: 27399 ms
      NUM_FILES=10000, after this patch:   9093 ms (-66.8%)
      
      NUM_FILES=5000, before this patch:   9241 ms
      NUM_FILES=5000, after this patch:    4642 ms (-49.8%)
      
      NUM_FILES=2000, before this patch:   2550 ms
      NUM_FILES=2000, after this patch:    1788 ms (-29.9%)
      
      NUM_FILES=1000, before this patch:   1088 ms
      NUM_FILES=1000, after this patch:     905 ms (-16.9%)
      
      Link: https://bugzilla.opensuse.org/show_bug.cgi?id=1193549Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      88d2beec
    • F
      btrfs: pass the dentry to btrfs_log_new_name() instead of the inode · d5f5bd54
      Filipe Manana 提交于
      In the next patch in the series, there will be the need to access the old
      name, and its length, of an inode when logging the inode during a rename.
      So instead of passing the inode to btrfs_log_new_name() pass the dentry,
      because from the dentry we can get the inode, the name and its length.
      
      This will avoid passing 3 new parameters to btrfs_log_new_name() in the
      next patch - the name, its length and an index number. This way we end
      up passing only 1 new parameter, the index number.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d5f5bd54