1. 27 10月, 2021 40 次提交
    • Q
      btrfs: remove btrfs_bio_alloc() helper · cd8e0cca
      Qu Wenruo 提交于
      The helper btrfs_bio_alloc() is almost the same as btrfs_io_bio_alloc(),
      except it's allocating using BIO_MAX_VECS as @nr_iovecs, and initializes
      bio->bi_iter.bi_sector.
      
      However the naming itself is not using "btrfs_io_bio" to indicate its
      parameter is "strcut btrfs_io_bio" and can be easily confused with
      "struct btrfs_bio".
      
      Considering assigned bio->bi_iter.bi_sector is such a simple work and
      there are already tons of call sites doing that manually, there is no
      need to do that in a helper.
      
      Remove btrfs_bio_alloc() helper, and enhance btrfs_io_bio_alloc()
      function to provide a fail-safe value for its @nr_iovecs.
      
      And then replace all btrfs_bio_alloc() callers with
      btrfs_io_bio_alloc().
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      cd8e0cca
    • Q
      btrfs: rename btrfs_bio to btrfs_io_context · 4c664611
      Qu Wenruo 提交于
      The structure btrfs_bio is used by two different sites:
      
      - bio->bi_private for mirror based profiles
        For those profiles (SINGLE/DUP/RAID1*/RAID10), this structures records
        how many mirrors are still pending, and save the original endio
        function of the bio.
      
      - RAID56 code
        In that case, RAID56 only utilize the stripes info, and no long uses
        that to trace the pending mirrors.
      
      So btrfs_bio is not always bind to a bio, and contains more info for IO
      context, thus renaming it will make the naming less confusing.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4c664611
    • F
      btrfs: keep track of the last logged keys when logging a directory · dc287224
      Filipe Manana 提交于
      After the first time we log a directory in the current transaction, for
      each directory item in a changed leaf of the subvolume tree, we have to
      check if we previously logged the item, in order to overwrite it in case
      its data changed or skip it in case its data hasn't changed.
      
      Checking if we have logged each item before not only wastes times, but it
      also adds lock contention on the log tree. So in order to minimize the
      number of times we do such checks, keep track of the offset of the last
      key we logged for a directory and, on the next time we log the directory,
      skip the checks for any new keys that have an offset greater than the
      offset we have previously saved. This is specially effective for index
      keys, because the offset for these keys comes from a monotonically
      increasing counter.
      
      This patch is part of a patchset comprised of the following 5 patches:
      
        btrfs: remove root argument from btrfs_log_inode() and its callees
        btrfs: remove redundant log root assignment from log_dir_items()
        btrfs: factor out the copying loop of dir items from log_dir_items()
        btrfs: insert items in batches when logging a directory when possible
        btrfs: keep track of the last logged keys when logging a directory
      
      This is patch 5/5.
      
      The following test was used on a non-debug kernel to measure the impact
      it has on a directory fsync:
      
        $ cat test-dir-fsync.sh
        #!/bin/bash
      
        DEV=/dev/nvme0n1
        MNT=/mnt/nvme0n1
      
        NUM_NEW_FILES=100000
        NUM_FILE_DELETES=1000
      
        mkfs.btrfs -f $DEV
        mount -o ssd $DEV $MNT
      
        mkdir $MNT/testdir
      
        for ((i = 1; i <= $NUM_NEW_FILES; i++)); do
            echo -n > $MNT/testdir/file_$i
        done
      
        # fsync the directory, this will log the new dir items and the inodes
        # they point to, because these are new inodes.
        start=$(date +%s%N)
        xfs_io -c "fsync" $MNT/testdir
        end=$(date +%s%N)
      
        dur=$(( (end - start) / 1000000 ))
        echo "dir fsync took $dur ms after adding $NUM_NEW_FILES files"
      
        # sync to force transaction commit and wipeout the log.
        sync
      
        del_inc=$(( $NUM_NEW_FILES / $NUM_FILE_DELETES ))
        for ((i = 1; i <= $NUM_NEW_FILES; i += $del_inc)); do
            rm -f $MNT/testdir/file_$i
        done
      
        # fsync the directory, this will only log dir items, there are no
        # dentries pointing to new inodes.
        start=$(date +%s%N)
        xfs_io -c "fsync" $MNT/testdir
        end=$(date +%s%N)
      
        dur=$(( (end - start) / 1000000 ))
        echo "dir fsync took $dur ms after deleting $NUM_FILE_DELETES files"
      
        umount $MNT
      
      Test results with NUM_NEW_FILES set to 100 000 and 1 000 000:
      
      **** before patchset, 100 000 files, 1000 deletes ****
      
      dir fsync took 848 ms after adding 100000 files
      dir fsync took 175 ms after deleting 1000 files
      
      **** after patchset, 100 000 files, 1000 deletes ****
      
      dir fsync took 758 ms after adding 100000 files  (-11.2%)
      dir fsync took 63 ms after deleting 1000 files   (-94.1%)
      
      **** before patchset, 1 000 000 files, 1000 deletes ****
      
      dir fsync took 9945 ms after adding 1000000 files
      dir fsync took 473 ms after deleting 1000 files
      
      **** after patchset, 1 000 000 files, 1000 deletes ****
      
      dir fsync took 8677 ms after adding 1000000 files (-13.6%)
      dir fsync took 146 ms after deleting 1000 files   (-105.6%)
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      dc287224
    • F
      btrfs: insert items in batches when logging a directory when possible · 086dcbfa
      Filipe Manana 提交于
      When logging a directory, we scan its directory items from the subvolume
      tree and then copy one by one into the log tree. This is not efficient
      since we generally are able to insert several items in a batch, using a
      single btree operation for adding several items at once. The reason we
      copy items one by one is that we must check if each item was previously
      logged in the current transaction, and if it was we either overwrite it
      or skip it in case its content did not change in the subvolume tree (this
      can happen only for dir item keys, but not for dir index keys), and doing
      such check makes it a bit cumbersome to attempt batch insertions.
      
      However the chances for doing batch insertions are very frequent and
      always happen when:
      
      1) Logging the directory for the first time in the current transaction,
         as none of the items exist in the log tree yet;
      
      2) Logging new dir index keys, because the offset for new dir index keys
         comes from a monotonically increasing counter. This means if we keep
         adding dentries to a directory, through creation of new files and
         sub-directories or by adding new links or renaming from some other
         directory into the one we are logging, all the new dir index keys
         have a new offset that is greater than the offset of any previously
         logged index keys, so we can insert them in batches into the log tree.
      
      For dir item keys, since their offset depends on the result of an hash
      function against the dentry's name, unless the directory is being logged
      for the first time in the current transaction, the chances being able to
      insert the items in the log using batches is pretty much random and not
      predictable, as it depends on the names of the dentries, but still happens
      often enough.
      
      So change directory logging to keep track of consecutive directory items
      that don't exist yet in the log and batch insert them.
      
      This patch is part of a patchset comprised of the following 5 patches:
      
        btrfs: remove root argument from btrfs_log_inode() and its callees
        btrfs: remove redundant log root assignment from log_dir_items()
        btrfs: factor out the copying loop of dir items from log_dir_items()
        btrfs: insert items in batches when logging a directory when possible
        btrfs: keep track of the last logged keys when logging a directory
      
      This is patch 4/5. The change log of the last patch (5/5) has performance
      results.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      086dcbfa
    • F
      btrfs: factor out the copying loop of dir items from log_dir_items() · eb10d85e
      Filipe Manana 提交于
      In preparation for the next change, move the loop that processes a leaf
      and copies its directory items to the log, into a separate helper
      function. This makes the next change simpler and it also helps making
      log_dir_items() a bit shorter (specially after the next change).
      
      This patch is part of a patchset comprised of the following 5 patches:
      
        btrfs: remove root argument from btrfs_log_inode() and its callees
        btrfs: remove redundant log root assignment from log_dir_items()
        btrfs: factor out the copying loop of dir items from log_dir_items()
        btrfs: insert items in batches when logging a directory when possible
        btrfs: keep track of the last logged keys when logging a directory
      
      This is patch 3/5. The change log of the last patch (5/5) has performance
      results.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      eb10d85e
    • F
      btrfs: remove redundant log root assignment from log_dir_items() · d46fb845
      Filipe Manana 提交于
      At log_dir_items() we are assigning the exact same value to the local
      variable 'log', once when it's declared and once again shortly after.
      Remove the later assignment as it's pointless.
      
      This patch is part of a patchset comprised of the following 5 patches:
      
        btrfs: remove root argument from btrfs_log_inode() and its callees
        btrfs: remove redundant log root assignment from log_dir_items()
        btrfs: factor out the copying loop of dir items from log_dir_items()
        btrfs: insert items in batches when logging a directory when possible
        btrfs: keep track of the last logged keys when logging a directory
      
      This is patch 2/5. The change log of the last patch (5/5) has performance
      results.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d46fb845
    • F
      btrfs: remove root argument from btrfs_log_inode() and its callees · 90d04510
      Filipe Manana 提交于
      The root argument passed to btrfs_log_inode() is unncessary, as it is
      always the root of the inode we are going to log. This root also gets
      unnecessarily propagated to several functions called by btrfs_log_inode(),
      and all of them take the inode as an argument as well. So just remove
      the root argument from these functions and have them get the root from
      the inode where needed.
      
      This patch is part of a patchset comprised of the following 5 patches:
      
        btrfs: remove root argument from btrfs_log_inode() and its callees
        btrfs: remove redundant log root assignment from log_dir_items()
        btrfs: factor out the copying loop of dir items from log_dir_items()
        btrfs: insert items in batches when logging a directory when possible
        btrfs: keep track of the last logged keys when logging a directory
      
      This is patch 1/5. The change log of the last patch (5/5) has performance
      results.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      90d04510
    • J
      btrfs: zoned: let the for_treelog test in the allocator stand out · 2d81eb1c
      Johannes Thumshirn 提交于
      The statement which decides if an extent allocation on a zoned device is
      for the dedicated tree-log block group or not and if we can use the block
      group we picked for this allocation is not easy to read but an important
      part of the allocator.
      
      Rewrite into an if condition instead of a plain boolean test to make it
      stand out more, like the version which tests for the dedicated
      data-relocation block group.
      Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2d81eb1c
    • J
      btrfs: rename setup_extent_mapping in relocation code · 4b01c44f
      Johannes Thumshirn 提交于
      In btrfs code we have two functions called setup_extent_mapping, one in
      the extent_map code and one in the relocation code. While both are
      private to their respective implementation, this can still be confusing
      for the reader.
      
      So rename the version in relocation.c to setup_relocation_extent_mapping.
      No functional changes.
      Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4b01c44f
    • J
      btrfs: zoned: allow preallocation for relocation inodes · 960a3166
      Johannes Thumshirn 提交于
      Now that we use a dedicated block group and regular writes for data
      relocation, we can preallocate the space needed for a relocated inode,
      just like we do in regular mode.
      
      Essentially this reverts commit 32430c61 ("btrfs: zoned: enable
      relocation on a zoned filesystem") as it is not needed anymore.
      Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      960a3166
    • J
      btrfs: check for relocation inodes on zoned btrfs in should_nocow · 2adada88
      Johannes Thumshirn 提交于
      Prepare for allowing preallocation for relocation inodes.
      Reviewed-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2adada88
    • J
      btrfs: zoned: use regular writes for relocation · e6d261e3
      Johannes Thumshirn 提交于
      Now that we have a dedicated block group for relocation, we can use
      REQ_OP_WRITE instead of  REQ_OP_ZONE_APPEND for writing out the data on
      relocation.
      Reviewed-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e6d261e3
    • J
      btrfs: zoned: only allow one process to add pages to a relocation inode · 35156d85
      Johannes Thumshirn 提交于
      Don't allow more than one process to add pages to a relocation inode on
      a zoned filesystem, otherwise we cannot guarantee the sequential write
      rule once we're filling preallocated extents on a zoned filesystem.
      Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      35156d85
    • J
      btrfs: zoned: add a dedicated data relocation block group · c2707a25
      Johannes Thumshirn 提交于
      Relocation in a zoned filesystem can fail with a transaction abort with
      error -22 (EINVAL). This happens because the relocation code assumes that
      the extents we relocated the data to have the same size the source extents
      had and ensures this by preallocating the extents.
      
      But in a zoned filesystem we currently can't preallocate the extents as
      this would break the sequential write required rule. Therefore it can
      happen that the writeback process kicks in while we're still adding pages
      to a delalloc range and starts writing out dirty pages.
      
      This then creates destination extents that are smaller than the source
      extents, triggering the following safety check in get_new_location():
      
       1034         if (num_bytes != btrfs_file_extent_disk_num_bytes(leaf, fi)) {
       1035                 ret = -EINVAL;
       1036                 goto out;
       1037         }
      
      Temporarily create a dedicated block group for the relocation process, so
      no non-relocation data writes can interfere with the relocation writes.
      
      This is needed that we can switch the relocation process on a zoned
      filesystem from the REQ_OP_ZONE_APPEND writing we use for data to a scheme
      like in a non-zoned filesystem using REQ_OP_WRITE and preallocation.
      
      Fixes: 32430c61 ("btrfs: zoned: enable relocation on a zoned filesystem")
      Reviewed-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c2707a25
    • J
      btrfs: introduce btrfs_is_data_reloc_root · 37f00a6d
      Johannes Thumshirn 提交于
      There are several places in our codebase where we check if a root is the
      root of the data reloc tree and subsequent patches will introduce more.
      
      Factor out the check into a small helper function instead of open coding
      it multiple times.
      Reviewed-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      37f00a6d
    • Q
      btrfs: unexport repair_io_failure() · 38d5e541
      Qu Wenruo 提交于
      Function repair_io_failure() is no longer used out of extent_io.c since
      commit 8b9b6f25 ("btrfs: scrub: cleanup the remaining nodatasum
      fixup code"), which removes the last external caller.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      38d5e541
    • F
      btrfs: do not commit delayed inode when logging a file in full sync mode · f6df27dd
      Filipe Manana 提交于
      When logging a regular file in full sync mode, we are currently committing
      its delayed inode item. This is to ensure that we never miss copying the
      inode item, with its most up to date data, into the log tree.
      
      However that is not necessary since commit e4545de5 ("Btrfs: fix fsync
      data loss after append write"), because even if we don't find the leaf
      with the inode item when looking for leaves that changed in the current
      transaction, we end up logging the inode item later using the in-memory
      content. In case we find the leaf containing the inode item, we already
      end up using the in-memory inode for filling the inode item in the log
      tree, and not the inode item that is in the fs/subvolume tree, as it
      might be not up to date (copy_items() -> fill_inode_item()).
      
      So don't commit the delayed inode item, which brings a couple of benefits:
      
      1) Avoid writing the inode item to the fs/subvolume btree, saving time and
         reducing lock contention on the btree;
      
      2) In case no other item for the inode was changed, added or deleted in
         the same leaf where the inode item is located, we ended up copying
         all the items in that leaf to the log tree - it's harmless from a
         functional point of view, but it wastes time and log tree space.
      
      This patch is part of a patch set comprised of the following patches:
      
        btrfs: check if a log tree exists at inode_logged()
        btrfs: remove no longer needed checks for NULL log context
        btrfs: do not log new dentries when logging that a new name exists
        btrfs: always update the logged transaction when logging new names
        btrfs: avoid expensive search when dropping inode items from log
        btrfs: add helper to truncate inode items when logging inode
        btrfs: avoid expensive search when truncating inode items from the log
        btrfs: avoid search for logged i_size when logging inode if possible
        btrfs: avoid attempt to drop extents when logging inode for the first time
        btrfs: do not commit delayed inode when logging a file in full sync mode
      
      This is patch 10/10 and the following test results compare a branch with
      the whole patch set applied versus a branch without any of the patches
      applied.
      
      The following script was used to test dbench with 8 and 16 jobs on a
      machine with 12 cores, 64G of RAM, a NVME device and using a non-debug
      kernel config (Debian's default):
      
        $ cat test.sh
        #!/bin/bash
      
        if [ $# -ne 1 ]; then
            echo "Use $0 NUM_JOBS"
            exit 1
        fi
      
        NUM_JOBS=$1
      
        DEV=/dev/nvme0n1
        MNT=/mnt/nvme0n1
        MOUNT_OPTIONS="-o ssd"
        MKFS_OPTIONS="-m single -d single"
      
        echo "performance" | \
            tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
      
        mkfs.btrfs -f $MKFS_OPTIONS $DEV
        mount $MOUNT_OPTIONS $DEV $MNT
      
        dbench -D $MNT -t 120 $NUM_JOBS
      
        umount $MNT
      
      The results were the following:
      
      8 jobs, before patchset:
      
       Operation      Count    AvgLat    MaxLat
       ----------------------------------------
       NTCreateX    4113896     0.009   238.665
       Close        3021699     0.001     0.590
       Rename        174215     0.082   238.733
       Unlink        830977     0.049   238.642
       Deltree           96     2.232     8.022
       Mkdir             48     0.003     0.005
       Qpathinfo    3729013     0.005     2.672
       Qfileinfo     653206     0.001     0.152
       Qfsinfo       683866     0.002     0.526
       Sfileinfo     335055     0.004     1.571
       Find         1441800     0.016     4.288
       WriteX       2049644     0.010     3.982
       ReadX        6449786     0.003     0.969
       LockX          13400     0.002     0.043
       UnlockX        13400     0.001     0.075
       Flush         288349     2.521   245.516
      
      Throughput 1075.73 MB/sec  8 clients  8 procs  max_latency=245.520 ms
      
      8 jobs, after patchset:
      
       Operation      Count    AvgLat    MaxLat
       ----------------------------------------
       NTCreateX    4154282     0.009   156.675
       Close        3051450     0.001     0.843
       Rename        175912     0.072     4.444
       Unlink        839067     0.048    66.050
       Deltree           96     2.131     5.979
       Mkdir             48     0.002     0.004
       Qpathinfo    3765575     0.005     3.079
       Qfileinfo     659582     0.001     0.099
       Qfsinfo       690474     0.002     0.155
       Sfileinfo     338366     0.004     1.419
       Find         1455816     0.016     3.423
       WriteX       2069538     0.010     4.328
       ReadX        6512429     0.003     0.840
       LockX          13530     0.002     0.078
       UnlockX        13530     0.001     0.051
       Flush         291158     2.500   163.468
      
      Throughput 1105.45 MB/sec  8 clients  8 procs  max_latency=163.474 ms
      
      +2.7% throughput, -40.1% max latency
      
      16 jobs, before patchset:
      
       Operation      Count    AvgLat    MaxLat
       ----------------------------------------
       NTCreateX    5457602     0.033   337.098
       Close        4008979     0.002     2.018
       Rename        231051     0.323   254.054
       Unlink       1102209     0.202   337.243
       Deltree          160     6.521    31.720
       Mkdir             80     0.003     0.007
       Qpathinfo    4946147     0.014     6.988
       Qfileinfo     867440     0.001     1.642
       Qfsinfo       907081     0.003     1.821
       Sfileinfo     444433     0.005     2.053
       Find         1912506     0.067     7.854
       WriteX       2724852     0.018     7.428
       ReadX        8553883     0.003     2.059
       LockX          17770     0.003     0.350
       UnlockX        17770     0.002     0.627
       Flush         382533     2.810   353.691
      
      Throughput 1413.09 MB/sec  16 clients  16 procs  max_latency=353.696 ms
      
      16 jobs, after patchset:
      
       Operation      Count    AvgLat    MaxLat
       ----------------------------------------
       NTCreateX    5393156     0.034   303.181
       Close        3961986     0.002     1.502
       Rename        228359     0.320   253.379
       Unlink       1088920     0.206   303.409
       Deltree          160     6.419    30.088
       Mkdir             80     0.003     0.004
       Qpathinfo    4887967     0.015     7.722
       Qfileinfo     857408     0.001     1.651
       Qfsinfo       896343     0.002     2.147
       Sfileinfo     439317     0.005     4.298
       Find         1890018     0.073     8.347
       WriteX       2693356     0.018     6.373
       ReadX        8453485     0.003     3.836
       LockX          17562     0.003     0.486
       UnlockX        17562     0.002     0.635
       Flush         378023     2.802   315.904
      
      Throughput 1454.46 MB/sec  16 clients  16 procs  max_latency=315.910 ms
      
      +2.9% throughput, -11.3% max latency
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f6df27dd
    • F
      btrfs: avoid attempt to drop extents when logging inode for the first time · 5328b2a7
      Filipe Manana 提交于
      When logging an extent, in the fast fsync path, we always attempt do drop
      or trim any existing extents with a range that match or overlap the range
      of the extent we are about to log. We do that through a call to
      btrfs_drop_extents().
      
      However this is not needed when we are logging the inode for the first
      time in the current transaction, since we have no inode items of the
      inode in the log tree. Calling btrfs_drop_extents() does a deletion search
      on the log tree, which is expensive when we have concurrent tasks
      accessing the log tree because a deletion search always acquires a write
      lock on the extent buffers at levels 2, 1 and 0, adding significant lock
      contention, specially taking into account the height of a log tree rarely
      (if ever) goes beyond 2 or 3, due to its short life.
      
      So skip the call to btrfs_drop_extents() when the inode was not previously
      logged in the current transaction.
      
      This patch is part of a patch set comprised of the following patches:
      
        btrfs: check if a log tree exists at inode_logged()
        btrfs: remove no longer needed checks for NULL log context
        btrfs: do not log new dentries when logging that a new name exists
        btrfs: always update the logged transaction when logging new names
        btrfs: avoid expensive search when dropping inode items from log
        btrfs: add helper to truncate inode items when logging inode
        btrfs: avoid expensive search when truncating inode items from the log
        btrfs: avoid search for logged i_size when logging inode if possible
        btrfs: avoid attempt to drop extents when logging inode for the first time
        btrfs: do not commit delayed inode when logging a file in full sync mode
      
      This is patch 9/10 and test results are listed in the change log of the
      last patch in the set.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5328b2a7
    • F
      btrfs: avoid search for logged i_size when logging inode if possible · a5c733a4
      Filipe Manana 提交于
      If we are logging that an inode exists and the inode was not logged
      before, we can avoid searching in the log tree for the inode item since we
      know it does not exists. That wastes time and adds more lock contention on
      the extent buffers of the log tree when there are other tasks that are
      logging other inodes.
      
      This patch is part of a patch set comprised of the following patches:
      
        btrfs: check if a log tree exists at inode_logged()
        btrfs: remove no longer needed checks for NULL log context
        btrfs: do not log new dentries when logging that a new name exists
        btrfs: always update the logged transaction when logging new names
        btrfs: avoid expensive search when dropping inode items from log
        btrfs: add helper to truncate inode items when logging inode
        btrfs: avoid expensive search when truncating inode items from the log
        btrfs: avoid search for logged i_size when logging inode if possible
        btrfs: avoid attempt to drop extents when logging inode for the first time
        btrfs: do not commit delayed inode when logging a file in full sync mode
      
      This is patch 8/10 and test results are listed in the change log of the
      last patch in the set.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a5c733a4
    • F
      btrfs: avoid expensive search when truncating inode items from the log · 4934a815
      Filipe Manana 提交于
      Whenever we are logging a file inode in full sync mode we call
      btrfs_truncate_inode_items() to delete items of the inode we may have
      previously logged.
      
      That results in doing a btree search for deletion, which is expensive
      because it always acquires write locks for extent buffers at levels 2, 1
      and 0, and it balances any node that is less than half full. Acquiring
      the write locks can block the task if the extent buffers are already
      locked by another task or block other tasks attempting to lock them,
      which is specially bad in case of log trees since they are small due to
      their short life, with a root node at a level typically not greater than
      level 2.
      
      If we know that we are logging the inode for the first time in the current
      transaction, we can skip the call to btrfs_truncate_inode_items(), avoiding
      the deletion search. This change does that.
      
      This patch is part of a patch set comprised of the following patches:
      
        btrfs: check if a log tree exists at inode_logged()
        btrfs: remove no longer needed checks for NULL log context
        btrfs: do not log new dentries when logging that a new name exists
        btrfs: always update the logged transaction when logging new names
        btrfs: avoid expensive search when dropping inode items from log
        btrfs: add helper to truncate inode items when logging inode
        btrfs: avoid expensive search when truncating inode items from the log
        btrfs: avoid search for logged i_size when logging inode if possible
        btrfs: avoid attempt to drop extents when logging inode for the first time
        btrfs: do not commit delayed inode when logging a file in full sync mode
      
      This is patch 7/10 and test results are listed in the change log of the
      last patch in the set.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4934a815
    • F
      btrfs: add helper to truncate inode items when logging inode · 8a2b3da1
      Filipe Manana 提交于
      Move the call to btrfs_truncate_inode_items(), and the surrounding retry
      loop, into a local helper function. This avoids some repetition and avoids
      making the next change a bit awkward due to a bit of too much indentation.
      
      This patch is part of a patch set comprised of the following patches:
      
        btrfs: check if a log tree exists at inode_logged()
        btrfs: remove no longer needed checks for NULL log context
        btrfs: do not log new dentries when logging that a new name exists
        btrfs: always update the logged transaction when logging new names
        btrfs: avoid expensive search when dropping inode items from log
        btrfs: add helper to truncate inode items when logging inode
        btrfs: avoid expensive search when truncating inode items from the log
        btrfs: avoid search for logged i_size when logging inode if possible
        btrfs: avoid attempt to drop extents when logging inode for the first time
        btrfs: do not commit delayed inode when logging a file in full sync mode
      
      This is patch 6/10 and test results are listed in the change log of the
      last patch in the set.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8a2b3da1
    • F
      btrfs: avoid expensive search when dropping inode items from log · 88e221cd
      Filipe Manana 提交于
      Whenever we are logging a directory inode, logging that an inode exists or
      logging an inode that has changes in its references or xattrs, we attempt
      to delete items of this inode we may have previously logged (through calls
      to drop_objectid_items()).
      
      That attempt does a btree search for deletion, which is expensive because
      it always acquires write locks for extent buffers at levels 2, 1 and 0,
      and it balances any node that is less than half full. Acquiring the write
      locks can block the task if the extent buffers are already locked or block
      other tasks attempting to lock them, which is specially bad in case of log
      trees since they are small due to their short life, with a root node at a
      level typically not greater than level 2.
      
      If we know that we are logging the inode for the first time in the current
      transaction, we can skip the search. This change does that.
      
      This patch is part of a patch set comprised of the following patches:
      
        btrfs: check if a log tree exists at inode_logged()
        btrfs: remove no longer needed checks for NULL log context
        btrfs: do not log new dentries when logging that a new name exists
        btrfs: always update the logged transaction when logging new names
        btrfs: avoid expensive search when dropping inode items from log
        btrfs: add helper to truncate inode items when logging inode
        btrfs: avoid expensive search when truncating inode items from the log
        btrfs: avoid search for logged i_size when logging inode if possible
        btrfs: avoid attempt to drop extents when logging inode for the first time
        btrfs: do not commit delayed inode when logging a file in full sync mode
      
      This is patch 5/10 and test results are listed in the change log of the
      last patch in the set.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      88e221cd
    • F
      btrfs: always update the logged transaction when logging new names · 130341be
      Filipe Manana 提交于
      When we are logging a new name for an inode, due to a link or rename
      operation, if the inode has ancestor inodes that are new, created in the
      current transaction, we need to log that these inodes exist. To ensure
      that a subsequent explicit fsync on one of these ancestor inodes does
      sync the log, we don't set the logged_trans field of these inodes.
      This was done in commit 75b463d2 ("btrfs: do not commit logs and
      transactions during link and rename operations"), to avoid syncing a
      log after a rename or link operation.
      
      In order to allow for future changes to do some optimizations, change
      this behaviour to always update the logged_trans of any logged inode
      and don't update the last_log_commit of the inode if we are logging
      that it exists. This accomplishes that same objective with simpler
      logic, allowing for some optimizations in the next patches.
      
      So just do that simplification.
      
      This patch is part of a patch set comprised of the following patches:
      
        btrfs: check if a log tree exists at inode_logged()
        btrfs: remove no longer needed checks for NULL log context
        btrfs: do not log new dentries when logging that a new name exists
        btrfs: always update the logged transaction when logging new names
        btrfs: avoid expensive search when dropping inode items from log
        btrfs: add helper to truncate inode items when logging inode
        btrfs: avoid expensive search when truncating inode items from the log
        btrfs: avoid search for logged i_size when logging inode if possible
        btrfs: avoid attempt to drop extents when logging inode for the first time
        btrfs: do not commit delayed inode when logging a file in full sync mode
      
      This is patch 4/10 and test results are listed in the change log of the
      last patch in the set.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      130341be
    • F
      btrfs: do not log new dentries when logging that a new name exists · c48792c6
      Filipe Manana 提交于
      When logging a new name for an inode, due to a link or rename operation,
      we don't need to log all new dentries of the parent directories and their
      subdirectories. We only want to log the names of the inode and that any
      new parent directories exist. So in this case don't trigger logging of
      the new dentries, that is only need when doing an explicit fsync on a
      directory or on a file which requires logging its parent directories.
      
      This avoids unnecessary work and reduces contention on the extent buffers
      of a log tree.
      
      This patch is part of a patch set comprised of the following patches:
      
        btrfs: check if a log tree exists at inode_logged()
        btrfs: remove no longer needed checks for NULL log context
        btrfs: do not log new dentries when logging that a new name exists
        btrfs: always update the logged transaction when logging new names
        btrfs: avoid expensive search when dropping inode items from log
        btrfs: add helper to truncate inode items when logging inode
        btrfs: avoid expensive search when truncating inode items from the log
        btrfs: avoid search for logged i_size when logging inode if possible
        btrfs: avoid attempt to drop extents when logging inode for the first time
        btrfs: do not commit delayed inode when logging a file in full sync mode
      
      This is patch 3/10 and test results are listed in the change log of the
      last patch in the set.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c48792c6
    • F
      btrfs: remove no longer needed checks for NULL log context · 289cffcb
      Filipe Manana 提交于
      Since commit 75b463d2 ("btrfs: do not commit logs and transactions
      during link and rename operations"), we always pass a non-NULL log context
      to btrfs_log_inode_parent() and therefore to all the functions that it
      calls. So remove the checks we have all over the place that test for a
      NULL log context, making the code shorter and easier to read, as well as
      reducing the size of the generated code.
      
      This patch is part of a patch set comprised of the following patches:
      
        btrfs: check if a log tree exists at inode_logged()
        btrfs: remove no longer needed checks for NULL log context
        btrfs: do not log new dentries when logging that a new name exists
        btrfs: always update the logged transaction when logging new names
        btrfs: avoid expensive search when dropping inode items from log
        btrfs: add helper to truncate inode items when logging inode
        btrfs: avoid expensive search when truncating inode items from the log
        btrfs: avoid search for logged i_size when logging inode if possible
        btrfs: avoid attempt to drop extents when logging inode for the first time
        btrfs: do not commit delayed inode when logging a file in full sync mode
      
      This is patch 2/10 and test results are listed in the change log of the
      last patch in the set.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      289cffcb
    • F
      btrfs: check if a log tree exists at inode_logged() · 1e0860f3
      Filipe Manana 提交于
      In case an inode was never logged since it was loaded from disk and was
      modified in the current transaction (its ->last_trans matches the ID of
      the current transaction), inode_logged() returns true even if there's no
      existing log tree. In this case we can simply check if a log tree exists
      and return false if it does not. This avoids a caller of inode_logged()
      doing some unnecessary, but harmless, work.
      
      For btrfs_log_new_name() it avoids it logging an inode in case it was
      never logged since it was loaded from disk and there is currently no log
      tree for the inode's root. For the remaining callers of inode_logged(),
      btrfs_del_dir_entries_in_log() and btrfs_del_inode_ref_in_log(), it has
      no effect since they already check if a log tree exists through their
      calls to join_running_log_trans().
      
      So just add a check to inode_logged() to verify if a log tree exists, and
      return false if it does not.
      
      This patch is part of a patch set comprised of the following patches:
      
        btrfs: check if a log tree exists at inode_logged()
        btrfs: remove no longer needed checks for NULL log context
        btrfs: do not log new dentries when logging that a new name exists
        btrfs: always update the logged transaction when logging new names
        btrfs: avoid expensive search when dropping inode items from log
        btrfs: add helper to truncate inode items when logging inode
        btrfs: avoid expensive search when truncating inode items from the log
        btrfs: avoid search for logged i_size when logging inode if possible
        btrfs: avoid attempt to drop extents when logging inode for the first time
        btrfs: do not commit delayed inode when logging a file in full sync mode
      
      This is patch 1/10 and test results are listed in the change log of the
      last patch in the set.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1e0860f3
    • A
      btrfs: remove stale comment about the btrfs_show_devname · cdccc03a
      Anand Jain 提交于
      There were few lockdep warnings because btrfs_show_devname() was using
      device_list_mutex as recorded in the commits:
      
        0ccd0528 ("btrfs: fix a possible umount deadlock")
        779bf3fe ("btrfs: fix lock dep warning, move scratch dev out of device_list_mutex and uuid_mutex")
      
      And finally, commit 88c14590 ("btrfs: use RCU in btrfs_show_devname
      for device list traversal") removed the device_list_mutex from
      btrfs_show_devname for performance reasons.
      
      This patch removes a stale comment about the function
      btrfs_show_devname and device_list_mutex.
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      cdccc03a
    • A
      btrfs: update latest_dev when we create a sprout device · b7cb29e6
      Anand Jain 提交于
      When we add a device to the seed filesystem (sprouting) it is a new
      filesystem (and fsid) on the device added. Update the latest_dev so
      that /proc/self/mounts shows the correct device.
      
      Example:
      
        $ btrfstune -S1 /dev/vg/seed
        $ mount /dev/vg/seed /btrfs
        mount: /btrfs: WARNING: device write-protected, mounted read-only.
      
        $ cat /proc/self/mounts | grep btrfs
        /dev/mapper/vg-seed /btrfs btrfs ro,relatime,space_cache,subvolid=5,subvol=/ 0 0
      
        $ btrfs dev add -f /dev/vg/new /btrfs
      
      Before:
      
        $ cat /proc/self/mounts | grep btrfs
        /dev/mapper/vg-seed /btrfs btrfs ro,relatime,space_cache,subvolid=5,subvol=/ 0 0
      
      After:
      
        $ cat /proc/self/mounts | grep btrfs
        /dev/mapper/vg-new /btrfs btrfs ro,relatime,space_cache,subvolid=5,subvol=/ 0 0
      Tested-by: NSu Yue <l@damenly.su>
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b7cb29e6
    • A
      btrfs: use latest_dev in btrfs_show_devname · 6605fd2f
      Anand Jain 提交于
      The test case btrfs/238 reports the warning below:
      
       WARNING: CPU: 3 PID: 481 at fs/btrfs/super.c:2509 btrfs_show_devname+0x104/0x1e8 [btrfs]
       CPU: 2 PID: 1 Comm: systemd Tainted: G        W  O 5.14.0-rc1-custom #72
       Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015
       Call trace:
         btrfs_show_devname+0x108/0x1b4 [btrfs]
         show_mountinfo+0x234/0x2c4
         m_show+0x28/0x34
         seq_read_iter+0x12c/0x3c4
         vfs_read+0x29c/0x2c8
         ksys_read+0x80/0xec
         __arm64_sys_read+0x28/0x34
         invoke_syscall+0x50/0xf8
         do_el0_svc+0x88/0x138
         el0_svc+0x2c/0x8c
         el0t_64_sync_handler+0x84/0xe4
         el0t_64_sync+0x198/0x19c
      
      Reason:
      While btrfs_prepare_sprout() moves the fs_devices::devices into
      fs_devices::seed_list, the btrfs_show_devname() searches for the devices
      and found none, leading to the warning as in above.
      
      Fix:
      latest_dev is updated according to the changes to the device list.
      That means we could use the latest_dev->name to show the device name in
      /proc/self/mounts, the pointer will be always valid as it's assigned
      before the device is deleted from the list in remove or replace.
      The RCU protection is sufficient as the device structure is freed after
      synchronization.
      Reported-by: NSu Yue <l@damenly.su>
      Tested-by: NSu Yue <l@damenly.su>
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6605fd2f
    • A
      btrfs: convert latest_bdev type to btrfs_device and rename · d24fa5c1
      Anand Jain 提交于
      In preparation to fix a bug in btrfs_show_devname().
      
      Convert fs_devices::latest_bdev type from struct block_device to struct
      btrfs_device and, rename the member to fs_devices::latest_dev.
      So that btrfs_show_devname() can use fs_devices::latest_dev::name.
      Tested-by: NSu Yue <l@damenly.su>
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d24fa5c1
    • N
      btrfs: zoned: finish relocating block group · 7ae9bd18
      Naohiro Aota 提交于
      We will no longer write to a relocating block group. So, we can finish it
      now.
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7ae9bd18
    • N
      btrfs: zoned: finish fully written block group · be1a1d7a
      Naohiro Aota 提交于
      If we have written to the zone capacity, the device automatically
      deactivates the zone. Sync up block group side (the active BG list and
      zone_is_active flag) with it.
      
      We need to do it both on data BGs and metadata BGs. On data side, we add a
      hook to btrfs_finish_ordered_io(). On metadata side, we use
      end_extent_buffer_writeback().
      
      To reduce excess lookup of a block group, we mark the last extent buffer in
      a block group with EXTENT_BUFFER_ZONE_FINISH flag. This cannot be done for
      data (ordered_extent), because the address may change due to
      REQ_OP_ZONE_APPEND.
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      be1a1d7a
    • N
      btrfs: zoned: avoid chunk allocation if active block group has enough space · a85f05e5
      Naohiro Aota 提交于
      The current extent allocator tries to allocate a new block group when the
      existing block groups do not have enough space. On a ZNS device, a new
      block group means a new active zone. If the number of active zones has
      already reached the max_active_zones, activating a new zone needs to finish
      an existing zone, leading to wasting the free space there.
      
      So, instead, it should reuse the existing active block groups as much as
      possible when we can't activate any other zones without sacrificing an
      already activated block group.
      
      While at it, I converted find_free_extent_update_loop() to check the
      found_extent() case early and made the other conditions simpler.
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a85f05e5
    • N
      btrfs: move ffe_ctl one level up · a12b0dc0
      Naohiro Aota 提交于
      We are passing too many variables as it is from btrfs_reserve_extent() to
      find_free_extent(). The next commit will add min_alloc_size to ffe_ctl, and
      that means another pass-through argument. Take this opportunity to move
      ffe_ctl one level up and drop the redundant arguments.
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a12b0dc0
    • N
      btrfs: zoned: activate new block group · eb66a010
      Naohiro Aota 提交于
      Activate new block group at btrfs_make_block_group(). We do not check the
      return value. If failed, we can try again later at the actual extent
      allocation phase.
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      eb66a010
    • N
      btrfs: zoned: activate block group on allocation · 2e654e4b
      Naohiro Aota 提交于
      Activate a block group when trying to allocate an extent from it. We check
      read-only case and no space left case before trying to activate a block
      group not to consume the number of active zones uselessly.
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2e654e4b
    • N
      btrfs: zoned: load active zone info for block group · 68a384b5
      Naohiro Aota 提交于
      Load activeness of underlying zones of a block group. When underlying zones
      are active, we add the block group to the fs_info->zone_active_bgs list.
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      68a384b5
    • N
      btrfs: zoned: implement active zone tracking · afba2bc0
      Naohiro Aota 提交于
      Add zone_is_active flag to btrfs_block_group. This flag indicates the
      underlying zones are all active. Such zone active block groups are tracked
      by fs_info->active_bg_list.
      
      btrfs_dev_{set,clear}_active_zone() take responsibility for the underlying
      device part. They set/clear the bitmap to indicate zone activeness and
      count the number of zones we can activate left.
      
      btrfs_zone_{activate,finish}() take responsibility for the logical part and
      the list management. In addition, btrfs_zone_finish() wait for any writes
      on it and send REQ_OP_ZONE_FINISH to the zone.
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      afba2bc0
    • N
      btrfs: zoned: introduce physical_map to btrfs_block_group · dafc340d
      Naohiro Aota 提交于
      We will use a block group's physical location to track active zones and
      finish fully written zones in the following commits. Since the zone
      activation is done in the extent allocation context which already holding
      the tree locks, we can't query the chunk tree for the physical locations.
      So, copy the location info into a block group and use it for activation.
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      dafc340d
    • N
      btrfs: zoned: load active zone information from devices · ea6f8ddc
      Naohiro Aota 提交于
      The ZNS specification defines a limit on the number of zones that can be in
      the implicit open, explicit open or closed conditions. Any zone with such
      condition is defined as an active zone and correspond to any zone that is
      being written or that has been only partially written. If the maximum
      number of active zones is reached, we must either reset or finish some
      active zones before being able to chose other zones for storing data.
      
      Load queue_max_active_zones() and track the number of active zones left on
      the device.
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ea6f8ddc