1. 26 9月, 2022 5 次提交
    • F
      btrfs: use delayed items when logging a directory · 30b80f3c
      Filipe Manana 提交于
      When logging a directory we start by flushing all its delayed items.
      That results in adding dir index items to the subvolume btree, for new
      dentries, and removing dir index items from the subvolume btree for any
      dentries that were deleted.
      
      This makes it straightforward to log a directory simply by iterating over
      all the modified subvolume btree leaves, especially when we used to log
      both dir index keys and dir item keys (before commit 339d0354
      ("btrfs: only copy dir index keys when logging a directory") and when we
      used to copy old dir index entries for leaves modified in the current
      transaction (before commit 732d591a ("btrfs: stop copying old dir
      items when logging a directory")).
      
      From an efficiency point of view this has a couple of drawbacks:
      
      1) Adds extra latency, due to copying delayed items to the subvolume btree
         and deleting dir index items from the btree.
      
         Further if there are other tasks accessing the btree, which is common
         (syscalls like creat, mkdir, rename, link, unlink, truncate, reflinks,
         etc, finishing an ordered extent, etc), lock contention can cause
         further delays, both to the task logging a directory and to the other
         tasks accessing the btree;
      
      2) More time spent overall flushing delayed items, if after logging the
         directory further changes are done to the directory in the same
         transaction.
      
         For example, if we add 10 dentries to a directory, fsync it, add more
         10 dentries, fsync it again, then add more 10 dentries and fsync it
         again, then we end up inserting 3 batches of 10 items to the subvolume
         btree. With the changes from this patch, we flush all the delayed items
         to the btree only once - a single batch of 30 items, and outside the
         logging code (transaction commit or when delayed items are flushed
         asynchronously).
      
      This change simply skips the flushing of delayed items every time we log a
      directory. Instead we copy the delayed insertion items directly to the log
      tree and delete delayed deletion items directly from the log tree.
      Therefore avoiding changing first the subvolume btree and then scanning it
      for new items to copy from it to the log tree and detecting deletions
      by observing gaps in consecutive dir index keys in subvolume btree leaves.
      
      Running the following tests on a non-debug kernel (Debian's default kernel
      config), on a box with a NVMe device, a 12 cores Intel CPU and 64G of ram,
      produced the results below.
      
      The results compare a branch without this patch and all the other patches
      it depends on versus the same branch with the patchset applied.
      
      The patchset is comprised of the following patches:
      
        btrfs: don't drop dir index range items when logging a directory
        btrfs: remove the root argument from log_new_dir_dentries()
        btrfs: update stale comment for log_new_dir_dentries()
        btrfs: free list element sooner at log_new_dir_dentries()
        btrfs: avoid memory allocation at log_new_dir_dentries() for common case
        btrfs: remove root argument from btrfs_delayed_item_reserve_metadata()
        btrfs: store index number instead of key in struct btrfs_delayed_item
        btrfs: remove unused logic when looking up delayed items
        btrfs: shrink the size of struct btrfs_delayed_item
        btrfs: search for last logged dir index if it's not cached in the inode
        btrfs: move need_log_inode() to above log_conflicting_inodes()
        btrfs: move log_new_dir_dentries() above btrfs_log_inode()
        btrfs: log conflicting inodes without holding log mutex of the initial inode
        btrfs: skip logging parent dir when conflicting inode is not a dir
        btrfs: use delayed items when logging a directory
      
      Custom test script for testing time spent at btrfs_log_inode():
      
         #!/bin/bash
      
         DEV=/dev/nvme0n1
         MNT=/mnt/nvme0n1
      
         # Total number of files to create in the test directory.
         NUM_FILES=10000
         # Fsync after creating or renaming N files.
         FSYNC_AFTER=100
      
         umount $DEV &> /dev/null
         mkfs.btrfs -f $DEV
         mount -o ssd $DEV $MNT
      
         TEST_DIR=$MNT/testdir
         mkdir $TEST_DIR
      
         echo "Creating files..."
         for ((i = 1; i <= $NUM_FILES; i++)); do
                 echo -n > $TEST_DIR/file_$i
                 if (( ($i % $FSYNC_AFTER) == 0 )); then
                         xfs_io -c "fsync" $TEST_DIR
                 fi
         done
      
         sync
      
         echo "Renaming files..."
         for ((i = 1; i <= $NUM_FILES; i++)); do
                 mv $TEST_DIR/file_$i $TEST_DIR/file_$i.renamed
                 if (( ($i % $FSYNC_AFTER) == 0 )); then
                         xfs_io -c "fsync" $TEST_DIR
                 fi
         done
      
         umount $MNT
      
      And using the following bpftrace script to capture the total time that is
      spent at btrfs_log_inode():
      
         #!/usr/bin/bpftrace
      
         k:btrfs_log_inode
         {
                 @start_log_inode[tid] = nsecs;
         }
      
         kr:btrfs_log_inode
         /@start_log_inode[tid]/
         {
                 $dur = (nsecs - @start_log_inode[tid]) / 1000;
                 @btrfs_log_inode_total_time = sum($dur);
                 delete(@start_log_inode[tid]);
         }
      
         END
         {
                 clear(@start_log_inode);
         }
      
      Result before applying patchset:
      
         @btrfs_log_inode_total_time: 622642
      
      Result after applying patchset:
      
         @btrfs_log_inode_total_time: 354134    (-43.1% time spent)
      
      The following dbench script was also used for testing:
      
         #!/bin/bash
      
         NUM_JOBS=$(nproc --all)
      
         DEV=/dev/nvme0n1
         MNT=/mnt/nvme0n1
         MOUNT_OPTIONS="-o ssd"
         MKFS_OPTIONS="-O no-holes -R free-space-tree"
      
         echo "performance" | \
             tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
      
         umount $DEV &> /dev/null
         mkfs.btrfs -f $MKFS_OPTIONS $DEV
         mount $MOUNT_OPTIONS $DEV $MNT
      
         dbench -D $MNT --skip-cleanup -t 120 -S $NUM_JOBS
      
         umount $MNT
      
      Before patchset:
      
       Operation      Count    AvgLat    MaxLat
       ----------------------------------------
       NTCreateX    3322265     0.034    21.032
       Close        2440562     0.002     0.994
       Rename        140664     1.150   269.633
       Unlink        670796     1.093   269.678
       Deltree           96     5.481    15.510
       Mkdir             48     0.004     0.052
       Qpathinfo    3010924     0.014     8.127
       Qfileinfo     528055     0.001     0.518
       Qfsinfo       552113     0.003     0.372
       Sfileinfo     270575     0.005     0.688
       Find         1164176     0.052    13.931
       WriteX       1658537     0.019     5.918
       ReadX        5207412     0.003     1.034
       LockX          10818     0.003     0.079
       UnlockX        10818     0.002     0.313
       Flush         232811     1.027   269.735
      
      Throughput 869.867 MB/sec (sync dirs)  12 clients  12 procs  max_latency=269.741 ms
      
      After patchset:
      
       Operation      Count    AvgLat    MaxLat
       ----------------------------------------
       NTCreateX    4152738     0.029    20.863
       Close        3050770     0.002     1.119
       Rename        175829     0.871   211.741
       Unlink        838447     0.845   211.724
       Deltree          120     4.798    14.162
       Mkdir             60     0.003     0.005
       Qpathinfo    3763807     0.011     4.673
       Qfileinfo     660111     0.001     0.400
       Qfsinfo       690141     0.003     0.429
       Sfileinfo     338260     0.005     0.725
       Find         1455273     0.046     6.787
       WriteX       2073307     0.017     5.690
       ReadX        6509193     0.003     1.171
       LockX          13522     0.003     0.077
       UnlockX        13522     0.002     0.125
       Flush         291044     0.811   211.631
      
      Throughput 1089.27 MB/sec (sync dirs)  12 clients  12 procs  max_latency=211.750 ms
      
      (+25.2% throughput, -21.5% max latency)
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      30b80f3c
    • F
      btrfs: shrink the size of struct btrfs_delayed_item · 4c469798
      Filipe Manana 提交于
      Currently struct btrfs_delayed_item has a base size of 96 bytes, but its
      size can be decreased by doing the following 2 tweaks:
      
      1) Change data_len from u32 to u16. Our maximum possible leaf size is 64K,
         so the data_len can never be larger than that, and in fact it is always
         much smaller than that. The max length for a dentry's name is ensured
         at the VFS level (PATH_MAX, 4096 bytes) and in struct btrfs_inode_ref
         and btrfs_dir_item we use a u16 to store the name's length;
      
      2) Change 'ins_or_del' to a 1 bit enum, which is all we need since it
         can only have 2 values. After this there's also no longer the need to
         BUG_ON() before using 'ins_or_del' in several places. Also rename the
         field from 'ins_or_del' to 'type', which is more clear.
      
      These two tweaks decrease the size of struct btrfs_delayed_item from 96
      bytes down to 88 bytes. A previous patch already reduced the size of this
      structure by 16 bytes, but an upcoming change will increase its size by
      16 bytes (adding a struct list_head element).
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4c469798
    • F
      btrfs: remove unused logic when looking up delayed items · 4cbf37f5
      Filipe Manana 提交于
      All callers pass NULL to the 'prev' and 'next' arguments of the function
      __btrfs_lookup_delayed_item(), so remove these arguments. Also, remove
      the unnecessary wrapper __btrfs_lookup_delayed_insertion_item(), making
      btrfs_delete_delayed_insertion_item() directly call
      __btrfs_lookup_delayed_item().
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4cbf37f5
    • F
      btrfs: store index number instead of key in struct btrfs_delayed_item · 96d89923
      Filipe Manana 提交于
      All delayed items are for dir index keys, so there's really no point of
      having an embedded struct btrfs_key in struct btrfs_delayed_item, which
      makes the structure use more space than necessary (and adds a hole of 7
      bytes).
      
      So replace the key field with an index number (u64), which reduces the
      size of struct btrfs_delayed_item from 112 bytes down to 96 bytes.
      
      Some upcoming work will increase the structure size by 16 bytes, so this
      change compensates for that future size increase.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      96d89923
    • F
      btrfs: remove root argument from btrfs_delayed_item_reserve_metadata() · df492881
      Filipe Manana 提交于
      The root argument of btrfs_delayed_item_reserve_metadata() is used only
      to get the fs_info object, but we already have a transaction handle, which
      we can use to get the fs_info. So remove the root argument.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      df492881
  2. 25 7月, 2022 11 次提交
    • N
      btrfs: batch up release of reserved metadata for delayed items used for deletion · 1f4f639f
      Nikolay Borisov 提交于
      With Filipe's recent rework of the delayed inode code one aspect which
      isn't batched is the release of the reserved metadata of delayed inode's
      delete items. With this patch on top of Filipe's rework and running the
      same test as provided in the description of a patch titled
      "btrfs: improve batch deletion of delayed dir index items" I observe
      the following change of the number of calls to btrfs_block_rsv_release:
      
      Before this change:
      - block_rsv_release:                      1004
      - btrfs_delete_delayed_items_total_time: 14602
      - delete_batches:                          505
      
      After:
      - block_rsv_release:                       510
      - btrfs_delete_delayed_items_total_time: 13643
      - delete_batches:                          507
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1f4f639f
    • J
      btrfs: do not batch insert non-consecutive dir indexes during log replay · 71b68e9e
      Josef Bacik 提交于
      While running generic/475 in a loop I got the following error
      
      BTRFS critical (device dm-11): corrupt leaf: root=5 block=31096832 slot=69, bad key order, prev (263 96 531) current (263 96 524)
      <snip>
       item 65 key (263 96 517) itemoff 14132 itemsize 33
       item 66 key (263 96 523) itemoff 14099 itemsize 33
       item 67 key (263 96 525) itemoff 14066 itemsize 33
       item 68 key (263 96 531) itemoff 14033 itemsize 33
       item 69 key (263 96 524) itemoff 14000 itemsize 33
      
      As you can see here we have 3 dir index keys with the dir index value of
      523, 524, and 525 inserted between 517 and 524.  This occurs because our
      dir index insertion code will bulk insert all dir index items on the
      node regardless of their actual key value.
      
      This makes sense on a normally running system, because if there's a gap
      in between the items there was a deletion before the item was inserted,
      so there's not going to be an overlap of the dir index items that need
      to be inserted and what exists on disk.
      
      However during log replay this isn't necessarily true, we could have any
      number of dir indexes in the tree already.
      
      Fix this by seeing if we're replaying the log, and if we are simply skip
      batching if there's a gap in the key space.
      
      This file system was left broken from the fstest, I tested this patch
      against the broken fs to make sure it replayed the log properly, and
      then btrfs checked the file system after the log replay to verify
      everything was ok.
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NSweet Tea Dorminy <sweettea-kernel@dorminy.me>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      71b68e9e
    • F
      btrfs: reduce amount of reserved metadata for delayed item insertion · 763748b2
      Filipe Manana 提交于
      Whenever we want to create a new dir index item (when creating an inode,
      create a hard link, rename a file) we reserve 1 unit of metadata space
      for it in a transaction (that's 256K for a node/leaf size of 16K), and
      then create a delayed insertion item for it to be added later to the
      subvolume's tree. That unit of metadata is kept until the delayed item
      is inserted into the subvolume tree, which may take a while to happen
      (in the worst case, it's done only when the transaction commits). If we
      have multiple dir index items to insert for the same directory, say N
      index items, and they all fit in a single leaf of metadata, then we are
      holding N units of reserved metadata space when all we need is 1 unit.
      
      This change addresses that, whenever a new delayed dir index item is
      added, we release the unit of metadata the caller has reserved when it
      started the transaction if adding that new dir index item does not
      result in touching one more metadata leaf, otherwise the reservation
      is kept by transferring it from the transaction block reserve to the
      delayed items block reserve, just like before. Given that with a leaf
      size of 16K we can have a few hundred dir index items in a single leaf
      (the exact value depends on file name lengths), this reduces pressure on
      metadata reservation by releasing unnecessary space much sooner.
      
      The following fs_mark test showed some improvement when creating many
      files in parallel on machine running a non debug kernel (debian's default
      kernel config) with 12 cores:
      
        $ cat test.sh
        #!/bin/bash
      
        DEV=/dev/nvme0n1
        MNT=/mnt/nvme0n1
        MOUNT_OPTIONS="-o ssd"
        FILES=100000
        THREADS=$(nproc --all)
      
        echo "performance" | \
            tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
      
        mkfs.btrfs -f $DEV
        mount $MOUNT_OPTIONS $DEV $MNT
      
        OPTS="-S 0 -L 10 -n $FILES -s 0 -t $THREADS -k"
        for ((i = 1; i <= $THREADS; i++)); do
            OPTS="$OPTS -d $MNT/d$i"
        done
      
        fs_mark $OPTS
      
        umount $MNT
      
      Before:
      
      FSUse%        Count         Size    Files/sec     App Overhead
           2      1200000            0     225991.3          5465891
           4      2400000            0     345728.1          5512106
           4      3600000            0     346959.5          5557653
           8      4800000            0     329643.0          5587548
           8      6000000            0     312657.4          5606717
           8      7200000            0     281707.5          5727985
          12      8400000            0      88309.8          5020422
          12      9600000            0      85835.9          5207496
          16     10800000            0      81039.2          5404964
          16     12000000            0      58548.6          5842468
      
      After:
      
      FSUse%        Count         Size    Files/sec     App Overhead
           2      1200000            0     230604.5          5778375
           4      2400000            0     348908.3          5508072
           4      3600000            0     357028.7          5484337
           6      4800000            0     342898.3          5565703
           6      6000000            0     314670.8          5751555
           8      7200000            0     282548.2          5778177
          12      8400000            0      90844.9          5306819
          12      9600000            0      86963.1          5304689
          16     10800000            0      89113.2          5455248
          16     12000000            0      86693.5          5518933
      
      The "after" results are after applying this patch and all the other
      patches in the same patchset, which is comprised of the following
      changes:
      
        btrfs: balance btree dirty pages and delayed items after a rename
        btrfs: free the path earlier when creating a new inode
        btrfs: balance btree dirty pages and delayed items after clone and dedupe
        btrfs: add assertions when deleting batches of delayed items
        btrfs: deal with deletion errors when deleting delayed items
        btrfs: refactor the delayed item deletion entry point
        btrfs: improve batch deletion of delayed dir index items
        btrfs: assert that delayed item is a dir index item when adding it
        btrfs: improve batch insertion of delayed dir index items
        btrfs: do not BUG_ON() on failure to reserve metadata for delayed item
        btrfs: set delayed item type when initializing it
        btrfs: reduce amount of reserved metadata for delayed item insertion
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      763748b2
    • F
      btrfs: set delayed item type when initializing it · c9d02ab4
      Filipe Manana 提交于
      Currently we set the type of a delayed item only after successfully
      inserting it into its respective rbtree. This is fine, as the type
      is not used anywhere before that point, but for the next patch in the
      series, there will be the need to check the type of a delayed item
      before inserting it into a rbtree.
      
      So set the type of a delayed item immediately after allocating it.
      This also makes the trivial wrappers for adding insertion and deletion
      useless, so it removes them as well.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c9d02ab4
    • F
      btrfs: do not BUG_ON() on failure to reserve metadata for delayed item · 3bae13e9
      Filipe Manana 提交于
      At btrfs_insert_delayed_dir_index(), we don't expect the metadata
      reservation for the delayed dir index item insertion to fail, because the
      caller is supposed to have reserved 1 unit of metadata space for that.
      All callers are able to deal with an error in case that happens, so there
      is no need for something so drastic as a BUG_ON() in case of failure.
      Instead just emit a warning, so that's easily noticed during development
      (fstests in particular), and return the error to the caller.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3bae13e9
    • F
      btrfs: improve batch insertion of delayed dir index items · 06ac264f
      Filipe Manana 提交于
      Currently we group delayed dir index items for insertion as a single batch
      (a single btree operation) as long as their keys are sequential in the key
      space.
      
      For example we have delayed index items for the following index keys:
      
         10, 11, 12, 15, 16, 20, 21
      
      We end up building three batches:
      
      1) First one for index keys 10, 11 and 12;
      2) Second one for index keys 15 and 16;
      3) Third one for index keys 20 and 21.
      
      However, since the dir index numbers come from a monotonically increasing
      counter and are never reused, we could group all these items into a single
      batch. The existence of holes in the sequence happens only when we had
      delayed dir index items for insertion that got deleted before they were
      flushed to the subvolume's tree.
      
      The delayed items are stored in a rbtree based on their key order, so
      we can just group items into a batch as long as they all fit in a leaf,
      and ignore if there's a gap (key offset, index number) between two
      consecutive items. This is more efficient and reduces the amount of
      time spent when running delayed items if there are gaps between dir
      index items.
      
      For example running the following test script:
      
        $ cat test.sh
        #!/bin/bash
      
        DEV=/dev/sdj
        MNT=/mnt/sdj
      
        mkfs.btrfs -f $DEV
        mount $DEV $MNT
      
        NUM_FILES=100
      
        mkdir $MNT/testdir
        for ((i = 1; i <= $NUM_FILES; i++)); do
             echo -n > $MNT/testdir/file_$i
        done
      
        # Now delete every other file, to create gaps in the dir index keys.
        for ((i = 1; i <= $NUM_FILES; i += 2)); do
            rm -f $MNT/testdir/file_$i
        done
      
        start=$(date +%s%N)
        sync
        end=$(date +%s%N)
        dur=$(( (end - start) / 1000000 ))
      
        echo -e "\nsync took $dur milliseconds"
      
        umount $MNT
      
      While having the following bpftrace script running in another shell:
      
        $ cat bpf-delayed-items-inserts.sh
        #!/usr/bin/bpftrace
      
        /* Must add 'noinline' to btrfs_insert_delayed_items(). */
        k:btrfs_insert_delayed_items
        {
            @start_insert_delayed_items[tid] = nsecs;
        }
      
        k:btrfs_insert_empty_items
        /@start_insert_delayed_items[tid]/
        {
           @insert_batches = count();
        }
      
        kr:btrfs_insert_delayed_items
        /@start_insert_delayed_items[tid]/
        {
            $dur = (nsecs - @start_insert_delayed_items[tid]) / 1000;
            @btrfs_insert_delayed_items_total_time = sum($dur);
            delete(@start_insert_delayed_items[tid]);
        }
      
      Before this change:
      
      @btrfs_insert_delayed_items_total_time: 576
      @insert_batches: 51
      
      After this change:
      
      @btrfs_insert_delayed_items_total_time: 174
      @insert_batches: 2
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      06ac264f
    • F
      btrfs: assert that delayed item is a dir index item when adding it · a176affe
      Filipe Manana 提交于
      All delayed items are for dir index items, we don't support any other item
      types at the moment. So simplify __btrfs_add_delayed_item() and add an
      assertion for checking the item's key type. This also allows the next
      change to be simpler and avoid to check key types. In case we add support
      for different item types in the future, then we'll hit the assertion
      during development and be able to adjust any code that is assuming delayed
      items are always associated to dir index items.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a176affe
    • F
      btrfs: improve batch deletion of delayed dir index items · 4bd02d90
      Filipe Manana 提交于
      Currently we group delayed dir index items for deletion in a single batch
      (single btree operation) as long as they all exist in the same leaf and as
      long as their keys are sequential in the key space. For example if we have
      a leaf that has dir index items with offsets:
      
          2, 3, 4, 6, 7, 10
      
      And we have delayed dir index items for deleting all these indexes, and
      no delayed items for any other index keys in between, then we end up
      deleting in 3 batches:
      
      1) First batch for indexes 2, 3 and 4;
      2) Second batch for indexes 6 and 7;
      3) Third batch for index 10.
      
      This is a waste because we can delete all the index keys in a single
      batch. What matters is that each consecutive delayed index key matches
      each consecutive dir index key in a leaf.
      
      So update the logic at btrfs_batch_delete_items() to check only for a
      key match between delayed dir index items and dir index items in a leaf.
      Also avoid the useless first iteration on comparing the key of the
      first slot to delete with the key of the first delayed item, as it's
      silly since they always match, as the delayed item's key was used for
      the btree search that gave us the path we have.
      
      This is more efficient and reduces runtime of running delayed items, as
      well as lock contention on the subvolume's tree.
      
      For example, the following test script:
      
        $ cat test.sh
        #!/bin/bash
      
        DEV=/dev/sdj
        MNT=/mnt/sdj
      
        mkfs.btrfs -f $DEV
        mount $DEV $MNT
      
        NUM_FILES=1000
      
        mkdir $MNT/testdir
        for ((i = 1; i <= $NUM_FILES; i++)); do
            echo -n > $MNT/testdir/file_$i
        done
      
        # Now delete every other file, to create gaps in the dir index keys.
        for ((i = 1; i <= $NUM_FILES; i += 2)); do
            rm -f $MNT/testdir/file_$i
        done
      
        # Sync to force any delayed items to be flushed to the tree.
        sync
      
        start=$(date +%s%N)
        rm -fr $MNT/testdir
        end=$(date +%s%N)
        dur=$(( (end - start) / 1000000 ))
      
        echo -e "\nrm -fr took $dur milliseconds"
      
        umount $MNT
      
      Running that test script while having the following bpftrace script
      running in another shell:
      
        $ cat bpf-measure.sh
        #!/usr/bin/bpftrace
      
        /* Add 'noinline' to btrfs_delete_delayed_items()'s definition. */
        k:btrfs_delete_delayed_items
        {
            @start_delete_delayed_items[tid] = nsecs;
        }
      
        k:btrfs_del_items
        /@start_delete_delayed_items[tid]/
        {
            @delete_batches = count();
        }
      
        kr:btrfs_delete_delayed_items
        /@start_delete_delayed_items[tid]/
        {
            $dur = (nsecs - @start_delete_delayed_items[tid]) / 1000;
            @btrfs_delete_delayed_items_total_time = sum($dur);
            delete(@start_delete_delayed_items[tid]);
        }
      
      Before this change:
      
      @btrfs_delete_delayed_items_total_time: 9563
      @delete_batches: 1001
      
      After this change:
      
      @btrfs_delete_delayed_items_total_time: 7328
      @delete_batches: 509
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4bd02d90
    • F
      btrfs: refactor the delayed item deletion entry point · 36baa2c7
      Filipe Manana 提交于
      The delayed item deletion entry point, btrfs_delete_delayed_items(), is a
      bit convoluted for a few reasons:
      
      1) It's really a loop disguised with labels and goto statements;
      
      2) There's a 'delete_fail' label which isn't only for error cases, we can
         jump to that label even if no error happened, if we simply don't have
         more delayed items to delete;
      
      3) Unnecessarily keeps track of the current and previous items for no
         good reason, as after getting the next item and releasing the current
         one, it just jumps to the 'again' label just to look again for the
         first delayed item;
      
      4) When a delayed item is not in the tree (because it was already deleted
         before), it releases the item while holding a path locked, which is
         not necessary and adds more contention to the tree, specially taking
         into account that the path came from a deletion search, meaning we have
         write locks for nodes at levels 2, 1 and 0. And releasing the item is
         not computationally trivial (rb tree deletion, a kfree() and some
         trivial things).
      
      So refactor it to use a while loop and add some comments to make it more
      obvious why we can have delayed items without a matching item in the tree
      as well as why not keep the delayed node locked all the time when running
      all its deletion items. This is also a preparation for some upcoming work
      involving delayed items.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      36baa2c7
    • F
      btrfs: deal with deletion errors when deleting delayed items · 2b1d260d
      Filipe Manana 提交于
      Currently, btrfs_delete_delayed_items() ignores any errors returned from
      btrfs_batch_delete_items(). This looks fishy but it's not a problem at
      the moment because:
      
      1) Two of the errors returned from btrfs_batch_delete_items() are for
         impossible cases, cases where a delayed item does not match any item
         in the leaf the path points to - btrfs_delete_delayed_items() always
         calls btrfs_batch_delete_items() with a path that points to a leaf
         that contains an item matching a delayed item;
      
      2) btrfs_batch_delete_items() may return an error from btrfs_del_items(),
         in which case it does not release the delayed items of the batch.
      
         At the moment this is harmless because btrfs_del_items() actually is
         always able to delete items, even if it returns an error - when it
         returns an error it's because it ended up with a leaf mostly empty
         (less than 1/3 full) and failed to migrate items from that leaf into
         its neighbour leaves - this is not critical, as all the items were
         deleted, we just left the tree a bit unbalanced, but it's still a
         valid tree and causes no harm, and future operations on the tree will
         eventually balance it.
      
         So even if we get an error from btrfs_del_items(), the delayed items
         will not be released but the next time we run delayed items we will
         find out, at btrfs_delete_delayed_items(), that they are not present
         in the tree anymore and then release them.
      
      This is all a bit subtle, and it's certainly prone to be a disaster in
      case btrfs_del_items() changes one day and may return errors before being
      able to delete all the requested items, in which case we could leave the
      filesystem in an inconsistent state as we would commit a transaction
      despite a failure from deleting items from the tree.
      
      So make btrfs_delete_delayed_items() check for any errors from the call
      to btrfs_batch_delete_items().
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2b1d260d
    • F
      btrfs: add assertions when deleting batches of delayed items · 659192e6
      Filipe Manana 提交于
      There are a few impossible cases that btrfs_batch_delete_items() tries to
      deal with:
      
      1) Getting a path pointing to a NULL leaf;
      2) The leaf slot is pointing beyond the last item in the leaf;
      3) We can't find a single item to delete.
      
      The first case is impossible because the given path was returned by a
      successful call to btrfs_search_slot(). Replace the BUG_ON() with an
      ASSERT for this.
      
      The second case is impossible because we are always called when a delayed
      item matches an item in the given leaf. So add an ASSERT() for that and
      if that condition is not satisfied, trigger a warning and return an error.
      
      The third case is impossible exactly because of the same reason as the
      second case. The given delayed item matches one item in the leaf, so we
      know that our batch always has at least one item. Add an ASSERT to check
      that, trigger a warning if that expectation fails and return an error.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      659192e6
  3. 16 7月, 2022 1 次提交
  4. 16 5月, 2022 1 次提交
  5. 07 1月, 2022 1 次提交
  6. 03 1月, 2022 1 次提交
  7. 27 10月, 2021 1 次提交
    • F
      btrfs: loop only once over data sizes array when inserting an item batch · b7ef5f3a
      Filipe Manana 提交于
      When inserting a batch of items into a btree, we end up looping over the
      data sizes array 3 times:
      
      1) Once in the caller of btrfs_insert_empty_items(), when it populates the
         array with the data sizes for each item;
      
      2) Once at btrfs_insert_empty_items() to sum the elements of the data
         sizes array and compute the total data size;
      
      3) And then once again at setup_items_for_insert(), where we do exactly
         the same as what we do at btrfs_insert_empty_items(), to compute the
         total data size.
      
      That is not bad for small arrays, but when the arrays have hundreds of
      elements, the time spent on looping is not negligible. For example when
      doing batch inserts of delayed items for dir index items or when logging
      a directory, it's common to have 200 to 260 dir index items in a single
      batch when using a leaf size of 16K and using file names between 8 and 12
      characters. For a 64K leaf size, multiply that by 4. Taking into account
      that during directory logging or when flushing delayed dir index items we
      can have many of those large batches, the time spent on the looping adds
      up quickly.
      
      It's also more important to avoid it at setup_items_for_insert(), since
      we are holding a write lock on a leaf and, in some cases, on upper nodes
      of the btree, which causes us to block other tasks that want to access
      the leaf and nodes for longer than necessary.
      
      So change the code so that setup_items_for_insert() and
      btrfs_insert_empty_items() no longer compute the total data size, and
      instead rely on the caller to supply it. This makes us loop over the
      array only once, where we can both populate the data size array and
      compute the total data size, taking advantage of spatial and temporal
      locality. To make this more manageable, use a structure to contain
      all the relevant details for a batch of items (keys array, data sizes
      array, total data size, number of items), and use it as an argument
      for btrfs_insert_empty_items() and setup_items_for_insert().
      
      This patch is part of a small patchset that is comprised of the following
      patches:
      
        btrfs: loop only once over data sizes array when inserting an item batch
        btrfs: unexport setup_items_for_insert()
        btrfs: use single bulk copy operations when logging directories
      
      This is patch 1/3 and performance results, and the specific tests, are
      included in the changelog of patch 3/3.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b7ef5f3a
  8. 23 8月, 2021 3 次提交
    • B
      btrfs: add ro compat flags to inodes · 77eea05e
      Boris Burkov 提交于
      Currently, inode flags are fully backwards incompatible in btrfs. If we
      introduce a new inode flag, then tree-checker will detect it and fail.
      This can even cause us to fail to mount entirely. To make it possible to
      introduce new flags which can be read-only compatible, like VERITY, we
      add new ro flags to btrfs without treating them quite so harshly in
      tree-checker. A read-only file system can survive an unexpected flag,
      and can be mounted.
      
      As for the implementation, it unfortunately gets a little complicated.
      
      The on-disk representation of the inode, btrfs_inode_item, has an __le64
      for flags but the in-memory representation, btrfs_inode, uses a u32.
      David Sterba had the nice idea that we could reclaim those wasted 32 bits
      on disk and use them for the new ro_compat flags.
      
      It turns out that the tree-checker code which checks for unknown flags
      is broken, and ignores the upper 32 bits we are hoping to use. The issue
      is that the flags use the literal 1 rather than 1ULL, so the flags are
      signed ints, and one of them is specifically (1 << 31). As a result, the
      mask which ORs the flags is a negative integer on machines where int is
      32 bit twos complement. When tree-checker evaluates the expression:
      
        btrfs_inode_flags(leaf, iitem) & ~BTRFS_INODE_FLAG_MASK)
      
      The mask is something like 0x80000abc, which gets promoted to u64 with
      sign extension to 0xffffffff80000abc. Negating that 64 bit mask leaves
      all the upper bits zeroed, and we can't detect unexpected flags.
      
      This suggests that we can't use those bits after all. Luckily, we have
      good reason to believe that they are zero anyway. Inode flags are
      metadata, which is always checksummed, so any bit flips that would
      introduce 1s would cause a checksum failure anyway (excluding the
      improbable case of the checksum getting corrupted exactly badly).
      
      Further, unless the 1 << 31 flag is used, the cast to u64 of the 32 bit
      inode flag should preserve its value and not add leading zeroes
      (at least for twos complement). The only place that flag
      (BTRFS_INODE_ROOT_ITEM_INIT) is used is in a special inode embedded in
      the root item, and indeed for that inode we see 0xffffffff80000000 as
      the flags on disk. However, that inode is never seen by tree checker,
      nor is it used in a context where verity might be meaningful.
      Theoretically, a future ro flag might cause trouble on that inode, so we
      should proactively clean up that mess before it does.
      
      With the introduction of the new ro flags, keep two separate unsigned
      masks and check them against the appropriate u32. Since we no longer run
      afoul of sign extension, this also stops writing out 0xffffffff80000000
      in root_item inodes going forward.
      Signed-off-by: NBoris Burkov <boris@bur.io>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      77eea05e
    • F
      btrfs: stop doing GFP_KERNEL memory allocations in the ref verify tool · 5a656c36
      Filipe Manana 提交于
      In commit 351cbf6e ("btrfs: use nofs allocations for running delayed
      items") we wrapped all btree updates when running delayed items with
      memalloc_nofs_save() and memalloc_nofs_restore(), due to a lock inversion
      detected by lockdep involving reclaim and the mutex of delayed nodes.
      
      The problem is because the ref verify tool does some memory allocations
      with GFP_KERNEL, which can trigger reclaim and reclaim can trigger inode
      eviction, which requires locking the mutex of an inode's delayed node.
      On the other hand the ref verify tool is called when allocating metadata
      extents as part of operations that modify a btree, which is a problem when
      running delayed nodes, where we do btree updates while holding the mutex
      of a delayed node. This is what caused the lockdep warning.
      
      Instead of wrapping every btree update when running delayed nodes, change
      the ref verify tool to never do GFP_KERNEL allocations, because:
      
      1) We get less repeated code, which at the moment does not even have a
         comment mentioning why we need to setup the NOFS context, which is a
         recommended good practice as mentioned at
         Documentation/core-api/gfp_mask-from-fs-io.rst
      
      2) The ref verify tool is something meant only for debugging and not
         something that should be enabled on non-debug / non-development
         kernels;
      
      3) We may have yet more places outside delayed-inode.c where we have
         similar problem: doing btree updates while holding some lock and
         then having the GFP_KERNEL memory allocations, from the ref verify
         tool, trigger reclaim and trying again to acquire the same lock
         through the reclaim path.
         Or we could get more such cases in the future, therefore this change
         prevents getting into similar cases when using the ref verify tool.
      
      Curiously most of the memory allocations done by the ref verify tool
      were already using GFP_NOFS, except a few ones for no apparent reason.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5a656c36
    • F
      btrfs: improve the batch insertion of delayed items · 506650dc
      Filipe Manana 提交于
      When we insert the delayed items of an inode, which corresponds to the
      directory index keys for a directory (key type BTRFS_DIR_INDEX_KEY), we
      do the following:
      
      1) Pick the first delayed item from the rbtree and insert it into the
         fs/subvolume btree, using btrfs_insert_empty_item() for that;
      
      2) Without releasing the path returned by btrfs_insert_empty_item(),
         keep collecting as many consecutive delayed items from the rbtree
         as possible, as long as each one's BTRFS_DIR_INDEX_KEY key is the
         immediate successor of the previously picked item and as long as
         they fit in the available space of the leaf the path points to;
      
      3) Then insert all the collected items into the leaf;
      
      4) Release the reserve metadata space for each collected item and
         release each item (implies deleting from the rbtree);
      
      5) Unlock the path.
      
      While this is much better than inserting items one by one, it can be
      improved in a few aspects:
      
      1) Instead of adding items based on the remaining free space of the
         leaf, collect as many items that can fit in a leaf and bulk insert
         them. This results in less and larger batches, reducing the total
         amount of time to insert the delayed items. For example when adding
         100K files to a directory, we ended up creating 1658 batches with
         very variable sizes ranging from 1 item to 118 items, on a filesystem
         with a node/leaf size of 16K. After this change, we end up with 839
         batches, with the vast majority of them having exactly 120 items;
      
      2) We do the search for more items to batch, by iterating the rbtree,
         while holding a write lock on the leaf;
      
      3) While still holding the leaf locked, we are releasing the reserved
         metadata for each item and then deleting each item, keeping a write
         lock on the leaf for longer than necessary. Releasing the delayed items
         one by one can take a significant amount of time, because deleting
         them from the rbtree can often be a bit slow when the deletion results
         in rebalancing the rbtree.
      
      So change this so that we try to create larger batches, with a total
      item size up to the maximum a leaf can support, and by unlocking the leaf
      immediately after inserting the items, releasing the reserved metadata
      space of each item and releasing each item without holding the write lock
      on the leaf.
      
      The following script that runs fs_mark was used to test this change:
      
        $ cat test.sh
        #!/bin/bash
      
        DEV=/dev/nvme0n1
        MNT=/mnt/nvme0n1
        MOUNT_OPTIONS="-o ssd"
        MKFS_OPTIONS="-m single -d single"
        FILES=1000000
        THREADS=16
        FILE_SIZE=0
      
        echo "performance" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
      
        umount $DEV &> /dev/null
        mkfs.btrfs -f $MKFS_OPTIONS $DEV
        mount $MOUNT_OPTIONS $DEV $MNT
      
        OPTS="-S 0 -L 5 -n $FILES -s $FILE_SIZE -t 16"
        for ((i = 1; i <= $THREADS; i++)); do
            OPTS="$OPTS -d $MNT/d$i"
        done
      
        fs_mark $OPTS
      
        umount $MNT
      
      It was run on machine with 12 cores, 64G of ram, using a NVMe device and
      using a non-debug kernel config (Debian's default config).
      
      Results before this change:
      
      FSUse%        Count         Size    Files/sec         App Overhead
           1     16000000            0      76182.1             72223046
           3     32000000            0      62746.9             80776528
           5     48000000            0      77029.0             93022381
           6     64000000            0      73691.6             95251075
           8     80000000            0      66288.0             85089634
      
      Results after this change:
      
      FSUse%        Count         Size    Files/sec         App Overhead
           1     16000000            0      79049.5 (+3.7%)     69700824
           3     32000000            0      65248.9 (+3.9%)     80583693
           5     48000000            0      77991.4 (+1.2%)     90040908
           6     64000000            0      75096.8 (+1.9%)     89862241
           8     80000000            0      66926.8 (+1.0%)     84429169
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      506650dc
  9. 21 6月, 2021 4 次提交
  10. 19 4月, 2021 3 次提交
  11. 03 3月, 2021 1 次提交
    • N
      btrfs: don't flush from btrfs_delayed_inode_reserve_metadata · 4d14c5cd
      Nikolay Borisov 提交于
      Calling btrfs_qgroup_reserve_meta_prealloc from
      btrfs_delayed_inode_reserve_metadata can result in flushing delalloc
      while holding a transaction and delayed node locks. This is deadlock
      prone. In the past multiple commits:
      
       * ae5e070e ("btrfs: qgroup: don't try to wait flushing if we're
      already holding a transaction")
      
       * 6f23277a ("btrfs: qgroup: don't commit transaction when we already
       hold the handle")
      
      Tried to solve various aspects of this but this was always a
      whack-a-mole game. Unfortunately those 2 fixes don't solve a deadlock
      scenario involving btrfs_delayed_node::mutex. Namely, one thread
      can call btrfs_dirty_inode as a result of reading a file and modifying
      its atime:
      
        PID: 6963   TASK: ffff8c7f3f94c000  CPU: 2   COMMAND: "test"
        #0  __schedule at ffffffffa529e07d
        #1  schedule at ffffffffa529e4ff
        #2  schedule_timeout at ffffffffa52a1bdd
        #3  wait_for_completion at ffffffffa529eeea             <-- sleeps with delayed node mutex held
        #4  start_delalloc_inodes at ffffffffc0380db5
        #5  btrfs_start_delalloc_snapshot at ffffffffc0393836
        #6  try_flush_qgroup at ffffffffc03f04b2
        #7  __btrfs_qgroup_reserve_meta at ffffffffc03f5bb6     <-- tries to reserve space and starts delalloc inodes.
        #8  btrfs_delayed_update_inode at ffffffffc03e31aa      <-- acquires delayed node mutex
        #9  btrfs_update_inode at ffffffffc0385ba8
       #10  btrfs_dirty_inode at ffffffffc038627b               <-- TRANSACTIION OPENED
       #11  touch_atime at ffffffffa4cf0000
       #12  generic_file_read_iter at ffffffffa4c1f123
       #13  new_sync_read at ffffffffa4ccdc8a
       #14  vfs_read at ffffffffa4cd0849
       #15  ksys_read at ffffffffa4cd0bd1
       #16  do_syscall_64 at ffffffffa4a052eb
       #17  entry_SYSCALL_64_after_hwframe at ffffffffa540008c
      
      This will cause an asynchronous work to flush the delalloc inodes to
      happen which can try to acquire the same delayed_node mutex:
      
        PID: 455    TASK: ffff8c8085fa4000  CPU: 5   COMMAND: "kworker/u16:30"
        #0  __schedule at ffffffffa529e07d
        #1  schedule at ffffffffa529e4ff
        #2  schedule_preempt_disabled at ffffffffa529e80a
        #3  __mutex_lock at ffffffffa529fdcb                    <-- goes to sleep, never wakes up.
        #4  btrfs_delayed_update_inode at ffffffffc03e3143      <-- tries to acquire the mutex
        #5  btrfs_update_inode at ffffffffc0385ba8              <-- this is the same inode that pid 6963 is holding
        #6  cow_file_range_inline.constprop.78 at ffffffffc0386be7
        #7  cow_file_range at ffffffffc03879c1
        #8  btrfs_run_delalloc_range at ffffffffc038894c
        #9  writepage_delalloc at ffffffffc03a3c8f
       #10  __extent_writepage at ffffffffc03a4c01
       #11  extent_write_cache_pages at ffffffffc03a500b
       #12  extent_writepages at ffffffffc03a6de2
       #13  do_writepages at ffffffffa4c277eb
       #14  __filemap_fdatawrite_range at ffffffffa4c1e5bb
       #15  btrfs_run_delalloc_work at ffffffffc0380987         <-- starts running delayed nodes
       #16  normal_work_helper at ffffffffc03b706c
       #17  process_one_work at ffffffffa4aba4e4
       #18  worker_thread at ffffffffa4aba6fd
       #19  kthread at ffffffffa4ac0a3d
       #20  ret_from_fork at ffffffffa54001ff
      
      To fully address those cases the complete fix is to never issue any
      flushing while holding the transaction or the delayed node lock. This
      patch achieves it by calling qgroup_reserve_meta directly which will
      either succeed without flushing or will fail and return -EDQUOT. In the
      latter case that return value is going to be propagated to
      btrfs_dirty_inode which will fallback to start a new transaction. That's
      fine as the majority of time we expect the inode will have
      BTRFS_DELAYED_NODE_INODE_DIRTY flag set which will result in directly
      copying the in-memory state.
      
      Fixes: c53e9653 ("btrfs: qgroup: try to flush qgroup space when we get -EDQUOT")
      CC: stable@vger.kernel.org # 5.10+
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4d14c5cd
  12. 02 3月, 2021 1 次提交
  13. 09 2月, 2021 1 次提交
  14. 08 12月, 2020 3 次提交
  15. 07 10月, 2020 3 次提交