• F
    btrfs: cache sharedness of the last few data extents during fiemap · 73e339e6
    Filipe Manana 提交于
    During fiemap we process all the file extent items of an inode, by their
    file offset order (left to right b+tree order), and then check if the data
    extent they point at is shared or not. Until now we didn't cache those
    results, we only did it for b+tree nodes/leaves since for each unique
    b+tree path we have access to hundreds of file extent items. However, it
    is also common to repeat checking the sharedness of a particular data
    extent in a very short time window, and the cases that lead to that are
    the following:
    
    1) COW writes.
    
       If have a file extent item like this:
    
                      [ bytenr X, offset = 0, num_bytes = 512K ]
       file offset    0                                        512K
    
       Then a 4K write into file offset 64K happens, we end up with the
       following file extent item layout:
    
                      [ bytenr X, offset = 0, num_bytes = 64K ]
       file offset    0                                       64K
    
                      [ bytenr Y, offset = 0, num_bytes = 4K ]
       file offset   64K                                     68K
    
                      [ bytenr X, offset = 68K, num_bytes = 444K ]
       file offset   68K                                         512K
    
       So during fiemap we well check for the sharedness of the data extent
       with bytenr X twice. Typically for COW writes and for at least
       moderately updated files, we end up with many file extent items that
       point to different sections of the same data extent.
    
    2) Writing into a NOCOW file after a snapshot is taken.
    
       This happens if the target extent was created in a generation older
       than the generation where the last snapshot for the root (the tree the
       inode belongs to) was made.
    
       This leads to a scenario like the previous one.
    
    3) Writing into sections of a preallocated extent.
    
       For example if a file has the following layout:
    
       [ bytenr X, offset = 0, num_bytes = 1M, type = prealloc ]
       0                                                       1M
    
       After doing a 4K write into file offset 0 and another 4K write into
       offset 512K, we get the following layout:
    
          [ bytenr X, offset = 0, num_bytes = 4K, type = regular ]
          0                                                      4K
    
          [ bytenr X, offset = 4K, num_bytes = 508K, type = prealloc ]
         4K                                                          512K
    
          [ bytenr X, offset = 512K, num_bytes = 4K, type = regular ]
       512K                                                         516K
    
          [ bytenr X, offset = 516K, num_bytes = 508K, type = prealloc ]
       516K                                                            1M
    
       So we end up with 4 consecutive file extent items pointing to the data
       extent at bytenr X.
    
    4) Hole punching in the middle of an extent.
    
       For example if a file has the following file extent item:
    
       [ bytenr X, offset = 0, num_bytes = 8M ]
       0                                      8M
    
       And then hole is punched for the file range [4M, 6M[, we our file
       extent item split into two:
    
       [ bytenr X, offset = 0, num_bytes = 4M  ]
       0                                       4M
    
       [ 2M hole, implicit or explicit depending on NO_HOLES feature ]
       4M                                                            6M
    
       [ bytenr X, offset = 6M, num_bytes = 2M  ]
       6M                                       8M
    
       Again, we end up with two file extent items pointing to the same
       data extent.
    
    5) When reflinking (clone and deduplication) within the same file.
       This is probably the least common case of all.
    
    In cases 1, 2, 4 and 4, when we have multiple file extent items that point
    to the same data extent, their distance is usually short, typically
    separated by a few slots in a b+tree leaf (or across sibling leaves). For
    case 5, the distance can vary a lot, but it's typically the less common
    case.
    
    This change caches the result of the sharedness checks for data extents,
    but only for the last 8 extents that we notice that our inode refers to
    with multiple file extent items. Whenever we want to check if a data
    extent is shared, we lookup the cache which consists of doing a linear
    scan of an 8 elements array, and if we find the data extent there, we
    return the result and don't check the extent tree and delayed refs.
    
    The array/cache is small so that doing the search has no noticeable
    negative impact on the performance in case we don't have file extent items
    within a distance of 8 slots that point to the same data extent.
    
    Slots in the cache/array are overwritten in a simple round robin fashion,
    as that approach fits very well.
    
    Using this simple approach with only the last 8 data extents seen is
    effective as usually when multiple file extents items point to the same
    data extent, their distance is within 8 slots. It also uses very little
    memory and the time to cache a result or lookup the cache is negligible.
    
    The following test was run on non-debug kernel (Debian's default kernel
    config) to measure the impact in the case of COW writes (first example
    given above), where we run fiemap after overwriting 33% of the blocks of
    a file:
    
       $ cat test.sh
       #!/bin/bash
    
       DEV=/dev/sdi
       MNT=/mnt/sdi
    
       umount $DEV &> /dev/null
       mkfs.btrfs -f $DEV
       mount $DEV $MNT
    
       FILE_SIZE=$((1 * 1024 * 1024  * 1024))
    
       # Create the file full of 1M extents.
       xfs_io -f -s -c "pwrite -b 1M -S 0xab 0 $FILE_SIZE" $MNT/foobar
    
       block_count=$((FILE_SIZE / 4096))
       # Overwrite about 33% of the file blocks.
       overwrite_count=$((block_count / 3))
    
       echo -e "\nOverwriting $overwrite_count 4K blocks (out of $block_count)..."
       RANDOM=123
       for ((i = 1; i <= $overwrite_count; i++)); do
           off=$(((RANDOM % block_count) * 4096))
           xfs_io -c "pwrite -S 0xcd $off 4K" $MNT/foobar > /dev/null
           echo -ne "\r$i blocks overwritten..."
       done
       echo -e "\n"
    
       # Unmount and mount to clear all cached metadata.
       umount $MNT
       mount $DEV $MNT
    
       start=$(date +%s%N)
       filefrag $MNT/foobar
       end=$(date +%s%N)
       dur=$(( (end - start) / 1000000 ))
       echo "fiemap took $dur milliseconds"
    
       umount $MNT
    
    Result before applying this patch:
    
       fiemap took 128 milliseconds
    
    Result after applying this patch:
    
       fiemap took 92 milliseconds   (-28.1%)
    
    The test is somewhat limited in the sense the gains may be higher in
    practice, because in the test the filesystem is small, so we have small
    fs and extent trees, plus there's no concurrent access to the trees as
    well, therefore no lock contention there.
    Signed-off-by: NFilipe Manana <fdmanana@suse.com>
    Signed-off-by: NDavid Sterba <dsterba@suse.com>
    73e339e6
backref.h 12.2 KB