• F
    btrfs: only reserve the needed data space amount during fallocate · 47e1d1c7
    Filipe Manana 提交于
    During a plain fallocate, we always start by reserving an amount of data
    space that matches the length of the range passed to fallocate. When we
    already have extents allocated in that range, we may end up trying to
    reserve a lot more data space then we need, which can result in several
    undesired behaviours:
    
    1) We fail with -ENOSPC. For example the passed range has a length
       of 1G, but there's only one hole with a size of 1M in that range;
    
    2) We temporarily reserve excessive data space that could be used by
       other operations happening concurrently;
    
    3) By reserving much more data space then we need, we can end up
       doing expensive things like triggering dellaloc for other inodes,
       waiting for the ordered extents to complete, trigger transaction
       commits, allocate new block groups, etc.
    
    Example:
    
      $ cat test.sh
      #!/bin/bash
    
      DEV=/dev/sdj
      MNT=/mnt/sdj
    
      mkfs.btrfs -f -b 1g $DEV
      mount $DEV $MNT
    
      # Create a file with a size of 600M and two holes, one at [200M, 201M[
      # and another at [401M, 402M[
      xfs_io -f -c "pwrite -S 0xab 0 200M" \
                -c "pwrite -S 0xcd 201M 200M" \
                -c "pwrite -S 0xef 402M 198M" \
                $MNT/foobar
    
      # Now call fallocate against the whole file range, see if it fails
      # with -ENOSPC or not - it shouldn't since we only need to allocate
      # 2M of data space.
      xfs_io -c "falloc 0 600M" $MNT/foobar
    
      umount $MNT
    
      $ ./test.sh
      (...)
      wrote 209715200/209715200 bytes at offset 0
      200 MiB, 51200 ops; 0.8063 sec (248.026 MiB/sec and 63494.5831 ops/sec)
      wrote 209715200/209715200 bytes at offset 210763776
      200 MiB, 51200 ops; 0.8053 sec (248.329 MiB/sec and 63572.3172 ops/sec)
      wrote 207618048/207618048 bytes at offset 421527552
      198 MiB, 50688 ops; 0.7925 sec (249.830 MiB/sec and 63956.5548 ops/sec)
      fallocate: No space left on device
      $
    
    So fix this by not allocating an amount of data space that matches the
    length of the range passed to fallocate. Instead allocate an amount of
    data space that corresponds to the sum of the sizes of each hole found
    in the range. This reservation now happens after we have locked the file
    range, which is safe since we know at this point there's no delalloc
    in the range because we've taken the inode's VFS lock in exclusive mode,
    we have taken the inode's i_mmap_lock in exclusive mode, we have flushed
    delalloc and waited for all ordered extents in the range to complete.
    
    This type of failure actually seems to happen in practice with systemd,
    and we had at least one report about this in a very long thread which
    is referenced by the Link tag below.
    
    Link: https://lore.kernel.org/linux-btrfs/bdJVxLiFr_PyQSXRUbZJfFW_jAjsGgoMetqPHJMbg-hdy54Xt_ZHhRetmnJ6cJ99eBlcX76wy-AvWwV715c3YndkxneSlod11P1hlaADx0s=@protonmail.com/Signed-off-by: NFilipe Manana <fdmanana@suse.com>
    Signed-off-by: NDavid Sterba <dsterba@suse.com>
    47e1d1c7
file.c 104.7 KB