1. 08 12月, 2020 5 次提交
  2. 14 11月, 2020 1 次提交
    • F
      btrfs: fix missing delalloc new bit for new delalloc ranges · c3347309
      Filipe Manana 提交于
      When doing a buffered write, through one of the write family syscalls, we
      look for ranges which currently don't have allocated extents and set the
      'delalloc new' bit on them, so that we can report a correct number of used
      blocks to the stat(2) syscall until delalloc is flushed and ordered extents
      complete.
      
      However there are a few other places where we can do a buffered write
      against a range that is mapped to a hole (no extent allocated) and where
      we do not set the 'new delalloc' bit. Those places are:
      
      - Doing a memory mapped write against a hole;
      
      - Cloning an inline extent into a hole starting at file offset 0;
      
      - Calling btrfs_cont_expand() when the i_size of the file is not aligned
        to the sector size and is located in a hole. For example when cloning
        to a destination offset beyond EOF.
      
      So after such cases, until the corresponding delalloc range is flushed and
      the respective ordered extents complete, we can report an incorrect number
      of blocks used through the stat(2) syscall.
      
      In some cases we can end up reporting 0 used blocks to stat(2), which is a
      particular bad value to report as it may mislead tools to think a file is
      completely sparse when its i_size is not zero, making them skip reading
      any data, an undesired consequence for tools such as archivers and other
      backup tools, as reported a long time ago in the following thread (and
      other past threads):
      
        https://lists.gnu.org/archive/html/bug-tar/2016-07/msg00001.html
      
      Example reproducer:
      
        $ cat reproducer.sh
        #!/bin/bash
      
        MNT=/mnt/sdi
        DEV=/dev/sdi
      
        mkfs.btrfs -f $DEV > /dev/null
        # mkfs.xfs -f $DEV > /dev/null
        # mkfs.ext4 -F $DEV > /dev/null
        # mkfs.f2fs -f $DEV > /dev/null
        mount $DEV $MNT
      
        xfs_io -f -c "truncate 64K"   \
            -c "mmap -w 0 64K"        \
            -c "mwrite -S 0xab 0 64K" \
            -c "munmap"               \
            $MNT/foo
      
        blocks_used=$(stat -c %b $MNT/foo)
        echo "blocks used: $blocks_used"
      
        if [ $blocks_used -eq 0 ]; then
            echo "ERROR: blocks used is 0"
        fi
      
        umount $DEV
      
        $ ./reproducer.sh
        blocks used: 0
        ERROR: blocks used is 0
      
      So move the logic that decides to set the 'delalloc bit' bit into the
      function btrfs_set_extent_delalloc(), since that is what we use for all
      those missing cases as well as for the cases that currently work well.
      
      This change is also preparatory work for an upcoming patch that fixes
      other problems related to tracking and reporting the number of bytes used
      by an inode.
      
      CC: stable@vger.kernel.org # 4.19+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c3347309
  3. 27 10月, 2020 1 次提交
    • J
      btrfs: don't fallback to buffered read if we don't need to · 0425e7ba
      Johannes Thumshirn 提交于
      Since we switched to the iomap infrastructure in b5ff9f1a96e8f ("btrfs:
      switch to iomap for direct IO") we're calling generic_file_buffered_read()
      directly and not via generic_file_read_iter() anymore.
      
      If the read could read everything there is no need to bother calling
      generic_file_buffered_read(), like it is handled in
      generic_file_read_iter().
      
      If we call generic_file_buffered_read() in this case we can hit a
      situation where we do an invalid readahead and cause this UBSAN splat
      in fstest generic/091:
      
        run fstests generic/091 at 2020-10-21 10:52:32
        ================================================================================
        UBSAN: shift-out-of-bounds in ./include/linux/log2.h:57:13
        shift exponent 64 is too large for 64-bit type 'long unsigned int'
        CPU: 0 PID: 656 Comm: fsx Not tainted 5.9.0-rc7+ #821
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4-rebuilt.opensuse.org 04/01/2014
        Call Trace:
         __dump_stack lib/dump_stack.c:77
         dump_stack+0x57/0x70 lib/dump_stack.c:118
         ubsan_epilogue+0x5/0x40 lib/ubsan.c:148
         __ubsan_handle_shift_out_of_bounds.cold+0x61/0xe9 lib/ubsan.c:395
         __roundup_pow_of_two ./include/linux/log2.h:57
         get_init_ra_size mm/readahead.c:318
         ondemand_readahead.cold+0x16/0x2c mm/readahead.c:530
         generic_file_buffered_read+0x3ac/0x840 mm/filemap.c:2199
         call_read_iter ./include/linux/fs.h:1876
         new_sync_read+0x102/0x180 fs/read_write.c:415
         vfs_read+0x11c/0x1a0 fs/read_write.c:481
         ksys_read+0x4f/0xc0 fs/read_write.c:615
         do_syscall_64+0x33/0x40 arch/x86/entry/common.c:46
         entry_SYSCALL_64_after_hwframe+0x44/0xa9 arch/x86/entry/entry_64.S:118
        RIP: 0033:0x7fe87fee992e
        RSP: 002b:00007ffe01605278 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
        RAX: ffffffffffffffda RBX: 000000000004f000 RCX: 00007fe87fee992e
        RDX: 0000000000004000 RSI: 0000000001677000 RDI: 0000000000000003
        RBP: 000000000004f000 R08: 0000000000004000 R09: 000000000004f000
        R10: 0000000000053000 R11: 0000000000000246 R12: 0000000000004000
        R13: 0000000000000000 R14: 000000000007a120 R15: 0000000000000000
        ================================================================================
        BTRFS info (device nullb0): has skinny extents
        BTRFS info (device nullb0): ZONED mode enabled, zone size 268435456 B
        BTRFS info (device nullb0): enabling ssd optimizations
      
      Fixes: f85781fb ("btrfs: switch to iomap for direct IO")
      Reviewed-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0425e7ba
  4. 07 10月, 2020 16 次提交
    • N
      btrfs: rename BTRFS_INODE_ORDERED_DATA_CLOSE flag · 1fd4033d
      Nikolay Borisov 提交于
      Commit 8d875f95 ("btrfs: disable strict file flushes for
      renames and truncates") eliminated the notion of ordered operations and
      instead BTRFS_INODE_ORDERED_DATA_CLOSE only remained as a flag
      indicating that a file's content should be synced to disk in case a
      file is truncated and any writes happen to it concurrently. In fact
      this intendend behavior was broken until it was fixed in
      f6dc45c7 ("Btrfs: fix filemap_flush call in btrfs_file_release").
      
      All things considered let's give the flag a more descriptive name. Also
      slightly reword comments.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1fd4033d
    • N
      btrfs: remove inode argument from btrfs_start_ordered_extent · c0a43603
      Nikolay Borisov 提交于
      The passed in ordered_extent struct is always well-formed and contains
      the inode making the explicit argument redundant.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c0a43603
    • N
      btrfs: sink total_data parameter in setup_items_for_insert · fc0d82e1
      Nikolay Borisov 提交于
      That parameter can easily be derived based on the "data_size" and "nr"
      parameters exploit this fact to simply the function's signature. No
      functional changes.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fc0d82e1
    • N
      btrfs: eliminate total_size parameter from setup_items_for_insert · 3dc9dc89
      Nikolay Borisov 提交于
      The value of this argument can be derived from the total_data as it's
      simply the value of the data size + size of btrfs_items being touched.
      Move the parameter calculation inside the function. This results in a
      simpler interface and also a minor size reduction:
      
      ./scripts/bloat-o-meter ctree.original fs/btrfs/ctree.o
      add/remove: 0/0 grow/shrink: 0/3 up/down: 0/-34 (-34)
      Function                                     old     new   delta
      btrfs_duplicate_item                         260     259      -1
      setup_items_for_insert                      1200    1190     -10
      btrfs_insert_empty_items                     177     154     -23
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3dc9dc89
    • F
      btrfs: rename btrfs_insert_clone_extent() to a more generic name · 0cbb5bdf
      Filipe Manana 提交于
      Now that we use the same mechanism to replace all the extents in a file
      range with either a hole, an existing extent (when cloning) or a new
      extent (when using fallocate), the name of btrfs_insert_clone_extent()
      no longer reflects its genericity.
      
      So rename it to btrfs_insert_replace_extent(), since what it does is
      to either insert an existing extent or a new extent into a file range.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0cbb5bdf
    • F
      btrfs: rename btrfs_punch_hole_range() to a more generic name · 306bfec0
      Filipe Manana 提交于
      The function btrfs_punch_hole_range() is now used to replace all the file
      extents in a given file range with an extent described in the given struct
      btrfs_replace_extent_info argument. This extent can either be an existing
      extent that is being cloned or it can be a new extent (namely a prealloc
      extent). When that argument is NULL it only punches a hole (drops all the
      existing extents) in the file range.
      
      So rename the function to btrfs_replace_file_extents().
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      306bfec0
    • F
      btrfs: rename struct btrfs_clone_extent_info to a more generic name · bf385648
      Filipe Manana 提交于
      Now that we can use btrfs_clone_extent_info to convey information for a
      new prealloc extent as well, and not just for existing extents that are
      being cloned, rename it to btrfs_replace_extent_info, which reflects the
      fact that this is now more generic and it is used to replace all existing
      extents in a file range with the extent described by the structure.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      bf385648
    • F
      btrfs: remove item_size member of struct btrfs_clone_extent_info · fb870f6c
      Filipe Manana 提交于
      The value of item_size of struct btrfs_clone_extent_info is always set to
      the size of a non-inline file extent item, and in fact the infrastructure
      that uses this structure (btrfs_punch_hole_range()) does not work with
      inline file extents at all (and it is not supposed to).
      
      So just remove that field from the structure and use directly
      sizeof(struct btrfs_file_extent_item) instead. Also assert that the
      file extent type is not inline at btrfs_insert_clone_extent().
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fb870f6c
    • F
      btrfs: fix metadata reservation for fallocate that leads to transaction aborts · 8fccebfa
      Filipe Manana 提交于
      When doing an fallocate(), specially a zero range operation, we assume
      that reserving 3 units of metadata space is enough, that at most we touch
      one leaf in subvolume/fs tree for removing existing file extent items and
      inserting a new file extent item. This assumption is generally true for
      most common use cases. However when we end up needing to remove file extent
      items from multiple leaves, we can end up failing with -ENOSPC and abort
      the current transaction, turning the filesystem to RO mode. When this
      happens a stack trace like the following is dumped in dmesg/syslog:
      
      [ 1500.620934] ------------[ cut here ]------------
      [ 1500.620938] BTRFS: Transaction aborted (error -28)
      [ 1500.620973] WARNING: CPU: 2 PID: 30807 at fs/btrfs/inode.c:9724 __btrfs_prealloc_file_range+0x512/0x570 [btrfs]
      [ 1500.620974] Modules linked in: btrfs intel_rapl_msr intel_rapl_common kvm_intel (...)
      [ 1500.621010] CPU: 2 PID: 30807 Comm: xfs_io Tainted: G        W         5.9.0-rc3-btrfs-next-67 #1
      [ 1500.621012] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
      [ 1500.621023] RIP: 0010:__btrfs_prealloc_file_range+0x512/0x570 [btrfs]
      [ 1500.621026] Code: 8b 40 50 f0 48 (...)
      [ 1500.621028] RSP: 0018:ffffb05fc8803ca0 EFLAGS: 00010286
      [ 1500.621030] RAX: 0000000000000000 RBX: ffff9608af276488 RCX: 0000000000000000
      [ 1500.621032] RDX: 0000000000000001 RSI: 0000000000000027 RDI: 00000000ffffffff
      [ 1500.621033] RBP: ffffb05fc8803d90 R08: 0000000000000001 R09: 0000000000000001
      [ 1500.621035] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000003200000
      [ 1500.621037] R13: 00000000ffffffe4 R14: ffff9608af275fe8 R15: ffff9608af275f60
      [ 1500.621039] FS:  00007fb5b2368ec0(0000) GS:ffff9608b6600000(0000) knlGS:0000000000000000
      [ 1500.621041] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 1500.621043] CR2: 00007fb5b2366fb8 CR3: 0000000202d38005 CR4: 00000000003706e0
      [ 1500.621046] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [ 1500.621047] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [ 1500.621049] Call Trace:
      [ 1500.621076]  btrfs_prealloc_file_range+0x10/0x20 [btrfs]
      [ 1500.621087]  btrfs_fallocate+0xccd/0x1280 [btrfs]
      [ 1500.621108]  vfs_fallocate+0x14d/0x290
      [ 1500.621112]  ksys_fallocate+0x3a/0x70
      [ 1500.621117]  __x64_sys_fallocate+0x1a/0x20
      [ 1500.621120]  do_syscall_64+0x33/0x80
      [ 1500.621123]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      [ 1500.621126] RIP: 0033:0x7fb5b248c477
      [ 1500.621128] Code: 89 7c 24 08 (...)
      [ 1500.621130] RSP: 002b:00007ffc7bee9060 EFLAGS: 00000293 ORIG_RAX: 000000000000011d
      [ 1500.621132] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fb5b248c477
      [ 1500.621134] RDX: 0000000000000000 RSI: 0000000000000010 RDI: 0000000000000003
      [ 1500.621136] RBP: 0000557718faafd0 R08: 0000000000000000 R09: 0000000000000000
      [ 1500.621137] R10: 0000000003200000 R11: 0000000000000293 R12: 0000000000000010
      [ 1500.621139] R13: 0000557718faafb0 R14: 0000557718faa480 R15: 0000000000000003
      [ 1500.621151] irq event stamp: 1026217
      [ 1500.621154] hardirqs last  enabled at (1026223): [<ffffffffba965570>] console_unlock+0x500/0x5c0
      [ 1500.621156] hardirqs last disabled at (1026228): [<ffffffffba9654c7>] console_unlock+0x457/0x5c0
      [ 1500.621159] softirqs last  enabled at (1022486): [<ffffffffbb6003dc>] __do_softirq+0x3dc/0x606
      [ 1500.621161] softirqs last disabled at (1022477): [<ffffffffbb4010b2>] asm_call_on_stack+0x12/0x20
      [ 1500.621162] ---[ end trace 2955b08408d8b9d4 ]---
      [ 1500.621167] BTRFS: error (device sdj) in __btrfs_prealloc_file_range:9724: errno=-28 No space left
      
      When we use fallocate() internally, for reserving an extent for a space
      cache, inode cache or relocation, we can't hit this problem since either
      there aren't any file extent items to remove from the subvolume tree or
      there is at most one.
      
      When using plain fallocate() it's very unlikely, since that would require
      having many file extent items representing holes for the target range and
      crossing multiple leafs - we attempt to increase the range (merge) of such
      file extent items when punching holes, so at most we end up with 2 file
      extent items for holes at leaf boundaries.
      
      However when using the zero range operation of fallocate() for a large
      range (100+ MiB for example) that's fairly easy to trigger. The following
      example reproducer triggers the issue:
      
        $ cat reproducer.sh
        #!/bin/bash
      
        umount /dev/sdj &> /dev/null
        mkfs.btrfs -f -n 16384 -O ^no-holes /dev/sdj > /dev/null
        mount /dev/sdj /mnt/sdj
      
        # Create a 100M file with many file extent items. Punch a hole every 8K
        # just to speedup the file creation - we could do 4K sequential writes
        # followed by fsync (or O_SYNC) as well, but that takes a lot of time.
        file_size=$((100 * 1024 * 1024))
        xfs_io -f -c "pwrite -S 0xab -b 10M 0 $file_size" /mnt/sdj/foobar
        for ((i = 0; i < $file_size; i += 8192)); do
            xfs_io -c "fpunch $i 4096" /mnt/sdj/foobar
        done
      
        # Force a transaction commit, so the zero range operation will be forced
        # to COW all metadata extents it need to touch.
        sync
      
        xfs_io -c "fzero 0 $file_size" /mnt/sdj/foobar
      
        umount /mnt/sdj
      
        $ ./reproducer.sh
        wrote 104857600/104857600 bytes at offset 0
        100 MiB, 10 ops; 0.0669 sec (1.458 GiB/sec and 149.3117 ops/sec)
        fallocate: No space left on device
      
        $ dmesg
        <shows the same stack trace pasted before>
      
      To fix this use the existing infrastructure that hole punching and
      extent cloning use for replacing a file range with another extent. This
      deals with doing the removal of file extent items and inserting the new
      one using an incremental approach, reserving more space when needed and
      always ensuring we don't leave an implicit hole in the range in case
      we need to do multiple iterations and a crash happens between iterations.
      
      A test case for fstests will follow up soon.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8fccebfa
    • N
    • N
      btrfs: convert btrfs_inode_sectorsize to take btrfs_inode · 6fee248d
      Nikolay Borisov 提交于
      It's counterintuitive to have a function named btrfs_inode_xxx which
      takes a generic inode. Also move the function to btrfs_inode.h so that
      it has access to the definition of struct btrfs_inode.
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6fee248d
    • N
    • J
      btrfs: dio iomap DSYNC workaround · 0eb79294
      Josef Bacik 提交于
      iomap dio will run generic_write_sync() for us if the iocb is DSYNC.
      This is problematic for us because of 2 reasons:
      
      1. we hold the inode_lock() during this operation, and we take it in
         generic_write_sync()
      2. we hold a read lock on the dio_sem but take the write lock in fsync
      
      Since we don't want to rip out this code right now, but reworking the
      locking is a bit much to do at this point, work around this problem with
      this masterpiece of a patch.
      
      First, we clear DSYNC on the iocb so that the iomap stuff doesn't know
      that it needs to handle the sync.  We save this fact in
      current->journal_info, because we need to see do special things once
      we're in iomap_begin, and we have no way to pass private information
      into iomap_dio_rw().
      
      Next we specify a separate iomap_dio_ops for sync, which implements an
      ->end_io() callback that gets called when the dio completes.  This is
      important for AIO, because we really do need to run generic_write_sync()
      if we complete asynchronously.  However if we're still in the submitting
      context when we enter ->end_io() we clear the flag so that the submitter
      knows they're the ones that needs to run generic_write_sync().
      
      This is meant to be temporary.  We need to work out how to eliminate the
      inode_lock() and the dio_sem in our fsync and use another mechanism to
      protect these operations.
      Tested-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0eb79294
    • G
      btrfs: switch to iomap for direct IO · f85781fb
      Goldwyn Rodrigues 提交于
      We're using direct io implementation based on buffer heads. This patch
      switches to the new iomap infrastructure.
      
      Switch from __blockdev_direct_IO() to iomap_dio_rw().  Rename
      btrfs_get_blocks_direct() to btrfs_dio_iomap_begin() and use it as
      iomap_begin() for iomap direct I/O functions. This function allocates
      and locks all the blocks required for the I/O.  btrfs_submit_direct() is
      used as the submit_io() hook for direct I/O ops.
      
      Since we need direct I/O reads to go through iomap_dio_rw(), we change
      file_operations.read_iter() to a btrfs_file_read_iter() which calls
      btrfs_direct_IO() for direct reads and falls back to
      generic_file_buffered_read() for incomplete reads and buffered reads.
      
      We don't need address_space.direct_IO() anymore: set it to noop.
      
      Similarly, we don't need flags used in __blockdev_direct_IO(). iomap is
      capable of direct I/O reads from a hole, so we don't need to return
      -ENOENT.
      
      Btrfs direct I/O is now done under i_rwsem, shared in case of reads and
      exclusive in case of writes. This guards against simultaneous truncates.
      
      Use iomap->iomap_end() to check for failed or incomplete direct I/O:
      
        - for writes, call __endio_write_update_ordered()
        - for reads, unlock extents
      
      btrfs_dio_data is now hooked in iomap->private and not
      current->journal_info. It carries the reservation variable and the
      amount of data submitted, so we can calculate the amount of data to call
      __endio_write_update_ordered in case of an error.
      
      This patch removes last use of struct buffer_head from btrfs.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f85781fb
    • F
      btrfs: make fast fsyncs wait only for writeback · 48778179
      Filipe Manana 提交于
      Currently regardless of a full or a fast fsync we always wait for ordered
      extents to complete, and then start logging the inode after that. However
      for fast fsyncs we can just wait for the writeback to complete, we don't
      need to wait for the ordered extents to complete since we use the list of
      modified extents maps to figure out which extents we must log and we can
      get their checksums directly from the ordered extents that are still in
      flight, otherwise look them up from the checksums tree.
      
      Until commit b5e6c3e1 ("btrfs: always wait on ordered extents at
      fsync time"), for fast fsyncs, we used to start logging without even
      waiting for the writeback to complete first, we would wait for it to
      complete after logging, while holding a transaction open, which lead to
      performance issues when using cgroups and probably for other cases too,
      as wait for IO while holding a transaction handle should be avoided as
      much as possible. After that, for fast fsyncs, we started to wait for
      ordered extents to complete before starting to log, which adds some
      latency to fsyncs and we even got at least one report about a performance
      drop which bisected to that particular change:
      
      https://lore.kernel.org/linux-btrfs/20181109215148.GF23260@techsingularity.net/
      
      This change makes fast fsyncs only wait for writeback to finish before
      starting to log the inode, instead of waiting for both the writeback to
      finish and for the ordered extents to complete. This brings back part of
      the logic we had that extracts checksums from in flight ordered extents,
      which are not yet in the checksums tree, and making sure transaction
      commits wait for the completion of ordered extents previously logged
      (by far most of the time they have already completed by the time a
      transaction commit starts, resulting in no wait at all), to avoid any
      data loss if an ordered extent completes after the transaction used to
      log an inode is committed, followed by a power failure.
      
      When there are no other tasks accessing the checksums and the subvolume
      btrees, the ordered extent completion is pretty fast, typically taking
      100 to 200 microseconds only in my observations. However when there are
      other tasks accessing these btrees, ordered extent completion can take a
      lot more time due to lock contention on nodes and leaves of these btrees.
      I've seen cases over 2 milliseconds, which starts to be significant. In
      particular when we do have concurrent fsyncs against different files there
      is a lot of contention on the checksums btree, since we have many tasks
      writing the checksums into the btree and other tasks that already started
      the logging phase are doing lookups for checksums in the btree.
      
      This change also turns all ranged fsyncs into full ranged fsyncs, which
      is something we already did when not using the NO_HOLES features or when
      doing a full fsync. This is to guarantee we never miss checksums due to
      writeback having been triggered only for a part of an extent, and we end
      up logging the full extent but only checksums for the written range, which
      results in missing checksums after log replay. Allowing ranged fsyncs to
      operate again only in the original range, when using the NO_HOLES feature
      and doing a fast fsync is doable but requires some non trivial changes to
      the writeback path, which can always be worked on later if needed, but I
      don't think they are a very common use case.
      
      Several tests were performed using fio for different numbers of concurrent
      jobs, each writing and fsyncing its own file, for both sequential and
      random file writes. The tests were run on bare metal, no virtualization,
      on a box with 12 cores (Intel i7-8700), 64Gb of RAM and a NVMe device,
      with a kernel configuration that is the default of typical distributions
      (debian in this case), without debug options enabled (kasan, kmemleak,
      slub debug, debug of page allocations, lock debugging, etc).
      
      The following script that calls fio was used:
      
        $ cat test-fsync.sh
        #!/bin/bash
      
        DEV=/dev/nvme0n1
        MNT=/mnt/btrfs
        MOUNT_OPTIONS="-o ssd -o space_cache=v2"
        MKFS_OPTIONS="-d single -m single"
      
        if [ $# -ne 5 ]; then
          echo "Use $0 NUM_JOBS FILE_SIZE FSYNC_FREQ BLOCK_SIZE [write|randwrite]"
          exit 1
        fi
      
        NUM_JOBS=$1
        FILE_SIZE=$2
        FSYNC_FREQ=$3
        BLOCK_SIZE=$4
        WRITE_MODE=$5
      
        if [ "$WRITE_MODE" != "write" ] && [ "$WRITE_MODE" != "randwrite" ]; then
          echo "Invalid WRITE_MODE, must be 'write' or 'randwrite'"
          exit 1
        fi
      
        cat <<EOF > /tmp/fio-job.ini
        [writers]
        rw=$WRITE_MODE
        fsync=$FSYNC_FREQ
        fallocate=none
        group_reporting=1
        direct=0
        bs=$BLOCK_SIZE
        ioengine=sync
        size=$FILE_SIZE
        directory=$MNT
        numjobs=$NUM_JOBS
        EOF
      
        echo "performance" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
      
        echo
        echo "Using config:"
        echo
        cat /tmp/fio-job.ini
        echo
      
        umount $MNT &> /dev/null
        mkfs.btrfs -f $MKFS_OPTIONS $DEV
        mount $MOUNT_OPTIONS $DEV $MNT
        fio /tmp/fio-job.ini
        umount $MNT
      
      The results were the following:
      
      *************************
      *** sequential writes ***
      *************************
      
      ==== 1 job, 8GiB file, fsync frequency 1, block size 64KiB ====
      
      Before patch:
      
      WRITE: bw=36.6MiB/s (38.4MB/s), 36.6MiB/s-36.6MiB/s (38.4MB/s-38.4MB/s), io=8192MiB (8590MB), run=223689-223689msec
      
      After patch:
      
      WRITE: bw=40.2MiB/s (42.1MB/s), 40.2MiB/s-40.2MiB/s (42.1MB/s-42.1MB/s), io=8192MiB (8590MB), run=203980-203980msec
      (+9.8%, -8.8% runtime)
      
      ==== 2 jobs, 4GiB files, fsync frequency 1, block size 64KiB ====
      
      Before patch:
      
      WRITE: bw=35.8MiB/s (37.5MB/s), 35.8MiB/s-35.8MiB/s (37.5MB/s-37.5MB/s), io=8192MiB (8590MB), run=228950-228950msec
      
      After patch:
      
      WRITE: bw=43.5MiB/s (45.6MB/s), 43.5MiB/s-43.5MiB/s (45.6MB/s-45.6MB/s), io=8192MiB (8590MB), run=188272-188272msec
      (+21.5% throughput, -17.8% runtime)
      
      ==== 4 jobs, 2GiB files, fsync frequency 1, block size 64KiB ====
      
      Before patch:
      
      WRITE: bw=50.1MiB/s (52.6MB/s), 50.1MiB/s-50.1MiB/s (52.6MB/s-52.6MB/s), io=8192MiB (8590MB), run=163446-163446msec
      
      After patch:
      
      WRITE: bw=64.5MiB/s (67.6MB/s), 64.5MiB/s-64.5MiB/s (67.6MB/s-67.6MB/s), io=8192MiB (8590MB), run=126987-126987msec
      (+28.7% throughput, -22.3% runtime)
      
      ==== 8 jobs, 1GiB files, fsync frequency 1, block size 64KiB ====
      
      Before patch:
      
      WRITE: bw=64.0MiB/s (68.1MB/s), 64.0MiB/s-64.0MiB/s (68.1MB/s-68.1MB/s), io=8192MiB (8590MB), run=126075-126075msec
      
      After patch:
      
      WRITE: bw=86.8MiB/s (91.0MB/s), 86.8MiB/s-86.8MiB/s (91.0MB/s-91.0MB/s), io=8192MiB (8590MB), run=94358-94358msec
      (+35.6% throughput, -25.2% runtime)
      
      ==== 16 jobs, 512MiB files, fsync frequency 1, block size 64KiB ====
      
      Before patch:
      
      WRITE: bw=79.8MiB/s (83.6MB/s), 79.8MiB/s-79.8MiB/s (83.6MB/s-83.6MB/s), io=8192MiB (8590MB), run=102694-102694msec
      
      After patch:
      
      WRITE: bw=107MiB/s (112MB/s), 107MiB/s-107MiB/s (112MB/s-112MB/s), io=8192MiB (8590MB), run=76446-76446msec
      (+34.1% throughput, -25.6% runtime)
      
      ==== 32 jobs, 512MiB files, fsync frequency 1, block size 64KiB ====
      
      Before patch:
      
      WRITE: bw=93.2MiB/s (97.7MB/s), 93.2MiB/s-93.2MiB/s (97.7MB/s-97.7MB/s), io=16.0GiB (17.2GB), run=175836-175836msec
      
      After patch:
      
      WRITE: bw=111MiB/s (117MB/s), 111MiB/s-111MiB/s (117MB/s-117MB/s), io=16.0GiB (17.2GB), run=147001-147001msec
      (+19.1% throughput, -16.4% runtime)
      
      ==== 64 jobs, 512MiB files, fsync frequency 1, block size 64KiB ====
      
      Before patch:
      
      WRITE: bw=108MiB/s (114MB/s), 108MiB/s-108MiB/s (114MB/s-114MB/s), io=32.0GiB (34.4GB), run=302656-302656msec
      
      After patch:
      
      WRITE: bw=133MiB/s (140MB/s), 133MiB/s-133MiB/s (140MB/s-140MB/s), io=32.0GiB (34.4GB), run=246003-246003msec
      (+23.1% throughput, -18.7% runtime)
      
      ************************
      ***   random writes  ***
      ************************
      
      ==== 1 job, 8GiB file, fsync frequency 16, block size 4KiB ====
      
      Before patch:
      
      WRITE: bw=11.5MiB/s (12.0MB/s), 11.5MiB/s-11.5MiB/s (12.0MB/s-12.0MB/s), io=8192MiB (8590MB), run=714281-714281msec
      
      After patch:
      
      WRITE: bw=11.6MiB/s (12.2MB/s), 11.6MiB/s-11.6MiB/s (12.2MB/s-12.2MB/s), io=8192MiB (8590MB), run=705959-705959msec
      (+0.9% throughput, -1.7% runtime)
      
      ==== 2 jobs, 4GiB files, fsync frequency 16, block size 4KiB ====
      
      Before patch:
      
      WRITE: bw=12.8MiB/s (13.5MB/s), 12.8MiB/s-12.8MiB/s (13.5MB/s-13.5MB/s), io=8192MiB (8590MB), run=638101-638101msec
      
      After patch:
      
      WRITE: bw=13.1MiB/s (13.7MB/s), 13.1MiB/s-13.1MiB/s (13.7MB/s-13.7MB/s), io=8192MiB (8590MB), run=625374-625374msec
      (+2.3% throughput, -2.0% runtime)
      
      ==== 4 jobs, 2GiB files, fsync frequency 16, block size 4KiB ====
      
      Before patch:
      
      WRITE: bw=15.4MiB/s (16.2MB/s), 15.4MiB/s-15.4MiB/s (16.2MB/s-16.2MB/s), io=8192MiB (8590MB), run=531146-531146msec
      
      After patch:
      
      WRITE: bw=17.8MiB/s (18.7MB/s), 17.8MiB/s-17.8MiB/s (18.7MB/s-18.7MB/s), io=8192MiB (8590MB), run=460431-460431msec
      (+15.6% throughput, -13.3% runtime)
      
      ==== 8 jobs, 1GiB files, fsync frequency 16, block size 4KiB ====
      
      Before patch:
      
      WRITE: bw=19.9MiB/s (20.8MB/s), 19.9MiB/s-19.9MiB/s (20.8MB/s-20.8MB/s), io=8192MiB (8590MB), run=412664-412664msec
      
      After patch:
      
      WRITE: bw=22.2MiB/s (23.3MB/s), 22.2MiB/s-22.2MiB/s (23.3MB/s-23.3MB/s), io=8192MiB (8590MB), run=368589-368589msec
      (+11.6% throughput, -10.7% runtime)
      
      ==== 16 jobs, 512MiB files, fsync frequency 16, block size 4KiB ====
      
      Before patch:
      
      WRITE: bw=29.3MiB/s (30.7MB/s), 29.3MiB/s-29.3MiB/s (30.7MB/s-30.7MB/s), io=8192MiB (8590MB), run=279924-279924msec
      
      After patch:
      
      WRITE: bw=30.4MiB/s (31.9MB/s), 30.4MiB/s-30.4MiB/s (31.9MB/s-31.9MB/s), io=8192MiB (8590MB), run=269258-269258msec
      (+3.8% throughput, -3.8% runtime)
      
      ==== 32 jobs, 512MiB files, fsync frequency 16, block size 4KiB ====
      
      Before patch:
      
      WRITE: bw=36.9MiB/s (38.7MB/s), 36.9MiB/s-36.9MiB/s (38.7MB/s-38.7MB/s), io=16.0GiB (17.2GB), run=443581-443581msec
      
      After patch:
      
      WRITE: bw=41.6MiB/s (43.6MB/s), 41.6MiB/s-41.6MiB/s (43.6MB/s-43.6MB/s), io=16.0GiB (17.2GB), run=394114-394114msec
      (+12.7% throughput, -11.2% runtime)
      
      ==== 64 jobs, 512MiB files, fsync frequency 16, block size 4KiB ====
      
      Before patch:
      
      WRITE: bw=45.9MiB/s (48.1MB/s), 45.9MiB/s-45.9MiB/s (48.1MB/s-48.1MB/s), io=32.0GiB (34.4GB), run=714614-714614msec
      
      After patch:
      
      WRITE: bw=48.8MiB/s (51.1MB/s), 48.8MiB/s-48.8MiB/s (51.1MB/s-51.1MB/s), io=32.0GiB (34.4GB), run=672087-672087msec
      (+6.3% throughput, -6.0% runtime)
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      48778179
    • Q
      btrfs: cleanup calculation of lockend in lock_and_cleanup_extent_if_need() · e21139c6
      Qu Wenruo 提交于
      We're just doing rounding up to sectorsize to calculate the lockend.
      There is no need to do the unnecessary length calculation, just direct
      round_up() is enough.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e21139c6
  5. 21 8月, 2020 1 次提交
    • B
      btrfs: detect nocow for swap after snapshot delete · a84d5d42
      Boris Burkov 提交于
      can_nocow_extent and btrfs_cross_ref_exist both rely on a heuristic for
      detecting a must cow condition which is not exactly accurate, but saves
      unnecessary tree traversal. The incorrect assumption is that if the
      extent was created in a generation smaller than the last snapshot
      generation, it must be referenced by that snapshot. That is true, except
      the snapshot could have since been deleted, without affecting the last
      snapshot generation.
      
      The original patch claimed a performance win from this check, but it
      also leads to a bug where you are unable to use a swapfile if you ever
      snapshotted the subvolume it's in. Make the check slower and more strict
      for the swapon case, without modifying the general cow checks as a
      compromise. Turning swap on does not seem to be a particularly
      performance sensitive operation, so incurring a possibly unnecessary
      btrfs_search_slot seems worthwhile for the added usability.
      
      Note: Until the snapshot is competely cleaned after deletion,
      check_committed_refs will still cause the logic to think that cow is
      necessary, so the user must until 'btrfs subvolu sync' finished before
      activating the swapfile swapon.
      
      CC: stable@vger.kernel.org # 5.4+
      Suggested-by: NOmar Sandoval <osandov@osandov.com>
      Signed-off-by: NBoris Burkov <boris@bur.io>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a84d5d42
  6. 27 7月, 2020 12 次提交
  7. 10 7月, 2020 1 次提交
  8. 22 6月, 2020 1 次提交
  9. 17 6月, 2020 2 次提交
    • F
      btrfs: fix RWF_NOWAIT writes blocking on extent locks and waiting for IO · 5dbb75ed
      Filipe Manana 提交于
      A RWF_NOWAIT write is not supposed to wait on filesystem locks that can be
      held for a long time or for ongoing IO to complete.
      
      However when calling check_can_nocow(), if the inode has prealloc extents
      or has the NOCOW flag set, we can block on extent (file range) locks
      through the call to btrfs_lock_and_flush_ordered_range(). Such lock can
      take a significant amount of time to be available. For example, a fiemap
      task may be running, and iterating through the entire file range checking
      all extents and doing backref walking to determine if they are shared,
      or a readpage operation may be in progress.
      
      Also at btrfs_lock_and_flush_ordered_range(), called by check_can_nocow(),
      after locking the file range we wait for any existing ordered extent that
      is in progress to complete. Another operation that can take a significant
      amount of time and defeat the purpose of RWF_NOWAIT.
      
      So fix this by trying to lock the file range and if it's currently locked
      return -EAGAIN to user space. If we are able to lock the file range without
      waiting and there is an ordered extent in the range, return -EAGAIN as
      well, instead of waiting for it to complete. Finally, don't bother trying
      to lock the snapshot lock of the root when attempting a RWF_NOWAIT write,
      as that is only important for buffered writes.
      
      Fixes: edf064e7 ("btrfs: nowait aio support")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5dbb75ed
    • F
      btrfs: fix RWF_NOWAIT write not failling when we need to cow · 260a6339
      Filipe Manana 提交于
      If we attempt to do a RWF_NOWAIT write against a file range for which we
      can only do NOCOW for a part of it, due to the existence of holes or
      shared extents for example, we proceed with the write as if it were
      possible to NOCOW the whole range.
      
      Example:
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount /dev/sdb /mnt
      
        $ touch /mnt/sdj/bar
        $ chattr +C /mnt/sdj/bar
      
        $ xfs_io -d -c "pwrite -S 0xab -b 256K 0 256K" /mnt/bar
        wrote 262144/262144 bytes at offset 0
        256 KiB, 1 ops; 0.0003 sec (694.444 MiB/sec and 2777.7778 ops/sec)
      
        $ xfs_io -c "fpunch 64K 64K" /mnt/bar
        $ sync
      
        $ xfs_io -d -c "pwrite -N -V 1 -b 128K -S 0xfe 0 128K" /mnt/bar
        wrote 131072/131072 bytes at offset 0
        128 KiB, 1 ops; 0.0007 sec (160.051 MiB/sec and 1280.4097 ops/sec)
      
      This last write should fail with -EAGAIN since the file range from 64K to
      128K is a hole. On xfs it fails, as expected, but on ext4 it currently
      succeeds because apparently it is expensive to check if there are extents
      allocated for the whole range, but I'll check with the ext4 people.
      
      Fix the issue by checking if check_can_nocow() returns a number of
      NOCOW'able bytes smaller then the requested number of bytes, and if it
      does return -EAGAIN.
      
      Fixes: edf064e7 ("btrfs: nowait aio support")
      CC: stable@vger.kernel.org # 4.14+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      260a6339