1. 07 10月, 2020 40 次提交
    • Q
      btrfs: qgroup: fix wrong qgroup metadata reserve for delayed inode · b4c5d8fd
      Qu Wenruo 提交于
      For delayed inode facility, qgroup metadata is reserved for it, and
      later freed.
      
      However we're freeing more bytes than we reserved.
      In btrfs_delayed_inode_reserve_metadata():
      
      	num_bytes = btrfs_calc_metadata_size(fs_info, 1);
      	...
      		ret = btrfs_qgroup_reserve_meta_prealloc(root,
      				fs_info->nodesize, true);
      		...
      		if (!ret) {
      			node->bytes_reserved = num_bytes;
      
      But in btrfs_delayed_inode_release_metadata():
      
      	if (qgroup_free)
      		btrfs_qgroup_free_meta_prealloc(node->root,
      				node->bytes_reserved);
      	else
      		btrfs_qgroup_convert_reserved_meta(node->root,
      				node->bytes_reserved);
      
      This means, we're always releasing more qgroup metadata rsv than we have
      reserved.
      
      This won't trigger selftest warning, as btrfs qgroup metadata rsv has
      extra protection against cases like quota enabled half-way.
      
      But we still need to fix this problem any way.
      
      This patch will use the same num_bytes for qgroup metadata rsv so we
      could handle it correctly.
      
      Fixes: f218ea6c ("btrfs: delayed-inode: Remove wrong qgroup meta reservation calls")
      CC: stable@vger.kernel.org # 4.19+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b4c5d8fd
    • J
      btrfs: do not hold device_list_mutex when closing devices · 425c6ed6
      Josef Bacik 提交于
      The following lockdep splat
      
      ======================================================
      WARNING: possible circular locking dependency detected
      5.8.0-rc7-00169-g87212851a027-dirty #929 Not tainted
      ------------------------------------------------------
      fsstress/8739 is trying to acquire lock:
      ffff88bfd0eb0c90 (&fs_info->reloc_mutex){+.+.}-{3:3}, at: btrfs_record_root_in_trans+0x43/0x70
      
      but task is already holding lock:
      ffff88bfbd16e538 (sb_pagefaults){.+.+}-{0:0}, at: btrfs_page_mkwrite+0x6a/0x4a0
      
      which lock already depends on the new lock.
      
      the existing dependency chain (in reverse order) is:
      
      -> #10 (sb_pagefaults){.+.+}-{0:0}:
             __sb_start_write+0x129/0x210
             btrfs_page_mkwrite+0x6a/0x4a0
             do_page_mkwrite+0x4d/0xc0
             handle_mm_fault+0x103c/0x1730
             exc_page_fault+0x340/0x660
             asm_exc_page_fault+0x1e/0x30
      
      -> #9 (&mm->mmap_lock#2){++++}-{3:3}:
             __might_fault+0x68/0x90
             _copy_to_user+0x1e/0x80
             perf_read+0x141/0x2c0
             vfs_read+0xad/0x1b0
             ksys_read+0x5f/0xe0
             do_syscall_64+0x50/0x90
             entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      -> #8 (&cpuctx_mutex){+.+.}-{3:3}:
             __mutex_lock+0x9f/0x930
             perf_event_init_cpu+0x88/0x150
             perf_event_init+0x1db/0x20b
             start_kernel+0x3ae/0x53c
             secondary_startup_64+0xa4/0xb0
      
      -> #7 (pmus_lock){+.+.}-{3:3}:
             __mutex_lock+0x9f/0x930
             perf_event_init_cpu+0x4f/0x150
             cpuhp_invoke_callback+0xb1/0x900
             _cpu_up.constprop.26+0x9f/0x130
             cpu_up+0x7b/0xc0
             bringup_nonboot_cpus+0x4f/0x60
             smp_init+0x26/0x71
             kernel_init_freeable+0x110/0x258
             kernel_init+0xa/0x103
             ret_from_fork+0x1f/0x30
      
      -> #6 (cpu_hotplug_lock){++++}-{0:0}:
             cpus_read_lock+0x39/0xb0
             kmem_cache_create_usercopy+0x28/0x230
             kmem_cache_create+0x12/0x20
             bioset_init+0x15e/0x2b0
             init_bio+0xa3/0xaa
             do_one_initcall+0x5a/0x2e0
             kernel_init_freeable+0x1f4/0x258
             kernel_init+0xa/0x103
             ret_from_fork+0x1f/0x30
      
      -> #5 (bio_slab_lock){+.+.}-{3:3}:
             __mutex_lock+0x9f/0x930
             bioset_init+0xbc/0x2b0
             __blk_alloc_queue+0x6f/0x2d0
             blk_mq_init_queue_data+0x1b/0x70
             loop_add+0x110/0x290 [loop]
             fq_codel_tcf_block+0x12/0x20 [sch_fq_codel]
             do_one_initcall+0x5a/0x2e0
             do_init_module+0x5a/0x220
             load_module+0x2459/0x26e0
             __do_sys_finit_module+0xba/0xe0
             do_syscall_64+0x50/0x90
             entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      -> #4 (loop_ctl_mutex){+.+.}-{3:3}:
             __mutex_lock+0x9f/0x930
             lo_open+0x18/0x50 [loop]
             __blkdev_get+0xec/0x570
             blkdev_get+0xe8/0x150
             do_dentry_open+0x167/0x410
             path_openat+0x7c9/0xa80
             do_filp_open+0x93/0x100
             do_sys_openat2+0x22a/0x2e0
             do_sys_open+0x4b/0x80
             do_syscall_64+0x50/0x90
             entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      -> #3 (&bdev->bd_mutex){+.+.}-{3:3}:
             __mutex_lock+0x9f/0x930
             blkdev_put+0x1d/0x120
             close_fs_devices.part.31+0x84/0x130
             btrfs_close_devices+0x44/0xb0
             close_ctree+0x296/0x2b2
             generic_shutdown_super+0x69/0x100
             kill_anon_super+0xe/0x30
             btrfs_kill_super+0x12/0x20
             deactivate_locked_super+0x29/0x60
             cleanup_mnt+0xb8/0x140
             task_work_run+0x6d/0xb0
             __prepare_exit_to_usermode+0x1cc/0x1e0
             do_syscall_64+0x5c/0x90
             entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      -> #2 (&fs_devs->device_list_mutex){+.+.}-{3:3}:
             __mutex_lock+0x9f/0x930
             btrfs_run_dev_stats+0x49/0x480
             commit_cowonly_roots+0xb5/0x2a0
             btrfs_commit_transaction+0x516/0xa60
             sync_filesystem+0x6b/0x90
             generic_shutdown_super+0x22/0x100
             kill_anon_super+0xe/0x30
             btrfs_kill_super+0x12/0x20
             deactivate_locked_super+0x29/0x60
             cleanup_mnt+0xb8/0x140
             task_work_run+0x6d/0xb0
             __prepare_exit_to_usermode+0x1cc/0x1e0
             do_syscall_64+0x5c/0x90
             entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      -> #1 (&fs_info->tree_log_mutex){+.+.}-{3:3}:
             __mutex_lock+0x9f/0x930
             btrfs_commit_transaction+0x4bb/0xa60
             sync_filesystem+0x6b/0x90
             generic_shutdown_super+0x22/0x100
             kill_anon_super+0xe/0x30
             btrfs_kill_super+0x12/0x20
             deactivate_locked_super+0x29/0x60
             cleanup_mnt+0xb8/0x140
             task_work_run+0x6d/0xb0
             __prepare_exit_to_usermode+0x1cc/0x1e0
             do_syscall_64+0x5c/0x90
             entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      -> #0 (&fs_info->reloc_mutex){+.+.}-{3:3}:
             __lock_acquire+0x1272/0x2310
             lock_acquire+0x9e/0x360
             __mutex_lock+0x9f/0x930
             btrfs_record_root_in_trans+0x43/0x70
             start_transaction+0xd1/0x5d0
             btrfs_dirty_inode+0x42/0xd0
             file_update_time+0xc8/0x110
             btrfs_page_mkwrite+0x10c/0x4a0
             do_page_mkwrite+0x4d/0xc0
             handle_mm_fault+0x103c/0x1730
             exc_page_fault+0x340/0x660
             asm_exc_page_fault+0x1e/0x30
      
      other info that might help us debug this:
      
      Chain exists of:
        &fs_info->reloc_mutex --> &mm->mmap_lock#2 --> sb_pagefaults
      
       Possible unsafe locking scenario:
      
             CPU0                    CPU1
             ----                    ----
        lock(sb_pagefaults);
                                     lock(&mm->mmap_lock#2);
                                     lock(sb_pagefaults);
        lock(&fs_info->reloc_mutex);
      
       *** DEADLOCK ***
      
      3 locks held by fsstress/8739:
       #0: ffff88bee66eeb68 (&mm->mmap_lock#2){++++}-{3:3}, at: exc_page_fault+0x173/0x660
       #1: ffff88bfbd16e538 (sb_pagefaults){.+.+}-{0:0}, at: btrfs_page_mkwrite+0x6a/0x4a0
       #2: ffff88bfbd16e630 (sb_internal){.+.+}-{0:0}, at: start_transaction+0x3da/0x5d0
      
      stack backtrace:
      CPU: 17 PID: 8739 Comm: fsstress Kdump: loaded Not tainted 5.8.0-rc7-00169-g87212851a027-dirty #929
      Hardware name: Quanta Tioga Pass Single Side 01-0030993006/Tioga Pass Single Side, BIOS F08_3A18 12/20/2018
      Call Trace:
       dump_stack+0x78/0xa0
       check_noncircular+0x165/0x180
       __lock_acquire+0x1272/0x2310
       ? btrfs_get_alloc_profile+0x150/0x210
       lock_acquire+0x9e/0x360
       ? btrfs_record_root_in_trans+0x43/0x70
       __mutex_lock+0x9f/0x930
       ? btrfs_record_root_in_trans+0x43/0x70
       ? lock_acquire+0x9e/0x360
       ? join_transaction+0x5d/0x450
       ? find_held_lock+0x2d/0x90
       ? btrfs_record_root_in_trans+0x43/0x70
       ? join_transaction+0x3d5/0x450
       ? btrfs_record_root_in_trans+0x43/0x70
       btrfs_record_root_in_trans+0x43/0x70
       start_transaction+0xd1/0x5d0
       btrfs_dirty_inode+0x42/0xd0
       file_update_time+0xc8/0x110
       btrfs_page_mkwrite+0x10c/0x4a0
       ? handle_mm_fault+0x5e/0x1730
       do_page_mkwrite+0x4d/0xc0
       ? __do_fault+0x32/0x150
       handle_mm_fault+0x103c/0x1730
       exc_page_fault+0x340/0x660
       ? asm_exc_page_fault+0x8/0x30
       asm_exc_page_fault+0x1e/0x30
      RIP: 0033:0x7faa6c9969c4
      
      Was seen in testing.  The fix is similar to that of
      
        btrfs: open device without device_list_mutex
      
      where we're holding the device_list_mutex and then grab the bd_mutex,
      which pulls in a bunch of dependencies under the bd_mutex.  We only ever
      call btrfs_close_devices() on mount failure or unmount, so we're save to
      not have the device_list_mutex here.  We're already holding the
      uuid_mutex which keeps us safe from any external modification of the
      fs_devices.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      425c6ed6
    • J
      btrfs: move btrfs_rm_dev_replace_free_srcdev outside of all locks · 62cf5391
      Josef Bacik 提交于
      When closing and freeing the source device we could end up doing our
      final blkdev_put() on the bdev, which will grab the bd_mutex.  As such
      we want to be holding as few locks as possible, so move this call
      outside of the dev_replace->lock_finishing_cancel_unmount lock.  Since
      we're modifying the fs_devices we need to make sure we're holding the
      uuid_mutex here, so take that as well.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      62cf5391
    • N
      btrfs: remove alloc_list splice in btrfs_prepare_sprout · 68abf360
      Nikolay Borisov 提交于
      btrfs_prepare_sprout is called when the first rw device is added to a
      seed filesystem. This means the filesystem can't have its alloc_list
      be non-empty, since seed filesystems are read only. Simply remove the
      code altogether.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      68abf360
    • N
      btrfs: document some invariants of seed code · 427c8fdd
      Nikolay Borisov 提交于
      Without good understanding of how seed devices works it's hard to grok
      some of what the code in open_seed_devices or btrfs_prepare_sprout does.
      
      Add comments hopefully reducing some of the cognitive load.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      427c8fdd
    • N
      btrfs: switch seed device to list api · 944d3f9f
      Nikolay Borisov 提交于
      While this patch touches a bunch of files the conversion is
      straighforward. Instead of using the implicit linked list anchored at
      btrfs_fs_devices::seed the code is switched to using
      list_for_each_entry.
      
      Previous patches in the series already factored out code that processed
      both main and seed devices so in those cases the factored out functions
      are called on the main fs_devices and then on every seed dev inside
      list_for_each_entry.
      
      Using list api also allows to simplify deletion from the seed dev list
      performed in btrfs_rm_device and btrfs_rm_dev_replace_free_srcdev by
      substituting a while() loop with a simple list_del_init.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      944d3f9f
    • N
      btrfs: simplify setting/clearing fs_info to btrfs_fs_devices · c4989c2f
      Nikolay Borisov 提交于
      It makes no sense to have sysfs-related routines be responsible for
      properly initialising the fs_info pointer of struct btrfs_fs_device.
      Instead this can be streamlined by making it the responsibility of
      btrfs_init_devices_late to initialize it. That function already
      initializes fs_info of every individual device in btrfs_fs_devices.
      
      As far as clearing it is concerned it makes sense to move it to
      close_fs_devices. That function is only called when struct
      btrfs_fs_devices is no longer in use - either for holding seeds or
      main devices for a mounted filesystem.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c4989c2f
    • N
      btrfs: make close_fs_devices return void · 54eed6ae
      Nikolay Borisov 提交于
      The return value of this function conveys absolutely no information.
      All callers already check the state of fs_devices->opened to decide how
      to proceed. So convert the function to returning void. While at it make
      btrfs_close_devices also return void.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      54eed6ae
    • N
      btrfs: factor out loop logic from btrfs_free_extra_devids · 3712ccb7
      Nikolay Borisov 提交于
      This prepares the code to switching seeds devices to a proper list.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3712ccb7
    • N
      btrfs: factor out reada loop in __reada_start_machine · dc0ab488
      Nikolay Borisov 提交于
      This is in preparation for moving fs_devices to proper lists.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      dc0ab488
    • N
      btrfs: remove err variable from btrfs_get_extent · 1028d1c4
      Nikolay Borisov 提交于
      There's no practical reason too use 'err' as a variable to convey
      errors. In fact it's value is either set explicitly in the beginning of
      the function or it simply takes the value of 'ret'. Not conforming to
      the usual pattern of having ret be the only variable used to convey
      errors makes the code more error prone to bugs. In fact one such bug
      was introduced by 6bf9e4bd ("btrfs: inode: Verify inode mode toi
      avoid NULL pointer dereference") by assigning the error value to 'ret'
      and not 'err'.
      
      Let's fix that issue and make the function less tricky by leaving only
      ret to convey error values.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1028d1c4
    • J
      btrfs: dio iomap DSYNC workaround · 0eb79294
      Josef Bacik 提交于
      iomap dio will run generic_write_sync() for us if the iocb is DSYNC.
      This is problematic for us because of 2 reasons:
      
      1. we hold the inode_lock() during this operation, and we take it in
         generic_write_sync()
      2. we hold a read lock on the dio_sem but take the write lock in fsync
      
      Since we don't want to rip out this code right now, but reworking the
      locking is a bit much to do at this point, work around this problem with
      this masterpiece of a patch.
      
      First, we clear DSYNC on the iocb so that the iomap stuff doesn't know
      that it needs to handle the sync.  We save this fact in
      current->journal_info, because we need to see do special things once
      we're in iomap_begin, and we have no way to pass private information
      into iomap_dio_rw().
      
      Next we specify a separate iomap_dio_ops for sync, which implements an
      ->end_io() callback that gets called when the dio completes.  This is
      important for AIO, because we really do need to run generic_write_sync()
      if we complete asynchronously.  However if we're still in the submitting
      context when we enter ->end_io() we clear the flag so that the submitter
      knows they're the ones that needs to run generic_write_sync().
      
      This is meant to be temporary.  We need to work out how to eliminate the
      inode_lock() and the dio_sem in our fsync and use another mechanism to
      protect these operations.
      Tested-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0eb79294
    • G
      btrfs: switch to iomap for direct IO · f85781fb
      Goldwyn Rodrigues 提交于
      We're using direct io implementation based on buffer heads. This patch
      switches to the new iomap infrastructure.
      
      Switch from __blockdev_direct_IO() to iomap_dio_rw().  Rename
      btrfs_get_blocks_direct() to btrfs_dio_iomap_begin() and use it as
      iomap_begin() for iomap direct I/O functions. This function allocates
      and locks all the blocks required for the I/O.  btrfs_submit_direct() is
      used as the submit_io() hook for direct I/O ops.
      
      Since we need direct I/O reads to go through iomap_dio_rw(), we change
      file_operations.read_iter() to a btrfs_file_read_iter() which calls
      btrfs_direct_IO() for direct reads and falls back to
      generic_file_buffered_read() for incomplete reads and buffered reads.
      
      We don't need address_space.direct_IO() anymore: set it to noop.
      
      Similarly, we don't need flags used in __blockdev_direct_IO(). iomap is
      capable of direct I/O reads from a hole, so we don't need to return
      -ENOENT.
      
      Btrfs direct I/O is now done under i_rwsem, shared in case of reads and
      exclusive in case of writes. This guards against simultaneous truncates.
      
      Use iomap->iomap_end() to check for failed or incomplete direct I/O:
      
        - for writes, call __endio_write_update_ordered()
        - for reads, unlock extents
      
      btrfs_dio_data is now hooked in iomap->private and not
      current->journal_info. It carries the reservation variable and the
      amount of data submitted, so we can calculate the amount of data to call
      __endio_write_update_ordered in case of an error.
      
      This patch removes last use of struct buffer_head from btrfs.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f85781fb
    • Q
      btrfs: add owner and fs_info to alloc_state io_tree · 154f7cb8
      Qu Wenruo 提交于
      Commit 1c11b63e ("btrfs: replace pending/pinned chunks lists with io
      tree") introduced btrfs_device::alloc_state extent io tree, but it
      doesn't initialize the fs_info and owner member.
      
      This means the following features are not properly supported:
      
      - Fs owner report for insert_state() error
        Without fs_info initialized, although btrfs_err() won't panic, it
        won't output which fs is causing the error.
      
      - Wrong owner for trace events
        alloc_state will get the owner as pinned extents.
      
      Fix this by assiging proper fs_info and owner for
      btrfs_device::alloc_state.
      
      Fixes: 1c11b63e ("btrfs: replace pending/pinned chunks lists with io tree")
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      154f7cb8
    • M
      btrfs: make read_block_group_item return void · 4c448ce8
      Marcos Paulo de Souza 提交于
      Since it's inclusion on 9afc6649 ("btrfs: block-group: refactor how
      we read one block group item") this function always returned 0, so there
      is no need to check for the returned value.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NMarcos Paulo de Souza <mpdesouza@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4c448ce8
    • L
      btrfs: sysfs: fix unused-but-set-variable warnings · 24646481
      Leon Romanovsky 提交于
      The compilation with W=1 generates the following warnings:
       fs/btrfs/sysfs.c:1630:6: warning: variable 'ret' set but not used [-Wunused-but-set-variable]
        1630 |  int ret;
             |      ^~~
       fs/btrfs/sysfs.c:1629:6: warning: variable 'features' set but not used [-Wunused-but-set-variable]
        1629 |  u64 features;
             |      ^~~~~~~~
      
      [ The unused variables are leftover from e410e34f ("Revert "btrfs:
        synchronize incompat feature bits with sysfs files""), which needs
        to be properly fixed by moving feature bit manipulation from the sysfs
        context.  Silence the warning to save pepople time, we got several
        reports. ]
      Signed-off-by: NLeon Romanovsky <leonro@nvidia.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      24646481
    • F
      btrfs: make fast fsyncs wait only for writeback · 48778179
      Filipe Manana 提交于
      Currently regardless of a full or a fast fsync we always wait for ordered
      extents to complete, and then start logging the inode after that. However
      for fast fsyncs we can just wait for the writeback to complete, we don't
      need to wait for the ordered extents to complete since we use the list of
      modified extents maps to figure out which extents we must log and we can
      get their checksums directly from the ordered extents that are still in
      flight, otherwise look them up from the checksums tree.
      
      Until commit b5e6c3e1 ("btrfs: always wait on ordered extents at
      fsync time"), for fast fsyncs, we used to start logging without even
      waiting for the writeback to complete first, we would wait for it to
      complete after logging, while holding a transaction open, which lead to
      performance issues when using cgroups and probably for other cases too,
      as wait for IO while holding a transaction handle should be avoided as
      much as possible. After that, for fast fsyncs, we started to wait for
      ordered extents to complete before starting to log, which adds some
      latency to fsyncs and we even got at least one report about a performance
      drop which bisected to that particular change:
      
      https://lore.kernel.org/linux-btrfs/20181109215148.GF23260@techsingularity.net/
      
      This change makes fast fsyncs only wait for writeback to finish before
      starting to log the inode, instead of waiting for both the writeback to
      finish and for the ordered extents to complete. This brings back part of
      the logic we had that extracts checksums from in flight ordered extents,
      which are not yet in the checksums tree, and making sure transaction
      commits wait for the completion of ordered extents previously logged
      (by far most of the time they have already completed by the time a
      transaction commit starts, resulting in no wait at all), to avoid any
      data loss if an ordered extent completes after the transaction used to
      log an inode is committed, followed by a power failure.
      
      When there are no other tasks accessing the checksums and the subvolume
      btrees, the ordered extent completion is pretty fast, typically taking
      100 to 200 microseconds only in my observations. However when there are
      other tasks accessing these btrees, ordered extent completion can take a
      lot more time due to lock contention on nodes and leaves of these btrees.
      I've seen cases over 2 milliseconds, which starts to be significant. In
      particular when we do have concurrent fsyncs against different files there
      is a lot of contention on the checksums btree, since we have many tasks
      writing the checksums into the btree and other tasks that already started
      the logging phase are doing lookups for checksums in the btree.
      
      This change also turns all ranged fsyncs into full ranged fsyncs, which
      is something we already did when not using the NO_HOLES features or when
      doing a full fsync. This is to guarantee we never miss checksums due to
      writeback having been triggered only for a part of an extent, and we end
      up logging the full extent but only checksums for the written range, which
      results in missing checksums after log replay. Allowing ranged fsyncs to
      operate again only in the original range, when using the NO_HOLES feature
      and doing a fast fsync is doable but requires some non trivial changes to
      the writeback path, which can always be worked on later if needed, but I
      don't think they are a very common use case.
      
      Several tests were performed using fio for different numbers of concurrent
      jobs, each writing and fsyncing its own file, for both sequential and
      random file writes. The tests were run on bare metal, no virtualization,
      on a box with 12 cores (Intel i7-8700), 64Gb of RAM and a NVMe device,
      with a kernel configuration that is the default of typical distributions
      (debian in this case), without debug options enabled (kasan, kmemleak,
      slub debug, debug of page allocations, lock debugging, etc).
      
      The following script that calls fio was used:
      
        $ cat test-fsync.sh
        #!/bin/bash
      
        DEV=/dev/nvme0n1
        MNT=/mnt/btrfs
        MOUNT_OPTIONS="-o ssd -o space_cache=v2"
        MKFS_OPTIONS="-d single -m single"
      
        if [ $# -ne 5 ]; then
          echo "Use $0 NUM_JOBS FILE_SIZE FSYNC_FREQ BLOCK_SIZE [write|randwrite]"
          exit 1
        fi
      
        NUM_JOBS=$1
        FILE_SIZE=$2
        FSYNC_FREQ=$3
        BLOCK_SIZE=$4
        WRITE_MODE=$5
      
        if [ "$WRITE_MODE" != "write" ] && [ "$WRITE_MODE" != "randwrite" ]; then
          echo "Invalid WRITE_MODE, must be 'write' or 'randwrite'"
          exit 1
        fi
      
        cat <<EOF > /tmp/fio-job.ini
        [writers]
        rw=$WRITE_MODE
        fsync=$FSYNC_FREQ
        fallocate=none
        group_reporting=1
        direct=0
        bs=$BLOCK_SIZE
        ioengine=sync
        size=$FILE_SIZE
        directory=$MNT
        numjobs=$NUM_JOBS
        EOF
      
        echo "performance" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
      
        echo
        echo "Using config:"
        echo
        cat /tmp/fio-job.ini
        echo
      
        umount $MNT &> /dev/null
        mkfs.btrfs -f $MKFS_OPTIONS $DEV
        mount $MOUNT_OPTIONS $DEV $MNT
        fio /tmp/fio-job.ini
        umount $MNT
      
      The results were the following:
      
      *************************
      *** sequential writes ***
      *************************
      
      ==== 1 job, 8GiB file, fsync frequency 1, block size 64KiB ====
      
      Before patch:
      
      WRITE: bw=36.6MiB/s (38.4MB/s), 36.6MiB/s-36.6MiB/s (38.4MB/s-38.4MB/s), io=8192MiB (8590MB), run=223689-223689msec
      
      After patch:
      
      WRITE: bw=40.2MiB/s (42.1MB/s), 40.2MiB/s-40.2MiB/s (42.1MB/s-42.1MB/s), io=8192MiB (8590MB), run=203980-203980msec
      (+9.8%, -8.8% runtime)
      
      ==== 2 jobs, 4GiB files, fsync frequency 1, block size 64KiB ====
      
      Before patch:
      
      WRITE: bw=35.8MiB/s (37.5MB/s), 35.8MiB/s-35.8MiB/s (37.5MB/s-37.5MB/s), io=8192MiB (8590MB), run=228950-228950msec
      
      After patch:
      
      WRITE: bw=43.5MiB/s (45.6MB/s), 43.5MiB/s-43.5MiB/s (45.6MB/s-45.6MB/s), io=8192MiB (8590MB), run=188272-188272msec
      (+21.5% throughput, -17.8% runtime)
      
      ==== 4 jobs, 2GiB files, fsync frequency 1, block size 64KiB ====
      
      Before patch:
      
      WRITE: bw=50.1MiB/s (52.6MB/s), 50.1MiB/s-50.1MiB/s (52.6MB/s-52.6MB/s), io=8192MiB (8590MB), run=163446-163446msec
      
      After patch:
      
      WRITE: bw=64.5MiB/s (67.6MB/s), 64.5MiB/s-64.5MiB/s (67.6MB/s-67.6MB/s), io=8192MiB (8590MB), run=126987-126987msec
      (+28.7% throughput, -22.3% runtime)
      
      ==== 8 jobs, 1GiB files, fsync frequency 1, block size 64KiB ====
      
      Before patch:
      
      WRITE: bw=64.0MiB/s (68.1MB/s), 64.0MiB/s-64.0MiB/s (68.1MB/s-68.1MB/s), io=8192MiB (8590MB), run=126075-126075msec
      
      After patch:
      
      WRITE: bw=86.8MiB/s (91.0MB/s), 86.8MiB/s-86.8MiB/s (91.0MB/s-91.0MB/s), io=8192MiB (8590MB), run=94358-94358msec
      (+35.6% throughput, -25.2% runtime)
      
      ==== 16 jobs, 512MiB files, fsync frequency 1, block size 64KiB ====
      
      Before patch:
      
      WRITE: bw=79.8MiB/s (83.6MB/s), 79.8MiB/s-79.8MiB/s (83.6MB/s-83.6MB/s), io=8192MiB (8590MB), run=102694-102694msec
      
      After patch:
      
      WRITE: bw=107MiB/s (112MB/s), 107MiB/s-107MiB/s (112MB/s-112MB/s), io=8192MiB (8590MB), run=76446-76446msec
      (+34.1% throughput, -25.6% runtime)
      
      ==== 32 jobs, 512MiB files, fsync frequency 1, block size 64KiB ====
      
      Before patch:
      
      WRITE: bw=93.2MiB/s (97.7MB/s), 93.2MiB/s-93.2MiB/s (97.7MB/s-97.7MB/s), io=16.0GiB (17.2GB), run=175836-175836msec
      
      After patch:
      
      WRITE: bw=111MiB/s (117MB/s), 111MiB/s-111MiB/s (117MB/s-117MB/s), io=16.0GiB (17.2GB), run=147001-147001msec
      (+19.1% throughput, -16.4% runtime)
      
      ==== 64 jobs, 512MiB files, fsync frequency 1, block size 64KiB ====
      
      Before patch:
      
      WRITE: bw=108MiB/s (114MB/s), 108MiB/s-108MiB/s (114MB/s-114MB/s), io=32.0GiB (34.4GB), run=302656-302656msec
      
      After patch:
      
      WRITE: bw=133MiB/s (140MB/s), 133MiB/s-133MiB/s (140MB/s-140MB/s), io=32.0GiB (34.4GB), run=246003-246003msec
      (+23.1% throughput, -18.7% runtime)
      
      ************************
      ***   random writes  ***
      ************************
      
      ==== 1 job, 8GiB file, fsync frequency 16, block size 4KiB ====
      
      Before patch:
      
      WRITE: bw=11.5MiB/s (12.0MB/s), 11.5MiB/s-11.5MiB/s (12.0MB/s-12.0MB/s), io=8192MiB (8590MB), run=714281-714281msec
      
      After patch:
      
      WRITE: bw=11.6MiB/s (12.2MB/s), 11.6MiB/s-11.6MiB/s (12.2MB/s-12.2MB/s), io=8192MiB (8590MB), run=705959-705959msec
      (+0.9% throughput, -1.7% runtime)
      
      ==== 2 jobs, 4GiB files, fsync frequency 16, block size 4KiB ====
      
      Before patch:
      
      WRITE: bw=12.8MiB/s (13.5MB/s), 12.8MiB/s-12.8MiB/s (13.5MB/s-13.5MB/s), io=8192MiB (8590MB), run=638101-638101msec
      
      After patch:
      
      WRITE: bw=13.1MiB/s (13.7MB/s), 13.1MiB/s-13.1MiB/s (13.7MB/s-13.7MB/s), io=8192MiB (8590MB), run=625374-625374msec
      (+2.3% throughput, -2.0% runtime)
      
      ==== 4 jobs, 2GiB files, fsync frequency 16, block size 4KiB ====
      
      Before patch:
      
      WRITE: bw=15.4MiB/s (16.2MB/s), 15.4MiB/s-15.4MiB/s (16.2MB/s-16.2MB/s), io=8192MiB (8590MB), run=531146-531146msec
      
      After patch:
      
      WRITE: bw=17.8MiB/s (18.7MB/s), 17.8MiB/s-17.8MiB/s (18.7MB/s-18.7MB/s), io=8192MiB (8590MB), run=460431-460431msec
      (+15.6% throughput, -13.3% runtime)
      
      ==== 8 jobs, 1GiB files, fsync frequency 16, block size 4KiB ====
      
      Before patch:
      
      WRITE: bw=19.9MiB/s (20.8MB/s), 19.9MiB/s-19.9MiB/s (20.8MB/s-20.8MB/s), io=8192MiB (8590MB), run=412664-412664msec
      
      After patch:
      
      WRITE: bw=22.2MiB/s (23.3MB/s), 22.2MiB/s-22.2MiB/s (23.3MB/s-23.3MB/s), io=8192MiB (8590MB), run=368589-368589msec
      (+11.6% throughput, -10.7% runtime)
      
      ==== 16 jobs, 512MiB files, fsync frequency 16, block size 4KiB ====
      
      Before patch:
      
      WRITE: bw=29.3MiB/s (30.7MB/s), 29.3MiB/s-29.3MiB/s (30.7MB/s-30.7MB/s), io=8192MiB (8590MB), run=279924-279924msec
      
      After patch:
      
      WRITE: bw=30.4MiB/s (31.9MB/s), 30.4MiB/s-30.4MiB/s (31.9MB/s-31.9MB/s), io=8192MiB (8590MB), run=269258-269258msec
      (+3.8% throughput, -3.8% runtime)
      
      ==== 32 jobs, 512MiB files, fsync frequency 16, block size 4KiB ====
      
      Before patch:
      
      WRITE: bw=36.9MiB/s (38.7MB/s), 36.9MiB/s-36.9MiB/s (38.7MB/s-38.7MB/s), io=16.0GiB (17.2GB), run=443581-443581msec
      
      After patch:
      
      WRITE: bw=41.6MiB/s (43.6MB/s), 41.6MiB/s-41.6MiB/s (43.6MB/s-43.6MB/s), io=16.0GiB (17.2GB), run=394114-394114msec
      (+12.7% throughput, -11.2% runtime)
      
      ==== 64 jobs, 512MiB files, fsync frequency 16, block size 4KiB ====
      
      Before patch:
      
      WRITE: bw=45.9MiB/s (48.1MB/s), 45.9MiB/s-45.9MiB/s (48.1MB/s-48.1MB/s), io=32.0GiB (34.4GB), run=714614-714614msec
      
      After patch:
      
      WRITE: bw=48.8MiB/s (51.1MB/s), 48.8MiB/s-48.8MiB/s (51.1MB/s-51.1MB/s), io=32.0GiB (34.4GB), run=672087-672087msec
      (+6.3% throughput, -6.0% runtime)
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      48778179
    • F
      btrfs: do not commit logs and transactions during link and rename operations · 75b463d2
      Filipe Manana 提交于
      Since commit d4682ba0 ("Btrfs: sync log after logging new name") we
      started to commit logs, and fallback to transaction commits when we failed
      to log the new names or commit the logs, after link and rename operations
      when the target inodes (or their parents) were previously logged in the
      current transaction. This was to avoid losing directories despite an
      explicit fsync on them when they are ancestors of some inode that got a
      new named logged, due to a link or rename operation. However that adds the
      cost of starting IO and waiting for it to complete, which can cause higher
      latencies for applications.
      
      Instead of doing that, just make sure that when we log a new name for an
      inode we don't mark any of its ancestors as logged, so that if any one
      does an fsync against any of them, without doing any other change on them,
      the fsync commits the log. This way we only pay the cost of a log commit
      (or a transaction commit if something goes wrong or a new block group was
      created) if the application explicitly asks to fsync any of the parent
      directories.
      
      Using dbench, which mixes several filesystems operations including renames,
      revealed some significant latency gains. The following script that uses
      dbench was used to test this:
      
        #!/bin/bash
      
        DEV=/dev/nvme0n1
        MNT=/mnt/btrfs
        MOUNT_OPTIONS="-o ssd -o space_cache=v2"
        MKFS_OPTIONS="-m single -d single"
        THREADS=16
      
        echo "performance" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
        mkfs.btrfs -f $MKFS_OPTIONS $DEV
        mount $MOUNT_OPTIONS $DEV $MNT
      
        dbench -t 300 -D $MNT $THREADS
      
        umount $MNT
      
      The test was run on bare metal, no virtualization, on a box with 12 cores
      (Intel i7-8700), 64Gb of RAM and using a NVMe device, with a kernel
      configuration that is the default of typical distributions (debian in this
      case), without debug options enabled (kasan, kmemleak, slub debug, debug
      of page allocations, lock debugging, etc).
      
      Results before this patch:
      
       Operation      Count    AvgLat    MaxLat
       ----------------------------------------
       NTCreateX    10750455     0.011   155.088
       Close         7896674     0.001     0.243
       Rename         455222     2.158  1101.947
       Unlink        2171189     0.067   121.638
       Deltree           256     2.425     7.816
       Mkdir             128     0.002     0.003
       Qpathinfo     9744323     0.006    21.370
       Qfileinfo     1707092     0.001     0.146
       Qfsinfo       1786756     0.001    11.228
       Sfileinfo      875612     0.003    21.263
       Find          3767281     0.025     9.617
       WriteX        5356924     0.011   211.390
       ReadX        16852694     0.003     9.442
       LockX           35008     0.002     0.119
       UnlockX         35008     0.001     0.138
       Flush          753458     4.252  1102.249
      
      Throughput 1128.35 MB/sec  16 clients  16 procs  max_latency=1102.255 ms
      
      Results after this patch:
      
      16 clients, after
      
       Operation      Count    AvgLat    MaxLat
       ----------------------------------------
       NTCreateX    11471098     0.012   448.281
       Close         8426396     0.001     0.925
       Rename         485746     0.123   267.183
       Unlink        2316477     0.080    63.433
       Deltree           288     2.830    11.144
       Mkdir             144     0.003     0.010
       Qpathinfo    10397420     0.006    10.288
       Qfileinfo     1822039     0.001     0.169
       Qfsinfo       1906497     0.002    14.039
       Sfileinfo      934433     0.004     2.438
       Find          4019879     0.026    10.200
       WriteX        5718932     0.011   200.985
       ReadX        17981671     0.003    10.036
       LockX           37352     0.002     0.076
       UnlockX         37352     0.001     0.109
       Flush          804018     5.015   778.033
      
      Throughput 1201.98 MB/sec  16 clients  16 procs  max_latency=778.036 ms
      (+6.5% throughput, -29.4% max latency, -75.8% rename latency)
      
      Test case generic/498 from fstests tests the scenario that the previously
      mentioned commit fixed.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      75b463d2
    • F
      btrfs: do not take the log_mutex of the subvolume when pinning the log · 5522a27e
      Filipe Manana 提交于
      During a rename we pin the log to make sure no one commits a log that
      reflects an ongoing rename operation, as it might result in a committed
      log where it recorded the unlink of the old name without having recorded
      the new name. However we are taking the subvolume's log_mutex before
      incrementing the log_writers counter, which is not necessary since that
      counter is atomic and we only remove the old name from the log and add
      the new name to the log after we have incremented log_writers, ensuring
      that no one can commit the log after we have removed the old name from
      the log and before we added the new name to the log.
      
      By taking the log_mutex lock we are just adding unnecessary contention on
      the lock, which can become visible for workloads that mix renames with
      fsyncs, writes for files opened with O_SYNC and unlink operations (if the
      inode or its parent were fsynced before in the current transaction).
      
      So just remove the lock and unlock of the subvolume's log_mutex at
      btrfs_pin_log_trans().
      
      Using dbench, which mixes different types of operations that end up taking
      that mutex (fsyncs, renames, unlinks and writes into files opened with
      O_SYNC) revealed some small gains. The following script that calls dbench
      was used:
      
        #!/bin/bash
      
        DEV=/dev/nvme0n1
        MNT=/mnt/btrfs
        MOUNT_OPTIONS="-o ssd -o space_cache=v2"
        MKFS_OPTIONS="-m single -d single"
        THREADS=32
      
        echo "performance" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
        mkfs.btrfs -f $MKFS_OPTIONS $DEV
        mount $MOUNT_OPTIONS $DEV $MNT
      
        dbench -s -t 600 -D $MNT $THREADS
      
        umount $MNT
      
      The test was run on bare metal, no virtualization, on a box with 12 cores
      (Intel i7-8700), 64Gb of RAM and using a NVMe device, with a kernel
      configuration that is the default of typical distributions (debian in this
      case), without debug options enabled (kasan, kmemleak, slub debug, debug
      of page allocations, lock debugging, etc).
      
      Results before this patch:
      
       Operation      Count    AvgLat    MaxLat
       ----------------------------------------
       NTCreateX    4410848     0.017   738.640
       Close        3240222     0.001     0.834
       Rename        186850     7.478  1272.476
       Unlink        890875     0.128   785.018
       Deltree          128     2.846    12.081
       Mkdir             64     0.002     0.003
       Qpathinfo    3997659     0.009    11.171
       Qfileinfo     701307     0.001     0.478
       Qfsinfo       733494     0.002     1.103
       Sfileinfo     359362     0.004     3.266
       Find         1546226     0.041     4.128
       WriteX       2202803     7.905  1376.989
       ReadX        6917775     0.003     3.887
       LockX          14392     0.002     0.043
       UnlockX        14392     0.001     0.085
       Flush         309225     0.128  1033.936
      
      Throughput 231.555 MB/sec (sync open)  32 clients  32 procs  max_latency=1376.993 ms
      
      Results after this patch:
      
      Operation      Count    AvgLat    MaxLat
       ----------------------------------------
       NTCreateX    4603244     0.017   232.776
       Close        3381299     0.001     1.041
       Rename        194871     7.251  1073.165
       Unlink        929730     0.133   119.233
       Deltree          128     2.871    10.199
       Mkdir             64     0.002     0.004
       Qpathinfo    4171343     0.009    11.317
       Qfileinfo     731227     0.001     1.635
       Qfsinfo       765079     0.002     3.568
       Sfileinfo     374881     0.004     1.220
       Find         1612964     0.041     4.675
       WriteX       2296720     7.569  1178.204
       ReadX        7213633     0.003     3.075
       LockX          14976     0.002     0.076
       UnlockX        14976     0.001     0.061
       Flush         322635     0.102   579.505
      
      Throughput 241.4 MB/sec (sync open)  32 clients  32 procs  max_latency=1178.207 ms
      (+4.3% throughput, -14.4% max latency)
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5522a27e
    • D
      btrfs: send: remove indirect callback parameter for changed_cb · 1b51d6fc
      David Sterba 提交于
      There's a custom callback passed to btrfs_compare_trees which happens to
      be named exactly same as the existing function implementing it. This is
      confusing and the indirection is not necessary for our needs. Compiler
      is clever enough to call it directly so there's effectively no change.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1b51d6fc
    • D
      btrfs: scrub: rename ratelimit state varaible to avoid shadowing · 8bb1cf1b
      David Sterba 提交于
      There's already defined _rs within ctree.h:btrfs_printk_ratelimited,
      local variables should not use _ to avoid such name clashes with
      macro-local variables.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8bb1cf1b
    • D
      btrfs: remove unnecessarily shadowed variables · 0af447d0
      David Sterba 提交于
      In btrfs_orphan_cleanup, there's another instance of fs_info, but it's
      the same as the one we already have.
      
      In btrfs_backref_finish_upper_links, rb_node is same type and used
      as temporary cursor to the tree.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0af447d0
    • D
      btrfs: compression: move declarations to header · cb4c9198
      David Sterba 提交于
      The declarations of compression algorithm callbacks are defined in the
      .c file as they're used from there. Compiler warns that there are no
      declarations for public functions when compiling lzo.c/zlib.c/zstd.c.
      Fix that by moving the declarations to the header as it's the common
      place for all of them.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      cb4c9198
    • D
      btrfs: remove const from btrfs_feature_set_name · 9e6df7ce
      David Sterba 提交于
      The function btrfs_feature_set_name returns a const char pointer, the
      second const is not necessary and reported as a warning:
      
       In file included from fs/btrfs/space-info.c:6:
       fs/btrfs/sysfs.h:16:1: warning: type qualifiers ignored on function return type [-Wignored-qualifiers]
          16 | const char * const btrfs_feature_set_name(enum btrfs_feature_set set);
             | ^~~~~
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9e6df7ce
    • Q
      btrfs: cleanup calculation of lockend in lock_and_cleanup_extent_if_need() · e21139c6
      Qu Wenruo 提交于
      We're just doing rounding up to sectorsize to calculate the lockend.
      There is no need to do the unnecessary length calculation, just direct
      round_up() is enough.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e21139c6
    • J
      btrfs: fix possible infinite loop in data async reclaim · c4923027
      Josef Bacik 提交于
      Dave reported an issue where generic/102 would sometimes hang.  This
      turned out to be because we'd get into this spot where we were no longer
      making progress on data reservations because our exit condition was not
      met.  The log is basically
      
      while (!space_info->full && !list_empty(&space_info->tickets))
      	flush_space(space_info, flush_state);
      
      where flush state is our various flush states, but doesn't include
      ALLOC_CHUNK_FORCE.  This is because we actually lead with allocating
      chunks, and so the assumption was that once you got to the actual
      flushing states you could no longer allocate chunks.  This was a stupid
      assumption, because you could have deleted block groups that would be
      reclaimed by a transaction commit, thus unsetting space_info->full.
      This is essentially what happens with generic/102, and so sometimes
      you'd get stuck in the flushing loop because we weren't allocating
      chunks, but flushing space wasn't giving us what we needed to make
      progress.
      
      Fix this by adding ALLOC_CHUNK_FORCE to the end of our flushing states,
      that way we will eventually bail out because we did end up with
      space_info->full if we free'd a chunk previously.  Otherwise, as is the
      case for this test, we'll allocate our chunk and continue on our happy
      merry way.
      Reported-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c4923027
    • J
      btrfs: add a comment explaining the data flush steps · 1a7a92c8
      Josef Bacik 提交于
      The data flushing steps are not obvious to people other than myself and
      Chris.  Write a giant comment explaining the reasoning behind each flush
      step for data as well as why it is in that particular order.
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1a7a92c8
    • J
      btrfs: do async reclaim for data reservations · 57056740
      Josef Bacik 提交于
      Now that we have the data ticketing stuff in place, move normal data
      reservations to use an async reclaim helper to satisfy tickets.  Before
      we could have multiple tasks race in and both allocate chunks, resulting
      in more data chunks than we would necessarily need.  Serializing these
      allocations and making a single thread responsible for flushing will
      only allocate chunks as needed, as well as cut down on transaction
      commits and other flush related activities.
      
      Priority reservations will still work as they have before, simply
      trying to allocate a chunk until they can make their reservation.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Tested-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      57056740
    • J
      btrfs: flush delayed refs when trying to reserve data space · cb3e3930
      Josef Bacik 提交于
      We can end up with freed extents in the delayed refs, and thus
      may_commit_transaction() may not think we have enough pinned space to
      commit the transaction and we'll ENOSPC early.  Handle this by running
      the delayed refs in order to make sure pinned is uptodate before we try
      to commit the transaction.
      Tested-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      cb3e3930
    • J
      btrfs: run delayed iputs before committing the transaction for data · 327feeeb
      Josef Bacik 提交于
      Before we were waiting on iputs after we committed the transaction, but
      this doesn't really make much sense.  We want to reclaim any space we
      may have in order to be more likely to commit the transaction, due to
      pinned space being added by running the delayed iputs.  Fix this by
      making delayed iputs run before committing the transaction.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Tested-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      327feeeb
    • J
      btrfs: don't force commit if we are data · bb86bd3d
      Josef Bacik 提交于
      We used to unconditionally commit the transaction at least 2 times and
      then on the 3rd try check against pinned space to make sure committing
      the transaction was worth the effort.  This is overkill, we know nobody
      is going to steal our reservation, and if we can't make our reservation
      with the pinned amount simply bail out.
      
      This also cleans up the passing of bytes_needed to
      may_commit_transaction, as that was the thing we added into place in
      order to accomplish this behavior.  We no longer need it so remove that
      mess.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Tested-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      bb86bd3d
    • J
      btrfs: drop the commit_cycles stuff for data reservations · 02827001
      Josef Bacik 提交于
      This was an old wart left over from how we previously did data
      reservations.  Before we could have people race in and take a
      reservation while we were flushing space, so we needed to make sure we
      looped a few times before giving up.  Now that we're using the ticketing
      infrastructure we don't have to worry about this and can drop the logic
      altogether.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Tested-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      02827001
    • J
      btrfs: use the same helper for data and metadata reservations · f3bda421
      Josef Bacik 提交于
      Now that data reservations follow the same pattern as metadata
      reservations we can simply rename __reserve_metadata_bytes to
      __reserve_bytes and use that helper for data reservations.
      
      Things to keep in mind, btrfs_can_overcommit() returns 0 for data,
      because we can never overcommit.  We also will never pass in FLUSH_ALL
      for data, so we'll simply be added to the priority list and go straight
      into handle_reserve_ticket.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Tested-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f3bda421
    • J
      btrfs: serialize data reservations if we are flushing · 0532a6f8
      Josef Bacik 提交于
      Nikolay reported a problem where generic/371 would fail sometimes with a
      slow drive.  The gist of the test is that we fallocate a file in
      parallel with a pwrite of a different file.  These two files combined
      are smaller than the file system, but sometimes the pwrite would ENOSPC.
      
      A fair bit of investigation uncovered the fact that the fallocate
      workload was racing in and grabbing the free space that the pwrite
      workload was trying to free up so it could make its own reservation.
      After a few loops of this eventually the pwrite workload would error out
      with an ENOSPC.
      
      We've had the same problem with metadata as well, and we serialized all
      metadata allocations to satisfy this problem.  This wasn't usually a
      problem with data because data reservations are more straightforward,
      but obviously could still happen.
      
      Fix this by not allowing reservations to occur if there are any pending
      tickets waiting to be satisfied on the space info.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Tested-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0532a6f8
    • J
      btrfs: use ticketing for data space reservations · 1004f686
      Josef Bacik 提交于
      Now that we have all the infrastructure in place, use the ticketing
      infrastructure to make data allocations.  This still maintains the exact
      same flushing behavior, but now we're using tickets to get our
      reservations satisfied.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Tested-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1004f686
    • J
      btrfs: add btrfs_reserve_data_bytes and use it · 8698fc4e
      Josef Bacik 提交于
      Create a new function btrfs_reserve_data_bytes() in order to handle data
      reservations.  This uses the new flush types and flush states to handle
      making data reservations.
      
      This patch specifically does not change any functionality, and is
      purposefully not cleaned up in order to make bisection easier for the
      future patches.  The new helper is identical to the old helper in how it
      handles data reservations.  We first try to force a chunk allocation,
      and then we run through the flush states all at once and in the same
      order that they were done with the old helper.
      
      Subsequent patches will clean this up and change the behavior of the
      flushing, and it is important to keep those changes separate so we can
      easily bisect down to the patch that caused the regression, rather than
      the patch that made us start using the new infrastructure.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Tested-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8698fc4e
    • J
      btrfs: add the data transaction commit logic into may_commit_transaction · a1ed0a82
      Josef Bacik 提交于
      Data space flushing currently unconditionally commits the transaction
      twice in a row, and the last time it checks if there's enough pinned
      extents to satisfy its reservation before deciding to commit the
      transaction for the 3rd and final time.
      
      Encode this logic into may_commit_transaction().  In the next patch we
      will pass in U64_MAX for bytes_needed the first two times, and the final
      time we will pass in the actual bytes we need so the normal logic will
      apply.
      
      This patch exists solely to make the logical changes I will make to the
      flushing state machine separate to make it easier to bisect any
      performance related regressions.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Tested-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a1ed0a82
    • J
      btrfs: add flushing states for handling data reservations · 058e6d1d
      Josef Bacik 提交于
      Currently the way we do data reservations is by seeing if we have enough
      space in our space_info.  If we do not and we're a normal inode we'll
      
      1) Attempt to force a chunk allocation until we can't anymore.
      2) If that fails we'll flush delalloc, then commit the transaction, then
         run the delayed iputs.
      
      If we are a free space inode we're only allowed to force a chunk
      allocation.  In order to use the normal flushing mechanism we need to
      encode this into a flush state array for normal inodes.  Since both will
      start with allocating chunks until the space info is full there is no
      need to add this as a flush state, this will be handled specially.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Tested-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      058e6d1d
    • J
      btrfs: check tickets after waiting on ordered extents · 448b966b
      Josef Bacik 提交于
      Right now if the space is freed up after the ordered extents complete
      (which is likely since the reservations are held until they complete),
      we would do extra delalloc flushing before we'd notice that we didn't
      have any more tickets.  Fix this by moving the tickets check after our
      wait_ordered_extents check.
      Tested-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      448b966b
    • J
      btrfs: use btrfs_start_delalloc_roots in shrink_delalloc · 38d715f4
      Josef Bacik 提交于
      The original iteration of flushing had us flushing delalloc and then
      checking to see if we could make our reservation, thus we were very
      careful about how many pages we would flush at once.
      
      But now that everything is async and we satisfy tickets as the space
      becomes available we don't have to keep track of any of this, simply
      try and flush the number of dirty inodes we may have in order to
      reclaim space to make our reservation.  This cleans up our delalloc
      flushing significantly.
      
      The async_pages stuff is dropped because btrfs_start_delalloc_roots()
      handles the case that we generate async extents for us, so we no longer
      require this extra logic.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Tested-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      38d715f4