1. 27 7月, 2020 40 次提交
    • F
      btrfs: do not set the full sync flag on the inode during page release · 5e548b32
      Filipe Manana 提交于
      When removing an extent map at try_release_extent_mapping(), called through
      the page release callback (btrfs_releasepage()), we always set the full
      sync flag on the inode, which forces the next fsync to use a slower code
      path.
      
      This hurts performance for workloads that dirty an amount of data that
      exceeds or is very close to the system's RAM memory and do frequent fsync
      operations (like database servers can for example). In particular if there
      are concurrent fsyncs against different files, by falling back to a full
      fsync we do a lot more checksum lookups in the checksums btree, as we do
      it for all the extents created in the current transaction, instead of only
      the new ones since the last fsync. These checksums lookups not only take
      some time but, more importantly, they also cause contention on the
      checksums btree locks due to the concurrency with checksum insertions in
      the btree by ordered extents from other inodes.
      
      We actually don't need to set the full sync flag on the inode, because we
      only remove extent maps that are in the list of modified extents if they
      were created in a past transaction, in which case an fsync skips them as
      it's pointless to log them. So stop setting the full fsync flag on the
      inode whenever we remove an extent map.
      
      This patch is part of a patchset that consists of 3 patches, which have
      the following subjects:
      
      1/3 btrfs: fix race between page release and a fast fsync
      2/3 btrfs: release old extent maps during page release
      3/3 btrfs: do not set the full sync flag on the inode during page release
      
      Performance tests were ran against a branch (misc-next) containing the
      whole patchset. The test exercises a workload where there are multiple
      processes writing to files and fsyncing them (each writing and fsyncing
      its own file), and in total the amount of data dirtied ranges from 2x to
      4x the system's RAM memory (16GiB), so that the page release callback is
      invoked frequently.
      
      The following script, using fio, was used to perform the tests:
      
        $ cat test-fsync.sh
        #!/bin/bash
      
        DEV=/dev/sdk
        MNT=/mnt/sdk
        MOUNT_OPTIONS="-o ssd"
        MKFS_OPTIONS="-d single -m single"
      
        if [ $# -ne 3 ]; then
            echo "Use $0 NUM_JOBS FILE_SIZE FSYNC_FREQ"
            exit 1
        fi
      
        NUM_JOBS=$1
        FILE_SIZE=$2
        FSYNC_FREQ=$3
      
        cat <<EOF > /tmp/fio-job.ini
        [writers]
        rw=write
        fsync=$FSYNC_FREQ
        fallocate=none
        group_reporting=1
        direct=0
        bs=64k
        ioengine=sync
        size=$FILE_SIZE
        directory=$MNT
        numjobs=$NUM_JOBS
        thread
        EOF
      
        echo "Using config:"
        echo
        cat /tmp/fio-job.ini
        echo
      
        mkfs.btrfs -f $MKFS_OPTIONS $DEV &> /dev/null
        mount $MOUNT_OPTIONS $DEV $MNT
        fio /tmp/fio-job.ini
        umount $MNT
      
      The tests were performed for different numbers of jobs, file sizes and
      fsync frequency. A qemu VM using kvm was used, with 8 cores (the host has
      12 cores, with cpu governance set to performance mode on all cores), 16GiB
      of ram (the host has 64GiB) and using a NVMe device directly (without an
      intermediary filesystem in the host). While running the tests, the host
      was not used for anything else, to avoid disturbing the tests.
      
      The obtained results were the following, and the last line printed by
      fio is pasted (includes aggregated throughput and test run time).
      
          *****************************************************
          ****     1 job, 32GiB file, fsync frequency 1     ****
          *****************************************************
      
      Before patchset:
      
      WRITE: bw=29.1MiB/s (30.5MB/s), 29.1MiB/s-29.1MiB/s (30.5MB/s-30.5MB/s), io=32.0GiB (34.4GB), run=1127557-1127557msec
      
      After patchset:
      
      WRITE: bw=29.3MiB/s (30.7MB/s), 29.3MiB/s-29.3MiB/s (30.7MB/s-30.7MB/s), io=32.0GiB (34.4GB), run=1119042-1119042msec
      (+0.7% throughput, -0.8% run time)
      
          *****************************************************
          ****     2 jobs, 16GiB files, fsync frequency 1   ****
          *****************************************************
      
      Before patchset:
      
      WRITE: bw=33.5MiB/s (35.1MB/s), 33.5MiB/s-33.5MiB/s (35.1MB/s-35.1MB/s), io=32.0GiB (34.4GB), run=979000-979000msec
      
      After patchset:
      
      WRITE: bw=39.9MiB/s (41.8MB/s), 39.9MiB/s-39.9MiB/s (41.8MB/s-41.8MB/s), io=32.0GiB (34.4GB), run=821283-821283msec
      (+19.1% throughput, -16.1% runtime)
      
          *****************************************************
          ****     4 jobs, 8GiB files, fsync frequency 1    ****
          *****************************************************
      
      Before patchset:
      
      WRITE: bw=52.1MiB/s (54.6MB/s), 52.1MiB/s-52.1MiB/s (54.6MB/s-54.6MB/s), io=32.0GiB (34.4GB), run=629130-629130msec
      
      After patchset:
      
      WRITE: bw=71.8MiB/s (75.3MB/s), 71.8MiB/s-71.8MiB/s (75.3MB/s-75.3MB/s), io=32.0GiB (34.4GB), run=456357-456357msec
      (+37.8% throughput, -27.5% runtime)
      
          *****************************************************
          ****     8 jobs, 4GiB files, fsync frequency 1    ****
          *****************************************************
      
      Before patchset:
      
      WRITE: bw=76.1MiB/s (79.8MB/s), 76.1MiB/s-76.1MiB/s (79.8MB/s-79.8MB/s), io=32.0GiB (34.4GB), run=430708-430708msec
      
      After patchset:
      
      WRITE: bw=133MiB/s (140MB/s), 133MiB/s-133MiB/s (140MB/s-140MB/s), io=32.0GiB (34.4GB), run=245458-245458msec
      (+74.7% throughput, -43.0% run time)
      
          *****************************************************
          ****    16 jobs, 2GiB files, fsync frequency 1    ****
          *****************************************************
      
      Before patchset:
      
      WRITE: bw=74.7MiB/s (78.3MB/s), 74.7MiB/s-74.7MiB/s (78.3MB/s-78.3MB/s), io=32.0GiB (34.4GB), run=438625-438625msec
      
      After patchset:
      
      WRITE: bw=184MiB/s (193MB/s), 184MiB/s-184MiB/s (193MB/s-193MB/s), io=32.0GiB (34.4GB), run=177864-177864msec
      (+146.3% throughput, -59.5% run time)
      
          *****************************************************
          ****    32 jobs, 2GiB files, fsync frequency 1    ****
          *****************************************************
      
      Before patchset:
      
      WRITE: bw=72.6MiB/s (76.1MB/s), 72.6MiB/s-72.6MiB/s (76.1MB/s-76.1MB/s), io=64.0GiB (68.7GB), run=902615-902615msec
      
      After patchset:
      
      WRITE: bw=227MiB/s (238MB/s), 227MiB/s-227MiB/s (238MB/s-238MB/s), io=64.0GiB (68.7GB), run=288936-288936msec
      (+212.7% throughput, -68.0% run time)
      
          *****************************************************
          ****    64 jobs, 1GiB files, fsync frequency 1    ****
          *****************************************************
      
      Before patchset:
      
      WRITE: bw=98.8MiB/s (104MB/s), 98.8MiB/s-98.8MiB/s (104MB/s-104MB/s), io=64.0GiB (68.7GB), run=663126-663126msec
      
      After patchset:
      
      WRITE: bw=294MiB/s (308MB/s), 294MiB/s-294MiB/s (308MB/s-308MB/s), io=64.0GiB (68.7GB), run=222940-222940msec
      (+197.6% throughput, -66.4% run time)
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5e548b32
    • F
      btrfs: release old extent maps during page release · fbc2bd7e
      Filipe Manana 提交于
      When removing an extent map at try_release_extent_mapping(), called through
      the page release callback (btrfs_releasepage()), we never release an extent
      map that is in the list of modified extents. This is to prevent races with
      a concurrent fsync using the fast path, which could lead to not logging an
      extent created in the current transaction.
      
      However we can safely remove an extent map created in a past transaction
      that is still in the list of modified extents (because no one fsynced yet
      the inode after that transaction got commited), because such extents are
      skipped during an fsync as it is pointless to log them. This change does
      that.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fbc2bd7e
    • F
      btrfs: fix race between page release and a fast fsync · 3d6448e6
      Filipe Manana 提交于
      When releasing an extent map, done through the page release callback, we
      can race with an ongoing fast fsync and cause the fsync to miss a new
      extent and not log it. The steps for this to happen are the following:
      
      1) A page is dirtied for some inode I;
      
      2) Writeback for that page is triggered by a path other than fsync, for
         example by the system due to memory pressure;
      
      3) When the ordered extent for the extent (a single 4K page) finishes,
         we unpin the corresponding extent map and set its generation to N,
         the current transaction's generation;
      
      4) The btrfs_releasepage() callback is invoked by the system due to
         memory pressure for that no longer dirty page of inode I;
      
      5) At the same time, some task calls fsync on inode I, joins transaction
         N, and at btrfs_log_inode() it sees that the inode does not have the
         full sync flag set, so we proceed with a fast fsync. But before we get
         into btrfs_log_changed_extents() and lock the inode's extent map tree:
      
      6) Through btrfs_releasepage() we end up at try_release_extent_mapping()
         and we remove the extent map for the new 4Kb extent, because it is
         neither pinned anymore nor locked. By calling remove_extent_mapping(),
         we remove the extent map from the list of modified extents, since the
         extent map does not have the logging flag set. We unlock the inode's
         extent map tree;
      
      7) The task doing the fast fsync now enters btrfs_log_changed_extents(),
         locks the inode's extent map tree and iterates its list of modified
         extents, which no longer has the 4Kb extent in it, so it does not log
         the extent;
      
      8) The fsync finishes;
      
      9) Before transaction N is committed, a power failure happens. After
         replaying the log, the 4K extent of inode I will be missing, since
         it was not logged due to the race with try_release_extent_mapping().
      
      So fix this by teaching try_release_extent_mapping() to not remove an
      extent map if it's still in the list of modified extents.
      
      Fixes: ff44c6e3 ("Btrfs: do not hold the write_lock on the extent tree while logging")
      CC: stable@vger.kernel.org # 5.4+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3d6448e6
    • J
      btrfs: open-code remount flag setting in btrfs_remount · 88c4703f
      Johannes Thumshirn 提交于
      When we're (re)mounting a btrfs filesystem we set the
      BTRFS_FS_STATE_REMOUNTING state in fs_info to serialize against async
      reclaim or defrags.
      
      This flag is set in btrfs_remount_prepare() called by btrfs_remount().
      As btrfs_remount_prepare() does nothing but setting this flag and
      doesn't have a second caller, we can just open-code the flag setting in
      btrfs_remount().
      
      Similarly do for so clearing of the flag by moving it out of
      btrfs_remount_cleanup() into btrfs_remount() to be symmetrical.
      Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      88c4703f
    • J
      btrfs: if we're restriping, use the target restripe profile · 162e0a16
      Josef Bacik 提交于
      Previously we depended on some weird behavior in our chunk allocator to
      force the allocation of new stripes, so by the time we got to doing the
      reduce we would usually already have a chunk with the proper target.
      
      However that behavior causes other problems and needs to be removed.
      First however we need to remove this check to only restripe if we
      already have those available profiles, because if we're allocating our
      first chunk it obviously will not be available.  Simply use the target
      as specified, and if that fails it'll be because we're out of space.
      Tested-by: NHolger Hoffstätte <holger@applied-asynchrony.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      162e0a16
    • J
      btrfs: don't adjust bg flags and use default allocation profiles · 349e120e
      Josef Bacik 提交于
      btrfs/061 has been failing consistently for me recently with a
      transaction abort.  We run out of space in the system chunk array, which
      means we've allocated way too many system chunks than we need.
      
      Chris added this a long time ago for balance as a poor mans restriping.
      If you had a single disk and then added another disk and then did a
      balance, update_block_group_flags would then figure out which RAID level
      you needed.
      
      Fast forward to today and we have restriping behavior, so we can
      explicitly tell the fs that we're trying to change the raid level.  This
      is accomplished through the normal get_alloc_profile path.
      
      Furthermore this code actually causes btrfs/061 to fail, because we do
      things like mkfs -m dup -d single with multiple devices.  This trips
      this check
      
      alloc_flags = update_block_group_flags(fs_info, cache->flags);
      if (alloc_flags != cache->flags) {
      	ret = btrfs_chunk_alloc(trans, alloc_flags, CHUNK_ALLOC_FORCE);
      
      in btrfs_inc_block_group_ro.  Because we're balancing and scrubbing, but
      not actually restriping, we keep forcing chunk allocation of RAID1
      chunks.  This eventually causes us to run out of system space and the
      file system aborts and flips read only.
      
      We don't need this poor mans restriping any more, simply use the normal
      get_alloc_profile helper, which will get the correct alloc_flags and
      thus make the right decision for chunk allocation.  This keeps us from
      allocating a billion system chunks and falling over.
      Tested-by: NHolger Hoffstätte <holger@applied-asynchrony.com>
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      349e120e
    • J
      btrfs: fix lockdep splat from btrfs_dump_space_info · ab0db043
      Josef Bacik 提交于
      When running with -o enospc_debug you can get the following splat if one
      of the dump_space_info's trip
      
        ======================================================
        WARNING: possible circular locking dependency detected
        5.8.0-rc5+ #20 Tainted: G           OE
        ------------------------------------------------------
        dd/563090 is trying to acquire lock:
        ffff9e7dbf4f1e18 (&ctl->tree_lock){+.+.}-{2:2}, at: btrfs_dump_free_space+0x2b/0xa0 [btrfs]
      
        but task is already holding lock:
        ffff9e7e2284d428 (&cache->lock){+.+.}-{2:2}, at: btrfs_dump_space_info+0xaa/0x120 [btrfs]
      
        which lock already depends on the new lock.
      
        the existing dependency chain (in reverse order) is:
      
        -> #3 (&cache->lock){+.+.}-{2:2}:
      	 _raw_spin_lock+0x25/0x30
      	 btrfs_add_reserved_bytes+0x3c/0x3c0 [btrfs]
      	 find_free_extent+0x7ef/0x13b0 [btrfs]
      	 btrfs_reserve_extent+0x9b/0x180 [btrfs]
      	 btrfs_alloc_tree_block+0xc1/0x340 [btrfs]
      	 alloc_tree_block_no_bg_flush+0x4a/0x60 [btrfs]
      	 __btrfs_cow_block+0x122/0x530 [btrfs]
      	 btrfs_cow_block+0x106/0x210 [btrfs]
      	 commit_cowonly_roots+0x55/0x300 [btrfs]
      	 btrfs_commit_transaction+0x4ed/0xac0 [btrfs]
      	 sync_filesystem+0x74/0x90
      	 generic_shutdown_super+0x22/0x100
      	 kill_anon_super+0x14/0x30
      	 btrfs_kill_super+0x12/0x20 [btrfs]
      	 deactivate_locked_super+0x36/0x70
      	 cleanup_mnt+0x104/0x160
      	 task_work_run+0x5f/0x90
      	 __prepare_exit_to_usermode+0x1bd/0x1c0
      	 do_syscall_64+0x5e/0xb0
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #2 (&space_info->lock){+.+.}-{2:2}:
      	 _raw_spin_lock+0x25/0x30
      	 btrfs_block_rsv_release+0x1a6/0x3f0 [btrfs]
      	 btrfs_inode_rsv_release+0x4f/0x170 [btrfs]
      	 btrfs_clear_delalloc_extent+0x155/0x480 [btrfs]
      	 clear_state_bit+0x81/0x1a0 [btrfs]
      	 __clear_extent_bit+0x25c/0x5d0 [btrfs]
      	 clear_extent_bit+0x15/0x20 [btrfs]
      	 btrfs_invalidatepage+0x2b7/0x3c0 [btrfs]
      	 truncate_cleanup_page+0x47/0xe0
      	 truncate_inode_pages_range+0x238/0x840
      	 truncate_pagecache+0x44/0x60
      	 btrfs_setattr+0x202/0x5e0 [btrfs]
      	 notify_change+0x33b/0x490
      	 do_truncate+0x76/0xd0
      	 path_openat+0x687/0xa10
      	 do_filp_open+0x91/0x100
      	 do_sys_openat2+0x215/0x2d0
      	 do_sys_open+0x44/0x80
      	 do_syscall_64+0x52/0xb0
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #1 (&tree->lock#2){+.+.}-{2:2}:
      	 _raw_spin_lock+0x25/0x30
      	 find_first_extent_bit+0x32/0x150 [btrfs]
      	 write_pinned_extent_entries.isra.0+0xc5/0x100 [btrfs]
      	 __btrfs_write_out_cache+0x172/0x480 [btrfs]
      	 btrfs_write_out_cache+0x7a/0xf0 [btrfs]
      	 btrfs_write_dirty_block_groups+0x286/0x3b0 [btrfs]
      	 commit_cowonly_roots+0x245/0x300 [btrfs]
      	 btrfs_commit_transaction+0x4ed/0xac0 [btrfs]
      	 close_ctree+0xf9/0x2f5 [btrfs]
      	 generic_shutdown_super+0x6c/0x100
      	 kill_anon_super+0x14/0x30
      	 btrfs_kill_super+0x12/0x20 [btrfs]
      	 deactivate_locked_super+0x36/0x70
      	 cleanup_mnt+0x104/0x160
      	 task_work_run+0x5f/0x90
      	 __prepare_exit_to_usermode+0x1bd/0x1c0
      	 do_syscall_64+0x5e/0xb0
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #0 (&ctl->tree_lock){+.+.}-{2:2}:
      	 __lock_acquire+0x1240/0x2460
      	 lock_acquire+0xab/0x360
      	 _raw_spin_lock+0x25/0x30
      	 btrfs_dump_free_space+0x2b/0xa0 [btrfs]
      	 btrfs_dump_space_info+0xf4/0x120 [btrfs]
      	 btrfs_reserve_extent+0x176/0x180 [btrfs]
      	 __btrfs_prealloc_file_range+0x145/0x550 [btrfs]
      	 cache_save_setup+0x28d/0x3b0 [btrfs]
      	 btrfs_start_dirty_block_groups+0x1fc/0x4f0 [btrfs]
      	 btrfs_commit_transaction+0xcc/0xac0 [btrfs]
      	 btrfs_alloc_data_chunk_ondemand+0x162/0x4c0 [btrfs]
      	 btrfs_check_data_free_space+0x4c/0xa0 [btrfs]
      	 btrfs_buffered_write.isra.0+0x19b/0x740 [btrfs]
      	 btrfs_file_write_iter+0x3cf/0x610 [btrfs]
      	 new_sync_write+0x11e/0x1b0
      	 vfs_write+0x1c9/0x200
      	 ksys_write+0x68/0xe0
      	 do_syscall_64+0x52/0xb0
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        other info that might help us debug this:
      
        Chain exists of:
          &ctl->tree_lock --> &space_info->lock --> &cache->lock
      
         Possible unsafe locking scenario:
      
      	 CPU0                    CPU1
      	 ----                    ----
          lock(&cache->lock);
      				 lock(&space_info->lock);
      				 lock(&cache->lock);
          lock(&ctl->tree_lock);
      
         *** DEADLOCK ***
      
        6 locks held by dd/563090:
         #0: ffff9e7e21d18448 (sb_writers#14){.+.+}-{0:0}, at: vfs_write+0x195/0x200
         #1: ffff9e7dd0410ed8 (&sb->s_type->i_mutex_key#19){++++}-{3:3}, at: btrfs_file_write_iter+0x86/0x610 [btrfs]
         #2: ffff9e7e21d18638 (sb_internal#2){.+.+}-{0:0}, at: start_transaction+0x40b/0x5b0 [btrfs]
         #3: ffff9e7e1f05d688 (&cur_trans->cache_write_mutex){+.+.}-{3:3}, at: btrfs_start_dirty_block_groups+0x158/0x4f0 [btrfs]
         #4: ffff9e7e2284ddb8 (&space_info->groups_sem){++++}-{3:3}, at: btrfs_dump_space_info+0x69/0x120 [btrfs]
         #5: ffff9e7e2284d428 (&cache->lock){+.+.}-{2:2}, at: btrfs_dump_space_info+0xaa/0x120 [btrfs]
      
        stack backtrace:
        CPU: 3 PID: 563090 Comm: dd Tainted: G           OE     5.8.0-rc5+ #20
        Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./890FX Deluxe5, BIOS P1.40 05/03/2011
        Call Trace:
         dump_stack+0x96/0xd0
         check_noncircular+0x162/0x180
         __lock_acquire+0x1240/0x2460
         ? wake_up_klogd.part.0+0x30/0x40
         lock_acquire+0xab/0x360
         ? btrfs_dump_free_space+0x2b/0xa0 [btrfs]
         _raw_spin_lock+0x25/0x30
         ? btrfs_dump_free_space+0x2b/0xa0 [btrfs]
         btrfs_dump_free_space+0x2b/0xa0 [btrfs]
         btrfs_dump_space_info+0xf4/0x120 [btrfs]
         btrfs_reserve_extent+0x176/0x180 [btrfs]
         __btrfs_prealloc_file_range+0x145/0x550 [btrfs]
         ? btrfs_qgroup_reserve_data+0x1d/0x60 [btrfs]
         cache_save_setup+0x28d/0x3b0 [btrfs]
         btrfs_start_dirty_block_groups+0x1fc/0x4f0 [btrfs]
         btrfs_commit_transaction+0xcc/0xac0 [btrfs]
         ? start_transaction+0xe0/0x5b0 [btrfs]
         btrfs_alloc_data_chunk_ondemand+0x162/0x4c0 [btrfs]
         btrfs_check_data_free_space+0x4c/0xa0 [btrfs]
         btrfs_buffered_write.isra.0+0x19b/0x740 [btrfs]
         ? ktime_get_coarse_real_ts64+0xa8/0xd0
         ? trace_hardirqs_on+0x1c/0xe0
         btrfs_file_write_iter+0x3cf/0x610 [btrfs]
         new_sync_write+0x11e/0x1b0
         vfs_write+0x1c9/0x200
         ksys_write+0x68/0xe0
         do_syscall_64+0x52/0xb0
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      This is because we're holding the block_group->lock while trying to dump
      the free space cache.  However we don't need this lock, we just need it
      to read the values for the printk, so move the free space cache dumping
      outside of the block group lock.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ab0db043
    • J
      btrfs: move the chunk_mutex in btrfs_read_chunk_tree · 01d01caf
      Josef Bacik 提交于
      We are currently getting this lockdep splat in btrfs/161:
      
        ======================================================
        WARNING: possible circular locking dependency detected
        5.8.0-rc5+ #20 Tainted: G            E
        ------------------------------------------------------
        mount/678048 is trying to acquire lock:
        ffff9b769f15b6e0 (&fs_devs->device_list_mutex){+.+.}-{3:3}, at: clone_fs_devices+0x4d/0x170 [btrfs]
      
        but task is already holding lock:
        ffff9b76abdb08d0 (&fs_info->chunk_mutex){+.+.}-{3:3}, at: btrfs_read_chunk_tree+0x6a/0x800 [btrfs]
      
        which lock already depends on the new lock.
      
        the existing dependency chain (in reverse order) is:
      
        -> #1 (&fs_info->chunk_mutex){+.+.}-{3:3}:
      	 __mutex_lock+0x8b/0x8f0
      	 btrfs_init_new_device+0x2d2/0x1240 [btrfs]
      	 btrfs_ioctl+0x1de/0x2d20 [btrfs]
      	 ksys_ioctl+0x87/0xc0
      	 __x64_sys_ioctl+0x16/0x20
      	 do_syscall_64+0x52/0xb0
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #0 (&fs_devs->device_list_mutex){+.+.}-{3:3}:
      	 __lock_acquire+0x1240/0x2460
      	 lock_acquire+0xab/0x360
      	 __mutex_lock+0x8b/0x8f0
      	 clone_fs_devices+0x4d/0x170 [btrfs]
      	 btrfs_read_chunk_tree+0x330/0x800 [btrfs]
      	 open_ctree+0xb7c/0x18ce [btrfs]
      	 btrfs_mount_root.cold+0x13/0xfa [btrfs]
      	 legacy_get_tree+0x30/0x50
      	 vfs_get_tree+0x28/0xc0
      	 fc_mount+0xe/0x40
      	 vfs_kern_mount.part.0+0x71/0x90
      	 btrfs_mount+0x13b/0x3e0 [btrfs]
      	 legacy_get_tree+0x30/0x50
      	 vfs_get_tree+0x28/0xc0
      	 do_mount+0x7de/0xb30
      	 __x64_sys_mount+0x8e/0xd0
      	 do_syscall_64+0x52/0xb0
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        other info that might help us debug this:
      
         Possible unsafe locking scenario:
      
      	 CPU0                    CPU1
      	 ----                    ----
          lock(&fs_info->chunk_mutex);
      				 lock(&fs_devs->device_list_mutex);
      				 lock(&fs_info->chunk_mutex);
          lock(&fs_devs->device_list_mutex);
      
         *** DEADLOCK ***
      
        3 locks held by mount/678048:
         #0: ffff9b75ff5fb0e0 (&type->s_umount_key#63/1){+.+.}-{3:3}, at: alloc_super+0xb5/0x380
         #1: ffffffffc0c2fbc8 (uuid_mutex){+.+.}-{3:3}, at: btrfs_read_chunk_tree+0x54/0x800 [btrfs]
         #2: ffff9b76abdb08d0 (&fs_info->chunk_mutex){+.+.}-{3:3}, at: btrfs_read_chunk_tree+0x6a/0x800 [btrfs]
      
        stack backtrace:
        CPU: 2 PID: 678048 Comm: mount Tainted: G            E     5.8.0-rc5+ #20
        Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./890FX Deluxe5, BIOS P1.40 05/03/2011
        Call Trace:
         dump_stack+0x96/0xd0
         check_noncircular+0x162/0x180
         __lock_acquire+0x1240/0x2460
         ? asm_sysvec_apic_timer_interrupt+0x12/0x20
         lock_acquire+0xab/0x360
         ? clone_fs_devices+0x4d/0x170 [btrfs]
         __mutex_lock+0x8b/0x8f0
         ? clone_fs_devices+0x4d/0x170 [btrfs]
         ? rcu_read_lock_sched_held+0x52/0x60
         ? cpumask_next+0x16/0x20
         ? module_assert_mutex_or_preempt+0x14/0x40
         ? __module_address+0x28/0xf0
         ? clone_fs_devices+0x4d/0x170 [btrfs]
         ? static_obj+0x4f/0x60
         ? lockdep_init_map_waits+0x43/0x200
         ? clone_fs_devices+0x4d/0x170 [btrfs]
         clone_fs_devices+0x4d/0x170 [btrfs]
         btrfs_read_chunk_tree+0x330/0x800 [btrfs]
         open_ctree+0xb7c/0x18ce [btrfs]
         ? super_setup_bdi_name+0x79/0xd0
         btrfs_mount_root.cold+0x13/0xfa [btrfs]
         ? vfs_parse_fs_string+0x84/0xb0
         ? rcu_read_lock_sched_held+0x52/0x60
         ? kfree+0x2b5/0x310
         legacy_get_tree+0x30/0x50
         vfs_get_tree+0x28/0xc0
         fc_mount+0xe/0x40
         vfs_kern_mount.part.0+0x71/0x90
         btrfs_mount+0x13b/0x3e0 [btrfs]
         ? cred_has_capability+0x7c/0x120
         ? rcu_read_lock_sched_held+0x52/0x60
         ? legacy_get_tree+0x30/0x50
         legacy_get_tree+0x30/0x50
         vfs_get_tree+0x28/0xc0
         do_mount+0x7de/0xb30
         ? memdup_user+0x4e/0x90
         __x64_sys_mount+0x8e/0xd0
         do_syscall_64+0x52/0xb0
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      This is because btrfs_read_chunk_tree() can come upon DEV_EXTENT's and
      then read the device, which takes the device_list_mutex.  The
      device_list_mutex needs to be taken before the chunk_mutex, so this is a
      problem.  We only really need the chunk mutex around adding the chunk,
      so move the mutex around read_one_chunk.
      
      An argument could be made that we don't even need the chunk_mutex here
      as it's during mount, and we are protected by various other locks.
      However we already have special rules for ->device_list_mutex, and I'd
      rather not have another special case for ->chunk_mutex.
      
      CC: stable@vger.kernel.org # 4.19+
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      01d01caf
    • J
      btrfs: open device without device_list_mutex · 18c850fd
      Josef Bacik 提交于
      There's long existed a lockdep splat because we open our bdev's under
      the ->device_list_mutex at mount time, which acquires the bd_mutex.
      Usually this goes unnoticed, but if you do loopback devices at all
      suddenly the bd_mutex comes with a whole host of other dependencies,
      which results in the splat when you mount a btrfs file system.
      
      ======================================================
      WARNING: possible circular locking dependency detected
      5.8.0-0.rc3.1.fc33.x86_64+debug #1 Not tainted
      ------------------------------------------------------
      systemd-journal/509 is trying to acquire lock:
      ffff970831f84db0 (&fs_info->reloc_mutex){+.+.}-{3:3}, at: btrfs_record_root_in_trans+0x44/0x70 [btrfs]
      
      but task is already holding lock:
      ffff97083144d598 (sb_pagefaults){.+.+}-{0:0}, at: btrfs_page_mkwrite+0x59/0x560 [btrfs]
      
      which lock already depends on the new lock.
      
      the existing dependency chain (in reverse order) is:
      
       -> #6 (sb_pagefaults){.+.+}-{0:0}:
             __sb_start_write+0x13e/0x220
             btrfs_page_mkwrite+0x59/0x560 [btrfs]
             do_page_mkwrite+0x4f/0x130
             do_wp_page+0x3b0/0x4f0
             handle_mm_fault+0xf47/0x1850
             do_user_addr_fault+0x1fc/0x4b0
             exc_page_fault+0x88/0x300
             asm_exc_page_fault+0x1e/0x30
      
       -> #5 (&mm->mmap_lock#2){++++}-{3:3}:
             __might_fault+0x60/0x80
             _copy_from_user+0x20/0xb0
             get_sg_io_hdr+0x9a/0xb0
             scsi_cmd_ioctl+0x1ea/0x2f0
             cdrom_ioctl+0x3c/0x12b4
             sr_block_ioctl+0xa4/0xd0
             block_ioctl+0x3f/0x50
             ksys_ioctl+0x82/0xc0
             __x64_sys_ioctl+0x16/0x20
             do_syscall_64+0x52/0xb0
             entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
       -> #4 (&cd->lock){+.+.}-{3:3}:
             __mutex_lock+0x7b/0x820
             sr_block_open+0xa2/0x180
             __blkdev_get+0xdd/0x550
             blkdev_get+0x38/0x150
             do_dentry_open+0x16b/0x3e0
             path_openat+0x3c9/0xa00
             do_filp_open+0x75/0x100
             do_sys_openat2+0x8a/0x140
             __x64_sys_openat+0x46/0x70
             do_syscall_64+0x52/0xb0
             entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
       -> #3 (&bdev->bd_mutex){+.+.}-{3:3}:
             __mutex_lock+0x7b/0x820
             __blkdev_get+0x6a/0x550
             blkdev_get+0x85/0x150
             blkdev_get_by_path+0x2c/0x70
             btrfs_get_bdev_and_sb+0x1b/0xb0 [btrfs]
             open_fs_devices+0x88/0x240 [btrfs]
             btrfs_open_devices+0x92/0xa0 [btrfs]
             btrfs_mount_root+0x250/0x490 [btrfs]
             legacy_get_tree+0x30/0x50
             vfs_get_tree+0x28/0xc0
             vfs_kern_mount.part.0+0x71/0xb0
             btrfs_mount+0x119/0x380 [btrfs]
             legacy_get_tree+0x30/0x50
             vfs_get_tree+0x28/0xc0
             do_mount+0x8c6/0xca0
             __x64_sys_mount+0x8e/0xd0
             do_syscall_64+0x52/0xb0
             entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
       -> #2 (&fs_devs->device_list_mutex){+.+.}-{3:3}:
             __mutex_lock+0x7b/0x820
             btrfs_run_dev_stats+0x36/0x420 [btrfs]
             commit_cowonly_roots+0x91/0x2d0 [btrfs]
             btrfs_commit_transaction+0x4e6/0x9f0 [btrfs]
             btrfs_sync_file+0x38a/0x480 [btrfs]
             __x64_sys_fdatasync+0x47/0x80
             do_syscall_64+0x52/0xb0
             entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
       -> #1 (&fs_info->tree_log_mutex){+.+.}-{3:3}:
             __mutex_lock+0x7b/0x820
             btrfs_commit_transaction+0x48e/0x9f0 [btrfs]
             btrfs_sync_file+0x38a/0x480 [btrfs]
             __x64_sys_fdatasync+0x47/0x80
             do_syscall_64+0x52/0xb0
             entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
       -> #0 (&fs_info->reloc_mutex){+.+.}-{3:3}:
             __lock_acquire+0x1241/0x20c0
             lock_acquire+0xb0/0x400
             __mutex_lock+0x7b/0x820
             btrfs_record_root_in_trans+0x44/0x70 [btrfs]
             start_transaction+0xd2/0x500 [btrfs]
             btrfs_dirty_inode+0x44/0xd0 [btrfs]
             file_update_time+0xc6/0x120
             btrfs_page_mkwrite+0xda/0x560 [btrfs]
             do_page_mkwrite+0x4f/0x130
             do_wp_page+0x3b0/0x4f0
             handle_mm_fault+0xf47/0x1850
             do_user_addr_fault+0x1fc/0x4b0
             exc_page_fault+0x88/0x300
             asm_exc_page_fault+0x1e/0x30
      
      other info that might help us debug this:
      
      Chain exists of:
        &fs_info->reloc_mutex --> &mm->mmap_lock#2 --> sb_pagefaults
      
      Possible unsafe locking scenario:
      
           CPU0                    CPU1
           ----                    ----
       lock(sb_pagefaults);
                                   lock(&mm->mmap_lock#2);
                                   lock(sb_pagefaults);
       lock(&fs_info->reloc_mutex);
      
       *** DEADLOCK ***
      
      3 locks held by systemd-journal/509:
       #0: ffff97083bdec8b8 (&mm->mmap_lock#2){++++}-{3:3}, at: do_user_addr_fault+0x12e/0x4b0
       #1: ffff97083144d598 (sb_pagefaults){.+.+}-{0:0}, at: btrfs_page_mkwrite+0x59/0x560 [btrfs]
       #2: ffff97083144d6a8 (sb_internal){.+.+}-{0:0}, at: start_transaction+0x3f8/0x500 [btrfs]
      
      stack backtrace:
      CPU: 0 PID: 509 Comm: systemd-journal Not tainted 5.8.0-0.rc3.1.fc33.x86_64+debug #1
      Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
      Call Trace:
       dump_stack+0x92/0xc8
       check_noncircular+0x134/0x150
       __lock_acquire+0x1241/0x20c0
       lock_acquire+0xb0/0x400
       ? btrfs_record_root_in_trans+0x44/0x70 [btrfs]
       ? lock_acquire+0xb0/0x400
       ? btrfs_record_root_in_trans+0x44/0x70 [btrfs]
       __mutex_lock+0x7b/0x820
       ? btrfs_record_root_in_trans+0x44/0x70 [btrfs]
       ? kvm_sched_clock_read+0x14/0x30
       ? sched_clock+0x5/0x10
       ? sched_clock_cpu+0xc/0xb0
       btrfs_record_root_in_trans+0x44/0x70 [btrfs]
       start_transaction+0xd2/0x500 [btrfs]
       btrfs_dirty_inode+0x44/0xd0 [btrfs]
       file_update_time+0xc6/0x120
       btrfs_page_mkwrite+0xda/0x560 [btrfs]
       ? sched_clock+0x5/0x10
       do_page_mkwrite+0x4f/0x130
       do_wp_page+0x3b0/0x4f0
       handle_mm_fault+0xf47/0x1850
       do_user_addr_fault+0x1fc/0x4b0
       exc_page_fault+0x88/0x300
       ? asm_exc_page_fault+0x8/0x30
       asm_exc_page_fault+0x1e/0x30
      RIP: 0033:0x7fa3972fdbfe
      Code: Bad RIP value.
      
      Fix this by not holding the ->device_list_mutex at this point.  The
      device_list_mutex exists to protect us from modifying the device list
      while the file system is running.
      
      However it can also be modified by doing a scan on a device.  But this
      action is specifically protected by the uuid_mutex, which we are holding
      here.  We cannot race with opening at this point because we have the
      ->s_mount lock held during the mount.  Not having the
      ->device_list_mutex here is perfectly safe as we're not going to change
      the devices at this point.
      
      CC: stable@vger.kernel.org # 4.19+
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ add some comments ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      18c850fd
    • J
      btrfs: sysfs: use NOFS for device creation · a47bd78d
      Josef Bacik 提交于
      Dave hit this splat during testing btrfs/078:
      
        ======================================================
        WARNING: possible circular locking dependency detected
        5.8.0-rc6-default+ #1191 Not tainted
        ------------------------------------------------------
        kswapd0/75 is trying to acquire lock:
        ffffa040e9d04ff8 (&delayed_node->mutex){+.+.}-{3:3}, at: __btrfs_release_delayed_node.part.0+0x3f/0x310 [btrfs]
      
        but task is already holding lock:
        ffffffff8b0c8040 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30
      
        which lock already depends on the new lock.
      
        the existing dependency chain (in reverse order) is:
      
        -> #2 (fs_reclaim){+.+.}-{0:0}:
      	 __lock_acquire+0x56f/0xaa0
      	 lock_acquire+0xa3/0x440
      	 fs_reclaim_acquire.part.0+0x25/0x30
      	 __kmalloc_track_caller+0x49/0x330
      	 kstrdup+0x2e/0x60
      	 __kernfs_new_node.constprop.0+0x44/0x250
      	 kernfs_new_node+0x25/0x50
      	 kernfs_create_link+0x34/0xa0
      	 sysfs_do_create_link_sd+0x5e/0xd0
      	 btrfs_sysfs_add_devices_dir+0x65/0x100 [btrfs]
      	 btrfs_init_new_device+0x44c/0x12b0 [btrfs]
      	 btrfs_ioctl+0xc3c/0x25c0 [btrfs]
      	 ksys_ioctl+0x68/0xa0
      	 __x64_sys_ioctl+0x16/0x20
      	 do_syscall_64+0x50/0xe0
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #1 (&fs_info->chunk_mutex){+.+.}-{3:3}:
      	 __lock_acquire+0x56f/0xaa0
      	 lock_acquire+0xa3/0x440
      	 __mutex_lock+0xa0/0xaf0
      	 btrfs_chunk_alloc+0x137/0x3e0 [btrfs]
      	 find_free_extent+0xb44/0xfb0 [btrfs]
      	 btrfs_reserve_extent+0x9b/0x180 [btrfs]
      	 btrfs_alloc_tree_block+0xc1/0x350 [btrfs]
      	 alloc_tree_block_no_bg_flush+0x4a/0x60 [btrfs]
      	 __btrfs_cow_block+0x143/0x7a0 [btrfs]
      	 btrfs_cow_block+0x15f/0x310 [btrfs]
      	 push_leaf_right+0x150/0x240 [btrfs]
      	 split_leaf+0x3cd/0x6d0 [btrfs]
      	 btrfs_search_slot+0xd14/0xf70 [btrfs]
      	 btrfs_insert_empty_items+0x64/0xc0 [btrfs]
      	 __btrfs_commit_inode_delayed_items+0xb2/0x840 [btrfs]
      	 btrfs_async_run_delayed_root+0x10e/0x1d0 [btrfs]
      	 btrfs_work_helper+0x2f9/0x650 [btrfs]
      	 process_one_work+0x22c/0x600
      	 worker_thread+0x50/0x3b0
      	 kthread+0x137/0x150
      	 ret_from_fork+0x1f/0x30
      
        -> #0 (&delayed_node->mutex){+.+.}-{3:3}:
      	 check_prev_add+0x98/0xa20
      	 validate_chain+0xa8c/0x2a00
      	 __lock_acquire+0x56f/0xaa0
      	 lock_acquire+0xa3/0x440
      	 __mutex_lock+0xa0/0xaf0
      	 __btrfs_release_delayed_node.part.0+0x3f/0x310 [btrfs]
      	 btrfs_evict_inode+0x3bf/0x560 [btrfs]
      	 evict+0xd6/0x1c0
      	 dispose_list+0x48/0x70
      	 prune_icache_sb+0x54/0x80
      	 super_cache_scan+0x121/0x1a0
      	 do_shrink_slab+0x175/0x420
      	 shrink_slab+0xb1/0x2e0
      	 shrink_node+0x192/0x600
      	 balance_pgdat+0x31f/0x750
      	 kswapd+0x206/0x510
      	 kthread+0x137/0x150
      	 ret_from_fork+0x1f/0x30
      
        other info that might help us debug this:
      
        Chain exists of:
          &delayed_node->mutex --> &fs_info->chunk_mutex --> fs_reclaim
      
         Possible unsafe locking scenario:
      
      	 CPU0                    CPU1
      	 ----                    ----
          lock(fs_reclaim);
      				 lock(&fs_info->chunk_mutex);
      				 lock(fs_reclaim);
          lock(&delayed_node->mutex);
      
         *** DEADLOCK ***
      
        3 locks held by kswapd0/75:
         #0: ffffffff8b0c8040 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30
         #1: ffffffff8b0b50b8 (shrinker_rwsem){++++}-{3:3}, at: shrink_slab+0x54/0x2e0
         #2: ffffa040e057c0e8 (&type->s_umount_key#26){++++}-{3:3}, at: trylock_super+0x16/0x50
      
        stack backtrace:
        CPU: 2 PID: 75 Comm: kswapd0 Not tainted 5.8.0-rc6-default+ #1191
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
        Call Trace:
         dump_stack+0x78/0xa0
         check_noncircular+0x16f/0x190
         check_prev_add+0x98/0xa20
         validate_chain+0xa8c/0x2a00
         __lock_acquire+0x56f/0xaa0
         lock_acquire+0xa3/0x440
         ? __btrfs_release_delayed_node.part.0+0x3f/0x310 [btrfs]
         __mutex_lock+0xa0/0xaf0
         ? __btrfs_release_delayed_node.part.0+0x3f/0x310 [btrfs]
         ? __lock_acquire+0x56f/0xaa0
         ? __btrfs_release_delayed_node.part.0+0x3f/0x310 [btrfs]
         ? lock_acquire+0xa3/0x440
         ? btrfs_evict_inode+0x138/0x560 [btrfs]
         ? btrfs_evict_inode+0x2fe/0x560 [btrfs]
         ? __btrfs_release_delayed_node.part.0+0x3f/0x310 [btrfs]
         __btrfs_release_delayed_node.part.0+0x3f/0x310 [btrfs]
         btrfs_evict_inode+0x3bf/0x560 [btrfs]
         evict+0xd6/0x1c0
         dispose_list+0x48/0x70
         prune_icache_sb+0x54/0x80
         super_cache_scan+0x121/0x1a0
         do_shrink_slab+0x175/0x420
         shrink_slab+0xb1/0x2e0
         shrink_node+0x192/0x600
         balance_pgdat+0x31f/0x750
         kswapd+0x206/0x510
         ? _raw_spin_unlock_irqrestore+0x3e/0x50
         ? finish_wait+0x90/0x90
         ? balance_pgdat+0x750/0x750
         kthread+0x137/0x150
         ? kthread_stop+0x2a0/0x2a0
         ret_from_fork+0x1f/0x30
      
      This is because we're holding the chunk_mutex while adding this device
      and adding its sysfs entries.  We actually hold different locks in
      different places when calling this function, the dev_replace semaphore
      for instance in dev replace, so instead of moving this call around
      simply wrap it's operations in NOFS.
      
      CC: stable@vger.kernel.org # 4.14+
      Reported-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a47bd78d
    • J
      btrfs: return EROFS for BTRFS_FS_STATE_ERROR cases · fbabd4a3
      Josef Bacik 提交于
      Eric reported seeing this message while running generic/475
      
        BTRFS: error (device dm-3) in btrfs_sync_log:3084: errno=-117 Filesystem corrupted
      
      Full stack trace:
      
        BTRFS: error (device dm-0) in btrfs_commit_transaction:2323: errno=-5 IO failure (Error while writing out transaction)
        BTRFS info (device dm-0): forced readonly
        BTRFS warning (device dm-0): Skipping commit of aborted transaction.
        ------------[ cut here ]------------
        BTRFS: error (device dm-0) in cleanup_transaction:1894: errno=-5 IO failure
        BTRFS: Transaction aborted (error -117)
        BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c6480 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c6488 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c6490 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c6498 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c64a0 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c64a8 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c64b0 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c64b8 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c64c0 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3572 rw 0,0 sector 0x1b85e8 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3572 rw 0,0 sector 0x1b85f0 len 4096 err no 10
        WARNING: CPU: 3 PID: 23985 at fs/btrfs/tree-log.c:3084 btrfs_sync_log+0xbc8/0xd60 [btrfs]
        BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d4288 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d4290 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d4298 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42a0 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42a8 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42b0 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42b8 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42c0 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42c8 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42d0 len 4096 err no 10
        CPU: 3 PID: 23985 Comm: fsstress Tainted: G        W    L    5.8.0-rc4-default+ #1181
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
        RIP: 0010:btrfs_sync_log+0xbc8/0xd60 [btrfs]
        RSP: 0018:ffff909a44d17bd0 EFLAGS: 00010286
        RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000001
        RDX: ffff8f3be41cb940 RSI: ffffffffb0108d2b RDI: ffffffffb0108ff7
        RBP: ffff909a44d17e70 R08: 0000000000000000 R09: 0000000000000000
        R10: 0000000000000000 R11: 0000000000037988 R12: ffff8f3bd20e4000
        R13: ffff8f3bd20e4428 R14: 00000000ffffff8b R15: ffff909a44d17c70
        FS:  00007f6a6ed3fb80(0000) GS:ffff8f3c3dc00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007f6a6ed3e000 CR3: 00000000525c0003 CR4: 0000000000160ee0
        Call Trace:
         ? finish_wait+0x90/0x90
         ? __mutex_unlock_slowpath+0x45/0x2a0
         ? lock_acquire+0xa3/0x440
         ? lockref_put_or_lock+0x9/0x30
         ? dput+0x20/0x4a0
         ? dput+0x20/0x4a0
         ? do_raw_spin_unlock+0x4b/0xc0
         ? _raw_spin_unlock+0x1f/0x30
         btrfs_sync_file+0x335/0x490 [btrfs]
         do_fsync+0x38/0x70
         __x64_sys_fsync+0x10/0x20
         do_syscall_64+0x50/0xe0
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        RIP: 0033:0x7f6a6ef1b6e3
        Code: Bad RIP value.
        RSP: 002b:00007ffd01e20038 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
        RAX: ffffffffffffffda RBX: 000000000007a120 RCX: 00007f6a6ef1b6e3
        RDX: 00007ffd01e1ffa0 RSI: 00007ffd01e1ffa0 RDI: 0000000000000003
        RBP: 0000000000000003 R08: 0000000000000001 R09: 00007ffd01e2004c
        R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000009f
        R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
        irq event stamp: 0
        hardirqs last  enabled at (0): [<0000000000000000>] 0x0
        hardirqs last disabled at (0): [<ffffffffb007fe0b>] copy_process+0x67b/0x1b00
        softirqs last  enabled at (0): [<ffffffffb007fe0b>] copy_process+0x67b/0x1b00
        softirqs last disabled at (0): [<0000000000000000>] 0x0
        ---[ end trace af146e0e38433456 ]---
        BTRFS: error (device dm-0) in btrfs_sync_log:3084: errno=-117 Filesystem corrupted
      
      This ret came from btrfs_write_marked_extents().  If we get an aborted
      transaction via EIO before, we'll see it in btree_write_cache_pages()
      and return EUCLEAN, which gets printed as "Filesystem corrupted".
      
      Except we shouldn't be returning EUCLEAN here, we need to be returning
      EROFS because EUCLEAN is reserved for actual corruption, not IO errors.
      
      We are inconsistent about our handling of BTRFS_FS_STATE_ERROR
      elsewhere, but we want to use EROFS for this particular case.  The
      original transaction abort has the real error code for why we ended up
      with an aborted transaction, all subsequent actions just need to return
      EROFS because they may not have a trans handle and have no idea about
      the original cause of the abort.
      
      After patch "btrfs: don't WARN if we abort a transaction with EROFS" the
      stacktrace will not be dumped either.
      Reported-by: NEric Sandeen <esandeen@redhat.com>
      CC: stable@vger.kernel.org # 5.4+
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ add full test stacktrace ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fbabd4a3
    • J
      btrfs: document special case error codes for fs errors · 59131393
      Josef Bacik 提交于
      We've had some discussions about what to do in certain scenarios for
      error codes, specifically EUCLEAN and EROFS.  Document these near the
      error handling code so its clear what their intentions are.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      59131393
    • J
      btrfs: don't WARN if we abort a transaction with EROFS · f95ebdbe
      Josef Bacik 提交于
      If we got some sort of corruption via a read and call
      btrfs_handle_fs_error() we'll set BTRFS_FS_STATE_ERROR on the fs and
      complain.  If a subsequent trans handle trips over this it'll get EROFS
      and then abort.  However at that point we're not aborting for the
      original reason, we're aborting because we've been flipped read only.
      We do not need to WARN_ON() here.
      
      CC: stable@vger.kernel.org # 5.4+
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f95ebdbe
    • F
      btrfs: reduce contention on log trees when logging checksums · 3ebac17c
      Filipe Manana 提交于
      The possibility of extents being shared (through clone and deduplication
      operations) requires special care when logging data checksums, to avoid
      having a log tree with different checksum items that cover ranges which
      overlap (which resulted in missing checksums after replaying a log tree).
      Such problems were fixed in the past by the following commits:
      
      commit 40e046ac ("Btrfs: fix missing data checksums after replaying a
                            log tree")
      
      commit e289f03e ("btrfs: fix corrupt log due to concurrent fsync of
                            inodes with shared extents")
      
      Test case generic/588 exercises the scenario solved by the first commit
      (purely sequential and deterministic) while test case generic/457 often
      triggered the case fixed by the second commit (not deterministic, requires
      specific timings under concurrency).
      
      The problems were addressed by deleting, from the log tree, any existing
      checksums before logging the new ones. And also by doing the deletion and
      logging of the cheksums while locking the checksum range in an extent io
      tree (root->log_csum_range), to deal with the case where we have concurrent
      fsyncs against files with shared extents.
      
      That however causes more contention on the leaves of a log tree where we
      store checksums (and all the nodes in the paths leading to them), even
      when we do not have shared extents, or all the shared extents were created
      by past transactions. It also adds a bit of contention on the spin lock of
      the log_csums_range extent io tree of the log root.
      
      This change adds a 'last_reflink_trans' field to the inode to keep track
      of the last transaction where a new extent was shared between inodes
      (through clone and deduplication operations). It is updated for both the
      source and destination inodes of reflink operations whenever a new extent
      (created in the current transaction) becomes shared by the inodes. This
      field is kept in memory only, not persisted in the inode item, similar
      to other existing fields (last_unlink_trans, logged_trans).
      
      When logging checksums for an extent, if the value of 'last_reflink_trans'
      is smaller then the current transaction's generation/id, we skip locking
      the extent range and deletion of checksums from the log tree, since we
      know we do not have new shared extents. This reduces contention on the
      log tree's leaves where checksums are stored.
      
      The following script, which uses fio, was used to measure the impact of
      this change:
      
        $ cat test-fsync.sh
        #!/bin/bash
      
        DEV=/dev/sdk
        MNT=/mnt/sdk
        MOUNT_OPTIONS="-o ssd"
        MKFS_OPTIONS="-d single -m single"
      
        if [ $# -ne 3 ]; then
            echo "Use $0 NUM_JOBS FILE_SIZE FSYNC_FREQ"
            exit 1
        fi
      
        NUM_JOBS=$1
        FILE_SIZE=$2
        FSYNC_FREQ=$3
      
        cat <<EOF > /tmp/fio-job.ini
        [writers]
        rw=write
        fsync=$FSYNC_FREQ
        fallocate=none
        group_reporting=1
        direct=0
        bs=64k
        ioengine=sync
        size=$FILE_SIZE
        directory=$MNT
        numjobs=$NUM_JOBS
        EOF
      
        echo "Using config:"
        echo
        cat /tmp/fio-job.ini
        echo
      
        mkfs.btrfs -f $MKFS_OPTIONS $DEV
        mount $MOUNT_OPTIONS $DEV $MNT
        fio /tmp/fio-job.ini
        umount $MNT
      
      The tests were performed for different numbers of jobs, file sizes and
      fsync frequency. A qemu VM using kvm was used, with 8 cores (the host has
      12 cores, with cpu governance set to performance mode on all cores), 16GiB
      of ram (the host has 64GiB) and using a NVMe device directly (without an
      intermediary filesystem in the host). While running the tests, the host
      was not used for anything else, to avoid disturbing the tests.
      
      The obtained results were the following (the last line of fio's output was
      pasted). Starting with 16 jobs is where a significant difference is
      observable in this particular setup and hardware (differences highlighted
      below). The very small differences for tests with less than 16 jobs are
      possibly just noise and random.
      
          **** 1 job, file size 1G, fsync frequency 1 ****
      
      before this change:
      
      WRITE: bw=23.8MiB/s (24.9MB/s), 23.8MiB/s-23.8MiB/s (24.9MB/s-24.9MB/s), io=1024MiB (1074MB), run=43075-43075msec
      
      after this change:
      
      WRITE: bw=24.4MiB/s (25.6MB/s), 24.4MiB/s-24.4MiB/s (25.6MB/s-25.6MB/s), io=1024MiB (1074MB), run=41938-41938msec
      
          **** 2 jobs, file size 1G, fsync frequency 1 ****
      
      before this change:
      
      WRITE: bw=37.7MiB/s (39.5MB/s), 37.7MiB/s-37.7MiB/s (39.5MB/s-39.5MB/s), io=2048MiB (2147MB), run=54351-54351msec
      
      after this change:
      
      WRITE: bw=37.7MiB/s (39.5MB/s), 37.6MiB/s-37.6MiB/s (39.5MB/s-39.5MB/s), io=2048MiB (2147MB), run=54428-54428msec
      
          **** 4 jobs, file size 1G, fsync frequency 1 ****
      
      before this change:
      
      WRITE: bw=67.5MiB/s (70.8MB/s), 67.5MiB/s-67.5MiB/s (70.8MB/s-70.8MB/s), io=4096MiB (4295MB), run=60669-60669msec
      
      after this change:
      
      WRITE: bw=68.6MiB/s (71.0MB/s), 68.6MiB/s-68.6MiB/s (71.0MB/s-71.0MB/s), io=4096MiB (4295MB), run=59678-59678msec
      
          **** 8 jobs, file size 1G, fsync frequency 1 ****
      
      before this change:
      
      WRITE: bw=128MiB/s (134MB/s), 128MiB/s-128MiB/s (134MB/s-134MB/s), io=8192MiB (8590MB), run=64048-64048msec
      
      after this change:
      
      WRITE: bw=129MiB/s (135MB/s), 129MiB/s-129MiB/s (135MB/s-135MB/s), io=8192MiB (8590MB), run=63405-63405msec
      
          **** 16 jobs, file size 1G, fsync frequency 1 ****
      
      before this change:
      
      WRITE: bw=78.5MiB/s (82.3MB/s), 78.5MiB/s-78.5MiB/s (82.3MB/s-82.3MB/s), io=16.0GiB (17.2GB), run=208676-208676msec
      
      after this change:
      
      WRITE: bw=110MiB/s (115MB/s), 110MiB/s-110MiB/s (115MB/s-115MB/s), io=16.0GiB (17.2GB), run=149295-149295msec
      (+40.1% throughput, -28.5% runtime)
      
          **** 32 jobs, file size 1G, fsync frequency 1 ****
      
      before this change:
      
      WRITE: bw=58.8MiB/s (61.7MB/s), 58.8MiB/s-58.8MiB/s (61.7MB/s-61.7MB/s), io=32.0GiB (34.4GB), run=557134-557134msec
      
      after this change:
      
      WRITE: bw=76.1MiB/s (79.8MB/s), 76.1MiB/s-76.1MiB/s (79.8MB/s-79.8MB/s), io=32.0GiB (34.4GB), run=430550-430550msec
      (+29.4% throughput, -22.7% runtime)
      
          **** 64 jobs, file size 512M, fsync frequency 1 ****
      
      before this change:
      
      WRITE: bw=65.8MiB/s (68.0MB/s), 65.8MiB/s-65.8MiB/s (68.0MB/s-68.0MB/s), io=32.0GiB (34.4GB), run=498055-498055msec
      
      after this change:
      
      WRITE: bw=85.1MiB/s (89.2MB/s), 85.1MiB/s-85.1MiB/s (89.2MB/s-89.2MB/s), io=32.0GiB (34.4GB), run=385116-385116msec
      (+29.3% throughput, -22.7% runtime)
      
          **** 128 jobs, file size 256M, fsync frequency 1 ****
      
      before this change:
      
      WRITE: bw=54.7MiB/s (57.3MB/s), 54.7MiB/s-54.7MiB/s (57.3MB/s-57.3MB/s), io=32.0GiB (34.4GB), run=599373-599373msec
      
      after this change:
      
      WRITE: bw=121MiB/s (126MB/s), 121MiB/s-121MiB/s (126MB/s-126MB/s), io=32.0GiB (34.4GB), run=271907-271907msec
      (+121.2% throughput, -54.6% runtime)
      
          **** 256 jobs, file size 256M, fsync frequency 1 ****
      
      before this change:
      
      WRITE: bw=69.2MiB/s (72.5MB/s), 69.2MiB/s-69.2MiB/s (72.5MB/s-72.5MB/s), io=64.0GiB (68.7GB), run=947536-947536msec
      
      after this change:
      
      WRITE: bw=121MiB/s (127MB/s), 121MiB/s-121MiB/s (127MB/s-127MB/s), io=64.0GiB (68.7GB), run=541916-541916msec
      (+74.9% throughput, -42.8% runtime)
      
          **** 512 jobs, file size 128M, fsync frequency 1 ****
      
      before this change:
      
      WRITE: bw=85.4MiB/s (89.5MB/s), 85.4MiB/s-85.4MiB/s (89.5MB/s-89.5MB/s), io=64.0GiB (68.7GB), run=767734-767734msec
      
      after this change:
      
      WRITE: bw=141MiB/s (147MB/s), 141MiB/s-141MiB/s (147MB/s-147MB/s), io=64.0GiB (68.7GB), run=466022-466022msec
      (+65.1% throughput, -39.3% runtime)
      
          **** 1024 jobs, file size 128M, fsync frequency 1 ****
      
      before this change:
      
      WRITE: bw=115MiB/s (120MB/s), 115MiB/s-115MiB/s (120MB/s-120MB/s), io=128GiB (137GB), run=1143775-1143775msec
      
      after this change:
      
      WRITE: bw=171MiB/s (180MB/s), 171MiB/s-171MiB/s (180MB/s-180MB/s), io=128GiB (137GB), run=764843-764843msec
      (+48.7% throughput, -33.1% runtime)
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3ebac17c
    • N
      btrfs: remove done label in writepage_delalloc · b69d1ee9
      Nikolay Borisov 提交于
      Since there is not common cleanup run after the label it makes it
      somewhat redundant.
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b69d1ee9
    • Q
      btrfs: add comments for btrfs_reserve_flush_enum · fd7fb634
      Qu Wenruo 提交于
      This enum is the interface exposed to developers.
      
      Although we have a detailed comment explaining the whole idea of space
      flushing at the beginning of space-info.c, the exposed enum interface
      doesn't have any comment.
      
      Some corner cases, like BTRFS_RESERVE_FLUSH_ALL and
      BTRFS_RESERVE_FLUSH_ALL_STEAL can be interrupted by fatal signals, are
      not explained at all.
      
      So add some simple comments for these enums as a quick reference.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fd7fb634
    • Q
      btrfs: relocation: review the call sites which can be interrupted by signal · 44d354ab
      Qu Wenruo 提交于
      Since most metadata reservation calls can return -EINTR when get
      interrupted by fatal signal, we need to review the all the metadata
      reservation call sites.
      
      In relocation code, the metadata reservation happens in the following
      sites:
      
      - btrfs_block_rsv_refill() in merge_reloc_root()
        merge_reloc_root() is a pretty critical section, we don't want to be
        interrupted by signal, so change the flush status to
        BTRFS_RESERVE_FLUSH_LIMIT, so it won't get interrupted by signal.
        Since such change can be ENPSPC-prone, also shrink the amount of
        metadata to reserve least amount avoid deadly ENOSPC there.
      
      - btrfs_block_rsv_refill() in reserve_metadata_space()
        It calls with BTRFS_RESERVE_FLUSH_LIMIT, which won't get interrupted
        by signal.
      
      - btrfs_block_rsv_refill() in prepare_to_relocate()
      
      - btrfs_block_rsv_add() in prepare_to_relocate()
      
      - btrfs_block_rsv_refill() in relocate_block_group()
      
      - btrfs_delalloc_reserve_metadata() in relocate_file_extent_cluster()
      
      - btrfs_start_transaction() in relocate_block_group()
      
      - btrfs_start_transaction() in create_reloc_inode()
        Can be interrupted by fatal signal and we can handle it easily.
        For these call sites, just catch the -EINTR value in btrfs_balance()
        and count them as canceled.
      
      CC: stable@vger.kernel.org # 5.4+
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      44d354ab
    • Q
      btrfs: avoid possible signal interruption of btrfs_drop_snapshot() on relocation tree · f3e3d9cc
      Qu Wenruo 提交于
      [BUG]
      There is a bug report about bad signal timing could lead to read-only
      fs during balance:
      
        BTRFS info (device xvdb): balance: start -d -m -s
        BTRFS info (device xvdb): relocating block group 73001861120 flags metadata
        BTRFS info (device xvdb): found 12236 extents, stage: move data extents
        BTRFS info (device xvdb): relocating block group 71928119296 flags data
        BTRFS info (device xvdb): found 3 extents, stage: move data extents
        BTRFS info (device xvdb): found 3 extents, stage: update data pointers
        BTRFS info (device xvdb): relocating block group 60922265600 flags metadata
        BTRFS: error (device xvdb) in btrfs_drop_snapshot:5505: errno=-4 unknown
        BTRFS info (device xvdb): forced readonly
        BTRFS info (device xvdb): balance: ended with status: -4
      
      [CAUSE]
      The direct cause is the -EINTR from the following call chain when a
      fatal signal is pending:
      
       relocate_block_group()
       |- clean_dirty_subvols()
          |- btrfs_drop_snapshot()
             |- btrfs_start_transaction()
                |- btrfs_delayed_refs_rsv_refill()
                   |- btrfs_reserve_metadata_bytes()
                      |- __reserve_metadata_bytes()
                         |- wait_reserve_ticket()
                            |- prepare_to_wait_event();
                            |- ticket->error = -EINTR;
      
      Normally this behavior is fine for most btrfs_start_transaction()
      callers, as they need to catch any other error, same for the signal, and
      exit ASAP.
      
      However for balance, especially for the clean_dirty_subvols() case, we're
      already doing cleanup works, getting -EINTR from btrfs_drop_snapshot()
      could cause a lot of unexpected problems.
      
      From the mentioned forced read-only report, to later balance error due
      to half dropped reloc trees.
      
      [FIX]
      Fix this problem by using btrfs_join_transaction() if
      btrfs_drop_snapshot() is called from relocation context.
      
      Since btrfs_join_transaction() won't get interrupted by signal, we can
      continue the cleanup.
      
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: David Sterba <dsterba@suse.com>3
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f3e3d9cc
    • Q
      btrfs: relocation: allow signal to cancel balance · 5cb502f4
      Qu Wenruo 提交于
      Although btrfs balance can be canceled with "btrfs balance cancel"
      command, it's still almost muscle memory to press Ctrl-C to cancel a
      long running btrfs balance.
      
      So allow btrfs balance to check signal to determine if it should exit.
      The cancellation points are in known location and we're only adding one
      more reason, so this should be safe.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5cb502f4
    • N
      btrfs: raid56: remove out label in __raid56_parity_recover · 813f8a0e
      Nikolay Borisov 提交于
      There's no cleanup that occurs so we can simply return 0 directly.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      813f8a0e
    • D
      btrfs: add missing check for nocow and compression inode flags · f37c563b
      David Sterba 提交于
      User Forza reported on IRC that some invalid combinations of file
      attributes are accepted by chattr.
      
      The NODATACOW and compression file flags/attributes are mutually
      exclusive, but they could be set by 'chattr +c +C' on an empty file. The
      nodatacow will be in effect because it's checked first in
      btrfs_run_delalloc_range.
      
      Extend the flag validation to catch the following cases:
      
        - input flags are conflicting
        - old and new flags are conflicting
        - initialize the local variable with inode flags after inode ls locked
      
      Inode attributes take precedence over mount options and are an
      independent setting.
      
      Nocompress would be a no-op with nodatacow, but we don't want to mix
      any compression-related options with nodatacow.
      
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f37c563b
    • A
      btrfs: don't traverse into the seed devices in show_devname · 4faf55b0
      Anand Jain 提交于
      ->show_devname currently shows the lowest devid in the list. As the seed
      devices have the lowest devid in the sprouted filesystem, the userland
      tool such as findmnt end up seeing seed device instead of the device from
      the read-writable sprouted filesystem. As shown below.
      
       mount /dev/sda /btrfs
       mount: /btrfs: WARNING: device write-protected, mounted read-only.
      
       findmnt --output SOURCE,TARGET,UUID /btrfs
       SOURCE   TARGET UUID
       /dev/sda /btrfs 899f7027-3e46-4626-93e7-7d4c9ad19111
      
       btrfs dev add -f /dev/sdb /btrfs
      
       umount /btrfs
       mount /dev/sdb /btrfs
      
       findmnt --output SOURCE,TARGET,UUID /btrfs
       SOURCE   TARGET UUID
       /dev/sda /btrfs 899f7027-3e46-4626-93e7-7d4c9ad19111
      
      All sprouts from a single seed will show the same seed device and the
      same fsid. That's confusing.
      This is causing problems in our prototype as there isn't any reference
      to the sprout file-system(s) which is being used for actual read and
      write.
      
      This was added in the patch which implemented the show_devname in btrfs
      commit 9c5085c1 ("Btrfs: implement ->show_devname").
      I tried to look for any particular reason that we need to show the seed
      device, there isn't any.
      
      So instead, do not traverse through the seed devices, just show the
      lowest devid in the sprouted fsid.
      
      After the patch:
      
       mount /dev/sda /btrfs
       mount: /btrfs: WARNING: device write-protected, mounted read-only.
      
       findmnt --output SOURCE,TARGET,UUID /btrfs
       SOURCE   TARGET UUID
       /dev/sda /btrfs 899f7027-3e46-4626-93e7-7d4c9ad19111
      
       btrfs dev add -f /dev/sdb /btrfs
       mount -o rw,remount /dev/sdb /btrfs
      
       findmnt --output SOURCE,TARGET,UUID /btrfs
       SOURCE   TARGET UUID
       /dev/sdb /btrfs 595ca0e6-b82e-46b5-b9e2-c72a6928be48
      
       mount /dev/sda /btrfs1
       mount: /btrfs1: WARNING: device write-protected, mounted read-only.
      
       btrfs dev add -f /dev/sdc /btrfs1
      
       findmnt --output SOURCE,TARGET,UUID /btrfs1
       SOURCE   TARGET  UUID
       /dev/sdc /btrfs1 ca1dbb7a-8446-4f95-853c-a20f3f82bdbb
      
       cat /proc/self/mounts | grep btrfs
       /dev/sdb /btrfs btrfs rw,relatime,noacl,space_cache,subvolid=5,subvol=/ 0 0
       /dev/sdc /btrfs1 btrfs ro,relatime,noacl,space_cache,subvolid=5,subvol=/ 0 0
      Reported-by: NMartin K. Petersen <martin.petersen@oracle.com>
      CC: stable@vger.kernel.org # 4.19+
      Tested-by: NMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4faf55b0
    • Q
      btrfs: qgroup: free per-trans reserved space when a subvolume gets dropped · a3cf0e43
      Qu Wenruo 提交于
      [BUG]
      Sometime fsstress could lead to qgroup warning for case like
      generic/013:
      
        BTRFS warning (device dm-3): qgroup 0/259 has unreleased space, type 1 rsv 81920
        ------------[ cut here ]------------
        WARNING: CPU: 9 PID: 24535 at fs/btrfs/disk-io.c:4142 close_ctree+0x1dc/0x323 [btrfs]
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
        RIP: 0010:close_ctree+0x1dc/0x323 [btrfs]
        Call Trace:
         btrfs_put_super+0x15/0x17 [btrfs]
         generic_shutdown_super+0x72/0x110
         kill_anon_super+0x18/0x30
         btrfs_kill_super+0x17/0x30 [btrfs]
         deactivate_locked_super+0x3b/0xa0
         deactivate_super+0x40/0x50
         cleanup_mnt+0x135/0x190
         __cleanup_mnt+0x12/0x20
         task_work_run+0x64/0xb0
         __prepare_exit_to_usermode+0x1bc/0x1c0
         __syscall_return_slowpath+0x47/0x230
         do_syscall_64+0x64/0xb0
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        ---[ end trace 6c341cdf9b6cc3c1 ]---
        BTRFS error (device dm-3): qgroup reserved space leaked
      
      While that subvolume 259 is no longer in that filesystem.
      
      [CAUSE]
      Normally per-trans qgroup reserved space is freed when a transaction is
      committed, in commit_fs_roots().
      
      However for completely dropped subvolume, that subvolume is completely
      gone, thus is no longer in the fs_roots_radix, and its per-trans
      reserved qgroup will never be freed.
      
      Since the subvolume is already gone, leaked per-trans space won't cause
      any trouble for end users.
      
      [FIX]
      Just call btrfs_qgroup_free_meta_all_pertrans() before a subvolume is
      completely dropped.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a3cf0e43
    • T
      btrfs: ref-verify: fix memory leak in add_block_entry · d60ba8de
      Tom Rix 提交于
      clang static analysis flags this error
      
      fs/btrfs/ref-verify.c:290:3: warning: Potential leak of memory pointed to by 're' [unix.Malloc]
                      kfree(be);
                      ^~~~~
      
      The problem is in this block of code:
      
      	if (root_objectid) {
      		struct root_entry *exist_re;
      
      		exist_re = insert_root_entry(&exist->roots, re);
      		if (exist_re)
      			kfree(re);
      	}
      
      There is no 'else' block freeing when root_objectid is 0. Add the
      missing kfree to the else branch.
      
      Fixes: fd708b81 ("Btrfs: add a extent ref verify tool")
      CC: stable@vger.kernel.org # 4.19+
      Signed-off-by: NTom Rix <trix@redhat.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d60ba8de
    • D
      btrfs: prefetch chunk tree leaves at mount · d85327b1
      David Sterba 提交于
      The whole chunk tree is read at mount time so we can utilize readahead
      to get the tree blocks to memory before we read the items. The idea is
      from Robbie, but instead of updating search slot readahead, this patch
      implements the chunk tree readahead manually from nodes on level 1.
      
      We've decided to do specific readahead optimizations and then unify them
      under a common API so we don't break everything by changing the search
      slot readahead logic.
      
      Higher chunk trees grow on large filesystems (many terabytes), and
      prefetching just level 1 seems to be sufficient. Provided example was
      from a 200TiB filesystem with chunk tree level 2.
      
      CC: Robbie Ko <robbieko@synology.com>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d85327b1
    • J
      btrfs: add metadata_uuid to FS_INFO ioctl · 49bac897
      Johannes Thumshirn 提交于
      Add retrieval of the filesystem's metadata UUID to the fsinfo ioctl.
      This is driven by setting the BTRFS_FS_INFO_FLAG_METADATA_UUID flag in
      btrfs_ioctl_fs_info_args::flags.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      49bac897
    • J
      btrfs: add filesystem generation to FS_INFO ioctl · 0fb408a5
      Johannes Thumshirn 提交于
      Add retrieval of the filesystem's generation to the fsinfo ioctl. This is
      driven by setting the BTRFS_FS_INFO_FLAG_GENERATION flag in
      btrfs_ioctl_fs_info_args::flags.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0fb408a5
    • J
      btrfs: pass checksum type via BTRFS_IOC_FS_INFO ioctl · 137c5418
      Johannes Thumshirn 提交于
      With the recent addition of filesystem checksum types other than CRC32c,
      it is not anymore hard-coded which checksum type a btrfs filesystem uses.
      
      Up to now there is no good way to read the filesystem checksum, apart from
      reading the filesystem UUID and then query sysfs for the checksum type.
      
      Add a new csum_type and csum_size fields to the BTRFS_IOC_FS_INFO ioctl
      command which usually is used to query filesystem features. Also add a
      flags member indicating that the kernel responded with a set csum_type and
      csum_size field.
      
      For compatibility reasons, only return the csum_type and csum_size if
      the BTRFS_FS_INFO_FLAG_CSUM_INFO flag was passed to the kernel. Also
      clear any unknown flags so we don't pass false positives to user-space
      newer than the kernel.
      
      To simplify further additions to the ioctl, also switch the padding to a
      u8 array. Pahole was used to verify the result of this switch:
      
      The csum members are added before flags, which might look odd, but this
      is to keep the alignment requirements and not to introduce holes in the
      structure.
      
        $ pahole -C btrfs_ioctl_fs_info_args fs/btrfs/btrfs.ko
        struct btrfs_ioctl_fs_info_args {
      	  __u64                      max_id;               /*     0     8 */
      	  __u64                      num_devices;          /*     8     8 */
      	  __u8                       fsid[16];             /*    16    16 */
      	  __u32                      nodesize;             /*    32     4 */
      	  __u32                      sectorsize;           /*    36     4 */
      	  __u32                      clone_alignment;      /*    40     4 */
      	  __u16                      csum_type;            /*    44     2 */
      	  __u16                      csum_size;            /*    46     2 */
      	  __u64                      flags;                /*    48     8 */
      	  __u8                       reserved[968];        /*    56   968 */
      
      	  /* size: 1024, cachelines: 16, members: 10 */
        };
      
      Fixes: 3951e7f0 ("btrfs: add xxhash64 to checksumming algorithms")
      Fixes: 3831bf00 ("btrfs: add sha256 to checksumming algorithm")
      CC: stable@vger.kernel.org # 5.5+
      Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      137c5418
    • Q
      btrfs: qgroup: remove ASYNC_COMMIT mechanism in favor of reserve retry-after-EDQUOT · adca4d94
      Qu Wenruo 提交于
      commit a514d638 ("btrfs: qgroup: Commit transaction in advance to
      reduce early EDQUOT") tries to reduce the early EDQUOT problems by
      checking the qgroup free against threshold and tries to wake up commit
      kthread to free some space.
      
      The problem of that mechanism is, it can only free qgroup per-trans
      metadata space, can't do anything to data, nor prealloc qgroup space.
      
      Now since we have the ability to flush qgroup space, and implemented
      retry-after-EDQUOT behavior, such mechanism can be completely replaced.
      
      So this patch will cleanup such mechanism in favor of
      retry-after-EDQUOT.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      adca4d94
    • Q
      btrfs: qgroup: try to flush qgroup space when we get -EDQUOT · c53e9653
      Qu Wenruo 提交于
      [PROBLEM]
      There are known problem related to how btrfs handles qgroup reserved
      space.  One of the most obvious case is the the test case btrfs/153,
      which do fallocate, then write into the preallocated range.
      
        btrfs/153 1s ... - output mismatch (see xfstests-dev/results//btrfs/153.out.bad)
            --- tests/btrfs/153.out     2019-10-22 15:18:14.068965341 +0800
            +++ xfstests-dev/results//btrfs/153.out.bad      2020-07-01 20:24:40.730000089 +0800
            @@ -1,2 +1,5 @@
             QA output created by 153
            +pwrite: Disk quota exceeded
            +/mnt/scratch/testfile2: Disk quota exceeded
            +/mnt/scratch/testfile2: Disk quota exceeded
             Silence is golden
            ...
            (Run 'diff -u xfstests-dev/tests/btrfs/153.out xfstests-dev/results//btrfs/153.out.bad'  to see the entire diff)
      
      [CAUSE]
      Since commit c6887cd1 ("Btrfs: don't do nocow check unless we have to"),
      we always reserve space no matter if it's COW or not.
      
      Such behavior change is mostly for performance, and reverting it is not
      a good idea anyway.
      
      For preallcoated extent, we reserve qgroup data space for it already,
      and since we also reserve data space for qgroup at buffered write time,
      it needs twice the space for us to write into preallocated space.
      
      This leads to the -EDQUOT in buffered write routine.
      
      And we can't follow the same solution, unlike data/meta space check,
      qgroup reserved space is shared between data/metadata.
      The EDQUOT can happen at the metadata reservation, so doing NODATACOW
      check after qgroup reservation failure is not a solution.
      
      [FIX]
      To solve the problem, we don't return -EDQUOT directly, but every time
      we got a -EDQUOT, we try to flush qgroup space:
      
      - Flush all inodes of the root
        NODATACOW writes will free the qgroup reserved at run_dealloc_range().
        However we don't have the infrastructure to only flush NODATACOW
        inodes, here we flush all inodes anyway.
      
      - Wait for ordered extents
        This would convert the preallocated metadata space into per-trans
        metadata, which can be freed in later transaction commit.
      
      - Commit transaction
        This will free all per-trans metadata space.
      
      Also we don't want to trigger flush multiple times, so here we introduce
      a per-root wait list and a new root status, to ensure only one thread
      starts the flushing.
      
      Fixes: c6887cd1 ("Btrfs: don't do nocow check unless we have to")
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c53e9653
    • Q
      btrfs: qgroup: allow to unreserve range without releasing other ranges · 263da812
      Qu Wenruo 提交于
      [PROBLEM]
      Before this patch, when btrfs_qgroup_reserve_data() fails, we free all
      reserved space of the changeset.
      
      For example:
      	ret = btrfs_qgroup_reserve_data(inode, changeset, 0, SZ_1M);
      	ret = btrfs_qgroup_reserve_data(inode, changeset, SZ_1M, SZ_1M);
      	ret = btrfs_qgroup_reserve_data(inode, changeset, SZ_2M, SZ_1M);
      
      If the last btrfs_qgroup_reserve_data() failed, it will release the
      entire [0, 3M) range.
      
      This behavior is kind of OK for now, as when we hit -EDQUOT, we normally
      go error handling and need to release all reserved ranges anyway.
      
      But this also means the following call is not possible:
      
      	ret = btrfs_qgroup_reserve_data();
      	if (ret == -EDQUOT) {
      		/* Do something to free some qgroup space */
      		ret = btrfs_qgroup_reserve_data();
      	}
      
      As if the first btrfs_qgroup_reserve_data() fails, it will free all
      reserved qgroup space.
      
      [CAUSE]
      This is because we release all reserved ranges when
      btrfs_qgroup_reserve_data() fails.
      
      [FIX]
      This patch will implement a new function, qgroup_unreserve_range(), to
      iterate through the ulist nodes, to find any nodes in the failure range,
      and remove the EXTENT_QGROUP_RESERVED bits from the io_tree, and
      decrease the extent_changeset::bytes_changed, so that we can revert to
      previous state.
      
      This allows later patches to retry btrfs_qgroup_reserve_data() if EDQUOT
      happens.
      Suggested-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      263da812
    • J
      btrfs: convert block group refcount to refcount_t · 48aaeebe
      Josef Bacik 提交于
      We have refcount_t now with the associated library to handle refcounts,
      which gives us extra debugging around reference count mistakes that may
      be made.  For example it'll warn on any transition from 0->1 or 0->-1,
      which is handy for noticing cases where we've messed up reference
      counting.  Convert the block group ref counting from an atomic_t to
      refcount_t and use the appropriate helpers.
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      48aaeebe
    • M
      btrfs: add multi-statement protection to btrfs_set/clear_and_info macros · 60f8667b
      Marcos Paulo de Souza 提交于
      Multi-statement macros should be enclosed in do/while(0) block to make
      their use safe in single statement if conditions. All current uses of
      the macros are safe, so this change is for future protection.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NMarcos Paulo de Souza <mpdesouza@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      60f8667b
    • N
      93c4c033
    • N
    • N
      btrfs: raid56: use in_range where applicable · 83025863
      Nikolay Borisov 提交于
      While at it use the opportunity to simplify find_logical_bio_stripe by
      reducing the scope of 'stripe_start' variable and squash the
      sector-to-bytes conversion on one line.
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      83025863
    • N
      btrfs: raid56: assign bio in while() when using bio_list_pop · bf28a605
      Nikolay Borisov 提交于
      Unify the style in the file such that return value of bio_list_pop is
      assigned directly in the while loop. This is in line with the rest of
      the kernel.
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      bf28a605
    • N
      btrfs: raid56: remove redundant device check in rbio_add_io_page · f90ae76a
      Nikolay Borisov 提交于
      The merging logic is always executed if the current stripe's device
      is not missing. So there's no point in duplicating the check. Simply
      remove it, while at it reduce the scope of the 'last_end' variable.
      If the current stripe's device is missing we fail the stripe early on.
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f90ae76a
    • N
      btrfs: always initialize btrfs_bio::tgtdev_map/raid_map pointers · 608769a4
      Nikolay Borisov 提交于
      Since btrfs_bio always contains the extra space for the tgtdev_map and
      raid_maps it's pointless to make the assignment iff specific conditions
      are met.
      
      Instead, always assign the pointers to their correct value at allocation
      time. To accommodate this change also move code a bit in
      __btrfs_map_block so that btrfs_bio::stripes array is always initialized
      before the raid_map, subsequently move the call to sort_parity_stripes
      in the 'if' building the raid_map, retaining the old behavior.
      
      To better understand the change, there are 2 aspects to this:
      
      1. The original code is harder to grasp because the calculations for
         initializing raid_map/tgtdev ponters are apart from the initial
         allocation of memory. Having them predicated on 2 separate checks
         doesn't help that either... So by moving the initialisation in
         alloc_btrfs_bio puts everything together.
      
      2. tgtdev/raid_maps are now always initialized despite sometimes they
         might be equal i.e __btrfs_map_block_for_discard calls
         alloc_btrfs_bio with tgtdev = 0 but their usage should be predicated
         on external checks i.e. just because those pointers are non-null
         doesn't mean they are valid per-se. And actually while taking another
         look at __btrfs_map_block I saw a discrepancy:
      
         Original code initialised tgtdev_map if the following check is true:
      
      	   if (dev_replace_is_ongoing && dev_replace->tgtdev != NULL)
      
         However, further down tgtdev_map is only used if the following check
         is true:
      
      	if (dev_replace_is_ongoing && dev_replace->tgtdev != NULL && need_full_stripe(op))
      
        e.g. the additional need_full_stripe(op) predicate is there.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ copy more details from mail discussion ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      608769a4
    • N
      btrfs: sysfs: add bdi link to the fsid directory · 3092c68f
      Nikolay Borisov 提交于
      Since BTRFS uses a private bdi it makes sense to create a link to this
      bdi under /sys/fs/btrfs/<UUID>/bdi. This allows size of read ahead to
      be controlled. Without this patch it's not possible to uniquely identify
      which bdi pertains to which btrfs filesystem in the case of multiple
      btrfs filesystems.
      
      It's fine to simply call sysfs_remove_link without checking if the
      link indeed has been created. The call path
      
      sysfs_remove_link
       kernfs_remove_by_name
        kernfs_remove_by_name_ns
      
      will simply return -ENOENT in case it doesn't exist.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3092c68f