1. 07 10月, 2020 5 次提交
    • Q
      btrfs: qgroup: fix qgroup meta rsv leak for subvolume operations · e85fde51
      Qu Wenruo 提交于
      [BUG]
      When quota is enabled for TEST_DEV, generic/013 sometimes fails like this:
      
        generic/013 14s ... _check_dmesg: something found in dmesg (see xfstests-dev/results//generic/013.dmesg)
      
      And with the following metadata leak:
      
        BTRFS warning (device dm-3): qgroup 0/1370 has unreleased space, type 2 rsv 49152
        ------------[ cut here ]------------
        WARNING: CPU: 2 PID: 47912 at fs/btrfs/disk-io.c:4078 close_ctree+0x1dc/0x323 [btrfs]
        Call Trace:
         btrfs_put_super+0x15/0x17 [btrfs]
         generic_shutdown_super+0x72/0x110
         kill_anon_super+0x18/0x30
         btrfs_kill_super+0x17/0x30 [btrfs]
         deactivate_locked_super+0x3b/0xa0
         deactivate_super+0x40/0x50
         cleanup_mnt+0x135/0x190
         __cleanup_mnt+0x12/0x20
         task_work_run+0x64/0xb0
         __prepare_exit_to_usermode+0x1bc/0x1c0
         __syscall_return_slowpath+0x47/0x230
         do_syscall_64+0x64/0xb0
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        ---[ end trace a6cfd45ba80e4e06 ]---
        BTRFS error (device dm-3): qgroup reserved space leaked
        BTRFS info (device dm-3): disk space caching is enabled
        BTRFS info (device dm-3): has skinny extents
      
      [CAUSE]
      The qgroup preallocated meta rsv operations of that offending root are:
      
        btrfs_delayed_inode_reserve_metadata: rsv_meta_prealloc root=1370 num_bytes=131072
        btrfs_delayed_inode_reserve_metadata: rsv_meta_prealloc root=1370 num_bytes=131072
        btrfs_subvolume_reserve_metadata: rsv_meta_prealloc root=1370 num_bytes=49152
        btrfs_delayed_inode_release_metadata: convert_meta_prealloc root=1370 num_bytes=-131072
        btrfs_delayed_inode_release_metadata: convert_meta_prealloc root=1370 num_bytes=-131072
      
      It's pretty obvious that, we reserve qgroup meta rsv in
      btrfs_subvolume_reserve_metadata(), but doesn't have corresponding
      release/convert calls in btrfs_subvolume_release_metadata().
      
      This leads to the leakage.
      
      [FIX]
      To fix this bug, we should follow what we're doing in
      btrfs_delalloc_reserve_metadata(), where we reserve qgroup space, and
      add it to block_rsv->qgroup_rsv_reserved.
      
      And free the qgroup reserved metadata space when releasing the
      block_rsv.
      
      To do this, we need to change the btrfs_subvolume_release_metadata() to
      accept btrfs_root, and record the qgroup_to_release number, and call
      btrfs_qgroup_convert_reserved_meta() for it.
      
      Fixes: 733e03a0 ("btrfs: qgroup: Split meta rsv type into meta_prealloc and meta_pertrans")
      CC: stable@vger.kernel.org # 4.19+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e85fde51
    • G
      btrfs: switch to iomap for direct IO · f85781fb
      Goldwyn Rodrigues 提交于
      We're using direct io implementation based on buffer heads. This patch
      switches to the new iomap infrastructure.
      
      Switch from __blockdev_direct_IO() to iomap_dio_rw().  Rename
      btrfs_get_blocks_direct() to btrfs_dio_iomap_begin() and use it as
      iomap_begin() for iomap direct I/O functions. This function allocates
      and locks all the blocks required for the I/O.  btrfs_submit_direct() is
      used as the submit_io() hook for direct I/O ops.
      
      Since we need direct I/O reads to go through iomap_dio_rw(), we change
      file_operations.read_iter() to a btrfs_file_read_iter() which calls
      btrfs_direct_IO() for direct reads and falls back to
      generic_file_buffered_read() for incomplete reads and buffered reads.
      
      We don't need address_space.direct_IO() anymore: set it to noop.
      
      Similarly, we don't need flags used in __blockdev_direct_IO(). iomap is
      capable of direct I/O reads from a hole, so we don't need to return
      -ENOENT.
      
      Btrfs direct I/O is now done under i_rwsem, shared in case of reads and
      exclusive in case of writes. This guards against simultaneous truncates.
      
      Use iomap->iomap_end() to check for failed or incomplete direct I/O:
      
        - for writes, call __endio_write_update_ordered()
        - for reads, unlock extents
      
      btrfs_dio_data is now hooked in iomap->private and not
      current->journal_info. It carries the reservation variable and the
      amount of data submitted, so we can calculate the amount of data to call
      __endio_write_update_ordered in case of an error.
      
      This patch removes last use of struct buffer_head from btrfs.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f85781fb
    • J
      btrfs: do async reclaim for data reservations · 57056740
      Josef Bacik 提交于
      Now that we have the data ticketing stuff in place, move normal data
      reservations to use an async reclaim helper to satisfy tickets.  Before
      we could have multiple tasks race in and both allocate chunks, resulting
      in more data chunks than we would necessarily need.  Serializing these
      allocations and making a single thread responsible for flushing will
      only allocate chunks as needed, as well as cut down on transaction
      commits and other flush related activities.
      
      Priority reservations will still work as they have before, simply
      trying to allocate a chunk until they can make their reservation.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Tested-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      57056740
    • J
      btrfs: add flushing states for handling data reservations · 058e6d1d
      Josef Bacik 提交于
      Currently the way we do data reservations is by seeing if we have enough
      space in our space_info.  If we do not and we're a normal inode we'll
      
      1) Attempt to force a chunk allocation until we can't anymore.
      2) If that fails we'll flush delalloc, then commit the transaction, then
         run the delayed iputs.
      
      If we are a free space inode we're only allowed to force a chunk
      allocation.  In order to use the normal flushing mechanism we need to
      encode this into a flush state array for normal inodes.  Since both will
      start with allocating chunks until the space info is full there is no
      need to add this as a flush state, this will be handled specially.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Tested-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      058e6d1d
    • J
      btrfs: change nr to u64 in btrfs_start_delalloc_roots · b4912139
      Josef Bacik 提交于
      We have btrfs_wait_ordered_roots() which takes a u64 for nr, but
      btrfs_start_delalloc_roots() that takes an int for nr, which makes using
      them in conjunction, especially for something like (u64)-1, annoying and
      inconsistent.  Fix btrfs_start_delalloc_roots() to take a u64 for nr and
      adjust start_delalloc_inodes() and it's callers appropriately.
      
      This means we've adjusted start_delalloc_inodes() to take a pointer of
      nr since we want to preserve the ability for start-delalloc_inodes() to
      return an error, so simply make it do the nr adjusting as necessary.
      
      Part of adjusting the callers to this means changing
      btrfs_writeback_inodes_sb_nr() to take a u64 for items.  This may be
      confusing because it seems unrelated, but the caller of
      btrfs_writeback_inodes_sb_nr() already passes in a u64, it's just the
      function variable that needs to be changed.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Tested-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b4912139
  2. 21 8月, 2020 1 次提交
    • B
      btrfs: detect nocow for swap after snapshot delete · a84d5d42
      Boris Burkov 提交于
      can_nocow_extent and btrfs_cross_ref_exist both rely on a heuristic for
      detecting a must cow condition which is not exactly accurate, but saves
      unnecessary tree traversal. The incorrect assumption is that if the
      extent was created in a generation smaller than the last snapshot
      generation, it must be referenced by that snapshot. That is true, except
      the snapshot could have since been deleted, without affecting the last
      snapshot generation.
      
      The original patch claimed a performance win from this check, but it
      also leads to a bug where you are unable to use a swapfile if you ever
      snapshotted the subvolume it's in. Make the check slower and more strict
      for the swapon case, without modifying the general cow checks as a
      compromise. Turning swap on does not seem to be a particularly
      performance sensitive operation, so incurring a possibly unnecessary
      btrfs_search_slot seems worthwhile for the added usability.
      
      Note: Until the snapshot is competely cleaned after deletion,
      check_committed_refs will still cause the logic to think that cow is
      necessary, so the user must until 'btrfs subvolu sync' finished before
      activating the swapfile swapon.
      
      CC: stable@vger.kernel.org # 5.4+
      Suggested-by: NOmar Sandoval <osandov@osandov.com>
      Signed-off-by: NBoris Burkov <boris@bur.io>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a84d5d42
  3. 20 8月, 2020 1 次提交
  4. 27 7月, 2020 19 次提交
    • J
      btrfs: don't WARN if we abort a transaction with EROFS · f95ebdbe
      Josef Bacik 提交于
      If we got some sort of corruption via a read and call
      btrfs_handle_fs_error() we'll set BTRFS_FS_STATE_ERROR on the fs and
      complain.  If a subsequent trans handle trips over this it'll get EROFS
      and then abort.  However at that point we're not aborting for the
      original reason, we're aborting because we've been flipped read only.
      We do not need to WARN_ON() here.
      
      CC: stable@vger.kernel.org # 5.4+
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f95ebdbe
    • Q
      btrfs: add comments for btrfs_reserve_flush_enum · fd7fb634
      Qu Wenruo 提交于
      This enum is the interface exposed to developers.
      
      Although we have a detailed comment explaining the whole idea of space
      flushing at the beginning of space-info.c, the exposed enum interface
      doesn't have any comment.
      
      Some corner cases, like BTRFS_RESERVE_FLUSH_ALL and
      BTRFS_RESERVE_FLUSH_ALL_STEAL can be interrupted by fatal signals, are
      not explained at all.
      
      So add some simple comments for these enums as a quick reference.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fd7fb634
    • Q
      btrfs: qgroup: remove ASYNC_COMMIT mechanism in favor of reserve retry-after-EDQUOT · adca4d94
      Qu Wenruo 提交于
      commit a514d638 ("btrfs: qgroup: Commit transaction in advance to
      reduce early EDQUOT") tries to reduce the early EDQUOT problems by
      checking the qgroup free against threshold and tries to wake up commit
      kthread to free some space.
      
      The problem of that mechanism is, it can only free qgroup per-trans
      metadata space, can't do anything to data, nor prealloc qgroup space.
      
      Now since we have the ability to flush qgroup space, and implemented
      retry-after-EDQUOT behavior, such mechanism can be completely replaced.
      
      So this patch will cleanup such mechanism in favor of
      retry-after-EDQUOT.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      adca4d94
    • Q
      btrfs: qgroup: try to flush qgroup space when we get -EDQUOT · c53e9653
      Qu Wenruo 提交于
      [PROBLEM]
      There are known problem related to how btrfs handles qgroup reserved
      space.  One of the most obvious case is the the test case btrfs/153,
      which do fallocate, then write into the preallocated range.
      
        btrfs/153 1s ... - output mismatch (see xfstests-dev/results//btrfs/153.out.bad)
            --- tests/btrfs/153.out     2019-10-22 15:18:14.068965341 +0800
            +++ xfstests-dev/results//btrfs/153.out.bad      2020-07-01 20:24:40.730000089 +0800
            @@ -1,2 +1,5 @@
             QA output created by 153
            +pwrite: Disk quota exceeded
            +/mnt/scratch/testfile2: Disk quota exceeded
            +/mnt/scratch/testfile2: Disk quota exceeded
             Silence is golden
            ...
            (Run 'diff -u xfstests-dev/tests/btrfs/153.out xfstests-dev/results//btrfs/153.out.bad'  to see the entire diff)
      
      [CAUSE]
      Since commit c6887cd1 ("Btrfs: don't do nocow check unless we have to"),
      we always reserve space no matter if it's COW or not.
      
      Such behavior change is mostly for performance, and reverting it is not
      a good idea anyway.
      
      For preallcoated extent, we reserve qgroup data space for it already,
      and since we also reserve data space for qgroup at buffered write time,
      it needs twice the space for us to write into preallocated space.
      
      This leads to the -EDQUOT in buffered write routine.
      
      And we can't follow the same solution, unlike data/meta space check,
      qgroup reserved space is shared between data/metadata.
      The EDQUOT can happen at the metadata reservation, so doing NODATACOW
      check after qgroup reservation failure is not a solution.
      
      [FIX]
      To solve the problem, we don't return -EDQUOT directly, but every time
      we got a -EDQUOT, we try to flush qgroup space:
      
      - Flush all inodes of the root
        NODATACOW writes will free the qgroup reserved at run_dealloc_range().
        However we don't have the infrastructure to only flush NODATACOW
        inodes, here we flush all inodes anyway.
      
      - Wait for ordered extents
        This would convert the preallocated metadata space into per-trans
        metadata, which can be freed in later transaction commit.
      
      - Commit transaction
        This will free all per-trans metadata space.
      
      Also we don't want to trigger flush multiple times, so here we introduce
      a per-root wait list and a new root status, to ensure only one thread
      starts the flushing.
      
      Fixes: c6887cd1 ("Btrfs: don't do nocow check unless we have to")
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c53e9653
    • M
      btrfs: add multi-statement protection to btrfs_set/clear_and_info macros · 60f8667b
      Marcos Paulo de Souza 提交于
      Multi-statement macros should be enclosed in do/while(0) block to make
      their use safe in single statement if conditions. All current uses of
      the macros are safe, so this change is for future protection.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NMarcos Paulo de Souza <mpdesouza@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      60f8667b
    • F
      btrfs: remove no longer needed use of log_writers for the log root tree · a93e0168
      Filipe Manana 提交于
      When syncing the log, we used to update the log root tree without holding
      neither the log_mutex of the subvolume root nor the log_mutex of log root
      tree.
      
      We used to have two critical sections delimited by the log_mutex of the
      log root tree, so in the first one we incremented the log_writers of the
      log root tree and on the second one we decremented it and waited for the
      log_writers counter to go down to zero. This was because the update of
      the log root tree happened between the two critical sections.
      
      The use of two critical sections allowed a little bit more of parallelism
      and required the use of the log_writers counter, necessary to make sure
      we didn't miss any log root tree update when we have multiple tasks trying
      to sync the log in parallel.
      
      However after commit 06989c79 ("Btrfs: fix race updating log root
      item during fsync") the log root tree update was moved into a critical
      section delimited by the subvolume's log_mutex. Later another commit
      moved the log tree update from that critical section into the second
      critical section delimited by the log_mutex of the log root tree. Both
      commits addressed different bugs.
      
      The end result is that the first critical section delimited by the
      log_mutex of the log root tree became pointless, since there's nothing
      done between it and the second critical section, we just have an unlock
      of the log_mutex followed by a lock operation. This means we can merge
      both critical sections, as the first one does almost nothing now, and we
      can stop using the log_writers counter of the log root tree, which was
      incremented in the first critical section and decremented in the second
      criticial section, used to make sure no one in the second critical section
      started writeback of the log root tree before some other task updated it.
      
      So just remove the mutex_unlock() followed by mutex_lock() of the log root
      tree, as well as the use of the log_writers counter for the log root tree.
      
      This patch is part of a series that has the following patches:
      
      1/4 btrfs: only commit the delayed inode when doing a full fsync
      2/4 btrfs: only commit delayed items at fsync if we are logging a directory
      3/4 btrfs: stop incremening log_batch for the log root tree when syncing log
      4/4 btrfs: remove no longer needed use of log_writers for the log root tree
      
      After the entire patchset applied I saw about 12% decrease on max latency
      reported by dbench. The test was done on a qemu vm, with 8 cores, 16Gb of
      ram, using kvm and using a raw NVMe device directly (no intermediary fs on
      the host). The test was invoked like the following:
      
        mkfs.btrfs -f /dev/sdk
        mount -o ssd -o nospace_cache /dev/sdk /mnt/sdk
        dbench -D /mnt/sdk -t 300 8
        umount /mnt/dsk
      
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a93e0168
    • F
      btrfs: stop incremening log_batch for the log root tree when syncing log · 28a95795
      Filipe Manana 提交于
      We are incrementing the log_batch atomic counter of the root log tree but
      we never use that counter, it's used only for the log trees of subvolume
      roots. We started doing it when we moved the log_batch and log_write
      counters from the global, per fs, btrfs_fs_info structure, into the
      btrfs_root structure in commit 7237f183 ("Btrfs: fix tree logs
      parallel sync").
      
      So just stop doing it for the log root tree and add a comment over the
      field declaration so inform it's used only for log trees of subvolume
      roots.
      
      This patch is part of a series that has the following patches:
      
      1/4 btrfs: only commit the delayed inode when doing a full fsync
      2/4 btrfs: only commit delayed items at fsync if we are logging a directory
      3/4 btrfs: stop incremening log_batch for the log root tree when syncing log
      4/4 btrfs: remove no longer needed use of log_writers for the log root tree
      
      After the entire patchset applied I saw about 12% decrease on max latency
      reported by dbench. The test was done on a qemu vm, with 8 cores, 16Gb of
      ram, using kvm and using a raw NVMe device directly (no intermediary fs on
      the host). The test was invoked like the following:
      
        mkfs.btrfs -f /dev/sdk
        mount -o ssd -o nospace_cache /dev/sdk /mnt/sdk
        dbench -D /mnt/sdk -t 300 8
        umount /mnt/dsk
      
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      28a95795
    • Q
      btrfs: qgroup: export qgroups in sysfs · 49e5fb46
      Qu Wenruo 提交于
      This patch will add the following sysfs interface:
      
        /sys/fs/btrfs/<UUID>/qgroups/<qgroup_id>/referenced
        /sys/fs/btrfs/<UUID>/qgroups/<qgroup_id>/exclusive
        /sys/fs/btrfs/<UUID>/qgroups/<qgroup_id>/max_referenced
        /sys/fs/btrfs/<UUID>/qgroups/<qgroup_id>/max_exclusive
        /sys/fs/btrfs/<UUID>/qgroups/<qgroup_id>/limit_flags
      
      Which is also available in output of "btrfs qgroup show".
      
        /sys/fs/btrfs/<UUID>/qgroups/<qgroup_id>/rsv_data
        /sys/fs/btrfs/<UUID>/qgroups/<qgroup_id>/rsv_meta_pertrans
        /sys/fs/btrfs/<UUID>/qgroups/<qgroup_id>/rsv_meta_prealloc
      
      The last 3 rsv related members are not visible to users, but can be very
      useful to debug qgroup limit related bugs.
      
      Also, to avoid '/' used in <qgroup_id>, the separator between qgroup
      level and qgroup id is changed to '_'.
      
      The interface is not hidden behind 'debug' as we want this interface to
      be included into production build and to provide another way to read the
      qgroup information besides the ioctls.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      49e5fb46
    • N
      btrfs: make btrfs_dirty_pages take btrfs_inode · 088545f6
      Nikolay Borisov 提交于
      There is a single use of the generic vfs_inode so let's take btrfs_inode
      as a parameter and remove couple of redundant BTRFS_I() calls.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      088545f6
    • N
      btrfs: make btrfs_set_extent_delalloc take btrfs_inode · c2566f22
      Nikolay Borisov 提交于
      Preparation to make btrfs_dirty_pages take btrfs_inode as parameter.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c2566f22
    • N
      btrfs: make btrfs_run_delalloc_range take btrfs_inode · 98456b9c
      Nikolay Borisov 提交于
      All children now take btrfs_inode so convert it to taking it as a
      parameter as well.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      98456b9c
    • Q
      btrfs: refactor btrfs_check_can_nocow() into two variants · 38d37aa9
      Qu Wenruo 提交于
      The function btrfs_check_can_nocow() now has two completely different
      call patterns.
      
      For nowait variant, callers don't need to do any cleanup.  While for
      wait variant, callers need to release the lock if they can do nocow
      write.
      
      This is somehow confusing, and is already a problem for the exported
      btrfs_check_can_nocow().
      
      So this patch will separate the different patterns into different
      functions.
      For nowait variant, the function will be called check_nocow_nolock().
      For wait variant, the function pair will be btrfs_check_nocow_lock()
      btrfs_check_nocow_unlock().
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      38d37aa9
    • Q
      btrfs: allow btrfs_truncate_block() to fallback to nocow for data space reservation · 6d4572a9
      Qu Wenruo 提交于
      [BUG]
      When the data space is exhausted, even if the inode has NOCOW attribute,
      we will still refuse to truncate unaligned range due to ENOSPC.
      
      The following script can reproduce it pretty easily:
        #!/bin/bash
      
        dev=/dev/test/test
        mnt=/mnt/btrfs
      
        umount $dev &> /dev/null
        umount $mnt &> /dev/null
      
        mkfs.btrfs -f $dev -b 1G
        mount -o nospace_cache $dev $mnt
        touch $mnt/foobar
        chattr +C $mnt/foobar
      
        xfs_io -f -c "pwrite -b 4k 0 4k" $mnt/foobar > /dev/null
        xfs_io -f -c "pwrite -b 4k 0 1G" $mnt/padding &> /dev/null
        sync
      
        xfs_io -c "fpunch 0 2k" $mnt/foobar
        umount $mnt
      
      Currently this will fail at the fpunch part.
      
      [CAUSE]
      Because btrfs_truncate_block() always reserves space without checking
      the NOCOW attribute.
      
      Since the writeback path follows NOCOW bit, we only need to bother the
      space reservation code in btrfs_truncate_block().
      
      [FIX]
      Make btrfs_truncate_block() follow btrfs_buffered_write() to try to
      reserve data space first, and fall back to NOCOW check only when we
      don't have enough space.
      
      Such always-try-reserve is an optimization introduced in
      btrfs_buffered_write(), to avoid expensive btrfs_check_can_nocow() call.
      
      This patch will export check_can_nocow() as btrfs_check_can_nocow(), and
      use it in btrfs_truncate_block() to fix the problem.
      Reported-by: NMartin Doucha <martin.doucha@suse.com>
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6d4572a9
    • D
      btrfs: remove unused btrfs_root::defrag_trans_start · a2570ef3
      David Sterba 提交于
      Last touched in 2013 by commit de78b51a ("btrfs: remove cache only
      arguments from defrag path") that was the only code that used the value.
      Now it's only set but never used for anything, so we can remove it.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a2570ef3
    • N
      btrfs: make __btrfs_drop_extents take btrfs_inode · 906c448c
      Nikolay Borisov 提交于
      It has only 4 uses of a vfs_inode for inode_sub_bytes but unifies the
      interface with the non  __ prefixed version. Will also makes converting
      its callers to btrfs_inode easier.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      906c448c
    • N
      btrfs: make btrfs_csum_one_bio takae btrfs_inode · bd242a08
      Nikolay Borisov 提交于
      Will enable converting btrfs_submit_compressed_write to btrfs_inode more
      easily.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      bd242a08
    • N
      btrfs: make btrfs_reloc_clone_csums take btrfs_inode · 7bfa9535
      Nikolay Borisov 提交于
      It really wants btrfs_inode and not a vfs inode.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7bfa9535
    • D
      btrfs: add little-endian optimized key helpers · ce6ef5ab
      David Sterba 提交于
      The CPU and on-disk keys are mapped to two different structures because
      of the endianness. There's an intermediate buffer used to do the
      conversion, but this is not necessary when CPU and on-disk endianness
      match.
      
      Add optimized versions of helpers that take disk_key and use the buffer
      directly for CPU keys or drop the intermediate buffer and conversion.
      
      This saves a lot of stack space accross many functions and removes about
      6K of generated binary code:
      
         text    data     bss     dec     hex filename
      1090439   17468   14912 1122819  112203 pre/btrfs.ko
      1084613   17456   14912 1116981  110b35 post/btrfs.ko
      
      Delta: -5826
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ce6ef5ab
    • Q
      btrfs: inode: refactor the parameters of insert_reserved_file_extent() · 203f44c5
      Qu Wenruo 提交于
      Function insert_reserved_file_extent() takes a long list of parameters,
      which are all for btrfs_file_extent_item, even including two reserved
      members, encryption and other_encoding.
      
      This makes the parameter list unnecessary long for a function which only
      gets called twice.
      
      This patch will refactor the parameter list, by using
      btrfs_file_extent_item as parameter directly to hugely reduce the number
      of parameters.
      
      Also, since there are only two callers, one in btrfs_finish_ordered_io()
      which inserts file extent for ordered extent, and one
      __btrfs_prealloc_file_range().
      
      These two call sites have completely different context, where ordered
      extent can be compressed, but will always be regular extent, while the
      preallocated one is never going to be compressed and always has PREALLOC
      type.
      
      So use two small wrapper for these two different call sites to improve
      readability.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      203f44c5
  5. 17 6月, 2020 1 次提交
    • F
      btrfs: check if a log root exists before locking the log_mutex on unlink · e7a79811
      Filipe Manana 提交于
      This brings back an optimization that commit e678934c ("btrfs:
      Remove unnecessary check from join_running_log_trans") removed, but in
      a different form. So it's almost equivalent to a revert.
      
      That commit removed an optimization where we avoid locking a root's
      log_mutex when there is no log tree created in the current transaction.
      The affected code path is triggered through unlink operations.
      
      That commit was based on the assumption that the optimization was not
      necessary because we used to have the following checks when the patch
      was authored:
      
        int btrfs_del_dir_entries_in_log(...)
        {
              (...)
              if (dir->logged_trans < trans->transid)
                  return 0;
      
              ret = join_running_log_trans(root);
              (...)
         }
      
         int btrfs_del_inode_ref_in_log(...)
         {
              (...)
              if (inode->logged_trans < trans->transid)
                  return 0;
      
              ret = join_running_log_trans(root);
              (...)
         }
      
      However before that patch was merged, another patch was merged first which
      replaced those checks because they were buggy.
      
      That other patch corresponds to commit 803f0f64 ("Btrfs: fix fsync
      not persisting dentry deletions due to inode evictions"). The assumption
      that if the logged_trans field of an inode had a smaller value then the
      current transaction's generation (transid) meant that the inode was not
      logged in the current transaction was only correct if the inode was not
      evicted and reloaded in the current transaction. So the corresponding bug
      fix changed those checks and replaced them with the following helper
      function:
      
        static bool inode_logged(struct btrfs_trans_handle *trans,
                                 struct btrfs_inode *inode)
        {
              if (inode->logged_trans == trans->transid)
                      return true;
      
              if (inode->last_trans == trans->transid &&
                  test_bit(BTRFS_INODE_NEEDS_FULL_SYNC, &inode->runtime_flags) &&
                  !test_bit(BTRFS_FS_LOG_RECOVERING, &trans->fs_info->flags))
                      return true;
      
              return false;
        }
      
      So if we have a subvolume without a log tree in the current transaction
      (because we had no fsyncs), every time we unlink an inode we can end up
      trying to lock the log_mutex of the root through join_running_log_trans()
      twice, once for the inode being unlinked (by btrfs_del_inode_ref_in_log())
      and once for the parent directory (with btrfs_del_dir_entries_in_log()).
      
      This means if we have several unlink operations happening in parallel for
      inodes in the same subvolume, and the those inodes and/or their parent
      inode were changed in the current transaction, we end up having a lot of
      contention on the log_mutex.
      
      The test robots from intel reported a -30.7% performance regression for
      a REAIM test after commit e678934c ("btrfs: Remove unnecessary check
      from join_running_log_trans").
      
      So just bring back the optimization to join_running_log_trans() where we
      check first if a log root exists before trying to lock the log_mutex. This
      is done by checking for a bit that is set on the root when a log tree is
      created and removed when a log tree is freed (at transaction commit time).
      
      Commit e678934c ("btrfs: Remove unnecessary check from
      join_running_log_trans") was merged in the 5.4 merge window while commit
      803f0f64 ("Btrfs: fix fsync not persisting dentry deletions due to
      inode evictions") was merged in the 5.3 merge window. But the first
      commit was actually authored before the second commit (May 23 2019 vs
      June 19 2019).
      Reported-by: Nkernel test robot <rong.a.chen@intel.com>
      Link: https://lore.kernel.org/lkml/20200611090233.GL12456@shao2-debian/
      Fixes: e678934c ("btrfs: Remove unnecessary check from join_running_log_trans")
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e7a79811
  6. 14 6月, 2020 1 次提交
    • D
      Revert "btrfs: switch to iomap_dio_rw() for dio" · 55e20bd1
      David Sterba 提交于
      This reverts commit a43a67a2.
      
      This patch reverts the main part of switching direct io implementation
      to iomap infrastructure. There's a problem in invalidate page that
      couldn't be solved as regression in this development cycle.
      
      The problem occurs when buffered and direct io are mixed, and the ranges
      overlap. Although this is not recommended, filesystems implement
      measures or fallbacks to make it somehow work. In this case, fallback to
      buffered IO would be an option for btrfs (this already happens when
      direct io is done on compressed data), but the change would be needed in
      the iomap code, bringing new semantics to other filesystems.
      
      Another problem arises when again the buffered and direct ios are mixed,
      invalidation fails, then -EIO is set on the mapping and fsync will fail,
      though there's no real error.
      
      There have been discussions how to fix that, but revert seems to be the
      least intrusive option.
      
      Link: https://lore.kernel.org/linux-btrfs/20200528192103.xm45qoxqmkw7i5yl@fiona/Signed-off-by: NDavid Sterba <dsterba@suse.com>
      55e20bd1
  7. 10 6月, 2020 1 次提交
  8. 28 5月, 2020 2 次提交
    • C
      btrfs: split btrfs_direct_IO to read and write part · d8f3e735
      Christoph Hellwig 提交于
      The read and write versions don't have anything in common except for the
      call to iomap_dio_rw.  So split this function, and merge each half into
      its only caller.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d8f3e735
    • G
      btrfs: switch to iomap_dio_rw() for dio · a43a67a2
      Goldwyn Rodrigues 提交于
      Switch from __blockdev_direct_IO() to iomap_dio_rw().
      Rename btrfs_get_blocks_direct() to btrfs_dio_iomap_begin() and use it
      as iomap_begin() for iomap direct I/O functions. This function
      allocates and locks all the blocks required for the I/O.
      btrfs_submit_direct() is used as the submit_io() hook for direct I/O
      ops.
      
      Since we need direct I/O reads to go through iomap_dio_rw(), we change
      file_operations.read_iter() to a btrfs_file_read_iter() which calls
      btrfs_direct_IO() for direct reads and falls back to
      generic_file_buffered_read() for incomplete reads and buffered reads.
      
      We don't need address_space.direct_IO() anymore so set it to noop.
      Similarly, we don't need flags used in __blockdev_direct_IO(). iomap is
      capable of direct I/O reads from a hole, so we don't need to return
      -ENOENT.
      
      BTRFS direct I/O is now done under i_rwsem, shared in case of reads and
      exclusive in case of writes. This guards against simultaneous truncates.
      
      Use iomap->iomap_end() to check for failed or incomplete direct I/O:
       - for writes, call __endio_write_update_ordered()
       - for reads, unlock extents
      
      btrfs_dio_data is now hooked in iomap->private and not
      current->journal_info. It carries the reservation variable and the
      amount of data submitted, so we can calculate the amount of data to call
      __endio_write_update_ordered in case of an error.
      
      This patch removes last use of struct buffer_head from btrfs.
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a43a67a2
  9. 25 5月, 2020 9 次提交
    • F
      btrfs: fix corrupt log due to concurrent fsync of inodes with shared extents · e289f03e
      Filipe Manana 提交于
      When we have extents shared amongst different inodes in the same subvolume,
      if we fsync them in parallel we can end up with checksum items in the log
      tree that represent ranges which overlap.
      
      For example, consider we have inodes A and B, both sharing an extent that
      covers the logical range from X to X + 64KiB:
      
      1) Task A starts an fsync on inode A;
      
      2) Task B starts an fsync on inode B;
      
      3) Task A calls btrfs_csum_file_blocks(), and the first search in the
         log tree, through btrfs_lookup_csum(), returns -EFBIG because it
         finds an existing checksum item that covers the range from X - 64KiB
         to X;
      
      4) Task A checks that the checksum item has not reached the maximum
         possible size (MAX_CSUM_ITEMS) and then releases the search path
         before it does another path search for insertion (through a direct
         call to btrfs_search_slot());
      
      5) As soon as task A releases the path and before it does the search
         for insertion, task B calls btrfs_csum_file_blocks() and gets -EFBIG
         too, because there is an existing checksum item that has an end
         offset that matches the start offset (X) of the checksum range we want
         to log;
      
      6) Task B releases the path;
      
      7) Task A does the path search for insertion (through btrfs_search_slot())
         and then verifies that the checksum item that ends at offset X still
         exists and extends its size to insert the checksums for the range from
         X to X + 64KiB;
      
      8) Task A releases the path and returns from btrfs_csum_file_blocks(),
         having inserted the checksums into an existing checksum item that got
         its size extended. At this point we have one checksum item in the log
         tree that covers the logical range from X - 64KiB to X + 64KiB;
      
      9) Task B now does a search for insertion using btrfs_search_slot() too,
         but it finds that the previous checksum item no longer ends at the
         offset X, it now ends at an of offset X + 64KiB, so it leaves that item
         untouched.
      
         Then it releases the path and calls btrfs_insert_empty_item()
         that inserts a checksum item with a key offset corresponding to X and
         a size for inserting a single checksum (4 bytes in case of crc32c).
         Subsequent iterations end up extending this new checksum item so that
         it contains the checksums for the range from X to X + 64KiB.
      
         So after task B returns from btrfs_csum_file_blocks() we end up with
         two checksum items in the log tree that have overlapping ranges, one
         for the range from X - 64KiB to X + 64KiB, and another for the range
         from X to X + 64KiB.
      
      Having checksum items that represent ranges which overlap, regardless of
      being in the log tree or in the chekcsums tree, can lead to problems where
      checksums for a file range end up not being found. This type of problem
      has happened a few times in the past and the following commits fixed them
      and explain in detail why having checksum items with overlapping ranges is
      problematic:
      
        27b9a812 "Btrfs: fix csum tree corruption, duplicate and outdated checksums"
        b84b8390 "Btrfs: fix file read corruption after extent cloning and fsync"
        40e046ac "Btrfs: fix missing data checksums after replaying a log tree"
      
      Since this specific instance of the problem can only happen when logging
      inodes, because it is the only case where concurrent attempts to insert
      checksums for the same range can happen, fix the issue by using an extent
      io tree as a range lock to serialize checksum insertion during inode
      logging.
      
      This issue could often be reproduced by the test case generic/457 from
      fstests. When it happens it produces the following trace:
      
       BTRFS critical (device dm-0): corrupt leaf: root=18446744073709551610 block=30625792 slot=42, csum end range (15020032) goes beyond the start range (15015936) of the next csum item
       BTRFS info (device dm-0): leaf 30625792 gen 7 total ptrs 49 free space 2402 owner 18446744073709551610
       BTRFS info (device dm-0): refs 1 lock (w:0 r:0 bw:0 br:0 sw:0 sr:0) lock_owner 0 current 15884
            item 0 key (18446744073709551606 128 13979648) itemoff 3991 itemsize 4
            item 1 key (18446744073709551606 128 13983744) itemoff 3987 itemsize 4
            item 2 key (18446744073709551606 128 13987840) itemoff 3983 itemsize 4
            item 3 key (18446744073709551606 128 13991936) itemoff 3979 itemsize 4
            item 4 key (18446744073709551606 128 13996032) itemoff 3975 itemsize 4
            item 5 key (18446744073709551606 128 14000128) itemoff 3971 itemsize 4
       (...)
       BTRFS error (device dm-0): block=30625792 write time tree block corruption detected
       ------------[ cut here ]------------
       WARNING: CPU: 1 PID: 15884 at fs/btrfs/disk-io.c:539 btree_csum_one_bio+0x268/0x2d0 [btrfs]
       Modules linked in: btrfs dm_thin_pool ...
       CPU: 1 PID: 15884 Comm: fsx Tainted: G        W         5.6.0-rc7-btrfs-next-58 #1
       Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
       RIP: 0010:btree_csum_one_bio+0x268/0x2d0 [btrfs]
       Code: c7 c7 ...
       RSP: 0018:ffffbb0109e6f8e0 EFLAGS: 00010296
       RAX: 0000000000000000 RBX: ffffe1c0847b6080 RCX: 0000000000000000
       RDX: 0000000000000000 RSI: ffffffffaa963988 RDI: 0000000000000001
       RBP: ffff956a4f4d2000 R08: 0000000000000000 R09: 0000000000000001
       R10: 0000000000000526 R11: 0000000000000000 R12: ffff956a5cd28bb0
       R13: 0000000000000000 R14: ffff956a649c9388 R15: 000000011ed82000
       FS:  00007fb419959e80(0000) GS:ffff956a7aa00000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 0000000000fe6d54 CR3: 0000000138696005 CR4: 00000000003606e0
       DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       Call Trace:
        btree_submit_bio_hook+0x67/0xc0 [btrfs]
        submit_one_bio+0x31/0x50 [btrfs]
        btree_write_cache_pages+0x2db/0x4b0 [btrfs]
        ? __filemap_fdatawrite_range+0xb1/0x110
        do_writepages+0x23/0x80
        __filemap_fdatawrite_range+0xd2/0x110
        btrfs_write_marked_extents+0x15e/0x180 [btrfs]
        btrfs_sync_log+0x206/0x10a0 [btrfs]
        ? kmem_cache_free+0x315/0x3b0
        ? btrfs_log_inode+0x1e8/0xf90 [btrfs]
        ? __mutex_unlock_slowpath+0x45/0x2a0
        ? lockref_put_or_lock+0x9/0x30
        ? dput+0x2d/0x580
        ? dput+0xb5/0x580
        ? btrfs_sync_file+0x464/0x4d0 [btrfs]
        btrfs_sync_file+0x464/0x4d0 [btrfs]
        do_fsync+0x38/0x60
        __x64_sys_fsync+0x10/0x20
        do_syscall_64+0x5c/0x280
        entry_SYSCALL_64_after_hwframe+0x49/0xbe
       RIP: 0033:0x7fb41953a6d0
       Code: 48 3d ...
       RSP: 002b:00007ffcc86bd218 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
       RAX: ffffffffffffffda RBX: 000000000000000d RCX: 00007fb41953a6d0
       RDX: 0000000000000009 RSI: 0000000000040000 RDI: 0000000000000003
       RBP: 0000000000040000 R08: 0000000000000001 R09: 0000000000000009
       R10: 0000000000000064 R11: 0000000000000246 R12: 0000556cf4b2c060
       R13: 0000000000000100 R14: 0000000000000000 R15: 0000556cf322b420
       irq event stamp: 0
       hardirqs last  enabled at (0): [<0000000000000000>] 0x0
       hardirqs last disabled at (0): [<ffffffffa96bdedf>] copy_process+0x74f/0x2020
       softirqs last  enabled at (0): [<ffffffffa96bdedf>] copy_process+0x74f/0x2020
       softirqs last disabled at (0): [<0000000000000000>] 0x0
       ---[ end trace d543fc76f5ad7fd8 ]---
      
      In that trace the tree checker detected the overlapping checksum items at
      the time when we triggered writeback for the log tree when syncing the
      log.
      
      Another trace that can happen is due to BUG_ON() when deleting checksum
      items while logging an inode:
      
       BTRFS critical (device dm-0): slot 81 key (18446744073709551606 128 13635584) new key (18446744073709551606 128 13635584)
       BTRFS info (device dm-0): leaf 30949376 gen 7 total ptrs 98 free space 8527 owner 18446744073709551610
       BTRFS info (device dm-0): refs 4 lock (w:1 r:0 bw:0 br:0 sw:1 sr:0) lock_owner 13473 current 13473
        item 0 key (257 1 0) itemoff 16123 itemsize 160
                inode generation 7 size 262144 mode 100600
        item 1 key (257 12 256) itemoff 16103 itemsize 20
        item 2 key (257 108 0) itemoff 16050 itemsize 53
                extent data disk bytenr 13631488 nr 4096
                extent data offset 0 nr 131072 ram 131072
       (...)
       ------------[ cut here ]------------
       kernel BUG at fs/btrfs/ctree.c:3153!
       invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
       CPU: 1 PID: 13473 Comm: fsx Not tainted 5.6.0-rc7-btrfs-next-58 #1
       Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
       RIP: 0010:btrfs_set_item_key_safe+0x1ea/0x270 [btrfs]
       Code: 0f b6 ...
       RSP: 0018:ffff95e3889179d0 EFLAGS: 00010282
       RAX: 0000000000000000 RBX: 0000000000000051 RCX: 0000000000000000
       RDX: 0000000000000000 RSI: ffffffffb7763988 RDI: 0000000000000001
       RBP: fffffffffffffff6 R08: 0000000000000000 R09: 0000000000000001
       R10: 00000000000009ef R11: 0000000000000000 R12: ffff8912a8ba5a08
       R13: ffff95e388917a06 R14: ffff89138dcf68c8 R15: ffff95e388917ace
       FS:  00007fe587084e80(0000) GS:ffff8913baa00000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 00007fe587091000 CR3: 0000000126dac005 CR4: 00000000003606e0
       DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       Call Trace:
        btrfs_del_csums+0x2f4/0x540 [btrfs]
        copy_items+0x4b5/0x560 [btrfs]
        btrfs_log_inode+0x910/0xf90 [btrfs]
        btrfs_log_inode_parent+0x2a0/0xe40 [btrfs]
        ? dget_parent+0x5/0x370
        btrfs_log_dentry_safe+0x4a/0x70 [btrfs]
        btrfs_sync_file+0x42b/0x4d0 [btrfs]
        __x64_sys_msync+0x199/0x200
        do_syscall_64+0x5c/0x280
        entry_SYSCALL_64_after_hwframe+0x49/0xbe
       RIP: 0033:0x7fe586c65760
       Code: 00 f7 ...
       RSP: 002b:00007ffe250f98b8 EFLAGS: 00000246 ORIG_RAX: 000000000000001a
       RAX: ffffffffffffffda RBX: 00000000000040e1 RCX: 00007fe586c65760
       RDX: 0000000000000004 RSI: 0000000000006b51 RDI: 00007fe58708b000
       RBP: 0000000000006a70 R08: 0000000000000003 R09: 00007fe58700cb61
       R10: 0000000000000100 R11: 0000000000000246 R12: 00000000000000e1
       R13: 00007fe58708b000 R14: 0000000000006b51 R15: 0000558de021a420
       Modules linked in: dm_log_writes ...
       ---[ end trace c92a7f447a8515f5 ]---
      
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e289f03e
    • D
      btrfs: simplify iget helpers · 0202e83f
      David Sterba 提交于
      The inode lookup starting at btrfs_iget takes the full location key,
      while only the objectid is used to match the inode, because the lookup
      happens inside the given root thus the inode number is unique.
      The entire location key is properly set up in btrfs_init_locked_inode.
      
      Simplify the helpers and pass only inode number, renaming it to 'ino'
      instead of 'objectid'. This allows to remove temporary variables key,
      saving some stack space.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0202e83f
    • Q
      btrfs: don't set SHAREABLE flag for data reloc tree · aeb935a4
      Qu Wenruo 提交于
      SHAREABLE flag is set for subvolumes because users can create snapshot
      for subvolumes, thus sharing tree blocks of them.
      
      But data reloc tree is not exposed to user space, as it's only an
      internal tree for data relocation, thus it doesn't need the full path
      replacement handling at all.
      
      This patch will make data reloc tree a non-shareable tree, and add
      btrfs_fs_info::data_reloc_root for data reloc tree, so relocation code
      can grab it from fs_info directly.
      
      This would slightly improve tree relocation, as now data reloc tree
      can go through regular COW routine to get relocated, without bothering
      the complex tree reloc tree routine.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      aeb935a4
    • Q
      btrfs: rename BTRFS_ROOT_REF_COWS to BTRFS_ROOT_SHAREABLE · 92a7cc42
      Qu Wenruo 提交于
      The name BTRFS_ROOT_REF_COWS is not very clear about the meaning.
      
      In fact, that bit can only be set to those trees:
      
      - Subvolume roots
      - Data reloc root
      - Reloc roots for above roots
      
      All other trees won't get this bit set.  So just by the result, it is
      obvious that, roots with this bit set can have tree blocks shared with
      other trees.  Either shared by snapshots, or by reloc roots (an special
      snapshot created by relocation).
      
      This patch will rename BTRFS_ROOT_REF_COWS to BTRFS_ROOT_SHAREABLE to
      make it easier to understand, and update all comment mentioning
      "reference counted" to follow the rename.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      92a7cc42
    • D
      btrfs: constify extent_buffer in the API functions · 2b48966a
      David Sterba 提交于
      There are many helpers around extent buffers, found in extent_io.h and
      ctree.h. Most of them can be converted to take constified eb as there
      are no changes to the extent buffer structure itself but rather the
      pages.
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2b48966a
    • D
      btrfs: preset set/get token with first page and drop condition · 870b388d
      David Sterba 提交于
      All the set/get helpers first check if the token contains a cached
      address. After first use the address is always valid, but the extra
      check is done for each call.
      
      The token initialization can optimistically set it to the first extent
      buffer page, that we know always exists. Then the condition in all
      btrfs_token_*/btrfs_set_token_* can be simplified by removing the
      address check from the condition, but for development the assertion
      still makes sure it's valid.
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      870b388d
    • D
      btrfs: drop eb parameter from set/get token helpers · cc4c13d5
      David Sterba 提交于
      Now that all set/get helpers use the eb from the token, we don't need to
      pass it to many btrfs_token_*/btrfs_set_token_* helpers, saving some
      stack space.
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      cc4c13d5
    • F
      btrfs: move the block group freeze/unfreeze helpers into block-group.c · 684b752b
      Filipe Manana 提交于
      The helpers btrfs_freeze_block_group() and btrfs_unfreeze_block_group()
      used to be named btrfs_get_block_group_trimming() and
      btrfs_put_block_group_trimming() respectively.
      
      At the time they were added to free-space-cache.c, by commit e33e17ee
      ("btrfs: add missing discards when unpinning extents with -o discard")
      because all the trimming related functions were in free-space-cache.c.
      
      Now that the helpers were renamed and are used in scrub context as well,
      move them to block-group.c, a much more logical location for them.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      684b752b
    • F
      btrfs: rename member 'trimming' of block group to a more generic name · 6b7304af
      Filipe Manana 提交于
      Back in 2014, commit 04216820 ("Btrfs: fix race between fs trimming
      and block group remove/allocation"), I added the 'trimming' member to the
      block group structure. Its purpose was to prevent races between trimming
      and block group deletion/allocation by pinning the block group in a way
      that prevents its logical address and device extents from being reused
      while trimming is in progress for a block group, so that if another task
      deletes the block group and then another task allocates a new block group
      that gets the same logical address and device extents while the trimming
      task is still in progress.
      
      After the previous fix for scrub (patch "btrfs: fix a race between scrub
      and block group removal/allocation"), scrub now also has the same needs that
      trimming has, so the member name 'trimming' no longer makes sense.
      Since there is already a 'pinned' member in the block group that refers
      to space reservations (pinned bytes), rename the member to 'frozen',
      add a comment on top of it to describe its general purpose and rename
      the helpers to increment and decrement the counter as well, to match
      the new member name.
      
      The next patch in the series will move the helpers into a more suitable
      file (from free-space-cache.c to block-group.c).
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6b7304af