1. 27 7月, 2020 14 次提交
    • F
      btrfs: remove no longer needed use of log_writers for the log root tree · a93e0168
      Filipe Manana 提交于
      When syncing the log, we used to update the log root tree without holding
      neither the log_mutex of the subvolume root nor the log_mutex of log root
      tree.
      
      We used to have two critical sections delimited by the log_mutex of the
      log root tree, so in the first one we incremented the log_writers of the
      log root tree and on the second one we decremented it and waited for the
      log_writers counter to go down to zero. This was because the update of
      the log root tree happened between the two critical sections.
      
      The use of two critical sections allowed a little bit more of parallelism
      and required the use of the log_writers counter, necessary to make sure
      we didn't miss any log root tree update when we have multiple tasks trying
      to sync the log in parallel.
      
      However after commit 06989c79 ("Btrfs: fix race updating log root
      item during fsync") the log root tree update was moved into a critical
      section delimited by the subvolume's log_mutex. Later another commit
      moved the log tree update from that critical section into the second
      critical section delimited by the log_mutex of the log root tree. Both
      commits addressed different bugs.
      
      The end result is that the first critical section delimited by the
      log_mutex of the log root tree became pointless, since there's nothing
      done between it and the second critical section, we just have an unlock
      of the log_mutex followed by a lock operation. This means we can merge
      both critical sections, as the first one does almost nothing now, and we
      can stop using the log_writers counter of the log root tree, which was
      incremented in the first critical section and decremented in the second
      criticial section, used to make sure no one in the second critical section
      started writeback of the log root tree before some other task updated it.
      
      So just remove the mutex_unlock() followed by mutex_lock() of the log root
      tree, as well as the use of the log_writers counter for the log root tree.
      
      This patch is part of a series that has the following patches:
      
      1/4 btrfs: only commit the delayed inode when doing a full fsync
      2/4 btrfs: only commit delayed items at fsync if we are logging a directory
      3/4 btrfs: stop incremening log_batch for the log root tree when syncing log
      4/4 btrfs: remove no longer needed use of log_writers for the log root tree
      
      After the entire patchset applied I saw about 12% decrease on max latency
      reported by dbench. The test was done on a qemu vm, with 8 cores, 16Gb of
      ram, using kvm and using a raw NVMe device directly (no intermediary fs on
      the host). The test was invoked like the following:
      
        mkfs.btrfs -f /dev/sdk
        mount -o ssd -o nospace_cache /dev/sdk /mnt/sdk
        dbench -D /mnt/sdk -t 300 8
        umount /mnt/dsk
      
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a93e0168
    • F
      btrfs: stop incremening log_batch for the log root tree when syncing log · 28a95795
      Filipe Manana 提交于
      We are incrementing the log_batch atomic counter of the root log tree but
      we never use that counter, it's used only for the log trees of subvolume
      roots. We started doing it when we moved the log_batch and log_write
      counters from the global, per fs, btrfs_fs_info structure, into the
      btrfs_root structure in commit 7237f183 ("Btrfs: fix tree logs
      parallel sync").
      
      So just stop doing it for the log root tree and add a comment over the
      field declaration so inform it's used only for log trees of subvolume
      roots.
      
      This patch is part of a series that has the following patches:
      
      1/4 btrfs: only commit the delayed inode when doing a full fsync
      2/4 btrfs: only commit delayed items at fsync if we are logging a directory
      3/4 btrfs: stop incremening log_batch for the log root tree when syncing log
      4/4 btrfs: remove no longer needed use of log_writers for the log root tree
      
      After the entire patchset applied I saw about 12% decrease on max latency
      reported by dbench. The test was done on a qemu vm, with 8 cores, 16Gb of
      ram, using kvm and using a raw NVMe device directly (no intermediary fs on
      the host). The test was invoked like the following:
      
        mkfs.btrfs -f /dev/sdk
        mount -o ssd -o nospace_cache /dev/sdk /mnt/sdk
        dbench -D /mnt/sdk -t 300 8
        umount /mnt/dsk
      
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      28a95795
    • Q
      btrfs: qgroup: export qgroups in sysfs · 49e5fb46
      Qu Wenruo 提交于
      This patch will add the following sysfs interface:
      
        /sys/fs/btrfs/<UUID>/qgroups/<qgroup_id>/referenced
        /sys/fs/btrfs/<UUID>/qgroups/<qgroup_id>/exclusive
        /sys/fs/btrfs/<UUID>/qgroups/<qgroup_id>/max_referenced
        /sys/fs/btrfs/<UUID>/qgroups/<qgroup_id>/max_exclusive
        /sys/fs/btrfs/<UUID>/qgroups/<qgroup_id>/limit_flags
      
      Which is also available in output of "btrfs qgroup show".
      
        /sys/fs/btrfs/<UUID>/qgroups/<qgroup_id>/rsv_data
        /sys/fs/btrfs/<UUID>/qgroups/<qgroup_id>/rsv_meta_pertrans
        /sys/fs/btrfs/<UUID>/qgroups/<qgroup_id>/rsv_meta_prealloc
      
      The last 3 rsv related members are not visible to users, but can be very
      useful to debug qgroup limit related bugs.
      
      Also, to avoid '/' used in <qgroup_id>, the separator between qgroup
      level and qgroup id is changed to '_'.
      
      The interface is not hidden behind 'debug' as we want this interface to
      be included into production build and to provide another way to read the
      qgroup information besides the ioctls.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      49e5fb46
    • N
      btrfs: make btrfs_dirty_pages take btrfs_inode · 088545f6
      Nikolay Borisov 提交于
      There is a single use of the generic vfs_inode so let's take btrfs_inode
      as a parameter and remove couple of redundant BTRFS_I() calls.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      088545f6
    • N
      btrfs: make btrfs_set_extent_delalloc take btrfs_inode · c2566f22
      Nikolay Borisov 提交于
      Preparation to make btrfs_dirty_pages take btrfs_inode as parameter.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c2566f22
    • N
      btrfs: make btrfs_run_delalloc_range take btrfs_inode · 98456b9c
      Nikolay Borisov 提交于
      All children now take btrfs_inode so convert it to taking it as a
      parameter as well.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      98456b9c
    • Q
      btrfs: refactor btrfs_check_can_nocow() into two variants · 38d37aa9
      Qu Wenruo 提交于
      The function btrfs_check_can_nocow() now has two completely different
      call patterns.
      
      For nowait variant, callers don't need to do any cleanup.  While for
      wait variant, callers need to release the lock if they can do nocow
      write.
      
      This is somehow confusing, and is already a problem for the exported
      btrfs_check_can_nocow().
      
      So this patch will separate the different patterns into different
      functions.
      For nowait variant, the function will be called check_nocow_nolock().
      For wait variant, the function pair will be btrfs_check_nocow_lock()
      btrfs_check_nocow_unlock().
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      38d37aa9
    • Q
      btrfs: allow btrfs_truncate_block() to fallback to nocow for data space reservation · 6d4572a9
      Qu Wenruo 提交于
      [BUG]
      When the data space is exhausted, even if the inode has NOCOW attribute,
      we will still refuse to truncate unaligned range due to ENOSPC.
      
      The following script can reproduce it pretty easily:
        #!/bin/bash
      
        dev=/dev/test/test
        mnt=/mnt/btrfs
      
        umount $dev &> /dev/null
        umount $mnt &> /dev/null
      
        mkfs.btrfs -f $dev -b 1G
        mount -o nospace_cache $dev $mnt
        touch $mnt/foobar
        chattr +C $mnt/foobar
      
        xfs_io -f -c "pwrite -b 4k 0 4k" $mnt/foobar > /dev/null
        xfs_io -f -c "pwrite -b 4k 0 1G" $mnt/padding &> /dev/null
        sync
      
        xfs_io -c "fpunch 0 2k" $mnt/foobar
        umount $mnt
      
      Currently this will fail at the fpunch part.
      
      [CAUSE]
      Because btrfs_truncate_block() always reserves space without checking
      the NOCOW attribute.
      
      Since the writeback path follows NOCOW bit, we only need to bother the
      space reservation code in btrfs_truncate_block().
      
      [FIX]
      Make btrfs_truncate_block() follow btrfs_buffered_write() to try to
      reserve data space first, and fall back to NOCOW check only when we
      don't have enough space.
      
      Such always-try-reserve is an optimization introduced in
      btrfs_buffered_write(), to avoid expensive btrfs_check_can_nocow() call.
      
      This patch will export check_can_nocow() as btrfs_check_can_nocow(), and
      use it in btrfs_truncate_block() to fix the problem.
      Reported-by: NMartin Doucha <martin.doucha@suse.com>
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6d4572a9
    • D
      btrfs: remove unused btrfs_root::defrag_trans_start · a2570ef3
      David Sterba 提交于
      Last touched in 2013 by commit de78b51a ("btrfs: remove cache only
      arguments from defrag path") that was the only code that used the value.
      Now it's only set but never used for anything, so we can remove it.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a2570ef3
    • N
      btrfs: make __btrfs_drop_extents take btrfs_inode · 906c448c
      Nikolay Borisov 提交于
      It has only 4 uses of a vfs_inode for inode_sub_bytes but unifies the
      interface with the non  __ prefixed version. Will also makes converting
      its callers to btrfs_inode easier.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      906c448c
    • N
      btrfs: make btrfs_csum_one_bio takae btrfs_inode · bd242a08
      Nikolay Borisov 提交于
      Will enable converting btrfs_submit_compressed_write to btrfs_inode more
      easily.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      bd242a08
    • N
      btrfs: make btrfs_reloc_clone_csums take btrfs_inode · 7bfa9535
      Nikolay Borisov 提交于
      It really wants btrfs_inode and not a vfs inode.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7bfa9535
    • D
      btrfs: add little-endian optimized key helpers · ce6ef5ab
      David Sterba 提交于
      The CPU and on-disk keys are mapped to two different structures because
      of the endianness. There's an intermediate buffer used to do the
      conversion, but this is not necessary when CPU and on-disk endianness
      match.
      
      Add optimized versions of helpers that take disk_key and use the buffer
      directly for CPU keys or drop the intermediate buffer and conversion.
      
      This saves a lot of stack space accross many functions and removes about
      6K of generated binary code:
      
         text    data     bss     dec     hex filename
      1090439   17468   14912 1122819  112203 pre/btrfs.ko
      1084613   17456   14912 1116981  110b35 post/btrfs.ko
      
      Delta: -5826
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ce6ef5ab
    • Q
      btrfs: inode: refactor the parameters of insert_reserved_file_extent() · 203f44c5
      Qu Wenruo 提交于
      Function insert_reserved_file_extent() takes a long list of parameters,
      which are all for btrfs_file_extent_item, even including two reserved
      members, encryption and other_encoding.
      
      This makes the parameter list unnecessary long for a function which only
      gets called twice.
      
      This patch will refactor the parameter list, by using
      btrfs_file_extent_item as parameter directly to hugely reduce the number
      of parameters.
      
      Also, since there are only two callers, one in btrfs_finish_ordered_io()
      which inserts file extent for ordered extent, and one
      __btrfs_prealloc_file_range().
      
      These two call sites have completely different context, where ordered
      extent can be compressed, but will always be regular extent, while the
      preallocated one is never going to be compressed and always has PREALLOC
      type.
      
      So use two small wrapper for these two different call sites to improve
      readability.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      203f44c5
  2. 17 6月, 2020 1 次提交
    • F
      btrfs: check if a log root exists before locking the log_mutex on unlink · e7a79811
      Filipe Manana 提交于
      This brings back an optimization that commit e678934c ("btrfs:
      Remove unnecessary check from join_running_log_trans") removed, but in
      a different form. So it's almost equivalent to a revert.
      
      That commit removed an optimization where we avoid locking a root's
      log_mutex when there is no log tree created in the current transaction.
      The affected code path is triggered through unlink operations.
      
      That commit was based on the assumption that the optimization was not
      necessary because we used to have the following checks when the patch
      was authored:
      
        int btrfs_del_dir_entries_in_log(...)
        {
              (...)
              if (dir->logged_trans < trans->transid)
                  return 0;
      
              ret = join_running_log_trans(root);
              (...)
         }
      
         int btrfs_del_inode_ref_in_log(...)
         {
              (...)
              if (inode->logged_trans < trans->transid)
                  return 0;
      
              ret = join_running_log_trans(root);
              (...)
         }
      
      However before that patch was merged, another patch was merged first which
      replaced those checks because they were buggy.
      
      That other patch corresponds to commit 803f0f64 ("Btrfs: fix fsync
      not persisting dentry deletions due to inode evictions"). The assumption
      that if the logged_trans field of an inode had a smaller value then the
      current transaction's generation (transid) meant that the inode was not
      logged in the current transaction was only correct if the inode was not
      evicted and reloaded in the current transaction. So the corresponding bug
      fix changed those checks and replaced them with the following helper
      function:
      
        static bool inode_logged(struct btrfs_trans_handle *trans,
                                 struct btrfs_inode *inode)
        {
              if (inode->logged_trans == trans->transid)
                      return true;
      
              if (inode->last_trans == trans->transid &&
                  test_bit(BTRFS_INODE_NEEDS_FULL_SYNC, &inode->runtime_flags) &&
                  !test_bit(BTRFS_FS_LOG_RECOVERING, &trans->fs_info->flags))
                      return true;
      
              return false;
        }
      
      So if we have a subvolume without a log tree in the current transaction
      (because we had no fsyncs), every time we unlink an inode we can end up
      trying to lock the log_mutex of the root through join_running_log_trans()
      twice, once for the inode being unlinked (by btrfs_del_inode_ref_in_log())
      and once for the parent directory (with btrfs_del_dir_entries_in_log()).
      
      This means if we have several unlink operations happening in parallel for
      inodes in the same subvolume, and the those inodes and/or their parent
      inode were changed in the current transaction, we end up having a lot of
      contention on the log_mutex.
      
      The test robots from intel reported a -30.7% performance regression for
      a REAIM test after commit e678934c ("btrfs: Remove unnecessary check
      from join_running_log_trans").
      
      So just bring back the optimization to join_running_log_trans() where we
      check first if a log root exists before trying to lock the log_mutex. This
      is done by checking for a bit that is set on the root when a log tree is
      created and removed when a log tree is freed (at transaction commit time).
      
      Commit e678934c ("btrfs: Remove unnecessary check from
      join_running_log_trans") was merged in the 5.4 merge window while commit
      803f0f64 ("Btrfs: fix fsync not persisting dentry deletions due to
      inode evictions") was merged in the 5.3 merge window. But the first
      commit was actually authored before the second commit (May 23 2019 vs
      June 19 2019).
      Reported-by: Nkernel test robot <rong.a.chen@intel.com>
      Link: https://lore.kernel.org/lkml/20200611090233.GL12456@shao2-debian/
      Fixes: e678934c ("btrfs: Remove unnecessary check from join_running_log_trans")
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e7a79811
  3. 14 6月, 2020 1 次提交
    • D
      Revert "btrfs: switch to iomap_dio_rw() for dio" · 55e20bd1
      David Sterba 提交于
      This reverts commit a43a67a2.
      
      This patch reverts the main part of switching direct io implementation
      to iomap infrastructure. There's a problem in invalidate page that
      couldn't be solved as regression in this development cycle.
      
      The problem occurs when buffered and direct io are mixed, and the ranges
      overlap. Although this is not recommended, filesystems implement
      measures or fallbacks to make it somehow work. In this case, fallback to
      buffered IO would be an option for btrfs (this already happens when
      direct io is done on compressed data), but the change would be needed in
      the iomap code, bringing new semantics to other filesystems.
      
      Another problem arises when again the buffered and direct ios are mixed,
      invalidation fails, then -EIO is set on the mapping and fsync will fail,
      though there's no real error.
      
      There have been discussions how to fix that, but revert seems to be the
      least intrusive option.
      
      Link: https://lore.kernel.org/linux-btrfs/20200528192103.xm45qoxqmkw7i5yl@fiona/Signed-off-by: NDavid Sterba <dsterba@suse.com>
      55e20bd1
  4. 10 6月, 2020 1 次提交
  5. 28 5月, 2020 2 次提交
    • C
      btrfs: split btrfs_direct_IO to read and write part · d8f3e735
      Christoph Hellwig 提交于
      The read and write versions don't have anything in common except for the
      call to iomap_dio_rw.  So split this function, and merge each half into
      its only caller.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d8f3e735
    • G
      btrfs: switch to iomap_dio_rw() for dio · a43a67a2
      Goldwyn Rodrigues 提交于
      Switch from __blockdev_direct_IO() to iomap_dio_rw().
      Rename btrfs_get_blocks_direct() to btrfs_dio_iomap_begin() and use it
      as iomap_begin() for iomap direct I/O functions. This function
      allocates and locks all the blocks required for the I/O.
      btrfs_submit_direct() is used as the submit_io() hook for direct I/O
      ops.
      
      Since we need direct I/O reads to go through iomap_dio_rw(), we change
      file_operations.read_iter() to a btrfs_file_read_iter() which calls
      btrfs_direct_IO() for direct reads and falls back to
      generic_file_buffered_read() for incomplete reads and buffered reads.
      
      We don't need address_space.direct_IO() anymore so set it to noop.
      Similarly, we don't need flags used in __blockdev_direct_IO(). iomap is
      capable of direct I/O reads from a hole, so we don't need to return
      -ENOENT.
      
      BTRFS direct I/O is now done under i_rwsem, shared in case of reads and
      exclusive in case of writes. This guards against simultaneous truncates.
      
      Use iomap->iomap_end() to check for failed or incomplete direct I/O:
       - for writes, call __endio_write_update_ordered()
       - for reads, unlock extents
      
      btrfs_dio_data is now hooked in iomap->private and not
      current->journal_info. It carries the reservation variable and the
      amount of data submitted, so we can calculate the amount of data to call
      __endio_write_update_ordered in case of an error.
      
      This patch removes last use of struct buffer_head from btrfs.
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a43a67a2
  6. 25 5月, 2020 16 次提交
    • F
      btrfs: fix corrupt log due to concurrent fsync of inodes with shared extents · e289f03e
      Filipe Manana 提交于
      When we have extents shared amongst different inodes in the same subvolume,
      if we fsync them in parallel we can end up with checksum items in the log
      tree that represent ranges which overlap.
      
      For example, consider we have inodes A and B, both sharing an extent that
      covers the logical range from X to X + 64KiB:
      
      1) Task A starts an fsync on inode A;
      
      2) Task B starts an fsync on inode B;
      
      3) Task A calls btrfs_csum_file_blocks(), and the first search in the
         log tree, through btrfs_lookup_csum(), returns -EFBIG because it
         finds an existing checksum item that covers the range from X - 64KiB
         to X;
      
      4) Task A checks that the checksum item has not reached the maximum
         possible size (MAX_CSUM_ITEMS) and then releases the search path
         before it does another path search for insertion (through a direct
         call to btrfs_search_slot());
      
      5) As soon as task A releases the path and before it does the search
         for insertion, task B calls btrfs_csum_file_blocks() and gets -EFBIG
         too, because there is an existing checksum item that has an end
         offset that matches the start offset (X) of the checksum range we want
         to log;
      
      6) Task B releases the path;
      
      7) Task A does the path search for insertion (through btrfs_search_slot())
         and then verifies that the checksum item that ends at offset X still
         exists and extends its size to insert the checksums for the range from
         X to X + 64KiB;
      
      8) Task A releases the path and returns from btrfs_csum_file_blocks(),
         having inserted the checksums into an existing checksum item that got
         its size extended. At this point we have one checksum item in the log
         tree that covers the logical range from X - 64KiB to X + 64KiB;
      
      9) Task B now does a search for insertion using btrfs_search_slot() too,
         but it finds that the previous checksum item no longer ends at the
         offset X, it now ends at an of offset X + 64KiB, so it leaves that item
         untouched.
      
         Then it releases the path and calls btrfs_insert_empty_item()
         that inserts a checksum item with a key offset corresponding to X and
         a size for inserting a single checksum (4 bytes in case of crc32c).
         Subsequent iterations end up extending this new checksum item so that
         it contains the checksums for the range from X to X + 64KiB.
      
         So after task B returns from btrfs_csum_file_blocks() we end up with
         two checksum items in the log tree that have overlapping ranges, one
         for the range from X - 64KiB to X + 64KiB, and another for the range
         from X to X + 64KiB.
      
      Having checksum items that represent ranges which overlap, regardless of
      being in the log tree or in the chekcsums tree, can lead to problems where
      checksums for a file range end up not being found. This type of problem
      has happened a few times in the past and the following commits fixed them
      and explain in detail why having checksum items with overlapping ranges is
      problematic:
      
        27b9a812 "Btrfs: fix csum tree corruption, duplicate and outdated checksums"
        b84b8390 "Btrfs: fix file read corruption after extent cloning and fsync"
        40e046ac "Btrfs: fix missing data checksums after replaying a log tree"
      
      Since this specific instance of the problem can only happen when logging
      inodes, because it is the only case where concurrent attempts to insert
      checksums for the same range can happen, fix the issue by using an extent
      io tree as a range lock to serialize checksum insertion during inode
      logging.
      
      This issue could often be reproduced by the test case generic/457 from
      fstests. When it happens it produces the following trace:
      
       BTRFS critical (device dm-0): corrupt leaf: root=18446744073709551610 block=30625792 slot=42, csum end range (15020032) goes beyond the start range (15015936) of the next csum item
       BTRFS info (device dm-0): leaf 30625792 gen 7 total ptrs 49 free space 2402 owner 18446744073709551610
       BTRFS info (device dm-0): refs 1 lock (w:0 r:0 bw:0 br:0 sw:0 sr:0) lock_owner 0 current 15884
            item 0 key (18446744073709551606 128 13979648) itemoff 3991 itemsize 4
            item 1 key (18446744073709551606 128 13983744) itemoff 3987 itemsize 4
            item 2 key (18446744073709551606 128 13987840) itemoff 3983 itemsize 4
            item 3 key (18446744073709551606 128 13991936) itemoff 3979 itemsize 4
            item 4 key (18446744073709551606 128 13996032) itemoff 3975 itemsize 4
            item 5 key (18446744073709551606 128 14000128) itemoff 3971 itemsize 4
       (...)
       BTRFS error (device dm-0): block=30625792 write time tree block corruption detected
       ------------[ cut here ]------------
       WARNING: CPU: 1 PID: 15884 at fs/btrfs/disk-io.c:539 btree_csum_one_bio+0x268/0x2d0 [btrfs]
       Modules linked in: btrfs dm_thin_pool ...
       CPU: 1 PID: 15884 Comm: fsx Tainted: G        W         5.6.0-rc7-btrfs-next-58 #1
       Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
       RIP: 0010:btree_csum_one_bio+0x268/0x2d0 [btrfs]
       Code: c7 c7 ...
       RSP: 0018:ffffbb0109e6f8e0 EFLAGS: 00010296
       RAX: 0000000000000000 RBX: ffffe1c0847b6080 RCX: 0000000000000000
       RDX: 0000000000000000 RSI: ffffffffaa963988 RDI: 0000000000000001
       RBP: ffff956a4f4d2000 R08: 0000000000000000 R09: 0000000000000001
       R10: 0000000000000526 R11: 0000000000000000 R12: ffff956a5cd28bb0
       R13: 0000000000000000 R14: ffff956a649c9388 R15: 000000011ed82000
       FS:  00007fb419959e80(0000) GS:ffff956a7aa00000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 0000000000fe6d54 CR3: 0000000138696005 CR4: 00000000003606e0
       DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       Call Trace:
        btree_submit_bio_hook+0x67/0xc0 [btrfs]
        submit_one_bio+0x31/0x50 [btrfs]
        btree_write_cache_pages+0x2db/0x4b0 [btrfs]
        ? __filemap_fdatawrite_range+0xb1/0x110
        do_writepages+0x23/0x80
        __filemap_fdatawrite_range+0xd2/0x110
        btrfs_write_marked_extents+0x15e/0x180 [btrfs]
        btrfs_sync_log+0x206/0x10a0 [btrfs]
        ? kmem_cache_free+0x315/0x3b0
        ? btrfs_log_inode+0x1e8/0xf90 [btrfs]
        ? __mutex_unlock_slowpath+0x45/0x2a0
        ? lockref_put_or_lock+0x9/0x30
        ? dput+0x2d/0x580
        ? dput+0xb5/0x580
        ? btrfs_sync_file+0x464/0x4d0 [btrfs]
        btrfs_sync_file+0x464/0x4d0 [btrfs]
        do_fsync+0x38/0x60
        __x64_sys_fsync+0x10/0x20
        do_syscall_64+0x5c/0x280
        entry_SYSCALL_64_after_hwframe+0x49/0xbe
       RIP: 0033:0x7fb41953a6d0
       Code: 48 3d ...
       RSP: 002b:00007ffcc86bd218 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
       RAX: ffffffffffffffda RBX: 000000000000000d RCX: 00007fb41953a6d0
       RDX: 0000000000000009 RSI: 0000000000040000 RDI: 0000000000000003
       RBP: 0000000000040000 R08: 0000000000000001 R09: 0000000000000009
       R10: 0000000000000064 R11: 0000000000000246 R12: 0000556cf4b2c060
       R13: 0000000000000100 R14: 0000000000000000 R15: 0000556cf322b420
       irq event stamp: 0
       hardirqs last  enabled at (0): [<0000000000000000>] 0x0
       hardirqs last disabled at (0): [<ffffffffa96bdedf>] copy_process+0x74f/0x2020
       softirqs last  enabled at (0): [<ffffffffa96bdedf>] copy_process+0x74f/0x2020
       softirqs last disabled at (0): [<0000000000000000>] 0x0
       ---[ end trace d543fc76f5ad7fd8 ]---
      
      In that trace the tree checker detected the overlapping checksum items at
      the time when we triggered writeback for the log tree when syncing the
      log.
      
      Another trace that can happen is due to BUG_ON() when deleting checksum
      items while logging an inode:
      
       BTRFS critical (device dm-0): slot 81 key (18446744073709551606 128 13635584) new key (18446744073709551606 128 13635584)
       BTRFS info (device dm-0): leaf 30949376 gen 7 total ptrs 98 free space 8527 owner 18446744073709551610
       BTRFS info (device dm-0): refs 4 lock (w:1 r:0 bw:0 br:0 sw:1 sr:0) lock_owner 13473 current 13473
        item 0 key (257 1 0) itemoff 16123 itemsize 160
                inode generation 7 size 262144 mode 100600
        item 1 key (257 12 256) itemoff 16103 itemsize 20
        item 2 key (257 108 0) itemoff 16050 itemsize 53
                extent data disk bytenr 13631488 nr 4096
                extent data offset 0 nr 131072 ram 131072
       (...)
       ------------[ cut here ]------------
       kernel BUG at fs/btrfs/ctree.c:3153!
       invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
       CPU: 1 PID: 13473 Comm: fsx Not tainted 5.6.0-rc7-btrfs-next-58 #1
       Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
       RIP: 0010:btrfs_set_item_key_safe+0x1ea/0x270 [btrfs]
       Code: 0f b6 ...
       RSP: 0018:ffff95e3889179d0 EFLAGS: 00010282
       RAX: 0000000000000000 RBX: 0000000000000051 RCX: 0000000000000000
       RDX: 0000000000000000 RSI: ffffffffb7763988 RDI: 0000000000000001
       RBP: fffffffffffffff6 R08: 0000000000000000 R09: 0000000000000001
       R10: 00000000000009ef R11: 0000000000000000 R12: ffff8912a8ba5a08
       R13: ffff95e388917a06 R14: ffff89138dcf68c8 R15: ffff95e388917ace
       FS:  00007fe587084e80(0000) GS:ffff8913baa00000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 00007fe587091000 CR3: 0000000126dac005 CR4: 00000000003606e0
       DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       Call Trace:
        btrfs_del_csums+0x2f4/0x540 [btrfs]
        copy_items+0x4b5/0x560 [btrfs]
        btrfs_log_inode+0x910/0xf90 [btrfs]
        btrfs_log_inode_parent+0x2a0/0xe40 [btrfs]
        ? dget_parent+0x5/0x370
        btrfs_log_dentry_safe+0x4a/0x70 [btrfs]
        btrfs_sync_file+0x42b/0x4d0 [btrfs]
        __x64_sys_msync+0x199/0x200
        do_syscall_64+0x5c/0x280
        entry_SYSCALL_64_after_hwframe+0x49/0xbe
       RIP: 0033:0x7fe586c65760
       Code: 00 f7 ...
       RSP: 002b:00007ffe250f98b8 EFLAGS: 00000246 ORIG_RAX: 000000000000001a
       RAX: ffffffffffffffda RBX: 00000000000040e1 RCX: 00007fe586c65760
       RDX: 0000000000000004 RSI: 0000000000006b51 RDI: 00007fe58708b000
       RBP: 0000000000006a70 R08: 0000000000000003 R09: 00007fe58700cb61
       R10: 0000000000000100 R11: 0000000000000246 R12: 00000000000000e1
       R13: 00007fe58708b000 R14: 0000000000006b51 R15: 0000558de021a420
       Modules linked in: dm_log_writes ...
       ---[ end trace c92a7f447a8515f5 ]---
      
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e289f03e
    • D
      btrfs: simplify iget helpers · 0202e83f
      David Sterba 提交于
      The inode lookup starting at btrfs_iget takes the full location key,
      while only the objectid is used to match the inode, because the lookup
      happens inside the given root thus the inode number is unique.
      The entire location key is properly set up in btrfs_init_locked_inode.
      
      Simplify the helpers and pass only inode number, renaming it to 'ino'
      instead of 'objectid'. This allows to remove temporary variables key,
      saving some stack space.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0202e83f
    • Q
      btrfs: don't set SHAREABLE flag for data reloc tree · aeb935a4
      Qu Wenruo 提交于
      SHAREABLE flag is set for subvolumes because users can create snapshot
      for subvolumes, thus sharing tree blocks of them.
      
      But data reloc tree is not exposed to user space, as it's only an
      internal tree for data relocation, thus it doesn't need the full path
      replacement handling at all.
      
      This patch will make data reloc tree a non-shareable tree, and add
      btrfs_fs_info::data_reloc_root for data reloc tree, so relocation code
      can grab it from fs_info directly.
      
      This would slightly improve tree relocation, as now data reloc tree
      can go through regular COW routine to get relocated, without bothering
      the complex tree reloc tree routine.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      aeb935a4
    • Q
      btrfs: rename BTRFS_ROOT_REF_COWS to BTRFS_ROOT_SHAREABLE · 92a7cc42
      Qu Wenruo 提交于
      The name BTRFS_ROOT_REF_COWS is not very clear about the meaning.
      
      In fact, that bit can only be set to those trees:
      
      - Subvolume roots
      - Data reloc root
      - Reloc roots for above roots
      
      All other trees won't get this bit set.  So just by the result, it is
      obvious that, roots with this bit set can have tree blocks shared with
      other trees.  Either shared by snapshots, or by reloc roots (an special
      snapshot created by relocation).
      
      This patch will rename BTRFS_ROOT_REF_COWS to BTRFS_ROOT_SHAREABLE to
      make it easier to understand, and update all comment mentioning
      "reference counted" to follow the rename.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      92a7cc42
    • D
      btrfs: constify extent_buffer in the API functions · 2b48966a
      David Sterba 提交于
      There are many helpers around extent buffers, found in extent_io.h and
      ctree.h. Most of them can be converted to take constified eb as there
      are no changes to the extent buffer structure itself but rather the
      pages.
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2b48966a
    • D
      btrfs: preset set/get token with first page and drop condition · 870b388d
      David Sterba 提交于
      All the set/get helpers first check if the token contains a cached
      address. After first use the address is always valid, but the extra
      check is done for each call.
      
      The token initialization can optimistically set it to the first extent
      buffer page, that we know always exists. Then the condition in all
      btrfs_token_*/btrfs_set_token_* can be simplified by removing the
      address check from the condition, but for development the assertion
      still makes sure it's valid.
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      870b388d
    • D
      btrfs: drop eb parameter from set/get token helpers · cc4c13d5
      David Sterba 提交于
      Now that all set/get helpers use the eb from the token, we don't need to
      pass it to many btrfs_token_*/btrfs_set_token_* helpers, saving some
      stack space.
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      cc4c13d5
    • F
      btrfs: move the block group freeze/unfreeze helpers into block-group.c · 684b752b
      Filipe Manana 提交于
      The helpers btrfs_freeze_block_group() and btrfs_unfreeze_block_group()
      used to be named btrfs_get_block_group_trimming() and
      btrfs_put_block_group_trimming() respectively.
      
      At the time they were added to free-space-cache.c, by commit e33e17ee
      ("btrfs: add missing discards when unpinning extents with -o discard")
      because all the trimming related functions were in free-space-cache.c.
      
      Now that the helpers were renamed and are used in scrub context as well,
      move them to block-group.c, a much more logical location for them.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      684b752b
    • F
      btrfs: rename member 'trimming' of block group to a more generic name · 6b7304af
      Filipe Manana 提交于
      Back in 2014, commit 04216820 ("Btrfs: fix race between fs trimming
      and block group remove/allocation"), I added the 'trimming' member to the
      block group structure. Its purpose was to prevent races between trimming
      and block group deletion/allocation by pinning the block group in a way
      that prevents its logical address and device extents from being reused
      while trimming is in progress for a block group, so that if another task
      deletes the block group and then another task allocates a new block group
      that gets the same logical address and device extents while the trimming
      task is still in progress.
      
      After the previous fix for scrub (patch "btrfs: fix a race between scrub
      and block group removal/allocation"), scrub now also has the same needs that
      trimming has, so the member name 'trimming' no longer makes sense.
      Since there is already a 'pinned' member in the block group that refers
      to space reservations (pinned bytes), rename the member to 'frozen',
      add a comment on top of it to describe its general purpose and rename
      the helpers to increment and decrement the counter as well, to match
      the new member name.
      
      The next patch in the series will move the helpers into a more suitable
      file (from free-space-cache.c to block-group.c).
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6b7304af
    • D
      btrfs: remove more obsolete v0 extent ref declarations · 31344b2f
      David Sterba 提交于
      The extent references v0 have been superseded long time go, there are
      some unused declarations of access helpers. We can safely remove them
      now. The struct btrfs_extent_ref_v0 is not used anywhere, but struct
      btrfs_extent_item_v0 is still part of a backward compatibility check in
      relocation.c and thus not removed.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      31344b2f
    • Y
      btrfs: remove unused function btrfs_dev_extent_chunk_tree_uuid · 943aeb0d
      YueHaibing 提交于
      There's no callers in-tree anymore since
      commit d24ee97b ("btrfs: use new helpers to set uuids in eb")
      Signed-off-by: NYueHaibing <yuehaibing@huawei.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      943aeb0d
    • O
      btrfs: get rid of endio_repair_workers · 5c047a69
      Omar Sandoval 提交于
      This was originally added in commit 8b110e39 ("Btrfs: implement
      repair function when direct read fails") to avoid a deadlock. In that
      commit, the direct I/O read endio executes on the endio_workers
      workqueue, submits a repair bio, and waits for it to complete. The
      repair bio endio must execute on a different workqueue, otherwise it
      could block on the endio_workers workqueue becoming available, which
      won't happen because the original endio is blocked on the repair bio.
      
      As of the previous commit, the original endio doesn't wait for the
      repair bio, so this separate workqueue is unnecessary.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5c047a69
    • Q
      btrfs: remove the redundant parameter level in btrfs_bin_search() · e3b83361
      Qu Wenruo 提交于
      All callers pass the eb::level so we can get read it directly inside the
      btrfs_bin_search and key_search.
      
      This is inspired by the work of Marek in U-boot.
      
      CC: Marek Behun <marek.behun@nic.cz>
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e3b83361
    • J
      btrfs: improve global reserve stealing logic · 7f9fe614
      Josef Bacik 提交于
      For unlink transactions and block group removal
      btrfs_start_transaction_fallback_global_rsv will first try to start an
      ordinary transaction and if it fails it will fall back to reserving the
      required amount by stealing from the global reserve. This is problematic
      because of all the same reasons we had with previous iterations of the
      ENOSPC handling, thundering herd.  We get a bunch of failures all at
      once, everybody tries to allocate from the global reserve, some win and
      some lose, we get an ENSOPC.
      
      Fix this behavior by introducing BTRFS_RESERVE_FLUSH_ALL_STEAL. It's
      used to mark unlink reservation. To fix this we need to integrate this
      logic into the normal ENOSPC infrastructure.  We still go through all of
      the normal flushing work, and at the moment we begin to fail all the
      tickets we try to satisfy any tickets that are allowed to steal by
      stealing from the global reserve.  If this works we start the flushing
      system over again just like we would with a normal ticket satisfaction.
      This serializes our global reserve stealing, so we don't have the
      thundering herd problem.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Tested-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7f9fe614
    • Q
      btrfs: backref: rename and move should_ignore_root() · 55465730
      Qu Wenruo 提交于
      This function is mostly single purpose to relocation backref cache, but
      since we're moving the main part of backref cache to backref.c, we need
      to export such function.
      
      And to avoid confusion, rename the function to
      btrfs_should_ignore_reloc_root() make the name a little more clear.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      55465730
    • Q
      btrfs: reloc: make reloc root search-specific for relocation backref cache · 2433bea5
      Qu Wenruo 提交于
      find_reloc_root() searches reloc_control::reloc_root_tree to find the
      reloc root.  This behavior is only useful for relocation backref cache.
      
      For the incoming more generic purpose backref cache, we don't care
      about who owns the reloc root, but only care if it's a reloc root.
      
      So this patch makes the following modifications to make the reloc root
      search more specific to relocation backref:
      
      - Add backref_node::is_reloc_root
        This will be an extra indicator for generic purposed backref cache.
        User doesn't need to read root key from backref_node::root to
        determine if it's a reloc root.
        Also for reloc tree root, it's useless and will be queued to useless
        list.
      
      - Add backref_cache::is_reloc
        This will allow backref cache code to do different behavior for
        generic purpose backref cache and relocation backref cache.
      
      - Pass fs_info to find_reloc_root()
      
      - Export find_reloc_root()
        So backref.c can utilize this function.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2433bea5
  7. 24 3月, 2020 5 次提交