1. 16 5月, 2022 16 次提交
    • C
      btrfs: use a normal workqueue for rmw_workers · 385de0ef
      Christoph Hellwig 提交于
      rmw_workers doesn't need ordered execution or thread disabling threshold
      (as the thresh parameter is less than DFT_THRESHOLD).
      
      Just switch to the normal workqueues that use a lot less resources,
      especially in the work_struct vs btrfs_work structures.
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      385de0ef
    • C
      btrfs: use normal workqueues for scrub · be539518
      Christoph Hellwig 提交于
      All three scrub workqueues don't need ordered execution or thread
      disabling threshold (as the thresh parameter is less than DFT_THRESHOLD).
      Just switch to the normal workqueues that use a lot less resources,
      especially in the work_struct vs btrfs_work structures.
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      be539518
    • C
      btrfs: simplify WQ_HIGHPRI handling in struct btrfs_workqueue · a31b4a43
      Christoph Hellwig 提交于
      Just let the one caller that wants optional WQ_HIGHPRI handling allocate
      a separate btrfs_workqueue for that.  This allows to rename struct
      __btrfs_workqueue to btrfs_workqueue, remove a pointer indirection and
      separate allocation for all btrfs_workqueue users and generally simplify
      the code.
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a31b4a43
    • C
      btrfs: do not return errors from submit_bio_hook_t instances · ad357938
      Christoph Hellwig 提交于
      Both btrfs_repair_one_sector and submit_bio_one as the direct caller of
      one of the instances ignore errors as they expect the methods themselves
      to call ->bi_end_io on error.  Remove the unused and dangerous return
      value.
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ad357938
    • C
      btrfs: move btrfs_readpage to extent_io.c · 7aab8b32
      Christoph Hellwig 提交于
      Keep btrfs_readpage next to btrfs_do_readpage and the other address
      space operations.  This allows to keep submit_one_bio and
      struct btrfs_bio_ctrl file local in extent_io.c.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7aab8b32
    • F
      btrfs: use a read/write lock for protecting the block groups tree · 16b0c258
      Filipe Manana 提交于
      Currently we use a spin lock to protect the red black tree that we use to
      track block groups. Most accesses to that tree are actually read only and
      for large filesystems, with thousands of block groups, it actually has
      a bad impact on performance, as concurrent read only searches on the tree
      are serialized.
      
      Read only searches on the tree are very frequent and done when:
      
      1) Pinning and unpinning extents, as we need to lookup the respective
         block group from the tree;
      
      2) Freeing the last reference of a tree block, regardless if we pin the
         underlying extent or add it back to free space cache/tree;
      
      3) During NOCOW writes, both buffered IO and direct IO, we need to check
         if the block group that contains an extent is read only or not and to
         increment the number of NOCOW writers in the block group. For those
         operations we need to search for the block group in the tree.
         Similarly, after creating the ordered extent for the NOCOW write, we
         need to decrement the number of NOCOW writers from the same block
         group, which requires searching for it in the tree;
      
      4) Decreasing the number of extent reservations in a block group;
      
      5) When allocating extents and freeing reserved extents;
      
      6) Adding and removing free space to the free space tree;
      
      7) When releasing delalloc bytes during ordered extent completion;
      
      8) When relocating a block group;
      
      9) During fitrim, to iterate over the block groups;
      
      10) etc;
      
      Write accesses to the tree, to add or remove block groups, are much less
      frequent as they happen only when allocating a new block group or when
      deleting a block group.
      
      We also use the same spin lock to protect the list of currently caching
      block groups. Additions to this list are made when we need to cache a
      block group, because we don't have a free space cache for it (or we have
      but it's invalid), and removals from this list are done when caching of
      the block group's free space finishes. These cases are also not very
      common, but when they happen, they happen only once when the filesystem
      is mounted.
      
      So switch the lock that protects the tree of block groups from a spinning
      lock to a read/write lock.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      16b0c258
    • F
      btrfs: use rbtree with leftmost node cached for tracking lowest block group · 08dddb29
      Filipe Manana 提交于
      We keep track of the start offset of the block group with the lowest start
      offset at fs_info->first_logical_byte. This requires explicitly updating
      that field every time we add, delete or lookup a block group to/from the
      red black tree at fs_info->block_group_cache_tree.
      
      Since the block group with the lowest start address happens to always be
      the one that is the leftmost node of the tree, we can use a red black tree
      that caches the left most node. Then when we need the start address of
      that block group, we can just quickly get the leftmost node in the tree
      and extract the start offset of that node's block group. This avoids the
      need to explicitly keep track of that address in the dedicated member
      fs_info->first_logical_byte, and it also allows the next patch in the
      series to switch the lock that protects the red black tree from a spin
      lock to a read/write lock - without this change it would be tricky
      because block group searches also update fs_info->first_logical_byte.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      08dddb29
    • C
      btrfs: remove the zoned/zone_size union in struct btrfs_fs_info · 8e010b3d
      Christoph Hellwig 提交于
      Reading a value from a different member of a union is not just a great
      way to obfuscate code, but also creates an aliasing violation.  Switch
      btrfs_is_zoned to look at ->zone_size and remove the union.
      
      Note: union was to simplify the detection of zoned filesystem but now
      this is wrapped behind btrfs_is_zoned so we can drop the union.
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ add note ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8e010b3d
    • F
      btrfs: avoid blocking on space revervation when doing nowait dio writes · d4135134
      Filipe Manana 提交于
      When doing a NOWAIT direct IO write, if we can NOCOW then it means we can
      proceed with the non-blocking, NOWAIT path. However reserving the metadata
      space and qgroup meta space can often result in blocking - flushing
      delalloc, wait for ordered extents to complete, trigger transaction
      commits, etc, going against the semantics of a NOWAIT write.
      
      So make the NOWAIT write path to try to reserve all the metadata it needs
      without resulting in a blocking behaviour - if we get -ENOSPC or -EDQUOT
      then return -EAGAIN to make the caller fallback to a blocking direct IO
      write.
      
      This is part of a patchset comprised of the following patches:
      
        btrfs: avoid blocking on page locks with nowait dio on compressed range
        btrfs: avoid blocking nowait dio when locking file range
        btrfs: avoid double nocow check when doing nowait dio writes
        btrfs: stop allocating a path when checking if cross reference exists
        btrfs: free path at can_nocow_extent() before checking for checksum items
        btrfs: release path earlier at can_nocow_extent()
        btrfs: avoid blocking when allocating context for nowait dio read/write
        btrfs: avoid blocking on space revervation when doing nowait dio writes
      
      The following test was run before and after applying this patchset:
      
        $ cat io-uring-nodatacow-test.sh
        #!/bin/bash
      
        DEV=/dev/sdc
        MNT=/mnt/sdc
      
        MOUNT_OPTIONS="-o ssd -o nodatacow"
        MKFS_OPTIONS="-R free-space-tree -O no-holes"
      
        NUM_JOBS=4
        FILE_SIZE=8G
        RUN_TIME=300
      
        cat <<EOF > /tmp/fio-job.ini
        [io_uring_rw]
        rw=randrw
        fsync=0
        fallocate=posix
        group_reporting=1
        direct=1
        ioengine=io_uring
        iodepth=64
        bssplit=4k/20:8k/20:16k/20:32k/10:64k/10:128k/5:256k/5:512k/5:1m/5
        filesize=$FILE_SIZE
        runtime=$RUN_TIME
        time_based
        filename=foobar
        directory=$MNT
        numjobs=$NUM_JOBS
        thread
        EOF
      
        echo performance | \
           tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
      
        umount $MNT &> /dev/null
        mkfs.btrfs -f $MKFS_OPTIONS $DEV &> /dev/null
        mount $MOUNT_OPTIONS $DEV $MNT
      
        fio /tmp/fio-job.ini
      
        umount $MNT
      
      The test was run a 12 cores box with 64G of ram, using a non-debug kernel
      config (Debian's default config) and a spinning disk.
      
      Result before the patchset:
      
       READ: bw=407MiB/s (427MB/s), 407MiB/s-407MiB/s (427MB/s-427MB/s), io=119GiB (128GB), run=300175-300175msec
      WRITE: bw=407MiB/s (427MB/s), 407MiB/s-407MiB/s (427MB/s-427MB/s), io=119GiB (128GB), run=300175-300175msec
      
      Result after the patchset:
      
       READ: bw=436MiB/s (457MB/s), 436MiB/s-436MiB/s (457MB/s-457MB/s), io=128GiB (137GB), run=300044-300044msec
      WRITE: bw=435MiB/s (456MB/s), 435MiB/s-435MiB/s (456MB/s-456MB/s), io=128GiB (137GB), run=300044-300044msec
      
      That's about +7.2% throughput for reads and +6.9% for writes.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d4135134
    • F
      btrfs: stop allocating a path when checking if cross reference exists · 1a89f173
      Filipe Manana 提交于
      At btrfs_cross_ref_exist() we always allocate a path, but we really don't
      need to because all its callers (only 2) already have an allocated path
      that is not being used when they call btrfs_cross_ref_exist(). So change
      btrfs_cross_ref_exist() to take a path as an argument and update both
      its callers to pass in the unused path they have when they call
      btrfs_cross_ref_exist().
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1a89f173
    • J
      btrfs: add messages to printk index · b0a66a31
      Jonathan Lassoff 提交于
      In order for end users to quickly react to new issues that come up in
      production, it is proving useful to leverage this printk indexing
      system. This printk index enables kernel developers to use calls to
      printk() with changeable ad-hoc format strings, while still enabling end
      users to detect changes and develop a semi-stable interface for
      detecting and parsing these messages.
      
      So that detailed Btrfs messages are captured by this printk index, this
      patch wraps btrfs_printk and btrfs_handle_fs_error with macros.
      
      Example of the generated list:
      https://lore.kernel.org/lkml/12588e13d51a9c3bf59467d3fc1ac2162f1275c1.1647539056.git.jof@thejof.comSigned-off-by: NJonathan Lassoff <jof@thejof.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b0a66a31
    • F
      btrfs: add and use helper to assert an inode range is clean · 63c34cb4
      Filipe Manana 提交于
      We have four different scenarios where we don't expect to find ordered
      extents after locking a file range:
      
      1) During plain fallocate;
      2) During hole punching;
      3) During zero range;
      4) During reflinks (both cloning and deduplication).
      
      This is because in all these cases we follow the pattern:
      
      1) Lock the inode's VFS lock in exclusive mode;
      
      2) Lock the inode's i_mmap_lock in exclusive node, to serialize with
         mmap writes;
      
      3) Flush delalloc in a file range and wait for all ordered extents
         to complete - both done through btrfs_wait_ordered_range();
      
      4) Lock the file range in the inode's io_tree.
      
      So add a helper that asserts that we don't have ordered extents for a
      given range. Make the four scenarios listed above use this helper after
      locking the respective file range.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      63c34cb4
    • O
      btrfs: move common inode creation code into btrfs_create_new_inode() · caae78e0
      Omar Sandoval 提交于
      All of our inode creation code paths duplicate the calls to
      btrfs_init_inode_security() and btrfs_add_link(). Subvolume creation
      additionally duplicates property inheritance and the call to
      btrfs_set_inode_index(). Fix this by moving the common code into
      btrfs_create_new_inode(). This accomplishes a few things at once:
      
      1. It reduces code duplication.
      
      2. It allows us to set up the inode completely before inserting the
         inode item, removing calls to btrfs_update_inode().
      
      3. It fixes a leak of an inode on disk in some error cases. For example,
         in btrfs_create(), if btrfs_new_inode() succeeds, then we have
         inserted an inode item and its inode ref. However, if something after
         that fails (e.g., btrfs_init_inode_security()), then we end the
         transaction and then decrement the link count on the inode. If the
         transaction is committed and the system crashes before the failed
         inode is deleted, then we leak that inode on disk. Instead, this
         refactoring aborts the transaction when we can't recover more
         gracefully.
      
      4. It exposes various ways that subvolume creation diverges from mkdir
         in terms of inheriting flags, properties, permissions, and POSIX
         ACLs, a lot of which appears to be accidental. This patch explicitly
         does _not_ change the existing non-standard behavior, but it makes
         those differences more clear in the code and documents them so that
         we can discuss whether they should be changed.
      Reviewed-by: NSweet Tea Dorminy <sweettea-kernel@dorminy.me>
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      caae78e0
    • O
      btrfs: reserve correct number of items for inode creation · 3538d68d
      Omar Sandoval 提交于
      The various inode creation code paths do not account for the compression
      property, POSIX ACLs, or the parent inode item when starting a
      transaction. Fix it by refactoring all of these code paths to use a new
      function, btrfs_new_inode_prepare(), which computes the correct number
      of items. To do so, it needs to know whether POSIX ACLs will be created,
      so move the ACL creation into that function. To reduce the number of
      arguments that need to be passed around for inode creation, define
      struct btrfs_new_inode_args containing all of the relevant information.
      
      btrfs_new_inode_prepare() will also be a good place to set up the
      fscrypt context and encrypted filename in the future.
      Reviewed-by: NSweet Tea Dorminy <sweettea-kernel@dorminy.me>
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3538d68d
    • O
      btrfs: allocate inode outside of btrfs_new_inode() · a1fd0c35
      Omar Sandoval 提交于
      Instead of calling new_inode() and inode_init_owner() inside of
      btrfs_new_inode(), do it in the callers. This allows us to pass in just
      the inode instead of the mnt_userns and mode and removes the need for
      memalloc_nofs_{save,restores}() since we do it before starting a
      transaction. In create_subvol(), it also means we no longer have to look
      up the inode again to instantiate it. This also paves the way for some
      more cleanups in later patches.
      
      This also removes the comments about Smack checking i_op, which are no
      longer true since commit 5d6c3191 ("xattr: Add
      __vfs_{get,set,remove}xattr helpers"). Now it checks inode->i_opflags &
      IOP_XATTR, which is set based on sb->s_xattr.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a1fd0c35
    • G
      btrfs: introduce btrfs_for_each_slot iterator macro · 62142be3
      Gabriel Niebler 提交于
      There is a common pattern when searching for a key in btrfs:
      
      * Call btrfs_search_slot to find the slot for the key
      * Enter an endless loop:
        * If the found slot is larger than the no. of items in the current
          leaf, check the next leaf
        * If it's still not found in the next leaf, terminate the loop
        * Otherwise do something with the found key
        * Increment the current slot and continue
      
      To reduce code duplication, we can replace this code pattern with an
      iterator macro, similar to the existing for_each_X macros found
      elsewhere in the kernel.  This also makes the code easier to understand
      for newcomers by putting a name to the encapsulated functionality.
      Signed-off-by: NMarcos Paulo de Souza <mpdesouza@suse.com>
      Signed-off-by: NGabriel Niebler <gniebler@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      62142be3
  2. 21 4月, 2022 1 次提交
    • N
      btrfs: zoned: use dedicated lock for data relocation · 5f0addf7
      Naohiro Aota 提交于
      Currently, we use btrfs_inode_{lock,unlock}() to grant an exclusive
      writeback of the relocation data inode in
      btrfs_zoned_data_reloc_{lock,unlock}(). However, that can cause a deadlock
      in the following path.
      
      Thread A takes btrfs_inode_lock() and waits for metadata reservation by
      e.g, waiting for writeback:
      
      prealloc_file_extent_cluster()
        - btrfs_inode_lock(&inode->vfs_inode, 0);
        - btrfs_prealloc_file_range()
        ...
          - btrfs_replace_file_extents()
            - btrfs_start_transaction
            ...
              - btrfs_reserve_metadata_bytes()
      
      Thread B (e.g, doing a writeback work) needs to wait for the inode lock to
      continue writeback process:
      
      do_writepages
        - btrfs_writepages
          - extent_writpages
            - btrfs_zoned_data_reloc_lock(BTRFS_I(inode));
              - btrfs_inode_lock()
      
      The deadlock is caused by relying on the vfs_inode's lock. By using it, we
      introduced unnecessary exclusion of writeback and
      btrfs_prealloc_file_range(). Also, the lock at this point is useless as we
      don't have any dirty pages in the inode yet.
      
      Introduce fs_info->zoned_data_reloc_io_lock and use it for the exclusive
      writeback.
      
      Fixes: 35156d85 ("btrfs: zoned: only allow one process to add pages to a relocation inode")
      CC: stable@vger.kernel.org # 5.16.x: 869f4cdc: btrfs: zoned: encapsulate inode locking for zoned relocation
      CC: stable@vger.kernel.org # 5.16.x
      CC: stable@vger.kernel.org # 5.17
      Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5f0addf7
  3. 15 3月, 2022 1 次提交
  4. 14 3月, 2022 10 次提交
    • J
      btrfs: pass btrfs_fs_info to btrfs_recover_relocation · 7eefae6b
      Josef Bacik 提交于
      We don't need a root here, we just need the btrfs_fs_info, we can just
      get the specific roots we need from fs_info.
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7eefae6b
    • S
      btrfs: add filesystems state details to error messages · c067da87
      Sweet Tea Dorminy 提交于
      When a filesystem goes read-only due to an error, multiple errors tend
      to be reported, some of which are knock-on failures. Logging fs_states,
      in btrfs_handle_fs_error() and btrfs_printk() helps distinguish the
      first error from subsequent messages which may only exist due to an
      error state.
      
      Under the new format, most initial errors will look like:
      `BTRFS: error (device loop0) in ...`
      while subsequent errors will begin with:
      `error (device loop0: state E) in ...`
      
      An initial transaction abort error will look like
      `error (device loop0: state A) in ...`
      and subsequent messages will contain
      `(device loop0: state EA) in ...`
      
      In addition to the error states we can also print other states that are
      temporary, like remounting, device replace, or indicate a global state
      that may affect functionality.
      
      Now implemented:
      
      E - filesystem error detected
      A - transaction aborted
      L - log tree errors
      
      M - remounting in progress
      R - device replace in progress
      C - data checksums not verified (mounted with ignoredatacsums)
      Signed-off-by: NSweet Tea Dorminy <sweettea-kernel@dorminy.me>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c067da87
    • O
      btrfs: add BTRFS_IOC_ENCODED_WRITE · 7c0c7269
      Omar Sandoval 提交于
      The implementation resembles direct I/O: we have to flush any ordered
      extents, invalidate the page cache, and do the io tree/delalloc/extent
      map/ordered extent dance. From there, we can reuse the compression code
      with a minor modification to distinguish the write from writeback. This
      also creates inline extents when possible.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7c0c7269
    • O
      btrfs: add BTRFS_IOC_ENCODED_READ ioctl · 1881fba8
      Omar Sandoval 提交于
      There are 4 main cases:
      
      1. Inline extents: we copy the data straight out of the extent buffer.
      2. Hole/preallocated extents: we fill in zeroes.
      3. Regular, uncompressed extents: we read the sectors we need directly
         from disk.
      4. Regular, compressed extents: we read the entire compressed extent
         from disk and indicate what subset of the decompressed extent is in
         the file.
      
      This initial implementation simplifies a few things that can be improved
      in the future:
      
      - Cases 1, 3, and 4 allocate temporary memory to read into before
        copying out to userspace.
      - We don't do read repair, because it turns out that read repair is
        currently broken for compressed data.
      - We hold the inode lock during the operation.
      
      Note that we don't need to hold the mmap lock. We may race with
      btrfs_page_mkwrite() and read the old data from before the page was
      dirtied:
      
      btrfs_page_mkwrite         btrfs_encoded_read
      ---------------------------------------------------
      (enter)                    (enter)
                                 btrfs_wait_ordered_range
      lock_extent_bits
      btrfs_page_set_dirty
      unlock_extent_cached
      (exit)
                                 lock_extent_bits
                                 read extent (dirty page hasn't been flushed,
                                              so this is the old data)
                                 unlock_extent_cached
                                 (exit)
      
      we read the old data from before the page was dirtied. But, that's true
      even if we were to hold the mmap lock:
      
      btrfs_page_mkwrite               btrfs_encoded_read
      -------------------------------------------------------------------
      (enter)                          (enter)
                                       btrfs_inode_lock(BTRFS_ILOCK_MMAP)
      down_read(i_mmap_lock) (blocked)
                                       btrfs_wait_ordered_range
                                       lock_extent_bits
      				 read extent (page hasn't been dirtied,
                                                    so this is the old data)
                                       unlock_extent_cached
                                       btrfs_inode_unlock(BTRFS_ILOCK_MMAP)
      down_read(i_mmap_lock) returns
      lock_extent_bits
      btrfs_page_set_dirty
      unlock_extent_cached
      
      In other words, this is inherently racy, so it's fine that we return the
      old data in this tiny window.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1881fba8
    • O
      btrfs: support different disk extent size for delalloc · 28c9b1e7
      Omar Sandoval 提交于
      Currently, we always reserve the same extent size in the file and extent
      size on disk for delalloc because the former is the worst case for the
      latter. For BTRFS_IOC_ENCODED_WRITE writes, we know the exact size of
      the extent on disk, which may be less than or greater than (for
      bookends) the size in the file. Add a disk_num_bytes parameter to
      btrfs_delalloc_reserve_metadata() so that we can reserve the correct
      amount of csum bytes. No functional change.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      28c9b1e7
    • O
      btrfs: don't advance offset for compressed bios in btrfs_csum_one_bio() · e331f6b1
      Omar Sandoval 提交于
      btrfs_csum_one_bio() loops over each filesystem block in the bio while
      keeping a cursor of its current logical position in the file in order to
      look up the ordered extent to add the checksums to. However, this
      doesn't make much sense for compressed extents, as a sector on disk does
      not correspond to a sector of decompressed file data. It happens to work
      because:
      
      1) the compressed bio always covers one ordered extent
      2) the size of the bio is always less than the size of the ordered
         extent
      
      However, the second point will not always be true for encoded writes.
      
      Let's add a boolean parameter to btrfs_csum_one_bio() to indicate that
      it can assume that the bio only covers one ordered extent. Since we're
      already changing the signature, let's get rid of the contig parameter
      and make it implied by the offset parameter, similar to the change we
      recently made to btrfs_lookup_bio_sums(). Additionally, let's rename
      nr_sectors to blockcount to make it clear that it's the number of
      filesystem blocks, not the number of 512-byte sectors.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e331f6b1
    • D
      btrfs: replace BUILD_BUG_ON by static_assert · a55e65b8
      David Sterba 提交于
      The static_assert introduced in 6bab69c6 ("build_bug.h: add wrapper
      for _Static_assert") has been supported by compilers for a long time
      (gcc 4.6, clang 3.0) and can be used in header files. We don't need to
      put BUILD_BUG_ON to random functions but rather keep it next to the
      definition.
      
      The exception here is the UAPI header btrfs_tree.h that could be
      potentially included by userspace code and the static assert is not
      defined (nor used in any other header).
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a55e65b8
    • J
      btrfs: add support for multiple global roots · f7238e50
      Josef Bacik 提交于
      With extent tree v2 you will be able to create multiple csum, extent,
      and free space trees.  They will be used based on the block group, which
      will now use the block_group_item->chunk_objectid to point to the set of
      global roots that it will use.  When allocating new block groups we'll
      simply mod the gigabyte offset of the block group against the number of
      global roots we have and that will be the block groups global id.
      
      >From there we can take the bytenr that we're modifying in the respective
      tree, look up the block group and get that block groups corresponding
      global root id.  From there we can get to the appropriate global root
      for that bytenr.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f7238e50
    • J
      btrfs: add code to support the block group root · 9c54e80d
      Josef Bacik 提交于
      This code adds the on disk structures for the block group root, which
      will hold the block group items for extent tree v2.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9c54e80d
    • J
      btrfs: add definition for EXTENT_TREE_V2 · 2c7d2a23
      Josef Bacik 提交于
      This adds the initial definition of the EXTENT_TREE_V2 incompat feature
      flag.  This also hides the support behind CONFIG_BTRFS_DEBUG.
      
      THIS IS A IN DEVELOPMENT FORMAT CHANGE, DO NOT USE UNLESS YOU ARE A
      DEVELOPER OR A TESTER.
      
      The format is in flux and will be added in stages, any fs will need to
      be re-made between updates to the format.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2c7d2a23
  5. 02 3月, 2022 1 次提交
    • J
      btrfs: do not start relocation until in progress drops are done · b4be6aef
      Josef Bacik 提交于
      We hit a bug with a recovering relocation on mount for one of our file
      systems in production.  I reproduced this locally by injecting errors
      into snapshot delete with balance running at the same time.  This
      presented as an error while looking up an extent item
      
        WARNING: CPU: 5 PID: 1501 at fs/btrfs/extent-tree.c:866 lookup_inline_extent_backref+0x647/0x680
        CPU: 5 PID: 1501 Comm: btrfs-balance Not tainted 5.16.0-rc8+ #8
        RIP: 0010:lookup_inline_extent_backref+0x647/0x680
        RSP: 0018:ffffae0a023ab960 EFLAGS: 00010202
        RAX: 0000000000000001 RBX: 0000000000000000 RCX: 0000000000000000
        RDX: 0000000000000000 RSI: 000000000000000c RDI: 0000000000000000
        RBP: ffff943fd2a39b60 R08: 0000000000000000 R09: 0000000000000001
        R10: 0001434088152de0 R11: 0000000000000000 R12: 0000000001d05000
        R13: ffff943fd2a39b60 R14: ffff943fdb96f2a0 R15: ffff9442fc923000
        FS:  0000000000000000(0000) GS:ffff944e9eb40000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007f1157b1fca8 CR3: 000000010f092000 CR4: 0000000000350ee0
        Call Trace:
         <TASK>
         insert_inline_extent_backref+0x46/0xd0
         __btrfs_inc_extent_ref.isra.0+0x5f/0x200
         ? btrfs_merge_delayed_refs+0x164/0x190
         __btrfs_run_delayed_refs+0x561/0xfa0
         ? btrfs_search_slot+0x7b4/0xb30
         ? btrfs_update_root+0x1a9/0x2c0
         btrfs_run_delayed_refs+0x73/0x1f0
         ? btrfs_update_root+0x1a9/0x2c0
         btrfs_commit_transaction+0x50/0xa50
         ? btrfs_update_reloc_root+0x122/0x220
         prepare_to_merge+0x29f/0x320
         relocate_block_group+0x2b8/0x550
         btrfs_relocate_block_group+0x1a6/0x350
         btrfs_relocate_chunk+0x27/0xe0
         btrfs_balance+0x777/0xe60
         balance_kthread+0x35/0x50
         ? btrfs_balance+0xe60/0xe60
         kthread+0x16b/0x190
         ? set_kthread_struct+0x40/0x40
         ret_from_fork+0x22/0x30
         </TASK>
      
      Normally snapshot deletion and relocation are excluded from running at
      the same time by the fs_info->cleaner_mutex.  However if we had a
      pending balance waiting to get the ->cleaner_mutex, and a snapshot
      deletion was running, and then the box crashed, we would come up in a
      state where we have a half deleted snapshot.
      
      Again, in the normal case the snapshot deletion needs to complete before
      relocation can start, but in this case relocation could very well start
      before the snapshot deletion completes, as we simply add the root to the
      dead roots list and wait for the next time the cleaner runs to clean up
      the snapshot.
      
      Fix this by setting a bit on the fs_info if we have any DEAD_ROOT's that
      had a pending drop_progress key.  If they do then we know we were in the
      middle of the drop operation and set a flag on the fs_info.  Then
      balance can wait until this flag is cleared to start up again.
      
      If there are DEAD_ROOT's that don't have a drop_progress set then we're
      safe to start balance right away as we'll be properly protected by the
      cleaner_mutex.
      
      CC: stable@vger.kernel.org # 5.10+
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b4be6aef
  6. 24 2月, 2022 1 次提交
    • Q
      btrfs: reduce extent threshold for autodefrag · 558732df
      Qu Wenruo 提交于
      There is a big gap between inode_should_defrag() and autodefrag extent
      size threshold.  For inode_should_defrag() it has a flexible
      @small_write value. For compressed extent is 16K, and for non-compressed
      extent it's 64K.
      
      However for autodefrag extent size threshold, it's always fixed to the
      default value (256K).
      
      This means, the following write sequence will trigger autodefrag to
      defrag ranges which didn't trigger autodefrag:
      
        pwrite 0 8k
        sync
        pwrite 8k 128K
        sync
      
      The latter 128K write will also be considered as a defrag target (if
      other conditions are met). While only that 8K write is really
      triggering autodefrag.
      
      Such behavior can cause extra IO for autodefrag.
      
      Close the gap, by copying the @small_write value into inode_defrag, so
      that later autodefrag can use the same @small_write value which
      triggered autodefrag.
      
      With the existing transid value, this allows autodefrag really to scan
      the ranges which triggered autodefrag.
      
      Although this behavior change is mostly reducing the extent_thresh value
      for autodefrag, I believe in the future we should allow users to specify
      the autodefrag extent threshold through mount options, but that's an
      other problem to consider in the future.
      
      CC: stable@vger.kernel.org # 5.16+
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      558732df
  7. 31 1月, 2022 1 次提交
    • F
      btrfs: skip reserved bytes warning on unmount after log cleanup failure · 40cdc509
      Filipe Manana 提交于
      After the recent changes made by commit c2e39305 ("btrfs: clear
      extent buffer uptodate when we fail to write it") and its followup fix,
      commit 651740a5 ("btrfs: check WRITE_ERR when trying to read an
      extent buffer"), we can now end up not cleaning up space reservations of
      log tree extent buffers after a transaction abort happens, as well as not
      cleaning up still dirty extent buffers.
      
      This happens because if writeback for a log tree extent buffer failed,
      then we have cleared the bit EXTENT_BUFFER_UPTODATE from the extent buffer
      and we have also set the bit EXTENT_BUFFER_WRITE_ERR on it. Later on,
      when trying to free the log tree with free_log_tree(), which iterates
      over the tree, we can end up getting an -EIO error when trying to read
      a node or a leaf, since read_extent_buffer_pages() returns -EIO if an
      extent buffer does not have EXTENT_BUFFER_UPTODATE set and has the
      EXTENT_BUFFER_WRITE_ERR bit set. Getting that -EIO means that we return
      immediately as we can not iterate over the entire tree.
      
      In that case we never update the reserved space for an extent buffer in
      the respective block group and space_info object.
      
      When this happens we get the following traces when unmounting the fs:
      
      [174957.284509] BTRFS: error (device dm-0) in cleanup_transaction:1913: errno=-5 IO failure
      [174957.286497] BTRFS: error (device dm-0) in free_log_tree:3420: errno=-5 IO failure
      [174957.399379] ------------[ cut here ]------------
      [174957.402497] WARNING: CPU: 2 PID: 3206883 at fs/btrfs/block-group.c:127 btrfs_put_block_group+0x77/0xb0 [btrfs]
      [174957.407523] Modules linked in: btrfs overlay dm_zero (...)
      [174957.424917] CPU: 2 PID: 3206883 Comm: umount Tainted: G        W         5.16.0-rc5-btrfs-next-109 #1
      [174957.426689] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
      [174957.428716] RIP: 0010:btrfs_put_block_group+0x77/0xb0 [btrfs]
      [174957.429717] Code: 21 48 8b bd (...)
      [174957.432867] RSP: 0018:ffffb70d41cffdd0 EFLAGS: 00010206
      [174957.433632] RAX: 0000000000000001 RBX: ffff8b09c3848000 RCX: ffff8b0758edd1c8
      [174957.434689] RDX: 0000000000000001 RSI: ffffffffc0b467e7 RDI: ffff8b0758edd000
      [174957.436068] RBP: ffff8b0758edd000 R08: 0000000000000000 R09: 0000000000000000
      [174957.437114] R10: 0000000000000246 R11: 0000000000000000 R12: ffff8b09c3848148
      [174957.438140] R13: ffff8b09c3848198 R14: ffff8b0758edd188 R15: dead000000000100
      [174957.439317] FS:  00007f328fb82800(0000) GS:ffff8b0a2d200000(0000) knlGS:0000000000000000
      [174957.440402] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [174957.441164] CR2: 00007fff13563e98 CR3: 0000000404f4e005 CR4: 0000000000370ee0
      [174957.442117] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [174957.443076] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [174957.443948] Call Trace:
      [174957.444264]  <TASK>
      [174957.444538]  btrfs_free_block_groups+0x255/0x3c0 [btrfs]
      [174957.445238]  close_ctree+0x301/0x357 [btrfs]
      [174957.445803]  ? call_rcu+0x16c/0x290
      [174957.446250]  generic_shutdown_super+0x74/0x120
      [174957.446832]  kill_anon_super+0x14/0x30
      [174957.447305]  btrfs_kill_super+0x12/0x20 [btrfs]
      [174957.447890]  deactivate_locked_super+0x31/0xa0
      [174957.448440]  cleanup_mnt+0x147/0x1c0
      [174957.448888]  task_work_run+0x5c/0xa0
      [174957.449336]  exit_to_user_mode_prepare+0x1e5/0x1f0
      [174957.449934]  syscall_exit_to_user_mode+0x16/0x40
      [174957.450512]  do_syscall_64+0x48/0xc0
      [174957.450980]  entry_SYSCALL_64_after_hwframe+0x44/0xae
      [174957.451605] RIP: 0033:0x7f328fdc4a97
      [174957.452059] Code: 03 0c 00 f7 (...)
      [174957.454320] RSP: 002b:00007fff13564ec8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
      [174957.455262] RAX: 0000000000000000 RBX: 00007f328feea264 RCX: 00007f328fdc4a97
      [174957.456131] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000560b8ae51dd0
      [174957.457118] RBP: 0000560b8ae51ba0 R08: 0000000000000000 R09: 00007fff13563c40
      [174957.458005] R10: 00007f328fe49fc0 R11: 0000000000000246 R12: 0000000000000000
      [174957.459113] R13: 0000560b8ae51dd0 R14: 0000560b8ae51cb0 R15: 0000000000000000
      [174957.460193]  </TASK>
      [174957.460534] irq event stamp: 0
      [174957.461003] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
      [174957.461947] hardirqs last disabled at (0): [<ffffffffb0e94214>] copy_process+0x934/0x2040
      [174957.463147] softirqs last  enabled at (0): [<ffffffffb0e94214>] copy_process+0x934/0x2040
      [174957.465116] softirqs last disabled at (0): [<0000000000000000>] 0x0
      [174957.466323] ---[ end trace bc7ee0c490bce3af ]---
      [174957.467282] ------------[ cut here ]------------
      [174957.468184] WARNING: CPU: 2 PID: 3206883 at fs/btrfs/block-group.c:3976 btrfs_free_block_groups+0x330/0x3c0 [btrfs]
      [174957.470066] Modules linked in: btrfs overlay dm_zero (...)
      [174957.483137] CPU: 2 PID: 3206883 Comm: umount Tainted: G        W         5.16.0-rc5-btrfs-next-109 #1
      [174957.484691] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
      [174957.486853] RIP: 0010:btrfs_free_block_groups+0x330/0x3c0 [btrfs]
      [174957.488050] Code: 00 00 00 ad de (...)
      [174957.491479] RSP: 0018:ffffb70d41cffde0 EFLAGS: 00010206
      [174957.492520] RAX: ffff8b08d79310b0 RBX: ffff8b09c3848000 RCX: 0000000000000000
      [174957.493868] RDX: 0000000000000001 RSI: fffff443055ee600 RDI: ffffffffb1131846
      [174957.495183] RBP: ffff8b08d79310b0 R08: 0000000000000000 R09: 0000000000000000
      [174957.496580] R10: 0000000000000001 R11: 0000000000000000 R12: ffff8b08d7931000
      [174957.498027] R13: ffff8b09c38492b0 R14: dead000000000122 R15: dead000000000100
      [174957.499438] FS:  00007f328fb82800(0000) GS:ffff8b0a2d200000(0000) knlGS:0000000000000000
      [174957.500990] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [174957.502117] CR2: 00007fff13563e98 CR3: 0000000404f4e005 CR4: 0000000000370ee0
      [174957.503513] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [174957.504864] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [174957.506167] Call Trace:
      [174957.506654]  <TASK>
      [174957.507047]  close_ctree+0x301/0x357 [btrfs]
      [174957.507867]  ? call_rcu+0x16c/0x290
      [174957.508567]  generic_shutdown_super+0x74/0x120
      [174957.509447]  kill_anon_super+0x14/0x30
      [174957.510194]  btrfs_kill_super+0x12/0x20 [btrfs]
      [174957.511123]  deactivate_locked_super+0x31/0xa0
      [174957.511976]  cleanup_mnt+0x147/0x1c0
      [174957.512610]  task_work_run+0x5c/0xa0
      [174957.513309]  exit_to_user_mode_prepare+0x1e5/0x1f0
      [174957.514231]  syscall_exit_to_user_mode+0x16/0x40
      [174957.515069]  do_syscall_64+0x48/0xc0
      [174957.515718]  entry_SYSCALL_64_after_hwframe+0x44/0xae
      [174957.516688] RIP: 0033:0x7f328fdc4a97
      [174957.517413] Code: 03 0c 00 f7 d8 (...)
      [174957.521052] RSP: 002b:00007fff13564ec8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
      [174957.522514] RAX: 0000000000000000 RBX: 00007f328feea264 RCX: 00007f328fdc4a97
      [174957.523950] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000560b8ae51dd0
      [174957.525375] RBP: 0000560b8ae51ba0 R08: 0000000000000000 R09: 00007fff13563c40
      [174957.526763] R10: 00007f328fe49fc0 R11: 0000000000000246 R12: 0000000000000000
      [174957.528058] R13: 0000560b8ae51dd0 R14: 0000560b8ae51cb0 R15: 0000000000000000
      [174957.529404]  </TASK>
      [174957.529843] irq event stamp: 0
      [174957.530256] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
      [174957.531061] hardirqs last disabled at (0): [<ffffffffb0e94214>] copy_process+0x934/0x2040
      [174957.532075] softirqs last  enabled at (0): [<ffffffffb0e94214>] copy_process+0x934/0x2040
      [174957.533083] softirqs last disabled at (0): [<0000000000000000>] 0x0
      [174957.533865] ---[ end trace bc7ee0c490bce3b0 ]---
      [174957.534452] BTRFS info (device dm-0): space_info 4 has 1070841856 free, is not full
      [174957.535404] BTRFS info (device dm-0): space_info total=1073741824, used=2785280, pinned=0, reserved=49152, may_use=0, readonly=65536 zone_unusable=0
      [174957.537029] BTRFS info (device dm-0): global_block_rsv: size 0 reserved 0
      [174957.537859] BTRFS info (device dm-0): trans_block_rsv: size 0 reserved 0
      [174957.538697] BTRFS info (device dm-0): chunk_block_rsv: size 0 reserved 0
      [174957.539552] BTRFS info (device dm-0): delayed_block_rsv: size 0 reserved 0
      [174957.540403] BTRFS info (device dm-0): delayed_refs_rsv: size 0 reserved 0
      
      This also means that in case we have log tree extent buffers that are
      still dirty, we can end up not cleaning them up in case we find an
      extent buffer with EXTENT_BUFFER_WRITE_ERR set on it, as in that case
      we have no way for iterating over the rest of the tree.
      
      This issue is very often triggered with test cases generic/475 and
      generic/648 from fstests.
      
      The issue could almost be fixed by iterating over the io tree attached to
      each log root which keeps tracks of the range of allocated extent buffers,
      log_root->dirty_log_pages, however that does not work and has some
      inconveniences:
      
      1) After we sync the log, we clear the range of the extent buffers from
         the io tree, so we can't find them after writeback. We could keep the
         ranges in the io tree, with a separate bit to signal they represent
         extent buffers already written, but that means we need to hold into
         more memory until the transaction commits.
      
         How much more memory is used depends a lot on whether we are able to
         allocate contiguous extent buffers on disk (and how often) for a log
         tree - if we are able to, then a single extent state record can
         represent multiple extent buffers, otherwise we need multiple extent
         state record structures to track each extent buffer.
         In fact, my earlier approach did that:
      
         https://lore.kernel.org/linux-btrfs/3aae7c6728257c7ce2279d6660ee2797e5e34bbd.1641300250.git.fdmanana@suse.com/
      
         However that can cause a very significant negative impact on
         performance, not only due to the extra memory usage but also because
         we get a larger and deeper dirty_log_pages io tree.
         We got a report that, on beefy machines at least, we can get such
         performance drop with fsmark for example:
      
         https://lore.kernel.org/linux-btrfs/20220117082426.GE32491@xsang-OptiPlex-9020/
      
      2) We would be doing it only to deal with an unexpected and exceptional
         case, which is basically failure to read an extent buffer from disk
         due to IO failures. On a healthy system we don't expect transaction
         aborts to happen after all;
      
      3) Instead of relying on iterating the log tree or tracking the ranges
         of extent buffers in the dirty_log_pages io tree, using the radix
         tree that tracks extent buffers (fs_info->buffer_radix) to find all
         log tree extent buffers is not reliable either, because after writeback
         of an extent buffer it can be evicted from memory by the release page
         callback of the btree inode (btree_releasepage()).
      
      Since there's no way to be able to properly cleanup a log tree without
      being able to read its extent buffers from disk and without using more
      memory to track the logical ranges of the allocated extent buffers do
      the following:
      
      1) When we fail to cleanup a log tree, setup a flag that indicates that
         failure;
      
      2) Trigger writeback of all log tree extent buffers that are still dirty,
         and wait for the writeback to complete. This is just to cleanup their
         state, page states, page leaks, etc;
      
      3) When unmounting the fs, ignore if the number of bytes reserved in a
         block group and in a space_info is not 0 if, and only if, we failed to
         cleanup a log tree. Also ignore only for metadata block groups and the
         metadata space_info object.
      
      This is far from a perfect solution, but it serves to silence test
      failures such as those from generic/475 and generic/648. However having
      a non-zero value for the reserved bytes counters on unmount after a
      transaction abort, is not such a terrible thing and it's completely
      harmless, it does not affect the filesystem integrity in any way.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      40cdc509
  8. 07 1月, 2022 5 次提交
    • Q
      btrfs: remove reada infrastructure · f26c9238
      Qu Wenruo 提交于
      Currently there is only one user for btrfs metadata readahead, and
      that's scrub.
      
      But even for the single user, it's not providing the correct
      functionality it needs, as scrub needs reada for commit root, which
      current readahead can't provide. (Although it's pretty easy to add such
      feature).
      
      Despite this, there are some extra problems related to metadata
      readahead:
      
      - Duplicated feature with btrfs_path::reada
      
      - Partly duplicated feature of btrfs_fs_info::buffer_radix
        Btrfs already caches its metadata in buffer_radix, while readahead
        tries to read the tree block no matter if it's already cached.
      
      - Poor layer separation
        Metadata readahead works kinda at device level.
        This is definitely not the correct layer it should be, since metadata
        is at btrfs logical address space, it should not bother device at all.
      
        This brings extra chance for bugs to sneak in, while brings
        unnecessary complexity.
      
      - Dead code
        In the very beginning of scrub.c we have #undef DEBUG, rendering all
        the debug related code useless and unable to test.
      
      Thus here I purpose to remove the metadata readahead mechanism
      completely.
      
      [BENCHMARK]
      There is a full benchmark for the scrub performance difference using the
      old btrfs_reada_add() and btrfs_path::reada.
      
      For the worst case (no dirty metadata, slow HDD), there could be a 5%
      performance drop for scrub.
      For other cases (even SATA SSD), there is no distinguishable performance
      difference.
      
      The number is reported scrub speed, in MiB/s.
      The resolution is limited by the reported duration, which only has a
      resolution of 1 second.
      
      	Old		New		Diff
      SSD	455.3		466.332		+2.42%
      HDD	103.927 	98.012		-5.69%
      
      Comprehensive test methodology is in the cover letter of the patch.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f26c9238
    • J
      btrfs: move btrfs_truncate_inode_items to inode-item.c · 54f03ab1
      Josef Bacik 提交于
      This is an inode item related manipulation with a few vfs related
      adjustments.  I'm going to remove the vfs related code from this helper
      and simplify it a lot, but I want those changes to be easily seen via
      git blame, so move this function now and then the simplification work
      can be done.
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      54f03ab1
    • J
      btrfs: add an inode-item.h · 26c2c454
      Josef Bacik 提交于
      We have a few helpers in inode-item.c, and I'm going to make a few
      changes to how we do truncate in the future, so break out these
      definitions into their own header file to trim down ctree.h some and
      make it easier to do the work on truncate in the future.
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      26c2c454
    • N
      btrfs: introduce exclusive operation BALANCE_PAUSED state · efc0e69c
      Nikolay Borisov 提交于
      Current set of exclusive operation states is not sufficient to handle
      all practical use cases. In particular there is a need to be able to add
      a device to a filesystem that have paused balance. Currently there is no
      way to distinguish between a running and a paused balance. Fix this by
      introducing BTRFS_EXCLOP_BALANCE_PAUSED which is going to be set in 2
      occasions:
      
      1. When a filesystem is mounted with skip_balance and there is an
         unfinished balance it will now be into BALANCE_PAUSED instead of
         simply BALANCE state.
      
      2. When a running balance is paused.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      efc0e69c
    • F
      btrfs: make send work with concurrent block group relocation · d96b3424
      Filipe Manana 提交于
      We don't allow send and balance/relocation to run in parallel in order
      to prevent send failing or silently producing some bad stream. This is
      because while send is using an extent (specially metadata) or about to
      read a metadata extent and expecting it belongs to a specific parent
      node, relocation can run, the transaction used for the relocation is
      committed and the extent gets reallocated while send is still using the
      extent, so it ends up with a different content than expected. This can
      result in just failing to read a metadata extent due to failure of the
      validation checks (parent transid, level, etc), failure to find a
      backreference for a data extent, and other unexpected failures. Besides
      reallocation, there's also a similar problem of an extent getting
      discarded when it's unpinned after the transaction used for block group
      relocation is committed.
      
      The restriction between balance and send was added in commit 9e967495
      ("Btrfs: prevent send failures and crashes due to concurrent relocation"),
      kernel 5.3, while the more general restriction between send and relocation
      was added in commit 1cea5cf0 ("btrfs: ensure relocation never runs
      while we have send operations running"), kernel 5.14.
      
      Both send and relocation can be very long running operations. Relocation
      because it has to do a lot of IO and expensive backreference lookups in
      case there are many snapshots, and send due to read IO when operating on
      very large trees. This makes it inconvenient for users and tools to deal
      with scheduling both operations.
      
      For zoned filesystem we also have automatic block group relocation, so
      send can fail with -EAGAIN when users least expect it or send can end up
      delaying the block group relocation for too long. In the future we might
      also get the automatic block group relocation for non zoned filesystems.
      
      This change makes it possible for send and relocation to run in parallel.
      This is achieved the following way:
      
      1) For all tree searches, send acquires a read lock on the commit root
         semaphore;
      
      2) After each tree search, and before releasing the commit root semaphore,
         the leaf is cloned and placed in the search path (struct btrfs_path);
      
      3) After releasing the commit root semaphore, the changed_cb() callback
         is invoked, which operates on the leaf and writes commands to the pipe
         (or file in case send/receive is not used with a pipe). It's important
         here to not hold a lock on the commit root semaphore, because if we did
         we could deadlock when sending and receiving to the same filesystem
         using a pipe - the send task blocks on the pipe because it's full, the
         receive task, which is the only consumer of the pipe, triggers a
         transaction commit when attempting to create a subvolume or reserve
         space for a write operation for example, but the transaction commit
         blocks trying to write lock the commit root semaphore, resulting in a
         deadlock;
      
      4) Before moving to the next key, or advancing to the next change in case
         of an incremental send, check if a transaction used for relocation was
         committed (or is about to finish its commit). If so, release the search
         path(s) and restart the search, to where we were before, so that we
         don't operate on stale extent buffers. The search restarts are always
         possible because both the send and parent roots are RO, and no one can
         add, remove of update keys (change their offset) in RO trees - the
         only exception is deduplication, but that is still not allowed to run
         in parallel with send;
      
      5) Periodically check if there is contention on the commit root semaphore,
         which means there is a transaction commit trying to write lock it, and
         release the semaphore and reschedule if there is contention, so as to
         avoid causing any significant delays to transaction commits.
      
      This leaves some room for optimizations for send to have less path
      releases and re searching the trees when there's relocation running, but
      for now it's kept simple as it performs quite well (on very large trees
      with resulting send streams in the order of a few hundred gigabytes).
      
      Test case btrfs/187, from fstests, stresses relocation, send and
      deduplication attempting to run in parallel, but without verifying if send
      succeeds and if it produces correct streams. A new test case will be added
      that exercises relocation happening in parallel with send and then checks
      that send succeeds and the resulting streams are correct.
      
      A final note is that for now this still leaves the mutual exclusion
      between send operations and deduplication on files belonging to a root
      used by send operations. A solution for that will be slightly more complex
      but it will eventually be built on top of this change.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d96b3424
  9. 03 1月, 2022 4 次提交