1. 24 3月, 2020 40 次提交
    • J
      btrfs: track reloc roots based on their commit root bytenr · ea287ab1
      Josef Bacik 提交于
      We always search the commit root of the extent tree for looking up back
      references, however we track the reloc roots based on their current
      bytenr.
      
      This is wrong, if we commit the transaction between relocating tree
      blocks we could end up in this code in build_backref_tree
      
        if (key.objectid == key.offset) {
      	  /*
      	   * Only root blocks of reloc trees use backref
      	   * pointing to itself.
      	   */
      	  root = find_reloc_root(rc, cur->bytenr);
      	  ASSERT(root);
      	  cur->root = root;
      	  break;
        }
      
      find_reloc_root() is looking based on the bytenr we had in the commit
      root, but if we've COWed this reloc root we will not find that bytenr,
      and we will trip over the ASSERT(root).
      
      Fix this by using the commit_root->start bytenr for indexing the commit
      root.  Then we change the __update_reloc_root() caller to be used when
      we switch the commit root for the reloc root during commit.
      
      This fixes the panic I was seeing when we started throttling relocation
      for delayed refs.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ea287ab1
    • J
      btrfs: restart relocate_tree_blocks properly · 50dbbb71
      Josef Bacik 提交于
      There are two bugs here, but fixing them independently would just result
      in pain if you happened to bisect between the two patches.
      
      First is how we handle the -EAGAIN from relocate_tree_block().  We don't
      set error, unless we happen to be the first node, which makes no sense,
      I have no idea what the code was trying to accomplish here.
      
      We in fact _do_ want err set here so that we know we need to restart in
      relocate_block_group().  Also we need finish_pending_nodes() to not
      actually call link_to_upper(), because we didn't actually relocate the
      block.
      
      And then if we do get -EAGAIN we do not want to set our backref cache
      last_trans to the one before ours.  This would force us to update our
      backref cache if we didn't cross transaction ids, which would mean we'd
      have some nodes updated to their new_bytenr, but still able to find
      their old bytenr because we're searching the same commit root as the
      last time we went through relocate_tree_blocks.
      
      Fixing these two things keeps us from panicing when we start breaking
      out of relocate_tree_blocks() either for delayed ref flushing or enospc.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      50dbbb71
    • J
      btrfs: reloc: reorder reservation before root selection · 5f6b2e5c
      Josef Bacik 提交于
      Since we're not only checking for metadata reservations but also if we
      need to throttle our delayed ref generation, reorder
      reserve_metadata_space() above the select_one_root() call in
      relocate_tree_block().
      
      The reason we want this is because select_reloc_root() will mess with
      the backref cache, and if we're going to bail we want to be able to
      cleanly remove this node from the backref cache and come back along to
      regenerate it.  Move it up so this is the first thing we do to make
      restarting cleaner.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5f6b2e5c
    • J
      btrfs: do not readahead in build_backref_tree · d7ff00f6
      Josef Bacik 提交于
      Here we are just searching down to the bytenr we're building the backref
      tree for, and all of it's paths to the roots.  These bytenrs are not
      guaranteed to be anywhere near each other, so readahead just generates
      extra latency.
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d7ff00f6
    • J
      btrfs: do not use readahead for running delayed refs · cd22a51c
      Josef Bacik 提交于
      Readahead will generate a lot of extra reads for adjacent nodes, but
      when running delayed refs we have no idea if the next ref is going to be
      adjacent or not, so this potentially just generates a lot of extra IO.
      To make matters worse each ref is truly just looking for one item, it
      doesn't generally search forward, so we simply don't need it here.
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      cd22a51c
    • N
      btrfs: Remove async_transid from btrfs_mksubvol/create_subvol/create_snapshot · 9babda9f
      Nikolay Borisov 提交于
      With BTRFS_SUBVOL_CREATE_ASYNC support remove it's no longer required to
      pass the async_transid parameter so remove it and any code using it.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9babda9f
    • N
      btrfs: Remove transid argument from btrfs_ioctl_snap_create_transid · 5d54c67e
      Nikolay Borisov 提交于
      btrfs_ioctl_snap_create_transid no longer takes a transid argument, so
      remove it and rename the function to __btrfs_ioctl_snap_create to
      reflect it's an internal, worker function.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5d54c67e
    • N
      btrfs: Remove BTRFS_SUBVOL_CREATE_ASYNC support · 9c1036fd
      Nikolay Borisov 提交于
      This functionality was deprecated in kernel 5.4. Since no one has
      complained of the impending removal it's time we did so.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ add comment ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9c1036fd
    • J
      btrfs: kill the subvol_srcu · c75e8394
      Josef Bacik 提交于
      Now that we have proper root ref counting everywhere we can kill the
      subvol_srcu.
      
      * removal of fs_info::subvol_srcu reduces size of fs_info by 1176 bytes
      
      * the refcount_t used for the references checks for accidental 0->1
        in cases where the root lifetime would not be properly protected
      
      * there's a leak detector for roots to catch unfreed roots at umount
        time
      
      * SRCU served us well over the years but is was not a proper
        synchronization mechanism for some cases
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ update changelog ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c75e8394
    • J
      btrfs: make btrfs_cleanup_fs_roots use the radix tree lock · efc34534
      Josef Bacik 提交于
      The radix root is primarily protected by the fs_roots_radix_lock, so use
      that to lookup and get a ref on all of our fs roots in
      btrfs_cleanup_fs_roots. The tree reference is taken in the protected
      section as before.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      efc34534
    • J
      btrfs: don't take an extra root ref at allocation time · 4785e24f
      Josef Bacik 提交于
      Now that all the users of roots take references for them we can drop the
      extra root ref we've been taking.  Before we had roots at 2 refs for the
      life of the file system, one for the radix tree, and one simply for
      existing.  Now that we have proper ref accounting in all places that use
      roots we can drop this extra ref simply for existing as we no longer
      need it.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4785e24f
    • J
      btrfs: hold a ref on the root on the dead roots list · dc9492c1
      Josef Bacik 提交于
      At the point we add a root to the dead roots list we have no open inodes
      for that root, so we need to hold a ref on that root to keep it from
      disappearing.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      dc9492c1
    • J
      btrfs: make inodes hold a ref on their roots · 5c8fd99f
      Josef Bacik 提交于
      If we make sure all the inodes have refs on their root we don't have to
      worry about the root disappearing while we have open inodes.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5c8fd99f
    • J
      btrfs: move the root freeing stuff into btrfs_put_root · 8c38938c
      Josef Bacik 提交于
      There are a few different ways to free roots, either you allocated them
      yourself and you just do
      
      free_extent_buffer(root->node);
      free_extent_buffer(root->commit_node);
      btrfs_put_root(root);
      
      Which is the pattern for log roots.  Or for snapshots/subvolumes that
      are being dropped you simply call btrfs_free_fs_root() which does all
      the cleanup for you.
      
      Unify this all into btrfs_put_root(), so that we don't free up things
      associated with the root until the last reference is dropped.  This
      makes the root freeing code much more significant.
      
      The only caveat is at close_ctree() time we have to free the extent
      buffers for all of our main roots (extent_root, chunk_root, etc) because
      we have to drop the btree_inode and we'll run into issues if we hold
      onto those nodes until ->kill_sb() time.  This will be addressed in the
      future when we kill the btree_inode.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8c38938c
    • J
      btrfs: move ino_cache_inode dropping out of btrfs_free_fs_root · 0e996e7f
      Josef Bacik 提交于
      We are going to make root life be controlled soley by refcounting, and
      inodes will be one of the things that hold a ref on the root.  This
      means we need to handle dropping the ino_cache_inode outside of the root
      freeing logic, so move it into btrfs_drop_and_free_fs_root() so it is
      cleaned up properly on unmount.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0e996e7f
    • J
      btrfs: make the extent buffer leak check per fs info · 3fd63727
      Josef Bacik 提交于
      I'm going to make the entire destruction of btrfs_root's controlled by
      their refcount, so it will be helpful to notice if we're leaking their
      eb's on umount.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3fd63727
    • J
      btrfs: remove a BUG_ON() from merge_reloc_roots() · 7b7b7431
      Josef Bacik 提交于
      This was pretty subtle, we default to reloc roots having 0 root refs, so
      if we crash in the middle of the relocation they can just be deleted.
      If we successfully complete the relocation operations we'll set our root
      refs to 1 in prepare_to_merge() and then go on to merge_reloc_roots().
      
      At prepare_to_merge() time if any of the reloc roots have a 0 reference
      still, we will remove that reloc root from our reloc root rb tree, and
      then clean it up later.
      
      However this only happens if we successfully start a transaction.  If
      we've aborted previously we will skip this step completely, and only
      have reloc roots with a reference count of 0, but were never properly
      removed from the reloc control's rb tree.
      
      This isn't a problem per-se, our references are held by the list the
      reloc roots are on, and by the original root the reloc root belongs to.
      If we end up in this situation all the reloc roots will be added to the
      dirty_reloc_list, and then properly dropped at that point.  The reloc
      control will be free'd and the rb tree is no longer used.
      
      There were two options when fixing this, one was to remove the BUG_ON(),
      the other was to make prepare_to_merge() handle the case where we
      couldn't start a trans handle.
      
      IMO this is the cleaner solution.  I started with handling the error in
      prepare_to_merge(), but it turned out super ugly.  And in the end this
      BUG_ON() simply doesn't matter, the cleanup was happening properly, we
      were just panicing because this BUG_ON() only matters in the success
      case.  So I've opted to just remove it and add a comment where it was.
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7b7b7431
    • J
      btrfs: hold a ref on the root->reloc_root · f44deb74
      Josef Bacik 提交于
      We previously were relying on root->reloc_root to be cleaned up by the
      drop snapshot, or the error handling.  However if btrfs_drop_snapshot()
      failed it wouldn't drop the ref for the root.  Also we sort of depend on
      the right thing to happen with moving reloc roots between lists and the
      fs root they belong to, which makes it hard to figure out who owns the
      reference.
      
      Fix this by explicitly holding a reference on the reloc root for
      roo->reloc_root.  This means that we hold two references on reloc roots,
      one for whichever reloc_roots list it's attached to, and the
      root->reloc_root we're on.
      
      This makes it easier to reason out who owns a reference on the root, and
      when it needs to be dropped.
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f44deb74
    • J
      btrfs: clear DEAD_RELOC_TREE before dropping the reloc root · f28de8d8
      Josef Bacik 提交于
      The DEAD_RELOC_TREE flag is in place in order to avoid a use after free
      in init_reloc_root, tracking the presence of reloc_root.  However adding
      the explicit tree references in previous patches makes the use after
      free impossible because at this point we no longer have a reloc_control
      set on the fs_info and thus cannot enter the function.
      
      So move this to be coupled with clearing the root->reloc_root so we're
      consistent with all other operations of the reloc root.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ update changelog ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f28de8d8
    • J
      btrfs: free the reloc_control in a consistent way · 1a0afa0e
      Josef Bacik 提交于
      If we have an error while processing the reloc roots we could leak roots
      that were added to rc->reloc_roots before we hit the error.  We could
      have also not removed the reloc tree mapping from our rb_tree, so clean
      up any remaining nodes in the reloc root rb_tree.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ use rbtree_postorder_for_each_entry_safe ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1a0afa0e
    • J
      btrfs: do not init a reloc root if we aren't relocating · 2abc726a
      Josef Bacik 提交于
      We previously were checking if the root had a dead root before accessing
      root->reloc_root in order to avoid a use-after-free type bug.  However
      this scenario happens after we've unset the reloc control, so we would
      have been saved if we'd simply checked for fs_info->reloc_control.  At
      this point during relocation we no longer need to be creating new reloc
      roots, so simply move this check above the reloc_root checks to avoid
      any future races and confusion.
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2abc726a
    • J
      btrfs: reloc: clean dirty subvols if we fail to start a transaction · 6217b0fa
      Josef Bacik 提交于
      If we do merge_reloc_roots() we could insert a few roots onto the dirty
      subvol roots list, where we hold a ref on them.  If we fail to start the
      transaction we need to run clean_dirty_subvols() in order to cleanup the
      refs.
      
      CC: stable@vger.kernel.org # 5.4+
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6217b0fa
    • J
      btrfs: unset reloc control if we fail to recover · fb2d83ee
      Josef Bacik 提交于
      If we fail to load an fs root, or fail to start a transaction we can
      bail without unsetting the reloc control, which leads to problems later
      when we free the reloc control but still have it attached to the file
      system.
      
      In the normal path we'll end up calling unset_reloc_control() twice, but
      all it does is set fs_info->reloc_control = NULL, and we can only have
      one balance at a time so it's not racey.
      
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fb2d83ee
    • J
      btrfs: drop block from cache on error in relocation · 8e19c973
      Josef Bacik 提交于
      If we have an error while building the backref tree in relocation we'll
      process all the pending edges and then free the node.  However if we
      integrated some edges into the cache we'll lose our link to those edges
      by simply freeing this node, which means we'll leak memory and
      references to any roots that we've found.
      
      Instead we need to use remove_backref_node(), which walks through all of
      the edges that are still linked to this node and free's them up and
      drops any root references we may be holding.
      
      CC: stable@vger.kernel.org # 4.9+
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8e19c973
    • Q
      btrfs: relocation: Use btrfs_find_all_leafs to locate data extent parent tree leaves · 19b546d7
      Qu Wenruo 提交于
      In relocation, we need to locate all parent tree leaves referring to one
      data extent, thus we have a complex mechanism to iterate throught extent
      tree and subvolume trees to locate the related leaves.
      
      However this is already done in backref.c, we have
      btrfs_find_all_leafs(), which can return a ulist containing all leaves
      referring to that data extent.
      
      Use btrfs_find_all_leafs() to replace find_data_references().
      
      There is a special handling for v1 space cache data extents, where we
      need to delete the v1 space cache data extents, to avoid those data
      extents to hang the data relocation.
      
      In this patch, the special handling is done by re-iterating the root
      tree leaf.  Although it's a little less efficient than the old handling,
      considering we can reuse a lot of code, it should be acceptable.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      19b546d7
    • J
      btrfs: fix ref-verify to catch operations on 0 ref extents · b39c8f5a
      Josef Bacik 提交于
      While debugging I noticed I wasn't getting ref verify errors before
      everything blew up.  Turns out it's because we don't warn when we try to
      add a normal ref via btrfs_inc_ref() if the block entry exists but has 0
      references.  This is incorrect, we should never be doing anything other
      than adding a new extent once a block entry drops to 0 references.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b39c8f5a
    • F
      btrfs: make ranged full fsyncs more efficient · 0a8068a3
      Filipe Manana 提交于
      Commit 0c713cba ("Btrfs: fix race between ranged fsync and writeback
      of adjacent ranges") fixed a bug where we could end up with file extent
      items in a log tree that represent file ranges that overlap due to a race
      between the hole detection of a ranged full fsync and writeback for a
      different file range.
      
      The problem was solved by forcing any ranged full fsync to become a
      non-ranged full fsync - setting the range start to 0 and the end offset to
      LLONG_MAX. This was a simple solution because the code that detected and
      marked holes was very complex, it used to be done at copy_items() and
      implied several searches on the fs/subvolume tree. The drawback of that
      solution was that we started to flush delalloc for the entire file and
      wait for all the ordered extents to complete for ranged full fsyncs
      (including ordered extents covering ranges completely outside the given
      range). Fortunatelly ranged full fsyncs are not the most common case
      (hopefully for most workloads).
      
      However a later fix for detecting and marking holes was made by commit
      0e56315c ("Btrfs: fix missing hole after hole punching and fsync
      when using NO_HOLES") and it simplified a lot the detection of holes,
      and now copy_items() no longer does it and we do it in a much more simple
      way at btrfs_log_holes().
      
      This makes it now possible to simply make the code that detects holes to
      operate only on the initial range and no longer need to operate on the
      whole file, while also avoiding the need to flush delalloc for the entire
      file and wait for ordered extents that cover ranges that don't overlap the
      given range.
      
      Another special care is that we must skip file extent items that fall
      entirely outside the fsync range when copying inode items from the
      fs/subvolume tree into the log tree - this is to avoid races with ordered
      extent completion for extents falling outside the fsync range, which could
      cause us to end up with file extent items in the log tree that have
      overlapping ranges - for example if the fsync range is [1Mb, 2Mb], when
      we copy inode items we could copy an extent item for the range [0, 512K],
      then release the search path and before moving to the next leaf, an
      ordered extent for a range of [256Kb, 512Kb] completes - this would
      cause us to copy the new extent item for range [256Kb, 512Kb] into the
      log tree after we have copied one for the range [0, 512Kb] - the extents
      overlap, resulting in a corruption.
      
      So this change just does these steps:
      
      1) When the NO_HOLES feature is enabled it leaves the initial range
         intact - no longer sets it to [0, LLONG_MAX] when the full sync bit
         is set in the inode. If NO_HOLES is not enabled, always set the range
         to a full, just like before this change, to avoid missing file extent
         items representing holes after replaying the log (for both full and
         fast fsyncs);
      
      2) Make the hole detection code to operate only on the fsync range;
      
      3) Make the code that copies items from the fs/subvolume tree to skip
         copying file extent items that cover a range completely outside the
         range of the fsync.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0a8068a3
    • F
      btrfs: factor out inode items copy loop from btrfs_log_inode() · da447009
      Filipe Manana 提交于
      The function btrfs_log_inode() is quite large and so is its loop which
      iterates the inode items from the fs/subvolume tree and copies them into
      a log tree. Because this is a large loop inside a very large function
      and because an upcoming patch in this series needs to add some more logic
      inside that loop, move the loop into a helper function to make it a bit
      more manageable.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      da447009
    • F
      btrfs: add helper to get the end offset of a file extent item · a5eeb3d1
      Filipe Manana 提交于
      Getting the end offset for a file extent item requires a bit of code since
      the extent can be either inline or regular/prealloc. There are some places
      all over the code base that open code this logic and in another patch
      later in this series it will be needed again. Therefore encapsulate this
      logic in a helper function and use it.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a5eeb3d1
    • F
      btrfs: fix missing file extent item for hole after ranged fsync · 95418ed1
      Filipe Manana 提交于
      When doing a fast fsync for a range that starts at an offset greater than
      zero, we can end up with a log that when replayed causes the respective
      inode miss a file extent item representing a hole if we are not using the
      NO_HOLES feature. This is because for fast fsyncs we don't log any extents
      that cover a range different from the one requested in the fsync.
      
      Example scenario to trigger it:
      
        $ mkfs.btrfs -O ^no-holes -f /dev/sdd
        $ mount /dev/sdd /mnt
      
        # Create a file with a single 256K and fsync it to clear to full sync
        # bit in the inode - we want the msync below to trigger a fast fsync.
        $ xfs_io -f -c "pwrite -S 0xab 0 256K" -c "fsync" /mnt/foo
      
        # Force a transaction commit and wipe out the log tree.
        $ sync
      
        # Dirty 768K of data, increasing the file size to 1Mb, and flush only
        # the range from 256K to 512K without updating the log tree
        # (sync_file_range() does not trigger fsync, it only starts writeback
        # and waits for it to finish).
      
        $ xfs_io -c "pwrite -S 0xcd 256K 768K" /mnt/foo
        $ xfs_io -c "sync_range -abw 256K 256K" /mnt/foo
      
        # Now dirty the range from 768K to 1M again and sync that range.
        $ xfs_io -c "mmap -w 768K 256K"        \
                 -c "mwrite -S 0xef 768K 256K" \
                 -c "msync -s 768K 256K"       \
                 -c "munmap"                   \
                 /mnt/foo
      
        <power fail>
      
        # Mount to replay the log.
        $ mount /dev/sdd /mnt
        $ umount /mnt
      
        $ btrfs check /dev/sdd
        Opening filesystem to check...
        Checking filesystem on /dev/sdd
        UUID: 482fb574-b288-478e-a190-a9c44a78fca6
        [1/7] checking root items
        [2/7] checking extents
        [3/7] checking free space cache
        [4/7] checking fs roots
        root 5 inode 257 errors 100, file extent discount
        Found file extent holes:
             start: 262144, len: 524288
        ERROR: errors found in fs roots
        found 720896 bytes used, error(s) found
        total csum bytes: 512
        total tree bytes: 131072
        total fs tree bytes: 32768
        total extent tree bytes: 16384
        btree space waste bytes: 123514
        file data blocks allocated: 589824
          referenced 589824
      
      Fix this issue by setting the range to full (0 to LLONG_MAX) when the
      NO_HOLES feature is not enabled. This results in extra work being done
      but it gives the guarantee we don't end up with missing holes after
      replaying the log.
      
      CC: stable@vger.kernel.org # 4.19+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      95418ed1
    • N
      btrfs: account ticket size at add/delete time · db161806
      Nikolay Borisov 提交于
      Instead of iterating all pending tickets on the normal/priority list to
      sum their total size the cost can be amortized across ticket addition/
      removal. This turns O(n) + O(m) (where n is the size of the normal list
      and m of the priority list) into O(1). This will mostly have effect in
      workloads that experience heavy flushing.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      db161806
    • R
      btrfs: implement migratepage callback for data pages · f8e66081
      Roman Gushchin 提交于
      Currently btrfs doesn't provide a migratepage callback for data pages.
      It means that fallback_migrate_page() is used to migrate btrfs pages.
      
      fallback_migrate_page() cannot move dirty pages, instead it tries to
      flush them (in sync mode) or just fails (in async mode).
      
      In the sync mode pages which are scheduled to be processed by
      btrfs_writepage_fixup_worker() can't be effectively flushed by the
      migration code, because there is no established way to wait for the
      completion of the delayed work.
      
      It all leads to page migration failures.
      
      To fix it the patch implements a btrs-specific migratepage callback,
      which is similar to iomap_migrate_page() used by some other fs, except
      it does take care of the PagePrivate2 flag which is used for data
      ordering purposes.
      Reviewed-by: NChris Mason <clm@fb.com>
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f8e66081
    • N
      btrfs: Remove block_rsv parameter from btrfs_drop_snapshot · 0078a9f9
      Nikolay Borisov 提交于
      It's no longer used following 30d40577 ("btrfs: reloc: Also queue
      orphan reloc tree for cleanup to avoid BUG_ON()"), so just remove it.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0078a9f9
    • N
      btrfs: Remove __ prefix from btrfs_block_rsv_release · 63f018be
      Nikolay Borisov 提交于
      Currently the non-prefixed version is a simple wrapper used to hide
      the 4th argument of the prefixed version. This doesn't bring much value
      in practice and only makes the code harder to follow by adding another
      level of indirection. Rectify this by removing the __ prefix and
      have only one public function to release bytes from a block reservation.
      No semantic changes.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      63f018be
    • Q
      btrfs: relocation: Check cancel request after each extent found · f31ea088
      Qu Wenruo 提交于
      When relocating data block groups with tons of small extents, or large
      metadata block groups, there can be over 200,000 extents.
      
      We will iterate all extents of such block group in relocate_block_group(),
      where iteration itself can be kinda time-consuming.
      
      So when user want to cancel the balance, the extent iteration loop can
      be another target.
      
      This patch will add the cancelling check in the extent iteration loop of
      relocate_block_group() to make balance cancelling faster.
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f31ea088
    • Q
      btrfs: relocation: Check cancel request after each data page read · 7f913c7c
      Qu Wenruo 提交于
      When relocating a data extents with large large data extents, we spend
      most of our time in relocate_file_extent_cluster() at stage "moving data
      extents":
      
       1)               |  btrfs_relocate_block_group [btrfs]() {
       1)               |    relocate_file_extent_cluster [btrfs]() {
       1) $ 6586769 us  |    }
       1) + 18.260 us   |    relocate_file_extent_cluster [btrfs]();
       1) + 15.770 us   |    relocate_file_extent_cluster [btrfs]();
       1) $ 8916340 us  |  }
       1)               |  btrfs_relocate_block_group [btrfs]() {
       1)               |    relocate_file_extent_cluster [btrfs]() {
       1) $ 11611586 us |    }
       1) + 16.930 us   |    relocate_file_extent_cluster [btrfs]();
       1) + 15.870 us   |    relocate_file_extent_cluster [btrfs]();
       1) $ 14986130 us |  }
      
      To make data relocation cancelling quicker, add extra balance cancelling
      check after each page read in relocate_file_extent_cluster().
      
      Cleanup and error handling uses the same mechanism as if the whole
      process finished
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7f913c7c
    • Q
      btrfs: relocation: add error injection points for cancelling balance · 726a3421
      Qu Wenruo 提交于
      Introduce a new error injection point, should_cancel_balance().
      
      It's just a wrapper of atomic_read(&fs_info->balance_cancel_req), but
      allows us to override the return value.
      
      Currently there are only one locations using this function:
      
      - btrfs_balance()
        It checks cancel before each block group.
      
      There are other locations checking fs_info->balance_cancel_req, but they
      are not used as an indicator to exit, so there is no need to use the
      wrapper.
      
      But there will be more locations coming, and some locations can cause
      kernel panic if not handled properly.  So introduce this error injection
      to provide better test interface.
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      726a3421
    • F
      Btrfs: implement full reflink support for inline extents · 05a5a762
      Filipe Manana 提交于
      There are a few cases where we don't allow cloning an inline extent into
      the destination inode, returning -EOPNOTSUPP to user space. This was done
      to prevent several types of file corruption and because it's not very
      straightforward to deal with these cases, as they can't rely on simply
      copying the inline extent between leaves. Such cases require copying the
      inline extent's data into the respective page of the destination inode.
      
      Not supporting these cases makes it harder and more cumbersome to write
      applications/libraries that work on any filesystem with reflink support,
      since all these cases for which btrfs fails with -EOPNOTSUPP work just
      fine on xfs for example. These unsupported cases are also not documented
      anywhere and explaining which exact cases fail require a bit of too
      technical understanding of btrfs's internal (inline extents and when and
      where can they exist in a file), so it's not really user friendly.
      
      Also some test cases from fstests that use fsx, such as generic/522 for
      example, can sporadically fail because they trigger one of these cases,
      and fsx expects all operations to succeed.
      
      This change adds supports for cloning all these cases by copying the
      inline extent's data into the respective page of the destination inode.
      
      With this change test case btrfs/112 from fstests fails because it
      expects some clone operations to fail, so it will be updated. Also a
      new test case that exercises all these previously unsupported cases
      will be added to fstests.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      05a5a762
    • F
      Btrfs: simplify inline extent handling when doing reflinks · a61e1e0d
      Filipe Manana 提交于
      We can not reflink parts of an inline extent, we must always reflink the
      whole inline extent. We know that inline extents always start at file
      offset 0 and that can never represent an amount of data larger then the
      filesystem's sector size (both compressed and uncompressed). We also have
      had the constraints that reflink operations must have a start offset that
      is aligned to the sector size and an end offset that is also aligned or
      it ends the inode's i_size, so there's no way for user space to be able
      to do a reflink operation that will refer to only a part of an inline
      extent.
      
      Initially there was a bug in the inlining code that could allow compressed
      inline extents that encoded more than 1 page, but that was fixed in 2008
      by commit 70b99e69 ("Btrfs: Compression corner fixes") since that
      was problematic.
      
      So remove all the extent cloning code that deals with the possibility
      of cloning only partial inline extents.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a61e1e0d
    • F
      Btrfs: move all reflink implementation code into its own file · 6a177381
      Filipe Manana 提交于
      The reflink code is quite large and has been living in ioctl.c since ever.
      It has grown over the years after many bug fixes and improvements, and
      since I'm planning on making some further improvements on it, it's time
      to get it better organized by moving into its own file, reflink.c
      (similar to what xfs does for example).
      
      This change only moves the code out of ioctl.c into the new file, it
      doesn't do any other change.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6a177381