1. 11 4月, 2015 1 次提交
    • J
      Btrfs: account for crcs in delayed ref processing · 1262133b
      Josef Bacik 提交于
      As we delete large extents, we end up doing huge amounts of COW in order
      to delete the corresponding crcs.  This adds accounting so that we keep
      track of that space and flushing of delayed refs so that we don't build
      up too much delayed crc work.
      
      This helps limit the delayed work that must be done at commit time and
      tries to avoid ENOSPC aborts because the crcs eat all the global
      reserves.
      Signed-off-by: NChris Mason <clm@fb.com>
      1262133b
  2. 10 6月, 2014 1 次提交
    • J
      Btrfs: rework qgroup accounting · fcebe456
      Josef Bacik 提交于
      Currently qgroups account for space by intercepting delayed ref updates to fs
      trees.  It does this by adding sequence numbers to delayed ref updates so that
      it can figure out how the tree looked before the update so we can adjust the
      counters properly.  The problem with this is that it does not allow delayed refs
      to be merged, so if you say are defragging an extent with 5k snapshots pointing
      to it we will thrash the delayed ref lock because we need to go back and
      manually merge these things together.  Instead we want to process quota changes
      when we know they are going to happen, like when we first allocate an extent, we
      free a reference for an extent, we add new references etc.  This patch
      accomplishes this by only adding qgroup operations for real ref changes.  We
      only modify the sequence number when we need to lookup roots for bytenrs, this
      reduces the amount of churn on the sequence number and allows us to merge
      delayed refs as we add them most of the time.  This patch encompasses a bunch of
      architectural changes
      
      1) qgroup ref operations: instead of tracking qgroup operations through the
      delayed refs we simply add new ref operations whenever we notice that we need to
      when we've modified the refs themselves.
      
      2) tree mod seq:  we no longer have this separation of major/minor counters.
      this makes the sequence number stuff much more sane and we can remove some
      locking that was needed to protect the counter.
      
      3) delayed ref seq: we now read the tree mod seq number and use that as our
      sequence.  This means each new delayed ref doesn't have it's own unique sequence
      number, rather whenever we go to lookup backrefs we inc the sequence number so
      we can make sure to keep any new operations from screwing up our world view at
      that given point.  This allows us to merge delayed refs during runtime.
      
      With all of these changes the delayed ref stuff is a little saner and the qgroup
      accounting stuff no longer goes negative in some cases like it was before.
      Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      fcebe456
  3. 29 1月, 2014 2 次提交
    • J
      Btrfs: attach delayed ref updates to delayed ref heads · d7df2c79
      Josef Bacik 提交于
      Currently we have two rb-trees, one for delayed ref heads and one for all of the
      delayed refs, including the delayed ref heads.  When we process the delayed refs
      we have to hold onto the delayed ref lock for all of the selecting and merging
      and such, which results in quite a bit of lock contention.  This was solved by
      having a waitqueue and only one flusher at a time, however this hurts if we get
      a lot of delayed refs queued up.
      
      So instead just have an rb tree for the delayed ref heads, and then attach the
      delayed ref updates to an rb tree that is per delayed ref head.  Then we only
      need to take the delayed ref lock when adding new delayed refs and when
      selecting a delayed ref head to process, all the rest of the time we deal with a
      per delayed ref head lock which will be much less contentious.
      
      The locking rules for this get a little more complicated since we have to lock
      up to 3 things to properly process delayed refs, but I will address that problem
      later.  For now this passes all of xfstests and my overnight stress tests.
      Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      d7df2c79
    • L
      Btrfs: introduce a head ref rbtree · c46effa6
      Liu Bo 提交于
      The way how we process delayed refs is
      1) get a bunch of head refs,
      2) pick up one head ref,
      3) go one node back for any delayed ref updates.
      
      The head ref is also linked in the same rbtree as the delayed ref is,
      so in 1) stage, we have to walk one by one including not only head refs, but
      delayed refs.
      
      When we have a great number of delayed refs pending to process,
      this'll cost time a lot.
      
      Here we introduce a head ref specific rbtree, it only has head refs, so troubles
      go away.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      c46effa6
  4. 18 5月, 2013 1 次提交
    • J
      Btrfs: handle running extent ops with skinny metadata · b1c79e09
      Josef Bacik 提交于
      Chris hit a bug where we weren't finding extent records when running extent ops.
      This is because we use the delayed_ref_head when running the extent op, which
      means we can't use the ->type checks to see if we are metadata.  We also lose
      the level of the metadata we are working on.  So to fix this we can just check
      the ->is_data section of the extent_op, and we can store the level of the buffer
      we were modifying in the extent_op.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      b1c79e09
  5. 20 2月, 2013 2 次提交
  6. 02 2月, 2013 1 次提交
    • C
      Btrfs: reduce CPU contention while waiting for delayed extent operations · bb721703
      Chris Mason 提交于
      We batch up operations to the extent allocation tree, which allows
      us to deal with the recursive nature of using the extent allocation
      tree to allocate extents to the extent allocation tree.
      
      It also provides a mechanism to sort and collect extent
      operations, which makes it much more efficient to record extents
      that are close together.
      
      The delayed extent operations must all be finished before the
      running transaction commits, so we have code to make sure and run a few
      of the batched operations when closing our transaction handles.
      
      This creates a great deal of contention for the locks in the
      delayed extent operation tree, and also contention for the lock on the
      extent allocation tree itself.  All the extra contention just slows
      down the operations and doesn't get things done any faster.
      
      This commit changes things to use a wait queue instead.  As procs
      want to run the delayed operations, one of them races in and gets
      permission to hit the tree, and the others step back and wait for
      progress to be made.
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      bb721703
  7. 21 9月, 2012 1 次提交
  8. 29 8月, 2012 1 次提交
    • J
      Btrfs: allow delayed refs to be merged · ae1e206b
      Josef Bacik 提交于
      Daniel Blueman reported a bug with fio+balance on a ramdisk setup.
      Basically what happens is the balance relocates a tree block which will drop
      the implicit refs for all of its children and adds a full backref.  Once the
      block is relocated we have to add the implicit refs back, so when we cow the
      block again we add the implicit refs for its children back.  The problem
      comes when the original drop ref doesn't get run before we add the implicit
      refs back.  The delayed ref stuff will specifically prefer ADD operations
      over DROP to keep us from freeing up an extent that will have references to
      it, so we try to add the implicit ref before it is actually removed and we
      panic.  This worked fine before because the add would have just canceled the
      drop out and we would have been fine.  But the backref walking work needs to
      be able to freeze the delayed ref stuff in time so we have this ever
      increasing sequence number that gets attached to all new delayed ref updates
      which makes us not merge refs and we run into this issue.
      
      So to fix this we need to merge delayed refs.  So everytime we run a
      clustered ref we need to try and merge all of its delayed refs.  The backref
      walking stuff locks the delayed ref head before processing, so if we have it
      locked we are safe to merge any refs inside of the sequence number.  If
      there is no sequence number we can merge all refs.  Doing this not only
      fixes our bug but keeps the delayed ref code from adding and removing
      useless refs and batching together multiple refs into one search instead of
      one search per delayed ref, which will really help our commit times.  I ran
      this with Daniels test and 276 and I haven't seen any problems.  Thanks,
      Reported-by: NDaniel J Blueman <daniel@quora.org>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      ae1e206b
  9. 12 7月, 2012 1 次提交
  10. 10 7月, 2012 1 次提交
  11. 31 5月, 2012 1 次提交
    • J
      Btrfs: use delayed ref sequence numbers for all fs-tree updates · 95a06077
      Jan Schmidt 提交于
      The sequence number for delayed refs is needed to postpone certain delayed
      refs for a very short period while walking backrefs. Before the tree
      modification log, we thought we'd only have to hold back those references
      that don't have a counter operation.
      
      While now we've the tree mod log, we're rewinding fs tree blocks to a
      defined consistent state. We cannot know in advance for which tree block
      we'll be doing rewind operations later. Therefore, we must postpone all the
      delayed refs for fs-tree blocks, even those having a counter operation.
      Signed-off-by: NJan Schmidt <list.btrfs@jan-o-sch.net>
      95a06077
  12. 26 5月, 2012 1 次提交
  13. 04 1月, 2012 2 次提交
    • J
      Btrfs: add waitqueue instead of doing busy waiting for more delayed refs · a168650c
      Jan Schmidt 提交于
      Now that we may be holding back delayed refs for a limited period, we
      might end up having no runnable delayed refs. Without this commit, we'd
      do busy waiting in that thread until another (runnable) ref arives.
      Instead, we're detecting this situation and use a waitqueue, such that
      we only try to run more refs after
      	a) another runnable ref was added  or
      	b) delayed refs are no longer held back
      Signed-off-by: NJan Schmidt <list.btrfs@jan-o-sch.net>
      a168650c
    • A
      Btrfs: add sequence numbers to delayed refs · 00f04b88
      Arne Jansen 提交于
      Sequence numbers are needed to reconstruct the backrefs of a given extent to
      a certain point in time. The total set of backrefs consist of the set of
      backrefs recorded on disk plus the enqueued delayed refs for it that existed
      at that moment.
      
      This patch also adds a list that records all delayed refs which are
      currently in the process of being added.
      
      When walking all refs of an extent in btrfs_find_all_roots(), we freeze the
      current state of delayed refs, honor anythinh up to this point and prevent
      processing newer delayed refs to assert consistency.
      Signed-off-by: NArne Jansen <sensille@gmx.net>
      Signed-off-by: NJan Schmidt <list.btrfs@jan-o-sch.net>
      00f04b88
  14. 22 12月, 2011 2 次提交
    • A
      Btrfs: always save ref_root in delayed refs · eebe063b
      Arne Jansen 提交于
      For consistent backref walking and (later) qgroup calculation the
      information to which root a delayed ref belongs is useful even for shared
      refs.
      Signed-off-by: NArne Jansen <sensille@gmx.net>
      Signed-off-by: NJan Schmidt <list.btrfs@jan-o-sch.net>
      eebe063b
    • A
      Btrfs: mark delayed refs as for cow · 66d7e7f0
      Arne Jansen 提交于
      Add a for_cow parameter to add_delayed_*_ref and pass the appropriate value
      from every call site. The for_cow parameter will later on be used to
      determine if a ref will change anything with respect to qgroups.
      
      Delayed refs coming from relocation are always counted as for_cow, as they
      don't change subvol quota.
      
      Also pass in the fs_info for later use.
      
      btrfs_find_all_roots() will use this as an optimization, as changes that are
      for_cow will not change anything with respect to which root points to a
      certain leaf. Thus, we don't need to add the current sequence number to
      those delayed refs.
      Signed-off-by: NArne Jansen <sensille@gmx.net>
      Signed-off-by: NJan Schmidt <list.btrfs@jan-o-sch.net>
      66d7e7f0
  15. 06 5月, 2011 1 次提交
  16. 04 5月, 2011 1 次提交
  17. 25 5月, 2010 1 次提交
  18. 10 6月, 2009 1 次提交
    • Y
      Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE) · 5d4f98a2
      Yan Zheng 提交于
      This commit introduces a new kind of back reference for btrfs metadata.
      Once a filesystem has been mounted with this commit, IT WILL NO LONGER
      BE MOUNTABLE BY OLDER KERNELS.
      
      When a tree block in subvolume tree is cow'd, the reference counts of all
      extents it points to are increased by one.  At transaction commit time,
      the old root of the subvolume is recorded in a "dead root" data structure,
      and the btree it points to is later walked, dropping reference counts
      and freeing any blocks where the reference count goes to 0.
      
      The increments done during cow and decrements done after commit cancel out,
      and the walk is a very expensive way to go about freeing the blocks that
      are no longer referenced by the new btree root.  This commit reduces the
      transaction overhead by avoiding the need for dead root records.
      
      When a non-shared tree block is cow'd, we free the old block at once, and the
      new block inherits old block's references. When a tree block with reference
      count > 1 is cow'd, we increase the reference counts of all extents
      the new block points to by one, and decrease the old block's reference count by
      one.
      
      This dead tree avoidance code removes the need to modify the reference
      counts of lower level extents when a non-shared tree block is cow'd.
      But we still need to update back ref for all pointers in the block.
      This is because the location of the block is recorded in the back ref
      item.
      
      We can solve this by introducing a new type of back ref. The new
      back ref provides information about pointer's key, level and in which
      tree the pointer lives. This information allow us to find the pointer
      by searching the tree. The shortcoming of the new back ref is that it
      only works for pointers in tree blocks referenced by their owner trees.
      
      This is mostly a problem for snapshots, where resolving one of these
      fuzzy back references would be O(number_of_snapshots) and quite slow.
      The solution used here is to use the fuzzy back references in the common
      case where a given tree block is only referenced by one root,
      and use the full back references when multiple roots have a reference
      on a given block.
      
      This commit adds per subvolume red-black tree to keep trace of cached
      inodes. The red-black tree helps the balancing code to find cached
      inodes whose inode numbers within a given range.
      
      This commit improves the balancing code by introducing several data
      structures to keep the state of balancing. The most important one
      is the back ref cache. It caches how the upper level tree blocks are
      referenced. This greatly reduce the overhead of checking back ref.
      
      The improved balancing code scales significantly better with a large
      number of snapshots.
      
      This is a very large commit and was written in a number of
      pieces.  But, they depend heavily on the disk format change and were
      squashed together to make sure git bisect didn't end up in a
      bad state wrt space balancing or the format change.
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      5d4f98a2
  19. 25 3月, 2009 4 次提交
    • C
      Btrfs: make sure btrfs_update_delayed_ref doesn't increase ref_mod · 1a81af4d
      Chris Mason 提交于
      btrfs_update_delayed_ref is optimized to add and remove different
      references in one pass through the delayed ref tree.  It is a zero
      sum on the total number of refs on a given extent.
      
      But, the code was recording an extra ref in the head node.  This
      never made it down to the disk but was used when deciding if it was
      safe to free the extent while dropping snapshots.
      
      The fix used here is to make sure the ref_mod count is unchanged
      on the head ref when btrfs_update_delayed_ref is called.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      1a81af4d
    • C
      Btrfs: process the delayed reference queue in clusters · c3e69d58
      Chris Mason 提交于
      The delayed reference queue maintains pending operations that need to
      be done to the extent allocation tree.  These are processed by
      finding records in the tree that are not currently being processed one at
      a time.
      
      This is slow because it uses lots of time searching through the rbtree
      and because it creates lock contention on the extent allocation tree
      when lots of different procs are running delayed refs at the same time.
      
      This commit changes things to grab a cluster of refs for processing,
      using a cursor into the rbtree as the starting point of the next search.
      This way we walk smoothly through the rbtree.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      c3e69d58
    • C
      Btrfs: try to cleanup delayed refs while freeing extents · 1887be66
      Chris Mason 提交于
      When extents are freed, it is likely that we've removed the last
      delayed reference update for the extent.  This checks the delayed
      ref tree when things are freed, and if no ref updates area left it
      immediately processes the delayed ref.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      1887be66
    • C
      Btrfs: do extent allocation and reference count updates in the background · 56bec294
      Chris Mason 提交于
      The extent allocation tree maintains a reference count and full
      back reference information for every extent allocated in the
      filesystem.  For subvolume and snapshot trees, every time
      a block goes through COW, the new copy of the block adds a reference
      on every block it points to.
      
      If a btree node points to 150 leaves, then the COW code needs to go
      and add backrefs on 150 different extents, which might be spread all
      over the extent allocation tree.
      
      These updates currently happen during btrfs_cow_block, and most COWs
      happen during btrfs_search_slot.  btrfs_search_slot has locks held
      on both the parent and the node we are COWing, and so we really want
      to avoid IO during the COW if we can.
      
      This commit adds an rbtree of pending reference count updates and extent
      allocations.  The tree is ordered by byte number of the extent and byte number
      of the parent for the back reference.  The tree allows us to:
      
      1) Modify back references in something close to disk order, reducing seeks
      2) Significantly reduce the number of modifications made as block pointers
      are balanced around
      3) Do all of the extent insertion and back reference modifications outside
      of the performance critical btrfs_search_slot code.
      
      #3 has the added benefit of greatly reducing the btrfs stack footprint.
      The extent allocation tree modifications are done without the deep
      (and somewhat recursive) call chains used in the past.
      
      These delayed back reference updates must be done before the transaction
      commits, and so the rbtree is tied to the transaction.  Throttling is
      implemented to help keep the queue of backrefs at a reasonable size.
      
      Since there was a similar mechanism in place for the extent tree
      extents, that is removed and replaced by the delayed reference tree.
      
      Yan Zheng <yan.zheng@oracle.com> helped review and fixup this code.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      56bec294