1. 10 6月, 2009 1 次提交
    • Y
      Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE) · 5d4f98a2
      Yan Zheng 提交于
      This commit introduces a new kind of back reference for btrfs metadata.
      Once a filesystem has been mounted with this commit, IT WILL NO LONGER
      BE MOUNTABLE BY OLDER KERNELS.
      
      When a tree block in subvolume tree is cow'd, the reference counts of all
      extents it points to are increased by one.  At transaction commit time,
      the old root of the subvolume is recorded in a "dead root" data structure,
      and the btree it points to is later walked, dropping reference counts
      and freeing any blocks where the reference count goes to 0.
      
      The increments done during cow and decrements done after commit cancel out,
      and the walk is a very expensive way to go about freeing the blocks that
      are no longer referenced by the new btree root.  This commit reduces the
      transaction overhead by avoiding the need for dead root records.
      
      When a non-shared tree block is cow'd, we free the old block at once, and the
      new block inherits old block's references. When a tree block with reference
      count > 1 is cow'd, we increase the reference counts of all extents
      the new block points to by one, and decrease the old block's reference count by
      one.
      
      This dead tree avoidance code removes the need to modify the reference
      counts of lower level extents when a non-shared tree block is cow'd.
      But we still need to update back ref for all pointers in the block.
      This is because the location of the block is recorded in the back ref
      item.
      
      We can solve this by introducing a new type of back ref. The new
      back ref provides information about pointer's key, level and in which
      tree the pointer lives. This information allow us to find the pointer
      by searching the tree. The shortcoming of the new back ref is that it
      only works for pointers in tree blocks referenced by their owner trees.
      
      This is mostly a problem for snapshots, where resolving one of these
      fuzzy back references would be O(number_of_snapshots) and quite slow.
      The solution used here is to use the fuzzy back references in the common
      case where a given tree block is only referenced by one root,
      and use the full back references when multiple roots have a reference
      on a given block.
      
      This commit adds per subvolume red-black tree to keep trace of cached
      inodes. The red-black tree helps the balancing code to find cached
      inodes whose inode numbers within a given range.
      
      This commit improves the balancing code by introducing several data
      structures to keep the state of balancing. The most important one
      is the back ref cache. It caches how the upper level tree blocks are
      referenced. This greatly reduce the overhead of checking back ref.
      
      The improved balancing code scales significantly better with a large
      number of snapshots.
      
      This is a very large commit and was written in a number of
      pieces.  But, they depend heavily on the disk format change and were
      squashed together to make sure git bisect didn't end up in a
      bad state wrt space balancing or the format change.
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      5d4f98a2
  2. 15 5月, 2009 1 次提交
    • C
      Btrfs: Don't loop forever on metadata IO failures · 76a05b35
      Chris Mason 提交于
      When a btrfs metadata read fails, the first thing we try to do is find
      a good copy on another mirror of the block.  If this fails, read_tree_block()
      ends up returning a buffer that isn't up to date.
      
      The btrfs btree reading code was reworked to drop locks and repeat
      the search when IO was done, but the changes didn't add a check for failed
      reads.  The end result was looping forever on buffers that were never
      going to become up to date.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      76a05b35
  3. 21 4月, 2009 1 次提交
    • C
      Btrfs: use the right node in reada_for_balance · 8c594ea8
      Chris Mason 提交于
      reada_for_balance was using the wrong index into the path node array,
      so it wasn't reading the right blocks.  We never directly used the
      results of the read done by this function because the btree search is
      started over at the end.
      
      This fixes reada_for_balance to reada in the correct node and to
      avoid searching past the last slot in the node.  It also makes sure to
      hold the parent lock while we are finding the nodes to read.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      8c594ea8
  4. 03 4月, 2009 3 次提交
  5. 25 3月, 2009 5 次提交
    • C
      Btrfs: limit balancing work while flushing delayed refs · a4b6e07d
      Chris Mason 提交于
      The delayed reference mechanism is responsible for all updates to the
      extent allocation trees, including those updates created while processing
      the delayed references.
      
      This commit tries to limit the amount of work that gets created during
      the final run of delayed refs before a commit.  It avoids cowing new blocks
      unless it is required to finish the commit, and so it avoids new allocations
      that were not really required.
      
      The goal is to avoid infinite loops where we are always making more work
      on the final run of delayed refs.  Over the long term we'll make a
      special log for the last delayed ref updates as well.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      a4b6e07d
    • C
      Btrfs: leave btree locks spinning more often · b9473439
      Chris Mason 提交于
      btrfs_mark_buffer dirty would set dirty bits in the extent_io tree
      for the buffers it was dirtying.  This may require a kmalloc and it
      was not atomic.  So, anyone who called btrfs_mark_buffer_dirty had to
      set any btree locks they were holding to blocking first.
      
      This commit changes dirty tracking for extent buffers to just use a flag
      in the extent buffer.  Now that we have one and only one extent buffer
      per page, this can be safely done without losing dirty bits along the way.
      
      This also introduces a path->leave_spinning flag that callers of
      btrfs_search_slot can use to indicate they will properly deal with a
      path returned where all the locks are spinning instead of blocking.
      
      Many of the btree search callers now expect spinning paths,
      resulting in better btree concurrency overall.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      b9473439
    • C
      Btrfs: reduce stack usage in some crucial tree balancing functions · 44871b1b
      Chris Mason 提交于
      Many of the tree balancing functions follow the same pattern.
      
      1) cow a block
      2) do something to the result
      
      This commit breaks them up into two functions so the variables and
      code required for part two don't suck down stack during part one.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      44871b1b
    • C
      Btrfs: do extent allocation and reference count updates in the background · 56bec294
      Chris Mason 提交于
      The extent allocation tree maintains a reference count and full
      back reference information for every extent allocated in the
      filesystem.  For subvolume and snapshot trees, every time
      a block goes through COW, the new copy of the block adds a reference
      on every block it points to.
      
      If a btree node points to 150 leaves, then the COW code needs to go
      and add backrefs on 150 different extents, which might be spread all
      over the extent allocation tree.
      
      These updates currently happen during btrfs_cow_block, and most COWs
      happen during btrfs_search_slot.  btrfs_search_slot has locks held
      on both the parent and the node we are COWing, and so we really want
      to avoid IO during the COW if we can.
      
      This commit adds an rbtree of pending reference count updates and extent
      allocations.  The tree is ordered by byte number of the extent and byte number
      of the parent for the back reference.  The tree allows us to:
      
      1) Modify back references in something close to disk order, reducing seeks
      2) Significantly reduce the number of modifications made as block pointers
      are balanced around
      3) Do all of the extent insertion and back reference modifications outside
      of the performance critical btrfs_search_slot code.
      
      #3 has the added benefit of greatly reducing the btrfs stack footprint.
      The extent allocation tree modifications are done without the deep
      (and somewhat recursive) call chains used in the past.
      
      These delayed back reference updates must be done before the transaction
      commits, and so the rbtree is tied to the transaction.  Throttling is
      implemented to help keep the queue of backrefs at a reasonable size.
      
      Since there was a similar mechanism in place for the extent tree
      extents, that is removed and replaced by the delayed reference tree.
      
      Yan Zheng <yan.zheng@oracle.com> helped review and fixup this code.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      56bec294
    • C
      Btrfs: don't preallocate metadata blocks during btrfs_search_slot · 9fa8cfe7
      Chris Mason 提交于
      In order to avoid doing expensive extent management with tree locks held,
      btrfs_search_slot will preallocate tree blocks for use by COW without
      any tree locks held.
      
      A later commit moves all of the extent allocation work for COW into
      a delayed update mechanism, and this preallocation will no longer be
      required.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      9fa8cfe7
  6. 09 3月, 2009 1 次提交
    • C
      Btrfs: fix spinlock assertions on UP systems · b9447ef8
      Chris Mason 提交于
      btrfs_tree_locked was being used to make sure a given extent_buffer was
      properly locked in a few places.  But, it wasn't correct for UP compiled
      kernels.
      
      This switches it to using assert_spin_locked instead, and renames it to
      btrfs_assert_tree_locked to better reflect how it was really being used.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      b9447ef8
  7. 13 2月, 2009 2 次提交
    • C
      Btrfs: make a lockdep class for the extent buffer locks · 4008c04a
      Chris Mason 提交于
      Btrfs is currently using spin_lock_nested with a nested value based
      on the tree depth of the block.  But, this doesn't quite work because
      the max tree depth is bigger than what spin_lock_nested can deal with,
      and because locks are sometimes taken before the level field is filled in.
      
      The solution here is to use lockdep_set_class_and_name instead, and to
      set the class before unlocking the pages when the block is read from the
      disk and just after init of a freshly allocated tree block.
      
      btrfs_clear_path_blocking is also changed to take the locks in the proper
      order, and it also makes sure all the locks currently held are properly
      set to blocking before it tries to retake the spinlocks.  Otherwise, lockdep
      gets upset about bad lock orderin.
      
      The lockdep magic cam from Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      4008c04a
    • J
      Btrfs: remove btrfs_init_path · e00f7308
      Jeff Mahoney 提交于
      btrfs_init_path was initially used when the path objects were on the
      stack.  Now all the work is done by btrfs_alloc_path and btrfs_init_path
      isn't required.
      
      This patch removes it, and just uses kmem_cache_zalloc to zero out the object.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      e00f7308
  8. 12 2月, 2009 1 次提交
  9. 10 2月, 2009 1 次提交
    • C
      Btrfs: don't use spin_is_contended · 284b066a
      Chris Mason 提交于
      Btrfs was using spin_is_contended to see if it should drop locks before
      doing extent allocations during btrfs_search_slot.  The idea was to avoid
      expensive searches in the tree unless the lock was actually contended.
      
      But, spin_is_contended is specific to the ticket spinlocks on x86, so this
      is causing compile errors everywhere else.
      
      In practice, the contention could easily appear some time after we started
      doing the extent allocation, and it makes more sense to always drop the lock
      instead.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      284b066a
  10. 04 2月, 2009 5 次提交
    • C
      Btrfs: Only prep for btree deletion balances when nodes are mostly empty · 7b78c170
      Chris Mason 提交于
      Whenever an item deletion is done, we need to balance all the nodes
      in the tree to make sure we don't end up with an empty node if a pointer
      is deleted.  This balance prep happens from the root of the tree down
      so we can drop our locks as we go.
      
      reada_for_balance was triggering read-ahead on neighboring nodes even
      when no balancing was required.  This adds an extra check to avoid
      calling balance_level() and avoid reada_for_balance() when a balance
      won't be required.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      7b78c170
    • C
      Btrfs: fix btrfs_unlock_up_safe to walk the entire path · 12f4dacc
      Chris Mason 提交于
      btrfs_unlock_up_safe would break out at the first NULL node entry or
      unlocked node it found in the path.
      
      Some of the callers have missing nodes at the lower levels of the path, so this
      commit fixes things to check all the nodes in the path before returning.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      12f4dacc
    • C
      Btrfs: change btrfs_del_leaf to drop locks earlier · 4d081c41
      Chris Mason 提交于
      btrfs_del_leaf does two things.  First it removes the pointer in the
      parent, and then it frees the block that has the leaf.  It has the
      parent node locked for both operations.
      
      But, it only needs the parent locked while it is deleting the pointer.
      After that it can safely free the block without the parent locked.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      4d081c41
    • C
      Btrfs: Change btree locking to use explicit blocking points · b4ce94de
      Chris Mason 提交于
      Most of the btrfs metadata operations can be protected by a spinlock,
      but some operations still need to schedule.
      
      So far, btrfs has been using a mutex along with a trylock loop,
      most of the time it is able to avoid going for the full mutex, so
      the trylock loop is a big performance gain.
      
      This commit is step one for getting rid of the blocking locks entirely.
      btrfs_tree_lock takes a spinlock, and the code explicitly switches
      to a blocking lock when it starts an operation that can schedule.
      
      We'll be able get rid of the blocking locks in smaller pieces over time.
      Tracing allows us to find the most common cause of blocking, so we
      can start with the hot spots first.
      
      The basic idea is:
      
      btrfs_tree_lock() returns with the spin lock held
      
      btrfs_set_lock_blocking() sets the EXTENT_BUFFER_BLOCKING bit in
      the extent buffer flags, and then drops the spin lock.  The buffer is
      still considered locked by all of the btrfs code.
      
      If btrfs_tree_lock gets the spinlock but finds the blocking bit set, it drops
      the spin lock and waits on a wait queue for the blocking bit to go away.
      
      Much of the code that needs to set the blocking bit finishes without actually
      blocking a good percentage of the time.  So, an adaptive spin is still
      used against the blocking bit to avoid very high context switch rates.
      
      btrfs_clear_lock_blocking() clears the blocking bit and returns
      with the spinlock held again.
      
      btrfs_tree_unlock() can be called on either blocking or spinning locks,
      it does the right thing based on the blocking bit.
      
      ctree.c has a helper function to set/clear all the locked buffers in a
      path as blocking.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      b4ce94de
    • C
      Btrfs: hash_lock is no longer needed · c487685d
      Chris Mason 提交于
      Before metadata is written to disk, it is updated to reflect that writeout
      has begun.  Once this update is done, the block must be cow'd before it
      can be modified again.
      
      This update was originally synchronized by using a per-fs spinlock.  Today
      the buffers for the metadata blocks are locked before writeout begins,
      and everyone that tests the flag has the buffer locked as well.
      
      So, the per-fs spinlock (called hash_lock for no good reason) is no
      longer required.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      c487685d
  11. 22 1月, 2009 1 次提交
    • C
      Btrfs: do less aggressive btree readahead · a7175319
      Chris Mason 提交于
      Just before reading a leaf, btrfs scans the node for blocks that are
      close by and reads them too.  It tries to build up a large window
      of IO looking for blocks that are within a max distance from the top
      and bottom of the IO window.
      
      This patch changes things to just look for blocks within 64k of the
      target block.  It will trigger less IO and make for lower latencies on
      the read size.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      a7175319
  12. 06 1月, 2009 1 次提交
  13. 17 12月, 2008 1 次提交
  14. 16 12月, 2008 1 次提交
    • C
      Btrfs: Fix compressed writes on truncated pages · 42dc7bab
      Chris Mason 提交于
      The compression code was using isize to limit the amount of data it
      sent through zlib.  But, it wasn't properly limiting the looping to
      just the pages inside i_size.  The end result was trying to compress
      too many pages, including those that had not been setup and properly locked
      down.  This made the compression code oops while trying find_get_page on a
      page that didn't exist.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      42dc7bab
  15. 10 12月, 2008 1 次提交
    • C
      Btrfs: Delete csum items when freeing extents · 459931ec
      Chris Mason 提交于
      This finishes off the new checksumming code by removing csum items
      for extents that are no longer in use.
      
      The trick is doing it without racing because a single csum item may
      hold csums for more than one extent.  Extra checks are added to
      btrfs_csum_file_blocks to make sure that we are using the correct
      csum item after dropping locks.
      
      A new btrfs_split_item is added to split a single csum item so it
      can be split without dropping the leaf lock.  This is used to
      remove csum bytes from the middle of an item.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      459931ec
  16. 09 12月, 2008 1 次提交
  17. 02 12月, 2008 1 次提交
  18. 19 11月, 2008 1 次提交
    • L
      Btrfs: Some fixes for batching extent insert. · b4eec2ca
      Liu Hui 提交于
      In insert_extents(), when ret==1 and last is not zero, it should
      check if the current inserted item is the last item in this batching
      inserts. If so, it should just break from loop. If not, 'cur =
      insert_list->next' will make no sense because the list is empty now,
      and 'op' will point to an unexpectable place.
      
      There are also some trivial fixs in this patch including one comment
      typo error and deleting two redundant lines.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      b4eec2ca
  19. 18 11月, 2008 1 次提交
    • Y
      Btrfs: Seed device support · 2b82032c
      Yan Zheng 提交于
      Seed device is a special btrfs with SEEDING super flag
      set and can only be mounted in read-only mode. Seed
      devices allow people to create new btrfs on top of it.
      
      The new FS contains the same contents as the seed device,
      but it can be mounted in read-write mode.
      
      This patch does the following:
      
      1) split code in btrfs_alloc_chunk into two parts. The first part does makes
      the newly allocated chunk usable, but does not do any operation that modifies
      the chunk tree. The second part does the the chunk tree modifications. This
      division is for the bootstrap step of adding storage to the seed device.
      
      2) Update device management code to handle seed device.
      The basic idea is: For an FS grown from seed devices, its
      seed devices are put into a list. Seed devices are
      opened on demand at mounting time. If any seed device is
      missing or has been changed, btrfs kernel module will
      refuse to mount the FS.
      
      3) make btrfs_find_block_group not return NULL when all
      block groups are read-only.
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      2b82032c
  20. 13 11月, 2008 2 次提交
    • J
      Btrfs: batch extent inserts/updates/deletions on the extent root · f3465ca4
      Josef Bacik 提交于
      While profiling the allocator I noticed a good amount of time was being spent in
      finish_current_insert and del_pending_extents, and as the filesystem filled up
      more and more time was being spent in those functions.  This patch aims to try
      and reduce that problem.  This happens two ways
      
      1) track if we tried to delete an extent that we are going to update or insert.
      Once we get into finish_current_insert we discard any of the extents that were
      marked for deletion.  This saves us from doing unnecessary work almost every
      time finish_current_insert runs.
      
      2) Batch insertion/updates/deletions.  Instead of doing a btrfs_search_slot for
      each individual extent and doing the needed operation, we instead keep the leaf
      around and see if there is anything else we can do on that leaf.  On the insert
      case I introduced a btrfs_insert_some_items, which will take an array of keys
      with an array of data_sizes and try and squeeze in as many of those keys as
      possible, and then return how many keys it was able to insert.  In the update
      case we search for an extent ref, update the ref and then loop through the leaf
      to see if any of the other refs we are looking to update are on that leaf, and
      then once we are done we release the path and search for the next ref we need to
      update.  And finally for the deletion we try and delete the extent+ref in pairs,
      so we will try to find extent+ref pairs next to the extent we are trying to free
      and free them in bulk if possible.
      
      This along with the other cluster fix that Chris pushed out a bit ago helps make
      the allocator preform more uniformly as it fills up the disk.  There is still a
      slight drop as we fill up the disk since we start having to stick new blocks in
      odd places which results in more COW's than on a empty fs, but the drop is not
      nearly as severe as it was before.
      Signed-off-by: NJosef Bacik <jbacik@redhat.com>
      f3465ca4
    • C
      Btrfs: Improve metadata read latencies · 6f3577bd
      Chris Mason 提交于
      This fixes latency problems on metadata reads by making sure they
      don't go through the async submit queue, and by tuning down the amount
      of readahead done during btree searches.
      
      Also, the btrfs bdi congestion function is tuned to ignore the
      number of pending async bios and checksums pending.  There is additional
      code that throttles new async bios now and the congestion function
      doesn't need to worry about it anymore.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      6f3577bd
  21. 30 10月, 2008 2 次提交
    • J
      Btrfs: nuke fs wide allocation mutex V2 · 25179201
      Josef Bacik 提交于
      This patch removes the giant fs_info->alloc_mutex and replaces it with a bunch
      of little locks.
      
      There is now a pinned_mutex, which is used when messing with the pinned_extents
      extent io tree, and the extent_ins_mutex which is used with the pending_del and
      extent_ins extent io trees.
      
      The locking for the extent tree stuff was inspired by a patch that Yan Zheng
      wrote to fix a race condition, I cleaned it up some and changed the locking
      around a little bit, but the idea remains the same.  Basically instead of
      holding the extent_ins_mutex throughout the processing of an extent on the
      extent_ins or pending_del trees, we just hold it while we're searching and when
      we clear the bits on those trees, and lock the extent for the duration of the
      operations on the extent.
      
      Also to keep from getting hung up waiting to lock an extent, I've added a
      try_lock_extent so if we cannot lock the extent, move on to the next one in the
      tree and we'll come back to that one.  I have tested this heavily and it does
      not appear to break anything.  This has to be applied on top of my
      find_free_extent redo patch.
      
      I tested this patch on top of Yan's space reblancing code and it worked fine.
      The only thing that has changed since the last version is I pulled out all my
      debugging stuff, apparently I forgot to run guilt refresh before I sent the
      last patch out.  Thank you,
      Signed-off-by: NJosef Bacik <jbacik@redhat.com>
      
      25179201
    • Y
      Btrfs: Improve space balancing code · f82d02d9
      Yan Zheng 提交于
      This patch improves the space balancing code to keep more sharing
      of tree blocks. The only case that breaks sharing of tree blocks is
      data extents get fragmented during balancing. The main changes in
      this patch are:
      
      Add a 'drop sub-tree' function. This solves the problem in old code
      that BTRFS_HEADER_FLAG_WRITTEN check breaks sharing of tree block.
      
      Remove relocation mapping tree. Relocation mappings are stored in
      struct btrfs_ref_path and updated dynamically during walking up/down
      the reference path. This reduces CPU usage and simplifies code.
      
      This patch also fixes a bug. Root items for reloc trees should be
      updated in btrfs_free_reloc_root.
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      
      f82d02d9
  22. 09 10月, 2008 1 次提交
    • Y
      Btrfs: Remove offset field from struct btrfs_extent_ref · 3bb1a1bc
      Yan Zheng 提交于
      The offset field in struct btrfs_extent_ref records the position
      inside file that file extent is referenced by. In the new back
      reference system, tree leaves holding references to file extent
      are recorded explicitly. We can scan these tree leaves very quickly, so the
      offset field is not required.
      
      This patch also makes the back reference system check the objectid
      when extents are in deleting.
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      3bb1a1bc
  23. 02 10月, 2008 1 次提交
    • C
      Btrfs: don't read leaf blocks containing only checksums during truncate · 323ac95b
      Chris Mason 提交于
      Checksum items take up a significant portion of the metadata for large files.
      It is possible to avoid reading them during truncates by checking the keys in
      the higher level nodes.
      
      If a given leaf is followed by another leaf where the lowest key is a checksum
      item from the same file, we know we can safely delete the leaf without
      reading it.
      
      For a 32GB file on a 6 drive raid0 array, Btrfs needs 8s to delete
      the file with a cold cache.  It is read bound during the run.
      
      With this change, Btrfs is able to delete the file in 0.5s
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      323ac95b
  24. 30 9月, 2008 1 次提交
    • C
      Btrfs: add and improve comments · d352ac68
      Chris Mason 提交于
      This improves the comments at the top of many functions.  It didn't
      dive into the guts of functions because I was trying to
      avoid merging problems with the new allocator and back reference work.
      
      extent-tree.c and volumes.c were both skipped, and there is definitely
      more work todo in cleaning and commenting the code.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      d352ac68
  25. 26 9月, 2008 2 次提交
    • Z
      Btrfs: update space balancing code · 1a40e23b
      Zheng Yan 提交于
      This patch updates the space balancing code to utilize the new
      backref format.  Before, btrfs-vol -b would break any COW links
      on data blocks or metadata.  This was slow and caused the amount
      of space used to explode if a large number of snapshots were present.
      
      The new code can keeps the sharing of all data extents and
      most of the tree blocks.
      
      To maintain the sharing of data extents, the space balance code uses
      a seperate inode hold data extent pointers, then updates the references
      to point to the new location.
      
      To maintain the sharing of tree blocks, the space balance code uses
      reloc trees to relocate tree blocks in reference counted roots.
      There is one reloc tree for each subvol, and all reloc trees share
      same root key objectid. Reloc trees are snapshots of the latest
      committed roots of subvols (root->commit_root).
      
      To relocate a tree block referenced by a subvol, there are two steps.
      COW the block through subvol's reloc tree, then update block pointer in
      the subvol to point to the new block. Since all reloc trees share
      same root key objectid, doing special handing for tree blocks
      owned by them is easy. Once a tree block has been COWed in one
      reloc tree, we can use the resulting new block directly when the
      same block is required to COW again through other reloc trees.
      In this way, relocated tree blocks are shared between reloc trees,
      so they are also shared between subvols.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      1a40e23b
    • Z
      Btrfs: extent_map and data=ordered fixes for space balancing · 5b21f2ed
      Zheng Yan 提交于
      * Add an EXTENT_BOUNDARY state bit to keep the writepage code
      from merging data extents that are in the process of being
      relocated.  This allows us to do accounting for them properly.
      
      * The balancing code relocates data extents indepdent of the underlying
      inode.  The extent_map code was modified to properly account for
      things moving around (invalidating extent_map caches in the inode).
      
      * Don't take the drop_mutex in the create_subvol ioctl.  It isn't
      required.
      
      * Fix walking of the ordered extent list to avoid races with sys_unlink
      
      * Change the lock ordering rules.  Transaction start goes outside
      the drop_mutex.  This allows btrfs_commit_transaction to directly
      drop the relocation trees.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      5b21f2ed
  26. 25 9月, 2008 1 次提交
    • Z
      Btrfs: Full back reference support · 31840ae1
      Zheng Yan 提交于
      This patch makes the back reference system to explicit record the
      location of parent node for all types of extents. The location of
      parent node is placed into the offset field of backref key. Every
      time a tree block is balanced, the back references for the affected
      lower level extents are updated.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      31840ae1