1. 25 3月, 2009 4 次提交
    • C
      Btrfs: reduce stalls during transaction commit · b7ec40d7
      Chris Mason 提交于
      To avoid deadlocks and reduce latencies during some critical operations, some
      transaction writers are allowed to jump into the running transaction and make
      it run a little longer, while others sit around and wait for the commit to
      finish.
      
      This is a bit unfair, especially when the callers that jump in do a bunch
      of IO that makes all the others procs on the box wait.  This commit
      reduces the stalls this produces by pre-reading file extent pointers
      during btrfs_finish_ordered_io before the transaction is joined.
      
      It also tunes the drop_snapshot code to politely wait for transactions
      that have started writing out their delayed refs to finish.  This avoids
      new delayed refs being flooded into the queue while we're trying to
      close off the transaction.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      b7ec40d7
    • C
      Btrfs: process the delayed reference queue in clusters · c3e69d58
      Chris Mason 提交于
      The delayed reference queue maintains pending operations that need to
      be done to the extent allocation tree.  These are processed by
      finding records in the tree that are not currently being processed one at
      a time.
      
      This is slow because it uses lots of time searching through the rbtree
      and because it creates lock contention on the extent allocation tree
      when lots of different procs are running delayed refs at the same time.
      
      This commit changes things to grab a cluster of refs for processing,
      using a cursor into the rbtree as the starting point of the next search.
      This way we walk smoothly through the rbtree.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      c3e69d58
    • C
      Btrfs: try to cleanup delayed refs while freeing extents · 1887be66
      Chris Mason 提交于
      When extents are freed, it is likely that we've removed the last
      delayed reference update for the extent.  This checks the delayed
      ref tree when things are freed, and if no ref updates area left it
      immediately processes the delayed ref.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      1887be66
    • C
      Btrfs: do extent allocation and reference count updates in the background · 56bec294
      Chris Mason 提交于
      The extent allocation tree maintains a reference count and full
      back reference information for every extent allocated in the
      filesystem.  For subvolume and snapshot trees, every time
      a block goes through COW, the new copy of the block adds a reference
      on every block it points to.
      
      If a btree node points to 150 leaves, then the COW code needs to go
      and add backrefs on 150 different extents, which might be spread all
      over the extent allocation tree.
      
      These updates currently happen during btrfs_cow_block, and most COWs
      happen during btrfs_search_slot.  btrfs_search_slot has locks held
      on both the parent and the node we are COWing, and so we really want
      to avoid IO during the COW if we can.
      
      This commit adds an rbtree of pending reference count updates and extent
      allocations.  The tree is ordered by byte number of the extent and byte number
      of the parent for the back reference.  The tree allows us to:
      
      1) Modify back references in something close to disk order, reducing seeks
      2) Significantly reduce the number of modifications made as block pointers
      are balanced around
      3) Do all of the extent insertion and back reference modifications outside
      of the performance critical btrfs_search_slot code.
      
      #3 has the added benefit of greatly reducing the btrfs stack footprint.
      The extent allocation tree modifications are done without the deep
      (and somewhat recursive) call chains used in the past.
      
      These delayed back reference updates must be done before the transaction
      commits, and so the rbtree is tied to the transaction.  Throttling is
      implemented to help keep the queue of backrefs at a reasonable size.
      
      Since there was a similar mechanism in place for the extent tree
      extents, that is removed and replaced by the delayed reference tree.
      
      Yan Zheng <yan.zheng@oracle.com> helped review and fixup this code.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      56bec294
  2. 11 3月, 2009 1 次提交
    • C
      Btrfs: Fix locking around adding new space_info · 4184ea7f
      Chris Mason 提交于
      Storage allocated to different raid levels in btrfs is tracked by
      a btrfs_space_info structure, and all of the current space_infos are
      collected into a list_head.
      
      Most filesystems have 3 or 4 of these structs total, and the list is
      only changed when new raid levels are added or at unmount time.
      
      This commit adds rcu locking on the list head, and properly frees
      things at unmount time.  It also clears the space_info->full flag
      whenever new space is added to the FS.
      
      The locking for the space info list goes like this:
      
      reads: protected by rcu_read_lock()
      writes: protected by the chunk_mutex
      
      At unmount time we don't need special locking because all the readers
      are gone.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      4184ea7f
  3. 09 3月, 2009 1 次提交
    • C
      Btrfs: fix spinlock assertions on UP systems · b9447ef8
      Chris Mason 提交于
      btrfs_tree_locked was being used to make sure a given extent_buffer was
      properly locked in a few places.  But, it wasn't correct for UP compiled
      kernels.
      
      This switches it to using assert_spin_locked instead, and renames it to
      btrfs_assert_tree_locked to better reflect how it was really being used.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      b9447ef8
  4. 20 2月, 2009 1 次提交
    • J
      Btrfs: try committing transaction before returning ENOSPC · 4e06bdd6
      Josef Bacik 提交于
      This fixes a problem where we could return -ENOSPC when we may actually have
      plenty of space, the space is just pinned.  Instead of returning -ENOSPC
      immediately, commit the transaction first and then try and do the allocation
      again.
      
      This patch also does chunk allocation for metadata if we pass the 80%
      threshold for metadata space.  This will help with stack usage since the chunk
      allocation will happen early on, instead of when the allocation is happening.
      Signed-off-by: NJosef Bacik <jbacik@redhat.com>
      
      4e06bdd6
  5. 21 2月, 2009 1 次提交
    • J
      Btrfs: add better -ENOSPC handling · 6a63209f
      Josef Bacik 提交于
      This is a step in the direction of better -ENOSPC handling.  Instead of
      checking the global bytes counter we check the space_info bytes counters to
      make sure we have enough space.
      
      If we don't we go ahead and try to allocate a new chunk, and then if that fails
      we return -ENOSPC.  This patch adds two counters to btrfs_space_info,
      bytes_delalloc and bytes_may_use.
      
      bytes_delalloc account for extents we've actually setup for delalloc and will
      be allocated at some point down the line. 
      
      bytes_may_use is to keep track of how many bytes we may use for delalloc at
      some point.  When we actually set the extent_bit for the delalloc bytes we
      subtract the reserved bytes from the bytes_may_use counter.  This keeps us from
      not actually being able to allocate space for any delalloc bytes.
      Signed-off-by: NJosef Bacik <jbacik@redhat.com>
      
      
      
      6a63209f
  6. 13 2月, 2009 2 次提交
    • Y
      Btrfs: hold trans_mutex when using btrfs_record_root_in_trans · 24562425
      Yan Zheng 提交于
      btrfs_record_root_in_trans needs the trans_mutex held to make sure two
      callers don't race to setup the root in a given transaction.  This adds
      it to all the places that were missing it.
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      24562425
    • C
      Btrfs: make a lockdep class for the extent buffer locks · 4008c04a
      Chris Mason 提交于
      Btrfs is currently using spin_lock_nested with a nested value based
      on the tree depth of the block.  But, this doesn't quite work because
      the max tree depth is bigger than what spin_lock_nested can deal with,
      and because locks are sometimes taken before the level field is filled in.
      
      The solution here is to use lockdep_set_class_and_name instead, and to
      set the class before unlocking the pages when the block is read from the
      disk and just after init of a freshly allocated tree block.
      
      btrfs_clear_path_blocking is also changed to take the locks in the proper
      order, and it also makes sure all the locks currently held are properly
      set to blocking before it tries to retake the spinlocks.  Otherwise, lockdep
      gets upset about bad lock orderin.
      
      The lockdep magic cam from Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      4008c04a
  7. 12 2月, 2009 2 次提交
    • C
      Btrfs: use larger metadata clusters in ssd mode · 536ac8ae
      Chris Mason 提交于
      Larger metadata clusters can significantly improve writeback performance
      on ssd drives with large erasure blocks.  The larger clusters make it
      more likely a given IO will completely overwrite the ssd block, so it
      doesn't have to do an internal rwm cycle.
      
      On spinning media, lager metadata clusters end up spreading out the
      metadata more over time, which makes fsck slower, so we don't want this
      to be the default.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      536ac8ae
    • J
      Btrfs: make sure all pending extent operations are complete · eb099670
      Josef Bacik 提交于
      Theres a slight problem with finish_current_insert, if we set all to 1 and then
      go through and don't actually skip any of the extents on the pending list, we
      could exit right after we've added new extents.
      
      This is a problem because by inserting the new extents we could have gotten new
      COW's to happen and such, so we may have some pending updates to do or even
      more inserts to do after that.
      
      So this patch will only exit if we have never skipped any of the extents in the
      pending list, and we have no extents to insert, this will make sure that all of
      the pending work is truly done before we return.  I've been running with this
      patch for a few days with all of my other testing and have not seen issues.
      Thanks,
      Signed-off-by: NJosef Bacik <jbacik@redhat.com>
      eb099670
  8. 05 2月, 2009 1 次提交
  9. 04 2月, 2009 3 次提交
    • C
      Btrfs: Make btrfs_drop_snapshot work in larger and more efficient chunks · bd56b302
      Chris Mason 提交于
      Every transaction in btrfs creates a new snapshot, and then schedules the
      snapshot from the last transaction for deletion.  Snapshot deletion
      works by walking down the btree and dropping the reference counts
      on each btree block during the walk.
      
      If if a given leaf or node has a reference count greater than one,
      the reference count is decremented and the subtree pointed to by that
      node is ignored.
      
      If the reference count is one, walking continues down into that node
      or leaf, and the references of everything it points to are decremented.
      
      The old code would try to work in small pieces, walking down the tree
      until it found the lowest leaf or node to free and then returning.  This
      was very friendly to the rest of the FS because it didn't have a huge
      impact on other operations.
      
      But it wouldn't always keep up with the rate that new commits added new
      snapshots for deletion, and it wasn't very optimal for the extent
      allocation tree because it wasn't finding leaves that were close together
      on disk and processing them at the same time.
      
      This changes things to walk down to a level 1 node and then process it
      in bulk.  All the leaf pointers are sorted and the leaves are dropped
      in order based on their extent number.
      
      The extent allocation tree and commit code are now fast enough for
      this kind of bulk processing to work without slowing the rest of the FS
      down.  Overall it does less IO and is better able to keep up with
      snapshot deletions under high load.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      bd56b302
    • C
      Btrfs: Change btree locking to use explicit blocking points · b4ce94de
      Chris Mason 提交于
      Most of the btrfs metadata operations can be protected by a spinlock,
      but some operations still need to schedule.
      
      So far, btrfs has been using a mutex along with a trylock loop,
      most of the time it is able to avoid going for the full mutex, so
      the trylock loop is a big performance gain.
      
      This commit is step one for getting rid of the blocking locks entirely.
      btrfs_tree_lock takes a spinlock, and the code explicitly switches
      to a blocking lock when it starts an operation that can schedule.
      
      We'll be able get rid of the blocking locks in smaller pieces over time.
      Tracing allows us to find the most common cause of blocking, so we
      can start with the hot spots first.
      
      The basic idea is:
      
      btrfs_tree_lock() returns with the spin lock held
      
      btrfs_set_lock_blocking() sets the EXTENT_BUFFER_BLOCKING bit in
      the extent buffer flags, and then drops the spin lock.  The buffer is
      still considered locked by all of the btrfs code.
      
      If btrfs_tree_lock gets the spinlock but finds the blocking bit set, it drops
      the spin lock and waits on a wait queue for the blocking bit to go away.
      
      Much of the code that needs to set the blocking bit finishes without actually
      blocking a good percentage of the time.  So, an adaptive spin is still
      used against the blocking bit to avoid very high context switch rates.
      
      btrfs_clear_lock_blocking() clears the blocking bit and returns
      with the spinlock held again.
      
      btrfs_tree_unlock() can be called on either blocking or spinning locks,
      it does the right thing based on the blocking bit.
      
      ctree.c has a helper function to set/clear all the locked buffers in a
      path as blocking.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      b4ce94de
    • C
      Btrfs: sort references by byte number during btrfs_inc_ref · b7a9f29f
      Chris Mason 提交于
      When a block goes through cow, we update the reference counts of
      everything that block points to.  The internal pointers of the block
      can be in just about any order, and it is likely to have clusters of
      things that are close together and clusters of things that are not.
      
      To help reduce the seeks that come with updating all of these reference
      counts, sort them by byte number before actual updates are done.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      b7a9f29f
  10. 22 1月, 2009 1 次提交
    • Y
      Btrfs: fix tree logs parallel sync · 7237f183
      Yan Zheng 提交于
      To improve performance, btrfs_sync_log merges tree log sync
      requests. But it wrongly merges sync requests for different
      tree logs. If multiple tree logs are synced at the same time,
      only one of them actually gets synced.
      
      This patch has following changes to fix the bug:
      
      Move most tree log related fields in btrfs_fs_info to
      btrfs_root. This allows merging sync requests separately
      for each tree log.
      
      Don't insert root item into the log root tree immediately
      after log tree is allocated. Root item for log tree is
      inserted when log tree get synced for the first time. This
      allows syncing the log root tree without first syncing all
      log trees.
      
      At tree-log sync, btrfs_sync_log first sync the log tree;
      then updates corresponding root item in the log root tree;
      sync the log root tree; then update the super block.
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      7237f183
  11. 21 1月, 2009 6 次提交
  12. 07 1月, 2009 1 次提交
    • Y
      Btrfs: tree logging checksum fixes · 07d400a6
      Yan Zheng 提交于
      This patch contains following things.
      
      1) Limit the max size of btrfs_ordered_sum structure to PAGE_SIZE.  This
      struct is kmalloced so we want to keep it reasonable.
      
      2) Replace copy_extent_csums by btrfs_lookup_csums_range.  This was
      duplicated code in tree-log.c
      
      3) Remove replay_one_csum. csum items are replayed at the same time as
         replaying file extents. This guarantees we only replay useful csums.
      
      4) nbytes accounting fix.
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      07d400a6
  13. 06 1月, 2009 3 次提交
    • C
    • C
      Btrfs: Fix checkpatch.pl warnings · d397712b
      Chris Mason 提交于
      There were many, most are fixed now.  struct-funcs.c generates some warnings
      but these are bogus.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      d397712b
    • L
      Btrfs: Fix free block discard calls down to the block layer · 1f3c79a2
      Liu Hui 提交于
      This is a patch to fix discard semantic to make Btrfs work with FTL and SSD.
      We can improve FTL's performance by telling it which sectors are freed by file
      system. But if we don't tell FTL the information of free sectors in proper
      time, the transaction mechanism of Btrfs will be destroyed and Btrfs could not
      roll back the previous transaction under the power loss condition.
      
      There are some problems in the old implementation:
      1, In __free_extent(), the pinned down extents should not be discarded.
      2, In free_extents(), the free extents are all pinned, so they need to
      be discarded in transaction committing time instead of free_extents().
      3, The reserved extent used by log tree should be discard too.
      
      This patch change discard behavior as follows:
      1, For the extents which need to be free at once,
         we discard them in update_block_group().
      2, Delay discarding the pinned extent in btrfs_finish_extent_commit()
         when committing transaction.
      3, Remove discarding from free_extents() and __free_extent()
      4, Add discard interface into btrfs_free_reserved_extent()
      5, Discard sectors before updating the free space cache, otherwise,
         FTL will destroy file system data.
      1f3c79a2
  14. 19 12月, 2008 2 次提交
  15. 17 12月, 2008 1 次提交
    • C
      Btrfs: delete checksum items before marking blocks free · dcbdd4dc
      Chris Mason 提交于
      Btrfs maintains a cache of blocks available for allocation in ram.  The
      code that frees extents was marking the extents free and then deleting
      the checksum items.
      
      This meant it was possible the extent would be reallocated before the
      checksum item was actually deleted, leading to races and other
      problems as the checksums were updated for the newly allocated extent.
      
      The fix is to delete the checksum before marking the extent free.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      dcbdd4dc
  16. 16 12月, 2008 1 次提交
  17. 12 12月, 2008 3 次提交
    • Y
      Btrfs: fix nodatasum handling in balancing code · 17d217fe
      Yan Zheng 提交于
      Checksums on data can be disabled by mount option, so it's
      possible some data extents don't have checksums or have
      invalid checksums. This causes trouble for data relocation.
      This patch contains following things to make data relocation
      work.
      
      1) make nodatasum/nodatacow mount option only affects new
      files. Checksums and COW on data are only controlled by the
      inode flags.
      
      2) check the existence of checksum in the nodatacow checker.
      If checksums exist, force COW the data extent. This ensure that
      checksum for a given block is either valid or does not exist.
      
      3) update data relocation code to properly handle the case
      of checksum missing.
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      17d217fe
    • Y
      Btrfs: shared seed device · e4404d6e
      Yan Zheng 提交于
      This patch makes seed device possible to be shared by
      multiple mounted file systems. The sharing is achieved
      by cloning seed device's btrfs_fs_devices structure.
      Thanks you,
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      e4404d6e
    • Y
      Btrfs: fix leaking block group on balance · d2fb3437
      Yan Zheng 提交于
      The block group structs are referenced in many different
      places, and it's not safe to free while balancing.  So, those block
      group structs were simply leaked instead.
      
      This patch replaces the block group pointer in the inode with the starting byte
      offset of the block group and adds reference counting to the block group
      struct.
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      d2fb3437
  18. 11 12月, 2008 1 次提交
  19. 10 12月, 2008 1 次提交
    • C
      Btrfs: Delete csum items when freeing extents · 459931ec
      Chris Mason 提交于
      This finishes off the new checksumming code by removing csum items
      for extents that are no longer in use.
      
      The trick is doing it without racing because a single csum item may
      hold csums for more than one extent.  Extra checks are added to
      btrfs_csum_file_blocks to make sure that we are using the correct
      csum item after dropping locks.
      
      A new btrfs_split_item is added to split a single csum item so it
      can be split without dropping the leaf lock.  This is used to
      remove csum bytes from the middle of an item.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      459931ec
  20. 09 12月, 2008 1 次提交
    • Y
      Btrfs: superblock duplication · a512bbf8
      Yan Zheng 提交于
      This patch implements superblock duplication. Superblocks
      are stored at offset 16K, 64M and 256G on every devices.
      Spaces used by superblocks are preserved by the allocator,
      which uses a reverse mapping function to find the logical
      addresses that correspond to superblocks. Thank you,
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      a512bbf8
  21. 02 12月, 2008 1 次提交
  22. 21 11月, 2008 1 次提交
  23. 20 11月, 2008 1 次提交