1. 13 2月, 2009 1 次提交
    • C
      Btrfs: make a lockdep class for the extent buffer locks · 4008c04a
      Chris Mason 提交于
      Btrfs is currently using spin_lock_nested with a nested value based
      on the tree depth of the block.  But, this doesn't quite work because
      the max tree depth is bigger than what spin_lock_nested can deal with,
      and because locks are sometimes taken before the level field is filled in.
      
      The solution here is to use lockdep_set_class_and_name instead, and to
      set the class before unlocking the pages when the block is read from the
      disk and just after init of a freshly allocated tree block.
      
      btrfs_clear_path_blocking is also changed to take the locks in the proper
      order, and it also makes sure all the locks currently held are properly
      set to blocking before it tries to retake the spinlocks.  Otherwise, lockdep
      gets upset about bad lock orderin.
      
      The lockdep magic cam from Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      4008c04a
  2. 12 2月, 2009 2 次提交
    • C
      Btrfs: use larger metadata clusters in ssd mode · 536ac8ae
      Chris Mason 提交于
      Larger metadata clusters can significantly improve writeback performance
      on ssd drives with large erasure blocks.  The larger clusters make it
      more likely a given IO will completely overwrite the ssd block, so it
      doesn't have to do an internal rwm cycle.
      
      On spinning media, lager metadata clusters end up spreading out the
      metadata more over time, which makes fsck slower, so we don't want this
      to be the default.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      536ac8ae
    • J
      Btrfs: make sure all pending extent operations are complete · eb099670
      Josef Bacik 提交于
      Theres a slight problem with finish_current_insert, if we set all to 1 and then
      go through and don't actually skip any of the extents on the pending list, we
      could exit right after we've added new extents.
      
      This is a problem because by inserting the new extents we could have gotten new
      COW's to happen and such, so we may have some pending updates to do or even
      more inserts to do after that.
      
      So this patch will only exit if we have never skipped any of the extents in the
      pending list, and we have no extents to insert, this will make sure that all of
      the pending work is truly done before we return.  I've been running with this
      patch for a few days with all of my other testing and have not seen issues.
      Thanks,
      Signed-off-by: NJosef Bacik <jbacik@redhat.com>
      eb099670
  3. 05 2月, 2009 1 次提交
  4. 04 2月, 2009 3 次提交
    • C
      Btrfs: Make btrfs_drop_snapshot work in larger and more efficient chunks · bd56b302
      Chris Mason 提交于
      Every transaction in btrfs creates a new snapshot, and then schedules the
      snapshot from the last transaction for deletion.  Snapshot deletion
      works by walking down the btree and dropping the reference counts
      on each btree block during the walk.
      
      If if a given leaf or node has a reference count greater than one,
      the reference count is decremented and the subtree pointed to by that
      node is ignored.
      
      If the reference count is one, walking continues down into that node
      or leaf, and the references of everything it points to are decremented.
      
      The old code would try to work in small pieces, walking down the tree
      until it found the lowest leaf or node to free and then returning.  This
      was very friendly to the rest of the FS because it didn't have a huge
      impact on other operations.
      
      But it wouldn't always keep up with the rate that new commits added new
      snapshots for deletion, and it wasn't very optimal for the extent
      allocation tree because it wasn't finding leaves that were close together
      on disk and processing them at the same time.
      
      This changes things to walk down to a level 1 node and then process it
      in bulk.  All the leaf pointers are sorted and the leaves are dropped
      in order based on their extent number.
      
      The extent allocation tree and commit code are now fast enough for
      this kind of bulk processing to work without slowing the rest of the FS
      down.  Overall it does less IO and is better able to keep up with
      snapshot deletions under high load.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      bd56b302
    • C
      Btrfs: Change btree locking to use explicit blocking points · b4ce94de
      Chris Mason 提交于
      Most of the btrfs metadata operations can be protected by a spinlock,
      but some operations still need to schedule.
      
      So far, btrfs has been using a mutex along with a trylock loop,
      most of the time it is able to avoid going for the full mutex, so
      the trylock loop is a big performance gain.
      
      This commit is step one for getting rid of the blocking locks entirely.
      btrfs_tree_lock takes a spinlock, and the code explicitly switches
      to a blocking lock when it starts an operation that can schedule.
      
      We'll be able get rid of the blocking locks in smaller pieces over time.
      Tracing allows us to find the most common cause of blocking, so we
      can start with the hot spots first.
      
      The basic idea is:
      
      btrfs_tree_lock() returns with the spin lock held
      
      btrfs_set_lock_blocking() sets the EXTENT_BUFFER_BLOCKING bit in
      the extent buffer flags, and then drops the spin lock.  The buffer is
      still considered locked by all of the btrfs code.
      
      If btrfs_tree_lock gets the spinlock but finds the blocking bit set, it drops
      the spin lock and waits on a wait queue for the blocking bit to go away.
      
      Much of the code that needs to set the blocking bit finishes without actually
      blocking a good percentage of the time.  So, an adaptive spin is still
      used against the blocking bit to avoid very high context switch rates.
      
      btrfs_clear_lock_blocking() clears the blocking bit and returns
      with the spinlock held again.
      
      btrfs_tree_unlock() can be called on either blocking or spinning locks,
      it does the right thing based on the blocking bit.
      
      ctree.c has a helper function to set/clear all the locked buffers in a
      path as blocking.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      b4ce94de
    • C
      Btrfs: sort references by byte number during btrfs_inc_ref · b7a9f29f
      Chris Mason 提交于
      When a block goes through cow, we update the reference counts of
      everything that block points to.  The internal pointers of the block
      can be in just about any order, and it is likely to have clusters of
      things that are close together and clusters of things that are not.
      
      To help reduce the seeks that come with updating all of these reference
      counts, sort them by byte number before actual updates are done.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      b7a9f29f
  5. 22 1月, 2009 1 次提交
    • Y
      Btrfs: fix tree logs parallel sync · 7237f183
      Yan Zheng 提交于
      To improve performance, btrfs_sync_log merges tree log sync
      requests. But it wrongly merges sync requests for different
      tree logs. If multiple tree logs are synced at the same time,
      only one of them actually gets synced.
      
      This patch has following changes to fix the bug:
      
      Move most tree log related fields in btrfs_fs_info to
      btrfs_root. This allows merging sync requests separately
      for each tree log.
      
      Don't insert root item into the log root tree immediately
      after log tree is allocated. Root item for log tree is
      inserted when log tree get synced for the first time. This
      allows syncing the log root tree without first syncing all
      log trees.
      
      At tree-log sync, btrfs_sync_log first sync the log tree;
      then updates corresponding root item in the log root tree;
      sync the log root tree; then update the super block.
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      7237f183
  6. 21 1月, 2009 6 次提交
  7. 07 1月, 2009 1 次提交
    • Y
      Btrfs: tree logging checksum fixes · 07d400a6
      Yan Zheng 提交于
      This patch contains following things.
      
      1) Limit the max size of btrfs_ordered_sum structure to PAGE_SIZE.  This
      struct is kmalloced so we want to keep it reasonable.
      
      2) Replace copy_extent_csums by btrfs_lookup_csums_range.  This was
      duplicated code in tree-log.c
      
      3) Remove replay_one_csum. csum items are replayed at the same time as
         replaying file extents. This guarantees we only replay useful csums.
      
      4) nbytes accounting fix.
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      07d400a6
  8. 06 1月, 2009 3 次提交
    • C
    • C
      Btrfs: Fix checkpatch.pl warnings · d397712b
      Chris Mason 提交于
      There were many, most are fixed now.  struct-funcs.c generates some warnings
      but these are bogus.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      d397712b
    • L
      Btrfs: Fix free block discard calls down to the block layer · 1f3c79a2
      Liu Hui 提交于
      This is a patch to fix discard semantic to make Btrfs work with FTL and SSD.
      We can improve FTL's performance by telling it which sectors are freed by file
      system. But if we don't tell FTL the information of free sectors in proper
      time, the transaction mechanism of Btrfs will be destroyed and Btrfs could not
      roll back the previous transaction under the power loss condition.
      
      There are some problems in the old implementation:
      1, In __free_extent(), the pinned down extents should not be discarded.
      2, In free_extents(), the free extents are all pinned, so they need to
      be discarded in transaction committing time instead of free_extents().
      3, The reserved extent used by log tree should be discard too.
      
      This patch change discard behavior as follows:
      1, For the extents which need to be free at once,
         we discard them in update_block_group().
      2, Delay discarding the pinned extent in btrfs_finish_extent_commit()
         when committing transaction.
      3, Remove discarding from free_extents() and __free_extent()
      4, Add discard interface into btrfs_free_reserved_extent()
      5, Discard sectors before updating the free space cache, otherwise,
         FTL will destroy file system data.
      1f3c79a2
  9. 19 12月, 2008 2 次提交
  10. 17 12月, 2008 1 次提交
    • C
      Btrfs: delete checksum items before marking blocks free · dcbdd4dc
      Chris Mason 提交于
      Btrfs maintains a cache of blocks available for allocation in ram.  The
      code that frees extents was marking the extents free and then deleting
      the checksum items.
      
      This meant it was possible the extent would be reallocated before the
      checksum item was actually deleted, leading to races and other
      problems as the checksums were updated for the newly allocated extent.
      
      The fix is to delete the checksum before marking the extent free.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      dcbdd4dc
  11. 16 12月, 2008 1 次提交
  12. 12 12月, 2008 3 次提交
    • Y
      Btrfs: fix nodatasum handling in balancing code · 17d217fe
      Yan Zheng 提交于
      Checksums on data can be disabled by mount option, so it's
      possible some data extents don't have checksums or have
      invalid checksums. This causes trouble for data relocation.
      This patch contains following things to make data relocation
      work.
      
      1) make nodatasum/nodatacow mount option only affects new
      files. Checksums and COW on data are only controlled by the
      inode flags.
      
      2) check the existence of checksum in the nodatacow checker.
      If checksums exist, force COW the data extent. This ensure that
      checksum for a given block is either valid or does not exist.
      
      3) update data relocation code to properly handle the case
      of checksum missing.
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      17d217fe
    • Y
      Btrfs: shared seed device · e4404d6e
      Yan Zheng 提交于
      This patch makes seed device possible to be shared by
      multiple mounted file systems. The sharing is achieved
      by cloning seed device's btrfs_fs_devices structure.
      Thanks you,
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      e4404d6e
    • Y
      Btrfs: fix leaking block group on balance · d2fb3437
      Yan Zheng 提交于
      The block group structs are referenced in many different
      places, and it's not safe to free while balancing.  So, those block
      group structs were simply leaked instead.
      
      This patch replaces the block group pointer in the inode with the starting byte
      offset of the block group and adds reference counting to the block group
      struct.
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      d2fb3437
  13. 11 12月, 2008 1 次提交
  14. 10 12月, 2008 1 次提交
    • C
      Btrfs: Delete csum items when freeing extents · 459931ec
      Chris Mason 提交于
      This finishes off the new checksumming code by removing csum items
      for extents that are no longer in use.
      
      The trick is doing it without racing because a single csum item may
      hold csums for more than one extent.  Extra checks are added to
      btrfs_csum_file_blocks to make sure that we are using the correct
      csum item after dropping locks.
      
      A new btrfs_split_item is added to split a single csum item so it
      can be split without dropping the leaf lock.  This is used to
      remove csum bytes from the middle of an item.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      459931ec
  15. 09 12月, 2008 1 次提交
    • Y
      Btrfs: superblock duplication · a512bbf8
      Yan Zheng 提交于
      This patch implements superblock duplication. Superblocks
      are stored at offset 16K, 64M and 256G on every devices.
      Spaces used by superblocks are preserved by the allocator,
      which uses a reverse mapping function to find the logical
      addresses that correspond to superblocks. Thank you,
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      a512bbf8
  16. 02 12月, 2008 1 次提交
  17. 21 11月, 2008 1 次提交
  18. 20 11月, 2008 3 次提交
    • C
      Btrfs: compat code fixes · 4b4e25f2
      Chris Mason 提交于
      The btrfs git kernel trees is used to build a standalone tree for
      compiling against older kernels.  This commit makes the standalone tree
      work with 2.6.27
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      4b4e25f2
    • C
      Btrfs: Fixes for 2.6.28-rc API changes · 15916de8
      Chris Mason 提交于
      * open/close_bdev_excl -> open/close_bdev_exclusive
      * blkdev_issue_discard takes a GFP mask now
      * Fix blkdev_issue_discard usage now that it is enabled
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      15916de8
    • J
      Btrfs: fix free space accounting when unpinning extents · 07103a3c
      Josef Bacik 提交于
      This patch fixes what I hope is the last early ENOSPC bug left.  I did not know
      that pinned extents would merge into one big extent when inserted on to the
      pinned extent tree, so I was adding free space to a block group that could
      possibly span multiple block groups.
      
      This is a big issue because first that space doesn't exist in that block group,
      and second we won't actually use that space because there are a bunch of other
      checks to make sure we're allocating within the constraints of the block group.
      
      This patch fixes the problem by adding the btrfs_add_free_space to
      btrfs_update_pinned_extents which makes sure we are adding the appropriate
      amount of free space to the appropriate block group.  Thanks much to Lee Trager
      for running my myriad of debug patches to help me track this problem down.
      Thank you,
      Signed-off-by: NJosef Bacik <jbacik@redhat.com>
      07103a3c
  19. 19 11月, 2008 1 次提交
    • L
      Btrfs: Some fixes for batching extent insert. · b4eec2ca
      Liu Hui 提交于
      In insert_extents(), when ret==1 and last is not zero, it should
      check if the current inserted item is the last item in this batching
      inserts. If so, it should just break from loop. If not, 'cur =
      insert_list->next' will make no sense because the list is empty now,
      and 'op' will point to an unexpectable place.
      
      There are also some trivial fixs in this patch including one comment
      typo error and deleting two redundant lines.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      b4eec2ca
  20. 18 11月, 2008 3 次提交
    • J
      Btrfs: Add some debugging around the ENOSPC bugs · 4ce4cb52
      Josef Bacik 提交于
      Some people are still reporting problems with early enospc.  This
      will help narrown down the cause.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      4ce4cb52
    • J
      Btrfs: fix free space leak · e3e469f8
      Josef Bacik 提交于
      In my batch delete/update/insert patch I introduced a free space leak.  The
      extent that we do the original search on in free_extents is never pinned, so we
      always update the block saying that it has free space, but the free space never
      actually gets added to the free space tree, since op->del will always be 0 and
      it's never actually added to the pinned extents tree.
      
      This patch fixes this problem by making sure we call pin_down_bytes on the
      pending extent op and set op->del to the return value of pin_down_bytes so
      update_block_group is called with the right value.  This seems to fix the case
      where we were getting ENOSPC when there was plenty of space available.
      Signed-off-by: NJosef Bacik <jbacik@redhat.com>
      e3e469f8
    • Y
      Btrfs: Seed device support · 2b82032c
      Yan Zheng 提交于
      Seed device is a special btrfs with SEEDING super flag
      set and can only be mounted in read-only mode. Seed
      devices allow people to create new btrfs on top of it.
      
      The new FS contains the same contents as the seed device,
      but it can be mounted in read-write mode.
      
      This patch does the following:
      
      1) split code in btrfs_alloc_chunk into two parts. The first part does makes
      the newly allocated chunk usable, but does not do any operation that modifies
      the chunk tree. The second part does the the chunk tree modifications. This
      division is for the bootstrap step of adding storage to the seed device.
      
      2) Update device management code to handle seed device.
      The basic idea is: For an FS grown from seed devices, its
      seed devices are put into a list. Seed devices are
      opened on demand at mounting time. If any seed device is
      missing or has been changed, btrfs kernel module will
      refuse to mount the FS.
      
      3) make btrfs_find_block_group not return NULL when all
      block groups are read-only.
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      2b82032c
  21. 13 11月, 2008 3 次提交
    • Y
      Btrfs: mount ro and remount support · c146afad
      Yan Zheng 提交于
      This patch adds mount ro and remount support. The main
      changes in patch are: adding btrfs_remount and related
      helper function; splitting the transaction related code
      out of close_ctree into btrfs_commit_super; updating
      allocator to properly handle read only block group.
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      c146afad
    • J
      Btrfs: batch extent inserts/updates/deletions on the extent root · f3465ca4
      Josef Bacik 提交于
      While profiling the allocator I noticed a good amount of time was being spent in
      finish_current_insert and del_pending_extents, and as the filesystem filled up
      more and more time was being spent in those functions.  This patch aims to try
      and reduce that problem.  This happens two ways
      
      1) track if we tried to delete an extent that we are going to update or insert.
      Once we get into finish_current_insert we discard any of the extents that were
      marked for deletion.  This saves us from doing unnecessary work almost every
      time finish_current_insert runs.
      
      2) Batch insertion/updates/deletions.  Instead of doing a btrfs_search_slot for
      each individual extent and doing the needed operation, we instead keep the leaf
      around and see if there is anything else we can do on that leaf.  On the insert
      case I introduced a btrfs_insert_some_items, which will take an array of keys
      with an array of data_sizes and try and squeeze in as many of those keys as
      possible, and then return how many keys it was able to insert.  In the update
      case we search for an extent ref, update the ref and then loop through the leaf
      to see if any of the other refs we are looking to update are on that leaf, and
      then once we are done we release the path and search for the next ref we need to
      update.  And finally for the deletion we try and delete the extent+ref in pairs,
      so we will try to find extent+ref pairs next to the extent we are trying to free
      and free them in bulk if possible.
      
      This along with the other cluster fix that Chris pushed out a bit ago helps make
      the allocator preform more uniformly as it fills up the disk.  There is still a
      slight drop as we fill up the disk since we start having to stick new blocks in
      odd places which results in more COW's than on a empty fs, but the drop is not
      nearly as severe as it was before.
      Signed-off-by: NJosef Bacik <jbacik@redhat.com>
      f3465ca4
    • C
      Btrfs: Fix handling of space info full during allocations · 2ed6d664
      Chris Mason 提交于
      When we fail to allocate a new block group, we should still do the
      checks to make sure allocations try again with the minimum requested
      allocation size.
      
      This also fixes a deadlock that come from a missed down_read in
      the chunk allocation failure handling.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      2ed6d664