1. 18 9月, 2009 1 次提交
    • Y
      Btrfs: improve async block group caching · 11833d66
      Yan Zheng 提交于
      This patch gets rid of two limitations of async block group caching.
      The old code delays handling pinned extents when block group is in
      caching. To allocate logged file extents, the old code need wait
      until block group is fully cached. To get rid of the limitations,
      This patch introduces a data structure to track the progress of
      caching. Base on the caching progress, we know which extents should
      be added to the free space cache when handling the pinned extents.
      The logged file extents are also handled in a similar way.
      
      This patch also changes how pinned extents are tracked. The old
      code uses one tree to track pinned extents, and copy the pinned
      extents tree at transaction commit time. This patch makes it use
      two trees to track pinned extents. One tree for extents that are
      pinned in the running transaction, one tree for extents that can
      be unpinned. At transaction commit time, we swap the two trees.
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      11833d66
  2. 12 9月, 2009 1 次提交
    • C
      Btrfs: switch extent_map to a rw lock · 890871be
      Chris Mason 提交于
      There are two main users of the extent_map tree.  The
      first is regular file inodes, where it is evenly spread
      between readers and writers.
      
      The second is the chunk allocation tree, which maps blocks from
      logical addresses to phyiscal ones, and it is 99.99% reads.
      
      The mapping tree is a point of lock contention during heavy IO
      workloads, so this commit switches things to a rw lock.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      890871be
  3. 01 8月, 2009 1 次提交
    • C
      Btrfs: make sure the async caching thread advances the key · 013f1b12
      Chris Mason 提交于
      The async caching thread can end up looping forever if a given
      search puts it at the last key in a leaf.  It will end up calling
      btrfs_next_leaf and then checking if it needs to politely drop
      the read semaphore.
      
      Most of the time this looping isn't noticed because it is able to
      make progress the next time around.  But, during log replay,
      we wait on the async caching thread to finish, and the async thread
      is waiting on the commit, and no progress is really made.
      
      The fix used here is to copy the key out of the next leaf,
      that way our search lands there properly.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      013f1b12
  4. 30 7月, 2009 2 次提交
    • C
      Btrfs: be more polite in the async caching threads · f36f3042
      Chris Mason 提交于
      The semaphore used by the async caching threads can prevent a
      transaction commit, which can make the FS appear to stall.  This
      releases the semaphore more often when a transaction commit is
      in progress.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      f36f3042
    • Y
      Btrfs: preserve commit_root for async caching · 276e680d
      Yan Zheng 提交于
      The async block group caching code uses the commit_root pointer
      to get a stable version of the extent allocation tree for scanning.
      This copy of the tree root isn't going to change and it significantly
      reduces the complexity of the scanning code.
      
      During a commit, we have a loop where we update the extent allocation
      tree root.  We need to loop because updating the root pointer in
      the tree of tree roots may allocate blocks which may change the
      extent allocation tree.
      
      Right now the commit_root pointer is changed inside this loop.  It
      is more correct to change the commit_root pointer only after all the
      looping is done.
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      276e680d
  5. 28 7月, 2009 2 次提交
    • Y
      Btrfs: Fix async caching interaction with unmount · f25784b3
      Yan Zheng 提交于
      - don't stop the caching thread until btrfs_commit_super return.
      
      - if caching is interrupted by umount, set last to (u64)-1.
        otherwise the un-scanned range of block group will be considered
        as free extent.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      f25784b3
    • J
      Btrfs: change how we unpin extents · 68b38550
      Josef Bacik 提交于
      We are racy with async block caching and unpinning extents.  This patch makes
      things much less complicated by only unpinning the extent if the block group is
      cached.  We check the block_group->cached var under the block_group->lock spin
      lock.  If it is set to BTRFS_CACHE_FINISHED then we update the pinned counters,
      and unpin the extent and add the free space back.  If it is not set to this, we
      start the caching of the block group so the next time we unpin extents we can
      unpin the extent.  This keeps us from racing with the async caching threads,
      lets us kill the fs wide async thread counter, and keeps us from having to set
      DELALLOC bits for every extent we hit if there are caching kthreads going.
      
      One thing that needed to be changed was btrfs_free_super_mirror_extents.  Now
      instead of just looking for LOCKED extents, we also look for DIRTY extents,
      since we could have left some extents pinned in the previous transaction that
      will never get freed now that we are unmounting, which would cause us to leak
      memory.  So btrfs_free_super_mirror_extents has been changed to
      btrfs_free_pinned_extents, and it will clear the extents locked for the super
      mirror, and any remaining pinned extents that may be present.  Thank you,
      Signed-off-by: NJosef Bacik <jbacik@redhat.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      68b38550
  6. 25 7月, 2009 1 次提交
    • C
      Btrfs: clear all space_info->full after removing a block group · 283bb197
      Chris Mason 提交于
      Btrfs allocates individual extents from block groups, and each
      block group has a specific type.  It may hold metadata, data
      mirrored or striped etc.
      
      When we balance space (btrfs-vol -b) or remove a drive (btrfs-vol -r)
      we free block groups.  Once a block group is freed, the space it was
      using on the device may be available for use by new block groups.
      
      btrfs_remove_block_group was clearing the flag that said
      'our devices are full, don't even try to allocate new block groups',
      but it was only clearing that flag for a specific type of block group.
      
      This commit clears the full flag for all of the types of block groups,
      making it much more likely that we'll be able to balance space when
      the drive is close to full.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      283bb197
  7. 24 7月, 2009 2 次提交
    • J
      Btrfs: async block group caching · 817d52f8
      Josef Bacik 提交于
      This patch moves the caching of the block group off to a kthread in order to
      allow people to allocate sooner.  Instead of blocking up behind the caching
      mutex, we instead kick of the caching kthread, and then attempt to make an
      allocation.  If we cannot, we wait on the block groups caching waitqueue, which
      the caching kthread will wake the waiting threads up everytime it finds 2 meg
      worth of space, and then again when its finished caching.  This is how I tested
      the speedup from this
      
      mkfs the disk
      mount the disk
      fill the disk up with fs_mark
      unmount the disk
      mount the disk
      time touch /mnt/foo
      
      Without my changes this took 11 seconds on my box, with these changes it now
      takes 1 second.
      
      Another change thats been put in place is we lock the super mirror's in the
      pinned extent map in order to keep us from adding that stuff as free space when
      caching the block group.  This doesn't really change anything else as far as the
      pinned extent map is concerned, since for actual pinned extents we use
      EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
      those extents to keep from leaking memory.
      
      I've also added a check where when we are reading block groups from disk, if the
      amount of space used == the size of the block group, we go ahead and mark the
      block group as cached.  This drastically reduces the amount of time it takes to
      cache the block groups.  Using the same test as above, except doing a dd to a
      file and then unmounting, it used to take 33 seconds to umount, now it takes 3
      seconds.
      
      This version uses the commit_root in the caching kthread, and then keeps track
      of how many async caching threads are running at any given time so if one of the
      async threads is still running as we cross transactions we can wait until its
      finished before handling the pinned extents.  Thank you,
      Signed-off-by: NJosef Bacik <jbacik@redhat.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      817d52f8
    • J
      Btrfs: use hybrid extents+bitmap rb tree for free space · 96303081
      Josef Bacik 提交于
      Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
      tracking free space.  As free space gets fragmented, we end up with thousands of
      entries on an rb-tree per block group, which usually spans 1 gig of area.  Since
      we currently don't ever flush free space cache back to disk this gets to be a
      bit unweildly on large fs's with lots of fragmentation.
      
      This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
      space cache.  Initially we calculate a threshold of extent entries we can
      handle, which is however many extent entries we can cram into 16k of ram.  The
      maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
      will be 32k of RAM, which scales much better than we did before.
      
      Once we pass the extent threshold, we start adding bitmaps and using those
      instead for tracking the free space.  This patch also makes it so that any free
      space thats less than 4 * sectorsize we go ahead and put into a bitmap.  This is
      nice since we try and allocate out of the front of a block group, so if the
      front of a block group is heavily fragmented and then has a huge chunk of free
      space at the end, we go ahead and add the fragmented areas to bitmaps and use a
      normal extent entry to track the big chunk at the back of the block group.
      
      I've also taken the opportunity to revamp how we search for free space.
      Previously we indexed free space via an offset indexed rb tree and a bytes
      indexed rb tree.  I've dropped the bytes indexed rb tree and use only the offset
      indexed rb tree.  This cuts the number of tree operations we were doing
      previously down by half, and gives us a little bit of a better allocation
      pattern since we will always start from a specific offset and search forward
      from there, instead of searching for the size we need and try and get it as
      close as possible to the offset we want.
      
      I've given this a healthy amount of testing pre-new format stuff, as well as
      post-new format stuff.  I've booted up my fedora box which is installed on btrfs
      with this patch and ran with it for a few days without issues.  I've not seen
      any performance regressions in any of my tests.
      
      Since the last patch Yan Zheng fixed a problem where we could have overlapping
      entries, so updating their offset inline would cause problems.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@redhat.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      96303081
  8. 22 7月, 2009 1 次提交
    • Y
      Btrfs: make sure all dirty blocks are written at commit time · 4a8c9a62
      Yan Zheng 提交于
      Write dirty block groups may allocate new block, and so may add new delayed
      back ref. btrfs_run_delayed_refs may make some block groups dirty.
      
      commit_cowonly_roots does not handle the recursion properly, and some dirty
      blocks can be left unwritten at commit time. This patch moves
      btrfs_run_delayed_refs into the loop that writes dirty block groups, and makes
      the code not break out of the loop until there are no dirty block groups or
      delayed back refs.
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      4a8c9a62
  9. 03 7月, 2009 2 次提交
  10. 11 6月, 2009 1 次提交
  11. 10 6月, 2009 3 次提交
    • D
      Btrfs: remove crc32c.h and use libcrc32c directly. · 163e783e
      David Woodhouse 提交于
      There's no need to preserve this abstraction; it used to let us use
      hardware crc32c support directly, but libcrc32c is already doing that for us
      through the crypto API -- so we're already using the Intel crc32c
      acceleration where appropriate.
      Signed-off-by: NDavid Woodhouse <David.Woodhouse@intel.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      163e783e
    • C
      Btrfs: add mount -o ssd_spread to spread allocations out · 451d7585
      Chris Mason 提交于
      Some SSDs perform best when reusing block numbers often, while
      others perform much better when clustering strictly allocates
      big chunks of unused space.
      
      The default mount -o ssd will find rough groupings of blocks
      where there are a bunch of free blocks that might have some
      allocated blocks mixed in.
      
      mount -o ssd_spread will make sure there are no allocated blocks
      mixed in.  It should perform better on lower end SSDs.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      451d7585
    • Y
      Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE) · 5d4f98a2
      Yan Zheng 提交于
      This commit introduces a new kind of back reference for btrfs metadata.
      Once a filesystem has been mounted with this commit, IT WILL NO LONGER
      BE MOUNTABLE BY OLDER KERNELS.
      
      When a tree block in subvolume tree is cow'd, the reference counts of all
      extents it points to are increased by one.  At transaction commit time,
      the old root of the subvolume is recorded in a "dead root" data structure,
      and the btree it points to is later walked, dropping reference counts
      and freeing any blocks where the reference count goes to 0.
      
      The increments done during cow and decrements done after commit cancel out,
      and the walk is a very expensive way to go about freeing the blocks that
      are no longer referenced by the new btree root.  This commit reduces the
      transaction overhead by avoiding the need for dead root records.
      
      When a non-shared tree block is cow'd, we free the old block at once, and the
      new block inherits old block's references. When a tree block with reference
      count > 1 is cow'd, we increase the reference counts of all extents
      the new block points to by one, and decrease the old block's reference count by
      one.
      
      This dead tree avoidance code removes the need to modify the reference
      counts of lower level extents when a non-shared tree block is cow'd.
      But we still need to update back ref for all pointers in the block.
      This is because the location of the block is recorded in the back ref
      item.
      
      We can solve this by introducing a new type of back ref. The new
      back ref provides information about pointer's key, level and in which
      tree the pointer lives. This information allow us to find the pointer
      by searching the tree. The shortcoming of the new back ref is that it
      only works for pointers in tree blocks referenced by their owner trees.
      
      This is mostly a problem for snapshots, where resolving one of these
      fuzzy back references would be O(number_of_snapshots) and quite slow.
      The solution used here is to use the fuzzy back references in the common
      case where a given tree block is only referenced by one root,
      and use the full back references when multiple roots have a reference
      on a given block.
      
      This commit adds per subvolume red-black tree to keep trace of cached
      inodes. The red-black tree helps the balancing code to find cached
      inodes whose inode numbers within a given range.
      
      This commit improves the balancing code by introducing several data
      structures to keep the state of balancing. The most important one
      is the back ref cache. It caches how the upper level tree blocks are
      referenced. This greatly reduce the overhead of checking back ref.
      
      The improved balancing code scales significantly better with a large
      number of snapshots.
      
      This is a very large commit and was written in a number of
      pieces.  But, they depend heavily on the disk format change and were
      squashed together to make sure git bisect didn't end up in a
      bad state wrt space balancing or the format change.
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      5d4f98a2
  12. 05 6月, 2009 1 次提交
    • C
      Btrfs: Fix oops and use after free during space balancing · 44fb5511
      Chris Mason 提交于
      The btrfs allocator uses list_for_each to walk the available block
      groups when searching for free blocks.  It starts off with a hint
      to help find the best block group for a given allocation.
      
      The hint is resolved into a block group, but we don't properly check
      to make sure the block group we find isn't in the middle of being
      freed due to filesystem shrinking or balancing.  If it is being
      freed, the list pointers in it are bogus and can't be trusted.  But,
      the code happily goes along and uses them in the list_for_each loop,
      leading to all kinds of fun.
      
      The fix used here is to check to make sure the block group we find really
      is on the list before we use it.  list_del_init is used when removing
      it from the list, so we can do a proper check.
      
      The allocation clustering code has a similar bug where it will trust
      the block group in the current free space cluster.  If our allocation
      flags have changed (going from single spindle dup to raid1 for example)
      because the drives in the FS have changed, we're not allowed to use
      the old block group any more.
      
      The fix used here is to check the current cluster against the
      current allocation flags.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      44fb5511
  13. 15 5月, 2009 1 次提交
  14. 27 4月, 2009 1 次提交
  15. 25 4月, 2009 1 次提交
    • J
      Btrfs: try to keep a healthy ratio of metadata vs data block groups · 97e728d4
      Josef Bacik 提交于
      This patch makes the chunk allocator keep a good ratio of metadata vs data
      block groups.  By default for every 8 data block groups, we'll allocate 1
      metadata chunk, or about 12% of the disk will be allocated for metadata.  This
      can be changed by specifying the metadata_ratio mount option.
      
      This is simply the number of data block groups that have to be allocated to
      force a metadata chunk allocation.  By making sure we allocate metadata chunks
      more often, we are less likely to get into situations where the whole disk
      has been allocated as data block groups.
      Signed-off-by: NJosef Bacik <jbacik@redhat.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      97e728d4
  16. 03 4月, 2009 5 次提交
    • C
      Btrfs: rework allocation clustering · fa9c0d79
      Chris Mason 提交于
      Because btrfs is copy-on-write, we end up picking new locations for
      blocks very often.  This makes it fairly difficult to maintain perfect
      read patterns over time, but we can at least do some optimizations
      for writes.
      
      This is done today by remembering the last place we allocated and
      trying to find a free space hole big enough to hold more than just one
      allocation.  The end result is that we tend to write sequentially to
      the drive.
      
      This happens all the time for metadata and it happens for data
      when mounted -o ssd.  But, the way we record it is fairly racey
      and it tends to fragment the free space over time because we are trying
      to allocate fairly large areas at once.
      
      This commit gets rid of the races by adding a free space cluster object
      with dedicated locking to make sure that only one process at a time
      is out replacing the cluster.
      
      The free space fragmentation is somewhat solved by allowing a cluster
      to be comprised of smaller free space extents.  This part definitely
      adds some CPU time to the cluster allocations, but it allows the allocator
      to consume the small holes left behind by cow.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      fa9c0d79
    • J
      Btrfs: kill the pinned_mutex · 04018de5
      Josef Bacik 提交于
      This patch removes the pinned_mutex.  The extent io map has an internal tree
      lock that protects the tree itself, and since we only copy the extent io map
      when we are committing the transaction we don't need it there.  We also don't
      need it when caching the block group since searching through the tree is also
      protected by the internal map spin lock.
      Signed-off-by: NJosef Bacik <jbacik@redhat.com>
      04018de5
    • J
      Btrfs: kill the block group alloc mutex · 6226cb0a
      Josef Bacik 提交于
      This patch removes the block group alloc mutex used to protect the free space
      tree for allocations and replaces it with a spin lock which is used only to
      protect the free space rb tree.  This means we only take the lock when we are
      directly manipulating the tree, which makes us a touch faster with
      multi-threaded workloads.
      
      This patch also gets rid of btrfs_find_free_space and replaces it with
      btrfs_find_space_for_alloc, which takes the number of bytes you want to
      allocate, and empty_size, which is used to indicate how much free space should
      be at the end of the allocation.
      
      It will return an offset for the allocator to use.  If we don't end up using it
      we _must_ call btrfs_add_free_space to put it back.  This is the tradeoff to
      kill the alloc_mutex, since we need to make sure nobody else comes along and
      takes our space.
      Signed-off-by: NJosef Bacik <jbacik@redhat.com>
      6226cb0a
    • J
      Btrfs: clean up find_free_extent · 2552d17e
      Josef Bacik 提交于
      I've replaced the strange looping constructs with a list_for_each_entry on
      space_info->block_groups.  If we have a hint we just jump into the loop with
      the block group and start looking for space.  If we don't find anything we
      start at the beginning and start looking.  We never come out of the loop with a
      ref on the block_group _unless_ we found space to use, then we drop it after we
      set the trans block_group.
      Signed-off-by: NJosef Bacik <jbacik@redhat.com>
      2552d17e
    • J
      Btrfs: free space cache cleanups · 70cb0743
      Josef Bacik 提交于
      This patch cleans up the free space cache code a bit.  It better documents the
      idiosyncrasies of tree_search_offset and makes the code make a bit more sense.
      I took out the info allocation at the start of __btrfs_add_free_space and put it
      where it makes more sense.  This was left over cruft from when alloc_mutex
      existed.  Also all of the re-searches we do to make sure we inserted properly.
      Signed-off-by: NJosef Bacik <jbacik@redhat.com>
      70cb0743
  17. 01 4月, 2009 1 次提交
  18. 25 3月, 2009 6 次提交
    • C
      Btrfs: tree logging unlink/rename fixes · 12fcfd22
      Chris Mason 提交于
      The tree logging code allows individual files or directories to be logged
      without including operations on other files and directories in the FS.
      It tries to commit the minimal set of changes to disk in order to
      fsync the single file or directory that was sent to fsync or O_SYNC.
      
      The tree logging code was allowing files and directories to be unlinked
      if they were part of a rename operation where only one directory
      in the rename was in the fsync log.  This patch adds a few new rules
      to the tree logging.
      
      1) on rename or unlink, if the inode being unlinked isn't in the fsync
      log, we must force a full commit before doing an fsync of the directory
      where the unlink was done.  The commit isn't done during the unlink,
      but it is forced the next time we try to log the parent directory.
      
      Solution: record transid of last unlink/rename per directory when the
      directory wasn't already logged.  For renames this is only done when
      renaming to a different directory.
      
      mkdir foo/some_dir
      normal commit
      rename foo/some_dir foo2/some_dir
      mkdir foo/some_dir
      fsync foo/some_dir/some_file
      
      The fsync above will unlink the original some_dir without recording
      it in its new location (foo2).  After a crash, some_dir will be gone
      unless the fsync of some_file forces a full commit
      
      2) we must log any new names for any file or dir that is in the fsync
      log.  This way we make sure not to lose files that are unlinked during
      the same transaction.
      
      2a) we must log any new names for any file or dir during rename
      when the directory they are being removed from was logged.
      
      2a is actually the more important variant.  Without the extra logging
      a crash might unlink the old name without recreating the new one
      
      3) after a crash, we must go through any directories with a link count
      of zero and redo the rm -rf
      
      mkdir f1/foo
      normal commit
      rm -rf f1/foo
      fsync(f1)
      
      The directory f1 was fully removed from the FS, but fsync was never
      called on f1, only its parent dir.  After a crash the rm -rf must
      be replayed.  This must be able to recurse down the entire
      directory tree.  The inode link count fixup code takes care of the
      ugly details.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      12fcfd22
    • C
      Btrfs: leave btree locks spinning more often · b9473439
      Chris Mason 提交于
      btrfs_mark_buffer dirty would set dirty bits in the extent_io tree
      for the buffers it was dirtying.  This may require a kmalloc and it
      was not atomic.  So, anyone who called btrfs_mark_buffer_dirty had to
      set any btree locks they were holding to blocking first.
      
      This commit changes dirty tracking for extent buffers to just use a flag
      in the extent buffer.  Now that we have one and only one extent buffer
      per page, this can be safely done without losing dirty bits along the way.
      
      This also introduces a path->leave_spinning flag that callers of
      btrfs_search_slot can use to indicate they will properly deal with a
      path returned where all the locks are spinning instead of blocking.
      
      Many of the btree search callers now expect spinning paths,
      resulting in better btree concurrency overall.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      b9473439
    • C
      Btrfs: reduce stalls during transaction commit · b7ec40d7
      Chris Mason 提交于
      To avoid deadlocks and reduce latencies during some critical operations, some
      transaction writers are allowed to jump into the running transaction and make
      it run a little longer, while others sit around and wait for the commit to
      finish.
      
      This is a bit unfair, especially when the callers that jump in do a bunch
      of IO that makes all the others procs on the box wait.  This commit
      reduces the stalls this produces by pre-reading file extent pointers
      during btrfs_finish_ordered_io before the transaction is joined.
      
      It also tunes the drop_snapshot code to politely wait for transactions
      that have started writing out their delayed refs to finish.  This avoids
      new delayed refs being flooded into the queue while we're trying to
      close off the transaction.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      b7ec40d7
    • C
      Btrfs: process the delayed reference queue in clusters · c3e69d58
      Chris Mason 提交于
      The delayed reference queue maintains pending operations that need to
      be done to the extent allocation tree.  These are processed by
      finding records in the tree that are not currently being processed one at
      a time.
      
      This is slow because it uses lots of time searching through the rbtree
      and because it creates lock contention on the extent allocation tree
      when lots of different procs are running delayed refs at the same time.
      
      This commit changes things to grab a cluster of refs for processing,
      using a cursor into the rbtree as the starting point of the next search.
      This way we walk smoothly through the rbtree.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      c3e69d58
    • C
      Btrfs: try to cleanup delayed refs while freeing extents · 1887be66
      Chris Mason 提交于
      When extents are freed, it is likely that we've removed the last
      delayed reference update for the extent.  This checks the delayed
      ref tree when things are freed, and if no ref updates area left it
      immediately processes the delayed ref.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      1887be66
    • C
      Btrfs: do extent allocation and reference count updates in the background · 56bec294
      Chris Mason 提交于
      The extent allocation tree maintains a reference count and full
      back reference information for every extent allocated in the
      filesystem.  For subvolume and snapshot trees, every time
      a block goes through COW, the new copy of the block adds a reference
      on every block it points to.
      
      If a btree node points to 150 leaves, then the COW code needs to go
      and add backrefs on 150 different extents, which might be spread all
      over the extent allocation tree.
      
      These updates currently happen during btrfs_cow_block, and most COWs
      happen during btrfs_search_slot.  btrfs_search_slot has locks held
      on both the parent and the node we are COWing, and so we really want
      to avoid IO during the COW if we can.
      
      This commit adds an rbtree of pending reference count updates and extent
      allocations.  The tree is ordered by byte number of the extent and byte number
      of the parent for the back reference.  The tree allows us to:
      
      1) Modify back references in something close to disk order, reducing seeks
      2) Significantly reduce the number of modifications made as block pointers
      are balanced around
      3) Do all of the extent insertion and back reference modifications outside
      of the performance critical btrfs_search_slot code.
      
      #3 has the added benefit of greatly reducing the btrfs stack footprint.
      The extent allocation tree modifications are done without the deep
      (and somewhat recursive) call chains used in the past.
      
      These delayed back reference updates must be done before the transaction
      commits, and so the rbtree is tied to the transaction.  Throttling is
      implemented to help keep the queue of backrefs at a reasonable size.
      
      Since there was a similar mechanism in place for the extent tree
      extents, that is removed and replaced by the delayed reference tree.
      
      Yan Zheng <yan.zheng@oracle.com> helped review and fixup this code.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      56bec294
  19. 11 3月, 2009 1 次提交
    • C
      Btrfs: Fix locking around adding new space_info · 4184ea7f
      Chris Mason 提交于
      Storage allocated to different raid levels in btrfs is tracked by
      a btrfs_space_info structure, and all of the current space_infos are
      collected into a list_head.
      
      Most filesystems have 3 or 4 of these structs total, and the list is
      only changed when new raid levels are added or at unmount time.
      
      This commit adds rcu locking on the list head, and properly frees
      things at unmount time.  It also clears the space_info->full flag
      whenever new space is added to the FS.
      
      The locking for the space info list goes like this:
      
      reads: protected by rcu_read_lock()
      writes: protected by the chunk_mutex
      
      At unmount time we don't need special locking because all the readers
      are gone.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      4184ea7f
  20. 09 3月, 2009 1 次提交
    • C
      Btrfs: fix spinlock assertions on UP systems · b9447ef8
      Chris Mason 提交于
      btrfs_tree_locked was being used to make sure a given extent_buffer was
      properly locked in a few places.  But, it wasn't correct for UP compiled
      kernels.
      
      This switches it to using assert_spin_locked instead, and renames it to
      btrfs_assert_tree_locked to better reflect how it was really being used.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      b9447ef8
  21. 20 2月, 2009 1 次提交
    • J
      Btrfs: try committing transaction before returning ENOSPC · 4e06bdd6
      Josef Bacik 提交于
      This fixes a problem where we could return -ENOSPC when we may actually have
      plenty of space, the space is just pinned.  Instead of returning -ENOSPC
      immediately, commit the transaction first and then try and do the allocation
      again.
      
      This patch also does chunk allocation for metadata if we pass the 80%
      threshold for metadata space.  This will help with stack usage since the chunk
      allocation will happen early on, instead of when the allocation is happening.
      Signed-off-by: NJosef Bacik <jbacik@redhat.com>
      
      4e06bdd6
  22. 21 2月, 2009 1 次提交
    • J
      Btrfs: add better -ENOSPC handling · 6a63209f
      Josef Bacik 提交于
      This is a step in the direction of better -ENOSPC handling.  Instead of
      checking the global bytes counter we check the space_info bytes counters to
      make sure we have enough space.
      
      If we don't we go ahead and try to allocate a new chunk, and then if that fails
      we return -ENOSPC.  This patch adds two counters to btrfs_space_info,
      bytes_delalloc and bytes_may_use.
      
      bytes_delalloc account for extents we've actually setup for delalloc and will
      be allocated at some point down the line. 
      
      bytes_may_use is to keep track of how many bytes we may use for delalloc at
      some point.  When we actually set the extent_bit for the delalloc bytes we
      subtract the reserved bytes from the bytes_may_use counter.  This keeps us from
      not actually being able to allocate space for any delalloc bytes.
      Signed-off-by: NJosef Bacik <jbacik@redhat.com>
      
      
      
      6a63209f
  23. 13 2月, 2009 2 次提交
    • Y
      Btrfs: hold trans_mutex when using btrfs_record_root_in_trans · 24562425
      Yan Zheng 提交于
      btrfs_record_root_in_trans needs the trans_mutex held to make sure two
      callers don't race to setup the root in a given transaction.  This adds
      it to all the places that were missing it.
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      24562425
    • C
      Btrfs: make a lockdep class for the extent buffer locks · 4008c04a
      Chris Mason 提交于
      Btrfs is currently using spin_lock_nested with a nested value based
      on the tree depth of the block.  But, this doesn't quite work because
      the max tree depth is bigger than what spin_lock_nested can deal with,
      and because locks are sometimes taken before the level field is filled in.
      
      The solution here is to use lockdep_set_class_and_name instead, and to
      set the class before unlocking the pages when the block is read from the
      disk and just after init of a freshly allocated tree block.
      
      btrfs_clear_path_blocking is also changed to take the locks in the proper
      order, and it also makes sure all the locks currently held are properly
      set to blocking before it tries to retake the spinlocks.  Otherwise, lockdep
      gets upset about bad lock orderin.
      
      The lockdep magic cam from Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      4008c04a
  24. 12 2月, 2009 1 次提交
    • C
      Btrfs: use larger metadata clusters in ssd mode · 536ac8ae
      Chris Mason 提交于
      Larger metadata clusters can significantly improve writeback performance
      on ssd drives with large erasure blocks.  The larger clusters make it
      more likely a given IO will completely overwrite the ssd block, so it
      doesn't have to do an internal rwm cycle.
      
      On spinning media, lager metadata clusters end up spreading out the
      metadata more over time, which makes fsck slower, so we don't want this
      to be the default.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      536ac8ae