1. 24 7月, 2009 1 次提交
    • J
      Btrfs: use hybrid extents+bitmap rb tree for free space · 96303081
      Josef Bacik 提交于
      Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
      tracking free space.  As free space gets fragmented, we end up with thousands of
      entries on an rb-tree per block group, which usually spans 1 gig of area.  Since
      we currently don't ever flush free space cache back to disk this gets to be a
      bit unweildly on large fs's with lots of fragmentation.
      
      This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
      space cache.  Initially we calculate a threshold of extent entries we can
      handle, which is however many extent entries we can cram into 16k of ram.  The
      maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
      will be 32k of RAM, which scales much better than we did before.
      
      Once we pass the extent threshold, we start adding bitmaps and using those
      instead for tracking the free space.  This patch also makes it so that any free
      space thats less than 4 * sectorsize we go ahead and put into a bitmap.  This is
      nice since we try and allocate out of the front of a block group, so if the
      front of a block group is heavily fragmented and then has a huge chunk of free
      space at the end, we go ahead and add the fragmented areas to bitmaps and use a
      normal extent entry to track the big chunk at the back of the block group.
      
      I've also taken the opportunity to revamp how we search for free space.
      Previously we indexed free space via an offset indexed rb tree and a bytes
      indexed rb tree.  I've dropped the bytes indexed rb tree and use only the offset
      indexed rb tree.  This cuts the number of tree operations we were doing
      previously down by half, and gives us a little bit of a better allocation
      pattern since we will always start from a specific offset and search forward
      from there, instead of searching for the size we need and try and get it as
      close as possible to the offset we want.
      
      I've given this a healthy amount of testing pre-new format stuff, as well as
      post-new format stuff.  I've booted up my fedora box which is installed on btrfs
      with this patch and ran with it for a few days without issues.  I've not seen
      any performance regressions in any of my tests.
      
      Since the last patch Yan Zheng fixed a problem where we could have overlapping
      entries, so updating their offset inline would cause problems.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@redhat.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      96303081
  2. 10 6月, 2009 3 次提交
    • C
      Btrfs: add mount -o ssd_spread to spread allocations out · 451d7585
      Chris Mason 提交于
      Some SSDs perform best when reusing block numbers often, while
      others perform much better when clustering strictly allocates
      big chunks of unused space.
      
      The default mount -o ssd will find rough groupings of blocks
      where there are a bunch of free blocks that might have some
      allocated blocks mixed in.
      
      mount -o ssd_spread will make sure there are no allocated blocks
      mixed in.  It should perform better on lower end SSDs.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      451d7585
    • C
      Btrfs: avoid allocation clusters that are too spread out · c6044801
      Chris Mason 提交于
      In SSD mode for data, and all the time for metadata the allocator
      will try to find a cluster of nearby blocks for allocations.  This
      commit adds extra checks to make sure that each free block in the
      cluster is close to the last one.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      c6044801
    • C
      Btrfs: reduce mount -o ssd CPU usage · 2c943de6
      Chris Mason 提交于
      The block allocator in SSD mode will try to find groups of free blocks
      that are close together.  This commit makes it loop less on a given
      group size before bumping it.
      
      The end result is that we are less likely to fill small holes in the
      available free space, but we don't waste as much CPU building the
      large cluster used by ssd mode.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      2c943de6
  3. 27 4月, 2009 1 次提交
  4. 03 4月, 2009 4 次提交
    • S
      Btrfs: BUG to BUG_ON changes · c293498b
      Stoyan Gaydarov 提交于
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      c293498b
    • C
      Btrfs: rework allocation clustering · fa9c0d79
      Chris Mason 提交于
      Because btrfs is copy-on-write, we end up picking new locations for
      blocks very often.  This makes it fairly difficult to maintain perfect
      read patterns over time, but we can at least do some optimizations
      for writes.
      
      This is done today by remembering the last place we allocated and
      trying to find a free space hole big enough to hold more than just one
      allocation.  The end result is that we tend to write sequentially to
      the drive.
      
      This happens all the time for metadata and it happens for data
      when mounted -o ssd.  But, the way we record it is fairly racey
      and it tends to fragment the free space over time because we are trying
      to allocate fairly large areas at once.
      
      This commit gets rid of the races by adding a free space cluster object
      with dedicated locking to make sure that only one process at a time
      is out replacing the cluster.
      
      The free space fragmentation is somewhat solved by allowing a cluster
      to be comprised of smaller free space extents.  This part definitely
      adds some CPU time to the cluster allocations, but it allows the allocator
      to consume the small holes left behind by cow.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      fa9c0d79
    • J
      Btrfs: kill the block group alloc mutex · 6226cb0a
      Josef Bacik 提交于
      This patch removes the block group alloc mutex used to protect the free space
      tree for allocations and replaces it with a spin lock which is used only to
      protect the free space rb tree.  This means we only take the lock when we are
      directly manipulating the tree, which makes us a touch faster with
      multi-threaded workloads.
      
      This patch also gets rid of btrfs_find_free_space and replaces it with
      btrfs_find_space_for_alloc, which takes the number of bytes you want to
      allocate, and empty_size, which is used to indicate how much free space should
      be at the end of the allocation.
      
      It will return an offset for the allocator to use.  If we don't end up using it
      we _must_ call btrfs_add_free_space to put it back.  This is the tradeoff to
      kill the alloc_mutex, since we need to make sure nobody else comes along and
      takes our space.
      Signed-off-by: NJosef Bacik <jbacik@redhat.com>
      6226cb0a
    • J
      Btrfs: free space cache cleanups · 70cb0743
      Josef Bacik 提交于
      This patch cleans up the free space cache code a bit.  It better documents the
      idiosyncrasies of tree_search_offset and makes the code make a bit more sense.
      I took out the info allocation at the start of __btrfs_add_free_space and put it
      where it makes more sense.  This was left over cruft from when alloc_mutex
      existed.  Also all of the re-searches we do to make sure we inserted properly.
      Signed-off-by: NJosef Bacik <jbacik@redhat.com>
      70cb0743
  5. 06 1月, 2009 1 次提交
  6. 09 12月, 2008 1 次提交
    • Y
      Btrfs: superblock duplication · a512bbf8
      Yan Zheng 提交于
      This patch implements superblock duplication. Superblocks
      are stored at offset 16K, 64M and 256G on every devices.
      Spaces used by superblocks are preserved by the allocator,
      which uses a reverse mapping function to find the logical
      addresses that correspond to superblocks. Thank you,
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      a512bbf8
  7. 02 12月, 2008 1 次提交
  8. 30 10月, 2008 1 次提交
    • J
      Btrfs: nuke fs wide allocation mutex V2 · 25179201
      Josef Bacik 提交于
      This patch removes the giant fs_info->alloc_mutex and replaces it with a bunch
      of little locks.
      
      There is now a pinned_mutex, which is used when messing with the pinned_extents
      extent io tree, and the extent_ins_mutex which is used with the pending_del and
      extent_ins extent io trees.
      
      The locking for the extent tree stuff was inspired by a patch that Yan Zheng
      wrote to fix a race condition, I cleaned it up some and changed the locking
      around a little bit, but the idea remains the same.  Basically instead of
      holding the extent_ins_mutex throughout the processing of an extent on the
      extent_ins or pending_del trees, we just hold it while we're searching and when
      we clear the bits on those trees, and lock the extent for the duration of the
      operations on the extent.
      
      Also to keep from getting hung up waiting to lock an extent, I've added a
      try_lock_extent so if we cannot lock the extent, move on to the next one in the
      tree and we'll come back to that one.  I have tested this heavily and it does
      not appear to break anything.  This has to be applied on top of my
      find_free_extent redo patch.
      
      I tested this patch on top of Yan's space reblancing code and it worked fine.
      The only thing that has changed since the last version is I pulled out all my
      debugging stuff, apparently I forgot to run guilt refresh before I sent the
      last patch out.  Thank you,
      Signed-off-by: NJosef Bacik <jbacik@redhat.com>
      
      25179201
  9. 10 10月, 2008 1 次提交
    • J
      Btrfs: make tree_search_offset more flexible in its searching · 37d3cddd
      Josef Bacik 提交于
      Sometimes we end up freeing a reserved extent because we don't need it, however
      this means that its possible for transaction->last_alloc to point to the middle
      of a free area.
      
      When we search for free space in find_free_space we do a tree_search_offset
      with contains set to 0, because we want it to find the next best free area if
      we do not have an offset starting on the given offset.
      
      Unfortunately that currently means that if the offset we were given as a hint
      points to the middle of a free area, we won't find anything.  This is especially
      bad if we happened to last allocate from the big huge chunk of a newly formed
      block group, since we won't find anything and have to go back and search the
      long way around.
      
      This fixes this problem by making it so that we return the free space area
      regardless of the contains variable.  This made cache missing happen _alot_
      less, and speeds things up considerably.
      Signed-off-by: NJosef Bacik <jbacik@redhat.com>
      37d3cddd
  10. 26 9月, 2008 1 次提交
    • C
      Btrfs: Fix allocation completions in tree log replay · 9b49c9b9
      Chris Mason 提交于
      After a crash, the tree log code uses btrfs_alloc_logged_extent to
      record allocations of data extents that it finds in the log tree.  These
      come in basically random order, which does not fit how
      btrfs_remove_free_space() expects to be called.
      
      btrfs_remove_free_space was changed to support recording an extent
      allocation in the middle of a region of free space.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      9b49c9b9
  11. 25 9月, 2008 1 次提交
    • J
      Btrfs: free space accounting redo · 0f9dd46c
      Josef Bacik 提交于
      1) replace the per fs_info extent_io_tree that tracked free space with two
      rb-trees per block group to track free space areas via offset and size.  The
      reason to do this is because most allocations come with a hint byte where to
      start, so we can usually find a chunk of free space at that hint byte to satisfy
      the allocation and get good space packing.  If we cannot find free space at or
      after the given offset we fall back on looking for a chunk of the given size as
      close to that given offset as possible.  When we fall back on the size search we
      also try to find a slot as close to the size we want as possible, to avoid
      breaking small chunks off of huge areas if possible.
      
      2) remove the extent_io_tree that tracked the block group cache from fs_info and
      replaced it with an rb-tree thats tracks block group cache via offset.  also
      added a per space_info list that tracks the block group cache for the particular
      space so we can lookup related block groups easily.
      
      3) cleaned up the allocation code to make it a little easier to read and a
      little less complicated.  Basically there are 3 steps, first look from our
      provided hint.  If we couldn't find from that given hint, start back at our
      original search start and look for space from there.  If that fails try to
      allocate space if we can and start looking again.  If not we're screwed and need
      to start over again.
      
      4) small fixes.  there were some issues in volumes.c where we wouldn't allocate
      the rest of the disk.  fixed cow_file_range to actually pass the alloc_hint,
      which has helped a good bit in making the fs_mark test I run have semi-normal
      results as we run out of space.  Generally with data allocations we don't track
      where we last allocated from, so everytime we did a data allocation we'd search
      through every block group that we have looking for free space.  Now searching a
      block group with no free space isn't terribly time consuming, it was causing a
      slight degradation as we got more data block groups.  The alloc_hint has fixed
      this slight degredation and made things semi-normal.
      
      There is still one nagging problem I'm working on where we will get ENOSPC when
      there is definitely plenty of space.  This only happens with metadata
      allocations, and only when we are almost full.  So you generally hit the 85%
      mark first, but sometimes you'll hit the BUG before you hit the 85% wall.  I'm
      still tracking it down, but until then this seems to be pretty stable and make a
      significant performance gain.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      0f9dd46c