1. 24 7月, 2009 1 次提交
    • J
      Btrfs: async block group caching · 817d52f8
      Josef Bacik 提交于
      This patch moves the caching of the block group off to a kthread in order to
      allow people to allocate sooner.  Instead of blocking up behind the caching
      mutex, we instead kick of the caching kthread, and then attempt to make an
      allocation.  If we cannot, we wait on the block groups caching waitqueue, which
      the caching kthread will wake the waiting threads up everytime it finds 2 meg
      worth of space, and then again when its finished caching.  This is how I tested
      the speedup from this
      
      mkfs the disk
      mount the disk
      fill the disk up with fs_mark
      unmount the disk
      mount the disk
      time touch /mnt/foo
      
      Without my changes this took 11 seconds on my box, with these changes it now
      takes 1 second.
      
      Another change thats been put in place is we lock the super mirror's in the
      pinned extent map in order to keep us from adding that stuff as free space when
      caching the block group.  This doesn't really change anything else as far as the
      pinned extent map is concerned, since for actual pinned extents we use
      EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
      those extents to keep from leaking memory.
      
      I've also added a check where when we are reading block groups from disk, if the
      amount of space used == the size of the block group, we go ahead and mark the
      block group as cached.  This drastically reduces the amount of time it takes to
      cache the block groups.  Using the same test as above, except doing a dd to a
      file and then unmounting, it used to take 33 seconds to umount, now it takes 3
      seconds.
      
      This version uses the commit_root in the caching kthread, and then keeps track
      of how many async caching threads are running at any given time so if one of the
      async threads is still running as we cross transactions we can wait until its
      finished before handling the pinned extents.  Thank you,
      Signed-off-by: NJosef Bacik <jbacik@redhat.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      817d52f8
  2. 23 7月, 2009 1 次提交
  3. 11 6月, 2009 2 次提交
    • H
      Btrfs: pin buffers during write_dev_supers · 4eedeb75
      Hisashi Hifumi 提交于
      write_dev_supers is called in sequence.  First is it called with wait == 0,
      which starts IO on all of the super blocks for a given device.  Then it is
      called with wait == 1 to make sure they all reach the disk.
      
      It doesn't currently pin the buffers between the two calls, and it also
      assumes the buffers won't go away between the two calls, leading to
      an oops if the VM manages to free the buffers in the middle of the sync.
      
      This fixes that assumption and updates the code to return an error if things
      are not up to date when the wait == 1 run is done.
      Signed-off-by: NHisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      4eedeb75
    • C
      Btrfs: avoid races between super writeout and device list updates · e5e9a520
      Chris Mason 提交于
      On multi-device filesystems, btrfs writes supers to all of the devices
      before considering a sync complete.  There wasn't any additional
      locking between super writeout and the device list management code
      because device management was done inside a transaction and
      super writeout only happened  with no transation writers running.
      
      With the btrfs fsync log and other async transaction updates, this
      has been racey for some time.  This adds a mutex to protect
      the device list.  The existing volume mutex could not be reused due to
      transaction lock ordering requirements.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      e5e9a520
  4. 10 6月, 2009 4 次提交
    • D
      Btrfs: remove crc32c.h and use libcrc32c directly. · 163e783e
      David Woodhouse 提交于
      There's no need to preserve this abstraction; it used to let us use
      hardware crc32c support directly, but libcrc32c is already doing that for us
      through the crypto API -- so we're already using the Intel crc32c
      acceleration where appropriate.
      Signed-off-by: NDavid Woodhouse <David.Woodhouse@intel.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      163e783e
    • C
      Btrfs: autodetect SSD devices · c289811c
      Chris Mason 提交于
      During mount, btrfs will check the queue nonrot flag
      for all the devices found in the FS.  If they are all
      non-rotating, SSD mode is enabled by default.
      
      If the FS was mounted with -o nossd, the non-rotating
      flag is ignored.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      c289811c
    • C
      Btrfs: fix metadata dirty throttling limits · 585ad2c3
      Chris Mason 提交于
      Once a metadata block has been written, it must be recowed, so the
      btrfs dirty balancing call has a check to make sure a fair amount of metadata
      was actually dirty before it started writing it back to disk.
      
      A previous commit had changed the dirty tracking for metadata without
      updating the btrfs dirty balancing checks.  This commit switches it
      to use the correct counter.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      585ad2c3
    • Y
      Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE) · 5d4f98a2
      Yan Zheng 提交于
      This commit introduces a new kind of back reference for btrfs metadata.
      Once a filesystem has been mounted with this commit, IT WILL NO LONGER
      BE MOUNTABLE BY OLDER KERNELS.
      
      When a tree block in subvolume tree is cow'd, the reference counts of all
      extents it points to are increased by one.  At transaction commit time,
      the old root of the subvolume is recorded in a "dead root" data structure,
      and the btree it points to is later walked, dropping reference counts
      and freeing any blocks where the reference count goes to 0.
      
      The increments done during cow and decrements done after commit cancel out,
      and the walk is a very expensive way to go about freeing the blocks that
      are no longer referenced by the new btree root.  This commit reduces the
      transaction overhead by avoiding the need for dead root records.
      
      When a non-shared tree block is cow'd, we free the old block at once, and the
      new block inherits old block's references. When a tree block with reference
      count > 1 is cow'd, we increase the reference counts of all extents
      the new block points to by one, and decrease the old block's reference count by
      one.
      
      This dead tree avoidance code removes the need to modify the reference
      counts of lower level extents when a non-shared tree block is cow'd.
      But we still need to update back ref for all pointers in the block.
      This is because the location of the block is recorded in the back ref
      item.
      
      We can solve this by introducing a new type of back ref. The new
      back ref provides information about pointer's key, level and in which
      tree the pointer lives. This information allow us to find the pointer
      by searching the tree. The shortcoming of the new back ref is that it
      only works for pointers in tree blocks referenced by their owner trees.
      
      This is mostly a problem for snapshots, where resolving one of these
      fuzzy back references would be O(number_of_snapshots) and quite slow.
      The solution used here is to use the fuzzy back references in the common
      case where a given tree block is only referenced by one root,
      and use the full back references when multiple roots have a reference
      on a given block.
      
      This commit adds per subvolume red-black tree to keep trace of cached
      inodes. The red-black tree helps the balancing code to find cached
      inodes whose inode numbers within a given range.
      
      This commit improves the balancing code by introducing several data
      structures to keep the state of balancing. The most important one
      is the back ref cache. It caches how the upper level tree blocks are
      referenced. This greatly reduce the overhead of checking back ref.
      
      The improved balancing code scales significantly better with a large
      number of snapshots.
      
      This is a very large commit and was written in a number of
      pieces.  But, they depend heavily on the disk format change and were
      squashed together to make sure git bisect didn't end up in a
      bad state wrt space balancing or the format change.
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      5d4f98a2
  5. 15 5月, 2009 1 次提交
  6. 27 4月, 2009 3 次提交
  7. 25 4月, 2009 1 次提交
    • J
      Btrfs: try to keep a healthy ratio of metadata vs data block groups · 97e728d4
      Josef Bacik 提交于
      This patch makes the chunk allocator keep a good ratio of metadata vs data
      block groups.  By default for every 8 data block groups, we'll allocate 1
      metadata chunk, or about 12% of the disk will be allocated for metadata.  This
      can be changed by specifying the metadata_ratio mount option.
      
      This is simply the number of data block groups that have to be allocated to
      force a metadata chunk allocation.  By making sure we allocate metadata chunks
      more often, we are less likely to get into situations where the whole disk
      has been allocated as data block groups.
      Signed-off-by: NJosef Bacik <jbacik@redhat.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      97e728d4
  8. 21 4月, 2009 2 次提交
    • C
      Btrfs: add a priority queue to the async thread helpers · d313d7a3
      Chris Mason 提交于
      Btrfs is using WRITE_SYNC_PLUG to send down synchronous IOs with a
      higher priority.  But, the checksumming helper threads prevent it
      from being fully effective.
      
      There are two problems.  First, a big queue of pending checksumming
      will delay the synchronous IO behind other lower priority writes.  Second,
      the checksumming uses an ordered async work queue.  The ordering makes sure
      that IOs are sent to the block layer in the same order they are sent
      to the checksumming threads.  Usually this gives us less seeky IO.
      
      But, when we start mixing IO priorities, the lower priority IO can delay
      the higher priority IO.
      
      This patch solves both problems by adding a high priority list to the async
      helper threads, and a new btrfs_set_work_high_prio(), which is used
      to make put a new async work item onto the higher priority list.
      
      The ordering is still done on high priority IO, but all of the high
      priority bios are ordered separately from the low priority bios.  This
      ordering is purely an IO optimization, it is not involved in data
      or metadata integrity.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      d313d7a3
    • C
      Btrfs: use WRITE_SYNC for synchronous writes · ffbd517d
      Chris Mason 提交于
      Part of reducing fsync/O_SYNC/O_DIRECT latencies is using WRITE_SYNC for
      writes we plan on waiting on in the near future.  This patch
      mirrors recent changes in other filesystems and the generic code to
      use WRITE_SYNC when WB_SYNC_ALL is passed and to use WRITE_SYNC for
      other latency critical writes.
      
      Btrfs uses async worker threads for checksumming before the write is done,
      and then again to actually submit the bios.  The bio submission code just
      runs a per-device list of bios that need to be sent down the pipe.
      
      This list is split into low priority and high priority lists so the
      WRITE_SYNC IO happens first.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      ffbd517d
  9. 03 4月, 2009 3 次提交
    • D
      Btrfs: remove dead code · 3e7ad38d
      Dan Carpenter 提交于
      Remove an unneeded return statement and conditional
      Signed-off-by: NDan Carpenter <error27@gmail.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      3e7ad38d
    • C
      Btrfs: rework allocation clustering · fa9c0d79
      Chris Mason 提交于
      Because btrfs is copy-on-write, we end up picking new locations for
      blocks very often.  This makes it fairly difficult to maintain perfect
      read patterns over time, but we can at least do some optimizations
      for writes.
      
      This is done today by remembering the last place we allocated and
      trying to find a free space hole big enough to hold more than just one
      allocation.  The end result is that we tend to write sequentially to
      the drive.
      
      This happens all the time for metadata and it happens for data
      when mounted -o ssd.  But, the way we record it is fairly racey
      and it tends to fragment the free space over time because we are trying
      to allocate fairly large areas at once.
      
      This commit gets rid of the races by adding a free space cluster object
      with dedicated locking to make sure that only one process at a time
      is out replacing the cluster.
      
      The free space fragmentation is somewhat solved by allowing a cluster
      to be comprised of smaller free space extents.  This part definitely
      adds some CPU time to the cluster allocations, but it allows the allocator
      to consume the small holes left behind by cow.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      fa9c0d79
    • J
      Btrfs: kill the pinned_mutex · 04018de5
      Josef Bacik 提交于
      This patch removes the pinned_mutex.  The extent io map has an internal tree
      lock that protects the tree itself, and since we only copy the extent io map
      when we are committing the transaction we don't need it there.  We also don't
      need it when caching the block group since searching through the tree is also
      protected by the internal map spin lock.
      Signed-off-by: NJosef Bacik <jbacik@redhat.com>
      04018de5
  10. 01 4月, 2009 1 次提交
    • C
      Btrfs: add extra flushing for renames and truncates · 5a3f23d5
      Chris Mason 提交于
      Renames and truncates are both common ways to replace old data with new
      data.  The filesystem can make an effort to make sure the new data is
      on disk before actually replacing the old data.
      
      This is especially important for rename, which many application use as
      though it were atomic for both the data and the metadata involved.  The
      current btrfs code will happily replace a file that is fully on disk
      with one that was just created and still has pending IO.
      
      If we crash after transaction commit but before the IO is done, we'll end
      up replacing a good file with a zero length file.  The solution used
      here is to create a list of inodes that need special ordering and force
      them to disk before the commit is done.  This is similar to the
      ext3 style data=ordering, except it is only done on selected files.
      
      Btrfs is able to get away with this because it does not wait on commits
      very often, even for fsync (which use a sub-commit).
      
      For renames, we order the file when it wasn't already
      on disk and when it is replacing an existing file.  Larger files
      are sent to filemap_flush right away (before the transaction handle is
      opened).
      
      For truncates, we order if the file goes from non-zero size down to
      zero size.  This is a little different, because at the time of the
      truncate the file has no dirty bytes to order.  But, we flag the inode
      so that it is added to the ordered list on close (via release method).  We
      also immediately add it to the ordered list of the current transaction
      so that we can try to flush down any writes the application sneaks in
      before commit.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      5a3f23d5
  11. 26 3月, 2009 1 次提交
  12. 25 3月, 2009 3 次提交
    • C
      Btrfs: leave btree locks spinning more often · b9473439
      Chris Mason 提交于
      btrfs_mark_buffer dirty would set dirty bits in the extent_io tree
      for the buffers it was dirtying.  This may require a kmalloc and it
      was not atomic.  So, anyone who called btrfs_mark_buffer_dirty had to
      set any btree locks they were holding to blocking first.
      
      This commit changes dirty tracking for extent buffers to just use a flag
      in the extent buffer.  Now that we have one and only one extent buffer
      per page, this can be safely done without losing dirty bits along the way.
      
      This also introduces a path->leave_spinning flag that callers of
      btrfs_search_slot can use to indicate they will properly deal with a
      path returned where all the locks are spinning instead of blocking.
      
      Many of the btree search callers now expect spinning paths,
      resulting in better btree concurrency overall.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      b9473439
    • C
      Btrfs: process the delayed reference queue in clusters · c3e69d58
      Chris Mason 提交于
      The delayed reference queue maintains pending operations that need to
      be done to the extent allocation tree.  These are processed by
      finding records in the tree that are not currently being processed one at
      a time.
      
      This is slow because it uses lots of time searching through the rbtree
      and because it creates lock contention on the extent allocation tree
      when lots of different procs are running delayed refs at the same time.
      
      This commit changes things to grab a cluster of refs for processing,
      using a cursor into the rbtree as the starting point of the next search.
      This way we walk smoothly through the rbtree.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      c3e69d58
    • C
      Btrfs: do extent allocation and reference count updates in the background · 56bec294
      Chris Mason 提交于
      The extent allocation tree maintains a reference count and full
      back reference information for every extent allocated in the
      filesystem.  For subvolume and snapshot trees, every time
      a block goes through COW, the new copy of the block adds a reference
      on every block it points to.
      
      If a btree node points to 150 leaves, then the COW code needs to go
      and add backrefs on 150 different extents, which might be spread all
      over the extent allocation tree.
      
      These updates currently happen during btrfs_cow_block, and most COWs
      happen during btrfs_search_slot.  btrfs_search_slot has locks held
      on both the parent and the node we are COWing, and so we really want
      to avoid IO during the COW if we can.
      
      This commit adds an rbtree of pending reference count updates and extent
      allocations.  The tree is ordered by byte number of the extent and byte number
      of the parent for the back reference.  The tree allows us to:
      
      1) Modify back references in something close to disk order, reducing seeks
      2) Significantly reduce the number of modifications made as block pointers
      are balanced around
      3) Do all of the extent insertion and back reference modifications outside
      of the performance critical btrfs_search_slot code.
      
      #3 has the added benefit of greatly reducing the btrfs stack footprint.
      The extent allocation tree modifications are done without the deep
      (and somewhat recursive) call chains used in the past.
      
      These delayed back reference updates must be done before the transaction
      commits, and so the rbtree is tied to the transaction.  Throttling is
      implemented to help keep the queue of backrefs at a reasonable size.
      
      Since there was a similar mechanism in place for the extent tree
      extents, that is removed and replaced by the delayed reference tree.
      
      Yan Zheng <yan.zheng@oracle.com> helped review and fixup this code.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      56bec294
  13. 09 3月, 2009 1 次提交
    • C
      Btrfs: fix spinlock assertions on UP systems · b9447ef8
      Chris Mason 提交于
      btrfs_tree_locked was being used to make sure a given extent_buffer was
      properly locked in a few places.  But, it wasn't correct for UP compiled
      kernels.
      
      This switches it to using assert_spin_locked instead, and renames it to
      btrfs_assert_tree_locked to better reflect how it was really being used.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      b9447ef8
  14. 13 2月, 2009 1 次提交
    • C
      Btrfs: make a lockdep class for the extent buffer locks · 4008c04a
      Chris Mason 提交于
      Btrfs is currently using spin_lock_nested with a nested value based
      on the tree depth of the block.  But, this doesn't quite work because
      the max tree depth is bigger than what spin_lock_nested can deal with,
      and because locks are sometimes taken before the level field is filled in.
      
      The solution here is to use lockdep_set_class_and_name instead, and to
      set the class before unlocking the pages when the block is read from the
      disk and just after init of a freshly allocated tree block.
      
      btrfs_clear_path_blocking is also changed to take the locks in the proper
      order, and it also makes sure all the locks currently held are properly
      set to blocking before it tries to retake the spinlocks.  Otherwise, lockdep
      gets upset about bad lock orderin.
      
      The lockdep magic cam from Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      4008c04a
  15. 04 2月, 2009 3 次提交
    • C
      Btrfs: Change btree locking to use explicit blocking points · b4ce94de
      Chris Mason 提交于
      Most of the btrfs metadata operations can be protected by a spinlock,
      but some operations still need to schedule.
      
      So far, btrfs has been using a mutex along with a trylock loop,
      most of the time it is able to avoid going for the full mutex, so
      the trylock loop is a big performance gain.
      
      This commit is step one for getting rid of the blocking locks entirely.
      btrfs_tree_lock takes a spinlock, and the code explicitly switches
      to a blocking lock when it starts an operation that can schedule.
      
      We'll be able get rid of the blocking locks in smaller pieces over time.
      Tracing allows us to find the most common cause of blocking, so we
      can start with the hot spots first.
      
      The basic idea is:
      
      btrfs_tree_lock() returns with the spin lock held
      
      btrfs_set_lock_blocking() sets the EXTENT_BUFFER_BLOCKING bit in
      the extent buffer flags, and then drops the spin lock.  The buffer is
      still considered locked by all of the btrfs code.
      
      If btrfs_tree_lock gets the spinlock but finds the blocking bit set, it drops
      the spin lock and waits on a wait queue for the blocking bit to go away.
      
      Much of the code that needs to set the blocking bit finishes without actually
      blocking a good percentage of the time.  So, an adaptive spin is still
      used against the blocking bit to avoid very high context switch rates.
      
      btrfs_clear_lock_blocking() clears the blocking bit and returns
      with the spinlock held again.
      
      btrfs_tree_unlock() can be called on either blocking or spinning locks,
      it does the right thing based on the blocking bit.
      
      ctree.c has a helper function to set/clear all the locked buffers in a
      path as blocking.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      b4ce94de
    • C
      Btrfs: hash_lock is no longer needed · c487685d
      Chris Mason 提交于
      Before metadata is written to disk, it is updated to reflect that writeout
      has begun.  Once this update is done, the block must be cow'd before it
      can be modified again.
      
      This update was originally synchronized by using a per-fs spinlock.  Today
      the buffers for the metadata blocks are locked before writeout begins,
      and everyone that tests the flag has the buffer locked as well.
      
      So, the per-fs spinlock (called hash_lock for no good reason) is no
      longer required.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      c487685d
    • C
      Btrfs: async threads should try harder to find work · b51912c9
      Chris Mason 提交于
      Tracing shows the delay between when an async thread goes to sleep
      and when more work is added is often very short.  This commit adds
      a little bit of delay and extra checking to the code right before
      we schedule out.
      
      It allows more work to be added to the worker
      without requiring notifications from other procs.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      b51912c9
  16. 22 1月, 2009 1 次提交
    • Y
      Btrfs: fix tree logs parallel sync · 7237f183
      Yan Zheng 提交于
      To improve performance, btrfs_sync_log merges tree log sync
      requests. But it wrongly merges sync requests for different
      tree logs. If multiple tree logs are synced at the same time,
      only one of them actually gets synced.
      
      This patch has following changes to fix the bug:
      
      Move most tree log related fields in btrfs_fs_info to
      btrfs_root. This allows merging sync requests separately
      for each tree log.
      
      Don't insert root item into the log root tree immediately
      after log tree is allocated. Root item for log tree is
      inserted when log tree get synced for the first time. This
      allows syncing the log root tree without first syncing all
      log trees.
      
      At tree-log sync, btrfs_sync_log first sync the log tree;
      then updates corresponding root item in the log root tree;
      sync the log root tree; then update the super block.
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      7237f183
  17. 21 1月, 2009 4 次提交
  18. 06 1月, 2009 2 次提交
  19. 20 12月, 2008 1 次提交
  20. 18 12月, 2008 1 次提交
    • C
      Btrfs: shift all end_io work to thread pools · cad321ad
      Chris Mason 提交于
      bio_end_io for reads without checksumming on and btree writes were
      happening without using async thread pools.  This means the extent_io.c
      code had to use spin_lock_irq and friends on the rb tree locks for
      extent state.
      
      There were some irq safe vs unsafe lock inversions between the delallock
      lock and the extent state locks.  This patch gets rid of them by moving
      all end_io code into the thread pools.
      
      To avoid contention and deadlocks between the data end_io processing and the
      metadata end_io processing yet another thread pool is added to finish
      off metadata writes.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      cad321ad
  21. 12 12月, 2008 1 次提交
  22. 11 12月, 2008 1 次提交
  23. 09 12月, 2008 1 次提交
    • Y
      Btrfs: superblock duplication · a512bbf8
      Yan Zheng 提交于
      This patch implements superblock duplication. Superblocks
      are stored at offset 16K, 64M and 256G on every devices.
      Spaces used by superblocks are preserved by the allocator,
      which uses a reverse mapping function to find the logical
      addresses that correspond to superblocks. Thank you,
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      a512bbf8