1. 15 3月, 2010 1 次提交
  2. 09 3月, 2010 1 次提交
  3. 18 12月, 2009 3 次提交
  4. 16 12月, 2009 1 次提交
  5. 12 11月, 2009 1 次提交
  6. 14 10月, 2009 1 次提交
    • C
      Btrfs: streamline tree-log btree block writeout · 690587d1
      Chris Mason 提交于
      Syncing the tree log is a 3 phase operation.
      
      1) write and wait for all the tree log blocks for a given root.
      
      2) write and wait for all the tree log blocks for the
      tree of tree log roots.
      
      3) write and wait for the super blocks (barriers here)
      
      This isn't as efficient as it could be because there is
      no requirement to wait for the blocks from step one to hit the disk
      before we start writing the blocks from step two.  This commit
      changes the sequence so that we don't start waiting until
      all the tree blocks from both steps one and two have been sent
      to disk.
      
      We do this by breaking up btrfs_write_wait_marked_extents into
      two functions, which is trivial because it was already broken
      up into two parts.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      690587d1
  7. 29 9月, 2009 1 次提交
    • J
      Btrfs: proper -ENOSPC handling · 9ed74f2d
      Josef Bacik 提交于
      At the start of a transaction we do a btrfs_reserve_metadata_space() and
      specify how many items we plan on modifying.  Then once we've done our
      modifications and such, just call btrfs_unreserve_metadata_space() for
      the same number of items we reserved.
      
      For keeping track of metadata needed for data I've had to add an extent_io op
      for when we merge extents.  This lets us track space properly when we are doing
      sequential writes, so we don't end up reserving way more metadata space than
      what we need.
      
      The only place where the metadata space accounting is not done is in the
      relocation code.  This is because Yan is going to be reworking that code in the
      near future, so running btrfs-vol -b could still possibly result in a ENOSPC
      related panic.  This patch also turns off the metadata_ratio stuff in order to
      allow users to more efficiently use their disk space.
      
      This patch makes it so we track how much metadata we need for an inode's
      delayed allocation extents by tracking how many extents are currently
      waiting for allocation.  It introduces two new callbacks for the
      extent_io tree's, merge_extent_hook and split_extent_hook.  These help
      us keep track of when we merge delalloc extents together and split them
      up.  Reservations are handled prior to any actually dirty'ing occurs,
      and then we unreserve after we dirty.
      
      btrfs_unreserve_metadata_for_delalloc() will make the appropriate
      unreservations as needed based on the number of reservations we
      currently have and the number of extents we currently have.  Doing the
      reservation outside of doing any of the actual dirty'ing lets us do
      things like filemap_flush() the inode to try and force delalloc to
      happen, or as a last resort actually start allocation on all delalloc
      inodes in the fs.  This has survived dbench, fs_mark and an fsx torture
      test.
      Signed-off-by: NJosef Bacik <jbacik@redhat.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      9ed74f2d
  8. 22 9月, 2009 3 次提交
    • Y
      Btrfs: add snapshot/subvolume destroy ioctl · 76dda93c
      Yan, Zheng 提交于
      This patch adds snapshot/subvolume destroy ioctl.  A subvolume that isn't being
      used and doesn't contains links to other subvolumes can be destroyed.
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      76dda93c
    • Y
      Btrfs: change how subvolumes are organized · 4df27c4d
      Yan, Zheng 提交于
      btrfs allows subvolumes and snapshots anywhere in the directory tree.
      If we snapshot a subvolume that contains a link to other subvolume
      called subvolA, subvolA can be accessed through both the original
      subvolume and the snapshot. This is similar to creating hard link to
      directory, and has the very similar problems.
      
      The aim of this patch is enforcing there is only one access point to
      each subvolume. Only the first directory entry (the one added when
      the subvolume/snapshot was created) is treated as valid access point.
      The first directory entry is distinguished by checking root forward
      reference. If the corresponding root forward reference is missing,
      we know the entry is not the first one.
      
      This patch also adds snapshot/subvolume rename support, the code
      allows rename subvolume link across subvolumes.
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      4df27c4d
    • Y
      Btrfs: speed up snapshot dropping · 1c4850e2
      Yan, Zheng 提交于
      This patch contains two changes to avoid unnecessary tree block reads during
      snapshot dropping.
      
      First, check tree block's reference count and flags before reading the tree
      block. if reference count > 1 and there is no need to update backrefs, we can
      avoid reading the tree block.
      
      Second, save when snapshot was created in root_key.offset. we can compare block
      pointer's generation with snapshot's creation generation during updating
      backrefs. If a given block was created before snapshot was created, the
      snapshot can't be the tree block's owner. So we can avoid reading the block.
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      1c4850e2
  9. 18 9月, 2009 1 次提交
    • Y
      Btrfs: improve async block group caching · 11833d66
      Yan Zheng 提交于
      This patch gets rid of two limitations of async block group caching.
      The old code delays handling pinned extents when block group is in
      caching. To allocate logged file extents, the old code need wait
      until block group is fully cached. To get rid of the limitations,
      This patch introduces a data structure to track the progress of
      caching. Base on the caching progress, we know which extents should
      be added to the free space cache when handling the pinned extents.
      The logged file extents are also handled in a similar way.
      
      This patch also changes how pinned extents are tracked. The old
      code uses one tree to track pinned extents, and copy the pinned
      extents tree at transaction commit time. This patch makes it use
      two trees to track pinned extents. One tree for extents that are
      pinned in the running transaction, one tree for extents that can
      be unpinned. At transaction commit time, we swap the two trees.
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      11833d66
  10. 30 7月, 2009 2 次提交
    • C
      Btrfs: be more polite in the async caching threads · f36f3042
      Chris Mason 提交于
      The semaphore used by the async caching threads can prevent a
      transaction commit, which can make the FS appear to stall.  This
      releases the semaphore more often when a transaction commit is
      in progress.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      f36f3042
    • Y
      Btrfs: preserve commit_root for async caching · 276e680d
      Yan Zheng 提交于
      The async block group caching code uses the commit_root pointer
      to get a stable version of the extent allocation tree for scanning.
      This copy of the tree root isn't going to change and it significantly
      reduces the complexity of the scanning code.
      
      During a commit, we have a loop where we update the extent allocation
      tree root.  We need to loop because updating the root pointer in
      the tree of tree roots may allocate blocks which may change the
      extent allocation tree.
      
      Right now the commit_root pointer is changed inside this loop.  It
      is more correct to change the commit_root pointer only after all the
      looping is done.
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      276e680d
  11. 25 7月, 2009 1 次提交
  12. 24 7月, 2009 1 次提交
    • J
      Btrfs: async block group caching · 817d52f8
      Josef Bacik 提交于
      This patch moves the caching of the block group off to a kthread in order to
      allow people to allocate sooner.  Instead of blocking up behind the caching
      mutex, we instead kick of the caching kthread, and then attempt to make an
      allocation.  If we cannot, we wait on the block groups caching waitqueue, which
      the caching kthread will wake the waiting threads up everytime it finds 2 meg
      worth of space, and then again when its finished caching.  This is how I tested
      the speedup from this
      
      mkfs the disk
      mount the disk
      fill the disk up with fs_mark
      unmount the disk
      mount the disk
      time touch /mnt/foo
      
      Without my changes this took 11 seconds on my box, with these changes it now
      takes 1 second.
      
      Another change thats been put in place is we lock the super mirror's in the
      pinned extent map in order to keep us from adding that stuff as free space when
      caching the block group.  This doesn't really change anything else as far as the
      pinned extent map is concerned, since for actual pinned extents we use
      EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
      those extents to keep from leaking memory.
      
      I've also added a check where when we are reading block groups from disk, if the
      amount of space used == the size of the block group, we go ahead and mark the
      block group as cached.  This drastically reduces the amount of time it takes to
      cache the block groups.  Using the same test as above, except doing a dd to a
      file and then unmounting, it used to take 33 seconds to umount, now it takes 3
      seconds.
      
      This version uses the commit_root in the caching kthread, and then keeps track
      of how many async caching threads are running at any given time so if one of the
      async threads is still running as we cross transactions we can wait until its
      finished before handling the pinned extents.  Thank you,
      Signed-off-by: NJosef Bacik <jbacik@redhat.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      817d52f8
  13. 22 7月, 2009 1 次提交
    • Y
      Btrfs: make sure all dirty blocks are written at commit time · 4a8c9a62
      Yan Zheng 提交于
      Write dirty block groups may allocate new block, and so may add new delayed
      back ref. btrfs_run_delayed_refs may make some block groups dirty.
      
      commit_cowonly_roots does not handle the recursion properly, and some dirty
      blocks can be left unwritten at commit time. This patch moves
      btrfs_run_delayed_refs into the loop that writes dirty block groups, and makes
      the code not break out of the loop until there are no dirty block groups or
      delayed back refs.
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      4a8c9a62
  14. 03 7月, 2009 1 次提交
    • Y
      Btrfs: update backrefs while dropping snapshot · 2c47e605
      Yan Zheng 提交于
      The new backref format has restriction on type of backref item.  If a tree
      block isn't referenced by its owner tree, full backrefs must be used for the
      pointers in it. When a tree block loses its owner tree's reference, backrefs
      for the pointers in it should be updated to full backrefs. Current
      btrfs_drop_snapshot misses the code that updates backrefs, so it's unsafe for
      general use.
      
      This patch adds backrefs update code to btrfs_drop_snapshot.  It isn't a
      problem in the restricted form btrfs_drop_snapshot is used today, but for
      general snapshot deletion this update is required.
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      2c47e605
  15. 16 6月, 2009 1 次提交
  16. 10 6月, 2009 1 次提交
    • Y
      Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE) · 5d4f98a2
      Yan Zheng 提交于
      This commit introduces a new kind of back reference for btrfs metadata.
      Once a filesystem has been mounted with this commit, IT WILL NO LONGER
      BE MOUNTABLE BY OLDER KERNELS.
      
      When a tree block in subvolume tree is cow'd, the reference counts of all
      extents it points to are increased by one.  At transaction commit time,
      the old root of the subvolume is recorded in a "dead root" data structure,
      and the btree it points to is later walked, dropping reference counts
      and freeing any blocks where the reference count goes to 0.
      
      The increments done during cow and decrements done after commit cancel out,
      and the walk is a very expensive way to go about freeing the blocks that
      are no longer referenced by the new btree root.  This commit reduces the
      transaction overhead by avoiding the need for dead root records.
      
      When a non-shared tree block is cow'd, we free the old block at once, and the
      new block inherits old block's references. When a tree block with reference
      count > 1 is cow'd, we increase the reference counts of all extents
      the new block points to by one, and decrease the old block's reference count by
      one.
      
      This dead tree avoidance code removes the need to modify the reference
      counts of lower level extents when a non-shared tree block is cow'd.
      But we still need to update back ref for all pointers in the block.
      This is because the location of the block is recorded in the back ref
      item.
      
      We can solve this by introducing a new type of back ref. The new
      back ref provides information about pointer's key, level and in which
      tree the pointer lives. This information allow us to find the pointer
      by searching the tree. The shortcoming of the new back ref is that it
      only works for pointers in tree blocks referenced by their owner trees.
      
      This is mostly a problem for snapshots, where resolving one of these
      fuzzy back references would be O(number_of_snapshots) and quite slow.
      The solution used here is to use the fuzzy back references in the common
      case where a given tree block is only referenced by one root,
      and use the full back references when multiple roots have a reference
      on a given block.
      
      This commit adds per subvolume red-black tree to keep trace of cached
      inodes. The red-black tree helps the balancing code to find cached
      inodes whose inode numbers within a given range.
      
      This commit improves the balancing code by introducing several data
      structures to keep the state of balancing. The most important one
      is the back ref cache. It caches how the upper level tree blocks are
      referenced. This greatly reduce the overhead of checking back ref.
      
      The improved balancing code scales significantly better with a large
      number of snapshots.
      
      This is a very large commit and was written in a number of
      pieces.  But, they depend heavily on the disk format change and were
      squashed together to make sure git bisect didn't end up in a
      bad state wrt space balancing or the format change.
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      5d4f98a2
  17. 25 4月, 2009 1 次提交
    • C
      Btrfs: fix deadlocks and stalls on dead root removal · 59bc5c75
      Chris Mason 提交于
      After a transaction commit, the old root of the subvol btrees are sent through
      snapshot removal.  This is what actually frees up any blocks replaced by
      COW, and anything the old blocks pointed to.
      
      Snapshot deletion will pause when a transaction commit has started, which
      helps to avoid a huge amount of delayed reference count updates piling up
      as the transaction is trying to close.
      
      But, this pause happens after the snapshot deletion process has asked other
      procs on the system to throttle back a bit so that it can make progress.
      
      We don't want to throttle everyone while we're waiting for the transaction
      commit, it leads to deadlocks in the user transaction ioctls used by Ceph
      and makes things slower in general.
      
      This patch changes things to avoid the throttling while we sleep.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      59bc5c75
  18. 03 4月, 2009 2 次提交
    • S
      Btrfs: add flushoncommit mount option · dccae999
      Sage Weil 提交于
      The 'flushoncommit' mount option forces any data dirtied by a write in a
      prior transaction to commit as part of the current commit.  This makes
      the committed state a fully consistent view of the file system from the
      application's perspective (i.e., it includes all completed file system
      operations).  This was previously the behavior only when a snapshot is
      created.
      
      This is used by Ceph to ensure that completed writes make it to the
      platter along with the metadata operations they are bound to (by
      BTRFS_IOC_TRANS_{START,END}).
      Signed-off-by: NSage Weil <sage@newdream.net>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      dccae999
    • C
      Btrfs: rework allocation clustering · fa9c0d79
      Chris Mason 提交于
      Because btrfs is copy-on-write, we end up picking new locations for
      blocks very often.  This makes it fairly difficult to maintain perfect
      read patterns over time, but we can at least do some optimizations
      for writes.
      
      This is done today by remembering the last place we allocated and
      trying to find a free space hole big enough to hold more than just one
      allocation.  The end result is that we tend to write sequentially to
      the drive.
      
      This happens all the time for metadata and it happens for data
      when mounted -o ssd.  But, the way we record it is fairly racey
      and it tends to fragment the free space over time because we are trying
      to allocate fairly large areas at once.
      
      This commit gets rid of the races by adding a free space cluster object
      with dedicated locking to make sure that only one process at a time
      is out replacing the cluster.
      
      The free space fragmentation is somewhat solved by allowing a cluster
      to be comprised of smaller free space extents.  This part definitely
      adds some CPU time to the cluster allocations, but it allows the allocator
      to consume the small holes left behind by cow.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      fa9c0d79
  19. 01 4月, 2009 1 次提交
    • C
      Btrfs: add extra flushing for renames and truncates · 5a3f23d5
      Chris Mason 提交于
      Renames and truncates are both common ways to replace old data with new
      data.  The filesystem can make an effort to make sure the new data is
      on disk before actually replacing the old data.
      
      This is especially important for rename, which many application use as
      though it were atomic for both the data and the metadata involved.  The
      current btrfs code will happily replace a file that is fully on disk
      with one that was just created and still has pending IO.
      
      If we crash after transaction commit but before the IO is done, we'll end
      up replacing a good file with a zero length file.  The solution used
      here is to create a list of inodes that need special ordering and force
      them to disk before the commit is done.  This is similar to the
      ext3 style data=ordering, except it is only done on selected files.
      
      Btrfs is able to get away with this because it does not wait on commits
      very often, even for fsync (which use a sub-commit).
      
      For renames, we order the file when it wasn't already
      on disk and when it is replacing an existing file.  Larger files
      are sent to filemap_flush right away (before the transaction handle is
      opened).
      
      For truncates, we order if the file goes from non-zero size down to
      zero size.  This is a little different, because at the time of the
      truncate the file has no dirty bytes to order.  But, we flag the inode
      so that it is added to the ordered list on close (via release method).  We
      also immediately add it to the ordered list of the current transaction
      so that we can try to flush down any writes the application sneaks in
      before commit.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      5a3f23d5
  20. 25 3月, 2009 5 次提交
    • C
      Btrfs: Only let very young transactions grow during commit · 89573b9c
      Chris Mason 提交于
      Commits are fairly expensive, and so btrfs has code to sit around for a while
      during the commit and let new writers come in.
      
      But, while we're sitting there, new delayed refs might be added, and those
      can be expensive to process as well.  Unless the transaction is very very
      young, it makes sense to go ahead and let the commit finish without hanging
      around.
      
      The commit grow loop isn't as important as it used to be, the fsync logging
      code handles most performance critical syncs now.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      89573b9c
    • C
      Btrfs: reduce stalls during transaction commit · b7ec40d7
      Chris Mason 提交于
      To avoid deadlocks and reduce latencies during some critical operations, some
      transaction writers are allowed to jump into the running transaction and make
      it run a little longer, while others sit around and wait for the commit to
      finish.
      
      This is a bit unfair, especially when the callers that jump in do a bunch
      of IO that makes all the others procs on the box wait.  This commit
      reduces the stalls this produces by pre-reading file extent pointers
      during btrfs_finish_ordered_io before the transaction is joined.
      
      It also tunes the drop_snapshot code to politely wait for transactions
      that have started writing out their delayed refs to finish.  This avoids
      new delayed refs being flooded into the queue while we're trying to
      close off the transaction.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      b7ec40d7
    • C
      Btrfs: process the delayed reference queue in clusters · c3e69d58
      Chris Mason 提交于
      The delayed reference queue maintains pending operations that need to
      be done to the extent allocation tree.  These are processed by
      finding records in the tree that are not currently being processed one at
      a time.
      
      This is slow because it uses lots of time searching through the rbtree
      and because it creates lock contention on the extent allocation tree
      when lots of different procs are running delayed refs at the same time.
      
      This commit changes things to grab a cluster of refs for processing,
      using a cursor into the rbtree as the starting point of the next search.
      This way we walk smoothly through the rbtree.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      c3e69d58
    • C
      Btrfs: do extent allocation and reference count updates in the background · 56bec294
      Chris Mason 提交于
      The extent allocation tree maintains a reference count and full
      back reference information for every extent allocated in the
      filesystem.  For subvolume and snapshot trees, every time
      a block goes through COW, the new copy of the block adds a reference
      on every block it points to.
      
      If a btree node points to 150 leaves, then the COW code needs to go
      and add backrefs on 150 different extents, which might be spread all
      over the extent allocation tree.
      
      These updates currently happen during btrfs_cow_block, and most COWs
      happen during btrfs_search_slot.  btrfs_search_slot has locks held
      on both the parent and the node we are COWing, and so we really want
      to avoid IO during the COW if we can.
      
      This commit adds an rbtree of pending reference count updates and extent
      allocations.  The tree is ordered by byte number of the extent and byte number
      of the parent for the back reference.  The tree allows us to:
      
      1) Modify back references in something close to disk order, reducing seeks
      2) Significantly reduce the number of modifications made as block pointers
      are balanced around
      3) Do all of the extent insertion and back reference modifications outside
      of the performance critical btrfs_search_slot code.
      
      #3 has the added benefit of greatly reducing the btrfs stack footprint.
      The extent allocation tree modifications are done without the deep
      (and somewhat recursive) call chains used in the past.
      
      These delayed back reference updates must be done before the transaction
      commits, and so the rbtree is tied to the transaction.  Throttling is
      implemented to help keep the queue of backrefs at a reasonable size.
      
      Since there was a similar mechanism in place for the extent tree
      extents, that is removed and replaced by the delayed reference tree.
      
      Yan Zheng <yan.zheng@oracle.com> helped review and fixup this code.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      56bec294
    • C
      Btrfs: don't preallocate metadata blocks during btrfs_search_slot · 9fa8cfe7
      Chris Mason 提交于
      In order to avoid doing expensive extent management with tree locks held,
      btrfs_search_slot will preallocate tree blocks for use by COW without
      any tree locks held.
      
      A later commit moves all of the extent allocation work for COW into
      a delayed update mechanism, and this preallocation will no longer be
      required.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      9fa8cfe7
  21. 13 2月, 2009 1 次提交
  22. 21 1月, 2009 1 次提交
  23. 06 1月, 2009 3 次提交
  24. 12 12月, 2008 1 次提交
    • Y
      Btrfs: fix leaking block group on balance · d2fb3437
      Yan Zheng 提交于
      The block group structs are referenced in many different
      places, and it's not safe to free while balancing.  So, those block
      group structs were simply leaked instead.
      
      This patch replaces the block group pointer in the inode with the starting byte
      offset of the block group and adds reference counting to the block group
      struct.
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      d2fb3437
  25. 09 12月, 2008 1 次提交
    • Y
      Btrfs: superblock duplication · a512bbf8
      Yan Zheng 提交于
      This patch implements superblock duplication. Superblocks
      are stored at offset 16K, 64M and 256G on every devices.
      Spaces used by superblocks are preserved by the allocator,
      which uses a reverse mapping function to find the logical
      addresses that correspond to superblocks. Thank you,
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      a512bbf8
  26. 02 12月, 2008 1 次提交
  27. 19 11月, 2008 1 次提交
  28. 18 11月, 2008 1 次提交
    • C
      Btrfs: Add backrefs and forward refs for subvols and snapshots · 0660b5af
      Chris Mason 提交于
      Subvols and snapshots can now be referenced from any point in the directory
      tree.  We need to maintain back refs for them so we can find lost
      subvols.
      
      Forward refs are added so that we know all of the subvols and
      snapshots referenced anywhere in the directory tree of a single subvol.  This
      can be used to do recursive snapshotting (but they aren't yet) and it is
      also used to detect and prevent directory loops when creating new snapshots.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      0660b5af