1. 25 3月, 2009 3 次提交
    • C
      Btrfs: process the delayed reference queue in clusters · c3e69d58
      Chris Mason 提交于
      The delayed reference queue maintains pending operations that need to
      be done to the extent allocation tree.  These are processed by
      finding records in the tree that are not currently being processed one at
      a time.
      
      This is slow because it uses lots of time searching through the rbtree
      and because it creates lock contention on the extent allocation tree
      when lots of different procs are running delayed refs at the same time.
      
      This commit changes things to grab a cluster of refs for processing,
      using a cursor into the rbtree as the starting point of the next search.
      This way we walk smoothly through the rbtree.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      c3e69d58
    • C
      Btrfs: do extent allocation and reference count updates in the background · 56bec294
      Chris Mason 提交于
      The extent allocation tree maintains a reference count and full
      back reference information for every extent allocated in the
      filesystem.  For subvolume and snapshot trees, every time
      a block goes through COW, the new copy of the block adds a reference
      on every block it points to.
      
      If a btree node points to 150 leaves, then the COW code needs to go
      and add backrefs on 150 different extents, which might be spread all
      over the extent allocation tree.
      
      These updates currently happen during btrfs_cow_block, and most COWs
      happen during btrfs_search_slot.  btrfs_search_slot has locks held
      on both the parent and the node we are COWing, and so we really want
      to avoid IO during the COW if we can.
      
      This commit adds an rbtree of pending reference count updates and extent
      allocations.  The tree is ordered by byte number of the extent and byte number
      of the parent for the back reference.  The tree allows us to:
      
      1) Modify back references in something close to disk order, reducing seeks
      2) Significantly reduce the number of modifications made as block pointers
      are balanced around
      3) Do all of the extent insertion and back reference modifications outside
      of the performance critical btrfs_search_slot code.
      
      #3 has the added benefit of greatly reducing the btrfs stack footprint.
      The extent allocation tree modifications are done without the deep
      (and somewhat recursive) call chains used in the past.
      
      These delayed back reference updates must be done before the transaction
      commits, and so the rbtree is tied to the transaction.  Throttling is
      implemented to help keep the queue of backrefs at a reasonable size.
      
      Since there was a similar mechanism in place for the extent tree
      extents, that is removed and replaced by the delayed reference tree.
      
      Yan Zheng <yan.zheng@oracle.com> helped review and fixup this code.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      56bec294
    • C
      Btrfs: don't preallocate metadata blocks during btrfs_search_slot · 9fa8cfe7
      Chris Mason 提交于
      In order to avoid doing expensive extent management with tree locks held,
      btrfs_search_slot will preallocate tree blocks for use by COW without
      any tree locks held.
      
      A later commit moves all of the extent allocation work for COW into
      a delayed update mechanism, and this preallocation will no longer be
      required.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      9fa8cfe7
  2. 13 2月, 2009 1 次提交
  3. 21 1月, 2009 1 次提交
  4. 06 1月, 2009 3 次提交
  5. 12 12月, 2008 1 次提交
    • Y
      Btrfs: fix leaking block group on balance · d2fb3437
      Yan Zheng 提交于
      The block group structs are referenced in many different
      places, and it's not safe to free while balancing.  So, those block
      group structs were simply leaked instead.
      
      This patch replaces the block group pointer in the inode with the starting byte
      offset of the block group and adds reference counting to the block group
      struct.
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      d2fb3437
  6. 09 12月, 2008 1 次提交
    • Y
      Btrfs: superblock duplication · a512bbf8
      Yan Zheng 提交于
      This patch implements superblock duplication. Superblocks
      are stored at offset 16K, 64M and 256G on every devices.
      Spaces used by superblocks are preserved by the allocator,
      which uses a reverse mapping function to find the logical
      addresses that correspond to superblocks. Thank you,
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      a512bbf8
  7. 02 12月, 2008 1 次提交
  8. 19 11月, 2008 1 次提交
  9. 18 11月, 2008 3 次提交
    • C
      Btrfs: Add backrefs and forward refs for subvols and snapshots · 0660b5af
      Chris Mason 提交于
      Subvols and snapshots can now be referenced from any point in the directory
      tree.  We need to maintain back refs for them so we can find lost
      subvols.
      
      Forward refs are added so that we know all of the subvols and
      snapshots referenced anywhere in the directory tree of a single subvol.  This
      can be used to do recursive snapshotting (but they aren't yet) and it is
      also used to detect and prevent directory loops when creating new snapshots.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      0660b5af
    • C
      Btrfs: Give each subvol and snapshot their own anonymous devid · 3394e160
      Chris Mason 提交于
      Each subvolume has its own private inode number space, and so we need
      to fill in different device numbers for each subvolume to avoid confusing
      applications.
      
      This commit puts a struct super_block into struct btrfs_root so it can
      call set_anon_super() and get a different device number generated for
      each root.
      
      btrfs_rename is changed to prevent renames across subvols.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      3394e160
    • C
      Btrfs: Allow subvolumes and snapshots anywhere in the directory tree · 3de4586c
      Chris Mason 提交于
      Before, all snapshots and subvolumes lived in a single flat directory.  This
      was awkward and confusing because the single flat directory was only writable
      with the ioctls.
      
      This commit changes the ioctls to create subvols and snapshots at any
      point in the directory tree.  This requires making separate ioctls for
      snapshot and subvol creation instead of a combining them into one.
      
      The subvol ioctl does:
      
      btrfsctl -S subvol_name parent_dir
      
      After the ioctl is done subvol_name lives inside parent_dir.
      
      The snapshot ioctl does:
      
      btrfsctl -s path_for_snapshot root_to_snapshot
      
      path_for_snapshot can be an absolute or relative path.  btrfsctl breaks it up
      into directory and basename components.
      
      root_to_snapshot can be any file or directory in the FS.  The snapshot
      is taken of the entire root where that file lives.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      3de4586c
  10. 08 11月, 2008 1 次提交
    • C
      Btrfs: Avoid unplug storms during commit · 5f2cc086
      Chris Mason 提交于
      While doing a commit, btrfs makes sure all the metadata blocks
      were properly written to disk, calling wait_on_page_writeback for
      each page.  This writeback happens after allowing another transaction
      to start, so it competes for the disk with other processes in the FS.
      
      If the page writeback bit is still set, each wait_on_page_writeback might
      trigger an unplug, even though the page might be waiting for checksumming
      to finish or might be waiting for the async work queue to submit the
      bio.
      
      This trades wait_on_page_writeback for waiting on the extent writeback
      bits.  It won't trigger any unplugs and substantially improves performance
      in a number of workloads.
      
      This also changes the async bio submission to avoid requeueing if there
      is only one device.  The requeue just wastes CPU time because there are
      no other devices to service.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      5f2cc086
  11. 31 10月, 2008 1 次提交
  12. 30 10月, 2008 4 次提交
    • C
      Btrfs: prevent looping forever in finish_current_insert and del_pending_extents · 87ef2bb4
      Chris Mason 提交于
      finish_current_insert and del_pending_extents process extent tree modifications
      that build up while we are changing the extent tree.  It is a confusing
      bit of code that prevents recursion.
      
      Both functions run through a list of pending operations and both funcs
      add to the list of pending operations.  If you have two procs in either
      one of them, they can end up looping forever making more work for each other.
      
      This patch makes them walk forward through the list of pending changes instead
      of always trying to process the entire list.  At transaction commit
      time, we catch any changes that were left over.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      87ef2bb4
    • Y
      Btrfs: Add root tree pointer transaction ids · 84234f3a
      Yan Zheng 提交于
      This patch adds transaction IDs to root tree pointers.
      Transaction IDs in tree pointers are compared with the
      generation numbers in block headers when reading root
      blocks of trees. This can detect some types of IO errors.
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      
      84234f3a
    • J
      Btrfs: nuke fs wide allocation mutex V2 · 25179201
      Josef Bacik 提交于
      This patch removes the giant fs_info->alloc_mutex and replaces it with a bunch
      of little locks.
      
      There is now a pinned_mutex, which is used when messing with the pinned_extents
      extent io tree, and the extent_ins_mutex which is used with the pending_del and
      extent_ins extent io trees.
      
      The locking for the extent tree stuff was inspired by a patch that Yan Zheng
      wrote to fix a race condition, I cleaned it up some and changed the locking
      around a little bit, but the idea remains the same.  Basically instead of
      holding the extent_ins_mutex throughout the processing of an extent on the
      extent_ins or pending_del trees, we just hold it while we're searching and when
      we clear the bits on those trees, and lock the extent for the duration of the
      operations on the extent.
      
      Also to keep from getting hung up waiting to lock an extent, I've added a
      try_lock_extent so if we cannot lock the extent, move on to the next one in the
      tree and we'll come back to that one.  I have tested this heavily and it does
      not appear to break anything.  This has to be applied on top of my
      find_free_extent redo patch.
      
      I tested this patch on top of Yan's space reblancing code and it worked fine.
      The only thing that has changed since the last version is I pulled out all my
      debugging stuff, apparently I forgot to run guilt refresh before I sent the
      last patch out.  Thank you,
      Signed-off-by: NJosef Bacik <jbacik@redhat.com>
      
      25179201
    • Y
      Btrfs: Improve space balancing code · f82d02d9
      Yan Zheng 提交于
      This patch improves the space balancing code to keep more sharing
      of tree blocks. The only case that breaks sharing of tree blocks is
      data extents get fragmented during balancing. The main changes in
      this patch are:
      
      Add a 'drop sub-tree' function. This solves the problem in old code
      that BTRFS_HEADER_FLAG_WRITTEN check breaks sharing of tree block.
      
      Remove relocation mapping tree. Relocation mappings are stored in
      struct btrfs_ref_path and updated dynamically during walking up/down
      the reference path. This reduces CPU usage and simplifies code.
      
      This patch also fixes a bug. Root items for reloc trees should be
      updated in btrfs_free_reloc_root.
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      
      f82d02d9
  13. 04 10月, 2008 1 次提交
    • C
      Btrfs: remove last_log_alloc allocator optimization · 30c43e24
      Chris Mason 提交于
      The tree logging code was trying to separate tree log allocations
      from normal metadata allocations to improve writeback patterns during
      an fsync.
      
      But, the code was not effective and ended up just mixing tree log
      blocks with regular metadata.  That seems to be working fairly well,
      so the last_log_alloc code can be removed.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      30c43e24
  14. 30 9月, 2008 1 次提交
    • C
      Btrfs: add and improve comments · d352ac68
      Chris Mason 提交于
      This improves the comments at the top of many functions.  It didn't
      dive into the guts of functions because I was trying to
      avoid merging problems with the new allocator and back reference work.
      
      extent-tree.c and volumes.c were both skipped, and there is definitely
      more work todo in cleaning and commenting the code.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      d352ac68
  15. 26 9月, 2008 3 次提交
    • Z
      Btrfs: update space balancing code · 1a40e23b
      Zheng Yan 提交于
      This patch updates the space balancing code to utilize the new
      backref format.  Before, btrfs-vol -b would break any COW links
      on data blocks or metadata.  This was slow and caused the amount
      of space used to explode if a large number of snapshots were present.
      
      The new code can keeps the sharing of all data extents and
      most of the tree blocks.
      
      To maintain the sharing of data extents, the space balance code uses
      a seperate inode hold data extent pointers, then updates the references
      to point to the new location.
      
      To maintain the sharing of tree blocks, the space balance code uses
      reloc trees to relocate tree blocks in reference counted roots.
      There is one reloc tree for each subvol, and all reloc trees share
      same root key objectid. Reloc trees are snapshots of the latest
      committed roots of subvols (root->commit_root).
      
      To relocate a tree block referenced by a subvol, there are two steps.
      COW the block through subvol's reloc tree, then update block pointer in
      the subvol to point to the new block. Since all reloc trees share
      same root key objectid, doing special handing for tree blocks
      owned by them is easy. Once a tree block has been COWed in one
      reloc tree, we can use the resulting new block directly when the
      same block is required to COW again through other reloc trees.
      In this way, relocated tree blocks are shared between reloc trees,
      so they are also shared between subvols.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      1a40e23b
    • Z
      Btrfs: extent_map and data=ordered fixes for space balancing · 5b21f2ed
      Zheng Yan 提交于
      * Add an EXTENT_BOUNDARY state bit to keep the writepage code
      from merging data extents that are in the process of being
      relocated.  This allows us to do accounting for them properly.
      
      * The balancing code relocates data extents indepdent of the underlying
      inode.  The extent_map code was modified to properly account for
      things moving around (invalidating extent_map caches in the inode).
      
      * Don't take the drop_mutex in the create_subvol ioctl.  It isn't
      required.
      
      * Fix walking of the ordered extent list to avoid races with sys_unlink
      
      * Change the lock ordering rules.  Transaction start goes outside
      the drop_mutex.  This allows btrfs_commit_transaction to directly
      drop the relocation trees.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      5b21f2ed
    • Z
      Btrfs: Add shared reference cache · e4657689
      Zheng Yan 提交于
      Btrfs has a cache of reference counts in leaves, allowing it to
      avoid reading tree leaves while deleting snapshots.  To reduce
      contention with multiple subvolumes, this cache is private to each
      subvolume.
      
      This patch adds shared reference cache support. The new space
      balancing code plays with multiple subvols at the same time, So
      the old per-subvol reference cache is not well suited.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      e4657689
  16. 25 9月, 2008 14 次提交
    • C
      Btrfs: Record dirty pages tree-log pages in an extent_io tree · d0c803c4
      Chris Mason 提交于
      This is the same way the transaction code makes sure that all the
      other tree blocks are safely on disk.  There's an extent_io tree
      for each root, and any blocks allocated to the tree logs are
      recorded in that tree.
      
      At tree-log sync, the extent_io tree is walked to flush down the
      dirty pages and wait for them.
      
      The main benefit is less time spent walking the tree log and skipping
      clean pages, and getting sequential IO down to the drive.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      d0c803c4
    • C
      Btrfs: Tree logging fixes · 4bef0848
      Chris Mason 提交于
      * Pin down data blocks to prevent them from being reallocated like so:
      
      trans 1: allocate file extent
      trans 2: free file extent
      trans 3: free file extent during old snapshot deletion
      trans 3: allocate file extent to new file
      trans 3: fsync new file
      
      Before the tree logging code, this was legal because the fsync
      would commit the transation that did the final data extent free
      and the transaction that allocated the extent to the new file
      at the same time.
      
      With the tree logging code, the tree log subtransaction can commit
      before the transaction that freed the extent.  If we crash,
      we're left with two different files using the extent.
      
      * Don't wait in start_transaction if log replay is going on.  This
      avoids deadlocks from iput while we're cleaning up link counts in the
      replay code.
      
      * Don't deadlock in replay_one_name by trying to read an inode off
      the disk while holding paths for the directory
      
      * Hold the buffer lock while we mark a buffer as written.  This
      closes a race where someone is changing a buffer while we write it.
      They are supposed to mark it dirty again after they change it, but
      this violates the cow rules.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      4bef0848
    • C
      Btrfs: Add a write ahead tree log to optimize synchronous operations · e02119d5
      Chris Mason 提交于
      File syncs and directory syncs are optimized by copying their
      items into a special (copy-on-write) log tree.  There is one log tree per
      subvolume and the btrfs super block points to a tree of log tree roots.
      
      After a crash, items are copied out of the log tree and back into the
      subvolume.  See tree-log.c for all the details.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      e02119d5
    • C
      Btrfs: Wait for async bio submissions to make some progress at queue time · b64a2851
      Chris Mason 提交于
      Before, the btrfs bdi congestion function was used to test for too many
      async bios.  This keeps that check to throttle pdflush, but also
      adds a check while queuing bios.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      b64a2851
    • C
      Btrfs: Transaction commit: don't use filemap_fdatawait · 777e6bd7
      Chris Mason 提交于
      After writing out all the remaining btree blocks in the transaction,
      the commit code would use filemap_fdatawait to make sure it was all
      on disk.  This means it would wait for blocks written by other procs
      as well.
      
      The new code walks the list of blocks for this transaction again
      and waits only for those required by this transaction.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      777e6bd7
    • Y
      Btrfs: Fix nodatacow for the new data=ordered mode · 7ea394f1
      Yan Zheng 提交于
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      7ea394f1
    • Y
      Btrfs: Various small fixes. · b48652c1
      Yan Zheng 提交于
      This trivial patch contains two locking fixes and a off by one fix.
      
      ---
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      b48652c1
    • S
      Btrfs: fix ioctl-initiated transactions vs wait_current_trans() · 9ca9ee09
      Sage Weil 提交于
      Commit 597:466b27332893 (btrfs_start_transaction: wait for commits in
      progress) breaks the transaction start/stop ioctls by making
      btrfs_start_transaction conditionally wait for the next transaction to
      start.  If an application artificially is holding a transaction open,
      things deadlock.
      
      This workaround maintains a count of open ioctl-initiated transactions in
      fs_info, and avoids wait_current_trans() if any are currently open (in
      start_transaction() and btrfs_throttle()).  The start transaction ioctl
      uses a new btrfs_start_ioctl_transaction() that _does_ call
      wait_current_trans(), effectively pushing the join/wait decision to the
      outer ioctl-initiated transaction.
      
      This more or less neuters btrfs_throttle() when ioctl-initiated
      transactions are in use, but that seems like a pretty fundamental
      consequence of wrapping lots of write()'s in a transaction.  Btrfs has no
      way to tell if the application considers a given operation as part of it's
      transaction.
      
      Obviously, if the transaction start/stop ioctls aren't being used, there
      is no effect on current behavior.
      Signed-off-by: NSage Weil <sage@newdream.net>
      ---
       ctree.h       |    1 +
       ioctl.c       |   12 +++++++++++-
       transaction.c |   18 +++++++++++++-----
       transaction.h |    2 ++
       4 files changed, 27 insertions(+), 6 deletions(-)
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      9ca9ee09
    • C
      Btrfs: More throttle tuning · 2dd3e67b
      Chris Mason 提交于
      * Make walk_down_tree wake up throttled tasks more often
      * Make walk_down_tree call cond_resched during long loops
      * As the size of the ref cache grows, wait longer in throttle
      * Get rid of the reada code in walk_down_tree, the leaves don't get
        read anymore, thanks to the ref cache.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      2dd3e67b
    • C
      btrfs_search_slot: reduce lock contention by cowing in two stages · 65b51a00
      Chris Mason 提交于
      A btree block cow has two parts, the first is to allocate a destination
      block and the second is to copy the old bock over.
      
      The first part needs locks in the extent allocation tree, and may need to
      do IO.  This changeset splits that into a separate function that can be
      called without any tree locks held.
      
      btrfs_search_slot is changed to drop its path and start over if it has
      to COW a contended block.  This often means that many writers will
      pre-alloc a new destination for a the same contended block, but they
      cache their prealloc for later use on lower levels in the tree.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      65b51a00
    • C
      18e35e0a
    • C
      Btrfs: Throttle tuning · 37d1aeee
      Chris Mason 提交于
      This avoids waiting for transactions with pages locked by breaking out
      the code to wait for the current transaction to close into a function
      called by btrfs_throttle.
      
      It also lowers the limits for where we start throttling.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      37d1aeee
    • Y
      Btrfs: implement memory reclaim for leaf reference cache · bcc63abb
      Yan 提交于
      The memory reclaiming issue happens when snapshot exists. In that
      case, some cache entries may not be used during old snapshot dropping,
      so they will remain in the cache until umount.
      
      The patch adds a field to struct btrfs_leaf_ref to record create time. Besides,
      the patch makes all dead roots of a given snapshot linked together in order of
      create time. After a old snapshot was completely dropped, we check the dead
      root list and remove all cache entries created before the oldest dead root in
      the list.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      bcc63abb
    • Y
      Btrfs: Update and fix mount -o nodatacow · f321e491
      Yan Zheng 提交于
      To check whether a given file extent is referenced by multiple snapshots, the
      checker walks down the fs tree through dead root and checks all tree blocks in
      the path.
      
      We can easily detect whether a given tree block is directly referenced by other
      snapshot. We can also detect any indirect reference from other snapshot by
      checking reference's generation. The checker can always detect multiple
      references, but can't reliably detect cases of single reference. So btrfs may
      do file data cow even there is only one reference.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      f321e491