1. 25 9月, 2008 40 次提交
    • C
      Btrfs: Record dirty pages tree-log pages in an extent_io tree · d0c803c4
      Chris Mason 提交于
      This is the same way the transaction code makes sure that all the
      other tree blocks are safely on disk.  There's an extent_io tree
      for each root, and any blocks allocated to the tree logs are
      recorded in that tree.
      
      At tree-log sync, the extent_io tree is walked to flush down the
      dirty pages and wait for them.
      
      The main benefit is less time spent walking the tree log and skipping
      clean pages, and getting sequential IO down to the drive.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      d0c803c4
    • C
      Btrfs: Tree logging fixes · 4bef0848
      Chris Mason 提交于
      * Pin down data blocks to prevent them from being reallocated like so:
      
      trans 1: allocate file extent
      trans 2: free file extent
      trans 3: free file extent during old snapshot deletion
      trans 3: allocate file extent to new file
      trans 3: fsync new file
      
      Before the tree logging code, this was legal because the fsync
      would commit the transation that did the final data extent free
      and the transaction that allocated the extent to the new file
      at the same time.
      
      With the tree logging code, the tree log subtransaction can commit
      before the transaction that freed the extent.  If we crash,
      we're left with two different files using the extent.
      
      * Don't wait in start_transaction if log replay is going on.  This
      avoids deadlocks from iput while we're cleaning up link counts in the
      replay code.
      
      * Don't deadlock in replay_one_name by trying to read an inode off
      the disk while holding paths for the directory
      
      * Hold the buffer lock while we mark a buffer as written.  This
      closes a race where someone is changing a buffer while we write it.
      They are supposed to mark it dirty again after they change it, but
      this violates the cow rules.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      4bef0848
    • C
      Btrfs: Add a write ahead tree log to optimize synchronous operations · e02119d5
      Chris Mason 提交于
      File syncs and directory syncs are optimized by copying their
      items into a special (copy-on-write) log tree.  There is one log tree per
      subvolume and the btrfs super block points to a tree of log tree roots.
      
      After a crash, items are copied out of the log tree and back into the
      subvolume.  See tree-log.c for all the details.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      e02119d5
    • C
      Btrfs: Wait for async bio submissions to make some progress at queue time · b64a2851
      Chris Mason 提交于
      Before, the btrfs bdi congestion function was used to test for too many
      async bios.  This keeps that check to throttle pdflush, but also
      adds a check while queuing bios.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      b64a2851
    • C
      Btrfs: Transaction commit: don't use filemap_fdatawait · 777e6bd7
      Chris Mason 提交于
      After writing out all the remaining btree blocks in the transaction,
      the commit code would use filemap_fdatawait to make sure it was all
      on disk.  This means it would wait for blocks written by other procs
      as well.
      
      The new code walks the list of blocks for this transaction again
      and waits only for those required by this transaction.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      777e6bd7
    • Y
      Btrfs: Fix nodatacow for the new data=ordered mode · 7ea394f1
      Yan Zheng 提交于
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      7ea394f1
    • Y
      Btrfs: Various small fixes. · b48652c1
      Yan Zheng 提交于
      This trivial patch contains two locking fixes and a off by one fix.
      
      ---
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      b48652c1
    • S
      Btrfs: fix ioctl-initiated transactions vs wait_current_trans() · 9ca9ee09
      Sage Weil 提交于
      Commit 597:466b27332893 (btrfs_start_transaction: wait for commits in
      progress) breaks the transaction start/stop ioctls by making
      btrfs_start_transaction conditionally wait for the next transaction to
      start.  If an application artificially is holding a transaction open,
      things deadlock.
      
      This workaround maintains a count of open ioctl-initiated transactions in
      fs_info, and avoids wait_current_trans() if any are currently open (in
      start_transaction() and btrfs_throttle()).  The start transaction ioctl
      uses a new btrfs_start_ioctl_transaction() that _does_ call
      wait_current_trans(), effectively pushing the join/wait decision to the
      outer ioctl-initiated transaction.
      
      This more or less neuters btrfs_throttle() when ioctl-initiated
      transactions are in use, but that seems like a pretty fundamental
      consequence of wrapping lots of write()'s in a transaction.  Btrfs has no
      way to tell if the application considers a given operation as part of it's
      transaction.
      
      Obviously, if the transaction start/stop ioctls aren't being used, there
      is no effect on current behavior.
      Signed-off-by: NSage Weil <sage@newdream.net>
      ---
       ctree.h       |    1 +
       ioctl.c       |   12 +++++++++++-
       transaction.c |   18 +++++++++++++-----
       transaction.h |    2 ++
       4 files changed, 27 insertions(+), 6 deletions(-)
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      9ca9ee09
    • C
      Btrfs: More throttle tuning · 2dd3e67b
      Chris Mason 提交于
      * Make walk_down_tree wake up throttled tasks more often
      * Make walk_down_tree call cond_resched during long loops
      * As the size of the ref cache grows, wait longer in throttle
      * Get rid of the reada code in walk_down_tree, the leaves don't get
        read anymore, thanks to the ref cache.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      2dd3e67b
    • C
      btrfs_search_slot: reduce lock contention by cowing in two stages · 65b51a00
      Chris Mason 提交于
      A btree block cow has two parts, the first is to allocate a destination
      block and the second is to copy the old bock over.
      
      The first part needs locks in the extent allocation tree, and may need to
      do IO.  This changeset splits that into a separate function that can be
      called without any tree locks held.
      
      btrfs_search_slot is changed to drop its path and start over if it has
      to COW a contended block.  This often means that many writers will
      pre-alloc a new destination for a the same contended block, but they
      cache their prealloc for later use on lower levels in the tree.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      65b51a00
    • C
      18e35e0a
    • C
      Btrfs: Throttle tuning · 37d1aeee
      Chris Mason 提交于
      This avoids waiting for transactions with pages locked by breaking out
      the code to wait for the current transaction to close into a function
      called by btrfs_throttle.
      
      It also lowers the limits for where we start throttling.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      37d1aeee
    • Y
      Btrfs: implement memory reclaim for leaf reference cache · bcc63abb
      Yan 提交于
      The memory reclaiming issue happens when snapshot exists. In that
      case, some cache entries may not be used during old snapshot dropping,
      so they will remain in the cache until umount.
      
      The patch adds a field to struct btrfs_leaf_ref to record create time. Besides,
      the patch makes all dead roots of a given snapshot linked together in order of
      create time. After a old snapshot was completely dropped, we check the dead
      root list and remove all cache entries created before the oldest dead root in
      the list.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      bcc63abb
    • Y
      Btrfs: Update and fix mount -o nodatacow · f321e491
      Yan Zheng 提交于
      To check whether a given file extent is referenced by multiple snapshots, the
      checker walks down the fs tree through dead root and checks all tree blocks in
      the path.
      
      We can easily detect whether a given tree block is directly referenced by other
      snapshot. We can also detect any indirect reference from other snapshot by
      checking reference's generation. The checker can always detect multiple
      references, but can't reliably detect cases of single reference. So btrfs may
      do file data cow even there is only one reference.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      f321e491
    • C
      Btrfs: Throttle operations if the reference cache gets too large · ab78c84d
      Chris Mason 提交于
      A large reference cache is directly related to a lot of work pending
      for the cleaner thread.  This throttles back new operations based on
      the size of the reference cache so the cleaner thread will be able to keep
      up.
      
      Overall, this actually makes the FS faster because the cleaner thread will
      be more likely to find things in cache.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      ab78c84d
    • C
      Btrfs: Leaf reference cache update · 017e5369
      Chris Mason 提交于
      This changes the reference cache to make a single cache per root
      instead of one cache per transaction, and to key by the byte number
      of the disk block instead of the keys inside.
      
      This makes it much less likely to have cache misses if a snapshot
      or something has an extra reference on a higher node or a leaf while
      the first transaction that added the leaf into the cache is dropping.
      
      Some throttling is added to functions that free blocks heavily so they
      wait for old transactions to drop.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      017e5369
    • Y
      Btrfs: Add a leaf reference cache · 31153d81
      Yan Zheng 提交于
      Much of the IO done while dropping snapshots is done looking up
      leaves in the filesystem trees to see if they point to any extents and
      to drop the references on any extents found.
      
      This creates a cache so that IO isn't required.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      31153d81
    • J
      Btrfs: Implement new dir index format · aec7477b
      Josef Bacik 提交于
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      aec7477b
    • C
      ed98b56a
    • C
      Btrfs: Fix some data=ordered related data corruptions · f421950f
      Chris Mason 提交于
      Stress testing was showing data checksum errors, most of which were caused
      by a lookup bug in the extent_map tree.  The tree was caching the last
      pointer returned, and searches would check the last pointer first.
      
      But, search callers also expect the search to return the very first
      matching extent in the range, which wasn't always true with the last
      pointer usage.
      
      For now, the code to cache the last return value is just removed.  It is
      easy to fix, but I think lookups are rare enough that it isn't required anymore.
      
      This commit also replaces do_sync_mapping_range with a local copy of the
      related functions.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      f421950f
    • C
      btrfs_start_transaction: wait for commits in progress to finish · f9295749
      Chris Mason 提交于
      btrfs_commit_transaction has to loop waiting for any writers in the
      transaction to finish before it can proceed.  btrfs_start_transaction
      should be polite and not join a transaction that is in the process
      of being finished off.
      
      There are a few places that can't wait, basically the ones doing IO that
      might be needed to finish the transaction.  For them, btrfs_join_transaction
      is added.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      f9295749
    • C
      Btrfs: New data=ordered implementation · e6dcd2dc
      Chris Mason 提交于
      The old data=ordered code would force commit to wait until
      all the data extents from the transaction were fully on disk.  This
      introduced large latencies into the commit and stalled new writers
      in the transaction for a long time.
      
      The new code changes the way data allocations and extents work:
      
      * When delayed allocation is filled, data extents are reserved, and
        the extent bit EXTENT_ORDERED is set on the entire range of the extent.
        A struct btrfs_ordered_extent is allocated an inserted into a per-inode
        rbtree to track the pending extents.
      
      * As each page is written EXTENT_ORDERED is cleared on the bytes corresponding
        to that page.
      
      * When all of the bytes corresponding to a single struct btrfs_ordered_extent
        are written, The previously reserved extent is inserted into the FS
        btree and into the extent allocation trees.  The checksums for the file
        data are also updated.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      e6dcd2dc
    • C
      Btrfs: Drop some verbose printks · 77a41afb
      Chris Mason 提交于
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      77a41afb
    • C
      Btrfs: Online btree defragmentation fixes · 3f157a2f
      Chris Mason 提交于
      The btree defragger wasn't making forward progress because the new key wasn't
      being saved by the btrfs_search_forward function.
      
      This also disables the automatic btree defrag, it wasn't scaling well to
      huge filesystems.  The auto-defrag needs to be done differently.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      3f157a2f
    • C
    • C
      Btrfs: Replace the transaction work queue with kthreads · a74a4b97
      Chris Mason 提交于
      This creates one kthread for commits and one kthread for
      deleting old snapshots.  All the work queues are removed.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      a74a4b97
    • C
      Add btrfs_end_transaction_throttle to force writers to wait for pending commits · 89ce8a63
      Chris Mason 提交于
      The existing throttle mechanism was often not sufficient to prevent
      new writers from coming in and making a given transaction run forever.
      This adds an explicit wait at the end of most operations so they will
      allow the current transaction to close.
      
      There is no wait inside file_write, inode updates, or cow filling, all which
      have different deadlock possibilities.
      
      This is a temporary measure until better asynchronous commit support is
      added.  This code leads to stalls as it waits for data=ordered
      writeback, and it really needs to be fixed.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      89ce8a63
    • C
      Btrfs: Replace the big fs_mutex with a collection of other locks · a2135011
      Chris Mason 提交于
      Extent alloctions are still protected by a large alloc_mutex.
      Objectid allocations are covered by a objectid mutex
      Other btree operations are protected by a lock on individual btree nodes
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      a2135011
    • C
      Btrfs: Start btree concurrency work. · 925baedd
      Chris Mason 提交于
      The allocation trees and the chunk trees are serialized via their own
      dedicated mutexes.  This means allocation location is still not very
      fine grained.
      
      The main FS btree is protected by locks on each block in the btree.  Locks
      are taken top / down, and as processing finishes on a given level of the
      tree, the lock is released after locking the lower level.
      
      The end result of a search is now a path where only the lowest level
      is locked.  Releasing or freeing the path drops any locks held.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      925baedd
    • S
      Btrfs: Invalidate dcache entry after creating snapshot and · 3b96362c
      Sven Wegener 提交于
      We need to invalidate an existing dcache entry after creating a new
      snapshot or subvolume, because a negative dache entry will stop us from
      accessing the new snapshot or subvolume.
      
      ---
        ctree.h       |   23 +++++++++++++++++++++++
        inode.c       |    4 ++++
        transaction.c |    4 ++++
        3 files changed, 31 insertions(+)
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      3b96362c
    • C
      Btrfs: Fix race in running_transaction checks · 48ec2cf8
      Chris Mason 提交于
      When a new transaction was started, the code would incorrectly
      set the pointer in fs_info before all the data structures were setup.
      fsync heavy workloads hit races on the setup of the ordered inode spinlock
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      48ec2cf8
    • C
      Btrfs: Add support for online device removal · a061fc8d
      Chris Mason 提交于
      This required a few structural changes to the code that manages bdev pointers:
      
      The VFS super block now gets an anon-bdev instead of a pointer to the
      lowest bdev.  This allows us to avoid swapping the super block bdev pointer
      around at run time.
      
      The code to read in the super block no longer goes through the extent
      buffer interface.  Things got ugly keeping the mapping constant.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      a061fc8d
    • C
      Btrfs: Fixes for 2.6.18 enterprise kernels · d6bfde87
      Chris Mason 提交于
      2.6.18 seems to get caught in an infinite loop when
      cancel_rearming_delayed_workqueue is called more than once, so this switches
      to cancel_delayed_work, which is arguably more correct.
      
      Also, balance_dirty_pages can run into problems with 2.6.18 based kernels
      because it doesn't have the per-bdi dirty limits.  This avoids calling
      balance_dirty_pages on the btree inode unless there is actually something
      to balance, which is a good optimization in general.
      
      Finally there's a compile fix for ordered-data.h
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      d6bfde87
    • C
    • C
      Btrfs: Do metadata checksums for reads via a workqueue · ce9adaa5
      Chris Mason 提交于
      Before, metadata checksumming was done by the callers of read_tree_block,
      which would set EXTENT_CSUM bits in the extent tree to show that a given
      range of pages was already checksummed and didn't need to be verified
      again.
      
      But, those bits could go away via try_to_releasepage, and the end
      result was bogus checksum failures on pages that never left the cache.
      
      The new code validates checksums when the page is read.  It is a little
      tricky because metadata blocks can span pages and a single read may
      end up going via multiple bios.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      ce9adaa5
    • C
      0b86a832
    • C
      Btrfs: Lower stack usage in transaction.c · 80b6794d
      Chris Mason 提交于
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      80b6794d
    • C
      Btrfs: Add data block hints to SSD mode too · 4529ba49
      Chris Mason 提交于
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      4529ba49
    • C
      Btrfs: Split the extent_map code into two parts · d1310b2e
      Chris Mason 提交于
      There is now extent_map for mapping offsets in the file to disk and
      extent_io for state tracking, IO submission and extent_bufers.
      
      The new extent_map code shifts from [start,end] pairs to [start,len], and
      pushes the locking out into the caller.  This allows a few performance
      optimizations and is easier to use.
      
      A number of extent_map usage bugs were fixed, mostly with failing
      to remove extent_map entries when changing the file.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      d1310b2e
    • C