1. 16 12月, 2009 2 次提交
  2. 14 10月, 2009 4 次提交
    • Y
      Btrfs: properly wait log writers during log sync · 86df7eb9
      Yan, Zheng 提交于
      A recently fsync optimization make btrfs_sync_log skip calling
      wait_for_writer in the single log writer case. This is incorrect
      since the writer count can also be increased by btrfs_pin_log.
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      86df7eb9
    • C
      Btrfs: streamline tree-log btree block writeout · 690587d1
      Chris Mason 提交于
      Syncing the tree log is a 3 phase operation.
      
      1) write and wait for all the tree log blocks for a given root.
      
      2) write and wait for all the tree log blocks for the
      tree of tree log roots.
      
      3) write and wait for the super blocks (barriers here)
      
      This isn't as efficient as it could be because there is
      no requirement to wait for the blocks from step one to hit the disk
      before we start writing the blocks from step two.  This commit
      changes the sequence so that we don't start waiting until
      all the tree blocks from both steps one and two have been sent
      to disk.
      
      We do this by breaking up btrfs_write_wait_marked_extents into
      two functions, which is trivial because it was already broken
      up into two parts.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      690587d1
    • C
      Btrfs: avoid tree log commit when there are no changes · 257c62e1
      Chris Mason 提交于
      rpm has a habit of running fdatasync when the file hasn't
      changed.  We already detect if a file hasn't been changed
      in the current transaction but it might have been sent to
      the tree-log in this transaction and not changed since
      the last call to fsync.
      
      In this case, we want to avoid a tree log sync, which includes
      a number of synchronous writes and barriers.  This commit
      extends the existing tracking of the last transaction to change
      a file to also track the last sub-transaction.
      
      The end result is that rpm -ivh and -Uvh are roughly twice as fast,
      and on par with ext3.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      257c62e1
    • C
      Btrfs: only write one super copy during fsync · 4722607d
      Chris Mason 提交于
      During a tree-log commit for fsync, we've been writing at least
      two copies of the super block and forcing them to disk.
      
      The other filesystems write only one, and this change brings us on
      par with them.  A full transaction commit will write all the super
      copies, so we still have redundant info written on a regular
      basis.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      4722607d
  3. 09 10月, 2009 1 次提交
  4. 22 9月, 2009 2 次提交
  5. 21 9月, 2009 1 次提交
  6. 18 9月, 2009 1 次提交
    • Y
      Btrfs: improve async block group caching · 11833d66
      Yan Zheng 提交于
      This patch gets rid of two limitations of async block group caching.
      The old code delays handling pinned extents when block group is in
      caching. To allocate logged file extents, the old code need wait
      until block group is fully cached. To get rid of the limitations,
      This patch introduces a data structure to track the progress of
      caching. Base on the caching progress, we know which extents should
      be added to the free space cache when handling the pinned extents.
      The logged file extents are also handled in a similar way.
      
      This patch also changes how pinned extents are tracked. The old
      code uses one tree to track pinned extents, and copy the pinned
      extents tree at transaction commit time. This patch makes it use
      two trees to track pinned extents. One tree for extents that are
      pinned in the running transaction, one tree for extents that can
      be unpinned. At transaction commit time, we swap the two trees.
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      11833d66
  7. 12 9月, 2009 1 次提交
    • C
      Btrfs: Fix extent replacment race · a1ed835e
      Chris Mason 提交于
      Data COW means that whenever we write to a file, we replace any old
      extent pointers with new ones.  There was a window where a readpage
      might find the old extent pointers on disk and cache them in the
      extent_map tree in ram in the middle of a given write replacing them.
      
      Even though both the readpage and the write had their respective bytes
      in the file locked, the extent readpage inserts may cover more bytes than
      it had locked down.
      
      This commit closes the race by keeping the new extent pinned in the extent
      map tree until after the on-disk btree is properly setup with the new
      extent pointers.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      a1ed835e
  8. 28 7月, 2009 2 次提交
    • J
      Btrfs: change how we unpin extents · 68b38550
      Josef Bacik 提交于
      We are racy with async block caching and unpinning extents.  This patch makes
      things much less complicated by only unpinning the extent if the block group is
      cached.  We check the block_group->cached var under the block_group->lock spin
      lock.  If it is set to BTRFS_CACHE_FINISHED then we update the pinned counters,
      and unpin the extent and add the free space back.  If it is not set to this, we
      start the caching of the block group so the next time we unpin extents we can
      unpin the extent.  This keeps us from racing with the async caching threads,
      lets us kill the fs wide async thread counter, and keeps us from having to set
      DELALLOC bits for every extent we hit if there are caching kthreads going.
      
      One thing that needed to be changed was btrfs_free_super_mirror_extents.  Now
      instead of just looking for LOCKED extents, we also look for DIRTY extents,
      since we could have left some extents pinned in the previous transaction that
      will never get freed now that we are unmounting, which would cause us to leak
      memory.  So btrfs_free_super_mirror_extents has been changed to
      btrfs_free_pinned_extents, and it will clear the extents locked for the super
      mirror, and any remaining pinned extents that may be present.  Thank you,
      Signed-off-by: NJosef Bacik <jbacik@redhat.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      68b38550
    • J
      Btrfs: Correct redundant test in add_inode_ref · 631c07c8
      Julia Lawall 提交于
      dir has already been tested.  It seems that this test should be on the
      recently returned value inode.
      
      A simplified version of the semantic match that finds this problem is as
      follows: (http://www.emn.fr/x-info/coccinelle/)
      Signed-off-by: NJulia Lawall <julia@diku.dk>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      631c07c8
  9. 24 7月, 2009 1 次提交
    • J
      Btrfs: async block group caching · 817d52f8
      Josef Bacik 提交于
      This patch moves the caching of the block group off to a kthread in order to
      allow people to allocate sooner.  Instead of blocking up behind the caching
      mutex, we instead kick of the caching kthread, and then attempt to make an
      allocation.  If we cannot, we wait on the block groups caching waitqueue, which
      the caching kthread will wake the waiting threads up everytime it finds 2 meg
      worth of space, and then again when its finished caching.  This is how I tested
      the speedup from this
      
      mkfs the disk
      mount the disk
      fill the disk up with fs_mark
      unmount the disk
      mount the disk
      time touch /mnt/foo
      
      Without my changes this took 11 seconds on my box, with these changes it now
      takes 1 second.
      
      Another change thats been put in place is we lock the super mirror's in the
      pinned extent map in order to keep us from adding that stuff as free space when
      caching the block group.  This doesn't really change anything else as far as the
      pinned extent map is concerned, since for actual pinned extents we use
      EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
      those extents to keep from leaking memory.
      
      I've also added a check where when we are reading block groups from disk, if the
      amount of space used == the size of the block group, we go ahead and mark the
      block group as cached.  This drastically reduces the amount of time it takes to
      cache the block groups.  Using the same test as above, except doing a dd to a
      file and then unmounting, it used to take 33 seconds to umount, now it takes 3
      seconds.
      
      This version uses the commit_root in the caching kthread, and then keeps track
      of how many async caching threads are running at any given time so if one of the
      async threads is still running as we cross transactions we can wait until its
      finished before handling the pinned extents.  Thank you,
      Signed-off-by: NJosef Bacik <jbacik@redhat.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      817d52f8
  10. 11 6月, 2009 1 次提交
    • C
      Btrfs: fix extent_buffer leak during tree log replay · b263c2c8
      Chris Mason 提交于
      During tree log replay, we read in the tree log roots,
      process them and then free them.  A recent change
      takes an extra reference on the root node of the tree
      when the root is read in, and stores that reference
      in root->commit_root.
      
      This reference was not being freed, leaving us with
      one buffer pinned in ram for each subvol with
      a tree log root after a crash.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      b263c2c8
  11. 10 6月, 2009 1 次提交
    • Y
      Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE) · 5d4f98a2
      Yan Zheng 提交于
      This commit introduces a new kind of back reference for btrfs metadata.
      Once a filesystem has been mounted with this commit, IT WILL NO LONGER
      BE MOUNTABLE BY OLDER KERNELS.
      
      When a tree block in subvolume tree is cow'd, the reference counts of all
      extents it points to are increased by one.  At transaction commit time,
      the old root of the subvolume is recorded in a "dead root" data structure,
      and the btree it points to is later walked, dropping reference counts
      and freeing any blocks where the reference count goes to 0.
      
      The increments done during cow and decrements done after commit cancel out,
      and the walk is a very expensive way to go about freeing the blocks that
      are no longer referenced by the new btree root.  This commit reduces the
      transaction overhead by avoiding the need for dead root records.
      
      When a non-shared tree block is cow'd, we free the old block at once, and the
      new block inherits old block's references. When a tree block with reference
      count > 1 is cow'd, we increase the reference counts of all extents
      the new block points to by one, and decrease the old block's reference count by
      one.
      
      This dead tree avoidance code removes the need to modify the reference
      counts of lower level extents when a non-shared tree block is cow'd.
      But we still need to update back ref for all pointers in the block.
      This is because the location of the block is recorded in the back ref
      item.
      
      We can solve this by introducing a new type of back ref. The new
      back ref provides information about pointer's key, level and in which
      tree the pointer lives. This information allow us to find the pointer
      by searching the tree. The shortcoming of the new back ref is that it
      only works for pointers in tree blocks referenced by their owner trees.
      
      This is mostly a problem for snapshots, where resolving one of these
      fuzzy back references would be O(number_of_snapshots) and quite slow.
      The solution used here is to use the fuzzy back references in the common
      case where a given tree block is only referenced by one root,
      and use the full back references when multiple roots have a reference
      on a given block.
      
      This commit adds per subvolume red-black tree to keep trace of cached
      inodes. The red-black tree helps the balancing code to find cached
      inodes whose inode numbers within a given range.
      
      This commit improves the balancing code by introducing several data
      structures to keep the state of balancing. The most important one
      is the back ref cache. It caches how the upper level tree blocks are
      referenced. This greatly reduce the overhead of checking back ref.
      
      The improved balancing code scales significantly better with a large
      number of snapshots.
      
      This is a very large commit and was written in a number of
      pieces.  But, they depend heavily on the disk format change and were
      squashed together to make sure git bisect didn't end up in a
      bad state wrt space balancing or the format change.
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      5d4f98a2
  12. 25 4月, 2009 1 次提交
    • C
      Btrfs: fix fallocate deadlock on inode extent lock · e980b50c
      Chris Mason 提交于
      The btrfs fallocate call takes an extent lock on the entire range
      being fallocated, and then runs through insert_reserved_extent on each
      extent as they are allocated.
      
      The problem with this is that btrfs_drop_extents may decide to try
      and take the same extent lock fallocate was already holding.  The solution
      used here is to push down knowledge of the range that is already locked
      going into btrfs_drop_extents.
      
      It turns out that at least one other caller had the same bug.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      e980b50c
  13. 03 4月, 2009 3 次提交
  14. 25 3月, 2009 4 次提交
    • C
      Btrfs: optimize fsyncs on old files · af4176b4
      Chris Mason 提交于
      The fsync log has code to make sure all of the parents of a file are in the
      log along with the file.  It uses a minimal log of the parent directory
      inodes, just enough to get the parent directory on disk.
      
      If the transaction that originally created a file is fully on disk,
      and the file hasn't been renamed or linked into other directories, we
      can safely skip the parent directory walk.  We know the file is on disk
      somewhere and we can go ahead and just log that single file.
      
      This is more important now because unrelated unlinks in the parent directory
      might make us force a commit if we try to log the parent.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      af4176b4
    • C
      Btrfs: tree logging unlink/rename fixes · 12fcfd22
      Chris Mason 提交于
      The tree logging code allows individual files or directories to be logged
      without including operations on other files and directories in the FS.
      It tries to commit the minimal set of changes to disk in order to
      fsync the single file or directory that was sent to fsync or O_SYNC.
      
      The tree logging code was allowing files and directories to be unlinked
      if they were part of a rename operation where only one directory
      in the rename was in the fsync log.  This patch adds a few new rules
      to the tree logging.
      
      1) on rename or unlink, if the inode being unlinked isn't in the fsync
      log, we must force a full commit before doing an fsync of the directory
      where the unlink was done.  The commit isn't done during the unlink,
      but it is forced the next time we try to log the parent directory.
      
      Solution: record transid of last unlink/rename per directory when the
      directory wasn't already logged.  For renames this is only done when
      renaming to a different directory.
      
      mkdir foo/some_dir
      normal commit
      rename foo/some_dir foo2/some_dir
      mkdir foo/some_dir
      fsync foo/some_dir/some_file
      
      The fsync above will unlink the original some_dir without recording
      it in its new location (foo2).  After a crash, some_dir will be gone
      unless the fsync of some_file forces a full commit
      
      2) we must log any new names for any file or dir that is in the fsync
      log.  This way we make sure not to lose files that are unlinked during
      the same transaction.
      
      2a) we must log any new names for any file or dir during rename
      when the directory they are being removed from was logged.
      
      2a is actually the more important variant.  Without the extra logging
      a crash might unlink the old name without recreating the new one
      
      3) after a crash, we must go through any directories with a link count
      of zero and redo the rm -rf
      
      mkdir f1/foo
      normal commit
      rm -rf f1/foo
      fsync(f1)
      
      The directory f1 was fully removed from the FS, but fsync was never
      called on f1, only its parent dir.  After a crash the rm -rf must
      be replayed.  This must be able to recurse down the entire
      directory tree.  The inode link count fixup code takes care of the
      ugly details.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      12fcfd22
    • C
      Btrfs: Make sure i_nlink doesn't hit zero too soon during log replay · a74ac322
      Chris Mason 提交于
      During log replay, inodes are copied from the log to the main filesystem
      btrees.  Sometimes they have a zero link count in the log but they actually
      gain links during the replay or have some in the main btree.
      
      This patch updates the link count to be at least one after copying the
      inode out of the log.  This makes sure the inode is deleted during an
      iput while the rest of the replay code is still working on it.
      
      The log replay has fixup code to make sure that link counts are correct
      at the end of the replay, so we could use any non-zero number here and
      it would work fine.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      a74ac322
    • C
      Btrfs: leave btree locks spinning more often · b9473439
      Chris Mason 提交于
      btrfs_mark_buffer dirty would set dirty bits in the extent_io tree
      for the buffers it was dirtying.  This may require a kmalloc and it
      was not atomic.  So, anyone who called btrfs_mark_buffer_dirty had to
      set any btree locks they were holding to blocking first.
      
      This commit changes dirty tracking for extent buffers to just use a flag
      in the extent buffer.  Now that we have one and only one extent buffer
      per page, this can be safely done without losing dirty bits along the way.
      
      This also introduces a path->leave_spinning flag that callers of
      btrfs_search_slot can use to indicate they will properly deal with a
      path returned where all the locks are spinning instead of blocking.
      
      Many of the btree search callers now expect spinning paths,
      resulting in better btree concurrency overall.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      b9473439
  15. 13 2月, 2009 1 次提交
  16. 04 2月, 2009 1 次提交
    • C
      Btrfs: Change btree locking to use explicit blocking points · b4ce94de
      Chris Mason 提交于
      Most of the btrfs metadata operations can be protected by a spinlock,
      but some operations still need to schedule.
      
      So far, btrfs has been using a mutex along with a trylock loop,
      most of the time it is able to avoid going for the full mutex, so
      the trylock loop is a big performance gain.
      
      This commit is step one for getting rid of the blocking locks entirely.
      btrfs_tree_lock takes a spinlock, and the code explicitly switches
      to a blocking lock when it starts an operation that can schedule.
      
      We'll be able get rid of the blocking locks in smaller pieces over time.
      Tracing allows us to find the most common cause of blocking, so we
      can start with the hot spots first.
      
      The basic idea is:
      
      btrfs_tree_lock() returns with the spin lock held
      
      btrfs_set_lock_blocking() sets the EXTENT_BUFFER_BLOCKING bit in
      the extent buffer flags, and then drops the spin lock.  The buffer is
      still considered locked by all of the btrfs code.
      
      If btrfs_tree_lock gets the spinlock but finds the blocking bit set, it drops
      the spin lock and waits on a wait queue for the blocking bit to go away.
      
      Much of the code that needs to set the blocking bit finishes without actually
      blocking a good percentage of the time.  So, an adaptive spin is still
      used against the blocking bit to avoid very high context switch rates.
      
      btrfs_clear_lock_blocking() clears the blocking bit and returns
      with the spinlock held again.
      
      btrfs_tree_unlock() can be called on either blocking or spinning locks,
      it does the right thing based on the blocking bit.
      
      ctree.c has a helper function to set/clear all the locked buffers in a
      path as blocking.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      b4ce94de
  17. 22 1月, 2009 1 次提交
    • Y
      Btrfs: fix tree logs parallel sync · 7237f183
      Yan Zheng 提交于
      To improve performance, btrfs_sync_log merges tree log sync
      requests. But it wrongly merges sync requests for different
      tree logs. If multiple tree logs are synced at the same time,
      only one of them actually gets synced.
      
      This patch has following changes to fix the bug:
      
      Move most tree log related fields in btrfs_fs_info to
      btrfs_root. This allows merging sync requests separately
      for each tree log.
      
      Don't insert root item into the log root tree immediately
      after log tree is allocated. Root item for log tree is
      inserted when log tree get synced for the first time. This
      allows syncing the log root tree without first syncing all
      log trees.
      
      At tree-log sync, btrfs_sync_log first sync the log tree;
      then updates corresponding root item in the log root tree;
      sync the log root tree; then update the super block.
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      7237f183
  18. 10 1月, 2009 1 次提交
    • C
      Btrfs: explicitly mark the tree log root for writeback · e293e97e
      Chris Mason 提交于
      Each subvolume has an extent_state_tree used to mark metadata
      that needs to be sent to disk while syncing the tree.  This is
      used in addition to the dirty bits on the pages themselves so that
      a single subvolume can be sent to disk efficiently in disk order.
      
      Normally this marking happens in btrfs_alloc_free_block, which also does
      special recording of dirty tree blocks for the tree log roots.
      
      Yan Zheng noticed that when the root of the log tree is allocated, it is added
      to the wrong writeback list.  The fix used here is to explicitly set
      it dirty as part of tree log creation.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      e293e97e
  19. 07 1月, 2009 1 次提交
    • Y
      Btrfs: tree logging checksum fixes · 07d400a6
      Yan Zheng 提交于
      This patch contains following things.
      
      1) Limit the max size of btrfs_ordered_sum structure to PAGE_SIZE.  This
      struct is kmalloced so we want to keep it reasonable.
      
      2) Replace copy_extent_csums by btrfs_lookup_csums_range.  This was
      duplicated code in tree-log.c
      
      3) Remove replay_one_csum. csum items are replayed at the same time as
         replaying file extents. This guarantees we only replay useful csums.
      
      4) nbytes accounting fix.
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      07d400a6
  20. 06 1月, 2009 2 次提交
  21. 17 12月, 2008 1 次提交
  22. 09 12月, 2008 3 次提交
    • C
      Btrfs: Fix compressed checksum fsync log copies · 580afd76
      Chris Mason 提交于
      The fsync logging code makes sure to onl copy the relevant checksum for each
      extent based on the file extent pointers it finds.
      
      But for compressed extents, it needs to copy the checksum for the
      entire extent.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      580afd76
    • Y
      Btrfs: superblock duplication · a512bbf8
      Yan Zheng 提交于
      This patch implements superblock duplication. Superblocks
      are stored at offset 16K, 64M and 256G on every devices.
      Spaces used by superblocks are preserved by the allocator,
      which uses a reverse mapping function to find the logical
      addresses that correspond to superblocks. Thank you,
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      a512bbf8
    • C
      Btrfs: move data checksumming into a dedicated tree · d20f7043
      Chris Mason 提交于
      Btrfs stores checksums for each data block.  Until now, they have
      been stored in the subvolume trees, indexed by the inode that is
      referencing the data block.  This means that when we read the inode,
      we've probably read in at least some checksums as well.
      
      But, this has a few problems:
      
      * The checksums are indexed by logical offset in the file.  When
      compression is on, this means we have to do the expensive checksumming
      on the uncompressed data.  It would be faster if we could checksum
      the compressed data instead.
      
      * If we implement encryption, we'll be checksumming the plain text and
      storing that on disk.  This is significantly less secure.
      
      * For either compression or encryption, we have to get the plain text
      back before we can verify the checksum as correct.  This makes the raid
      layer balancing and extent moving much more expensive.
      
      * It makes the front end caching code more complex, as we have touch
      the subvolume and inodes as we cache extents.
      
      * There is potentitally one copy of the checksum in each subvolume
      referencing an extent.
      
      The solution used here is to store the extent checksums in a dedicated
      tree.  This allows us to index the checksums by phyiscal extent
      start and length.  It means:
      
      * The checksum is against the data stored on disk, after any compression
      or encryption is done.
      
      * The checksum is stored in a central location, and can be verified without
      following back references, or reading inodes.
      
      This makes compression significantly faster by reducing the amount of
      data that needs to be checksummed.  It will also allow much faster
      raid management code in general.
      
      The checksums are indexed by a key with a fixed objectid (a magic value
      in ctree.h) and offset set to the starting byte of the extent.  This
      allows us to copy the checksum items into the fsync log tree directly (or
      any other tree), without having to invent a second format for them.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      d20f7043
  23. 02 12月, 2008 2 次提交
  24. 31 10月, 2008 1 次提交
    • Y
      Btrfs: Add fallocate support v2 · d899e052
      Yan Zheng 提交于
      This patch updates btrfs-progs for fallocate support.
      
      fallocate is a little different in Btrfs because we need to tell the
      COW system that a given preallocated extent doesn't need to be
      cow'd as long as there are no snapshots of it.  This leverages the
      -o nodatacow checks.
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      d899e052
  25. 30 10月, 2008 1 次提交