1. 10 6月, 2009 2 次提交
    • Y
      Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE) · 5d4f98a2
      Yan Zheng 提交于
      This commit introduces a new kind of back reference for btrfs metadata.
      Once a filesystem has been mounted with this commit, IT WILL NO LONGER
      BE MOUNTABLE BY OLDER KERNELS.
      
      When a tree block in subvolume tree is cow'd, the reference counts of all
      extents it points to are increased by one.  At transaction commit time,
      the old root of the subvolume is recorded in a "dead root" data structure,
      and the btree it points to is later walked, dropping reference counts
      and freeing any blocks where the reference count goes to 0.
      
      The increments done during cow and decrements done after commit cancel out,
      and the walk is a very expensive way to go about freeing the blocks that
      are no longer referenced by the new btree root.  This commit reduces the
      transaction overhead by avoiding the need for dead root records.
      
      When a non-shared tree block is cow'd, we free the old block at once, and the
      new block inherits old block's references. When a tree block with reference
      count > 1 is cow'd, we increase the reference counts of all extents
      the new block points to by one, and decrease the old block's reference count by
      one.
      
      This dead tree avoidance code removes the need to modify the reference
      counts of lower level extents when a non-shared tree block is cow'd.
      But we still need to update back ref for all pointers in the block.
      This is because the location of the block is recorded in the back ref
      item.
      
      We can solve this by introducing a new type of back ref. The new
      back ref provides information about pointer's key, level and in which
      tree the pointer lives. This information allow us to find the pointer
      by searching the tree. The shortcoming of the new back ref is that it
      only works for pointers in tree blocks referenced by their owner trees.
      
      This is mostly a problem for snapshots, where resolving one of these
      fuzzy back references would be O(number_of_snapshots) and quite slow.
      The solution used here is to use the fuzzy back references in the common
      case where a given tree block is only referenced by one root,
      and use the full back references when multiple roots have a reference
      on a given block.
      
      This commit adds per subvolume red-black tree to keep trace of cached
      inodes. The red-black tree helps the balancing code to find cached
      inodes whose inode numbers within a given range.
      
      This commit improves the balancing code by introducing several data
      structures to keep the state of balancing. The most important one
      is the back ref cache. It caches how the upper level tree blocks are
      referenced. This greatly reduce the overhead of checking back ref.
      
      The improved balancing code scales significantly better with a large
      number of snapshots.
      
      This is a very large commit and was written in a number of
      pieces.  But, they depend heavily on the disk format change and were
      squashed together to make sure git bisect didn't end up in a
      bad state wrt space balancing or the format change.
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      5d4f98a2
    • Y
      btrfs: Fix set/clear_extent_bit for 'end == (u64)-1' · 5c939df5
      Yan Zheng 提交于
      There are some 'start = state->end + 1;' like code in set_extent_bit
      and clear_extent_bit. They overflow when end == (u64)-1.
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      5c939df5
  2. 05 6月, 2009 1 次提交
    • C
      Btrfs: Fix oops and use after free during space balancing · 44fb5511
      Chris Mason 提交于
      The btrfs allocator uses list_for_each to walk the available block
      groups when searching for free blocks.  It starts off with a hint
      to help find the best block group for a given allocation.
      
      The hint is resolved into a block group, but we don't properly check
      to make sure the block group we find isn't in the middle of being
      freed due to filesystem shrinking or balancing.  If it is being
      freed, the list pointers in it are bogus and can't be trusted.  But,
      the code happily goes along and uses them in the list_for_each loop,
      leading to all kinds of fun.
      
      The fix used here is to check to make sure the block group we find really
      is on the list before we use it.  list_del_init is used when removing
      it from the list, so we can do a proper check.
      
      The allocation clustering code has a similar bug where it will trust
      the block group in the current free space cluster.  If our allocation
      flags have changed (going from single spindle dup to raid1 for example)
      because the drives in the FS have changed, we're not allowed to use
      the old block group any more.
      
      The fix used here is to check the current cluster against the
      current allocation flags.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      44fb5511
  3. 04 6月, 2009 1 次提交
  4. 15 5月, 2009 6 次提交
  5. 09 5月, 2009 1 次提交
  6. 28 4月, 2009 2 次提交
    • C
      Btrfs: look for acls during btrfs_read_locked_inode · 46a53cca
      Chris Mason 提交于
      This changes btrfs_read_locked_inode() to peek ahead in the btree for acl items.
      If it is certain a given inode has no acls, it will set the in memory acl
      fields to null to avoid acl lookups completely.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      46a53cca
    • C
      Btrfs: fix acl caching · 7b1a14bb
      Chris Mason 提交于
      Linus noticed the btrfs code to cache acls wasn't properly caching
      a NULL acl when the inode didn't have any acls.  This meant the common
      case of no acls resulted in expensive btree searches every time the
      kernel checked permissions (which is quite often).
      
      This is a modified version of Linus' original patch:
      
      Properly set initial acl fields to BTRFS_ACL_NOT_CACHED in the inode.
      This forces an acl lookup when permission checks are done.
      
      Fix btrfs_get_acl to avoid lookups and locking when the inode acls fields
      are set to null.
      
      Fix btrfs_get_acl to use the right return value from __btrfs_getxattr
      when deciding to cache a NULL acl.  It was storing a NULL acl when
      __btrfs_getxattr return -ENOENT, but __btrfs_getxattr was actually returning
      -ENODATA for this case.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      7b1a14bb
  7. 27 4月, 2009 6 次提交
  8. 25 4月, 2009 6 次提交
  9. 22 4月, 2009 1 次提交
    • C
      Btrfs: fix btrfs fallocate oops and deadlock · 546888da
      Chris Mason 提交于
      Btrfs fallocate was incorrectly starting a transaction with a lock held
      on the extent_io tree for the file, which could deadlock.  Strictly
      speaking it was using join_transaction which would be safe, but it is better
      to move the transaction outside of the lock.
      
      When preallocated extents are overwritten, btrfs_mark_buffer_dirty was
      being called on an unlocked buffer.  This was triggering an assertion and
      oops because the lock is supposed to be held.
      
      The bug was calling btrfs_mark_buffer_dirty on a leaf after btrfs_del_item had
      been run.  btrfs_del_item takes care of dirtying things, so the solution is a
      to skip the btrfs_mark_buffer_dirty call in this case.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      546888da
  10. 21 4月, 2009 5 次提交
    • L
      btrfs: use memdup_user() · dae7b665
      Li Zefan 提交于
      Remove open-coded memdup_user().
      
      Note this changes some GFP_NOFS to GFP_KERNEL, since copy_from_user() may
      cause pagefault, it's pointless to pass GFP_NOFS to kmalloc().
      Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      dae7b665
    • C
      Btrfs: use the right node in reada_for_balance · 8c594ea8
      Chris Mason 提交于
      reada_for_balance was using the wrong index into the path node array,
      so it wasn't reading the right blocks.  We never directly used the
      results of the read done by this function because the btree search is
      started over at the end.
      
      This fixes reada_for_balance to reada in the correct node and to
      avoid searching past the last slot in the node.  It also makes sure to
      hold the parent lock while we are finding the nodes to read.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      8c594ea8
    • C
      Btrfs: fix oops on page->mapping->host during writepage · 11c8349b
      Chris Mason 提交于
      The extent_io writepage call updates the writepage index in the inode
      as it makes progress.  But, it was doing the update after unlocking the page,
      which isn't legal because page->mapping can't be trusted once the page
      is unlocked.
      
      This lead to an oops, especially common with compression turned on.  The
      fix here is to update the writeback index before unlocking the page.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      11c8349b
    • C
      Btrfs: add a priority queue to the async thread helpers · d313d7a3
      Chris Mason 提交于
      Btrfs is using WRITE_SYNC_PLUG to send down synchronous IOs with a
      higher priority.  But, the checksumming helper threads prevent it
      from being fully effective.
      
      There are two problems.  First, a big queue of pending checksumming
      will delay the synchronous IO behind other lower priority writes.  Second,
      the checksumming uses an ordered async work queue.  The ordering makes sure
      that IOs are sent to the block layer in the same order they are sent
      to the checksumming threads.  Usually this gives us less seeky IO.
      
      But, when we start mixing IO priorities, the lower priority IO can delay
      the higher priority IO.
      
      This patch solves both problems by adding a high priority list to the async
      helper threads, and a new btrfs_set_work_high_prio(), which is used
      to make put a new async work item onto the higher priority list.
      
      The ordering is still done on high priority IO, but all of the high
      priority bios are ordered separately from the low priority bios.  This
      ordering is purely an IO optimization, it is not involved in data
      or metadata integrity.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      d313d7a3
    • C
      Btrfs: use WRITE_SYNC for synchronous writes · ffbd517d
      Chris Mason 提交于
      Part of reducing fsync/O_SYNC/O_DIRECT latencies is using WRITE_SYNC for
      writes we plan on waiting on in the near future.  This patch
      mirrors recent changes in other filesystems and the generic code to
      use WRITE_SYNC when WB_SYNC_ALL is passed and to use WRITE_SYNC for
      other latency critical writes.
      
      Btrfs uses async worker threads for checksumming before the write is done,
      and then again to actually submit the bios.  The bio submission code just
      runs a per-device list of bios that need to be sent down the pipe.
      
      This list is split into low priority and high priority lists so the
      WRITE_SYNC IO happens first.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      ffbd517d
  11. 03 4月, 2009 9 次提交