1. 15 3月, 2010 2 次提交
    • J
      Btrfs: cache the extent state everywhere we possibly can V2 · 2ac55d41
      Josef Bacik 提交于
      This patch just goes through and fixes everybody that does
      
      lock_extent()
      blah
      unlock_extent()
      
      to use
      
      lock_extent_bits()
      blah
      unlock_extent_cached()
      
      and pass around a extent_state so we only have to do the searches once per
      function.  This gives me about a 3 mb/s boots on my random write test.  I have
      not converted some things, like the relocation and ioctl's, since they aren't
      heavily used and the relocation stuff is in the middle of being re-written.  I
      also changed the clear_extent_bit() to only unset the cached state if we are
      clearing EXTENT_LOCKED and related stuff, so we can do things like this
      
      lock_extent_bits()
      clear delalloc bits
      unlock_extent_cached()
      
      without losing our cached state.  I tested this thoroughly and turned on
      LEAK_DEBUG to make sure we weren't leaking extent states, everything worked out
      fine.
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      2ac55d41
    • J
      Btrfs: change how we mount subvolumes · 73f73415
      Josef Bacik 提交于
      This work is in preperation for being able to set a different root as the
      default mounting root.
      
      There is currently a problem with how we mount subvolumes.  We cannot currently
      mount a subvolume of a subvolume, you can only mount subvolumes/snapshots of the
      default subvolume.  So say you take a snapshot of the default subvolume and call
      it snap1, and then take a snapshot of snap1 and call it snap2, so now you have
      
      /
      /snap1
      /snap1/snap2
      
      as your available volumes.  Currently you can only mount / and /snap1,
      you cannot mount /snap1/snap2.  To fix this problem instead of passing
      subvolid=<name> you must pass in subvolid=<treeid>, where <treeid> is
      the tree id that gets spit out via the subvolume listing you get from
      the subvolume listing patches (btrfs filesystem list).  This allows us
      to mount /, /snap1 and /snap1/snap2 as the root volume.
      
      In addition to the above, we also now read the default dir item in the
      tree root to get the root key that it points to.  For now this just
      points at what has always been the default subvolme, but later on I plan
      to change it to point at whatever root you want to be the new default
      root, so you can just set the default mount and not have to mount with
      -o subvolid=<treeid>.  I tested this out with the above scenario and it
      worked perfectly.  Thanks,
      
      mount -o subvol operates inside the selected subvolid.  For example:
      
      mount -o subvol=snap1,subvolid=256 /dev/xxx /mnt
      
      /mnt will have the snap1 directory for the subvolume with id
      256.
      
      mount -o subvol=snap /dev/xxx /mnt
      
      /mnt will be the snap directory of whatever the default subvolume
      is.
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      73f73415
  2. 09 3月, 2010 1 次提交
  3. 05 2月, 2010 1 次提交
    • M
      Btrfs: remove BUG_ON() due to mounting bad filesystem · d7ce5843
      Miao Xie 提交于
      Mounting a bad filesystem caused a BUG_ON(). The following is steps to
      reproduce it.
       # mkfs.btrfs /dev/sda2
       # mount /dev/sda2 /mnt
       # mkfs.btrfs /dev/sda1 /dev/sda2
       (the program says that /dev/sda2 was mounted, and then exits. )
       # umount /mnt
       # mount /dev/sda1 /mnt
      
      At the third step, mkfs.btrfs exited in the way of make filesystem. So the
      initialization of the filesystem didn't finish. So the filesystem was bad, and
      it caused BUG_ON() when mounting it. But BUG_ON() should be called by the wrong
      code, not user's operation, so I think it is a bug of btrfs.
      
      This patch fixes it.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      d7ce5843
  4. 18 1月, 2010 1 次提交
  5. 18 12月, 2009 3 次提交
  6. 05 10月, 2009 1 次提交
    • C
      Btrfs: fix deadlock on async thread startup · 61d92c32
      Chris Mason 提交于
      The btrfs async worker threads are used for a wide variety of things,
      including processing bio end_io functions.  This means that when
      the endio threads aren't running, the rest of the FS isn't
      able to do the final processing required to clear PageWriteback.
      
      The endio threads also try to exit as they become idle and
      start more as the work piles up.  The problem is that starting more
      threads means kthreadd may need to allocate ram, and that allocation
      may wait until the global number of writeback pages on the system is
      below a certain limit.
      
      The result of that throttling is that end IO threads wait on
      kthreadd, who is waiting on IO to end, which will never happen.
      
      This commit fixes the deadlock by handing off thread startup to a
      dedicated thread.  It also fixes a bug where the on-demand thread
      creation was creating far too many threads because it didn't take into
      account threads being started by other procs.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      61d92c32
  7. 24 9月, 2009 1 次提交
  8. 22 9月, 2009 1 次提交
  9. 12 9月, 2009 2 次提交
    • C
      Btrfs: use a cached state for extent state operations during delalloc · 9655d298
      Chris Mason 提交于
      This changes the btrfs code to find delalloc ranges in the extent state
      tree to use the new state caching code from set/test bit.  It reduces
      one of the biggest causes of rbtree searches in the writeback path.
      
      test_range_bit is also modified to take the cached state as a starting
      point while searching.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      9655d298
    • C
      Btrfs: switch extent_map to a rw lock · 890871be
      Chris Mason 提交于
      There are two main users of the extent_map tree.  The
      first is regular file inodes, where it is evenly spread
      between readers and writers.
      
      The second is the chunk allocation tree, which maps blocks from
      logical addresses to phyiscal ones, and it is 99.99% reads.
      
      The mapping tree is a point of lock contention during heavy IO
      workloads, so this commit switches things to a rw lock.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      890871be
  10. 08 8月, 2009 1 次提交
  11. 22 7月, 2009 1 次提交
    • Y
      Btrfs: fix locking issue in btrfs_find_next_key · 33c66f43
      Yan Zheng 提交于
      When walking up the tree, btrfs_find_next_key assumes the upper level tree
      block is properly locked. This isn't always true even path->keep_locks is 1.
      This is because btrfs_find_next_key may advance path->slots[] several times
      instead of only once.
      
      When 'path->slots[level] >= btrfs_header_nritems(path->nodes[level])' is found,
      we can't guarantee the original value of 'path->slots[level]' is
      'btrfs_header_nritems(path->nodes[level]) - 1'. If it's not, the tree block at
      'level + 1' isn't locked.
      
      This patch fixes the issue by explicitly checking the locking state,
      re-searching the tree if it's not locked.
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      33c66f43
  12. 03 7月, 2009 1 次提交
    • Y
      Btrfs: update backrefs while dropping snapshot · 2c47e605
      Yan Zheng 提交于
      The new backref format has restriction on type of backref item.  If a tree
      block isn't referenced by its owner tree, full backrefs must be used for the
      pointers in it. When a tree block loses its owner tree's reference, backrefs
      for the pointers in it should be updated to full backrefs. Current
      btrfs_drop_snapshot misses the code that updates backrefs, so it's unsafe for
      general use.
      
      This patch adds backrefs update code to btrfs_drop_snapshot.  It isn't a
      problem in the restricted form btrfs_drop_snapshot is used today, but for
      general snapshot deletion this update is required.
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      2c47e605
  13. 10 6月, 2009 1 次提交
    • Y
      Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE) · 5d4f98a2
      Yan Zheng 提交于
      This commit introduces a new kind of back reference for btrfs metadata.
      Once a filesystem has been mounted with this commit, IT WILL NO LONGER
      BE MOUNTABLE BY OLDER KERNELS.
      
      When a tree block in subvolume tree is cow'd, the reference counts of all
      extents it points to are increased by one.  At transaction commit time,
      the old root of the subvolume is recorded in a "dead root" data structure,
      and the btree it points to is later walked, dropping reference counts
      and freeing any blocks where the reference count goes to 0.
      
      The increments done during cow and decrements done after commit cancel out,
      and the walk is a very expensive way to go about freeing the blocks that
      are no longer referenced by the new btree root.  This commit reduces the
      transaction overhead by avoiding the need for dead root records.
      
      When a non-shared tree block is cow'd, we free the old block at once, and the
      new block inherits old block's references. When a tree block with reference
      count > 1 is cow'd, we increase the reference counts of all extents
      the new block points to by one, and decrease the old block's reference count by
      one.
      
      This dead tree avoidance code removes the need to modify the reference
      counts of lower level extents when a non-shared tree block is cow'd.
      But we still need to update back ref for all pointers in the block.
      This is because the location of the block is recorded in the back ref
      item.
      
      We can solve this by introducing a new type of back ref. The new
      back ref provides information about pointer's key, level and in which
      tree the pointer lives. This information allow us to find the pointer
      by searching the tree. The shortcoming of the new back ref is that it
      only works for pointers in tree blocks referenced by their owner trees.
      
      This is mostly a problem for snapshots, where resolving one of these
      fuzzy back references would be O(number_of_snapshots) and quite slow.
      The solution used here is to use the fuzzy back references in the common
      case where a given tree block is only referenced by one root,
      and use the full back references when multiple roots have a reference
      on a given block.
      
      This commit adds per subvolume red-black tree to keep trace of cached
      inodes. The red-black tree helps the balancing code to find cached
      inodes whose inode numbers within a given range.
      
      This commit improves the balancing code by introducing several data
      structures to keep the state of balancing. The most important one
      is the back ref cache. It caches how the upper level tree blocks are
      referenced. This greatly reduce the overhead of checking back ref.
      
      The improved balancing code scales significantly better with a large
      number of snapshots.
      
      This is a very large commit and was written in a number of
      pieces.  But, they depend heavily on the disk format change and were
      squashed together to make sure git bisect didn't end up in a
      bad state wrt space balancing or the format change.
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      5d4f98a2