1. 04 2月, 2009 9 次提交
    • C
      Btrfs: Make btrfs_drop_snapshot work in larger and more efficient chunks · bd56b302
      Chris Mason 提交于
      Every transaction in btrfs creates a new snapshot, and then schedules the
      snapshot from the last transaction for deletion.  Snapshot deletion
      works by walking down the btree and dropping the reference counts
      on each btree block during the walk.
      
      If if a given leaf or node has a reference count greater than one,
      the reference count is decremented and the subtree pointed to by that
      node is ignored.
      
      If the reference count is one, walking continues down into that node
      or leaf, and the references of everything it points to are decremented.
      
      The old code would try to work in small pieces, walking down the tree
      until it found the lowest leaf or node to free and then returning.  This
      was very friendly to the rest of the FS because it didn't have a huge
      impact on other operations.
      
      But it wouldn't always keep up with the rate that new commits added new
      snapshots for deletion, and it wasn't very optimal for the extent
      allocation tree because it wasn't finding leaves that were close together
      on disk and processing them at the same time.
      
      This changes things to walk down to a level 1 node and then process it
      in bulk.  All the leaf pointers are sorted and the leaves are dropped
      in order based on their extent number.
      
      The extent allocation tree and commit code are now fast enough for
      this kind of bulk processing to work without slowing the rest of the FS
      down.  Overall it does less IO and is better able to keep up with
      snapshot deletions under high load.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      bd56b302
    • C
      Btrfs: Change btree locking to use explicit blocking points · b4ce94de
      Chris Mason 提交于
      Most of the btrfs metadata operations can be protected by a spinlock,
      but some operations still need to schedule.
      
      So far, btrfs has been using a mutex along with a trylock loop,
      most of the time it is able to avoid going for the full mutex, so
      the trylock loop is a big performance gain.
      
      This commit is step one for getting rid of the blocking locks entirely.
      btrfs_tree_lock takes a spinlock, and the code explicitly switches
      to a blocking lock when it starts an operation that can schedule.
      
      We'll be able get rid of the blocking locks in smaller pieces over time.
      Tracing allows us to find the most common cause of blocking, so we
      can start with the hot spots first.
      
      The basic idea is:
      
      btrfs_tree_lock() returns with the spin lock held
      
      btrfs_set_lock_blocking() sets the EXTENT_BUFFER_BLOCKING bit in
      the extent buffer flags, and then drops the spin lock.  The buffer is
      still considered locked by all of the btrfs code.
      
      If btrfs_tree_lock gets the spinlock but finds the blocking bit set, it drops
      the spin lock and waits on a wait queue for the blocking bit to go away.
      
      Much of the code that needs to set the blocking bit finishes without actually
      blocking a good percentage of the time.  So, an adaptive spin is still
      used against the blocking bit to avoid very high context switch rates.
      
      btrfs_clear_lock_blocking() clears the blocking bit and returns
      with the spinlock held again.
      
      btrfs_tree_unlock() can be called on either blocking or spinning locks,
      it does the right thing based on the blocking bit.
      
      ctree.c has a helper function to set/clear all the locked buffers in a
      path as blocking.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      b4ce94de
    • C
      Btrfs: hash_lock is no longer needed · c487685d
      Chris Mason 提交于
      Before metadata is written to disk, it is updated to reflect that writeout
      has begun.  Once this update is done, the block must be cow'd before it
      can be modified again.
      
      This update was originally synchronized by using a per-fs spinlock.  Today
      the buffers for the metadata blocks are locked before writeout begins,
      and everyone that tests the flag has the buffer locked as well.
      
      So, the per-fs spinlock (called hash_lock for no good reason) is no
      longer required.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      c487685d
    • C
      Btrfs: disable leak debugging checks in extent_io.c · 3935127c
      Chris Mason 提交于
      extent_io.c has debugging code to report and free leaked extent_state
      and extent_buffer objects at rmmod time.  This helps track down
      leaks and it saves you from rebooting just to properly remove the
      kmem_cache object.
      
      But, the code runs under a fairly expensive spinlock and the checks to
      see if it is currently enabled are not entirely consistent.  Some use
      #ifdef and some #if.
      
      This changes everything to #if and disables the leak checking.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      3935127c
    • C
      Btrfs: sort references by byte number during btrfs_inc_ref · b7a9f29f
      Chris Mason 提交于
      When a block goes through cow, we update the reference counts of
      everything that block points to.  The internal pointers of the block
      can be in just about any order, and it is likely to have clusters of
      things that are close together and clusters of things that are not.
      
      To help reduce the seeks that come with updating all of these reference
      counts, sort them by byte number before actual updates are done.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      b7a9f29f
    • C
      Btrfs: async threads should try harder to find work · b51912c9
      Chris Mason 提交于
      Tracing shows the delay between when an async thread goes to sleep
      and when more work is added is often very short.  This commit adds
      a little bit of delay and extra checking to the code right before
      we schedule out.
      
      It allows more work to be added to the worker
      without requiring notifications from other procs.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      b51912c9
    • J
      Btrfs: selinux support · 0279b4cd
      Jim Owens 提交于
      Add call to LSM security initialization and save
      resulting security xattr for new inodes.
      
      Add xattr support to symlink inode ops.
      
      Set inode->i_op for existing special files.
      Signed-off-by: Njim owens <jowens@hp.com>
      0279b4cd
    • C
      Btrfs: make btrfs acls selectable · bef62ef3
      Christian Hesse 提交于
      This patch adds a menu entry to kconfig to enable acls for btrfs.
      This allows you to enable FS_POSIX_ACL at kernel compile time.
      
      (updated by Jeff Mahoney to make the changes in fs/btrfs/Kconfig instead)
      Signed-off-by: NChristian Hesse <mail@earthworm.de>
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      bef62ef3
    • C
      Btrfs: Catch missed bios in the async bio submission thread · a6837051
      Chris Mason 提交于
      The async bio submission thread was missing some bios that were
      added after it had decided there was no work left to do.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      a6837051
  2. 29 1月, 2009 1 次提交
    • C
      Btrfs: fix readdir on 32 bit machines · 89f135d8
      Chris Mason 提交于
      After btrfs_readdir has gone through all the directory items, it
      sets the directory f_pos to the largest possible int.  This way
      applications that mix readdir with creating new files don't
      end up in an endless loop finding the new directory items as they go.
      
      It was a workaround for a bug in git, but the assumption was that if git
      could make this looping mistake than it would be a common problem.
      
      The largest possible int chosen was INT_LIMIT(typeof(file->f_pos),
      and it is possible for that to be a larger number than 32 bit glibc
      expects to come out of readdir.
      
      This patches switches that to INT_LIMIT(off_t), which should keep
      applications happy on 32 and 64 bit machines.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      89f135d8
  3. 22 1月, 2009 5 次提交
    • C
      Btrfs: do less aggressive btree readahead · a7175319
      Chris Mason 提交于
      Just before reading a leaf, btrfs scans the node for blocks that are
      close by and reads them too.  It tries to build up a large window
      of IO looking for blocks that are within a max distance from the top
      and bottom of the IO window.
      
      This patch changes things to just look for blocks within 64k of the
      target block.  It will trigger less IO and make for lower latencies on
      the read size.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      a7175319
    • A
      fs/Kconfig: move btrfs out · 335debee
      Alexey Dobriyan 提交于
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      335debee
    • Y
      Btrfs: fiemap support · 1506fcc8
      Yehuda Sadeh 提交于
      Now that bmap support is gone, this is the only way to get extent
      mappings for userland.  These are still not valid for IO, but they
      can tell us if a file has holes or how much fragmentation there is.
      Signed-off-by: NYehuda Sadeh <yehuda@hq.newdream.net>
      1506fcc8
    • C
      Btrfs: stop providing a bmap operation to avoid swapfile corruptions · 35054394
      Chris Mason 提交于
      Swapfiles use bmap to build a list of extents belonging to the file,
      and they assume these extents won't change over the life of the file.
      They also use resulting list to do IO directly to the block device.
      
      This causes problems for btrfs in a few ways:
      
      btrfs returns logical block numbers through bmap, and these are not suitable
      for IO.  They might translate to different devices, raid etc.
      
      COW means that file block mappings are going to change frequently.
      
      Using swapfiles on btrfs will lead to corruption, so we're avoiding the
      problem for now by dropping bmap support entirely.  A later commit
      will add fiemap support for people that really want to know how
      a file is laid out.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      35054394
    • Y
      Btrfs: fix tree logs parallel sync · 7237f183
      Yan Zheng 提交于
      To improve performance, btrfs_sync_log merges tree log sync
      requests. But it wrongly merges sync requests for different
      tree logs. If multiple tree logs are synced at the same time,
      only one of them actually gets synced.
      
      This patch has following changes to fix the bug:
      
      Move most tree log related fields in btrfs_fs_info to
      btrfs_root. This allows merging sync requests separately
      for each tree log.
      
      Don't insert root item into the log root tree immediately
      after log tree is allocated. Root item for log tree is
      inserted when log tree get synced for the first time. This
      allows syncing the log root tree without first syncing all
      log trees.
      
      At tree-log sync, btrfs_sync_log first sync the log tree;
      then updates corresponding root item in the log root tree;
      sync the log root tree; then update the super block.
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      7237f183
  4. 21 1月, 2009 12 次提交
  5. 17 1月, 2009 2 次提交
    • C
      Btrfs: fix ioctl arg size (userland incompatible change!) · c071fcfd
      Chris Mason 提交于
      The structure used to send device in btrfs ioctl calls was not
      properly aligned, and so 32 bit ioctls would not work properly on
      64 bit kernels.
      
      We could fix this with compat ioctls, but we're just one byte away
      and it doesn't make sense at this stage to carry about the compat ioctls
      forever at this stage in the project.
      
      This patch brings the ioctl arg up to an evenly aligned 4k.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      c071fcfd
    • C
      Btrfs: Clear the device->running_pending flag before bailing on congestion · 1d9e2ae9
      Chris Mason 提交于
      Btrfs maintains a queue of async bio submissions so the checksumming
      threads don't have to wait on get_request_wait.  In order to avoid
      extra wakeups, this code has a running_pending flag that is used
      to tell new submissions they don't need to wake the thread.
      
      When the threads notice congestion on a single device, they
      may decide to requeue the job and move on to other devices.  This
      makes sure the running_pending flag is cleared before the
      job is requeued.
      
      It should help avoid IO stalls by making sure the task is woken up
      when new submissions come in.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      1d9e2ae9
  6. 16 1月, 2009 1 次提交
  7. 10 1月, 2009 2 次提交
    • L
      btrfs: fix for write_super_lockfs/unlockfs error handling · 0176260f
      Linus Torvalds 提交于
      Commit c4be0c1d added the ability for
      write_super_lockfs to return errors, and renamed them to match.  But
      btrfs didn't get converted.
      
      Do the minimal conversion to make it compile again.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0176260f
    • C
      Btrfs: explicitly mark the tree log root for writeback · e293e97e
      Chris Mason 提交于
      Each subvolume has an extent_state_tree used to mark metadata
      that needs to be sent to disk while syncing the tree.  This is
      used in addition to the dirty bits on the pages themselves so that
      a single subvolume can be sent to disk efficiently in disk order.
      
      Normally this marking happens in btrfs_alloc_free_block, which also does
      special recording of dirty tree blocks for the tree log roots.
      
      Yan Zheng noticed that when the root of the log tree is allocated, it is added
      to the wrong writeback list.  The fix used here is to explicitly set
      it dirty as part of tree log creation.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      e293e97e
  8. 08 1月, 2009 1 次提交
  9. 07 1月, 2009 4 次提交
  10. 06 1月, 2009 3 次提交