1. 21 2月, 2009 1 次提交
    • J
      Btrfs: add better -ENOSPC handling · 6a63209f
      Josef Bacik 提交于
      This is a step in the direction of better -ENOSPC handling.  Instead of
      checking the global bytes counter we check the space_info bytes counters to
      make sure we have enough space.
      
      If we don't we go ahead and try to allocate a new chunk, and then if that fails
      we return -ENOSPC.  This patch adds two counters to btrfs_space_info,
      bytes_delalloc and bytes_may_use.
      
      bytes_delalloc account for extents we've actually setup for delalloc and will
      be allocated at some point down the line. 
      
      bytes_may_use is to keep track of how many bytes we may use for delalloc at
      some point.  When we actually set the extent_bit for the delalloc bytes we
      subtract the reserved bytes from the bytes_may_use counter.  This keeps us from
      not actually being able to allocate space for any delalloc bytes.
      Signed-off-by: NJosef Bacik <jbacik@redhat.com>
      
      
      
      6a63209f
  2. 20 2月, 2009 1 次提交
  3. 13 2月, 2009 2 次提交
    • Y
      Btrfs: hold trans_mutex when using btrfs_record_root_in_trans · 24562425
      Yan Zheng 提交于
      btrfs_record_root_in_trans needs the trans_mutex held to make sure two
      callers don't race to setup the root in a given transaction.  This adds
      it to all the places that were missing it.
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      24562425
    • C
      Btrfs: make a lockdep class for the extent buffer locks · 4008c04a
      Chris Mason 提交于
      Btrfs is currently using spin_lock_nested with a nested value based
      on the tree depth of the block.  But, this doesn't quite work because
      the max tree depth is bigger than what spin_lock_nested can deal with,
      and because locks are sometimes taken before the level field is filled in.
      
      The solution here is to use lockdep_set_class_and_name instead, and to
      set the class before unlocking the pages when the block is read from the
      disk and just after init of a freshly allocated tree block.
      
      btrfs_clear_path_blocking is also changed to take the locks in the proper
      order, and it also makes sure all the locks currently held are properly
      set to blocking before it tries to retake the spinlocks.  Otherwise, lockdep
      gets upset about bad lock orderin.
      
      The lockdep magic cam from Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      4008c04a
  4. 12 2月, 2009 1 次提交
    • J
      Btrfs: fs/btrfs/volumes.c: remove useless kzalloc · 3f3420df
      Julia Lawall 提交于
      The call to kzalloc is followed by a kmalloc whose result is stored in the
      same variable.
      
      The semantic match that finds the problem is as follows:
      (http://www.emn.fr/x-info/coccinelle/)
      
      // <smpl>
      @r exists@
      local idexpression x;
      statement S;
      expression E;
      identifier f,l;
      position p1,p2;
      expression *ptr != NULL;
      @@
      
      (
      if ((x@p1 = \(kmalloc\|kzalloc\|kcalloc\)(...)) == NULL) S
      |
      x@p1 = \(kmalloc\|kzalloc\|kcalloc\)(...);
      ...
      if (x == NULL) S
      )
      <... when != x
           when != if (...) { <+...x...+> }
      x->f = E
      ...>
      (
       return \(0\|<+...x...+>\|ptr\);
      |
       return@p2 ...;
      )
      
      @script:python@
      p1 << r.p1;
      p2 << r.p2;
      @@
      
      print "* file: %s kmalloc %s return %s" % (p1[0].file,p1[0].line,p2[0].line)
      // </smpl>
      Signed-off-by: NJulia Lawall <julia@diku.dk>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      3f3420df
  5. 13 2月, 2009 2 次提交
  6. 12 2月, 2009 6 次提交
  7. 10 2月, 2009 1 次提交
    • C
      Btrfs: don't use spin_is_contended · 284b066a
      Chris Mason 提交于
      Btrfs was using spin_is_contended to see if it should drop locks before
      doing extent allocations during btrfs_search_slot.  The idea was to avoid
      expensive searches in the tree unless the lock was actually contended.
      
      But, spin_is_contended is specific to the ticket spinlocks on x86, so this
      is causing compile errors everywhere else.
      
      In practice, the contention could easily appear some time after we started
      doing the extent allocation, and it makes more sense to always drop the lock
      instead.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      284b066a
  8. 07 2月, 2009 1 次提交
  9. 05 2月, 2009 1 次提交
  10. 04 2月, 2009 17 次提交
    • C
      Btrfs: don't return congestion in write_cache_pages as often · 9b0d3ace
      Chris Mason 提交于
      On fast devices that go from congested to uncongested very quickly, pdflush
      is waiting too often in congestion_wait, and the FS is backing off to
      easily in write_cache_pages.
      
      For now, fix this on the btrfs side by only checking congestion after
      some bios have already gone down.  Longer term a real fix is needed
      for pdflush, but that is a larger project.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      9b0d3ace
    • C
      Btrfs: Only prep for btree deletion balances when nodes are mostly empty · 7b78c170
      Chris Mason 提交于
      Whenever an item deletion is done, we need to balance all the nodes
      in the tree to make sure we don't end up with an empty node if a pointer
      is deleted.  This balance prep happens from the root of the tree down
      so we can drop our locks as we go.
      
      reada_for_balance was triggering read-ahead on neighboring nodes even
      when no balancing was required.  This adds an extra check to avoid
      calling balance_level() and avoid reada_for_balance() when a balance
      won't be required.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      7b78c170
    • C
      Btrfs: fix btrfs_unlock_up_safe to walk the entire path · 12f4dacc
      Chris Mason 提交于
      btrfs_unlock_up_safe would break out at the first NULL node entry or
      unlocked node it found in the path.
      
      Some of the callers have missing nodes at the lower levels of the path, so this
      commit fixes things to check all the nodes in the path before returning.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      12f4dacc
    • C
      Btrfs: change btrfs_del_leaf to drop locks earlier · 4d081c41
      Chris Mason 提交于
      btrfs_del_leaf does two things.  First it removes the pointer in the
      parent, and then it frees the block that has the leaf.  It has the
      parent node locked for both operations.
      
      But, it only needs the parent locked while it is deleting the pointer.
      After that it can safely free the block without the parent locked.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      4d081c41
    • C
      Btrfs: Change btrfs_truncate_inode_items to stop when it hits the inode · 06d9a8d7
      Chris Mason 提交于
      btrfs_truncate_inode_items is setup to stop doing btree searches when
      it has finished removing the items for the inode.  It used to detect the
      end of the inode by looking for an objectid that didn't match the
      one we were searching for.
      
      But, this would result in an extra search through the btree, which
      adds extra balancing and cow costs to the operation.
      
      This commit adds a check to see if we found the inode item, which means
      we can stop searching early.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      06d9a8d7
    • C
      Btrfs: Don't try to compress pages past i_size · f03d9301
      Chris Mason 提交于
      The compression code had some checks to make sure we were only
      compressing bytes inside of i_size, but it wasn't catching every
      case.  To make things worse, some incorrect math about the number
      of bytes remaining would make it try to compress more pages than the
      file really had.
      
      The fix used here is to fall back to the non-compression code in this
      case, which does all the proper cleanup of delalloc and other accounting.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      f03d9301
    • J
      Btrfs: join the transaction in __btrfs_setxattr · 81144949
      Josef Bacik 提交于
      With selinux on we end up calling __btrfs_setxattr when we create an inode,
      which calls btrfs_start_transaction().  The problem is we've already called
      that in btrfs_new_inode, and in btrfs_start_transaction we end up doing a
      wait_current_trans().  If btrfs-transaction has started committing it will wait
      for all handles to finish, while the other process is waiting for the
      transaction to commit.  This is fixed by using btrfs_join_transaction, which
      won't wait for the transaction to commit.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@redhat.com>
      
      81144949
    • C
      Btrfs: Handle SGID bit when creating inodes · 8c087b51
      Chris Ball 提交于
      Before this patch, new files/dirs would ignore the SGID bit on their
      parent directory and always be owned by the creating user's uid/gid.
      Signed-off-by: NChris Ball <cjb@laptop.org>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      
      8c087b51
    • C
      Btrfs: Make btrfs_drop_snapshot work in larger and more efficient chunks · bd56b302
      Chris Mason 提交于
      Every transaction in btrfs creates a new snapshot, and then schedules the
      snapshot from the last transaction for deletion.  Snapshot deletion
      works by walking down the btree and dropping the reference counts
      on each btree block during the walk.
      
      If if a given leaf or node has a reference count greater than one,
      the reference count is decremented and the subtree pointed to by that
      node is ignored.
      
      If the reference count is one, walking continues down into that node
      or leaf, and the references of everything it points to are decremented.
      
      The old code would try to work in small pieces, walking down the tree
      until it found the lowest leaf or node to free and then returning.  This
      was very friendly to the rest of the FS because it didn't have a huge
      impact on other operations.
      
      But it wouldn't always keep up with the rate that new commits added new
      snapshots for deletion, and it wasn't very optimal for the extent
      allocation tree because it wasn't finding leaves that were close together
      on disk and processing them at the same time.
      
      This changes things to walk down to a level 1 node and then process it
      in bulk.  All the leaf pointers are sorted and the leaves are dropped
      in order based on their extent number.
      
      The extent allocation tree and commit code are now fast enough for
      this kind of bulk processing to work without slowing the rest of the FS
      down.  Overall it does less IO and is better able to keep up with
      snapshot deletions under high load.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      bd56b302
    • C
      Btrfs: Change btree locking to use explicit blocking points · b4ce94de
      Chris Mason 提交于
      Most of the btrfs metadata operations can be protected by a spinlock,
      but some operations still need to schedule.
      
      So far, btrfs has been using a mutex along with a trylock loop,
      most of the time it is able to avoid going for the full mutex, so
      the trylock loop is a big performance gain.
      
      This commit is step one for getting rid of the blocking locks entirely.
      btrfs_tree_lock takes a spinlock, and the code explicitly switches
      to a blocking lock when it starts an operation that can schedule.
      
      We'll be able get rid of the blocking locks in smaller pieces over time.
      Tracing allows us to find the most common cause of blocking, so we
      can start with the hot spots first.
      
      The basic idea is:
      
      btrfs_tree_lock() returns with the spin lock held
      
      btrfs_set_lock_blocking() sets the EXTENT_BUFFER_BLOCKING bit in
      the extent buffer flags, and then drops the spin lock.  The buffer is
      still considered locked by all of the btrfs code.
      
      If btrfs_tree_lock gets the spinlock but finds the blocking bit set, it drops
      the spin lock and waits on a wait queue for the blocking bit to go away.
      
      Much of the code that needs to set the blocking bit finishes without actually
      blocking a good percentage of the time.  So, an adaptive spin is still
      used against the blocking bit to avoid very high context switch rates.
      
      btrfs_clear_lock_blocking() clears the blocking bit and returns
      with the spinlock held again.
      
      btrfs_tree_unlock() can be called on either blocking or spinning locks,
      it does the right thing based on the blocking bit.
      
      ctree.c has a helper function to set/clear all the locked buffers in a
      path as blocking.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      b4ce94de
    • C
      Btrfs: hash_lock is no longer needed · c487685d
      Chris Mason 提交于
      Before metadata is written to disk, it is updated to reflect that writeout
      has begun.  Once this update is done, the block must be cow'd before it
      can be modified again.
      
      This update was originally synchronized by using a per-fs spinlock.  Today
      the buffers for the metadata blocks are locked before writeout begins,
      and everyone that tests the flag has the buffer locked as well.
      
      So, the per-fs spinlock (called hash_lock for no good reason) is no
      longer required.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      c487685d
    • C
      Btrfs: disable leak debugging checks in extent_io.c · 3935127c
      Chris Mason 提交于
      extent_io.c has debugging code to report and free leaked extent_state
      and extent_buffer objects at rmmod time.  This helps track down
      leaks and it saves you from rebooting just to properly remove the
      kmem_cache object.
      
      But, the code runs under a fairly expensive spinlock and the checks to
      see if it is currently enabled are not entirely consistent.  Some use
      #ifdef and some #if.
      
      This changes everything to #if and disables the leak checking.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      3935127c
    • C
      Btrfs: sort references by byte number during btrfs_inc_ref · b7a9f29f
      Chris Mason 提交于
      When a block goes through cow, we update the reference counts of
      everything that block points to.  The internal pointers of the block
      can be in just about any order, and it is likely to have clusters of
      things that are close together and clusters of things that are not.
      
      To help reduce the seeks that come with updating all of these reference
      counts, sort them by byte number before actual updates are done.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      b7a9f29f
    • C
      Btrfs: async threads should try harder to find work · b51912c9
      Chris Mason 提交于
      Tracing shows the delay between when an async thread goes to sleep
      and when more work is added is often very short.  This commit adds
      a little bit of delay and extra checking to the code right before
      we schedule out.
      
      It allows more work to be added to the worker
      without requiring notifications from other procs.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      b51912c9
    • J
      Btrfs: selinux support · 0279b4cd
      Jim Owens 提交于
      Add call to LSM security initialization and save
      resulting security xattr for new inodes.
      
      Add xattr support to symlink inode ops.
      
      Set inode->i_op for existing special files.
      Signed-off-by: Njim owens <jowens@hp.com>
      0279b4cd
    • C
      Btrfs: make btrfs acls selectable · bef62ef3
      Christian Hesse 提交于
      This patch adds a menu entry to kconfig to enable acls for btrfs.
      This allows you to enable FS_POSIX_ACL at kernel compile time.
      
      (updated by Jeff Mahoney to make the changes in fs/btrfs/Kconfig instead)
      Signed-off-by: NChristian Hesse <mail@earthworm.de>
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      bef62ef3
    • C
      Btrfs: Catch missed bios in the async bio submission thread · a6837051
      Chris Mason 提交于
      The async bio submission thread was missing some bios that were
      added after it had decided there was no work left to do.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      a6837051
  11. 29 1月, 2009 1 次提交
    • C
      Btrfs: fix readdir on 32 bit machines · 89f135d8
      Chris Mason 提交于
      After btrfs_readdir has gone through all the directory items, it
      sets the directory f_pos to the largest possible int.  This way
      applications that mix readdir with creating new files don't
      end up in an endless loop finding the new directory items as they go.
      
      It was a workaround for a bug in git, but the assumption was that if git
      could make this looping mistake than it would be a common problem.
      
      The largest possible int chosen was INT_LIMIT(typeof(file->f_pos),
      and it is possible for that to be a larger number than 32 bit glibc
      expects to come out of readdir.
      
      This patches switches that to INT_LIMIT(off_t), which should keep
      applications happy on 32 and 64 bit machines.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      89f135d8
  12. 28 1月, 2009 3 次提交
  13. 27 1月, 2009 1 次提交
    • V
      inotify: clean up inotify_read and fix locking problems · 3632dee2
      Vegard Nossum 提交于
      If userspace supplies an invalid pointer to a read() of an inotify
      instance, the inotify device's event list mutex is unlocked twice.
      This causes an unbalance which effectively leaves the data structure
      unprotected, and we can trigger oopses by accessing the inotify
      instance from different tasks concurrently.
      
      The best fix (contributed largely by Linus) is a total rewrite
      of the function in question:
      
      On Thu, Jan 22, 2009 at 7:05 AM, Linus Torvalds wrote:
      > The thing to notice is that:
      >
      >  - locking is done in just one place, and there is no question about it
      >   not having an unlock.
      >
      >  - that whole double-while(1)-loop thing is gone.
      >
      >  - use multiple functions to make nesting and error handling sane
      >
      >  - do error testing after doing the things you always need to do, ie do
      >   this:
      >
      >        mutex_lock(..)
      >        ret = function_call();
      >        mutex_unlock(..)
      >
      >        .. test ret here ..
      >
      >   instead of doing conditional exits with unlocking or freeing.
      >
      > So if the code is written in this way, it may still be buggy, but at least
      > it's not buggy because of subtle "forgot to unlock" or "forgot to free"
      > issues.
      >
      > This _always_ unlocks if it locked, and it always frees if it got a
      > non-error kevent.
      
      Cc: John McCutchan <ttb@tentacle.dhs.org>
      Cc: Robert Love <rlove@google.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NVegard Nossum <vegard.nossum@gmail.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3632dee2
  14. 26 1月, 2009 2 次提交
    • M
      fuse: fix poll notify · f6d47a17
      Miklos Szeredi 提交于
      Move fuse_copy_finish() to before calling fuse_notify_poll_wakeup().
      This is not a big issue because fuse_notify_poll_wakeup() should be
      atomic, but it's cleaner this way, and later uses of notification will
      need to be able to finish the copying before performing some actions.
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      f6d47a17
    • M
      fuse: destroy bdi on umount · 26c36791
      Miklos Szeredi 提交于
      If a fuse filesystem is unmounted but the device file descriptor
      remains open and a new mount reuses the old device number, then the
      mount fails with EEXIST and the following warning is printed in the
      kernel log:
      
        WARNING: at fs/sysfs/dir.c:462 sysfs_add_one+0x35/0x3d()
        sysfs: duplicate filename '0:15' can not be created
      
      The cause is that the bdi belonging to the fuse filesystem was
      destoryed only after the device file was released.  Fix this by
      calling bdi_destroy() from fuse_put_super() instead.
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      CC: stable@kernel.org
      26c36791