1. 09 2月, 2009 1 次提交
  2. 07 2月, 2009 5 次提交
    • T
      eCryptfs: Regression in unencrypted filename symlinks · fd9fc842
      Tyler Hicks 提交于
      The addition of filename encryption caused a regression in unencrypted
      filename symlink support.  ecryptfs_copy_filename() is used when dealing
      with unencrypted filenames and it reported that the new, copied filename
      was a character longer than it should have been.
      
      This caused the return value of readlink() to count the NULL byte of the
      symlink target.  Most applications don't care about the extra NULL byte,
      but a version control system (bzr) helped in discovering the bug.
      Signed-off-by: NTyler Hicks <tyhicks@linux.vnet.ibm.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fd9fc842
    • R
      elf core dump: fix get_user use · 92dc07b1
      Roland McGrath 提交于
      The elf_core_dump() code does its work with set_fs(KERNEL_DS) in force,
      so vma_dump_size() needs to switch back with set_fs(USER_DS) to safely
      use get_user() for a normal user-space address.
      
      Checking for VM_READ optimizes out the case where get_user() would fail
      anyway.  The vm_file check here was already superfluous given the control
      flow earlier in the function, so that is a cleanup/optimization unrelated
      to other changes but an obvious and trivial one.
      Reported-by: NGerald Schaefer <gerald.schaefer@de.ibm.com>
      Signed-off-by: NRoland McGrath <roland@redhat.com>
      92dc07b1
    • D
      CRED: Fix SUID exec regression · 0bf2f3ae
      David Howells 提交于
      The patch:
      
      	commit a6f76f23
      	CRED: Make execve() take advantage of copy-on-write credentials
      
      moved the place in which the 'safeness' of a SUID/SGID exec was performed to
      before de_thread() was called.  This means that LSM_UNSAFE_SHARE is now
      calculated incorrectly.  This flag is set if any of the usage counts for
      fs_struct, files_struct and sighand_struct are greater than 1 at the time the
      determination is made.  All of which are true for threads created by the
      pthread library.
      
      However, since we wish to make the security calculation before irrevocably
      damaging the process so that we can return it an error code in the case where
      we decide we want to reject the exec request on this basis, we have to make the
      determination before calling de_thread().
      
      So, instead, we count up the number of threads (CLONE_THREAD) that are sharing
      our fs_struct (CLONE_FS), files_struct (CLONE_FILES) and sighand_structs
      (CLONE_SIGHAND/CLONE_THREAD) with us.  These will be killed by de_thread() and
      so can be discounted by check_unsafe_exec().
      
      We do have to be careful because CLONE_THREAD does not imply FS or FILES.
      
      We _assume_ that there will be no extra references to these structs held by the
      threads we're going to kill.
      
      This can be tested with the attached pair of programs.  Build the two programs
      using the Makefile supplied, and run ./test1 as a non-root user.  If
      successful, you should see something like:
      
      	[dhowells@andromeda tmp]$ ./test1
      	--TEST1--
      	uid=4043, euid=4043 suid=4043
      	exec ./test2
      	--TEST2--
      	uid=4043, euid=0 suid=0
      	SUCCESS - Correct effective user ID
      
      and if unsuccessful, something like:
      
      	[dhowells@andromeda tmp]$ ./test1
      	--TEST1--
      	uid=4043, euid=4043 suid=4043
      	exec ./test2
      	--TEST2--
      	uid=4043, euid=4043 suid=4043
      	ERROR - Incorrect effective user ID!
      
      The non-root user ID you see will depend on the user you run as.
      
      [test1.c]
      #include <stdio.h>
      #include <stdlib.h>
      #include <unistd.h>
      #include <pthread.h>
      
      static void *thread_func(void *arg)
      {
      	while (1) {}
      }
      
      int main(int argc, char **argv)
      {
      	pthread_t tid;
      	uid_t uid, euid, suid;
      
      	printf("--TEST1--\n");
      	getresuid(&uid, &euid, &suid);
      	printf("uid=%d, euid=%d suid=%d\n", uid, euid, suid);
      
      	if (pthread_create(&tid, NULL, thread_func, NULL) < 0) {
      		perror("pthread_create");
      		exit(1);
      	}
      
      	printf("exec ./test2\n");
      	execlp("./test2", "test2", NULL);
      	perror("./test2");
      	_exit(1);
      }
      
      [test2.c]
      #include <stdio.h>
      #include <stdlib.h>
      #include <unistd.h>
      
      int main(int argc, char **argv)
      {
      	uid_t uid, euid, suid;
      
      	getresuid(&uid, &euid, &suid);
      	printf("--TEST2--\n");
      	printf("uid=%d, euid=%d suid=%d\n", uid, euid, suid);
      
      	if (euid != 0) {
      		fprintf(stderr, "ERROR - Incorrect effective user ID!\n");
      		exit(1);
      	}
      	printf("SUCCESS - Correct effective user ID\n");
      	exit(0);
      }
      
      [Makefile]
      CFLAGS = -D_GNU_SOURCE -Wall -Werror -Wunused
      all: test1 test2
      
      test1: test1.c
      	gcc $(CFLAGS) -o test1 test1.c -lpthread
      
      test2: test2.c
      	gcc $(CFLAGS) -o test2 test2.c
      	sudo chown root.root test2
      	sudo chmod +s test2
      Reported-by: NDavid Smith <dsmith@redhat.com>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Acked-by: NDavid Smith <dsmith@redhat.com>
      Signed-off-by: NJames Morris <jmorris@namei.org>
      0bf2f3ae
    • D
      vfs: Don't call attach_nobh_buffers() with an empty list · d4cf109f
      Dave Kleikamp 提交于
      This is a modification of a patch by Bill Pemberton <wfp5p@virginia.edu>
      
      nobh_write_end() could call attach_nobh_buffers() with head == NULL.
      This would result in a trap when attach_nobh_buffers() attempted to
      access bh->b_this_page.
      
      This can be illustrated by running the writev01 testcase from LTP on jfs.
      
      This error was introduced by commit 5b41e74a "vfs: fix data leak in
      nobh_write_end()".  That patch did not take into account that if
      PageMappedToDisk() is true upon entry to nobh_write_begin(), then no
      buffers will be allocated for the page.  In that case, we won't have to
      worry about a failed write leaving unitialized data in the page.
      
      Of course, head != NULL implies !page_has_buffers(page), so no need to
      test both.
      Signed-off-by: NDave Kleikamp <shaggy@linux.vnet.ibm.com>
      Cc: Bill Pemberton <wfp5p@virginia.edu>
      Cc: Dmitri Monakhov <dmonakhov@openvz.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d4cf109f
    • C
      Btrfs: Make sure dir is non-null before doing S_ISGID checks · 42f15d77
      Chris Mason 提交于
      The S_ISGID check in btrfs_new_inode caused an oops during subvol creation
      because sometimes the dir is null.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      42f15d77
  3. 06 2月, 2009 3 次提交
  4. 05 2月, 2009 2 次提交
  5. 04 2月, 2009 19 次提交
    • C
      Btrfs: don't return congestion in write_cache_pages as often · 9b0d3ace
      Chris Mason 提交于
      On fast devices that go from congested to uncongested very quickly, pdflush
      is waiting too often in congestion_wait, and the FS is backing off to
      easily in write_cache_pages.
      
      For now, fix this on the btrfs side by only checking congestion after
      some bios have already gone down.  Longer term a real fix is needed
      for pdflush, but that is a larger project.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      9b0d3ace
    • C
      Btrfs: Only prep for btree deletion balances when nodes are mostly empty · 7b78c170
      Chris Mason 提交于
      Whenever an item deletion is done, we need to balance all the nodes
      in the tree to make sure we don't end up with an empty node if a pointer
      is deleted.  This balance prep happens from the root of the tree down
      so we can drop our locks as we go.
      
      reada_for_balance was triggering read-ahead on neighboring nodes even
      when no balancing was required.  This adds an extra check to avoid
      calling balance_level() and avoid reada_for_balance() when a balance
      won't be required.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      7b78c170
    • C
      Btrfs: fix btrfs_unlock_up_safe to walk the entire path · 12f4dacc
      Chris Mason 提交于
      btrfs_unlock_up_safe would break out at the first NULL node entry or
      unlocked node it found in the path.
      
      Some of the callers have missing nodes at the lower levels of the path, so this
      commit fixes things to check all the nodes in the path before returning.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      12f4dacc
    • C
      Btrfs: change btrfs_del_leaf to drop locks earlier · 4d081c41
      Chris Mason 提交于
      btrfs_del_leaf does two things.  First it removes the pointer in the
      parent, and then it frees the block that has the leaf.  It has the
      parent node locked for both operations.
      
      But, it only needs the parent locked while it is deleting the pointer.
      After that it can safely free the block without the parent locked.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      4d081c41
    • C
      Btrfs: Change btrfs_truncate_inode_items to stop when it hits the inode · 06d9a8d7
      Chris Mason 提交于
      btrfs_truncate_inode_items is setup to stop doing btree searches when
      it has finished removing the items for the inode.  It used to detect the
      end of the inode by looking for an objectid that didn't match the
      one we were searching for.
      
      But, this would result in an extra search through the btree, which
      adds extra balancing and cow costs to the operation.
      
      This commit adds a check to see if we found the inode item, which means
      we can stop searching early.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      06d9a8d7
    • C
      Btrfs: Don't try to compress pages past i_size · f03d9301
      Chris Mason 提交于
      The compression code had some checks to make sure we were only
      compressing bytes inside of i_size, but it wasn't catching every
      case.  To make things worse, some incorrect math about the number
      of bytes remaining would make it try to compress more pages than the
      file really had.
      
      The fix used here is to fall back to the non-compression code in this
      case, which does all the proper cleanup of delalloc and other accounting.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      f03d9301
    • J
      Btrfs: join the transaction in __btrfs_setxattr · 81144949
      Josef Bacik 提交于
      With selinux on we end up calling __btrfs_setxattr when we create an inode,
      which calls btrfs_start_transaction().  The problem is we've already called
      that in btrfs_new_inode, and in btrfs_start_transaction we end up doing a
      wait_current_trans().  If btrfs-transaction has started committing it will wait
      for all handles to finish, while the other process is waiting for the
      transaction to commit.  This is fixed by using btrfs_join_transaction, which
      won't wait for the transaction to commit.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@redhat.com>
      
      81144949
    • C
      Btrfs: Handle SGID bit when creating inodes · 8c087b51
      Chris Ball 提交于
      Before this patch, new files/dirs would ignore the SGID bit on their
      parent directory and always be owned by the creating user's uid/gid.
      Signed-off-by: NChris Ball <cjb@laptop.org>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      
      8c087b51
    • C
      Btrfs: Make btrfs_drop_snapshot work in larger and more efficient chunks · bd56b302
      Chris Mason 提交于
      Every transaction in btrfs creates a new snapshot, and then schedules the
      snapshot from the last transaction for deletion.  Snapshot deletion
      works by walking down the btree and dropping the reference counts
      on each btree block during the walk.
      
      If if a given leaf or node has a reference count greater than one,
      the reference count is decremented and the subtree pointed to by that
      node is ignored.
      
      If the reference count is one, walking continues down into that node
      or leaf, and the references of everything it points to are decremented.
      
      The old code would try to work in small pieces, walking down the tree
      until it found the lowest leaf or node to free and then returning.  This
      was very friendly to the rest of the FS because it didn't have a huge
      impact on other operations.
      
      But it wouldn't always keep up with the rate that new commits added new
      snapshots for deletion, and it wasn't very optimal for the extent
      allocation tree because it wasn't finding leaves that were close together
      on disk and processing them at the same time.
      
      This changes things to walk down to a level 1 node and then process it
      in bulk.  All the leaf pointers are sorted and the leaves are dropped
      in order based on their extent number.
      
      The extent allocation tree and commit code are now fast enough for
      this kind of bulk processing to work without slowing the rest of the FS
      down.  Overall it does less IO and is better able to keep up with
      snapshot deletions under high load.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      bd56b302
    • C
      Btrfs: Change btree locking to use explicit blocking points · b4ce94de
      Chris Mason 提交于
      Most of the btrfs metadata operations can be protected by a spinlock,
      but some operations still need to schedule.
      
      So far, btrfs has been using a mutex along with a trylock loop,
      most of the time it is able to avoid going for the full mutex, so
      the trylock loop is a big performance gain.
      
      This commit is step one for getting rid of the blocking locks entirely.
      btrfs_tree_lock takes a spinlock, and the code explicitly switches
      to a blocking lock when it starts an operation that can schedule.
      
      We'll be able get rid of the blocking locks in smaller pieces over time.
      Tracing allows us to find the most common cause of blocking, so we
      can start with the hot spots first.
      
      The basic idea is:
      
      btrfs_tree_lock() returns with the spin lock held
      
      btrfs_set_lock_blocking() sets the EXTENT_BUFFER_BLOCKING bit in
      the extent buffer flags, and then drops the spin lock.  The buffer is
      still considered locked by all of the btrfs code.
      
      If btrfs_tree_lock gets the spinlock but finds the blocking bit set, it drops
      the spin lock and waits on a wait queue for the blocking bit to go away.
      
      Much of the code that needs to set the blocking bit finishes without actually
      blocking a good percentage of the time.  So, an adaptive spin is still
      used against the blocking bit to avoid very high context switch rates.
      
      btrfs_clear_lock_blocking() clears the blocking bit and returns
      with the spinlock held again.
      
      btrfs_tree_unlock() can be called on either blocking or spinning locks,
      it does the right thing based on the blocking bit.
      
      ctree.c has a helper function to set/clear all the locked buffers in a
      path as blocking.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      b4ce94de
    • C
      Btrfs: hash_lock is no longer needed · c487685d
      Chris Mason 提交于
      Before metadata is written to disk, it is updated to reflect that writeout
      has begun.  Once this update is done, the block must be cow'd before it
      can be modified again.
      
      This update was originally synchronized by using a per-fs spinlock.  Today
      the buffers for the metadata blocks are locked before writeout begins,
      and everyone that tests the flag has the buffer locked as well.
      
      So, the per-fs spinlock (called hash_lock for no good reason) is no
      longer required.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      c487685d
    • C
      Btrfs: disable leak debugging checks in extent_io.c · 3935127c
      Chris Mason 提交于
      extent_io.c has debugging code to report and free leaked extent_state
      and extent_buffer objects at rmmod time.  This helps track down
      leaks and it saves you from rebooting just to properly remove the
      kmem_cache object.
      
      But, the code runs under a fairly expensive spinlock and the checks to
      see if it is currently enabled are not entirely consistent.  Some use
      #ifdef and some #if.
      
      This changes everything to #if and disables the leak checking.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      3935127c
    • C
      Btrfs: sort references by byte number during btrfs_inc_ref · b7a9f29f
      Chris Mason 提交于
      When a block goes through cow, we update the reference counts of
      everything that block points to.  The internal pointers of the block
      can be in just about any order, and it is likely to have clusters of
      things that are close together and clusters of things that are not.
      
      To help reduce the seeks that come with updating all of these reference
      counts, sort them by byte number before actual updates are done.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      b7a9f29f
    • C
      Btrfs: async threads should try harder to find work · b51912c9
      Chris Mason 提交于
      Tracing shows the delay between when an async thread goes to sleep
      and when more work is added is often very short.  This commit adds
      a little bit of delay and extra checking to the code right before
      we schedule out.
      
      It allows more work to be added to the worker
      without requiring notifications from other procs.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      b51912c9
    • J
      Btrfs: selinux support · 0279b4cd
      Jim Owens 提交于
      Add call to LSM security initialization and save
      resulting security xattr for new inodes.
      
      Add xattr support to symlink inode ops.
      
      Set inode->i_op for existing special files.
      Signed-off-by: Njim owens <jowens@hp.com>
      0279b4cd
    • C
      Btrfs: make btrfs acls selectable · bef62ef3
      Christian Hesse 提交于
      This patch adds a menu entry to kconfig to enable acls for btrfs.
      This allows you to enable FS_POSIX_ACL at kernel compile time.
      
      (updated by Jeff Mahoney to make the changes in fs/btrfs/Kconfig instead)
      Signed-off-by: NChristian Hesse <mail@earthworm.de>
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      bef62ef3
    • C
      Btrfs: Catch missed bios in the async bio submission thread · a6837051
      Chris Mason 提交于
      The async bio submission thread was missing some bios that were
      added after it had decided there was no work left to do.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      a6837051
    • F
      [XFS] Warn on transaction in flight on read-only remount · 43f3f057
      Felix Blyakher 提交于
      Till VFS can correctly support read-only remount without racing,
      use WARN_ON instead of BUG_ON on detecting transaction in flight
      after quiescing filesystem.
      Signed-off-by: NFelix Blyakher <felixb@sgi.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      43f3f057
    • D
      xfs: Check buffer lengths in log recovery · 6139a236
      Dave Chinner 提交于
      Before trying to obtain, read or write a buffer,
      check that the buffer length is actually valid. If
      it is not valid, then something read in the recovery
      process has been corrupted and we should abort
      recovery.
      Reported-by: NEric Sesterhenn <snakebyte@gmx.de>
      Tested-by: NEric Sesterhenn <snakebyte@gmx.de>
      Reviewed-by: NChristoph Hellwig <hch@infradead.org>
      Reviewed-by: NFelix Blyakher <felixb@sgi.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      Signed-off-by: NFelix Blyakher <felixb@sgi.com>
      6139a236
  6. 03 2月, 2009 6 次提交
    • M
      ocfs2: add quota call to ocfs2_remove_btree_range() · fd4ef231
      Mark Fasheh 提交于
      We weren't reclaiming the clusters which get free'd from this function,
      so any user punching holes in a file would still have those bytes accounted
      against him/her. Add the call to vfs_dq_free_space_nodirty() to fix this.
      Interestingly enough, the journal credits calculation already took this into
      account.
      Signed-off-by: NMark Fasheh <mfasheh@suse.com>
      Acked-by: NJan Kara <jack@suse.cz>
      fd4ef231
    • S
      ocfs2: Wakeup the downconvert thread after a successful cancel convert · a4b91965
      Sunil Mushran 提交于
      When two nodes holding PR locks on a resource concurrently attempt to
      upconvert the locks to EX, the master sends a BAST to one of the nodes. This
      message tells that node to first cancel convert the upconvert request,
      followed by downconvert to a NL. Only when this lock is downconverted to NL,
      can the master upconvert the first node's lock to EX.
      
      While the fs was doing the cancel convert, it was forgetting to wake up the
      dc thread after a successful cancel, leading to a deadlock.
      Reported-and-Tested-by: NDavid Teigland <teigland@redhat.com>
      Signed-off-by: NSunil Mushran <sunil.mushran@oracle.com>
      Signed-off-by: NMark Fasheh <mfasheh@suse.com>
      a4b91965
    • T
      ocfs2: Access the xattr bucket only before modifying it. · 554e7f9e
      Tao Ma 提交于
      In ocfs2_xattr_value_truncate, we may call b-tree codes which will
      extend the journal transaction. It has a potential problem that it
      may let the already-accessed-but-not-dirtied buffers gone. So we'd
      better access the bucket after we call ocfs2_xattr_value_truncate.
      And as for the root buffer for the xattr value, b-tree code will
      acess and dirty it, so we don't need to worry about it.
      Signed-off-by: NTao Ma <tao.ma@oracle.com>
      Signed-off-by: NMark Fasheh <mfasheh@suse.com>
      554e7f9e
    • J
      configfs: Silence lockdep on mkdir(), rmdir() and configfs_depend_item() · 0e033342
      Joel Becker 提交于
      When attaching default groups (subdirs) of a new group (in mkdir() or
      in configfs_register()), configfs recursively takes inode's mutexes
      along the path from the parent of the new group to the default
      subdirs. This is needed to ensure that the VFS will not race with
      operations on these sub-dirs. This is safe for the following reasons:
      
      - the VFS allows one to lock first an inode and second one of its
        children (The lock subclasses for this pattern are respectively
        I_MUTEX_PARENT and I_MUTEX_CHILD);
      - from this rule any inode path can be recursively locked in
        descending order as long as it stays under a single mountpoint and
        does not follow symlinks.
      
      Unfortunately lockdep does not know (yet?) how to handle such
      recursion.
      
      I've tried to use Peter Zijlstra's lock_set_subclass() helper to
      upgrade i_mutexes from I_MUTEX_CHILD to I_MUTEX_PARENT when we know
      that we might recursively lock some of their descendant, but this
      usage does not seem to fit the purpose of lock_set_subclass() because
      it leads to several i_mutex locked with subclass I_MUTEX_PARENT by
      the same task.
      
      >From inside configfs it is not possible to serialize those recursive
      locking with a top-level one, because mkdir() and rmdir() are already
      called with inodes locked by the VFS. So using some
      mutex_lock_nest_lock() is not an option.
      
      I am proposing two solutions:
      1) one that wraps recursive mutex_lock()s with
         lockdep_off()/lockdep_on().
      2) (as suggested earlier by Peter Zijlstra) one that puts the
         i_mutexes recursively locked in different classes based on their
         depth from the top-level config_group created. This
         induces an arbitrary limit (MAX_LOCK_DEPTH - 2 == 46) on the
         nesting of configfs default groups whenever lockdep is activated
         but this limit looks reasonably high. Unfortunately, this alos
         isolates VFS operations on configfs default groups from the others
         and thus lowers the chances to detect locking issues.
      
      This patch implements solution 1).
      
      Solution 2) looks better from lockdep's point of view, but fails with
      configfs_depend_item(). This needs to rework the locking
      scheme of configfs_depend_item() by removing the variable lock recursion
      depth, and I think that it's doable thanks to the configfs_dirent_lock.
      For now, let's stick to solution 1).
      Signed-off-by: NLouis Rilling <louis.rilling@kerlabs.com>
      Acked-by: NJoel Becker <joel.becker@oracle.com>
      Signed-off-by: NMark Fasheh <mfasheh@suse.com>
      0e033342
    • J
      ocfs2: Fix possible deadlock in ocfs2_write_dquot() · f8afead7
      Jan Kara 提交于
      It could happen that some limit has been set via quotactl() and in parallel
      ->mark_dirty() is called from another thread doing e.g. dquot_alloc_space(). In
      such case ocfs2_write_dquot() must not try to sync the dquot because that needs
      global quota lock but that ranks above transaction start.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NMark Fasheh <mfasheh@suse.com>
      f8afead7
    • J
      ocfs2: Push out dropping of dentry lock to ocfs2_wq · ea455f8a
      Jan Kara 提交于
      Dropping of last reference to dentry lock is a complicated operation involving
      dropping of reference to inode. This can get complicated and quota code in
      particular needs to obtain some quota locks which leads to potential deadlock.
      Thus we defer dropping of inode reference to ocfs2_wq.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NMark Fasheh <mfasheh@suse.com>
      ea455f8a
  7. 30 1月, 2009 4 次提交