1. 01 4月, 2014 1 次提交
  2. 23 3月, 2014 1 次提交
    • A
      rcuwalk: recheck mount_lock after mountpoint crossing attempts · b37199e6
      Al Viro 提交于
      We can get false negative from __lookup_mnt() if an unrelated vfsmount
      gets moved.  In that case legitimize_mnt() is guaranteed to fail,
      and we will fall back to non-RCU walk... unless we end up running
      into a hard error on a filesystem object we wouldn't have reached
      if not for that false negative.  IOW, delaying that check until
      the end of pathname resolution is wrong - we should recheck right
      after we attempt to cross the mountpoint.  We don't need to recheck
      unless we see d_mountpoint() being true - in that case even if
      we have just raced with mount/umount, we can simply go on as if
      we'd come at the moment when the sucker wasn't a mountpoint; if we
      run into a hard error as the result, it was a legitimate outcome.
      __lookup_mnt() returning NULL is different in that respect, since
      it might've happened due to operation on completely unrelated
      mountpoint.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      b37199e6
  3. 10 3月, 2014 1 次提交
    • L
      vfs: atomic f_pos accesses as per POSIX · 9c225f26
      Linus Torvalds 提交于
      Our write() system call has always been atomic in the sense that you get
      the expected thread-safe contiguous write, but we haven't actually
      guaranteed that concurrent writes are serialized wrt f_pos accesses, so
      threads (or processes) that share a file descriptor and use "write()"
      concurrently would quite likely overwrite each others data.
      
      This violates POSIX.1-2008/SUSv4 Section XSI 2.9.7 that says:
      
       "2.9.7 Thread Interactions with Regular File Operations
      
        All of the following functions shall be atomic with respect to each
        other in the effects specified in POSIX.1-2008 when they operate on
        regular files or symbolic links: [...]"
      
      and one of the effects is the file position update.
      
      This unprotected file position behavior is not new behavior, and nobody
      has ever cared.  Until now.  Yongzhi Pan reported unexpected behavior to
      Michael Kerrisk that was due to this.
      
      This resolves the issue with a f_pos-specific lock that is taken by
      read/write/lseek on file descriptors that may be shared across threads
      or processes.
      Reported-by: NYongzhi Pan <panyongzhi@gmail.com>
      Reported-by: NMichael Kerrisk <mtk.manpages@gmail.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      9c225f26
  4. 06 2月, 2014 1 次提交
    • L
      execve: use 'struct filename *' for executable name passing · c4ad8f98
      Linus Torvalds 提交于
      This changes 'do_execve()' to get the executable name as a 'struct
      filename', and to free it when it is done.  This is what the normal
      users want, and it simplifies and streamlines their error handling.
      
      The controlled lifetime of the executable name also fixes a
      use-after-free problem with the trace_sched_process_exec tracepoint: the
      lifetime of the passed-in string for kernel users was not at all
      obvious, and the user-mode helper code used UMH_WAIT_EXEC to serialize
      the pathname allocation lifetime with the execve() having finished,
      which in turn meant that the trace point that happened after
      mm_release() of the old process VM ended up using already free'd memory.
      
      To solve the kernel string lifetime issue, this simply introduces
      "getname_kernel()" that works like the normal user-space getname()
      function, except with the source coming from kernel memory.
      
      As Oleg points out, this also means that we could drop the tcomm[] array
      from 'struct linux_binprm', since the pathname lifetime now covers
      setup_new_exec().  That would be a separate cleanup.
      Reported-by: NIgor Zhbanov <i.zhbanov@samsung.com>
      Tested-by: NSteven Rostedt <rostedt@goodmis.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c4ad8f98
  5. 01 2月, 2014 2 次提交
  6. 26 1月, 2014 1 次提交
  7. 13 12月, 2013 1 次提交
  8. 29 11月, 2013 1 次提交
  9. 09 11月, 2013 10 次提交
    • J
      locks: break delegations on link · 146a8595
      J. Bruce Fields 提交于
      Cc: Tyler Hicks <tyhicks@canonical.com>
      Cc: Dustin Kirkland <dustin.kirkland@gazzang.com>
      Acked-by: NJeff Layton <jlayton@redhat.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      146a8595
    • J
      locks: break delegations on rename · 8e6d782c
      J. Bruce Fields 提交于
      Cc: David Howells <dhowells@redhat.com>
      Acked-by: NJeff Layton <jlayton@redhat.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      8e6d782c
    • J
      locks: helper functions for delegation breaking · 5a14696c
      J. Bruce Fields 提交于
      We'll need the same logic for rename and link.
      Acked-by: NJeff Layton <jlayton@redhat.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      5a14696c
    • J
      locks: break delegations on unlink · b21996e3
      J. Bruce Fields 提交于
      We need to break delegations on any operation that changes the set of
      links pointing to an inode.  Start with unlink.
      
      Such operations also hold the i_mutex on a parent directory.  Breaking a
      delegation may require waiting for a timeout (by default 90 seconds) in
      the case of a unresponsive NFS client.  To avoid blocking all directory
      operations, we therefore drop locks before waiting for the delegation.
      The logic then looks like:
      
      	acquire locks
      	...
      	test for delegation; if found:
      		take reference on inode
      		release locks
      		wait for delegation break
      		drop reference on inode
      		retry
      
      It is possible this could never terminate.  (Even if we take precautions
      to prevent another delegation being acquired on the same inode, we could
      get a different inode on each retry.)  But this seems very unlikely.
      
      The initial test for a delegation happens after the lock on the target
      inode is acquired, but the directory inode may have been acquired
      further up the call stack.  We therefore add a "struct inode **"
      argument to any intervening functions, which we use to pass the inode
      back up to the caller in the case it needs a delegation synchronously
      broken.
      
      Cc: David Howells <dhowells@redhat.com>
      Cc: Tyler Hicks <tyhicks@canonical.com>
      Cc: Dustin Kirkland <dustin.kirkland@gazzang.com>
      Acked-by: NJeff Layton <jlayton@redhat.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      b21996e3
    • J
      namei: minor vfs_unlink cleanup · 9accbb97
      J. Bruce Fields 提交于
      We'll be using dentry->d_inode in one more place.
      Acked-by: NJeff Layton <jlayton@redhat.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      9accbb97
    • J
      vfs: take i_mutex on renamed file · 6cedba89
      J. Bruce Fields 提交于
      A read delegation is used by NFSv4 as a guarantee that a client can
      perform local read opens without informing the server.
      
      The open operation takes the last component of the pathname as an
      argument, thus is also a lookup operation, and giving the client the
      above guarantee means informing the client before we allow anything that
      would change the set of names pointing to the inode.
      
      Therefore, we need to break delegations on rename, link, and unlink.
      
      We also need to prevent new delegations from being acquired while one of
      these operations is in progress.
      
      We could add some completely new locking for that purpose, but it's
      simpler to use the i_mutex, since that's already taken by all the
      operations we care about.
      
      The single exception is rename.  So, modify rename to take the i_mutex
      on the file that is being renamed.
      
      Also fix up lockdep and Documentation/filesystems/directory-locking to
      reflect the change.
      Acked-by: NJeff Layton <jlayton@redhat.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      6cedba89
    • J
      dcache: fix outdated DCACHE_NEED_LOOKUP comment · 13a2c3be
      J. Bruce Fields 提交于
      The DCACHE_NEED_LOOKUP case referred to here was removed with
      39e3c955 "vfs: remove
      DCACHE_NEED_LOOKUP".
      
      There are only four real_lookup() callers and all of them pass in an
      unhashed dentry just returned from d_alloc.
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      13a2c3be
    • D
      VFS: Put a small type field into struct dentry::d_flags · b18825a7
      David Howells 提交于
      Put a type field into struct dentry::d_flags to indicate if the dentry is one
      of the following types that relate particularly to pathwalk:
      
      	Miss (negative dentry)
      	Directory
      	"Automount" directory (defective - no i_op->lookup())
      	Symlink
      	Other (regular, socket, fifo, device)
      
      The type field is set to one of the first five types on a dentry by calls to
      __d_instantiate() and d_obtain_alias() from information in the inode (if one is
      given).
      
      The type is cleared by dentry_unlink_inode() when it reconstitutes an existing
      dentry as a negative dentry.
      
      Accessors provided are:
      
      	d_set_type(dentry, type)
      	d_is_directory(dentry)
      	d_is_autodir(dentry)
      	d_is_symlink(dentry)
      	d_is_file(dentry)
      	d_is_negative(dentry)
      	d_is_positive(dentry)
      
      A bunch of checks in pathname resolution switched to those.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      b18825a7
    • A
      get rid of {lock,unlock}_rcu_walk() · 8b61e74f
      Al Viro 提交于
      those have become aliases for rcu_read_{lock,unlock}()
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      8b61e74f
    • A
      RCU'd vfsmounts · 48a066e7
      Al Viro 提交于
      * RCU-delayed freeing of vfsmounts
      * vfsmount_lock replaced with a seqlock (mount_lock)
      * sequence number from mount_lock is stored in nameidata->m_seq and
      used when we exit RCU mode
      * new vfsmount flag - MNT_SYNC_UMOUNT.  Set by umount_tree() when its
      caller knows that vfsmount will have no surviving references.
      * synchronize_rcu() done between unlocking namespace_sem in namespace_unlock()
      and doing pending mntput().
      * new helper: legitimize_mnt(mnt, seq).  Checks the mount_lock sequence
      number against seq, then grabs reference to mnt.  Then it rechecks mount_lock
      again to close the race and either returns success or drops the reference it
      has acquired.  The subtle point is that in case of MNT_SYNC_UMOUNT we can
      simply decrement the refcount and sod off - aforementioned synchronize_rcu()
      makes sure that final mntput() won't come until we leave RCU mode.  We need
      that, since we don't want to end up with some lazy pathwalk racing with
      umount() and stealing the final mntput() from it - caller of umount() may
      expect it to return only once the fs is shut down and we don't want to break
      that.  In other cases (i.e. with MNT_SYNC_UMOUNT absent) we have to do
      full-blown mntput() in case of mount_lock sequence number mismatch happening
      just as we'd grabbed the reference, but in those cases we won't be stealing
      the final mntput() from anything that would care.
      * mntput_no_expire() doesn't lock anything on the fast path now.  Incidentally,
      SMP and UP cases are handled the same way - no ifdefs there.
      * normal pathname resolution does *not* do any writes to mount_lock.  It does,
      of course, bump the refcounts of vfsmount and dentry in the very end, but that's
      it.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      48a066e7
  10. 06 11月, 2013 1 次提交
    • J
      audit: add child record before the create to handle case where create fails · 14e972b4
      Jeff Layton 提交于
      Historically, when a syscall that creates a dentry fails, you get an audit
      record that looks something like this (when trying to create a file named
      "new" in "/tmp/tmp.SxiLnCcv63"):
      
          type=PATH msg=audit(1366128956.279:965): item=0 name="/tmp/tmp.SxiLnCcv63/new" inode=2138308 dev=fd:02 mode=040700 ouid=0 ogid=0 rdev=00:00 obj=staff_u:object_r:user_tmp_t:s15:c0.c1023
      
      This record makes no sense since it's associating the inode information for
      "/tmp/tmp.SxiLnCcv63" with the path "/tmp/tmp.SxiLnCcv63/new". The recent
      patch I posted to fix the audit_inode call in do_last fixes this, by making it
      look more like this:
      
          type=PATH msg=audit(1366128765.989:13875): item=0 name="/tmp/tmp.DJ1O8V3e4f/" inode=141 dev=fd:02 mode=040700 ouid=0 ogid=0 rdev=00:00 obj=staff_u:object_r:user_tmp_t:s15:c0.c1023
      
      While this is more correct, if the creation of the file fails, then we
      have no record of the filename that the user tried to create.
      
      This patch adds a call to audit_inode_child to may_create. This creates
      an AUDIT_TYPE_CHILD_CREATE record that will sit in place until the
      create succeeds. When and if the create does succeed, then this record
      will be updated with the correct inode info from the create.
      
      This fixes what was broken in commit bfcec708.
      Commit 79f6530c should also be backported to stable v3.7+.
      Signed-off-by: NJeff Layton <jlayton@redhat.com>
      Signed-off-by: NEric Paris <eparis@redhat.com>
      Signed-off-by: NRichard Guy Briggs <rgb@redhat.com>
      Signed-off-by: NEric Paris <eparis@redhat.com>
      14e972b4
  11. 25 10月, 2013 1 次提交
  12. 22 10月, 2013 1 次提交
  13. 18 9月, 2013 1 次提交
  14. 17 9月, 2013 1 次提交
    • M
      vfs: don't set FILE_CREATED before calling ->atomic_open() · 116cc022
      Miklos Szeredi 提交于
      If O_CREAT|O_EXCL are passed to open, then we know that either
      
       - the file is successfully created, or
       - the operation fails in some way.
      
      So previously we set FILE_CREATED before calling ->atomic_open() so the
      filesystem doesn't have to.  This, however, led to bugs in the
      implementation that went unnoticed when the filesystem didn't check for
      existence, yet returned success.  To prevent this kind of bug, require
      filesystems to always explicitly set FILE_CREATED on O_CREAT|O_EXCL and
      verify this in the VFS.
      
      Also added a couple more verifications for the result of atomic_open():
      
       - Warn if filesystem set FILE_CREATED despite the lack of O_CREAT.
       - Warn if filesystem set FILE_CREATED but gave a negative dentry.
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      116cc022
  15. 11 9月, 2013 7 次提交
  16. 09 9月, 2013 5 次提交
    • L
      vfs: fix dentry RCU to refcounting possibly sleeping dput() · e5c832d5
      Linus Torvalds 提交于
      This is the fix that the last two commits indirectly led up to - making
      sure that we don't call dput() in a bad context on the dentries we've
      looked up in RCU mode after the sequence count validation fails.
      
      This basically expands d_rcu_to_refcount() into the callers, and then
      fixes the callers to delay the dput() in the failure case until _after_
      we've dropped all locks and are no longer in an RCU-locked region.
      
      The case of 'complete_walk()' was trivial, since its failure case did
      the unlock_rcu_walk() directly after the call to d_rcu_to_refcount(),
      and as such that is just a pure expansion of the function with a trivial
      movement of the resulting dput() to after 'unlock_rcu_walk()'.
      
      In contrast, the unlazy_walk() case was much more complicated, because
      not only does convert two different dentries from RCU to be reference
      counted, but it used to not call unlock_rcu_walk() at all, and instead
      just returned an error and let the caller clean everything up in
      "terminate_walk()".
      
      Happily, one of the dentries in question (called "parent" inside
      unlazy_walk()) is the dentry of "nd->path", which terminate_walk() wants
      a refcount to anyway for the non-RCU case.
      
      So what the new and improved unlazy_walk() does is to first turn that
      dentry into a refcounted one, and once that is set up, the error cases
      can continue to use the terminate_walk() helper for cleanup, but for the
      non-RCU case.  Which makes it possible to drop out of RCU mode if we
      actually hit the sequence number failure case.
      Acked-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e5c832d5
    • A
      introduce kern_path_mountpoint() · 2d864651
      Al Viro 提交于
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      2d864651
    • A
      rename user_path_umountat() to user_path_mountpoint_at() · 197df04c
      Al Viro 提交于
      ... and move the extern from linux/namei.h to fs/internal.h,
      along with that of vfs_path_lookup().
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      197df04c
    • A
      take unlazy_walk() into umount_lookup_last() · 35759521
      Al Viro 提交于
      ... and massage it a bit to reduce nesting
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      35759521
    • L
      vfs: use lockred "dead" flag to mark unrecoverably dead dentries · 0d98439e
      Linus Torvalds 提交于
      This simplifies the RCU to refcounting code in particular.
      
      I was originally intending to leave this for later, but walking through
      all the dput() logic (see previous commit), I realized that the dput()
      "might_sleep()" check was misleadingly weak.  And I removed it as
      misleading, both for performance profiling and for debugging.
      
      However, the might_sleep() debugging case is actually true: the final
      dput() can indeed sleep, if the inode of the dentry that you are
      releasing ends up sleeping at iput time (see dentry_iput()).  So the
      problem with the might_sleep() in dput() wasn't that it wasn't true, it
      was that it wasn't actually testing and triggering on the interesting
      case.
      
      In particular, just about *any* dput() can indeed sleep, if you happen
      to race with another thread deleting the file in question, and you then
      lose the race to the be the last dput() for that file.  But because it's
      a very rare race, the debugging code would never trigger it in practice.
      
      Why is this problematic? The new d_rcu_to_refcount() (see commit
      15570086: "vfs: reimplement d_rcu_to_refcount() using
      lockref_get_or_lock()") does a dput() for the failure case, and it does
      it under the RCU lock.  So potentially sleeping really is a bug.
      
      But there's no way I'm going to fix this with the previous complicated
      "lockref_get_or_lock()" interface.  And rather than revert to the old
      and crufty nested dentry locking code (which did get this right by
      delaying the reference count updates until they were verified to be
      safe), let's make forward progress.
      
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0d98439e
  17. 04 9月, 2013 1 次提交
    • J
      vfs: allow umount to handle mountpoints without revalidating them · 8033426e
      Jeff Layton 提交于
      Christopher reported a regression where he was unable to unmount a NFS
      filesystem where the root had gone stale. The problem is that
      d_revalidate handles the root of the filesystem differently from other
      dentries, but d_weak_revalidate does not. We could simply fix this by
      making d_weak_revalidate return success on IS_ROOT dentries, but there
      are cases where we do want to revalidate the root of the fs.
      
      A umount is really a special case. We generally aren't interested in
      anything but the dentry and vfsmount that's attached at that point. If
      the inode turns out to be stale we just don't care since the intent is
      to stop using it anyway.
      
      Try to handle this situation better by treating umount as a special
      case in the lookup code. Have it resolve the parent using normal
      means, and then do a lookup of the final dentry without revalidating
      it. In most cases, the final lookup will come out of the dcache, but
      the case where there's a trailing symlink or !LAST_NORM entry on the
      end complicates things a bit.
      
      Cc: Neil Brown <neilb@suse.de>
      Reported-by: NChristopher T Vogan <cvogan@us.ibm.com>
      Signed-off-by: NJeff Layton <jlayton@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      8033426e
  18. 03 9月, 2013 1 次提交
    • L
      vfs: reimplement d_rcu_to_refcount() using lockref_get_or_lock() · 15570086
      Linus Torvalds 提交于
      This moves __d_rcu_to_refcount() from <linux/dcache.h> into fs/namei.c
      and re-implements it using the lockref infrastructure instead.  It also
      adds a lot of comments about what is actually going on, because turning
      a dentry that was looked up using RCU into a long-lived reference
      counted entry is one of the more subtle parts of the rcu walk.
      
      We also used to be _particularly_ subtle in unlazy_walk() where we
      re-validate both the dentry and its parent using the same sequence
      count.  We used to do it by nesting the locks and then verifying the
      sequence count just once.
      
      That was silly, because nested locking is expensive, but the sequence
      count check is not.  So this just re-validates the dentry and the parent
      separately, avoiding the nested locking, and making the lockref lookup
      possible.
      Acked-by: NWaiman Long <waiman.long@hp.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      15570086
  19. 29 8月, 2013 2 次提交
    • W
      vfs: make the dentry cache use the lockref infrastructure · 98474236
      Waiman Long 提交于
      This just replaces the dentry count/lock combination with the lockref
      structure that contains both a count and a spinlock, and does the
      mechanical conversion to use the lockref infrastructure.
      
      There are no semantic changes here, it's purely syntactic.  The
      reference lockref implementation uses the spinlock exactly the same way
      that the old dcache code did, and the bulk of this patch is just
      expanding the internal "d_count" use in the dcache code to use
      "d_lockref.count" instead.
      
      This is purely preparation for the real change to make the reference
      count updates be lockless during the 3.12 merge window.
      
      [ As with the previous commit, this is a rewritten version of a concept
        originally from Waiman, so credit goes to him, blame for any errors
        goes to me.
      
        Waiman's patch had some semantic differences for taking advantage of
        the lockless update in dget_parent(), while this patch is
        intentionally a pure search-and-replace change with no semantic
        changes.     - Linus ]
      Signed-off-by: NWaiman Long <Waiman.Long@hp.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      98474236
    • L
      Revert "fs: Allow unprivileged linkat(..., AT_EMPTY_PATH) aka flink" · f0cc6ffb
      Linus Torvalds 提交于
      This reverts commit bb2314b4.
      
      It wasn't necessarily wrong per se, but we're still busily discussing
      the exact details of this all, so I'm going to revert it for now.
      
      It's true that you can already do flink() through /proc and that flink()
      isn't new.  But as Brad Spengler points out, some secure environments do
      not mount proc, and flink adds a new interface that can avoid path
      lookup of the source for those kinds of environments.
      
      We may re-do this (and even mark it for stable backporting back in 3.11
      and possibly earlier) once the whole discussion about the interface is done.
      
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Brad Spengler <spender@grsecurity.net>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f0cc6ffb