1. 11 9月, 2013 1 次提交
    • L
      vfs: make sure we don't have a stale root path if unlazy_walk() fails · d0d27277
      Linus Torvalds 提交于
      When I moved the RCU walk termination into unlazy_walk(), I didn't copy
      quite all of it: for the successful RCU termination we properly add the
      necessary reference counts to our temporary copy of the root path, but
      for the failure case we need to make sure that any temporary root path
      information is cleared out (since it does _not_ have the proper
      reference counts from the RCU lookup).
      
      We could clean up this mess by just always dropping the temporary root
      information, but Al points out that that would mean that a single lookup
      through symlinks could see multiple different root entries if it races
      with another thread doing chroot.  Not that I think we should really
      care (we had that before too, back before we had a copy of the root path
      in the nameidata).
      
      Al says he has a cunning plan.  In the meantime, this is the minimal fix
      for the problem, even if it's not all that pretty.
      Reported-by: NMace Moneta <moneta.mace@gmail.com>
      Acked-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d0d27277
  2. 09 9月, 2013 2 次提交
    • L
      vfs: fix dentry RCU to refcounting possibly sleeping dput() · e5c832d5
      Linus Torvalds 提交于
      This is the fix that the last two commits indirectly led up to - making
      sure that we don't call dput() in a bad context on the dentries we've
      looked up in RCU mode after the sequence count validation fails.
      
      This basically expands d_rcu_to_refcount() into the callers, and then
      fixes the callers to delay the dput() in the failure case until _after_
      we've dropped all locks and are no longer in an RCU-locked region.
      
      The case of 'complete_walk()' was trivial, since its failure case did
      the unlock_rcu_walk() directly after the call to d_rcu_to_refcount(),
      and as such that is just a pure expansion of the function with a trivial
      movement of the resulting dput() to after 'unlock_rcu_walk()'.
      
      In contrast, the unlazy_walk() case was much more complicated, because
      not only does convert two different dentries from RCU to be reference
      counted, but it used to not call unlock_rcu_walk() at all, and instead
      just returned an error and let the caller clean everything up in
      "terminate_walk()".
      
      Happily, one of the dentries in question (called "parent" inside
      unlazy_walk()) is the dentry of "nd->path", which terminate_walk() wants
      a refcount to anyway for the non-RCU case.
      
      So what the new and improved unlazy_walk() does is to first turn that
      dentry into a refcounted one, and once that is set up, the error cases
      can continue to use the terminate_walk() helper for cleanup, but for the
      non-RCU case.  Which makes it possible to drop out of RCU mode if we
      actually hit the sequence number failure case.
      Acked-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e5c832d5
    • L
      vfs: use lockred "dead" flag to mark unrecoverably dead dentries · 0d98439e
      Linus Torvalds 提交于
      This simplifies the RCU to refcounting code in particular.
      
      I was originally intending to leave this for later, but walking through
      all the dput() logic (see previous commit), I realized that the dput()
      "might_sleep()" check was misleadingly weak.  And I removed it as
      misleading, both for performance profiling and for debugging.
      
      However, the might_sleep() debugging case is actually true: the final
      dput() can indeed sleep, if the inode of the dentry that you are
      releasing ends up sleeping at iput time (see dentry_iput()).  So the
      problem with the might_sleep() in dput() wasn't that it wasn't true, it
      was that it wasn't actually testing and triggering on the interesting
      case.
      
      In particular, just about *any* dput() can indeed sleep, if you happen
      to race with another thread deleting the file in question, and you then
      lose the race to the be the last dput() for that file.  But because it's
      a very rare race, the debugging code would never trigger it in practice.
      
      Why is this problematic? The new d_rcu_to_refcount() (see commit
      15570086: "vfs: reimplement d_rcu_to_refcount() using
      lockref_get_or_lock()") does a dput() for the failure case, and it does
      it under the RCU lock.  So potentially sleeping really is a bug.
      
      But there's no way I'm going to fix this with the previous complicated
      "lockref_get_or_lock()" interface.  And rather than revert to the old
      and crufty nested dentry locking code (which did get this right by
      delaying the reference count updates until they were verified to be
      safe), let's make forward progress.
      
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0d98439e
  3. 04 9月, 2013 1 次提交
    • J
      vfs: allow umount to handle mountpoints without revalidating them · 8033426e
      Jeff Layton 提交于
      Christopher reported a regression where he was unable to unmount a NFS
      filesystem where the root had gone stale. The problem is that
      d_revalidate handles the root of the filesystem differently from other
      dentries, but d_weak_revalidate does not. We could simply fix this by
      making d_weak_revalidate return success on IS_ROOT dentries, but there
      are cases where we do want to revalidate the root of the fs.
      
      A umount is really a special case. We generally aren't interested in
      anything but the dentry and vfsmount that's attached at that point. If
      the inode turns out to be stale we just don't care since the intent is
      to stop using it anyway.
      
      Try to handle this situation better by treating umount as a special
      case in the lookup code. Have it resolve the parent using normal
      means, and then do a lookup of the final dentry without revalidating
      it. In most cases, the final lookup will come out of the dcache, but
      the case where there's a trailing symlink or !LAST_NORM entry on the
      end complicates things a bit.
      
      Cc: Neil Brown <neilb@suse.de>
      Reported-by: NChristopher T Vogan <cvogan@us.ibm.com>
      Signed-off-by: NJeff Layton <jlayton@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      8033426e
  4. 03 9月, 2013 1 次提交
    • L
      vfs: reimplement d_rcu_to_refcount() using lockref_get_or_lock() · 15570086
      Linus Torvalds 提交于
      This moves __d_rcu_to_refcount() from <linux/dcache.h> into fs/namei.c
      and re-implements it using the lockref infrastructure instead.  It also
      adds a lot of comments about what is actually going on, because turning
      a dentry that was looked up using RCU into a long-lived reference
      counted entry is one of the more subtle parts of the rcu walk.
      
      We also used to be _particularly_ subtle in unlazy_walk() where we
      re-validate both the dentry and its parent using the same sequence
      count.  We used to do it by nesting the locks and then verifying the
      sequence count just once.
      
      That was silly, because nested locking is expensive, but the sequence
      count check is not.  So this just re-validates the dentry and the parent
      separately, avoiding the nested locking, and making the lockref lookup
      possible.
      Acked-by: NWaiman Long <waiman.long@hp.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      15570086
  5. 29 8月, 2013 2 次提交
    • W
      vfs: make the dentry cache use the lockref infrastructure · 98474236
      Waiman Long 提交于
      This just replaces the dentry count/lock combination with the lockref
      structure that contains both a count and a spinlock, and does the
      mechanical conversion to use the lockref infrastructure.
      
      There are no semantic changes here, it's purely syntactic.  The
      reference lockref implementation uses the spinlock exactly the same way
      that the old dcache code did, and the bulk of this patch is just
      expanding the internal "d_count" use in the dcache code to use
      "d_lockref.count" instead.
      
      This is purely preparation for the real change to make the reference
      count updates be lockless during the 3.12 merge window.
      
      [ As with the previous commit, this is a rewritten version of a concept
        originally from Waiman, so credit goes to him, blame for any errors
        goes to me.
      
        Waiman's patch had some semantic differences for taking advantage of
        the lockless update in dget_parent(), while this patch is
        intentionally a pure search-and-replace change with no semantic
        changes.     - Linus ]
      Signed-off-by: NWaiman Long <Waiman.Long@hp.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      98474236
    • L
      Revert "fs: Allow unprivileged linkat(..., AT_EMPTY_PATH) aka flink" · f0cc6ffb
      Linus Torvalds 提交于
      This reverts commit bb2314b4.
      
      It wasn't necessarily wrong per se, but we're still busily discussing
      the exact details of this all, so I'm going to revert it for now.
      
      It's true that you can already do flink() through /proc and that flink()
      isn't new.  But as Brad Spengler points out, some secure environments do
      not mount proc, and flink adds a new interface that can avoid path
      lookup of the source for those kinds of environments.
      
      We may re-do this (and even mark it for stable backporting back in 3.11
      and possibly earlier) once the whole discussion about the interface is done.
      
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Brad Spengler <spender@grsecurity.net>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f0cc6ffb
  6. 05 8月, 2013 1 次提交
    • A
      fs: Allow unprivileged linkat(..., AT_EMPTY_PATH) aka flink · bb2314b4
      Andy Lutomirski 提交于
      Every now and then someone proposes a new flink syscall, and this spawns
      a long discussion of whether it would be a security problem.  I think
      that this is missing the point: flink is *already* allowed without
      privilege as long as /proc is mounted -- it's called AT_SYMLINK_FOLLOW.
      
      Now that O_TMPFILE is here, the ability to create a file with O_TMPFILE,
      write it, and link it in is very convenient.  The only problem is that
      it requires that /proc be mounted so that you can do:
      
      linkat(AT_FDCWD, "/proc/self/fd/<tmpfd>", dfd, path, AT_SYMLINK_NOFOLLOW)
      
      This sucks -- it's much nicer to do:
      
      linkat(tmpfd, "", dfd, path, AT_EMPTY_PATH)
      
      Let's allow it.
      
      If this turns out to be excessively scary, it we could instead require
      that the inode in question be I_LINKABLE, but this seems pointless given
      the /proc situation
      Signed-off-by: NAndy Lutomirski <luto@amacapital.net>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      bb2314b4
  7. 13 7月, 2013 1 次提交
    • A
      Safer ABI for O_TMPFILE · bb458c64
      Al Viro 提交于
      [suggested by Rasmus Villemoes] make O_DIRECTORY | O_RDWR part of O_TMPFILE;
      that will fail on old kernels in a lot more cases than what I came up with.
      And make sure O_CREAT doesn't get there...
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      bb458c64
  8. 29 6月, 2013 5 次提交
  9. 15 6月, 2013 1 次提交
  10. 08 5月, 2013 1 次提交
    • J
      audit: vfs: fix audit_inode call in O_CREAT case of do_last · 33e2208a
      Jeff Layton 提交于
      Jiri reported a regression in auditing of open(..., O_CREAT) syscalls.
      In older kernels, creating a file with open(..., O_CREAT) created
      audit_name records that looked like this:
      
      type=PATH msg=audit(1360255720.628:64): item=1 name="/abc/foo" inode=138810 dev=fd:00 mode=0100640 ouid=0 ogid=0 rdev=00:00 obj=unconfined_u:object_r:default_t:s0
      type=PATH msg=audit(1360255720.628:64): item=0 name="/abc/" inode=138635 dev=fd:00 mode=040750 ouid=0 ogid=0 rdev=00:00 obj=unconfined_u:object_r:default_t:s0
      
      ...in recent kernels though, they look like this:
      
      type=PATH msg=audit(1360255402.886:12574): item=2 name=(null) inode=264599 dev=fd:00 mode=0100640 ouid=0 ogid=0 rdev=00:00 obj=unconfined_u:object_r:default_t:s0
      type=PATH msg=audit(1360255402.886:12574): item=1 name=(null) inode=264598 dev=fd:00 mode=040750 ouid=0 ogid=0 rdev=00:00 obj=unconfined_u:object_r:default_t:s0
      type=PATH msg=audit(1360255402.886:12574): item=0 name="/abc/foo" inode=264598 dev=fd:00 mode=040750 ouid=0 ogid=0 rdev=00:00 obj=unconfined_u:object_r:default_t:s0
      
      Richard bisected to determine that the problems started with commit
      bfcec708, but the log messages have changed with some later
      audit-related patches.
      
      The problem is that this audit_inode call is passing in the parent of
      the dentry being opened, but audit_inode is being called with the parent
      flag false. This causes later audit_inode and audit_inode_child calls to
      match the wrong entry in the audit_names list.
      
      This patch simply sets the flag to properly indicate that this inode
      represents the parent. With this, the audit_names entries are back to
      looking like they did before.
      
      Cc: <stable@vger.kernel.org> # v3.7+
      Reported-by: NJiri Jaburek <jjaburek@redhat.com>
      Signed-off-by: NJeff Layton <jlayton@redhat.com>
      Test By: Richard Guy Briggs <rbriggs@redhat.com>
      Signed-off-by: NEric Paris <eparis@redhat.com>
      33e2208a
  11. 09 3月, 2013 1 次提交
  12. 02 3月, 2013 1 次提交
  13. 26 2月, 2013 1 次提交
    • J
      vfs: kill FS_REVAL_DOT by adding a d_weak_revalidate dentry op · ecf3d1f1
      Jeff Layton 提交于
      The following set of operations on a NFS client and server will cause
      
          server# mkdir a
          client# cd a
          server# mv a a.bak
          client# sleep 30  # (or whatever the dir attrcache timeout is)
          client# stat .
          stat: cannot stat `.': Stale NFS file handle
      
      Obviously, we should not be getting an ESTALE error back there since the
      inode still exists on the server. The problem is that the lookup code
      will call d_revalidate on the dentry that "." refers to, because NFS has
      FS_REVAL_DOT set.
      
      nfs_lookup_revalidate will see that the parent directory has changed and
      will try to reverify the dentry by redoing a LOOKUP. That of course
      fails, so the lookup code returns ESTALE.
      
      The problem here is that d_revalidate is really a bad fit for this case.
      What we really want to know at this point is whether the inode is still
      good or not, but we don't really care what name it goes by or whether
      the dcache is still valid.
      
      Add a new d_op->d_weak_revalidate operation and have complete_walk call
      that instead of d_revalidate. The intent there is to allow for a
      "weaker" d_revalidate that just checks to see whether the inode is still
      good. This is also gives us an opportunity to kill off the FS_REVAL_DOT
      special casing.
      
      [AV: changed method name, added note in porting, fixed confusion re
      having it possibly called from RCU mode (it won't be)]
      
      Cc: NeilBrown <neilb@suse.de>
      Signed-off-by: NJeff Layton <jlayton@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      ecf3d1f1
  14. 23 2月, 2013 6 次提交
  15. 21 12月, 2012 12 次提交
  16. 30 11月, 2012 1 次提交
  17. 27 10月, 2012 1 次提交
    • L
      VFS: don't do protected {sym,hard}links by default · 561ec64a
      Linus Torvalds 提交于
      In commit 800179c9 ("This adds symlink and hardlink restrictions to
      the Linux VFS"), the new link protections were enabled by default, in
      the hope that no actual application would care, despite it being
      technically against legacy UNIX (and documented POSIX) behavior.
      
      However, it does turn out to break some applications.  It's rare, and
      it's unfortunate, but it's unacceptable to break existing systems, so
      we'll have to default to legacy behavior.
      
      In particular, it has broken the way AFD distributes files, see
      
        http://www.dwd.de/AFD/
      
      along with some legacy scripts.
      
      Distributions can end up setting this at initrd time or in system
      scripts: if you have security problems due to link attacks during your
      early boot sequence, you have bigger problems than some kernel sysctl
      setting. Do:
      
      	echo 1 > /proc/sys/fs/protected_symlinks
      	echo 1 > /proc/sys/fs/protected_hardlinks
      
      to re-enable the link protections.
      
      Alternatively, we may at some point introduce a kernel config option
      that sets these kinds of "more secure but not traditional" behavioural
      options automatically.
      Reported-by: NNick Bowler <nbowler@elliptictech.com>
      Reported-by: NHolger Kiehl <Holger.Kiehl@dwd.de>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org # v3.6
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      561ec64a
  18. 13 10月, 2012 1 次提交