1. 23 1月, 2015 6 次提交
  2. 14 12月, 2014 1 次提交
    • D
      syscalls: implement execveat() system call · 51f39a1f
      David Drysdale 提交于
      This patchset adds execveat(2) for x86, and is derived from Meredydd
      Luff's patch from Sept 2012 (https://lkml.org/lkml/2012/9/11/528).
      
      The primary aim of adding an execveat syscall is to allow an
      implementation of fexecve(3) that does not rely on the /proc filesystem,
      at least for executables (rather than scripts).  The current glibc version
      of fexecve(3) is implemented via /proc, which causes problems in sandboxed
      or otherwise restricted environments.
      
      Given the desire for a /proc-free fexecve() implementation, HPA suggested
      (https://lkml.org/lkml/2006/7/11/556) that an execveat(2) syscall would be
      an appropriate generalization.
      
      Also, having a new syscall means that it can take a flags argument without
      back-compatibility concerns.  The current implementation just defines the
      AT_EMPTY_PATH and AT_SYMLINK_NOFOLLOW flags, but other flags could be
      added in future -- for example, flags for new namespaces (as suggested at
      https://lkml.org/lkml/2006/7/11/474).
      
      Related history:
       - https://lkml.org/lkml/2006/12/27/123 is an example of someone
         realizing that fexecve() is likely to fail in a chroot environment.
       - http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=514043 covered
         documenting the /proc requirement of fexecve(3) in its manpage, to
         "prevent other people from wasting their time".
       - https://bugzilla.redhat.com/show_bug.cgi?id=241609 described a
         problem where a process that did setuid() could not fexecve()
         because it no longer had access to /proc/self/fd; this has since
         been fixed.
      
      This patch (of 4):
      
      Add a new execveat(2) system call.  execveat() is to execve() as openat()
      is to open(): it takes a file descriptor that refers to a directory, and
      resolves the filename relative to that.
      
      In addition, if the filename is empty and AT_EMPTY_PATH is specified,
      execveat() executes the file to which the file descriptor refers.  This
      replicates the functionality of fexecve(), which is a system call in other
      UNIXen, but in Linux glibc it depends on opening "/proc/self/fd/<fd>" (and
      so relies on /proc being mounted).
      
      The filename fed to the executed program as argv[0] (or the name of the
      script fed to a script interpreter) will be of the form "/dev/fd/<fd>"
      (for an empty filename) or "/dev/fd/<fd>/<filename>", effectively
      reflecting how the executable was found.  This does however mean that
      execution of a script in a /proc-less environment won't work; also, script
      execution via an O_CLOEXEC file descriptor fails (as the file will not be
      accessible after exec).
      
      Based on patches by Meredydd Luff.
      Signed-off-by: NDavid Drysdale <drysdale@google.com>
      Cc: Meredydd Luff <meredydd@senatehouse.org>
      Cc: Shuah Khan <shuah.kh@samsung.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Rich Felker <dalias@aerifal.cx>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      51f39a1f
  3. 12 12月, 2014 4 次提交
  4. 11 12月, 2014 1 次提交
  5. 31 10月, 2014 1 次提交
    • E
      fs: allow open(dir, O_TMPFILE|..., 0) with mode 0 · 69a91c23
      Eric Rannaud 提交于
      The man page for open(2) indicates that when O_CREAT is specified, the
      'mode' argument applies only to future accesses to the file:
      
      	Note that this mode applies only to future accesses of the newly
      	created file; the open() call that creates a read-only file
      	may well return a read/write file descriptor.
      
      The man page for open(2) implies that 'mode' is treated identically by
      O_CREAT and O_TMPFILE.
      
      O_TMPFILE, however, behaves differently:
      
      	int fd = open("/tmp", O_TMPFILE | O_RDWR, 0);
      	assert(fd == -1);
      	assert(errno == EACCES);
      
      	int fd = open("/tmp", O_TMPFILE | O_RDWR, 0600);
      	assert(fd > 0);
      
      For O_CREAT, do_last() sets acc_mode to MAY_OPEN only:
      
      	if (*opened & FILE_CREATED) {
      		/* Don't check for write permission, don't truncate */
      		open_flag &= ~O_TRUNC;
      		will_truncate = false;
      		acc_mode = MAY_OPEN;
      		path_to_nameidata(path, nd);
      		goto finish_open_created;
      	}
      
      But for O_TMPFILE, do_tmpfile() passes the full op->acc_mode to
      may_open().
      
      This patch lines up the behavior of O_TMPFILE with O_CREAT. After the
      inode is created, may_open() is called with acc_mode = MAY_OPEN, in
      do_tmpfile().
      
      A different, but related glibc bug revealed the discrepancy:
      https://sourceware.org/bugzilla/show_bug.cgi?id=17523
      
      The glibc lazily loads the 'mode' argument of open() and openat() using
      va_arg() only if O_CREAT is present in 'flags' (to support both the 2
      argument and the 3 argument forms of open; same idea for openat()).
      However, the glibc ignores the 'mode' argument if O_TMPFILE is in
      'flags'.
      
      On x86_64, for open(), it magically works anyway, as 'mode' is in
      RDX when entering open(), and is still in RDX on SYSCALL, which is where
      the kernel looks for the 3rd argument of a syscall.
      
      But openat() is not quite so lucky: 'mode' is in RCX when entering the
      glibc wrapper for openat(), while the kernel looks for the 4th argument
      of a syscall in R10. Indeed, the syscall calling convention differs from
      the regular calling convention in this respect on x86_64. So the kernel
      sees mode = 0 when trying to use glibc openat() with O_TMPFILE, and
      fails with EACCES.
      Signed-off-by: NEric Rannaud <e@nanocritical.com>
      Acked-by: NAndy Lutomirski <luto@amacapital.net>
      Cc: stable@vger.kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      69a91c23
  6. 29 10月, 2014 1 次提交
  7. 24 10月, 2014 5 次提交
  8. 13 10月, 2014 1 次提交
  9. 09 10月, 2014 3 次提交
    • E
      vfs: Make d_invalidate return void · 5542aa2f
      Eric W. Biederman 提交于
      Now that d_invalidate can no longer fail, stop returning a useless
      return code.  For the few callers that checked the return code update
      remove the handling of d_invalidate failure.
      Reviewed-by: NMiklos Szeredi <miklos@szeredi.hu>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      5542aa2f
    • E
      vfs: Lazily remove mounts on unlinked files and directories. · 8ed936b5
      Eric W. Biederman 提交于
      With the introduction of mount namespaces and bind mounts it became
      possible to access files and directories that on some paths are mount
      points but are not mount points on other paths.  It is very confusing
      when rm -rf somedir returns -EBUSY simply because somedir is mounted
      somewhere else.  With the addition of user namespaces allowing
      unprivileged mounts this condition has gone from annoying to allowing
      a DOS attack on other users in the system.
      
      The possibility for mischief is removed by updating the vfs to support
      rename, unlink and rmdir on a dentry that is a mountpoint and by
      lazily unmounting mountpoints on deleted dentries.
      
      In particular this change allows rename, unlink and rmdir system calls
      on a dentry without a mountpoint in the current mount namespace to
      succeed, and it allows rename, unlink, and rmdir performed on a
      distributed filesystem to update the vfs cache even if when there is a
      mount in some namespace on the original dentry.
      
      There are two common patterns of maintaining mounts: Mounts on trusted
      paths with the parent directory of the mount point and all ancestory
      directories up to / owned by root and modifiable only by root
      (i.e. /media/xxx, /dev, /dev/pts, /proc, /sys, /sys/fs/cgroup/{cpu,
      cpuacct, ...}, /usr, /usr/local).  Mounts on unprivileged directories
      maintained by fusermount.
      
      In the case of mounts in trusted directories owned by root and
      modifiable only by root the current parent directory permissions are
      sufficient to ensure a mount point on a trusted path is not removed
      or renamed by anyone other than root, even if there is a context
      where the there are no mount points to prevent this.
      
      In the case of mounts in directories owned by less privileged users
      races with users modifying the path of a mount point are already a
      danger.  fusermount already uses a combination of chdir,
      /proc/<pid>/fd/NNN, and UMOUNT_NOFOLLOW to prevent these races.  The
      removable of global rename, unlink, and rmdir protection really adds
      nothing new to consider only a widening of the attack window, and
      fusermount is already safe against unprivileged users modifying the
      directory simultaneously.
      
      In principle for perfect userspace programs returning -EBUSY for
      unlink, rmdir, and rename of dentires that have mounts in the local
      namespace is actually unnecessary.  Unfortunately not all userspace
      programs are perfect so retaining -EBUSY for unlink, rmdir and rename
      of dentries that have mounts in the current mount namespace plays an
      important role of maintaining consistency with historical behavior and
      making imperfect userspace applications hard to exploit.
      
      v2: Remove spurious old_dentry.
      v3: Optimized shrink_submounts_and_drop
          Removed unsued afs label
      v4: Simplified the changes to check_submounts_and_drop
          Do not rename check_submounts_and_drop shrink_submounts_and_drop
          Document what why we need atomicity in check_submounts_and_drop
          Rely on the parent inode mutex to make d_revalidate and d_invalidate
          an atomic unit.
      v5: Refcount the mountpoint to detach in case of simultaneous
          renames.
      Reviewed-by: NMiklos Szeredi <miklos@szeredi.hu>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      8ed936b5
    • E
      vfs: Don't allow overwriting mounts in the current mount namespace · 7af1364f
      Eric W. Biederman 提交于
      In preparation for allowing mountpoints to be renamed and unlinked
      in remote filesystems and in other mount namespaces test if on a dentry
      there is a mount in the local mount namespace before allowing it to
      be renamed or unlinked.
      
      The primary motivation here are old versions of fusermount unmount
      which is not safe if the a path can be renamed or unlinked while it is
      verifying the mount is safe to unmount.  More recent versions are simpler
      and safer by simply using UMOUNT_NOFOLLOW when unmounting a mount
      in a directory owned by an arbitrary user.
      
      Miklos Szeredi <miklos@szeredi.hu> reports this is approach is good
      enough to remove concerns about new kernels mixed with old versions
      of fusermount.
      
      A secondary motivation for restrictions here is that it removing empty
      directories that have non-empty mount points on them appears to
      violate the rule that rmdir can not remove empty directories.  As
      Linus Torvalds pointed out this is useful for programs (like git) that
      test if a directory is empty with rmdir.
      
      Therefore this patch arranges to enforce the existing mount point
      semantics for local mount namespace.
      
      v2: Rewrote the test to be a drop in replacement for d_mountpoint
      v3: Use bool instead of int as the return type of is_local_mountpoint
      Reviewed-by: NMiklos Szeredi <miklos@szeredi.hu>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      7af1364f
  10. 16 9月, 2014 2 次提交
    • J
      vfs: workaround gcc <4.6 build error in link_path_walk() · a060dc50
      James Hogan 提交于
      Commit d6bb3e90 ("vfs: simplify and shrink stack frame of
      link_path_walk()") introduced build problems with GCC versions older
      than 4.6 due to the initialisation of a member of an anonymous union in
      struct qstr without enclosing braces.
      
      This hits GCC bug 10676 [1] (which was fixed in GCC 4.6 by [2]), and
      causes the following build error:
      
        fs/namei.c: In function 'link_path_walk':
        fs/namei.c:1778: error: unknown field 'hash_len' specified in initializer
      
      This is worked around by adding explicit braces.
      
      [1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=10676
      [2] https://gcc.gnu.org/viewcvs/gcc?view=revision&revision=159206
      
      Fixes: d6bb3e90 (vfs: simplify and shrink stack frame of link_path_walk())
      Signed-off-by: NJames Hogan <james.hogan@imgtec.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: linux-fsdevel@vger.kernel.org
      Cc: linux-metag@vger.kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a060dc50
    • L
      vfs: simplify and shrink stack frame of link_path_walk() · d6bb3e90
      Linus Torvalds 提交于
      Commit 9226b5b4 ("vfs: avoid non-forwarding large load after small
      store in path lookup") made link_path_walk() always access the
      "hash_len" field as a single 64-bit entity, in order to avoid mixed size
      accesses to the members.
      
      However, what I didn't notice was that that effectively means that the
      whole "struct qstr this" is now basically redundant.  We already
      explicitly track the "const char *name", and if we just use "u64
      hash_len" instead of "long len", there is nothing else left of the
      "struct qstr".
      
      We do end up wanting the "struct qstr" if we have a filesystem with a
      "d_hash()" function, but that's a rare case, and we might as well then
      just squirrell away the name and hash_len at that point.
      
      End result: fewer live variables in the loop, a smaller stack frame, and
      better code generation.  And we don't need to pass in pointers variables
      to helper functions any more, because the return value contains all the
      relevant information.  So this removes more lines than it adds, and the
      source code is clearer too.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d6bb3e90
  11. 15 9月, 2014 3 次提交
    • L
      vfs: avoid non-forwarding large load after small store in path lookup · 9226b5b4
      Linus Torvalds 提交于
      The performance regression that Josef Bacik reported in the pathname
      lookup (see commit 99d263d4 "vfs: fix bad hashing of dentries") made
      me look at performance stability of the dcache code, just to verify that
      the problem was actually fixed.  That turned up a few other problems in
      this area.
      
      There are a few cases where we exit RCU lookup mode and go to the slow
      serializing case when we shouldn't, Al has fixed those and they'll come
      in with the next VFS pull.
      
      But my performance verification also shows that link_path_walk() turns
      out to have a very unfortunate 32-bit store of the length and hash of
      the name we look up, followed by a 64-bit read of the combined hash_len
      field.  That screws up the processor store to load forwarding, causing
      an unnecessary hickup in this critical routine.
      
      It's caused by the ugly calling convention for the "hash_name()"
      function, and easily fixed by just making hash_name() fill in the whole
      'struct qstr' rather than passing it a pointer to just the hash value.
      
      With that, the profile for this function looks much smoother.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9226b5b4
    • A
      be careful with nd->inode in path_init() and follow_dotdot_rcu() · 4023bfc9
      Al Viro 提交于
      in the former we simply check if dentry is still valid after picking
      its ->d_inode; in the latter we fetch ->d_inode in the same places
      where we fetch dentry and its ->d_seq, under the same checks.
      
      Cc: stable@vger.kernel.org # 2.6.38+
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      4023bfc9
    • A
      don't bugger nd->seq on set_root_rcu() from follow_dotdot_rcu() · 7bd88377
      Al Viro 提交于
      return the value instead, and have path_init() do the assignment.  Broken by
      "vfs: Fix absolute RCU path walk failures due to uninitialized seq number",
      which was Cc-stable with 2.6.38+ as destination.  This one should go where
      it went.
      
      To avoid dummy value returned in case when root is already set (it would do
      no harm, actually, since the only caller that doesn't ignore the return value
      is guaranteed to have nd->root *not* set, but it's more obvious that way),
      lift the check into callers.  And do the same to set_root(), to keep them
      in sync.
      
      Cc: stable@vger.kernel.org # 2.6.38+
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      7bd88377
  12. 14 9月, 2014 2 次提交
    • A
      fix bogus read_seqretry() checks introduced in b37199e6 · f5be3e29
      Al Viro 提交于
      read_seqretry() returns true on mismatch, not on match...
      
      Cc: stable@vger.kernel.org # 3.15+
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      f5be3e29
    • L
      vfs: fix bad hashing of dentries · 99d263d4
      Linus Torvalds 提交于
      Josef Bacik found a performance regression between 3.2 and 3.10 and
      narrowed it down to commit bfcfaa77 ("vfs: use 'unsigned long'
      accesses for dcache name comparison and hashing"). He reports:
      
       "The test case is essentially
      
            for (i = 0; i < 1000000; i++)
                    mkdir("a$i");
      
        On xfs on a fio card this goes at about 20k dir/sec with 3.2, and 12k
        dir/sec with 3.10.  This is because we spend waaaaay more time in
        __d_lookup on 3.10 than in 3.2.
      
        The new hashing function for strings is suboptimal for <
        sizeof(unsigned long) string names (and hell even > sizeof(unsigned
        long) string names that I've tested).  I broke out the old hashing
        function and the new one into a userspace helper to get real numbers
        and this is what I'm getting:
      
            Old hash table had 1000000 entries, 0 dupes, 0 max dupes
            New hash table had 12628 entries, 987372 dupes, 900 max dupes
            We had 11400 buckets with a p50 of 30 dupes, p90 of 240 dupes, p99 of 567 dupes for the new hash
      
        My test does the hash, and then does the d_hash into a integer pointer
        array the same size as the dentry hash table on my system, and then
        just increments the value at the address we got to see how many
        entries we overlap with.
      
        As you can see the old hash function ended up with all 1 million
        entries in their own bucket, whereas the new one they are only
        distributed among ~12.5k buckets, which is why we're using so much
        more CPU in __d_lookup".
      
      The reason for this hash regression is two-fold:
      
       - On 64-bit architectures the down-mixing of the original 64-bit
         word-at-a-time hash into the final 32-bit hash value is very
         simplistic and suboptimal, and just adds the two 32-bit parts
         together.
      
         In particular, because there is no bit shuffling and the mixing
         boundary is also a byte boundary, similar character patterns in the
         low and high word easily end up just canceling each other out.
      
       - the old byte-at-a-time hash mixed each byte into the final hash as it
         hashed the path component name, resulting in the low bits of the hash
         generally being a good source of hash data.  That is not true for the
         word-at-a-time case, and the hash data is distributed among all the
         bits.
      
      The fix is the same in both cases: do a better job of mixing the bits up
      and using as much of the hash data as possible.  We already have the
      "hash_32|64()" functions to do that.
      Reported-by: NJosef Bacik <jbacik@fb.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Chris Mason <clm@fb.com>
      Cc: linux-fsdevel@vger.kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      99d263d4
  13. 09 9月, 2014 1 次提交
  14. 08 8月, 2014 3 次提交
  15. 24 7月, 2014 1 次提交
  16. 11 6月, 2014 1 次提交
    • A
      fs,userns: Change inode_capable to capable_wrt_inode_uidgid · 23adbe12
      Andy Lutomirski 提交于
      The kernel has no concept of capabilities with respect to inodes; inodes
      exist independently of namespaces.  For example, inode_capable(inode,
      CAP_LINUX_IMMUTABLE) would be nonsense.
      
      This patch changes inode_capable to check for uid and gid mappings and
      renames it to capable_wrt_inode_uidgid, which should make it more
      obvious what it does.
      
      Fixes CVE-2014-4014.
      
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Serge Hallyn <serge.hallyn@ubuntu.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NAndy Lutomirski <luto@amacapital.net>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      23adbe12
  17. 20 4月, 2014 1 次提交
    • A
      fix races between __d_instantiate() and checks of dentry flags · 22213318
      Al Viro 提交于
      in non-lazy walk we need to be careful about dentry switching from
      negative to positive - both ->d_flags and ->d_inode are updated,
      and in some places we might see only one store.  The cases where
      dentry has been obtained by dcache lookup with ->i_mutex held on
      parent are safe - ->d_lock and ->i_mutex provide all the barriers
      we need.  However, there are several places where we run into
      trouble:
      	* do_last() fetches ->d_inode, then checks ->d_flags and
      assumes that inode won't be NULL unless d_is_negative() is true.
      Race with e.g. creat() - we might have fetched the old value of
      ->d_inode (still NULL) and new value of ->d_flags (already not
      DCACHE_MISS_TYPE).  Lin Ming has observed and reported the resulting
      oops.
      	* a bunch of places checks ->d_inode for being non-NULL,
      then checks ->d_flags for "is it a symlink".  Race with symlink(2)
      in case if our CPU sees ->d_inode update first - we see non-NULL
      there, but ->d_flags still contains DCACHE_MISS_TYPE instead of
      DCACHE_SYMLINK_TYPE.  Result: false negative on "should we follow
      link here?", with subsequent unpleasantness.
      
      Cc: stable@vger.kernel.org # 3.13 and 3.14 need that one
      Reported-and-tested-by: NLin Ming <minggr@gmail.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      22213318
  18. 02 4月, 2014 3 次提交