1. 26 10月, 2012 1 次提交
  2. 13 10月, 2012 2 次提交
    • J
      vfs: make path_openat take a struct filename pointer · 669abf4e
      Jeff Layton 提交于
      ...and fix up the callers. For do_file_open_root, just declare a
      struct filename on the stack and fill out the .name field. For
      do_filp_open, make it also take a struct filename pointer, and fix up its
      callers to call it appropriately.
      
      For filp_open, add a variant that takes a struct filename pointer and turn
      filp_open into a wrapper around it.
      Signed-off-by: NJeff Layton <jlayton@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      669abf4e
    • J
      vfs: define struct filename and have getname() return it · 91a27b2a
      Jeff Layton 提交于
      getname() is intended to copy pathname strings from userspace into a
      kernel buffer. The result is just a string in kernel space. It would
      however be quite helpful to be able to attach some ancillary info to
      the string.
      
      For instance, we could attach some audit-related info to reduce the
      amount of audit-related processing needed. When auditing is enabled,
      we could also call getname() on the string more than once and not
      need to recopy it from userspace.
      
      This patchset converts the getname()/putname() interfaces to return
      a struct instead of a string. For now, the struct just tracks the
      string in kernel space and the original userland pointer for it.
      
      Later, we'll add other information to the struct as it becomes
      convenient.
      Signed-off-by: NJeff Layton <jlayton@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      91a27b2a
  3. 09 10月, 2012 2 次提交
    • M
      mm: avoid taking rmap locks in move_ptes() · 38a76013
      Michel Lespinasse 提交于
      During mremap(), the destination VMA is generally placed after the
      original vma in rmap traversal order: in move_vma(), we always have
      new_pgoff >= vma->vm_pgoff, and as a result new_vma->vm_pgoff >=
      vma->vm_pgoff unless vma_merge() merged the new vma with an adjacent one.
      
      When the destination VMA is placed after the original in rmap traversal
      order, we can avoid taking the rmap locks in move_ptes().
      
      Essentially, this reintroduces the optimization that had been disabled in
      "mm anon rmap: remove anon_vma_moveto_tail".  The difference is that we
      don't try to impose the rmap traversal order; instead we just rely on
      things being in the desired order in the common case and fall back to
      taking locks in the uncommon case.  Also we skip the i_mmap_mutex in
      addition to the anon_vma lock: in both cases, the vmas are traversed in
      increasing vm_pgoff order with ties resolved in tree insertion order.
      Signed-off-by: NMichel Lespinasse <walken@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Daniel Santos <daniel.santos@pobox.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      38a76013
    • O
      exec: make de_thread() killable · d5bbd43d
      Oleg Nesterov 提交于
      Change de_thread() to use KILLABLE rather than UNINTERRUPTIBLE while
      waiting for other threads.  The only complication is that we should
      clear ->group_exit_task and ->notify_count before we return, and we
      should do this under tasklist_lock.  -EAGAIN is used to match the
      initial signal_group_exit() check/return, it doesn't really matter.
      
      This fixes the (unlikely) race with coredump.  de_thread() checks
      signal_group_exit() before it starts to kill the subthreads, but this
      can't help if another CLONE_VM (but non CLONE_THREAD) task starts the
      coredumping after de_thread() unlocks ->siglock.  In this case the
      killed sub-thread can block in exit_mm() waiting for coredump_finish(),
      execing thread waits for that sub-thead, and the coredumping thread
      waits for execing thread.  Deadlock.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d5bbd43d
  4. 06 10月, 2012 2 次提交
  5. 03 10月, 2012 1 次提交
  6. 01 10月, 2012 2 次提交
    • A
      generic sys_execve() · 38b983b3
      Al Viro 提交于
      Selected by __ARCH_WANT_SYS_EXECVE in unistd.h.  Requires
      	* working current_pt_regs()
      	* *NOT* doing a syscall-in-kernel kind of kernel_execve()
      implementation.  Using generic kernel_execve() is fine.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      38b983b3
    • A
      generic kernel_execve() · 282124d1
      Al Viro 提交于
      based mostly on arm and alpha versions.  Architectures can define
      __ARCH_WANT_KERNEL_EXECVE and use it, provided that
      	* they have working current_pt_regs(), even for kernel threads.
      	* kernel_thread-spawned threads do have space for pt_regs
      in the normal location.  Normally that's as simple as switching to
      generic kernel_thread() and making sure that kernel threads do *not*
      go through return from syscall path; call the payload from equivalent
      of ret_from_fork if we are in a kernel thread (or just have separate
      ret_from_kernel_thread and make copy_thread() use it instead of
      ret_from_fork in kernel thread case).
      	* they have ret_from_kernel_execve(); it is called after
      successful do_execve() done by kernel_execve() and gets normal
      pt_regs location passed to it as argument.  It's essentially
      a longjmp() analog - it should set sp, etc. to the situation
      expected at the return for syscall and go there.  Eventually
      the need for that sucker will disappear, but that'll take some
      surgery on kernel_thread() payloads.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      282124d1
  7. 27 9月, 2012 3 次提交
  8. 20 9月, 2012 1 次提交
  9. 31 7月, 2012 3 次提交
    • J
      coredump: fix wrong comments on core limits of pipe coredump case · 108ceeb0
      Jovi Zhang 提交于
      In commit 898b374a ("exec: replace call_usermodehelper_pipe with use
      of umh init function and resolve limit"), the core limits recursive
      check value was changed from 0 to 1, but the corresponding comments were
      not updated.
      Signed-off-by: NJovi Zhang <bookjovi@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Neil Horman <nhorman@tuxdriver.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      108ceeb0
    • K
      coredump: warn about unsafe suid_dumpable / core_pattern combo · 54b50199
      Kees Cook 提交于
      When suid_dumpable=2, detect unsafe core_pattern settings and warn when
      they are seen.
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Suggested-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alan Cox <alan@linux.intel.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Doug Ledford <dledford@redhat.com>
      Cc: Serge Hallyn <serge.hallyn@canonical.com>
      Cc: James Morris <james.l.morris@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      54b50199
    • K
      fs: make dumpable=2 require fully qualified path · 9520628e
      Kees Cook 提交于
      When the suid_dumpable sysctl is set to "2", and there is no core dump
      pipe defined in the core_pattern sysctl, a local user can cause core files
      to be written to root-writable directories, potentially with
      user-controlled content.
      
      This means an admin can unknowningly reintroduce a variation of
      CVE-2006-2451, allowing local users to gain root privileges.
      
        $ cat /proc/sys/fs/suid_dumpable
        2
        $ cat /proc/sys/kernel/core_pattern
        core
        $ ulimit -c unlimited
        $ cd /
        $ ls -l core
        ls: cannot access core: No such file or directory
        $ touch core
        touch: cannot touch `core': Permission denied
        $ OHAI="evil-string-here" ping localhost >/dev/null 2>&1 &
        $ pid=$!
        $ sleep 1
        $ kill -SEGV $pid
        $ ls -l core
        -rw------- 1 root kees 458752 Jun 21 11:35 core
        $ sudo strings core | grep evil
        OHAI=evil-string-here
      
      While cron has been fixed to abort reading a file when there is any
      parse error, there are still other sensitive directories that will read
      any file present and skip unparsable lines.
      
      Instead of introducing a suid_dumpable=3 mode and breaking all users of
      mode 2, this only disables the unsafe portion of mode 2 (writing to disk
      via relative path).  Most users of mode 2 (e.g.  Chrome OS) already use
      a core dump pipe handler, so this change will not break them.  For the
      situations where a pipe handler is not defined but mode 2 is still
      active, crash dumps will only be written to fully qualified paths.  If a
      relative path is defined (e.g.  the default "core" pattern), dump
      attempts will trigger a printk yelling about the lack of a fully
      qualified path.
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alan Cox <alan@linux.intel.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Doug Ledford <dledford@redhat.com>
      Cc: Serge Hallyn <serge.hallyn@canonical.com>
      Cc: James Morris <james.l.morris@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9520628e
  10. 30 7月, 2012 1 次提交
  11. 27 7月, 2012 1 次提交
    • J
      posix_types.h: Cleanup stale __NFDBITS and related definitions · 8ded2bbc
      Josh Boyer 提交于
      Recently, glibc made a change to suppress sign-conversion warnings in
      FD_SET (glibc commit ceb9e56b3d1).  This uncovered an issue with the
      kernel's definition of __NFDBITS if applications #include
      <linux/types.h> after including <sys/select.h>.  A build failure would
      be seen when passing the -Werror=sign-compare and -D_FORTIFY_SOURCE=2
      flags to gcc.
      
      It was suggested that the kernel should either match the glibc
      definition of __NFDBITS or remove that entirely.  The current in-kernel
      uses of __NFDBITS can be replaced with BITS_PER_LONG, and there are no
      uses of the related __FDELT and __FDMASK defines.  Given that, we'll
      continue the cleanup that was started with commit 8b3d1cda
      ("posix_types: Remove fd_set macros") and drop the remaining unused
      macros.
      
      Additionally, linux/time.h has similar macros defined that expand to
      nothing so we'll remove those at the same time.
      Reported-by: NJeff Law <law@redhat.com>
      Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      CC: <stable@vger.kernel.org>
      Signed-off-by: NJosh Boyer <jwboyer@redhat.com>
      [ .. and fix up whitespace as per akpm ]
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8ded2bbc
  12. 21 6月, 2012 1 次提交
  13. 08 6月, 2012 2 次提交
    • L
      Revert "mm: correctly synchronize rss-counters at exit/exec" · 48d212a2
      Linus Torvalds 提交于
      This reverts commit 40af1bbd.
      
      It's horribly and utterly broken for at least the following reasons:
      
       - calling sync_mm_rss() from mmput() is fundamentally wrong, because
         there's absolutely no reason to believe that the task that does the
         mmput() always does it on its own VM.  Example: fork, ptrace, /proc -
         you name it.
      
       - calling it *after* having done mmdrop() on it is doubly insane, since
         the mm struct may well be gone now.
      
       - testing mm against NULL before you call it is insane too, since a
      NULL mm there would have caused oopses long before.
      
      .. and those are just the three bugs I found before I decided to give up
      looking for me and revert it asap.  I should have caught it before I
      even took it, but I trusted Andrew too much.
      
      Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Markus Trippelsdorf <markus@trippelsdorf.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      48d212a2
    • K
      mm: correctly synchronize rss-counters at exit/exec · 40af1bbd
      Konstantin Khlebnikov 提交于
      mm->rss_stat counters have per-task delta: task->rss_stat.  Before
      changing task->mm pointer the kernel must flush this delta with
      sync_mm_rss().
      
      do_exit() already calls sync_mm_rss() to flush the rss-counters before
      committing the rss statistics into task->signal->maxrss, taskstats,
      audit and other stuff.  Unfortunately the kernel does this before
      calling mm_release(), which can call put_user() for processing
      task->clear_child_tid.  So at this point we can trigger page-faults and
      task->rss_stat becomes non-zero again.  As a result mm->rss_stat becomes
      inconsistent and check_mm() will print something like this:
      
      | BUG: Bad rss-counter state mm:ffff88020813c380 idx:1 val:-1
      | BUG: Bad rss-counter state mm:ffff88020813c380 idx:2 val:1
      
      This patch moves sync_mm_rss() into mm_release(), and moves mm_release()
      out of do_exit() and calls it earlier.  After mm_release() there should
      be no pagefaults.
      
      [akpm@linux-foundation.org: tweak comment]
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Reported-by: NMarkus Trippelsdorf <markus@trippelsdorf.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: <stable@vger.kernel.org>		[3.4.x]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      40af1bbd
  14. 01 6月, 2012 1 次提交
  15. 17 5月, 2012 1 次提交
    • S
      coredump: ensure the fpu state is flushed for proper multi-threaded core dump · 11aeca0b
      Suresh Siddha 提交于
      Nalluru reported hitting the BUG_ON(__thread_has_fpu(tsk)) in
      arch/x86/kernel/xsave.c:__sanitize_i387_state() during the coredump
      of a multi-threaded application.
      
      A look at the exit seqeuence shows that other threads can still be on the
      runqueue potentially at the below shown exit_mm() code snippet:
      
      		if (atomic_dec_and_test(&core_state->nr_threads))
      			complete(&core_state->startup);
      
      ===> other threads can still be active here, but we notify the thread
      ===> dumping core to wakeup from the coredump_wait() after the last thread
      ===> joins this point. Core dumping thread will continue dumping
      ===> all the threads state to the core file.
      
      		for (;;) {
      			set_task_state(tsk, TASK_UNINTERRUPTIBLE);
      			if (!self.task) /* see coredump_finish() */
      				break;
      			schedule();
      		}
      
      As some of those threads are on the runqueue and didn't call schedule() yet,
      their fpu state is still active in the live registers and the thread
      proceeding with the coredump will hit the above mentioned BUG_ON while
      trying to dump other threads fpustate to the coredump file.
      
      BUG_ON() in arch/x86/kernel/xsave.c:__sanitize_i387_state() is
      in the code paths for processors supporting xsaveopt. With or without
      xsaveopt, multi-threaded coredump is broken and maynot contain
      the correct fpustate at the time of exit.
      
      In coredump_wait(), wait for all the threads to be come inactive, so
      that we are sure all the extended register state is flushed to
      the memory, so that it can be reliably copied to the core file.
      Reported-by: NSuresh Nalluru <suresh@aristanetworks.com>
      Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com>
      Link: http://lkml.kernel.org/r/1336692811-30576-2-git-send-email-suresh.b.siddha@intel.comAcked-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      11aeca0b
  16. 16 5月, 2012 1 次提交
  17. 03 5月, 2012 1 次提交
  18. 14 4月, 2012 1 次提交
    • A
      Add PR_{GET,SET}_NO_NEW_PRIVS to prevent execve from granting privs · 259e5e6c
      Andy Lutomirski 提交于
      With this change, calling
        prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0)
      disables privilege granting operations at execve-time.  For example, a
      process will not be able to execute a setuid binary to change their uid
      or gid if this bit is set.  The same is true for file capabilities.
      
      Additionally, LSM_UNSAFE_NO_NEW_PRIVS is defined to ensure that
      LSMs respect the requested behavior.
      
      To determine if the NO_NEW_PRIVS bit is set, a task may call
        prctl(PR_GET_NO_NEW_PRIVS, 0, 0, 0, 0);
      It returns 1 if set and 0 if it is not set. If any of the arguments are
      non-zero, it will return -1 and set errno to -EINVAL.
      (PR_SET_NO_NEW_PRIVS behaves similarly.)
      
      This functionality is desired for the proposed seccomp filter patch
      series.  By using PR_SET_NO_NEW_PRIVS, it allows a task to modify the
      system call behavior for itself and its child tasks without being
      able to impact the behavior of a more privileged task.
      
      Another potential use is making certain privileged operations
      unprivileged.  For example, chroot may be considered "safe" if it cannot
      affect privileged tasks.
      
      Note, this patch causes execve to fail when PR_SET_NO_NEW_PRIVS is
      set and AppArmor is in use.  It is fixed in a subsequent patch.
      Signed-off-by: NAndy Lutomirski <luto@amacapital.net>
      Signed-off-by: NWill Drewry <wad@chromium.org>
      Acked-by: NEric Paris <eparis@redhat.com>
      Acked-by: NKees Cook <keescook@chromium.org>
      
      v18: updated change desc
      v17: using new define values as per 3.4
      Signed-off-by: NJames Morris <james.l.morris@oracle.com>
      259e5e6c
  19. 31 3月, 2012 1 次提交
  20. 29 3月, 2012 1 次提交
    • D
      Add #includes needed to permit the removal of asm/system.h · 96f951ed
      David Howells 提交于
      asm/system.h is a cause of circular dependency problems because it contains
      commonly used primitive stuff like barrier definitions and uncommonly used
      stuff like switch_to() that might require MMU definitions.
      
      asm/system.h has been disintegrated by this point on all arches into the
      following common segments:
      
       (1) asm/barrier.h
      
           Moved memory barrier definitions here.
      
       (2) asm/cmpxchg.h
      
           Moved xchg() and cmpxchg() here.  #included in asm/atomic.h.
      
       (3) asm/bug.h
      
           Moved die() and similar here.
      
       (4) asm/exec.h
      
           Moved arch_align_stack() here.
      
       (5) asm/elf.h
      
           Moved AT_VECTOR_SIZE_ARCH here.
      
       (6) asm/switch_to.h
      
           Moved switch_to() here.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      96f951ed
  21. 22 3月, 2012 1 次提交
  22. 21 3月, 2012 4 次提交
  23. 20 3月, 2012 1 次提交
  24. 06 3月, 2012 2 次提交
  25. 23 2月, 2012 1 次提交
    • D
      tracepoint, vfs, sched: Add exec() tracepoint · 4ff16c25
      David Smith 提交于
      Added a minimal exec tracepoint. Exec is an important major event
      in the life of a task, like fork(), clone() or exit(), all of
      which we already trace.
      
      [ We also do scheduling re-balancing during exec() - so it's useful
        from a scheduler instrumentation POV as well. ]
      
      If you want to watch a task start up, when it gets exec'ed is a good place
      to start.  With the addition of this tracepoint, exec's can be monitored
      and better picture of general system activity can be obtained. This
      tracepoint will also enable better process life tracking, allowing you to
      answer questions like "what process keeps starting up binary X?".
      
      This tracepoint can also be useful in ftrace filtering and trigger
      conditions: i.e. starting or stopping filtering when exec is called.
      Signed-off-by: NDavid Smith <dsmith@redhat.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/4F314D19.7030504@redhat.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
      4ff16c25
  26. 20 2月, 2012 2 次提交
    • D
      Replace the fd_sets in struct fdtable with an array of unsigned longs · 1fd36adc
      David Howells 提交于
      Replace the fd_sets in struct fdtable with an array of unsigned longs and then
      use the standard non-atomic bit operations rather than the FD_* macros.
      
      This:
      
       (1) Removes the abuses of struct fd_set:
      
           (a) Since we don't want to allocate a full fd_set the vast majority of the
           	 time, we actually, in effect, just allocate a just-big-enough array of
           	 unsigned longs and cast it to an fd_set type - so why bother with the
           	 fd_set at all?
      
           (b) Some places outside of the core fdtable handling code (such as
           	 SELinux) want to look inside the array of unsigned longs hidden inside
           	 the fd_set struct for more efficient iteration over the entire set.
      
       (2) Eliminates the use of FD_*() macros in the kernel completely.
      
       (3) Permits the __FD_*() macros to be deleted entirely where not exposed to
           userspace.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Link: http://lkml.kernel.org/r/20120216174954.23314.48147.stgit@warthog.procyon.org.ukSigned-off-by: NH. Peter Anvin <hpa@zytor.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      1fd36adc
    • D
      Wrap accesses to the fd_sets in struct fdtable · 1dce27c5
      David Howells 提交于
      Wrap accesses to the fd_sets in struct fdtable (for recording open files and
      close-on-exec flags) so that we can move away from using fd_sets since we
      abuse the fd_set structs by not allocating the full-sized structure under
      normal circumstances and by non-core code looking at the internals of the
      fd_sets.
      
      The first abuse means that use of FD_ZERO() on these fd_sets is not permitted,
      since that cannot be told about their abnormal lengths.
      
      This introduces six wrapper functions for setting, clearing and testing
      close-on-exec flags and fd-is-open flags:
      
      	void __set_close_on_exec(int fd, struct fdtable *fdt);
      	void __clear_close_on_exec(int fd, struct fdtable *fdt);
      	bool close_on_exec(int fd, const struct fdtable *fdt);
      	void __set_open_fd(int fd, struct fdtable *fdt);
      	void __clear_open_fd(int fd, struct fdtable *fdt);
      	bool fd_is_open(int fd, const struct fdtable *fdt);
      
      Note that I've prepended '__' to the names of the set/clear functions because
      they require the caller to hold a lock to use them.
      
      Note also that I haven't added wrappers for looking behind the scenes at the
      the array.  Possibly that should exist too.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Link: http://lkml.kernel.org/r/20120216174942.23314.1364.stgit@warthog.procyon.org.ukSigned-off-by: NH. Peter Anvin <hpa@zytor.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      1dce27c5