1. 30 5月, 2020 2 次提交
    • E
      exec: Compute file based creds only once · 56305aa9
      Eric W. Biederman 提交于
      Move the computation of creds from prepare_binfmt into begin_new_exec
      so that the creds need only be computed once.  This is just code
      reorganization no semantic changes of any kind are made.
      
      Moving the computation is safe.  I have looked through the kernel and
      verified none of the binfmts look at bprm->cred directly, and that
      there are no helpers that look at bprm->cred indirectly.  Which means
      that it is not a problem to compute the bprm->cred later in the
      execution flow as it is not used until it becomes current->cred.
      
      A new function bprm_creds_from_file is added to contain the work that
      needs to be done.  bprm_creds_from_file first computes which file
      bprm->executable or most likely bprm->file that the bprm->creds
      will be computed from.
      
      The funciton bprm_fill_uid is updated to receive the file instead of
      accessing bprm->file.  The now unnecessary work needed to reset the
      bprm->cred->euid, and bprm->cred->egid is removed from brpm_fill_uid.
      A small comment to document that bprm_fill_uid now only deals with the
      work to handle suid and sgid files.  The default case is already
      heandled by prepare_exec_creds.
      
      The function security_bprm_repopulate_creds is renamed
      security_bprm_creds_from_file and now is explicitly passed the file
      from which to compute the creds.  The documentation of the
      bprm_creds_from_file security hook is updated to explain when the hook
      is called and what it needs to do.  The file is passed from
      cap_bprm_creds_from_file into get_file_caps so that the caps are
      computed for the appropriate file.  The now unnecessary work in
      cap_bprm_creds_from_file to reset the ambient capabilites has been
      removed.  A small comment to document that the work of
      cap_bprm_creds_from_file is to read capabilities from the files
      secureity attribute and derive capabilities from the fact the
      user had uid 0 has been added.
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      56305aa9
    • E
      exec: Add a per bprm->file version of per_clear · a7868323
      Eric W. Biederman 提交于
      There is a small bug in the code that recomputes parts of bprm->cred
      for every bprm->file.  The code never recomputes the part of
      clear_dangerous_personality_flags it is responsible for.
      
      Which means that in practice if someone creates a sgid script
      the interpreter will not be able to use any of:
      	READ_IMPLIES_EXEC
      	ADDR_NO_RANDOMIZE
      	ADDR_COMPAT_LAYOUT
      	MMAP_PAGE_ZERO.
      
      This accentially clearing of personality flags probably does
      not matter in practice because no one has complained
      but it does make the code more difficult to understand.
      
      Further remaining bug compatible prevents the recomputation from being
      removed and replaced by simply computing bprm->cred once from the
      final bprm->file.
      
      Making this change removes the last behavior difference between
      computing bprm->creds from the final file and recomputing
      bprm->cred several times.  Which allows this behavior change
      to be justified for it's own reasons, and for any but hunts
      looking into why the behavior changed to wind up here instead
      of in the code that will follow that computes bprm->cred
      from the final bprm->file.
      
      This small logic bug appears to have existed since the code
      started clearing dangerous personality bits.
      
      History Tree: git://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
      Fixes: 1bb0fa189c6a ("[PATCH] NX: clean up legacy binary support")
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      a7868323
  2. 21 5月, 2020 6 次提交
  3. 17 5月, 2020 1 次提交
    • E
      exec: Move would_dump into flush_old_exec · f87d1c95
      Eric W. Biederman 提交于
      I goofed when I added mm->user_ns support to would_dump.  I missed the
      fact that in the case of binfmt_loader, binfmt_em86, binfmt_misc, and
      binfmt_script bprm->file is reassigned.  Which made the move of
      would_dump from setup_new_exec to __do_execve_file before exec_binprm
      incorrect as it can result in would_dump running on the script instead
      of the interpreter of the script.
      
      The net result is that the code stopped making unreadable interpreters
      undumpable.  Which allows them to be ptraced and written to disk
      without special permissions.  Oops.
      
      The move was necessary because the call in set_new_exec was after
      bprm->mm was no longer valid.
      
      To correct this mistake move the misplaced would_dump from
      __do_execve_file into flos_old_exec, before exec_mmap is called.
      
      I tested and confirmed that without this fix I can attach with gdb to
      a script with an unreadable interpreter, and with this fix I can not.
      
      Cc: stable@vger.kernel.org
      Fixes: f84df2a6 ("exec: Ensure mm->user_ns contains the execed files")
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      f87d1c95
  4. 12 5月, 2020 3 次提交
  5. 09 5月, 2020 2 次提交
  6. 08 5月, 2020 6 次提交
  7. 29 4月, 2020 2 次提交
    • E
      exec: Remove BUG_ON(has_group_leader_pid) · 610b8188
      Eric W. Biederman 提交于
      With the introduction of exchange_tids thread_group_leader and
      has_group_leader_pid have become equivalent.  Further at this point in the
      code a thread group has exactly two threads, the previous thread_group_leader
      that is waiting to be reaped and tsk.  So we know it is impossible for tsk to
      be the thread_group_leader.
      
      This is also the last user of has_group_leader_pid so removing this check
      will allow has_group_leader_pid to be removed.
      
      So remove the "BUG_ON(has_group_leader_pid)" that will never fire.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      610b8188
    • E
      proc: Ensure we see the exit of each process tid exactly once · 6b03d130
      Eric W. Biederman 提交于
      When the thread group leader changes during exec and the old leaders
      thread is reaped proc_flush_pid will flush the dentries for the entire
      process because the leader still has it's original pid.
      
      Fix this by exchanging the pids in an rcu safe manner,
      and wrapping the code to do that up in a helper exchange_tids.
      
      When I removed switch_exec_pids and introduced this behavior
      in d73d6529 ("[PATCH] pidhash: kill switch_exec_pids") there
      really was nothing that cared as flushing happened with
      the cached dentry and de_thread flushed both of them on exec.
      
      This lack of fully exchanging pids became a problem a few months later
      when I introduced 48e6484d ("[PATCH] proc: Rewrite the proc dentry
      flush on exit optimization").  Which overlooked the de_thread case
      was no longer swapping pids, and I was looking up proc dentries
      by task->pid.
      
      The current behavior isn't properly a bug as everything in proc will
      continue to work correctly just a little bit less efficiently.  Fix
      this just so there are no little surprise corner cases waiting to bite
      people.
      
      -- Oleg points out this could be an issue in next_tgid in proc where
         has_group_leader_pid is called, and reording some of the assignments
         should fix that.
      
      -- Oleg points out this will break the 10 year old hack in __exit_signal.c
      >	/*
      >	 * This can only happen if the caller is de_thread().
      >	 * FIXME: this is the temporary hack, we should teach
      >	 * posix-cpu-timers to handle this case correctly.
      >	 */
      >	if (unlikely(has_group_leader_pid(tsk)))
      >		posix_cpu_timers_exit_group(tsk);
      
      The code in next_tgid has been changed to use PIDTYPE_TGID,
      and the posix cpu timers code has been fixed so it does not
      need the 10 year old hack, so this should be safe to merge
      now.
      
      Link: https://lore.kernel.org/lkml/87h7x3ajll.fsf_-_@x220.int.ebiederm.org/Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      Fixes: 48e6484d ("[PATCH] proc: Rewrite the proc dentry flush on exit optimization").
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      6b03d130
  8. 02 4月, 2020 1 次提交
    • E
      signal: Extend exec_id to 64bits · d1e7fd64
      Eric W. Biederman 提交于
      Replace the 32bit exec_id with a 64bit exec_id to make it impossible
      to wrap the exec_id counter.  With care an attacker can cause exec_id
      wrap and send arbitrary signals to a newly exec'd parent.  This
      bypasses the signal sending checks if the parent changes their
      credentials during exec.
      
      The severity of this problem can been seen that in my limited testing
      of a 32bit exec_id it can take as little as 19s to exec 65536 times.
      Which means that it can take as little as 14 days to wrap a 32bit
      exec_id.  Adam Zabrocki has succeeded wrapping the self_exe_id in 7
      days.  Even my slower timing is in the uptime of a typical server.
      Which means self_exec_id is simply a speed bump today, and if exec
      gets noticably faster self_exec_id won't even be a speed bump.
      
      Extending self_exec_id to 64bits introduces a problem on 32bit
      architectures where reading self_exec_id is no longer atomic and can
      take two read instructions.  Which means that is is possible to hit
      a window where the read value of exec_id does not match the written
      value.  So with very lucky timing after this change this still
      remains expoiltable.
      
      I have updated the update of exec_id on exec to use WRITE_ONCE
      and the read of exec_id in do_notify_parent to use READ_ONCE
      to make it clear that there is no locking between these two
      locations.
      
      Link: https://lore.kernel.org/kernel-hardening/20200324215049.GA3710@pi3.com.pl
      Fixes: 2.3.23pre2
      Cc: stable@vger.kernel.org
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      d1e7fd64
  9. 25 3月, 2020 5 次提交
  10. 11 2月, 2020 1 次提交
    • T
      firmware_loader: load files from the mount namespace of init · 901cff7c
      Topi Miettinen 提交于
      I have an experimental setup where almost every possible system
      service (even early startup ones) runs in separate namespace, using a
      dedicated, minimal file system. In process of minimizing the contents
      of the file systems with regards to modules and firmware files, I
      noticed that in my system, the firmware files are loaded from three
      different mount namespaces, those of systemd-udevd, init and
      systemd-networkd. The logic of the source namespace is not very clear,
      it seems to depend on the driver, but the namespace of the current
      process is used.
      
      So, this patch tries to make things a bit clearer and changes the
      loading of firmware files only from the mount namespace of init. This
      may also improve security, though I think that using firmware files as
      attack vector could be too impractical anyway.
      
      Later, it might make sense to make the mount namespace configurable,
      for example with a new file in /proc/sys/kernel/firmware_config/. That
      would allow a dedicated file system only for firmware files and those
      need not be present anywhere else. This configurability would make
      more sense if made also for kernel modules and /sbin/modprobe. Modules
      are already loaded from init namespace (usermodehelper uses kthreadd
      namespace) except when directly loaded by systemd-udevd.
      
      Instead of using the mount namespace of the current process to load
      firmware files, use the mount namespace of init process.
      
      Link: https://lore.kernel.org/lkml/bb46ebae-4746-90d9-ec5b-fce4c9328c86@gmail.com/
      Link: https://lore.kernel.org/lkml/0e3f7653-c59d-9341-9db2-c88f5b988c68@gmail.com/Signed-off-by: NTopi Miettinen <toiwoton@gmail.com>
      Link: https://lore.kernel.org/r/20200123125839.37168-1-toiwoton@gmail.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      901cff7c
  11. 01 2月, 2020 1 次提交
    • A
      execve: warn if process starts with executable stack · 47a2ebb7
      Alexey Dobriyan 提交于
      There were few episodes of silent downgrade to an executable stack over
      years:
      
      1) linking innocent looking assembly file will silently add executable
         stack if proper linker options is not given as well:
      
      	$ cat f.S
      	.intel_syntax noprefix
      	.text
      	.globl f
      	f:
      	        ret
      
      	$ cat main.c
      	void f(void);
      	int main(void)
      	{
      	        f();
      	        return 0;
      	}
      
      	$ gcc main.c f.S
      	$ readelf -l ./a.out
      	  GNU_STACK      0x0000000000000000 0x0000000000000000 0x0000000000000000
                               0x0000000000000000 0x0000000000000000  RWE    0x10
      			 					 ^^^
      
      2) converting C99 nested function into a closure
         https://nullprogram.com/blog/2019/11/15/
      
      	void intsort2(int *base, size_t nmemb, _Bool invert)
      	{
      	    int cmp(const void *a, const void *b)
      	    {
      	        int r = *(int *)a - *(int *)b;
      	        return invert ? -r : r;
      	    }
      	    qsort(base, nmemb, sizeof(*base), cmp);
      	}
      
      will silently require stack trampolines while non-closure version will
      not.
      
      Without doubt this behaviour is documented somewhere, add a warning so
      that developers and users can at least notice.  After so many years of
      x86_64 having proper executable stack support it should not cause too
      many problems.
      
      Link: http://lkml.kernel.org/r/20191208171918.GC19716@avx2Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      47a2ebb7
  12. 24 1月, 2020 1 次提交
    • D
      mm: remove arch_bprm_mm_init() hook · 42222eae
      Dave Hansen 提交于
      From: Dave Hansen <dave.hansen@linux.intel.com>
      
      MPX is being removed from the kernel due to a lack of support
      in the toolchain going forward (gcc).
      
      arch_bprm_mm_init() is used at execve() time.  The only non-stub
      implementation is on x86 for MPX.  Remove the hook entirely from
      all architectures and generic code.
      
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: x86@kernel.org
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: linux-arch@vger.kernel.org
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com>
      Cc: Guan Xuetao <gxt@pku.edu.cn>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      42222eae
  13. 20 11月, 2019 1 次提交
  14. 13 11月, 2019 1 次提交
  15. 24 10月, 2019 1 次提交
  16. 25 9月, 2019 1 次提交
    • M
      sched/membarrier: Fix p->mm->membarrier_state racy load · 227a4aad
      Mathieu Desnoyers 提交于
      The membarrier_state field is located within the mm_struct, which
      is not guaranteed to exist when used from runqueue-lock-free iteration
      on runqueues by the membarrier system call.
      
      Copy the membarrier_state from the mm_struct into the scheduler runqueue
      when the scheduler switches between mm.
      
      When registering membarrier for mm, after setting the registration bit
      in the mm membarrier state, issue a synchronize_rcu() to ensure the
      scheduler observes the change. In order to take care of the case
      where a runqueue keeps executing the target mm without swapping to
      other mm, iterate over each runqueue and issue an IPI to copy the
      membarrier_state from the mm_struct into each runqueue which have the
      same mm which state has just been modified.
      
      Move the mm membarrier_state field closer to pgd in mm_struct to use
      a cache line already touched by the scheduler switch_mm.
      
      The membarrier_execve() (now membarrier_exec_mmap) hook now needs to
      clear the runqueue's membarrier state in addition to clear the mm
      membarrier state, so move its implementation into the scheduler
      membarrier code so it can access the runqueue structure.
      
      Add memory barrier in membarrier_exec_mmap() prior to clearing
      the membarrier state, ensuring memory accesses executed prior to exec
      are not reordered with the stores clearing the membarrier state.
      
      As suggested by Linus, move all membarrier.c RCU read-side locks outside
      of the for each cpu loops.
      Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Kirill Tkhai <tkhai@yandex.ru>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paul E. McKenney <paulmck@linux.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Russell King - ARM Linux admin <linux@armlinux.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/20190919173705.2181-5-mathieu.desnoyers@efficios.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      227a4aad
  17. 25 7月, 2019 1 次提交
    • J
      sched/fair: Don't free p->numa_faults with concurrent readers · 16d51a59
      Jann Horn 提交于
      When going through execve(), zero out the NUMA fault statistics instead of
      freeing them.
      
      During execve, the task is reachable through procfs and the scheduler. A
      concurrent /proc/*/sched reader can read data from a freed ->numa_faults
      allocation (confirmed by KASAN) and write it back to userspace.
      I believe that it would also be possible for a use-after-free read to occur
      through a race between a NUMA fault and execve(): task_numa_fault() can
      lead to task_numa_compare(), which invokes task_weight() on the currently
      running task of a different CPU.
      
      Another way to fix this would be to make ->numa_faults RCU-managed or add
      extra locking, but it seems easier to wipe the NUMA fault statistics on
      execve.
      Signed-off-by: NJann Horn <jannh@google.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Fixes: 82727018 ("sched/numa: Call task_numa_free() from do_execve()")
      Link: https://lkml.kernel.org/r/20190716152047.14424-1-jannh@google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      16d51a59
  18. 27 5月, 2019 1 次提交
  19. 21 5月, 2019 1 次提交
  20. 15 5月, 2019 1 次提交
  21. 08 3月, 2019 1 次提交