1. 14 12月, 2021 1 次提交
    • E
      kthread: Ensure struct kthread is present for all kthreads · 40966e31
      Eric W. Biederman 提交于
      Today the rules are a bit iffy and arbitrary about which kernel
      threads have struct kthread present.  Both idle threads and thread
      started with create_kthread want struct kthread present so that is
      effectively all kernel threads.  Make the rule that if PF_KTHREAD
      and the task is running then struct kthread is present.
      
      This will allow the kernel thread code to using tsk->exit_code
      with different semantics from ordinary processes.
      
      To make ensure that struct kthread is present for all
      kernel threads move it's allocation into copy_process.
      
      Add a deallocation of struct kthread in exec for processes
      that were kernel threads.
      
      Move the allocation of struct kthread for the initial thread
      earlier so that it is not repeated for each additional idle
      thread.
      
      Move the initialization of struct kthread into set_kthread_struct
      so that the structure is always and reliably initailized.
      
      Clear set_child_tid in free_kthread_struct to ensure the kthread
      struct is reliably freed during exec.  The function
      free_kthread_struct does not need to clear vfork_done during exec as
      exec_mm_release called from exec_mmap has already cleared vfork_done.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      40966e31
  2. 30 10月, 2021 1 次提交
  3. 07 10月, 2021 1 次提交
  4. 04 9月, 2021 3 次提交
  5. 24 8月, 2021 1 次提交
  6. 02 7月, 2021 1 次提交
  7. 01 5月, 2021 2 次提交
    • A
      Reimplement RLIMIT_NPROC on top of ucounts · 21d1c5e3
      Alexey Gladkov 提交于
      The rlimit counter is tied to uid in the user_namespace. This allows
      rlimit values to be specified in userns even if they are already
      globally exceeded by the user. However, the value of the previous
      user_namespaces cannot be exceeded.
      
      To illustrate the impact of rlimits, let's say there is a program that
      does not fork. Some service-A wants to run this program as user X in
      multiple containers. Since the program never fork the service wants to
      set RLIMIT_NPROC=1.
      
      service-A
       \- program (uid=1000, container1, rlimit_nproc=1)
       \- program (uid=1000, container2, rlimit_nproc=1)
      
      The service-A sets RLIMIT_NPROC=1 and runs the program in container1.
      When the service-A tries to run a program with RLIMIT_NPROC=1 in
      container2 it fails since user X already has one running process.
      
      We cannot use existing inc_ucounts / dec_ucounts because they do not
      allow us to exceed the maximum for the counter. Some rlimits can be
      overlimited by root or if the user has the appropriate capability.
      
      Changelog
      
      v11:
      * Change inc_rlimit_ucounts() which now returns top value of ucounts.
      * Drop inc_rlimit_ucounts_and_test() because the return code of
        inc_rlimit_ucounts() can be checked.
      Signed-off-by: NAlexey Gladkov <legion@kernel.org>
      Link: https://lkml.kernel.org/r/c5286a8aa16d2d698c222f7532f3d735c82bc6bc.1619094428.git.legion@kernel.orgSigned-off-by: NEric W. Biederman <ebiederm@xmission.com>
      21d1c5e3
    • A
      Add a reference to ucounts for each cred · 905ae01c
      Alexey Gladkov 提交于
      For RLIMIT_NPROC and some other rlimits the user_struct that holds the
      global limit is kept alive for the lifetime of a process by keeping it
      in struct cred. Adding a pointer to ucounts in the struct cred will
      allow to track RLIMIT_NPROC not only for user in the system, but for
      user in the user_namespace.
      
      Updating ucounts may require memory allocation which may fail. So, we
      cannot change cred.ucounts in the commit_creds() because this function
      cannot fail and it should always return 0. For this reason, we modify
      cred.ucounts before calling the commit_creds().
      
      Changelog
      
      v6:
      * Fix null-ptr-deref in is_ucounts_overlimit() detected by trinity. This
        error was caused by the fact that cred_alloc_blank() left the ucounts
        pointer empty.
      Reported-by: Nkernel test robot <oliver.sang@intel.com>
      Signed-off-by: NAlexey Gladkov <legion@kernel.org>
      Link: https://lkml.kernel.org/r/b37aaef28d8b9b0d757e07ba6dd27281bbe39259.1619094428.git.legion@kernel.orgSigned-off-by: NEric W. Biederman <ebiederm@xmission.com>
      905ae01c
  8. 25 2月, 2021 1 次提交
  9. 30 1月, 2021 2 次提交
  10. 24 1月, 2021 4 次提交
  11. 11 12月, 2020 5 次提交
  12. 02 12月, 2020 1 次提交
    • G
      kernel: Implement selective syscall userspace redirection · 1446e1df
      Gabriel Krisman Bertazi 提交于
      Introduce a mechanism to quickly disable/enable syscall handling for a
      specific process and redirect to userspace via SIGSYS.  This is useful
      for processes with parts that require syscall redirection and parts that
      don't, but who need to perform this boundary crossing really fast,
      without paying the cost of a system call to reconfigure syscall handling
      on each boundary transition.  This is particularly important for Windows
      games running over Wine.
      
      The proposed interface looks like this:
      
        prctl(PR_SET_SYSCALL_USER_DISPATCH, <op>, <off>, <length>, [selector])
      
      The range [<offset>,<offset>+<length>) is a part of the process memory
      map that is allowed to by-pass the redirection code and dispatch
      syscalls directly, such that in fast paths a process doesn't need to
      disable the trap nor the kernel has to check the selector.  This is
      essential to return from SIGSYS to a blocked area without triggering
      another SIGSYS from rt_sigreturn.
      
      selector is an optional pointer to a char-sized userspace memory region
      that has a key switch for the mechanism. This key switch is set to
      either PR_SYS_DISPATCH_ON, PR_SYS_DISPATCH_OFF to enable and disable the
      redirection without calling the kernel.
      
      The feature is meant to be set per-thread and it is disabled on
      fork/clone/execv.
      
      Internally, this doesn't add overhead to the syscall hot path, and it
      requires very little per-architecture support.  I avoided using seccomp,
      even though it duplicates some functionality, due to previous feedback
      that maybe it shouldn't mix with seccomp since it is not a security
      mechanism.  And obviously, this should never be considered a security
      mechanism, since any part of the program can by-pass it by using the
      syscall dispatcher.
      
      For the sysinfo benchmark, which measures the overhead added to
      executing a native syscall that doesn't require interception, the
      overhead using only the direct dispatcher region to issue syscalls is
      pretty much irrelevant.  The overhead of using the selector goes around
      40ns for a native (unredirected) syscall in my system, and it is (as
      expected) dominated by the supervisor-mode user-address access.  In
      fact, with SMAP off, the overhead is consistently less than 5ns on my
      test box.
      Signed-off-by: NGabriel Krisman Bertazi <krisman@collabora.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NAndy Lutomirski <luto@kernel.org>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NKees Cook <keescook@chromium.org>
      Link: https://lore.kernel.org/r/20201127193238.821364-4-krisman@collabora.com
      1446e1df
  13. 11 11月, 2020 1 次提交
    • H
      parisc: Make user stack size configurable · 22ee3ea5
      Helge Deller 提交于
      On parisc we need to initialize the memory layout for the user stack at
      process start time to a fixed size, which up until now was limited to
      the size as given by CONFIG_MAX_STACK_SIZE_MB at compile time.
      
      This hard limit was too small and showed problems when compiling
      ruby2.7, qmlcachegen and some Qt packages.
      
      This patch changes two things:
      a) It increases the default maximum stack size to 100MB.
      b) Users can modify the stack hard limit size with ulimit and then newly
         forked processes will use the given stack size which can even be bigger
         than the default 100MB.
      Reported-by: NJohn David Anglin <dave.anglin@bell.net>
      Signed-off-by: NHelge Deller <deller@gmx.de>
      22ee3ea5
  14. 05 10月, 2020 3 次提交
  15. 01 10月, 2020 1 次提交
    • J
      io_uring: don't rely on weak ->files references · 0f212204
      Jens Axboe 提交于
      Grab actual references to the files_struct. To avoid circular references
      issues due to this, we add a per-task note that keeps track of what
      io_uring contexts a task has used. When the tasks execs or exits its
      assigned files, we cancel requests based on this tracking.
      
      With that, we can grab proper references to the files table, and no
      longer need to rely on stashing away ring_fd and ring_file to check
      if the ring_fd may have been closed.
      
      Cc: stable@vger.kernel.org # v5.5+
      Reviewed-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      0f212204
  16. 16 9月, 2020 1 次提交
    • N
      mm: fix exec activate_mm vs TLB shootdown and lazy tlb switching race · d53c3dfb
      Nicholas Piggin 提交于
      Reading and modifying current->mm and current->active_mm and switching
      mm should be done with irqs off, to prevent races seeing an intermediate
      state.
      
      This is similar to commit 38cf307c ("mm: fix kthread_use_mm() vs TLB
      invalidate"). At exec-time when the new mm is activated, the old one
      should usually be single-threaded and no longer used, unless something
      else is holding an mm_users reference (which may be possible).
      
      Absent other mm_users, there is also a race with preemption and lazy tlb
      switching. Consider the kernel_execve case where the current thread is
      using a lazy tlb active mm:
      
        call_usermodehelper()
          kernel_execve()
            old_mm = current->mm;
            active_mm = current->active_mm;
            *** preempt *** -------------------->  schedule()
                                                     prev->active_mm = NULL;
                                                     mmdrop(prev active_mm);
                                                   ...
                            <--------------------  schedule()
            current->mm = mm;
            current->active_mm = mm;
            if (!old_mm)
                mmdrop(active_mm);
      
      If we switch back to the kernel thread from a different mm, there is a
      double free of the old active_mm, and a missing free of the new one.
      
      Closing this race only requires interrupts to be disabled while ->mm
      and ->active_mm are being switched, but the TLB problem requires also
      holding interrupts off over activate_mm. Unfortunately not all archs
      can do that yet, e.g., arm defers the switch if irqs are disabled and
      expects finish_arch_post_lock_switch() to be called to complete the
      flush; um takes a blocking lock in activate_mm().
      
      So as a first step, disable interrupts across the mm/active_mm updates
      to close the lazy tlb preempt race, and provide an arch option to
      extend that to activate_mm which allows architectures doing IPI based
      TLB shootdowns to close the second race.
      
      This is a bit ugly, but in the interest of fixing the bug and backporting
      before all architectures are converted this is a compromise.
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200914045219.3736466-2-npiggin@gmail.com
      d53c3dfb
  17. 13 8月, 2020 5 次提交
  18. 21 7月, 2020 6 次提交
    • E
      exec: Implement kernel_execve · be619f7f
      Eric W. Biederman 提交于
      To allow the kernel not to play games with set_fs to call exec
      implement kernel_execve.  The function kernel_execve takes pointers
      into kernel memory and copies the values pointed to onto the new
      userspace stack.
      
      The calls with arguments from kernel space of do_execve are replaced
      with calls to kernel_execve.
      
      The calls do_execve and do_execveat are made static as there are now
      no callers outside of exec.
      
      The comments that mention do_execve are updated to refer to
      kernel_execve or execve depending on the circumstances.  In addition
      to correcting the comments, this makes it easy to grep for do_execve
      and verify it is not used.
      
      Inspired-by: https://lkml.kernel.org/r/20200627072704.2447163-1-hch@lst.deReviewed-by: NKees Cook <keescook@chromium.org>
      Link: https://lkml.kernel.org/r/87wo365ikj.fsf@x220.int.ebiederm.orgSigned-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      be619f7f
    • E
      exec: Factor bprm_stack_limits out of prepare_arg_pages · d8b9cd54
      Eric W. Biederman 提交于
      In preparation for implementiong kernel_execve (which will take kernel
      pointers not userspace pointers) factor out bprm_stack_limits out of
      prepare_arg_pages.  This separates the counting which depends upon the
      getting data from userspace from the calculations of the stack limits
      which is usable in kernel_execve.
      
      The remove prepare_args_pages and compute bprm->argc and bprm->envc
      directly in do_execveat_common, before bprm_stack_limits is called.
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Link: https://lkml.kernel.org/r/87365u6x60.fsf@x220.int.ebiederm.orgSigned-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      d8b9cd54
    • E
      exec: Factor bprm_execve out of do_execve_common · 0c9cdff0
      Eric W. Biederman 提交于
      Currently it is necessary for the usermode helper code and the code
      that launches init to use set_fs so that pages coming from the kernel
      look like they are coming from userspace.
      
      To allow that usage of set_fs to be removed cleanly the argument
      copying from userspace needs to happen earlier.  Factor bprm_execve
      out of do_execve_common to separate out the copying of arguments
      to the newe stack, and the rest of exec.
      
      In separating bprm_execve from do_execve_common the copying
      of the arguments onto the new stack happens earlier.
      
      As the copying of the arguments does not depend any security hooks,
      files, the file table, current->in_execve, current->fs->in_exec,
      bprm->unsafe, or creds this is safe.
      
      Likewise the security hook security_creds_for_exec does not depend upon
      preventing the argument copying from happening.
      
      In addition to making it possible to implement kernel_execve that
      performs the copying differently, this separation of bprm_execve from
      do_execve_common makes for a nice separation of responsibilities making
      the exec code easier to navigate.
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Link: https://lkml.kernel.org/r/878sfm6x6x.fsf@x220.int.ebiederm.orgSigned-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      0c9cdff0
    • E
      exec: Move bprm_mm_init into alloc_bprm · f18ac551
      Eric W. Biederman 提交于
      Currently it is necessary for the usermode helper code and the code that
      launches init to use set_fs so that pages coming from the kernel look like
      they are coming from userspace.
      
      To allow that usage of set_fs to be removed cleanly the argument copying
      from userspace needs to happen earlier.  Move the allocation and
      initialization of bprm->mm into alloc_bprm so that the bprm->mm is
      available early to store the new user stack into.  This is a prerequisite
      for copying argv and envp into the new user stack early before ther rest of
      exec.
      
      To keep the things consistent the cleanup of bprm->mm is moved into
      free_bprm.  So that bprm->mm will be cleaned up whenever bprm->mm is
      allocated and free_bprm are called.
      
      Moving bprm_mm_init earlier is safe as it does not depend on any files,
      current->in_execve, current->fs->in_exec, bprm->unsafe, or the if the file
      table is shared. (AKA bprm_mm_init does not depend on any of the code that
      happens between alloc_bprm and where it was previously called.)
      
      This moves bprm->mm cleanup after current->fs->in_exec is set to 0.  This
      is safe because current->fs->in_exec is only used to preventy taking an
      additional reference on the fs_struct.
      
      This moves bprm->mm cleanup after current->in_execve is set to 0.  This is
      safe because current->in_execve is only used by the lsms (apparmor and
      tomoyou) and always for LSM specific functions, never for anything to do
      with the mm.
      
      This adds bprm->mm cleanup into the successful return path.  This is safe
      because being on the successful return path implies that begin_new_exec
      succeeded and set brpm->mm to NULL.  As bprm->mm is NULL bprm cleanup I am
      moving into free_bprm will do nothing.
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Link: https://lkml.kernel.org/r/87eepe6x7p.fsf@x220.int.ebiederm.orgSigned-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      f18ac551
    • E
      exec: Move initialization of bprm->filename into alloc_bprm · 60d9ad1d
      Eric W. Biederman 提交于
      Currently it is necessary for the usermode helper code and the code
      that launches init to use set_fs so that pages coming from the kernel
      look like they are coming from userspace.
      
      To allow that usage of set_fs to be removed cleanly the argument
      copying from userspace needs to happen earlier.  Move the computation
      of bprm->filename and possible allocation of a name in the case
      of execveat into alloc_bprm to make that possible.
      
      The exectuable name, the arguments, and the environment are
      copied into the new usermode stack which is stored in bprm
      until exec passes the point of no return.
      
      As the executable name is copied first onto the usermode stack
      it needs to be known.  As there are no dependencies to computing
      the executable name, compute it early in alloc_bprm.
      
      As an implementation detail if the filename needs to be generated
      because it embeds a file descriptor store that filename in a new field
      bprm->fdpath, and free it in free_bprm.  Previously this was done in
      an independent variable pathbuf.  I have renamed pathbuf fdpath
      because fdpath is more suggestive of what kind of path is in the
      variable.  I moved fdpath into struct linux_binprm because it is
      tightly tied to the other variables in struct linux_binprm, and as
      such is needed to allow the call alloc_binprm to move.
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Link: https://lkml.kernel.org/r/87k0z66x8f.fsf@x220.int.ebiederm.orgSigned-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      60d9ad1d
    • E
      exec: Factor out alloc_bprm · 0a8f36eb
      Eric W. Biederman 提交于
      Currently it is necessary for the usermode helper code and the code
      that launches init to use set_fs so that pages coming from the kernel
      look like they are coming from userspace.
      
      To allow that usage of set_fs to be removed cleanly the argument
      copying from userspace needs to happen earlier.  Move the allocation
      of the bprm into it's own function (alloc_bprm) and move the call of
      alloc_bprm before unshare_files so that bprm can ultimately be
      allocated, the arguments can be placed on the new stack, and then the
      bprm can be passed into the core of exec.
      
      Neither the allocation of struct binprm nor the unsharing depend upon each
      other so swapping the order in which they are called is trivially safe.
      
      To keep things consistent the order of cleanup at the end of
      do_execve_common swapped to match the order of initialization.
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Link: https://lkml.kernel.org/r/87pn8y6x9a.fsf@x220.int.ebiederm.orgSigned-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      0a8f36eb