1. 24 1月, 2021 4 次提交
  2. 11 12月, 2020 5 次提交
  3. 02 12月, 2020 1 次提交
    • G
      kernel: Implement selective syscall userspace redirection · 1446e1df
      Gabriel Krisman Bertazi 提交于
      Introduce a mechanism to quickly disable/enable syscall handling for a
      specific process and redirect to userspace via SIGSYS.  This is useful
      for processes with parts that require syscall redirection and parts that
      don't, but who need to perform this boundary crossing really fast,
      without paying the cost of a system call to reconfigure syscall handling
      on each boundary transition.  This is particularly important for Windows
      games running over Wine.
      
      The proposed interface looks like this:
      
        prctl(PR_SET_SYSCALL_USER_DISPATCH, <op>, <off>, <length>, [selector])
      
      The range [<offset>,<offset>+<length>) is a part of the process memory
      map that is allowed to by-pass the redirection code and dispatch
      syscalls directly, such that in fast paths a process doesn't need to
      disable the trap nor the kernel has to check the selector.  This is
      essential to return from SIGSYS to a blocked area without triggering
      another SIGSYS from rt_sigreturn.
      
      selector is an optional pointer to a char-sized userspace memory region
      that has a key switch for the mechanism. This key switch is set to
      either PR_SYS_DISPATCH_ON, PR_SYS_DISPATCH_OFF to enable and disable the
      redirection without calling the kernel.
      
      The feature is meant to be set per-thread and it is disabled on
      fork/clone/execv.
      
      Internally, this doesn't add overhead to the syscall hot path, and it
      requires very little per-architecture support.  I avoided using seccomp,
      even though it duplicates some functionality, due to previous feedback
      that maybe it shouldn't mix with seccomp since it is not a security
      mechanism.  And obviously, this should never be considered a security
      mechanism, since any part of the program can by-pass it by using the
      syscall dispatcher.
      
      For the sysinfo benchmark, which measures the overhead added to
      executing a native syscall that doesn't require interception, the
      overhead using only the direct dispatcher region to issue syscalls is
      pretty much irrelevant.  The overhead of using the selector goes around
      40ns for a native (unredirected) syscall in my system, and it is (as
      expected) dominated by the supervisor-mode user-address access.  In
      fact, with SMAP off, the overhead is consistently less than 5ns on my
      test box.
      Signed-off-by: NGabriel Krisman Bertazi <krisman@collabora.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NAndy Lutomirski <luto@kernel.org>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NKees Cook <keescook@chromium.org>
      Link: https://lore.kernel.org/r/20201127193238.821364-4-krisman@collabora.com
      1446e1df
  4. 11 11月, 2020 1 次提交
    • H
      parisc: Make user stack size configurable · 22ee3ea5
      Helge Deller 提交于
      On parisc we need to initialize the memory layout for the user stack at
      process start time to a fixed size, which up until now was limited to
      the size as given by CONFIG_MAX_STACK_SIZE_MB at compile time.
      
      This hard limit was too small and showed problems when compiling
      ruby2.7, qmlcachegen and some Qt packages.
      
      This patch changes two things:
      a) It increases the default maximum stack size to 100MB.
      b) Users can modify the stack hard limit size with ulimit and then newly
         forked processes will use the given stack size which can even be bigger
         than the default 100MB.
      Reported-by: NJohn David Anglin <dave.anglin@bell.net>
      Signed-off-by: NHelge Deller <deller@gmx.de>
      22ee3ea5
  5. 05 10月, 2020 3 次提交
  6. 01 10月, 2020 1 次提交
    • J
      io_uring: don't rely on weak ->files references · 0f212204
      Jens Axboe 提交于
      Grab actual references to the files_struct. To avoid circular references
      issues due to this, we add a per-task note that keeps track of what
      io_uring contexts a task has used. When the tasks execs or exits its
      assigned files, we cancel requests based on this tracking.
      
      With that, we can grab proper references to the files table, and no
      longer need to rely on stashing away ring_fd and ring_file to check
      if the ring_fd may have been closed.
      
      Cc: stable@vger.kernel.org # v5.5+
      Reviewed-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      0f212204
  7. 16 9月, 2020 1 次提交
    • N
      mm: fix exec activate_mm vs TLB shootdown and lazy tlb switching race · d53c3dfb
      Nicholas Piggin 提交于
      Reading and modifying current->mm and current->active_mm and switching
      mm should be done with irqs off, to prevent races seeing an intermediate
      state.
      
      This is similar to commit 38cf307c ("mm: fix kthread_use_mm() vs TLB
      invalidate"). At exec-time when the new mm is activated, the old one
      should usually be single-threaded and no longer used, unless something
      else is holding an mm_users reference (which may be possible).
      
      Absent other mm_users, there is also a race with preemption and lazy tlb
      switching. Consider the kernel_execve case where the current thread is
      using a lazy tlb active mm:
      
        call_usermodehelper()
          kernel_execve()
            old_mm = current->mm;
            active_mm = current->active_mm;
            *** preempt *** -------------------->  schedule()
                                                     prev->active_mm = NULL;
                                                     mmdrop(prev active_mm);
                                                   ...
                            <--------------------  schedule()
            current->mm = mm;
            current->active_mm = mm;
            if (!old_mm)
                mmdrop(active_mm);
      
      If we switch back to the kernel thread from a different mm, there is a
      double free of the old active_mm, and a missing free of the new one.
      
      Closing this race only requires interrupts to be disabled while ->mm
      and ->active_mm are being switched, but the TLB problem requires also
      holding interrupts off over activate_mm. Unfortunately not all archs
      can do that yet, e.g., arm defers the switch if irqs are disabled and
      expects finish_arch_post_lock_switch() to be called to complete the
      flush; um takes a blocking lock in activate_mm().
      
      So as a first step, disable interrupts across the mm/active_mm updates
      to close the lazy tlb preempt race, and provide an arch option to
      extend that to activate_mm which allows architectures doing IPI based
      TLB shootdowns to close the second race.
      
      This is a bit ugly, but in the interest of fixing the bug and backporting
      before all architectures are converted this is a compromise.
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200914045219.3736466-2-npiggin@gmail.com
      d53c3dfb
  8. 13 8月, 2020 5 次提交
  9. 21 7月, 2020 6 次提交
    • E
      exec: Implement kernel_execve · be619f7f
      Eric W. Biederman 提交于
      To allow the kernel not to play games with set_fs to call exec
      implement kernel_execve.  The function kernel_execve takes pointers
      into kernel memory and copies the values pointed to onto the new
      userspace stack.
      
      The calls with arguments from kernel space of do_execve are replaced
      with calls to kernel_execve.
      
      The calls do_execve and do_execveat are made static as there are now
      no callers outside of exec.
      
      The comments that mention do_execve are updated to refer to
      kernel_execve or execve depending on the circumstances.  In addition
      to correcting the comments, this makes it easy to grep for do_execve
      and verify it is not used.
      
      Inspired-by: https://lkml.kernel.org/r/20200627072704.2447163-1-hch@lst.deReviewed-by: NKees Cook <keescook@chromium.org>
      Link: https://lkml.kernel.org/r/87wo365ikj.fsf@x220.int.ebiederm.orgSigned-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      be619f7f
    • E
      exec: Factor bprm_stack_limits out of prepare_arg_pages · d8b9cd54
      Eric W. Biederman 提交于
      In preparation for implementiong kernel_execve (which will take kernel
      pointers not userspace pointers) factor out bprm_stack_limits out of
      prepare_arg_pages.  This separates the counting which depends upon the
      getting data from userspace from the calculations of the stack limits
      which is usable in kernel_execve.
      
      The remove prepare_args_pages and compute bprm->argc and bprm->envc
      directly in do_execveat_common, before bprm_stack_limits is called.
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Link: https://lkml.kernel.org/r/87365u6x60.fsf@x220.int.ebiederm.orgSigned-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      d8b9cd54
    • E
      exec: Factor bprm_execve out of do_execve_common · 0c9cdff0
      Eric W. Biederman 提交于
      Currently it is necessary for the usermode helper code and the code
      that launches init to use set_fs so that pages coming from the kernel
      look like they are coming from userspace.
      
      To allow that usage of set_fs to be removed cleanly the argument
      copying from userspace needs to happen earlier.  Factor bprm_execve
      out of do_execve_common to separate out the copying of arguments
      to the newe stack, and the rest of exec.
      
      In separating bprm_execve from do_execve_common the copying
      of the arguments onto the new stack happens earlier.
      
      As the copying of the arguments does not depend any security hooks,
      files, the file table, current->in_execve, current->fs->in_exec,
      bprm->unsafe, or creds this is safe.
      
      Likewise the security hook security_creds_for_exec does not depend upon
      preventing the argument copying from happening.
      
      In addition to making it possible to implement kernel_execve that
      performs the copying differently, this separation of bprm_execve from
      do_execve_common makes for a nice separation of responsibilities making
      the exec code easier to navigate.
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Link: https://lkml.kernel.org/r/878sfm6x6x.fsf@x220.int.ebiederm.orgSigned-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      0c9cdff0
    • E
      exec: Move bprm_mm_init into alloc_bprm · f18ac551
      Eric W. Biederman 提交于
      Currently it is necessary for the usermode helper code and the code that
      launches init to use set_fs so that pages coming from the kernel look like
      they are coming from userspace.
      
      To allow that usage of set_fs to be removed cleanly the argument copying
      from userspace needs to happen earlier.  Move the allocation and
      initialization of bprm->mm into alloc_bprm so that the bprm->mm is
      available early to store the new user stack into.  This is a prerequisite
      for copying argv and envp into the new user stack early before ther rest of
      exec.
      
      To keep the things consistent the cleanup of bprm->mm is moved into
      free_bprm.  So that bprm->mm will be cleaned up whenever bprm->mm is
      allocated and free_bprm are called.
      
      Moving bprm_mm_init earlier is safe as it does not depend on any files,
      current->in_execve, current->fs->in_exec, bprm->unsafe, or the if the file
      table is shared. (AKA bprm_mm_init does not depend on any of the code that
      happens between alloc_bprm and where it was previously called.)
      
      This moves bprm->mm cleanup after current->fs->in_exec is set to 0.  This
      is safe because current->fs->in_exec is only used to preventy taking an
      additional reference on the fs_struct.
      
      This moves bprm->mm cleanup after current->in_execve is set to 0.  This is
      safe because current->in_execve is only used by the lsms (apparmor and
      tomoyou) and always for LSM specific functions, never for anything to do
      with the mm.
      
      This adds bprm->mm cleanup into the successful return path.  This is safe
      because being on the successful return path implies that begin_new_exec
      succeeded and set brpm->mm to NULL.  As bprm->mm is NULL bprm cleanup I am
      moving into free_bprm will do nothing.
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Link: https://lkml.kernel.org/r/87eepe6x7p.fsf@x220.int.ebiederm.orgSigned-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      f18ac551
    • E
      exec: Move initialization of bprm->filename into alloc_bprm · 60d9ad1d
      Eric W. Biederman 提交于
      Currently it is necessary for the usermode helper code and the code
      that launches init to use set_fs so that pages coming from the kernel
      look like they are coming from userspace.
      
      To allow that usage of set_fs to be removed cleanly the argument
      copying from userspace needs to happen earlier.  Move the computation
      of bprm->filename and possible allocation of a name in the case
      of execveat into alloc_bprm to make that possible.
      
      The exectuable name, the arguments, and the environment are
      copied into the new usermode stack which is stored in bprm
      until exec passes the point of no return.
      
      As the executable name is copied first onto the usermode stack
      it needs to be known.  As there are no dependencies to computing
      the executable name, compute it early in alloc_bprm.
      
      As an implementation detail if the filename needs to be generated
      because it embeds a file descriptor store that filename in a new field
      bprm->fdpath, and free it in free_bprm.  Previously this was done in
      an independent variable pathbuf.  I have renamed pathbuf fdpath
      because fdpath is more suggestive of what kind of path is in the
      variable.  I moved fdpath into struct linux_binprm because it is
      tightly tied to the other variables in struct linux_binprm, and as
      such is needed to allow the call alloc_binprm to move.
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Link: https://lkml.kernel.org/r/87k0z66x8f.fsf@x220.int.ebiederm.orgSigned-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      60d9ad1d
    • E
      exec: Factor out alloc_bprm · 0a8f36eb
      Eric W. Biederman 提交于
      Currently it is necessary for the usermode helper code and the code
      that launches init to use set_fs so that pages coming from the kernel
      look like they are coming from userspace.
      
      To allow that usage of set_fs to be removed cleanly the argument
      copying from userspace needs to happen earlier.  Move the allocation
      of the bprm into it's own function (alloc_bprm) and move the call of
      alloc_bprm before unshare_files so that bprm can ultimately be
      allocated, the arguments can be placed on the new stack, and then the
      bprm can be passed into the core of exec.
      
      Neither the allocation of struct binprm nor the unsharing depend upon each
      other so swapping the order in which they are called is trivially safe.
      
      To keep things consistent the order of cleanup at the end of
      do_execve_common swapped to match the order of initialization.
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Link: https://lkml.kernel.org/r/87pn8y6x9a.fsf@x220.int.ebiederm.orgSigned-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      0a8f36eb
  10. 04 7月, 2020 1 次提交
  11. 10 6月, 2020 2 次提交
  12. 09 6月, 2020 2 次提交
  13. 05 6月, 2020 2 次提交
  14. 30 5月, 2020 2 次提交
    • E
      exec: Compute file based creds only once · 56305aa9
      Eric W. Biederman 提交于
      Move the computation of creds from prepare_binfmt into begin_new_exec
      so that the creds need only be computed once.  This is just code
      reorganization no semantic changes of any kind are made.
      
      Moving the computation is safe.  I have looked through the kernel and
      verified none of the binfmts look at bprm->cred directly, and that
      there are no helpers that look at bprm->cred indirectly.  Which means
      that it is not a problem to compute the bprm->cred later in the
      execution flow as it is not used until it becomes current->cred.
      
      A new function bprm_creds_from_file is added to contain the work that
      needs to be done.  bprm_creds_from_file first computes which file
      bprm->executable or most likely bprm->file that the bprm->creds
      will be computed from.
      
      The funciton bprm_fill_uid is updated to receive the file instead of
      accessing bprm->file.  The now unnecessary work needed to reset the
      bprm->cred->euid, and bprm->cred->egid is removed from brpm_fill_uid.
      A small comment to document that bprm_fill_uid now only deals with the
      work to handle suid and sgid files.  The default case is already
      heandled by prepare_exec_creds.
      
      The function security_bprm_repopulate_creds is renamed
      security_bprm_creds_from_file and now is explicitly passed the file
      from which to compute the creds.  The documentation of the
      bprm_creds_from_file security hook is updated to explain when the hook
      is called and what it needs to do.  The file is passed from
      cap_bprm_creds_from_file into get_file_caps so that the caps are
      computed for the appropriate file.  The now unnecessary work in
      cap_bprm_creds_from_file to reset the ambient capabilites has been
      removed.  A small comment to document that the work of
      cap_bprm_creds_from_file is to read capabilities from the files
      secureity attribute and derive capabilities from the fact the
      user had uid 0 has been added.
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      56305aa9
    • E
      exec: Add a per bprm->file version of per_clear · a7868323
      Eric W. Biederman 提交于
      There is a small bug in the code that recomputes parts of bprm->cred
      for every bprm->file.  The code never recomputes the part of
      clear_dangerous_personality_flags it is responsible for.
      
      Which means that in practice if someone creates a sgid script
      the interpreter will not be able to use any of:
      	READ_IMPLIES_EXEC
      	ADDR_NO_RANDOMIZE
      	ADDR_COMPAT_LAYOUT
      	MMAP_PAGE_ZERO.
      
      This accentially clearing of personality flags probably does
      not matter in practice because no one has complained
      but it does make the code more difficult to understand.
      
      Further remaining bug compatible prevents the recomputation from being
      removed and replaced by simply computing bprm->cred once from the
      final bprm->file.
      
      Making this change removes the last behavior difference between
      computing bprm->creds from the final file and recomputing
      bprm->cred several times.  Which allows this behavior change
      to be justified for it's own reasons, and for any but hunts
      looking into why the behavior changed to wind up here instead
      of in the code that will follow that computes bprm->cred
      from the final bprm->file.
      
      This small logic bug appears to have existed since the code
      started clearing dangerous personality bits.
      
      History Tree: git://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
      Fixes: 1bb0fa189c6a ("[PATCH] NX: clean up legacy binary support")
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      a7868323
  15. 21 5月, 2020 4 次提交