1. 29 10月, 2022 1 次提交
  2. 26 10月, 2022 1 次提交
  3. 27 9月, 2022 4 次提交
  4. 14 9月, 2022 1 次提交
    • A
      Revert "fs/exec: allow to unshare a time namespace on vfork+exec" · 33a2d6bc
      Andrei Vagin 提交于
      This reverts commit 133e2d3e.
      
      Alexey pointed out a few undesirable side effects of the reverted change.
      First, it doesn't take into account that CLONE_VFORK can be used with
      CLONE_THREAD. Second, a child process doesn't enter a target time name-space,
      if its parent dies before the child calls exec. It happens because the parent
      clears vfork_done.
      
      Eric W. Biederman suggests installing a time namespace as a task gets a new mm.
      It includes all new processes cloned without CLONE_VM and all tasks that call
      exec(). This is an user API change, but we think there aren't users that depend
      on the old behavior.
      
      It is too late to make such changes in this release, so let's roll back
      this patch and introduce the right one in the next release.
      
      Cc: Alexey Izbyshev <izbyshev@ispras.ru>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Dmitry Safonov <0x7f454c46@gmail.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Florian Weimer <fweimer@redhat.com>
      Cc: Kees Cook <keescook@chromium.org>
      Signed-off-by: NAndrei Vagin <avagin@gmail.com>
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Link: https://lore.kernel.org/r/20220913102551.1121611-3-avagin@google.com
      33a2d6bc
  5. 02 9月, 2022 1 次提交
  6. 17 8月, 2022 1 次提交
    • F
      exec: Replace kmap{,_atomic}() with kmap_local_page() · 3a608cfe
      Fabio M. De Francesco 提交于
      The use of kmap() and kmap_atomic() are being deprecated in favor of
      kmap_local_page().
      
      There are two main problems with kmap(): (1) It comes with an overhead as
      mapping space is restricted and protected by a global lock for
      synchronization and (2) it also requires global TLB invalidation when the
      kmap’s pool wraps and it might block when the mapping space is fully
      utilized until a slot becomes available.
      
      With kmap_local_page() the mappings are per thread, CPU local, can take
      page faults, and can be called from any context (including interrupts).
      It is faster than kmap() in kernels with HIGHMEM enabled. Furthermore,
      the tasks can be preempted and, when they are scheduled to run again, the
      kernel virtual addresses are restored and are still valid.
      
      Since the use of kmap_local_page() in exec.c is safe, it should be
      preferred everywhere in exec.c.
      
      As said, since kmap_local_page() can be also called from atomic context,
      and since remove_arg_zero() doesn't (and shouldn't ever) rely on an
      implicit preempt_disable(), this function can also safely replace
      kmap_atomic().
      
      Therefore, replace kmap() and kmap_atomic() with kmap_local_page() in
      fs/exec.c.
      
      Tested with xfstests on a QEMU/KVM x86_32 VM, 6GB RAM, booting a kernel
      with HIGHMEM64GB enabled.
      
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Suggested-by: NIra Weiny <ira.weiny@intel.com>
      Reviewed-by: NIra Weiny <ira.weiny@intel.com>
      Signed-off-by: NFabio M. De Francesco <fmdefrancesco@gmail.com>
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Link: https://lore.kernel.org/r/20220803182856.28246-1-fmdefrancesco@gmail.com
      3a608cfe
  7. 10 8月, 2022 1 次提交
  8. 28 7月, 2022 1 次提交
  9. 12 7月, 2022 1 次提交
  10. 02 7月, 2022 1 次提交
  11. 15 6月, 2022 1 次提交
  12. 19 5月, 2022 1 次提交
  13. 13 5月, 2022 1 次提交
    • N
      mm/mprotect: use mmu_gather · 4a18419f
      Nadav Amit 提交于
      Patch series "mm/mprotect: avoid unnecessary TLB flushes", v6.
      
      This patchset is intended to remove unnecessary TLB flushes during
      mprotect() syscalls.  Once this patch-set make it through, similar and
      further optimizations for MADV_COLD and userfaultfd would be possible.
      
      Basically, there are 3 optimizations in this patch-set:
      
      1. Use TLB batching infrastructure to batch flushes across VMAs and do
         better/fewer flushes.  This would also be handy for later userfaultfd
         enhancements.
      
      2. Avoid unnecessary TLB flushes.  This optimization is the one that
         provides most of the performance benefits.  Unlike previous versions,
         we now only avoid flushes that would not result in spurious
         page-faults.
      
      3. Avoiding TLB flushes on change_huge_pmd() that are only needed to
         prevent the A/D bits from changing.
      
      Andrew asked for some benchmark numbers.  I do not have an easy
      determinate macrobenchmark in which it is easy to show benefit.  I
      therefore ran a microbenchmark: a loop that does the following on
      anonymous memory, just as a sanity check to see that time is saved by
      avoiding TLB flushes.  The loop goes:
      
      	mprotect(p, PAGE_SIZE, PROT_READ)
      	mprotect(p, PAGE_SIZE, PROT_READ|PROT_WRITE)
      	*p = 0; // make the page writable
      
      The test was run in KVM guest with 1 or 2 threads (the second thread was
      busy-looping).  I measured the time (cycles) of each operation:
      
      		1 thread		2 threads
      		mmots	+patch		mmots	+patch
      PROT_READ	3494	2725 (-22%)	8630	7788 (-10%)
      PROT_READ|WRITE	3952	2724 (-31%)	9075	2865 (-68%)
      
      [ mmots = v5.17-rc6-mmots-2022-03-06-20-38 ]
      
      The exact numbers are really meaningless, but the benefit is clear.  There
      are 2 interesting results though.  
      
      (1) PROT_READ is cheaper, while one can expect it not to be affected. 
      This is presumably due to TLB miss that is saved
      
      (2) Without memory access (*p = 0), the speedup of the patch is even
      greater.  In that scenario mprotect(PROT_READ) also avoids the TLB flush. 
      As a result both operations on the patched kernel take roughly ~1500
      cycles (with either 1 or 2 threads), whereas on mmotm their cost is as
      high as presented in the table.
      
      
      This patch (of 3):
      
      change_pXX_range() currently does not use mmu_gather, but instead
      implements its own deferred TLB flushes scheme.  This both complicates the
      code, as developers need to be aware of different invalidation schemes,
      and prevents opportunities to avoid TLB flushes or perform them in finer
      granularity.
      
      The use of mmu_gather for modified PTEs has benefits in various scenarios
      even if pages are not released.  For instance, if only a single page needs
      to be flushed out of a range of many pages, only that page would be
      flushed.  If a THP page is flushed, on x86 a single TLB invlpg instruction
      can be used instead of 512 instructions (or a full TLB flush, which would
      Linux would actually use by default).  mprotect() over multiple VMAs
      requires a single flush.
      
      Use mmu_gather in change_pXX_range().  As the pages are not released, only
      record the flushed range using tlb_flush_pXX_range().
      
      Handle THP similarly and get rid of flush_cache_range() which becomes
      redundant since tlb_start_vma() calls it when needed.
      
      Link: https://lkml.kernel.org/r/20220401180821.1986781-1-namit@vmware.com
      Link: https://lkml.kernel.org/r/20220401180821.1986781-2-namit@vmware.comSigned-off-by: NNadav Amit <namit@vmware.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andrew Cooper <andrew.cooper3@citrix.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: Nick Piggin <npiggin@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      4a18419f
  14. 07 5月, 2022 2 次提交
    • E
      fork: Stop allowing kthreads to call execve · 1b2552cb
      Eric W. Biederman 提交于
      Now that kernel_execve is no longer called from kernel threads stop
      supporting kernel threads calling kernel_execve.
      
      Remove the code for converting a kthread to a normal thread in execve.
      
      Document the restriction that kthreads may not call kernel_execve by
      having kernel_execve fail if called by a kthread.
      
      Link: https://lkml.kernel.org/r/20220506141512.516114-7-ebiederm@xmission.comSigned-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      1b2552cb
    • E
      kthread: Don't allocate kthread_struct for init and umh · 343f4c49
      Eric W. Biederman 提交于
      If kthread_is_per_cpu runs concurrently with free_kthread_struct the
      kthread_struct that was just freed may be read from.
      
      This bug was introduced by commit 40966e31 ("kthread: Ensure
      struct kthread is present for all kthreads").  When kthread_struct
      started to be allocated for all tasks that have PF_KTHREAD set.  This
      in turn required the kthread_struct to be freed in kernel_execve and
      violated the assumption that kthread_struct will have the same
      lifetime as the task.
      
      Looking a bit deeper this only applies to callers of kernel_execve
      which is just the init process and the user mode helper processes.
      These processes really don't want to be kernel threads but are for
      historical reasons.  Mostly that copy_thread does not know how to take
      a kernel mode function to the process with for processes without
      PF_KTHREAD or PF_IO_WORKER set.
      
      Solve this by not allocating kthread_struct for the init process and
      the user mode helper processes.
      
      This is done by adding a kthread member to struct kernel_clone_args.
      Setting kthread in fork_idle and kernel_thread.  Adding
      user_mode_thread that works like kernel_thread except it does not set
      kthread.  In fork only allocating the kthread_struct if .kthread is set.
      
      I have looked at kernel/kthread.c and since commit 40966e31
      ("kthread: Ensure struct kthread is present for all kthreads") there
      have been no assumptions added that to_kthread or __to_kthread will
      not return NULL.
      
      There are a few callers of to_kthread or __to_kthread that assume a
      non-NULL struct kthread pointer will be returned.  These functions are
      kthread_data(), kthread_parmme(), kthread_exit(), kthread(),
      kthread_park(), kthread_unpark(), kthread_stop().  All of those functions
      can reasonably expected to be called when it is know that a task is a
      kthread so that assumption seems reasonable.
      
      Cc: stable@vger.kernel.org
      Fixes: 40966e31 ("kthread: Ensure struct kthread is present for all kthreads")
      Reported-by: NМаксим Кутявин <maximkabox13@gmail.com>
      Link: https://lkml.kernel.org/r/20220506141512.516114-1-ebiederm@xmission.comSigned-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      343f4c49
  15. 11 3月, 2022 1 次提交
  16. 02 3月, 2022 2 次提交
  17. 25 2月, 2022 1 次提交
    • A
      uaccess: remove CONFIG_SET_FS · 967747bb
      Arnd Bergmann 提交于
      There are no remaining callers of set_fs(), so CONFIG_SET_FS
      can be removed globally, along with the thread_info field and
      any references to it.
      
      This turns access_ok() into a cheaper check against TASK_SIZE_MAX.
      
      As CONFIG_SET_FS is now gone, drop all remaining references to
      set_fs()/get_fs(), mm_segment_t, user_addr_max() and uaccess_kernel().
      
      Acked-by: Sam Ravnborg <sam@ravnborg.org> # for sparc32 changes
      Acked-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Tested-by: Sergey Matyukevich <sergey.matyukevich@synopsys.com> # for arc changes
      Acked-by: Stafford Horne <shorne@gmail.com> # [openrisc, asm-generic]
      Acked-by: NDinh Nguyen <dinguyen@kernel.org>
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      967747bb
  18. 22 1月, 2022 2 次提交
  19. 20 1月, 2022 2 次提交
  20. 09 1月, 2022 2 次提交
    • E
      signal: Remove the helper signal_group_exit · 49697335
      Eric W. Biederman 提交于
      This helper is misleading.  It tests for an ongoing exec as well as
      the process having received a fatal signal.
      
      Sometimes it is appropriate to treat an on-going exec differently than
      a process that is shutting down due to a fatal signal.  In particular
      taking the fast path out of exit_signals instead of retargeting
      signals is not appropriate during exec, and not changing the the exit
      code in do_group_exit during exec.
      
      Removing the helper makes it more obvious what is going on as both
      cases must be coded for explicitly.
      
      While removing the helper fix the two cases where I have observed
      using signal_group_exit resulted in the wrong result.
      
      In exit_signals only test for SIGNAL_GROUP_EXIT so that signals are
      retargetted during an exec.
      
      In do_group_exit use 0 as the exit code during an exec as de_thread
      does not set group_exit_code.  As best as I can determine
      group_exit_code has been is set to 0 most of the time during
      de_thread.  During a thread group stop group_exit_code is set to the
      stop signal and when the thread group receives SIGCONT group_exit_code
      is reset to 0.
      
      Link: https://lkml.kernel.org/r/20211213225350.27481-8-ebiederm@xmission.comSigned-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      49697335
    • E
      signal: Rename group_exit_task group_exec_task · 60700e38
      Eric W. Biederman 提交于
      The only remaining user of group_exit_task is exec.  Rename the field
      so that it is clear which part of the code uses it.
      
      Update the comment above the definition of group_exec_task to document
      how it is currently used.
      
      Link: https://lkml.kernel.org/r/20211213225350.27481-7-ebiederm@xmission.comSigned-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      60700e38
  21. 14 12月, 2021 1 次提交
    • E
      kthread: Ensure struct kthread is present for all kthreads · 40966e31
      Eric W. Biederman 提交于
      Today the rules are a bit iffy and arbitrary about which kernel
      threads have struct kthread present.  Both idle threads and thread
      started with create_kthread want struct kthread present so that is
      effectively all kernel threads.  Make the rule that if PF_KTHREAD
      and the task is running then struct kthread is present.
      
      This will allow the kernel thread code to using tsk->exit_code
      with different semantics from ordinary processes.
      
      To make ensure that struct kthread is present for all
      kernel threads move it's allocation into copy_process.
      
      Add a deallocation of struct kthread in exec for processes
      that were kernel threads.
      
      Move the allocation of struct kthread for the initial thread
      earlier so that it is not repeated for each additional idle
      thread.
      
      Move the initialization of struct kthread into set_kthread_struct
      so that the structure is always and reliably initailized.
      
      Clear set_child_tid in free_kthread_struct to ensure the kthread
      struct is reliably freed during exec.  The function
      free_kthread_struct does not need to clear vfork_done during exec as
      exec_mm_release called from exec_mmap has already cleared vfork_done.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      40966e31
  22. 30 10月, 2021 1 次提交
  23. 07 10月, 2021 1 次提交
  24. 04 9月, 2021 3 次提交
  25. 24 8月, 2021 1 次提交
  26. 02 7月, 2021 1 次提交
  27. 01 5月, 2021 2 次提交
    • A
      Reimplement RLIMIT_NPROC on top of ucounts · 21d1c5e3
      Alexey Gladkov 提交于
      The rlimit counter is tied to uid in the user_namespace. This allows
      rlimit values to be specified in userns even if they are already
      globally exceeded by the user. However, the value of the previous
      user_namespaces cannot be exceeded.
      
      To illustrate the impact of rlimits, let's say there is a program that
      does not fork. Some service-A wants to run this program as user X in
      multiple containers. Since the program never fork the service wants to
      set RLIMIT_NPROC=1.
      
      service-A
       \- program (uid=1000, container1, rlimit_nproc=1)
       \- program (uid=1000, container2, rlimit_nproc=1)
      
      The service-A sets RLIMIT_NPROC=1 and runs the program in container1.
      When the service-A tries to run a program with RLIMIT_NPROC=1 in
      container2 it fails since user X already has one running process.
      
      We cannot use existing inc_ucounts / dec_ucounts because they do not
      allow us to exceed the maximum for the counter. Some rlimits can be
      overlimited by root or if the user has the appropriate capability.
      
      Changelog
      
      v11:
      * Change inc_rlimit_ucounts() which now returns top value of ucounts.
      * Drop inc_rlimit_ucounts_and_test() because the return code of
        inc_rlimit_ucounts() can be checked.
      Signed-off-by: NAlexey Gladkov <legion@kernel.org>
      Link: https://lkml.kernel.org/r/c5286a8aa16d2d698c222f7532f3d735c82bc6bc.1619094428.git.legion@kernel.orgSigned-off-by: NEric W. Biederman <ebiederm@xmission.com>
      21d1c5e3
    • A
      Add a reference to ucounts for each cred · 905ae01c
      Alexey Gladkov 提交于
      For RLIMIT_NPROC and some other rlimits the user_struct that holds the
      global limit is kept alive for the lifetime of a process by keeping it
      in struct cred. Adding a pointer to ucounts in the struct cred will
      allow to track RLIMIT_NPROC not only for user in the system, but for
      user in the user_namespace.
      
      Updating ucounts may require memory allocation which may fail. So, we
      cannot change cred.ucounts in the commit_creds() because this function
      cannot fail and it should always return 0. For this reason, we modify
      cred.ucounts before calling the commit_creds().
      
      Changelog
      
      v6:
      * Fix null-ptr-deref in is_ucounts_overlimit() detected by trinity. This
        error was caused by the fact that cred_alloc_blank() left the ucounts
        pointer empty.
      Reported-by: Nkernel test robot <oliver.sang@intel.com>
      Signed-off-by: NAlexey Gladkov <legion@kernel.org>
      Link: https://lkml.kernel.org/r/b37aaef28d8b9b0d757e07ba6dd27281bbe39259.1619094428.git.legion@kernel.orgSigned-off-by: NEric W. Biederman <ebiederm@xmission.com>
      905ae01c
  28. 25 2月, 2021 1 次提交
  29. 30 1月, 2021 1 次提交