1. 15 11月, 2022 2 次提交
    • M
      stack: Introduce CONFIG_RANDOMIZE_KSTACK_OFFSET · bcc7cf3a
      Marco Elver 提交于
      mainline inclusion
      from mainline-v5.18-rc1
      commit 8cb37a59
      category: featrue
      bugzilla: https://gitee.com/openeuler/kernel/issues/I5YQ6Z
      CVE: NA
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8cb37a5974a48569aab8a1736d21399fddbdbdb2
      
      --------------------------------
      
      The randomize_kstack_offset feature is unconditionally compiled in when
      the architecture supports it.
      
      To add constraints on compiler versions, we require a dedicated Kconfig
      variable. Therefore, introduce RANDOMIZE_KSTACK_OFFSET.
      
      Furthermore, this option is now also configurable by EXPERT kernels:
      while the feature is supposed to have zero performance overhead when
      disabled, due to its use of static branches, there are few cases where
      giving a distribution the option to disable the feature entirely makes
      sense. For example, in very resource constrained environments, which
      would never enable the feature to begin with, in which case the
      additional kernel code size increase would be redundant.
      Signed-off-by: NMarco Elver <elver@google.com>
      Reviewed-by: NNathan Chancellor <nathan@kernel.org>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Link: https://lore.kernel.org/r/20220131090521.1947110-1-elver@google.comSigned-off-by: NYi Yang <yiyang13@huawei.com>
      Reviewed-by: NXiu Jianfeng <xiujianfeng@huawei.com>
      Reviewed-by: NGONG Ruiqi <gongruiqi1@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      bcc7cf3a
    • K
      stack: Optionally randomize kernel stack offset each syscall · 18ca505d
      Kees Cook 提交于
      mainline inclusion
      from mainline-v5.13-rc1
      commit 39218ff4
      category: feature
      bugzilla: https://gitee.com/openeuler/kernel/issues/I5YQ6Z
      CVE: NA
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=39218ff4c625dbf2e68224024fe0acaa60bcd51a
      
      --------------------------------
      
      This provides the ability for architectures to enable kernel stack base
      address offset randomization. This feature is controlled by the boot
      param "randomize_kstack_offset=on/off", with its default value set by
      CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT.
      
      This feature is based on the original idea from the last public release
      of PaX's RANDKSTACK feature: https://pax.grsecurity.net/docs/randkstack.txt
      All the credit for the original idea goes to the PaX team. Note that
      the design and implementation of this upstream randomize_kstack_offset
      feature differs greatly from the RANDKSTACK feature (see below).
      
      Reasoning for the feature:
      
      This feature aims to make harder the various stack-based attacks that
      rely on deterministic stack structure. We have had many such attacks in
      past (just to name few):
      
      https://jon.oberheide.org/files/infiltrate12-thestackisback.pdf
      https://jon.oberheide.org/files/stackjacking-infiltrate11.pdf
      https://googleprojectzero.blogspot.com/2016/06/exploiting-recursion-in-linux-kernel_20.html
      
      As Linux kernel stack protections have been constantly improving
      (vmap-based stack allocation with guard pages, removal of thread_info,
      STACKLEAK), attackers have had to find new ways for their exploits
      to work. They have done so, continuing to rely on the kernel's stack
      determinism, in situations where VMAP_STACK and THREAD_INFO_IN_TASK_STRUCT
      were not relevant. For example, the following recent attacks would have
      been hampered if the stack offset was non-deterministic between syscalls:
      
      https://repositorio-aberto.up.pt/bitstream/10216/125357/2/374717.pdf
      (page 70: targeting the pt_regs copy with linear stack overflow)
      
      https://a13xp0p0v.github.io/2020/02/15/CVE-2019-18683.html
      (leaked stack address from one syscall as a target during next syscall)
      
      The main idea is that since the stack offset is randomized on each system
      call, it is harder for an attack to reliably land in any particular place
      on the thread stack, even with address exposures, as the stack base will
      change on the next syscall. Also, since randomization is performed after
      placing pt_regs, the ptrace-based approach[1] to discover the randomized
      offset during a long-running syscall should not be possible.
      
      Design description:
      
      During most of the kernel's execution, it runs on the "thread stack",
      which is pretty deterministic in its structure: it is fixed in size,
      and on every entry from userspace to kernel on a syscall the thread
      stack starts construction from an address fetched from the per-cpu
      cpu_current_top_of_stack variable. The first element to be pushed to the
      thread stack is the pt_regs struct that stores all required CPU registers
      and syscall parameters. Finally the specific syscall function is called,
      with the stack being used as the kernel executes the resulting request.
      
      The goal of randomize_kstack_offset feature is to add a random offset
      after the pt_regs has been pushed to the stack and before the rest of the
      thread stack is used during the syscall processing, and to change it every
      time a process issues a syscall. The source of randomness is currently
      architecture-defined (but x86 is using the low byte of rdtsc()). Future
      improvements for different entropy sources is possible, but out of scope
      for this patch. Further more, to add more unpredictability, new offsets
      are chosen at the end of syscalls (the timing of which should be less
      easy to measure from userspace than at syscall entry time), and stored
      in a per-CPU variable, so that the life of the value does not stay
      explicitly tied to a single task.
      
      As suggested by Andy Lutomirski, the offset is added using alloca()
      and an empty asm() statement with an output constraint, since it avoids
      changes to assembly syscall entry code, to the unwinder, and provides
      correct stack alignment as defined by the compiler.
      
      In order to make this available by default with zero performance impact
      for those that don't want it, it is boot-time selectable with static
      branches. This way, if the overhead is not wanted, it can just be
      left turned off with no performance impact.
      
      The generated assembly for x86_64 with GCC looks like this:
      
      ...
      ffffffff81003977: 65 8b 05 02 ea 00 7f  mov %gs:0x7f00ea02(%rip),%eax
      					    # 12380 <kstack_offset>
      ffffffff8100397e: 25 ff 03 00 00        and $0x3ff,%eax
      ffffffff81003983: 48 83 c0 0f           add $0xf,%rax
      ffffffff81003987: 25 f8 07 00 00        and $0x7f8,%eax
      ffffffff8100398c: 48 29 c4              sub %rax,%rsp
      ffffffff8100398f: 48 8d 44 24 0f        lea 0xf(%rsp),%rax
      ffffffff81003994: 48 83 e0 f0           and $0xfffffffffffffff0,%rax
      ...
      
      As a result of the above stack alignment, this patch introduces about
      5 bits of randomness after pt_regs is spilled to the thread stack on
      x86_64, and 6 bits on x86_32 (since its has 1 fewer bit required for
      stack alignment). The amount of entropy could be adjusted based on how
      much of the stack space we wish to trade for security.
      
      My measure of syscall performance overhead (on x86_64):
      
      lmbench: /usr/lib/lmbench/bin/x86_64-linux-gnu/lat_syscall -N 10000 null
          randomize_kstack_offset=y	Simple syscall: 0.7082 microseconds
          randomize_kstack_offset=n	Simple syscall: 0.7016 microseconds
      
      So, roughly 0.9% overhead growth for a no-op syscall, which is very
      manageable. And for people that don't want this, it's off by default.
      
      There are two gotchas with using the alloca() trick. First,
      compilers that have Stack Clash protection (-fstack-clash-protection)
      enabled by default (e.g. Ubuntu[3]) add pagesize stack probes to
      any dynamic stack allocations. While the randomization offset is
      always less than a page, the resulting assembly would still contain
      (unreachable!) probing routines, bloating the resulting assembly. To
      avoid this, -fno-stack-clash-protection is unconditionally added to
      the kernel Makefile since this is the only dynamic stack allocation in
      the kernel (now that VLAs have been removed) and it is provably safe
      from Stack Clash style attacks.
      
      The second gotcha with alloca() is a negative interaction with
      -fstack-protector*, in that it sees the alloca() as an array allocation,
      which triggers the unconditional addition of the stack canary function
      pre/post-amble which slows down syscalls regardless of the static
      branch. In order to avoid adding this unneeded check and its associated
      performance impact, architectures need to carefully remove uses of
      -fstack-protector-strong (or -fstack-protector) in the compilation units
      that use the add_random_kstack() macro and to audit the resulting stack
      mitigation coverage (to make sure no desired coverage disappears). No
      change is visible for this on x86 because the stack protector is already
      unconditionally disabled for the compilation unit, but the change is
      required on arm64. There is, unfortunately, no attribute that can be
      used to disable stack protector for specific functions.
      
      Comparison to PaX RANDKSTACK feature:
      
      The RANDKSTACK feature randomizes the location of the stack start
      (cpu_current_top_of_stack), i.e. including the location of pt_regs
      structure itself on the stack. Initially this patch followed the same
      approach, but during the recent discussions[2], it has been determined
      to be of a little value since, if ptrace functionality is available for
      an attacker, they can use PTRACE_PEEKUSR/PTRACE_POKEUSR to read/write
      different offsets in the pt_regs struct, observe the cache behavior of
      the pt_regs accesses, and figure out the random stack offset. Another
      difference is that the random offset is stored in a per-cpu variable,
      rather than having it be per-thread. As a result, these implementations
      differ a fair bit in their implementation details and results, though
      obviously the intent is similar.
      
      [1] https://lore.kernel.org/kernel-hardening/2236FBA76BA1254E88B949DDB74E612BA4BC57C1@IRSMSX102.ger.corp.intel.com/
      [2] https://lore.kernel.org/kernel-hardening/20190329081358.30497-1-elena.reshetova@intel.com/
      [3] https://lists.ubuntu.com/archives/ubuntu-devel/2019-June/040741.htmlCo-developed-by: NElena Reshetova <elena.reshetova@intel.com>
      Signed-off-by: NElena Reshetova <elena.reshetova@intel.com>
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Link: https://lore.kernel.org/r/20210401232347.2791257-4-keescook@chromium.org
      
      conflict:
      	Documentation/admin-guide/kernel-parameters.txt
      	arch/Kconfig
      Signed-off-by: NYi Yang <yiyang13@huawei.com>
      Reviewed-by: NXiu Jianfeng <xiujianfeng@huawei.com>
      Reviewed-by: NGONG Ruiqi <gongruiqi1@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      18ca505d
  2. 10 10月, 2022 1 次提交
    • J
      x86/sgx: Add an attribute for the amount of SGX memory in a NUMA node · 7417bfe2
      Jarkko Sakkinen 提交于
      mainline inclusion
      from mainline-5.17-rc1
      commit 50468e43
      category: feature
      bugzilla: https://gitee.com/openeuler/intel-kernel/issues/I5USAM
      CVE: NA
      
      Intel-SIG: commit 50468e43 x86/sgx: Add an attribute for the amount
      of SGX memory in a NUMA node.
      Backport for SGX EDMM support.
      
      This patch adds a new element into array node_dev_groups[],
      however, in 5.10 code the array is defined by macro
      ATTRIBUTE_GROUPS(node_dev).
      
      To resolve the conflict, just expand the macro without any functional
      change.
      
      --------------------------------
      
      == Problem ==
      
      The amount of SGX memory on a system is determined by the BIOS and it
      varies wildly between systems.  It can be as small as dozens of MB's
      and as large as many GB's on servers.  Just like how applications need
      to know how much regular RAM is available, enclave builders need to
      know how much SGX memory an enclave can consume.
      
      == Solution ==
      
      Introduce a new sysfs file:
      
      	/sys/devices/system/node/nodeX/x86/sgx_total_bytes
      
      to enumerate the amount of SGX memory available in each NUMA node.
      This serves the same function for SGX as /proc/meminfo or
      /sys/devices/system/node/nodeX/meminfo does for normal RAM.
      
      'sgx_total_bytes' is needed today to help drive the SGX selftests.
      SGX-specific swap code is exercised by creating overcommitted enclaves
      which are larger than the physical SGX memory on the system.  They
      currently use a CPUID-based approach which can diverge from the actual
      amount of SGX memory available.  'sgx_total_bytes' ensures that the
      selftests can work efficiently and do not attempt stupid things like
      creating a 100,000 MB enclave on a system with 128 MB of SGX memory.
      
      == Implementation Details ==
      
      Introduce CONFIG_HAVE_ARCH_NODE_DEV_GROUP opt-in flag to expose an
      arch specific attribute group, and add an attribute for the amount of
      SGX memory in bytes to each NUMA node:
      
      == ABI Design Discussion ==
      
      As opposed to the per-node ABI, a single, global ABI was considered.
      However, this would prevent enclaves from being able to size
      themselves so that they fit on a single NUMA node.  Essentially, a
      single value would rule out NUMA optimizations for enclaves.
      
      Create a new "x86/" directory inside each "nodeX/" sysfs directory.
      'sgx_total_bytes' is expected to be the first of at least a few
      sgx-specific files to be placed in the new directory.  Just scanning
      /proc/meminfo, these are the no-brainers that we have for RAM, but we
      need for SGX:
      
      	MemTotal:       xxxx kB // sgx_total_bytes (implemented here)
      	MemFree:        yyyy kB // sgx_free_bytes
      	SwapTotal:      zzzz kB // sgx_swapped_bytes
      
      So, at *least* three.  I think we will eventually end up needing
      something more along the lines of a dozen.  A new directory (as
      opposed to being in the nodeX/ "root") directory avoids cluttering the
      root with several "sgx_*" files.
      
      Place the new file in a new "nodeX/x86/" directory because SGX is
      highly x86-specific.  It is very unlikely that any other architecture
      (or even non-Intel x86 vendor) will ever implement SGX.  Using "sgx/"
      as opposed to "x86/" was also considered.  But, there is a real chance
      this can get used for other arch-specific purposes.
      
      [ dhansen: rewrite changelog ]
      Signed-off-by: NJarkko Sakkinen <jarkko@kernel.org>
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Acked-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Acked-by: NBorislav Petkov <bp@suse.de>
      Link: https://lkml.kernel.org/r/20211116162116.93081-2-jarkko@kernel.orgSigned-off-by: NZhiquan Li <zhiquan1.li@intel.com>
      7417bfe2
  3. 06 12月, 2021 1 次提交
  4. 31 8月, 2021 1 次提交
    • S
      kexec: Add quick kexec support for kernel · 742670c6
      Sang Yan 提交于
      hulk inclusion
      category: feature
      bugzilla: 48159
      CVE: N/A
      
      ------------------------------
      
      In normal kexec, relocating kernel may cost 5 ~ 10 seconds, to
      copy all segments from vmalloced memory to kernel boot memory,
      because of disabled mmu.
      
      We introduce quick kexec to save time of copying memory as above,
      just like kdump(kexec on crash), by using reserved memory
      "Quick Kexec".
      
      To enable it, we should reserve memory and setup quick_kexec_res.
      
      Constructing quick kimage as the same as crash kernel,
      then simply copy all segments of kimage to reserved memroy.
      
      We also add this support in syscall kexec_load using flags
      of KEXEC_QUICK.
      Signed-off-by: NSang Yan <sangyan@huawei.com>
      Reviewed-by: NKuohai Xu <xukuohai@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      742670c6
  5. 28 7月, 2021 1 次提交
  6. 14 7月, 2021 2 次提交
    • K
      mm: speedup mremap on 1GB or larger regions · 68e04270
      Kalesh Singh 提交于
      mainline inclusion
      from mainline-v5.11-rc1
      commit c49dd340
      category: feature
      bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZFUI
      CVE: NA
      
      -------------------------------------------------
      
      Android needs to move large memory regions for garbage collection.  The GC
      requires moving physical pages of multi-gigabyte heap using mremap.
      During this move, the application threads have to be paused for
      correctness.  It is critical to keep this pause as short as possible to
      avoid jitters during user interaction.
      
      Optimize mremap for >= 1GB-sized regions by moving at the PUD/PGD level if
      the source and destination addresses are PUD-aligned.  For
      CONFIG_PGTABLE_LEVELS == 3, moving at the PUD level in effect moves PGD
      entries, since the PUD entry is “folded back” onto the PGD entry.  Add
      HAVE_MOVE_PUD so that architectures where moving at the PUD level isn't
      supported/tested can turn this off by not selecting the config.
      
      Link: https://lkml.kernel.org/r/20201014005320.2233162-4-kaleshsingh@google.comSigned-off-by: NKalesh Singh <kaleshsingh@google.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reported-by: Nkernel test robot <lkp@intel.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Brauner <christian.brauner@ubuntu.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Frederic Weisbecker <frederic@kernel.org>
      Cc: Gavin Shan <gshan@redhat.com>
      Cc: Hassan Naveed <hnaveed@wavecomp.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jia He <justin.he@arm.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Krzysztof Kozlowski <krzk@kernel.org>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Masahiro Yamada <masahiroy@kernel.org>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Minchan Kim <minchan@google.com>
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Ram Pai <linuxram@us.ibm.com>
      Cc: Sami Tolvanen <samitolvanen@google.com>
      Cc: Sandipan Das <sandipan@linux.ibm.com>
      Cc: SeongJae Park <sjpark@amazon.de>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NLiu Shixin <liushixin2@huawei.com>
      Reviewed-by: NChen Wandun <chenwandun@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      68e04270
    • N
      mm/vmalloc: hugepage vmalloc mappings · 7954687a
      Nicholas Piggin 提交于
      mainline inclusion
      from mainline-5.13-rc1
      commit 121e6f32
      category: feature
      bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZGKZ
      CVE: NA
      
      -------------------------------------------------
      
      Support huge page vmalloc mappings.  Config option HAVE_ARCH_HUGE_VMALLOC
      enables support on architectures that define HAVE_ARCH_HUGE_VMAP and
      supports PMD sized vmap mappings.
      
      vmalloc will attempt to allocate PMD-sized pages if allocating PMD size or
      larger, and fall back to small pages if that was unsuccessful.
      
      Architectures must ensure that any arch specific vmalloc allocations that
      require PAGE_SIZE mappings (e.g., module allocations vs strict module rwx)
      use the VM_NOHUGE flag to inhibit larger mappings.
      
      This can result in more internal fragmentation and memory overhead for a
      given allocation, an option nohugevmalloc is added to disable at boot.
      
      [colin.king@canonical.com: fix read of uninitialized pointer area]
        Link: https://lkml.kernel.org/r/20210318155955.18220-1-colin.king@canonical.com
      
      Link: https://lkml.kernel.org/r/20210317062402.533919-14-npiggin@gmail.comSigned-off-by: NNicholas Piggin <npiggin@gmail.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Ding Tianhong <dingtianhong@huawei.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Conflicts:
      	mm/page_alloc.c
      Signed-off-by: NChen Wandun <chenwandun@huawei.com>
      Reviewed-by: NTong Tiangen <tongtiangen@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      7954687a
  7. 06 7月, 2021 1 次提交
  8. 08 2月, 2021 1 次提交
  9. 27 1月, 2021 1 次提交
  10. 12 1月, 2021 1 次提交
  11. 01 12月, 2020 1 次提交
  12. 09 10月, 2020 1 次提交
  13. 16 9月, 2020 1 次提交
    • N
      mm: fix exec activate_mm vs TLB shootdown and lazy tlb switching race · d53c3dfb
      Nicholas Piggin 提交于
      Reading and modifying current->mm and current->active_mm and switching
      mm should be done with irqs off, to prevent races seeing an intermediate
      state.
      
      This is similar to commit 38cf307c ("mm: fix kthread_use_mm() vs TLB
      invalidate"). At exec-time when the new mm is activated, the old one
      should usually be single-threaded and no longer used, unless something
      else is holding an mm_users reference (which may be possible).
      
      Absent other mm_users, there is also a race with preemption and lazy tlb
      switching. Consider the kernel_execve case where the current thread is
      using a lazy tlb active mm:
      
        call_usermodehelper()
          kernel_execve()
            old_mm = current->mm;
            active_mm = current->active_mm;
            *** preempt *** -------------------->  schedule()
                                                     prev->active_mm = NULL;
                                                     mmdrop(prev active_mm);
                                                   ...
                            <--------------------  schedule()
            current->mm = mm;
            current->active_mm = mm;
            if (!old_mm)
                mmdrop(active_mm);
      
      If we switch back to the kernel thread from a different mm, there is a
      double free of the old active_mm, and a missing free of the new one.
      
      Closing this race only requires interrupts to be disabled while ->mm
      and ->active_mm are being switched, but the TLB problem requires also
      holding interrupts off over activate_mm. Unfortunately not all archs
      can do that yet, e.g., arm defers the switch if irqs are disabled and
      expects finish_arch_post_lock_switch() to be called to complete the
      flush; um takes a blocking lock in activate_mm().
      
      So as a first step, disable interrupts across the mm/active_mm updates
      to close the lazy tlb preempt race, and provide an arch option to
      extend that to activate_mm which allows architectures doing IPI based
      TLB shootdowns to close the second race.
      
      This is a bit ugly, but in the interest of fixing the bug and backporting
      before all architectures are converted this is a compromise.
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200914045219.3736466-2-npiggin@gmail.com
      d53c3dfb
  14. 09 9月, 2020 1 次提交
  15. 01 9月, 2020 3 次提交
  16. 06 8月, 2020 1 次提交
  17. 24 7月, 2020 1 次提交
    • T
      entry: Provide generic syscall entry functionality · 142781e1
      Thomas Gleixner 提交于
      On syscall entry certain work needs to be done:
      
         - Establish state (lockdep, context tracking, tracing)
         - Conditional work (ptrace, seccomp, audit...)
      
      This code is needlessly duplicated and  different in all
      architectures.
      
      Provide a generic version based on the x86 implementation which has all the
      RCU and instrumentation bits right.
      
      As interrupt/exception entry from user space needs parts of the same
      functionality, provide a function for this as well.
      
      syscall_enter_from_user_mode() and irqentry_enter_from_user_mode() must be
      called right after the low level ASM entry. The calling code must be
      non-instrumentable. After the functions returns state is correct and the
      subsequent functions can be instrumented.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NKees Cook <keescook@chromium.org>
      Link: https://lkml.kernel.org/r/20200722220519.513463269@linutronix.de
      142781e1
  18. 07 7月, 2020 1 次提交
  19. 05 7月, 2020 1 次提交
  20. 27 6月, 2020 1 次提交
  21. 14 6月, 2020 1 次提交
    • M
      treewide: replace '---help---' in Kconfig files with 'help' · a7f7f624
      Masahiro Yamada 提交于
      Since commit 84af7a61 ("checkpatch: kconfig: prefer 'help' over
      '---help---'"), the number of '---help---' has been gradually
      decreasing, but there are still more than 2400 instances.
      
      This commit finishes the conversion. While I touched the lines,
      I also fixed the indentation.
      
      There are a variety of indentation styles found.
      
        a) 4 spaces + '---help---'
        b) 7 spaces + '---help---'
        c) 8 spaces + '---help---'
        d) 1 space + 1 tab + '---help---'
        e) 1 tab + '---help---'    (correct indentation)
        f) 1 tab + 1 space + '---help---'
        g) 1 tab + 2 spaces + '---help---'
      
      In order to convert all of them to 1 tab + 'help', I ran the
      following commend:
      
        $ find . -name 'Kconfig*' | xargs sed -i 's/^[[:space:]]*---help---/\thelp/'
      Signed-off-by: NMasahiro Yamada <masahiroy@kernel.org>
      a7f7f624
  22. 19 5月, 2020 1 次提交
  23. 15 5月, 2020 2 次提交
    • S
      scs: Disable when function graph tracing is enabled · ddc9863e
      Sami Tolvanen 提交于
      The graph tracer hooks returns by modifying frame records on the
      (regular) stack, but with SCS the return address is taken from the
      shadow stack, and the value in the frame record has no effect. As we
      don't currently have a mechanism to determine the corresponding slot
      on the shadow stack (and to pass this through the ftrace
      infrastructure), for now let's disable SCS when the graph tracer is
      enabled.
      
      With SCS the return address is taken from the shadow stack and the
      value in the frame record has no effect. The mcount based graph tracer
      hooks returns by modifying frame records on the (regular) stack, and
      thus is not compatible. The patchable-function-entry graph tracer
      used for DYNAMIC_FTRACE_WITH_REGS modifies the LR before it is saved
      to the shadow stack, and is compatible.
      
      Modifying the mcount based graph tracer to work with SCS would require
      a mechanism to determine the corresponding slot on the shadow stack
      (and to pass this through the ftrace infrastructure), and we expect
      that everyone will eventually move to the patchable-function-entry
      based graph tracer anyway, so for now let's disable SCS when the
      mcount-based graph tracer is enabled.
      
      SCS and patchable-function-entry are both supported from LLVM 10.x.
      Signed-off-by: NSami Tolvanen <samitolvanen@google.com>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Reviewed-by: NMark Rutland <mark.rutland@arm.com>
      Signed-off-by: NWill Deacon <will@kernel.org>
      ddc9863e
    • S
      scs: Add support for Clang's Shadow Call Stack (SCS) · d08b9f0c
      Sami Tolvanen 提交于
      This change adds generic support for Clang's Shadow Call Stack,
      which uses a shadow stack to protect return addresses from being
      overwritten by an attacker. Details are available here:
      
        https://clang.llvm.org/docs/ShadowCallStack.html
      
      Note that security guarantees in the kernel differ from the ones
      documented for user space. The kernel must store addresses of
      shadow stacks in memory, which means an attacker capable reading
      and writing arbitrary memory may be able to locate them and hijack
      control flow by modifying the stacks.
      Signed-off-by: NSami Tolvanen <samitolvanen@google.com>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Reviewed-by: NMiguel Ojeda <miguel.ojeda.sandonis@gmail.com>
      [will: Numerous cosmetic changes]
      Signed-off-by: NWill Deacon <will@kernel.org>
      d08b9f0c
  24. 13 5月, 2020 1 次提交
  25. 16 3月, 2020 3 次提交
  26. 06 3月, 2020 1 次提交
  27. 14 2月, 2020 1 次提交
    • F
      context-tracking: Introduce CONFIG_HAVE_TIF_NOHZ · 490f561b
      Frederic Weisbecker 提交于
      A few archs (x86, arm, arm64) don't rely anymore on TIF_NOHZ to call
      into context tracking on user entry/exit but instead use static keys
      (or not) to optimize those calls. Ideally every arch should migrate to
      that behaviour in the long run.
      
      Settle a config option to let those archs remove their TIF_NOHZ
      definitions.
      Signed-off-by: NFrederic Weisbecker <frederic@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Paul Burton <paulburton@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: David S. Miller <davem@davemloft.net>
      490f561b
  28. 04 2月, 2020 6 次提交