1. 08 4月, 2021 1 次提交
    • K
      stack: Optionally randomize kernel stack offset each syscall · 39218ff4
      Kees Cook 提交于
      This provides the ability for architectures to enable kernel stack base
      address offset randomization. This feature is controlled by the boot
      param "randomize_kstack_offset=on/off", with its default value set by
      CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT.
      
      This feature is based on the original idea from the last public release
      of PaX's RANDKSTACK feature: https://pax.grsecurity.net/docs/randkstack.txt
      All the credit for the original idea goes to the PaX team. Note that
      the design and implementation of this upstream randomize_kstack_offset
      feature differs greatly from the RANDKSTACK feature (see below).
      
      Reasoning for the feature:
      
      This feature aims to make harder the various stack-based attacks that
      rely on deterministic stack structure. We have had many such attacks in
      past (just to name few):
      
      https://jon.oberheide.org/files/infiltrate12-thestackisback.pdf
      https://jon.oberheide.org/files/stackjacking-infiltrate11.pdf
      https://googleprojectzero.blogspot.com/2016/06/exploiting-recursion-in-linux-kernel_20.html
      
      As Linux kernel stack protections have been constantly improving
      (vmap-based stack allocation with guard pages, removal of thread_info,
      STACKLEAK), attackers have had to find new ways for their exploits
      to work. They have done so, continuing to rely on the kernel's stack
      determinism, in situations where VMAP_STACK and THREAD_INFO_IN_TASK_STRUCT
      were not relevant. For example, the following recent attacks would have
      been hampered if the stack offset was non-deterministic between syscalls:
      
      https://repositorio-aberto.up.pt/bitstream/10216/125357/2/374717.pdf
      (page 70: targeting the pt_regs copy with linear stack overflow)
      
      https://a13xp0p0v.github.io/2020/02/15/CVE-2019-18683.html
      (leaked stack address from one syscall as a target during next syscall)
      
      The main idea is that since the stack offset is randomized on each system
      call, it is harder for an attack to reliably land in any particular place
      on the thread stack, even with address exposures, as the stack base will
      change on the next syscall. Also, since randomization is performed after
      placing pt_regs, the ptrace-based approach[1] to discover the randomized
      offset during a long-running syscall should not be possible.
      
      Design description:
      
      During most of the kernel's execution, it runs on the "thread stack",
      which is pretty deterministic in its structure: it is fixed in size,
      and on every entry from userspace to kernel on a syscall the thread
      stack starts construction from an address fetched from the per-cpu
      cpu_current_top_of_stack variable. The first element to be pushed to the
      thread stack is the pt_regs struct that stores all required CPU registers
      and syscall parameters. Finally the specific syscall function is called,
      with the stack being used as the kernel executes the resulting request.
      
      The goal of randomize_kstack_offset feature is to add a random offset
      after the pt_regs has been pushed to the stack and before the rest of the
      thread stack is used during the syscall processing, and to change it every
      time a process issues a syscall. The source of randomness is currently
      architecture-defined (but x86 is using the low byte of rdtsc()). Future
      improvements for different entropy sources is possible, but out of scope
      for this patch. Further more, to add more unpredictability, new offsets
      are chosen at the end of syscalls (the timing of which should be less
      easy to measure from userspace than at syscall entry time), and stored
      in a per-CPU variable, so that the life of the value does not stay
      explicitly tied to a single task.
      
      As suggested by Andy Lutomirski, the offset is added using alloca()
      and an empty asm() statement with an output constraint, since it avoids
      changes to assembly syscall entry code, to the unwinder, and provides
      correct stack alignment as defined by the compiler.
      
      In order to make this available by default with zero performance impact
      for those that don't want it, it is boot-time selectable with static
      branches. This way, if the overhead is not wanted, it can just be
      left turned off with no performance impact.
      
      The generated assembly for x86_64 with GCC looks like this:
      
      ...
      ffffffff81003977: 65 8b 05 02 ea 00 7f  mov %gs:0x7f00ea02(%rip),%eax
      					    # 12380 <kstack_offset>
      ffffffff8100397e: 25 ff 03 00 00        and $0x3ff,%eax
      ffffffff81003983: 48 83 c0 0f           add $0xf,%rax
      ffffffff81003987: 25 f8 07 00 00        and $0x7f8,%eax
      ffffffff8100398c: 48 29 c4              sub %rax,%rsp
      ffffffff8100398f: 48 8d 44 24 0f        lea 0xf(%rsp),%rax
      ffffffff81003994: 48 83 e0 f0           and $0xfffffffffffffff0,%rax
      ...
      
      As a result of the above stack alignment, this patch introduces about
      5 bits of randomness after pt_regs is spilled to the thread stack on
      x86_64, and 6 bits on x86_32 (since its has 1 fewer bit required for
      stack alignment). The amount of entropy could be adjusted based on how
      much of the stack space we wish to trade for security.
      
      My measure of syscall performance overhead (on x86_64):
      
      lmbench: /usr/lib/lmbench/bin/x86_64-linux-gnu/lat_syscall -N 10000 null
          randomize_kstack_offset=y	Simple syscall: 0.7082 microseconds
          randomize_kstack_offset=n	Simple syscall: 0.7016 microseconds
      
      So, roughly 0.9% overhead growth for a no-op syscall, which is very
      manageable. And for people that don't want this, it's off by default.
      
      There are two gotchas with using the alloca() trick. First,
      compilers that have Stack Clash protection (-fstack-clash-protection)
      enabled by default (e.g. Ubuntu[3]) add pagesize stack probes to
      any dynamic stack allocations. While the randomization offset is
      always less than a page, the resulting assembly would still contain
      (unreachable!) probing routines, bloating the resulting assembly. To
      avoid this, -fno-stack-clash-protection is unconditionally added to
      the kernel Makefile since this is the only dynamic stack allocation in
      the kernel (now that VLAs have been removed) and it is provably safe
      from Stack Clash style attacks.
      
      The second gotcha with alloca() is a negative interaction with
      -fstack-protector*, in that it sees the alloca() as an array allocation,
      which triggers the unconditional addition of the stack canary function
      pre/post-amble which slows down syscalls regardless of the static
      branch. In order to avoid adding this unneeded check and its associated
      performance impact, architectures need to carefully remove uses of
      -fstack-protector-strong (or -fstack-protector) in the compilation units
      that use the add_random_kstack() macro and to audit the resulting stack
      mitigation coverage (to make sure no desired coverage disappears). No
      change is visible for this on x86 because the stack protector is already
      unconditionally disabled for the compilation unit, but the change is
      required on arm64. There is, unfortunately, no attribute that can be
      used to disable stack protector for specific functions.
      
      Comparison to PaX RANDKSTACK feature:
      
      The RANDKSTACK feature randomizes the location of the stack start
      (cpu_current_top_of_stack), i.e. including the location of pt_regs
      structure itself on the stack. Initially this patch followed the same
      approach, but during the recent discussions[2], it has been determined
      to be of a little value since, if ptrace functionality is available for
      an attacker, they can use PTRACE_PEEKUSR/PTRACE_POKEUSR to read/write
      different offsets in the pt_regs struct, observe the cache behavior of
      the pt_regs accesses, and figure out the random stack offset. Another
      difference is that the random offset is stored in a per-cpu variable,
      rather than having it be per-thread. As a result, these implementations
      differ a fair bit in their implementation details and results, though
      obviously the intent is similar.
      
      [1] https://lore.kernel.org/kernel-hardening/2236FBA76BA1254E88B949DDB74E612BA4BC57C1@IRSMSX102.ger.corp.intel.com/
      [2] https://lore.kernel.org/kernel-hardening/20190329081358.30497-1-elena.reshetova@intel.com/
      [3] https://lists.ubuntu.com/archives/ubuntu-devel/2019-June/040741.htmlCo-developed-by: NElena Reshetova <elena.reshetova@intel.com>
      Signed-off-by: NElena Reshetova <elena.reshetova@intel.com>
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Link: https://lore.kernel.org/r/20210401232347.2791257-4-keescook@chromium.org
      39218ff4
  2. 27 2月, 2021 1 次提交
  3. 25 2月, 2021 1 次提交
  4. 17 2月, 2021 1 次提交
  5. 15 2月, 2021 1 次提交
  6. 12 2月, 2021 1 次提交
  7. 09 2月, 2021 5 次提交
  8. 29 1月, 2021 2 次提交
  9. 28 1月, 2021 1 次提交
    • P
      perf/intel: Remove Perfmon-v4 counter_freezing support · 3daa96d6
      Peter Zijlstra 提交于
      Perfmon-v4 counter freezing is fundamentally broken; remove this default
      disabled code to make sure nobody uses it.
      
      The feature is called Freeze-on-PMI in the SDM, and if it would do that,
      there wouldn't actually be a problem, *however* it does something subtly
      different. It globally disables the whole PMU when it raises the PMI,
      not when the PMI hits.
      
      This means there's a window between the PMI getting raised and the PMI
      actually getting served where we loose events and this violates the
      perf counter independence. That is, a counting event should not result
      in a different event count when there is a sampling event co-scheduled.
      
      This is known to break existing software (RR).
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      3daa96d6
  10. 27 1月, 2021 1 次提交
  11. 15 1月, 2021 1 次提交
  12. 14 1月, 2021 1 次提交
  13. 13 1月, 2021 1 次提交
  14. 08 1月, 2021 1 次提交
  15. 07 1月, 2021 3 次提交
  16. 05 1月, 2021 3 次提交
    • J
      rcu: Enable rcu_normal_after_boot unconditionally for RT · 36221e10
      Julia Cartwright 提交于
      Expedited RCU grace periods send IPIs to all non-idle CPUs, and thus can
      disrupt time-critical code in real-time applications.  However, there
      is a portion of boot-time processing (presumably before any real-time
      applications have started) where expedited RCU grace periods are the only
      option.  And so it is that experience with the -rt patchset indicates that
      PREEMPT_RT systems should always set the rcupdate.rcu_normal_after_boot
      kernel boot parameter.
      
      This commit therefore makes the post-boot application environment safe
      for real-time applications by making PREEMPT_RT systems disable the
      rcupdate.rcu_normal_after_boot kernel boot parameter and acting as
      if this parameter had been set.  This means that post-boot calls to
      synchronize_rcu_expedited() will be treated as if they were instead
      calls to synchronize_rcu(), thus preventing the IPIs, and thus avoiding
      disrupting real-time applications.
      Suggested-by: NLuiz Capitulino <lcapitulino@redhat.com>
      Acked-by: NPaul E. McKenney <paulmck@linux.ibm.com>
      Signed-off-by: NJulia Cartwright <julia@ni.com>
      Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      [ paulmck: Update kernel-parameters.txt accordingly. ]
      Signed-off-by: NPaul E. McKenney <paulmck@kernel.org>
      36221e10
    • S
      rcu: Unconditionally use rcuc threads on PREEMPT_RT · 8b9a0ecc
      Scott Wood 提交于
      PREEMPT_RT systems have long used the rcutree.use_softirq kernel
      boot parameter to avoid use of RCU_SOFTIRQ handlers, which can disrupt
      real-time applications by invoking callbacks during return from interrupts
      that arrived while executing time-critical code.  This kernel boot
      parameter instead runs RCU core processing in an 'rcuc' kthread, thus
      allowing the scheduler to do its job of avoiding disrupting time-critical
      code.
      
      This commit therefore disables the rcutree.use_softirq kernel boot
      parameter on PREEMPT_RT systems, thus forcing such systems to do RCU
      core processing in 'rcuc' kthreads.  This approach has long been in
      use by users of the -rt patchset, and there have been no complaints.
      There is therefore no way for the system administrator to override this
      choice, at least without modifying and rebuilding the kernel.
      Signed-off-by: NScott Wood <swood@redhat.com>
      [bigeasy: Reword commit message]
      Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      [ paulmck: Update kernel-parameters.txt accordingly. ]
      Signed-off-by: NPaul E. McKenney <paulmck@kernel.org>
      8b9a0ecc
    • P
      doc: Remove obsolete rcutree.rcu_idle_lazy_gp_delay boot parameter · 2252ec14
      Paul E. McKenney 提交于
      This commit removes documentation for the rcutree.rcu_idle_lazy_gp_delay
      kernel boot parameter given that this parameter no longer exists.
      
      Fixes: 77a40f97 ("rcu: Remove kfree_rcu() special casing and lazy-callback handling")
      Signed-off-by: NPaul E. McKenney <paulmck@kernel.org>
      2252ec14
  17. 10 12月, 2020 1 次提交
  18. 04 12月, 2020 1 次提交
  19. 01 12月, 2020 1 次提交
  20. 25 11月, 2020 1 次提交
  21. 19 11月, 2020 2 次提交
    • N
      powerpc/64s: flush L1D after user accesses · 9a32a7e7
      Nicholas Piggin 提交于
      IBM Power9 processors can speculatively operate on data in the L1 cache
      before it has been completely validated, via a way-prediction mechanism. It
      is not possible for an attacker to determine the contents of impermissible
      memory using this method, since these systems implement a combination of
      hardware and software security measures to prevent scenarios where
      protected data could be leaked.
      
      However these measures don't address the scenario where an attacker induces
      the operating system to speculatively execute instructions using data that
      the attacker controls. This can be used for example to speculatively bypass
      "kernel user access prevention" techniques, as discovered by Anthony
      Steinhauser of Google's Safeside Project. This is not an attack by itself,
      but there is a possibility it could be used in conjunction with
      side-channels or other weaknesses in the privileged code to construct an
      attack.
      
      This issue can be mitigated by flushing the L1 cache between privilege
      boundaries of concern. This patch flushes the L1 cache after user accesses.
      
      This is part of the fix for CVE-2020-4788.
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NDaniel Axtens <dja@axtens.net>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      9a32a7e7
    • N
      powerpc/64s: flush L1D on kernel entry · f7964378
      Nicholas Piggin 提交于
      IBM Power9 processors can speculatively operate on data in the L1 cache
      before it has been completely validated, via a way-prediction mechanism. It
      is not possible for an attacker to determine the contents of impermissible
      memory using this method, since these systems implement a combination of
      hardware and software security measures to prevent scenarios where
      protected data could be leaked.
      
      However these measures don't address the scenario where an attacker induces
      the operating system to speculatively execute instructions using data that
      the attacker controls. This can be used for example to speculatively bypass
      "kernel user access prevention" techniques, as discovered by Anthony
      Steinhauser of Google's Safeside Project. This is not an attack by itself,
      but there is a possibility it could be used in conjunction with
      side-channels or other weaknesses in the privileged code to construct an
      attack.
      
      This issue can be mitigated by flushing the L1 cache between privilege
      boundaries of concern. This patch flushes the L1 cache on kernel entry.
      
      This is part of the fix for CVE-2020-4788.
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NDaniel Axtens <dja@axtens.net>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      f7964378
  22. 17 11月, 2020 1 次提交
  23. 31 10月, 2020 1 次提交
  24. 23 10月, 2020 1 次提交
  25. 20 10月, 2020 1 次提交
    • J
      xen/events: defer eoi in case of excessive number of events · e99502f7
      Juergen Gross 提交于
      In case rogue guests are sending events at high frequency it might
      happen that xen_evtchn_do_upcall() won't stop processing events in
      dom0. As this is done in irq handling a crash might be the result.
      
      In order to avoid that, delay further inter-domain events after some
      time in xen_evtchn_do_upcall() by forcing eoi processing into a
      worker on the same cpu, thus inhibiting new events coming in.
      
      The time after which eoi processing is to be delayed is configurable
      via a new module parameter "event_loop_timeout" which specifies the
      maximum event loop time in jiffies (default: 2, the value was chosen
      after some tests showing that a value of 2 was the lowest with an
      only slight drop of dom0 network throughput while multiple guests
      performed an event storm).
      
      How long eoi processing will be delayed can be specified via another
      parameter "event_eoi_delay" (again in jiffies, default 10, again the
      value was chosen after testing with different delay values).
      
      This is part of XSA-332.
      
      Cc: stable@vger.kernel.org
      Reported-by: NJulien Grall <julien@xen.org>
      Signed-off-by: NJuergen Gross <jgross@suse.com>
      Reviewed-by: NStefano Stabellini <sstabellini@kernel.org>
      Reviewed-by: NWei Liu <wl@xen.org>
      e99502f7
  26. 17 10月, 2020 1 次提交
  27. 06 10月, 2020 1 次提交
  28. 25 9月, 2020 3 次提交