1. 28 2月, 2018 1 次提交
  2. 21 2月, 2018 6 次提交
  3. 20 2月, 2018 1 次提交
  4. 17 2月, 2018 1 次提交
  5. 15 2月, 2018 1 次提交
    • I
      x86/entry/64: Fix CR3 restore in paranoid_exit() · e4865757
      Ingo Molnar 提交于
      Josh Poimboeuf noticed the following bug:
      
       "The paranoid exit code only restores the saved CR3 when it switches back
        to the user GS.  However, even in the kernel GS case, it's possible that
        it needs to restore a user CR3, if for example, the paranoid exception
        occurred in the syscall exit path between SWITCH_TO_USER_CR3_STACK and
        SWAPGS."
      
      Josh also confirmed via targeted testing that it's possible to hit this bug.
      
      Fix the bug by also restoring CR3 in the paranoid_exit_no_swapgs branch.
      
      The reason we haven't seen this bug reported by users yet is probably because
      "paranoid" entry points are limited to the following cases:
      
       idtentry double_fault       do_double_fault  has_error_code=1  paranoid=2
       idtentry debug              do_debug         has_error_code=0  paranoid=1 shift_ist=DEBUG_STACK
       idtentry int3               do_int3          has_error_code=0  paranoid=1 shift_ist=DEBUG_STACK
       idtentry machine_check      do_mce           has_error_code=0  paranoid=1
      
      Amongst those entry points only machine_check is one that will interrupt an
      IRQS-off critical section asynchronously - and machine check events are rare.
      
      The other main asynchronous entries are NMI entries, which can be very high-freq
      with perf profiling, but they are special: they don't use the 'idtentry' macro but
      are open coded and restore user CR3 unconditionally so don't have this bug.
      Reported-and-tested-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Reviewed-by: NAndy Lutomirski <luto@kernel.org>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/20180214073910.boevmg65upbk3vqb@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      e4865757
  6. 13 2月, 2018 7 次提交
  7. 10 2月, 2018 1 次提交
    • N
      x86/mm/pti: Fix PTI comment in entry_SYSCALL_64() · 14b1fcc6
      Nadav Amit 提交于
      The comment is confusing since the path is taken when
      CONFIG_PAGE_TABLE_ISOLATION=y is disabled (while the comment says it is not
      taken).
      Signed-off-by: NNadav Amit <namit@vmware.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: nadav.amit@gmail.com
      Link: http://lkml.kernel.org/r/20180209170638.15161-1-namit@vmware.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      14b1fcc6
  8. 06 2月, 2018 3 次提交
    • D
      x86/entry/64: Clear registers for exceptions/interrupts, to reduce speculation attack surface · 3ac6d8c7
      Dan Williams 提交于
      Clear the 'extra' registers on entering the 64-bit kernel for exceptions
      and interrupts. The common registers are not cleared since they are
      likely clobbered well before they can be exploited in a speculative
      execution attack.
      
      Originally-From: Andi Kleen <ak@linux.intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Cc: <stable@vger.kernel.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/151787989146.7847.15749181712358213254.stgit@dwillia2-desk3.amr.corp.intel.com
      [ Made small improvements to the changelog and the code comments. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      3ac6d8c7
    • D
      x86/entry/64: Clear extra registers beyond syscall arguments, to reduce speculation attack surface · 8e1eb3fa
      Dan Williams 提交于
      At entry userspace may have (maliciously) populated the extra registers
      outside the syscall calling convention with arbitrary values that could
      be useful in a speculative execution (Spectre style) attack.
      
      Clear these registers to minimize the kernel's attack surface.
      
      Note, this only clears the extra registers and not the unused
      registers for syscalls less than 6 arguments, since those registers are
      likely to be clobbered well before their values could be put to use
      under speculation.
      
      Note, Linus found that the XOR instructions can be executed with
      minimized cost if interleaved with the PUSH instructions, and Ingo's
      analysis found that R10 and R11 should be included in the register
      clearing beyond the typical 'extra' syscall calling convention
      registers.
      Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Reported-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Cc: <stable@vger.kernel.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/151787988577.7847.16733592218894189003.stgit@dwillia2-desk3.amr.corp.intel.com
      [ Made small improvements to the changelog and the code comments. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      8e1eb3fa
    • M
      membarrier/x86: Provide core serializing command · 10bcc80e
      Mathieu Desnoyers 提交于
      There are two places where core serialization is needed by membarrier:
      
      1) When returning from the membarrier IPI,
      2) After scheduler updates curr to a thread with a different mm, before
         going back to user-space, since the curr->mm is used by membarrier to
         check whether it needs to send an IPI to that CPU.
      
      x86-32 uses IRET as return from interrupt, and both IRET and SYSEXIT to go
      back to user-space. The IRET instruction is core serializing, but not
      SYSEXIT.
      
      x86-64 uses IRET as return from interrupt, which takes care of the IPI.
      However, it can return to user-space through either SYSRETL (compat
      code), SYSRETQ, or IRET. Given that SYSRET{L,Q} is not core serializing,
      we rely instead on write_cr3() performed by switch_mm() to provide core
      serialization after changing the current mm, and deal with the special
      case of kthread -> uthread (temporarily keeping current mm into
      active_mm) by adding a sync_core() in that specific case.
      
      Use the new sync_core_before_usermode() to guarantee this.
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrea Parri <parri.andrea@gmail.com>
      Cc: Andrew Hunter <ahh@google.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Avi Kivity <avi@scylladb.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Dave Watson <davejwatson@fb.com>
      Cc: David Sehr <sehr@google.com>
      Cc: Greg Hackmann <ghackmann@google.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Maged Michael <maged.michael@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: linux-api@vger.kernel.org
      Cc: linux-arch@vger.kernel.org
      Link: http://lkml.kernel.org/r/20180129202020.8515-10-mathieu.desnoyers@efficios.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      10bcc80e
  9. 31 1月, 2018 1 次提交
    • V
      x86/hyperv: Reenlightenment notifications support · 93286261
      Vitaly Kuznetsov 提交于
      Hyper-V supports Live Migration notification. This is supposed to be used
      in conjunction with TSC emulation: when a VM is migrated to a host with
      different TSC frequency for some short period the host emulates the
      accesses to TSC and sends an interrupt to notify about the event. When the
      guest is done updating everything it can disable TSC emulation and
      everything will start working fast again.
      
      These notifications weren't required until now as Hyper-V guests are not
      supposed to use TSC as a clocksource: in Linux the TSC is even marked as
      unstable on boot. Guests normally use 'tsc page' clocksource and host
      updates its values on migrations automatically.
      
      Things change when with nested virtualization: even when the PV
      clocksources (kvm-clock or tsc page) are passed through to the nested
      guests the TSC frequency and frequency changes need to be know..
      
      Hyper-V Top Level Functional Specification (as of v5.0b) wrongly specifies
      EAX:BIT(12) of CPUID:0x40000009 as the feature identification bit. The
      right one to check is EAX:BIT(13) of CPUID:0x40000003. I was assured that
      the fix in on the way.
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: kvm@vger.kernel.org
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: "Michael Kelley (EOSG)" <Michael.H.Kelley@microsoft.com>
      Cc: Roman Kagan <rkagan@virtuozzo.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: devel@linuxdriverproject.org
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Cathy Avery <cavery@redhat.com>
      Cc: Mohammed Gamal <mmorsy@redhat.com>
      Link: https://lkml.kernel.org/r/20180124132337.30138-4-vkuznets@redhat.com
      93286261
  10. 30 1月, 2018 2 次提交
  11. 28 1月, 2018 1 次提交
  12. 19 1月, 2018 1 次提交
  13. 15 1月, 2018 1 次提交
    • D
      x86/retpoline: Fill RSB on context switch for affected CPUs · c995efd5
      David Woodhouse 提交于
      On context switch from a shallow call stack to a deeper one, as the CPU
      does 'ret' up the deeper side it may encounter RSB entries (predictions for
      where the 'ret' goes to) which were populated in userspace.
      
      This is problematic if neither SMEP nor KPTI (the latter of which marks
      userspace pages as NX for the kernel) are active, as malicious code in
      userspace may then be executed speculatively.
      
      Overwrite the CPU's return prediction stack with calls which are predicted
      to return to an infinite loop, to "capture" speculation if this
      happens. This is required both for retpoline, and also in conjunction with
      IBRS for !SMEP && !KPTI.
      
      On Skylake+ the problem is slightly different, and an *underflow* of the
      RSB may cause errant branch predictions to occur. So there it's not so much
      overwrite, as *filling* the RSB to attempt to prevent it getting
      empty. This is only a partial solution for Skylake+ since there are many
      other conditions which may result in the RSB becoming empty. The full
      solution on Skylake+ is to use IBRS, which will prevent the problem even
      when the RSB becomes empty. With IBRS, the RSB-stuffing will not be
      required on context switch.
      
      [ tglx: Added missing vendor check and slighty massaged comments and
        	changelog ]
      Signed-off-by: NDavid Woodhouse <dwmw@amazon.co.uk>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NArjan van de Ven <arjan@linux.intel.com>
      Cc: gnomes@lxorguk.ukuu.org.uk
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: thomas.lendacky@amd.com
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Jiri Kosina <jikos@kernel.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Kees Cook <keescook@google.com>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Greg Kroah-Hartman <gregkh@linux-foundation.org>
      Cc: Paul Turner <pjt@google.com>
      Link: https://lkml.kernel.org/r/1515779365-9032-1-git-send-email-dwmw@amazon.co.uk
      c995efd5
  14. 12 1月, 2018 1 次提交
    • D
      x86/retpoline/entry: Convert entry assembler indirect jumps · 2641f08b
      David Woodhouse 提交于
      Convert indirect jumps in core 32/64bit entry assembler code to use
      non-speculative sequences when CONFIG_RETPOLINE is enabled.
      
      Don't use CALL_NOSPEC in entry_SYSCALL_64_fastpath because the return
      address after the 'call' instruction must be *precisely* at the
      .Lentry_SYSCALL_64_after_fastpath label for stub_ptregs_64 to work,
      and the use of alternatives will mess that up unless we play horrid
      games to prepend with NOPs and make the variants the same length. It's
      not worth it; in the case where we ALTERNATIVE out the retpoline, the
      first instruction at __x86.indirect_thunk.rax is going to be a bare
      jmp *%rax anyway.
      Signed-off-by: NDavid Woodhouse <dwmw@amazon.co.uk>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NIngo Molnar <mingo@kernel.org>
      Acked-by: NArjan van de Ven <arjan@linux.intel.com>
      Cc: gnomes@lxorguk.ukuu.org.uk
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: thomas.lendacky@amd.com
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Jiri Kosina <jikos@kernel.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Kees Cook <keescook@google.com>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Greg Kroah-Hartman <gregkh@linux-foundation.org>
      Cc: Paul Turner <pjt@google.com>
      Link: https://lkml.kernel.org/r/1515707194-20531-7-git-send-email-dwmw@amazon.co.uk
      2641f08b
  15. 24 12月, 2017 3 次提交
    • P
      x86/mm: Optimize RESTORE_CR3 · 21e94459
      Peter Zijlstra 提交于
      Most NMI/paranoid exceptions will not in fact change pagetables and would
      thus not require TLB flushing, however RESTORE_CR3 uses flushing CR3
      writes.
      
      Restores to kernel PCIDs can be NOFLUSH, because we explicitly flush the
      kernel mappings and now that we track which user PCIDs need flushing we can
      avoid those too when possible.
      
      This does mean RESTORE_CR3 needs an additional scratch_reg, luckily both
      sites have plenty available.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Laight <David.Laight@aculab.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Eduardo Valentin <eduval@amazon.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: aliguori@amazon.com
      Cc: daniel.gruss@iaik.tugraz.at
      Cc: hughd@google.com
      Cc: keescook@google.com
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      21e94459
    • P
      x86/mm: Use/Fix PCID to optimize user/kernel switches · 6fd166aa
      Peter Zijlstra 提交于
      We can use PCID to retain the TLBs across CR3 switches; including those now
      part of the user/kernel switch. This increases performance of kernel
      entry/exit at the cost of more expensive/complicated TLB flushing.
      
      Now that we have two address spaces, one for kernel and one for user space,
      we need two PCIDs per mm. We use the top PCID bit to indicate a user PCID
      (just like we use the PFN LSB for the PGD). Since we do TLB invalidation
      from kernel space, the existing code will only invalidate the kernel PCID,
      we augment that by marking the corresponding user PCID invalid, and upon
      switching back to userspace, use a flushing CR3 write for the switch.
      
      In order to access the user_pcid_flush_mask we use PER_CPU storage, which
      means the previously established SWAPGS vs CR3 ordering is now mandatory
      and required.
      
      Having to do this memory access does require additional registers, most
      sites have a functioning stack and we can spill one (RAX), sites without
      functional stack need to otherwise provide the second scratch register.
      
      Note: PCID is generally available on Intel Sandybridge and later CPUs.
      Note: Up until this point TLB flushing was broken in this series.
      
      Based-on-code-from: Dave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Laight <David.Laight@aculab.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Eduardo Valentin <eduval@amazon.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: aliguori@amazon.com
      Cc: daniel.gruss@iaik.tugraz.at
      Cc: hughd@google.com
      Cc: keescook@google.com
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      6fd166aa
    • D
      x86/mm/pti: Prepare the x86/entry assembly code for entry/exit CR3 switching · 8a09317b
      Dave Hansen 提交于
      PAGE_TABLE_ISOLATION needs to switch to a different CR3 value when it
      enters the kernel and switch back when it exits.  This essentially needs to
      be done before leaving assembly code.
      
      This is extra challenging because the switching context is tricky: the
      registers that can be clobbered can vary.  It is also hard to store things
      on the stack because there is an established ABI (ptregs) or the stack is
      entirely unsafe to use.
      
      Establish a set of macros that allow changing to the user and kernel CR3
      values.
      
      Interactions with SWAPGS:
      
        Previous versions of the PAGE_TABLE_ISOLATION code relied on having
        per-CPU scratch space to save/restore a register that can be used for the
        CR3 MOV.  The %GS register is used to index into our per-CPU space, so
        SWAPGS *had* to be done before the CR3 switch.  That scratch space is gone
        now, but the semantic that SWAPGS must be done before the CR3 MOV is
        retained.  This is good to keep because it is not that hard to do and it
        allows to do things like add per-CPU debugging information.
      
      What this does in the NMI code is worth pointing out.  NMIs can interrupt
      *any* context and they can also be nested with NMIs interrupting other
      NMIs.  The comments below ".Lnmi_from_kernel" explain the format of the
      stack during this situation.  Changing the format of this stack is hard.
      Instead of storing the old CR3 value on the stack, this depends on the
      *regular* register save/restore mechanism and then uses %r14 to keep CR3
      during the NMI.  It is callee-saved and will not be clobbered by the C NMI
      handlers that get called.
      
      [ PeterZ: ESPFIX optimization ]
      
      Based-on-code-from: Andy Lutomirski <luto@kernel.org>
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NBorislav Petkov <bp@suse.de>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: David Laight <David.Laight@aculab.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Eduardo Valentin <eduval@amazon.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: aliguori@amazon.com
      Cc: daniel.gruss@iaik.tugraz.at
      Cc: hughd@google.com
      Cc: keescook@google.com
      Cc: linux-mm@kvack.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      8a09317b
  16. 23 12月, 2017 1 次提交
    • D
      x86/entry: Rename SYSENTER_stack to CPU_ENTRY_AREA_entry_stack · 4fe2d8b1
      Dave Hansen 提交于
      If the kernel oopses while on the trampoline stack, it will print
      "<SYSENTER>" even if SYSENTER is not involved.  That is rather confusing.
      
      The "SYSENTER" stack is used for a lot more than SYSENTER now.  Give it a
      better string to display in stack dumps, and rename the kernel code to
      match.
      
      Also move the 32-bit code over to the new naming even though it still uses
      the entry stack only for SYSENTER.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      4fe2d8b1
  17. 17 12月, 2017 6 次提交
    • A
      x86/entry/64: Make cpu_entry_area.tss read-only · c482feef
      Andy Lutomirski 提交于
      The TSS is a fairly juicy target for exploits, and, now that the TSS
      is in the cpu_entry_area, it's no longer protected by kASLR.  Make it
      read-only on x86_64.
      
      On x86_32, it can't be RO because it's written by the CPU during task
      switches, and we use a task gate for double faults.  I'd also be
      nervous about errata if we tried to make it RO even on configurations
      without double fault handling.
      
      [ tglx: AMD confirmed that there is no problem on 64-bit with TSS RO.  So
        	it's probably safe to assume that it's a non issue, though Intel
        	might have been creative in that area. Still waiting for
        	confirmation. ]
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NBorislav Petkov <bpetkov@suse.de>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Laight <David.Laight@aculab.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Eduardo Valentin <eduval@amazon.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: aliguori@amazon.com
      Cc: daniel.gruss@iaik.tugraz.at
      Cc: hughd@google.com
      Cc: keescook@google.com
      Link: https://lkml.kernel.org/r/20171204150606.733700132@linutronix.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      c482feef
    • A
      x86/entry: Clean up the SYSENTER_stack code · 0f9a4810
      Andy Lutomirski 提交于
      The existing code was a mess, mainly because C arrays are nasty.  Turn
      SYSENTER_stack into a struct, add a helper to find it, and do all the
      obvious cleanups this enables.
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NBorislav Petkov <bpetkov@suse.de>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Laight <David.Laight@aculab.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Eduardo Valentin <eduval@amazon.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: aliguori@amazon.com
      Cc: daniel.gruss@iaik.tugraz.at
      Cc: hughd@google.com
      Cc: keescook@google.com
      Link: https://lkml.kernel.org/r/20171204150606.653244723@linutronix.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      0f9a4810
    • A
      x86/entry/64: Create a per-CPU SYSCALL entry trampoline · 3386bc8a
      Andy Lutomirski 提交于
      Handling SYSCALL is tricky: the SYSCALL handler is entered with every
      single register (except FLAGS), including RSP, live.  It somehow needs
      to set RSP to point to a valid stack, which means it needs to save the
      user RSP somewhere and find its own stack pointer.  The canonical way
      to do this is with SWAPGS, which lets us access percpu data using the
      %gs prefix.
      
      With PAGE_TABLE_ISOLATION-like pagetable switching, this is
      problematic.  Without a scratch register, switching CR3 is impossible, so
      %gs-based percpu memory would need to be mapped in the user pagetables.
      Doing that without information leaks is difficult or impossible.
      
      Instead, use a different sneaky trick.  Map a copy of the first part
      of the SYSCALL asm at a different address for each CPU.  Now RIP
      varies depending on the CPU, so we can use RIP-relative memory access
      to access percpu memory.  By putting the relevant information (one
      scratch slot and the stack address) at a constant offset relative to
      RIP, we can make SYSCALL work without relying on %gs.
      
      A nice thing about this approach is that we can easily switch it on
      and off if we want pagetable switching to be configurable.
      
      The compat variant of SYSCALL doesn't have this problem in the first
      place -- there are plenty of scratch registers, since we don't care
      about preserving r8-r15.  This patch therefore doesn't touch SYSCALL32
      at all.
      
      This patch actually seems to be a small speedup.  With this patch,
      SYSCALL touches an extra cache line and an extra virtual page, but
      the pipeline no longer stalls waiting for SWAPGS.  It seems that, at
      least in a tight loop, the latter outweights the former.
      
      Thanks to David Laight for an optimization tip.
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NBorislav Petkov <bpetkov@suse.de>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Laight <David.Laight@aculab.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Eduardo Valentin <eduval@amazon.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: aliguori@amazon.com
      Cc: daniel.gruss@iaik.tugraz.at
      Cc: hughd@google.com
      Cc: keescook@google.com
      Link: https://lkml.kernel.org/r/20171204150606.403607157@linutronix.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      3386bc8a
    • A
      x86/entry/64: Return to userspace from the trampoline stack · 3e3b9293
      Andy Lutomirski 提交于
      By itself, this is useless.  It gives us the ability to run some final code
      before exit that cannnot run on the kernel stack.  This could include a CR3
      switch a la PAGE_TABLE_ISOLATION or some kernel stack erasing, for
      example.  (Or even weird things like *changing* which kernel stack gets
      used as an ASLR-strengthening mechanism.)
      
      The SYSRET32 path is not covered yet.  It could be in the future or
      we could just ignore it and force the slow path if needed.
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NBorislav Petkov <bp@suse.de>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Borislav Petkov <bpetkov@suse.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Laight <David.Laight@aculab.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Eduardo Valentin <eduval@amazon.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: aliguori@amazon.com
      Cc: daniel.gruss@iaik.tugraz.at
      Cc: hughd@google.com
      Cc: keescook@google.com
      Link: https://lkml.kernel.org/r/20171204150606.306546484@linutronix.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      3e3b9293
    • A
      x86/entry/64: Use a per-CPU trampoline stack for IDT entries · 7f2590a1
      Andy Lutomirski 提交于
      Historically, IDT entries from usermode have always gone directly
      to the running task's kernel stack.  Rearrange it so that we enter on
      a per-CPU trampoline stack and then manually switch to the task's stack.
      This touches a couple of extra cachelines, but it gives us a chance
      to run some code before we touch the kernel stack.
      
      The asm isn't exactly beautiful, but I think that fully refactoring
      it can wait.
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NBorislav Petkov <bp@suse.de>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Borislav Petkov <bpetkov@suse.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Laight <David.Laight@aculab.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Eduardo Valentin <eduval@amazon.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: aliguori@amazon.com
      Cc: daniel.gruss@iaik.tugraz.at
      Cc: hughd@google.com
      Cc: keescook@google.com
      Link: https://lkml.kernel.org/r/20171204150606.225330557@linutronix.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      7f2590a1
    • B
      x86/entry/64/paravirt: Use paravirt-safe macro to access eflags · e17f8234
      Boris Ostrovsky 提交于
      Commit 1d3e53e8 ("x86/entry/64: Refactor IRQ stacks and make them
      NMI-safe") added DEBUG_ENTRY_ASSERT_IRQS_OFF macro that acceses eflags
      using 'pushfq' instruction when testing for IF bit. On PV Xen guests
      looking at IF flag directly will always see it set, resulting in 'ud2'.
      
      Introduce SAVE_FLAGS() macro that will use appropriate save_fl pv op when
      running paravirt.
      Signed-off-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NJuergen Gross <jgross@suse.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Borislav Petkov <bpetkov@suse.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Laight <David.Laight@aculab.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Eduardo Valentin <eduval@amazon.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: aliguori@amazon.com
      Cc: daniel.gruss@iaik.tugraz.at
      Cc: hughd@google.com
      Cc: keescook@google.com
      Cc: xen-devel@lists.xenproject.org
      Link: https://lkml.kernel.org/r/20171204150604.899457242@linutronix.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      e17f8234
  18. 23 11月, 2017 1 次提交
    • A
      x86/entry/64: Add missing irqflags tracing to native_load_gs_index() · ca37e57b
      Andy Lutomirski 提交于
      Running this code with IRQs enabled (where dummy_lock is a spinlock):
      
      static void check_load_gs_index(void)
      {
      	/* This will fail. */
      	load_gs_index(0xffff);
      
      	spin_lock(&dummy_lock);
      	spin_unlock(&dummy_lock);
      }
      
      Will generate a lockdep warning.  The issue is that the actual write
      to %gs would cause an exception with IRQs disabled, and the exception
      handler would, as an inadvertent side effect, update irqflag tracing
      to reflect the IRQs-off status.  native_load_gs_index() would then
      turn IRQs back on and return with irqflag tracing still thinking that
      IRQs were off.  The dummy lock-and-unlock causes lockdep to notice the
      error and warn.
      
      Fix it by adding the missing tracing.
      
      Apparently nothing did this in a context where it mattered.  I haven't
      tried to find a code path that would actually exhibit the warning if
      appropriately nasty user code were running.
      
      I suspect that the security impact of this bug is very, very low --
      production systems don't run with lockdep enabled, and the warning is
      mostly harmless anyway.
      
      Found during a quick audit of the entry code to try to track down an
      unrelated bug that Ingo found in some still-in-development code.
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bpetkov@suse.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      Link: http://lkml.kernel.org/r/e1aeb0e6ba8dd430ec36c8a35e63b429698b4132.1511411918.git.luto@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      ca37e57b
  19. 22 11月, 2017 1 次提交
    • A
      x86/entry/64: Fix entry_SYSCALL_64_after_hwframe() IRQ tracing · 548c3050
      Andy Lutomirski 提交于
      When I added entry_SYSCALL_64_after_hwframe(), I left TRACE_IRQS_OFF
      before it.  This means that users of entry_SYSCALL_64_after_hwframe()
      were responsible for invoking TRACE_IRQS_OFF, and the one and only
      user (Xen, added in the same commit) got it wrong.
      
      I think this would manifest as a warning if a Xen PV guest with
      CONFIG_DEBUG_LOCKDEP=y were used with context tracking.  (The
      context tracking bit is to cause lockdep to get invoked before we
      turn IRQs back on.)  I haven't tested that for real yet because I
      can't get a kernel configured like that to boot at all on Xen PV.
      
      Move TRACE_IRQS_OFF below the label.
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Borislav Petkov <bpetkov@suse.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      Fixes: 8a9949bc ("x86/xen/64: Rearrange the SYSCALL entries")
      Link: http://lkml.kernel.org/r/9150aac013b7b95d62c2336751d5b6e91d2722aa.1511325444.git.luto@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      548c3050