1. 18 10月, 2019 3 次提交
    • J
      x86/asm: Use SYM_INNER_LABEL instead of GLOBAL · 26ba4e57
      Jiri Slaby 提交于
      The GLOBAL macro had several meanings and is going away. Convert all the
      inner function labels marked with GLOBAL to use SYM_INNER_LABEL instead.
      Signed-off-by: NJiri Slaby <jslaby@suse.cz>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: linux-arch@vger.kernel.org
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: "Steven Rostedt (VMware)" <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: x86-ml <x86@kernel.org>
      Link: https://lkml.kernel.org/r/20191011115108.12392-18-jslaby@suse.cz
      26ba4e57
    • J
      x86/asm/entry: Annotate interrupt symbols properly · cc66936e
      Jiri Slaby 提交于
      * annotate functions properly by SYM_CODE_START, SYM_CODE_START_LOCAL*
        and SYM_CODE_END -- these are not C-like functions, so they have to
        be annotated using CODE.
      * use SYM_INNER_LABEL* for labels being in the middle of other functions
        This prevents nested labels annotations.
      Signed-off-by: NJiri Slaby <jslaby@suse.cz>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: linux-arch@vger.kernel.org
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: x86-ml <x86@kernel.org>
      Link: https://lkml.kernel.org/r/20191011115108.12392-11-jslaby@suse.cz
      cc66936e
    • J
      x86/asm: Annotate local pseudo-functions · ef77e688
      Jiri Slaby 提交于
      Use the newly added SYM_CODE_START_LOCAL* to annotate beginnings of
      all pseudo-functions (those ending with END until now) which do not
      have ".globl" annotation. This is needed to balance END for tools that
      generate debuginfo. Note that ENDs are switched to SYM_CODE_END too so
      that everybody can see the pairing.
      
      C-like functions (which handle frame ptr etc.) are not annotated here,
      hence SYM_CODE_* macros are used here, not SYM_FUNC_*. Note that the
      32bit version of early_idt_handler_common already had ENDPROC -- switch
      that to SYM_CODE_END for the same reason as above (and to be the same as
      64bit).
      
      While early_idt_handler_common is LOCAL, it's name is not prepended with
      ".L" as it happens to appear in call traces.
      
      bad_get_user*, and bad_put_user are now aligned, as they are separate
      functions. They do not mind to be aligned -- no need to be compact
      there.
      
      early_idt_handler_common is aligned now too, as it is after
      early_idt_handler_array, so as well no need to be compact there.
      
      verify_cpu is self-standing and included in other .S files, so align it
      too.
      
      The others have alignment preserved to what it used to be (using the
      _NOALIGN variant of macros).
      Signed-off-by: NJiri Slaby <jslaby@suse.cz>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: Alexios Zavras <alexios.zavras@intel.com>
      Cc: Allison Randal <allison@lohutok.net>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Cao jin <caoj.fnst@cn.fujitsu.com>
      Cc: Enrico Weigelt <info@metux.net>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: linux-arch@vger.kernel.org
      Cc: Maran Wilson <maran.wilson@oracle.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: x86-ml <x86@kernel.org>
      Link: https://lkml.kernel.org/r/20191011115108.12392-6-jslaby@suse.cz
      ef77e688
  2. 11 10月, 2019 1 次提交
    • J
      x86/asm: Make more symbols local · 30a2441c
      Jiri Slaby 提交于
      During the assembly cleanup patchset review, I found more symbols which
      are used only locally. So make them really local by prepending ".L" to
      them. Namely:
      
       - wakeup_idt is used only in realmode/rm/wakeup_asm.S.
       - in_pm32 is used only in boot/pmjump.S.
       - retint_user is used only in entry/entry_64.S, perhaps since commit
         2ec67971 ("x86/entry/64/compat: Remove most of the fast system
         call machinery"), where entry_64_compat's caller was removed.
      
      Drop GLOBAL from all of them too. I do not see more candidates in the
      series.
      Signed-off-by: NJiri Slaby <jslaby@suse.cz>
      Acked-by: NBorislav Petkov <bp@suse.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: bp@alien8.de
      Cc: hpa@zytor.com
      Link: https://lkml.kernel.org/r/20191011092213.31470-1-jslaby@suse.czSigned-off-by: NIngo Molnar <mingo@kernel.org>
      30a2441c
  3. 06 9月, 2019 1 次提交
    • J
      x86/asm: Make some functions local labels · 98ededb6
      Jiri Slaby 提交于
      Boris suggests to make a local label (prepend ".L") to these functions
      to eliminate them from the symbol table. These are functions with very
      local names and really should not be visible anywhere.
      
      Note that objtool won't see these functions anymore (to generate ORC
      debug info). But all the functions are not annotated with ENDPROC, so
      they won't have objtool's attention anyway.
      Signed-off-by: NJiri Slaby <jslaby@suse.cz>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Cao jin <caoj.fnst@cn.fujitsu.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steve Winslow <swinslow@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Wei Huang <wei@redhat.com>
      Cc: x86-ml <x86@kernel.org>
      Cc: Xiaoyao Li <xiaoyao.li@linux.intel.com>
      Link: https://lkml.kernel.org/r/20190906075550.23435-2-jslaby@suse.cz
      98ededb6
  4. 01 8月, 2019 1 次提交
  5. 20 7月, 2019 1 次提交
  6. 18 7月, 2019 3 次提交
  7. 17 7月, 2019 2 次提交
    • J
      x86/entry/64: Use JMP instead of JMPQ · 64dbc122
      Josh Poimboeuf 提交于
      Somehow the swapgs mitigation entry code patch ended up with a JMPQ
      instruction instead of JMP, where only the short jump is needed.  Some
      assembler versions apparently fail to optimize JMPQ into a two-byte JMP
      when possible, instead always using a 7-byte JMP with relocation.  For
      some reason that makes the entry code explode with a #GP during boot.
      
      Change it back to "JMP" as originally intended.
      
      Fixes: 18ec54fd ("x86/speculation: Prepare entry code for Spectre v1 swapgs mitigations")
      Signed-off-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      64dbc122
    • Z
      xen/pv: Fix a boot up hang revealed by int3 self test · b23e5844
      Zhenzhong Duan 提交于
      Commit 7457c0da ("x86/alternatives: Add int3_emulate_call()
      selftest") is used to ensure there is a gap setup in int3 exception stack
      which could be used for inserting call return address.
      
      This gap is missed in XEN PV int3 exception entry path, then below panic
      triggered:
      
      [    0.772876] general protection fault: 0000 [#1] SMP NOPTI
      [    0.772886] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.2.0+ #11
      [    0.772893] RIP: e030:int3_magic+0x0/0x7
      [    0.772905] RSP: 3507:ffffffff82203e98 EFLAGS: 00000246
      [    0.773334] Call Trace:
      [    0.773334]  alternative_instructions+0x3d/0x12e
      [    0.773334]  check_bugs+0x7c9/0x887
      [    0.773334]  ? __get_locked_pte+0x178/0x1f0
      [    0.773334]  start_kernel+0x4ff/0x535
      [    0.773334]  ? set_init_arg+0x55/0x55
      [    0.773334]  xen_start_kernel+0x571/0x57a
      
      For 64bit PV guests, Xen's ABI enters the kernel with using SYSRET, with
      %rcx/%r11 on the stack. To convert back to "normal" looking exceptions,
      the xen thunks do 'xen_*: pop %rcx; pop %r11; jmp *'.
      
      E.g. Extracting 'xen_pv_trap xenint3' we have:
      xen_xenint3:
       pop %rcx;
       pop %r11;
       jmp xenint3
      
      As xenint3 and int3 entry code are same except xenint3 doesn't generate
      a gap, we can fix it by using int3 and drop useless xenint3.
      Signed-off-by: NZhenzhong Duan <zhenzhong.duan@oracle.com>
      Reviewed-by: NJuergen Gross <jgross@suse.com>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Stefano Stabellini <sstabellini@kernel.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Andrew Cooper <andrew.cooper3@citrix.com>
      Signed-off-by: NJuergen Gross <jgross@suse.com>
      b23e5844
  8. 09 7月, 2019 1 次提交
    • J
      x86/speculation: Prepare entry code for Spectre v1 swapgs mitigations · 18ec54fd
      Josh Poimboeuf 提交于
      
      Spectre v1 isn't only about array bounds checks.  It can affect any
      conditional checks.  The kernel entry code interrupt, exception, and NMI
      handlers all have conditional swapgs checks.  Those may be problematic in
      the context of Spectre v1, as kernel code can speculatively run with a user
      GS.
      
      For example:
      
      	if (coming from user space)
      		swapgs
      	mov %gs:<percpu_offset>, %reg
      	mov (%reg), %reg1
      
      When coming from user space, the CPU can speculatively skip the swapgs, and
      then do a speculative percpu load using the user GS value.  So the user can
      speculatively force a read of any kernel value.  If a gadget exists which
      uses the percpu value as an address in another load/store, then the
      contents of the kernel value may become visible via an L1 side channel
      attack.
      
      A similar attack exists when coming from kernel space.  The CPU can
      speculatively do the swapgs, causing the user GS to get used for the rest
      of the speculative window.
      
      The mitigation is similar to a traditional Spectre v1 mitigation, except:
      
        a) index masking isn't possible; because the index (percpu offset)
           isn't user-controlled; and
      
        b) an lfence is needed in both the "from user" swapgs path and the
           "from kernel" non-swapgs path (because of the two attacks described
           above).
      
      The user entry swapgs paths already have SWITCH_TO_KERNEL_CR3, which has a
      CR3 write when PTI is enabled.  Since CR3 writes are serializing, the
      lfences can be skipped in those cases.
      
      On the other hand, the kernel entry swapgs paths don't depend on PTI.
      
      To avoid unnecessary lfences for the user entry case, create two separate
      features for alternative patching:
      
        X86_FEATURE_FENCE_SWAPGS_USER
        X86_FEATURE_FENCE_SWAPGS_KERNEL
      
      Use these features in entry code to patch in lfences where needed.
      
      The features aren't enabled yet, so there's no functional change.
      Signed-off-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NDave Hansen <dave.hansen@intel.com>
      18ec54fd
  9. 03 7月, 2019 2 次提交
    • T
      x86/fsgsbase: Revert FSGSBASE support · 049331f2
      Thomas Gleixner 提交于
      The FSGSBASE series turned out to have serious bugs and there is still an
      open issue which is not fully understood yet.
      
      The confidence in those changes has become close to zero especially as the
      test cases which have been shipped with that series were obviously never
      run before sending the final series out to LKML.
      
        ./fsgsbase_64 >/dev/null
        Segmentation fault
      
      As the merge window is close, the only sane decision is to revert FSGSBASE
      support. The revert is necessary as this branch has been merged into
      perf/core already and rebasing all of that a few days before the merge
      window is not the most brilliant idea.
      
      I could definitely slap myself for not noticing the test case fail when
      merging that series, but TBH my expectations weren't that low back
      then. Won't happen again.
      
      Revert the following commits:
      539bca53 ("x86/entry/64: Fix and clean up paranoid_exit")
      2c7b5ac5 ("Documentation/x86/64: Add documentation for GS/FS addressing mode")
      f987c955 ("x86/elf: Enumerate kernel FSGSBASE capability in AT_HWCAP2")
      2032f1f9 ("x86/cpu: Enable FSGSBASE on 64bit by default and add a chicken bit")
      5bf0cab6 ("x86/entry/64: Document GSBASE handling in the paranoid path")
      708078f6 ("x86/entry/64: Handle FSGSBASE enabled paranoid entry/exit")
      79e1932f ("x86/entry/64: Introduce the FIND_PERCPU_BASE macro")
      1d07316b ("x86/entry/64: Switch CR3 before SWAPGS in paranoid entry")
      f60a83df ("x86/process/64: Use FSGSBASE instructions on thread copy and ptrace")
      1ab5f3f7 ("x86/process/64: Use FSBSBASE in switch_to() if available")
      a86b4625 ("x86/fsgsbase/64: Enable FSGSBASE instructions in helper functions")
      8b71340d ("x86/fsgsbase/64: Add intrinsics for FSGSBASE instructions")
      b64ed19b ("x86/cpu: Add 'unsafe_fsgsbase' to enable CR4.FSGSBASE")
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NIngo Molnar <mingo@kernel.org>
      Cc: Chang S. Bae <chang.seok.bae@intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Ravi Shankar <ravi.v.shankar@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      049331f2
    • T
      x86/irq: Seperate unused system vectors from spurious entry again · f8a8fe61
      Thomas Gleixner 提交于
      Quite some time ago the interrupt entry stubs for unused vectors in the
      system vector range got removed and directly mapped to the spurious
      interrupt vector entry point.
      
      Sounds reasonable, but it's subtly broken. The spurious interrupt vector
      entry point pushes vector number 0xFF on the stack which makes the whole
      logic in __smp_spurious_interrupt() pointless.
      
      As a consequence any spurious interrupt which comes from a vector != 0xFF
      is treated as a real spurious interrupt (vector 0xFF) and not
      acknowledged. That subsequently stalls all interrupt vectors of equal and
      lower priority, which brings the system to a grinding halt.
      
      This can happen because even on 64-bit the system vector space is not
      guaranteed to be fully populated. A full compile time handling of the
      unused vectors is not possible because quite some of them are conditonally
      populated at runtime.
      
      Bring the entry stubs back, which wastes 160 bytes if all stubs are unused,
      but gains the proper handling back. There is no point to selectively spare
      some of the stubs which are known at compile time as the required code in
      the IDT management would be way larger and convoluted.
      
      Do not route the spurious entries through common_interrupt and do_IRQ() as
      the original code did. Route it to smp_spurious_interrupt() which evaluates
      the vector number and acts accordingly now that the real vector numbers are
      handed in.
      
      Fixup the pr_warn so the actual spurious vector (0xff) is clearly
      distiguished from the other vectors and also note for the vectored case
      whether it was pending in the ISR or not.
      
       "Spurious APIC interrupt (vector 0xFF) on CPU#0, should never happen."
       "Spurious interrupt vector 0xed on CPU#1. Acked."
       "Spurious interrupt vector 0xee on CPU#1. Not pending!."
      
      Fixes: 2414e021 ("x86: Avoid building unused IRQ entry stubs")
      Reported-by: NJan Kiszka <jan.kiszka@siemens.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Marc Zyngier <marc.zyngier@arm.com>
      Cc: Jan Beulich <jbeulich@suse.com>
      Link: https://lkml.kernel.org/r/20190628111440.550568228@linutronix.de
      f8a8fe61
  10. 02 7月, 2019 2 次提交
  11. 22 6月, 2019 2 次提交
  12. 12 6月, 2019 1 次提交
  13. 09 6月, 2019 1 次提交
  14. 09 5月, 2019 1 次提交
  15. 17 4月, 2019 5 次提交
    • A
      x86/irq/64: Split the IRQ stack into its own pages · e6401c13
      Andy Lutomirski 提交于
      Currently, the IRQ stack is hardcoded as the first page of the percpu
      area, and the stack canary lives on the IRQ stack. The former gets in
      the way of adding an IRQ stack guard page, and the latter is a potential
      weakness in the stack canary mechanism.
      
      Split the IRQ stack into its own private percpu pages.
      
      [ tglx: Make 64 and 32 bit share struct irq_stack ]
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Brijesh Singh <brijesh.singh@amd.com>
      Cc: "Chang S. Bae" <chang.seok.bae@intel.com>
      Cc: Dominik Brodowski <linux@dominikbrodowski.net>
      Cc: Feng Tang <feng.tang@intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jan Beulich <JBeulich@suse.com>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Joerg Roedel <jroedel@suse.de>
      Cc: Jordan Borgner <mail@jordan-borgner.de>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Maran Wilson <maran.wilson@oracle.com>
      Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Nick Desaulniers <ndesaulniers@google.com>
      Cc: Nicolai Stange <nstange@suse.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Pu Wen <puwen@hygon.cn>
      Cc: "Rafael Ávila de Espíndola" <rafael@espindo.la>
      Cc: Sean Christopherson <sean.j.christopherson@intel.com>
      Cc: Stefano Stabellini <sstabellini@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: x86-ml <x86@kernel.org>
      Cc: xen-devel@lists.xenproject.org
      Link: https://lkml.kernel.org/r/20190414160146.267376656@linutronix.de
      e6401c13
    • T
      x86/irq/64: Rename irq_stack_ptr to hardirq_stack_ptr · 758a2e31
      Thomas Gleixner 提交于
      Preparatory patch to share code with 32bit.
      
      No functional changes.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: "Chang S. Bae" <chang.seok.bae@intel.com>
      Cc: Dominik Brodowski <linux@dominikbrodowski.net>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Nick Desaulniers <ndesaulniers@google.com>
      Cc: Nicolai Stange <nstange@suse.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Pingfan Liu <kernelfans@gmail.com>
      Cc: Sean Christopherson <sean.j.christopherson@intel.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: x86-ml <x86@kernel.org>
      Link: https://lkml.kernel.org/r/20190414160145.912584074@linutronix.de
      758a2e31
    • T
      x86/exceptions: Split debug IST stack · 2a594d4c
      Thomas Gleixner 提交于
      The debug IST stack is actually two separate debug stacks to handle #DB
      recursion. This is required because the CPU starts always at top of stack
      on exception entry, which means on #DB recursion the second #DB would
      overwrite the stack of the first.
      
      The low level entry code therefore adjusts the top of stack on entry so a
      secondary #DB starts from a different stack page. But the stack pages are
      adjacent without a guard page between them.
      
      Split the debug stack into 3 stacks which are separated by guard pages. The
      3rd stack is never mapped into the cpu_entry_area and is only there to
      catch triple #DB nesting:
      
            --- top of DB_stack	<- Initial stack
            --- end of DB_stack
            	  guard page
      
            --- top of DB1_stack	<- Top of stack after entering first #DB
            --- end of DB1_stack
            	  guard page
      
            --- top of DB2_stack	<- Top of stack after entering second #DB
            --- end of DB2_stack
            	  guard page
      
      If DB2 would not act as the final guard hole, a second #DB would point the
      top of #DB stack to the stack below #DB1 which would be valid and not catch
      the not so desired triple nesting.
      
      The backing store does not allocate any memory for DB2 and its guard page
      as it is not going to be mapped into the cpu_entry_area.
      
       - Adjust the low level entry code so it adjusts top of #DB with the offset
         between the stacks instead of exception stack size.
      
       - Make the dumpstack code aware of the new stacks.
      
       - Adjust the in_debug_stack() implementation and move it into the NMI code
         where it belongs. As this is NMI hotpath code, it just checks the full
         area between top of DB_stack and bottom of DB1_stack without checking
         for the guard page. That's correct because the NMI cannot hit a
         stackpointer pointing to the guard page between DB and DB1 stack.  Even
         if it would, then the NMI operation still is unaffected, but the resume
         of the debug exception on the topmost DB stack will crash by touching
         the guard page.
      
        [ bp: Make exception_stack_names static const char * const ]
      Suggested-by: NAndy Lutomirski <luto@kernel.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Reviewed-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: "Chang S. Bae" <chang.seok.bae@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Dominik Brodowski <linux@dominikbrodowski.net>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Joerg Roedel <jroedel@suse.de>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: linux-doc@vger.kernel.org
      Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Sean Christopherson <sean.j.christopherson@intel.com>
      Cc: x86-ml <x86@kernel.org>
      Link: https://lkml.kernel.org/r/20190414160145.439944544@linutronix.de
      2a594d4c
    • T
      x86/exceptions: Disconnect IST index and stack order · 32074269
      Thomas Gleixner 提交于
      The entry order of the TSS.IST array and the order of the stack
      storage/mapping are not required to be the same.
      
      With the upcoming split of the debug stack this is going to fall apart as
      the number of TSS.IST array entries stays the same while the actual stacks
      are increasing.
      
      Make them separate so that code like dumpstack can just utilize the mapping
      order. The IST index is solely required for the actual TSS.IST array
      initialization.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: "Chang S. Bae" <chang.seok.bae@intel.com>
      Cc: Dominik Brodowski <linux@dominikbrodowski.net>
      Cc: Dou Liyang <douly.fnst@cn.fujitsu.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Nicolai Stange <nstange@suse.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Sean Christopherson <sean.j.christopherson@intel.com>
      Cc: x86-ml <x86@kernel.org>
      Link: https://lkml.kernel.org/r/20190414160145.241588113@linutronix.de
      32074269
    • T
      x86/exceptions: Make IST index zero based · 8f34c5b5
      Thomas Gleixner 提交于
      The defines for the exception stack (IST) array in the TSS are using the
      SDM convention IST1 - IST7. That causes all sorts of code to subtract 1 for
      array indices related to IST. That's confusing at best and does not provide
      any value.
      
      Make the indices zero based and fixup the usage sites. The only code which
      needs to adjust the 0 based index is the interrupt descriptor setup which
      needs to add 1 now.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Reviewed-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: "Chang S. Bae" <chang.seok.bae@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Dominik Brodowski <linux@dominikbrodowski.net>
      Cc: Dou Liyang <douly.fnst@cn.fujitsu.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: linux-doc@vger.kernel.org
      Cc: Nicolai Stange <nstange@suse.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qian Cai <cai@lca.pw>
      Cc: x86-ml <x86@kernel.org>
      Link: https://lkml.kernel.org/r/20190414160144.331772825@linutronix.de
      8f34c5b5
  16. 05 4月, 2019 1 次提交
  17. 03 4月, 2019 2 次提交
    • P
      sched/x86_64: Don't save flags on context switch · 64604d54
      Peter Zijlstra 提交于
      Now that we have objtool validating AC=1 state for all x86_64 code,
      we can once again guarantee clean flags on schedule.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      64604d54
    • P
      sched/x86: Save [ER]FLAGS on context switch · 6690e86b
      Peter Zijlstra 提交于
      Effectively reverts commit:
      
        2c7577a7 ("sched/x86_64: Don't save flags on context switch")
      
      Specifically because SMAP uses FLAGS.AC which invalidates the claim
      that the kernel has clean flags.
      
      In particular; while preemption from interrupt return is fine (the
      IRET frame on the exception stack contains FLAGS) it breaks any code
      that does synchonous scheduling, including preempt_enable().
      
      This has become a significant issue ever since commit:
      
        5b24a7a2 ("Add 'unsafe' user access functions for batched accesses")
      
      provided for means of having 'normal' C code between STAC / CLAC,
      exposing the FLAGS.AC state. So far this hasn't led to trouble,
      however fix it before it comes apart.
      Reported-by: NJulien Thierry <julien.thierry@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NAndy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: stable@kernel.org
      Fixes: 5b24a7a2 ("Add 'unsafe' user access functions for batched accesses")
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      6690e86b
  18. 06 12月, 2018 1 次提交
    • A
      kprobes/x86: Blacklist non-attachable interrupt functions · a50480cb
      Andrea Righi 提交于
      These interrupt functions are already non-attachable by kprobes.
      Blacklist them explicitly so that they can show up in
      /sys/kernel/debug/kprobes/blacklist and tools like BCC can use this
      additional information.
      Signed-off-by: NAndrea Righi <righi.andrea@gmail.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Yonghong Song <yhs@fb.com>
      Link: http://lkml.kernel.org/r/20181206095648.GA8249@DellSigned-off-by: NIngo Molnar <mingo@kernel.org>
      a50480cb
  19. 17 10月, 2018 1 次提交
  20. 14 10月, 2018 1 次提交
  21. 13 9月, 2018 1 次提交
    • A
      x86/pti/64: Remove the SYSCALL64 entry trampoline · bf904d27
      Andy Lutomirski 提交于
      The SYSCALL64 trampoline has a couple of nice properties:
      
       - The usual sequence of SWAPGS followed by two GS-relative accesses to
         set up RSP is somewhat slow because the GS-relative accesses need
         to wait for SWAPGS to finish.  The trampoline approach allows
         RIP-relative accesses to set up RSP, which avoids the stall.
      
       - The trampoline avoids any percpu access before CR3 is set up,
         which means that no percpu memory needs to be mapped in the user
         page tables.  This prevents using Meltdown to read any percpu memory
         outside the cpu_entry_area and prevents using timing leaks
         to directly locate the percpu areas.
      
      The downsides of using a trampoline may outweigh the upsides, however.
      It adds an extra non-contiguous I$ cache line to system calls, and it
      forces an indirect jump to transfer control back to the normal kernel
      text after CR3 is set up.  The latter is because x86 lacks a 64-bit
      direct jump instruction that could jump from the trampoline to the entry
      text.  With retpolines enabled, the indirect jump is extremely slow.
      
      Change the code to map the percpu TSS into the user page tables to allow
      the non-trampoline SYSCALL64 path to work under PTI.  This does not add a
      new direct information leak, since the TSS is readable by Meltdown from the
      cpu_entry_area alias regardless.  It does allow a timing attack to locate
      the percpu area, but KASLR is more or less a lost cause against local
      attack on CPUs vulnerable to Meltdown regardless.  As far as I'm concerned,
      on current hardware, KASLR is only useful to mitigate remote attacks that
      try to attack the kernel without first gaining RCE against a vulnerable
      user process.
      
      On Skylake, with CONFIG_RETPOLINE=y and KPTI on, this reduces syscall
      overhead from ~237ns to ~228ns.
      
      There is a possible alternative approach: Move the trampoline within 2G of
      the entry text and make a separate copy for each CPU.  This would allow a
      direct jump to rejoin the normal entry path. There are pro's and con's for
      this approach:
      
       + It avoids a pipeline stall
      
       - It executes from an extra page and read from another extra page during
         the syscall. The latter is because it needs to use a relative
         addressing mode to find sp1 -- it's the same *cacheline*, but accessed
         using an alias, so it's an extra TLB entry.
      
       - Slightly more memory. This would be one page per CPU for a simple
         implementation and 64-ish bytes per CPU or one page per node for a more
         complex implementation.
      
       - More code complexity.
      
      The current approach is chosen for simplicity and because the alternative
      does not provide a significant benefit, which makes it worth.
      
      [ tglx: Added the alternative discussion to the changelog ]
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NBorislav Petkov <bp@suse.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Joerg Roedel <joro@8bytes.org>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/8c7c6e483612c3e4e10ca89495dc160b1aa66878.1536015544.git.luto@kernel.org
      bf904d27
  22. 08 9月, 2018 2 次提交
  23. 05 9月, 2018 1 次提交
    • A
      x86/entry: Add STACKLEAK erasing the kernel stack at the end of syscalls · afaef01c
      Alexander Popov 提交于
      The STACKLEAK feature (initially developed by PaX Team) has the following
      benefits:
      
      1. Reduces the information that can be revealed through kernel stack leak
         bugs. The idea of erasing the thread stack at the end of syscalls is
         similar to CONFIG_PAGE_POISONING and memzero_explicit() in kernel
         crypto, which all comply with FDP_RIP.2 (Full Residual Information
         Protection) of the Common Criteria standard.
      
      2. Blocks some uninitialized stack variable attacks (e.g. CVE-2017-17712,
         CVE-2010-2963). That kind of bugs should be killed by improving C
         compilers in future, which might take a long time.
      
      This commit introduces the code filling the used part of the kernel
      stack with a poison value before returning to userspace. Full
      STACKLEAK feature also contains the gcc plugin which comes in a
      separate commit.
      
      The STACKLEAK feature is ported from grsecurity/PaX. More information at:
        https://grsecurity.net/
        https://pax.grsecurity.net/
      
      This code is modified from Brad Spengler/PaX Team's code in the last
      public patch of grsecurity/PaX based on our understanding of the code.
      Changes or omissions from the original code are ours and don't reflect
      the original grsecurity/PaX code.
      
      Performance impact:
      
      Hardware: Intel Core i7-4770, 16 GB RAM
      
      Test #1: building the Linux kernel on a single core
              0.91% slowdown
      
      Test #2: hackbench -s 4096 -l 2000 -g 15 -f 25 -P
              4.2% slowdown
      
      So the STACKLEAK description in Kconfig includes: "The tradeoff is the
      performance impact: on a single CPU system kernel compilation sees a 1%
      slowdown, other systems and workloads may vary and you are advised to
      test this feature on your expected workload before deploying it".
      Signed-off-by: NAlexander Popov <alex.popov@linux.com>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NDave Hansen <dave.hansen@linux.intel.com>
      Acked-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NKees Cook <keescook@chromium.org>
      afaef01c
  24. 03 9月, 2018 1 次提交
  25. 24 7月, 2018 1 次提交
    • A
      x86/entry/64: Remove %ebx handling from error_entry/exit · b3681dd5
      Andy Lutomirski 提交于
      error_entry and error_exit communicate the user vs. kernel status of
      the frame using %ebx.  This is unnecessary -- the information is in
      regs->cs.  Just use regs->cs.
      
      This makes error_entry simpler and makes error_exit more robust.
      
      It also fixes a nasty bug.  Before all the Spectre nonsense, the
      xen_failsafe_callback entry point returned like this:
      
              ALLOC_PT_GPREGS_ON_STACK
              SAVE_C_REGS
              SAVE_EXTRA_REGS
              ENCODE_FRAME_POINTER
              jmp     error_exit
      
      And it did not go through error_entry.  This was bogus: RBX
      contained garbage, and error_exit expected a flag in RBX.
      
      Fortunately, it generally contained *nonzero* garbage, so the
      correct code path was used.  As part of the Spectre fixes, code was
      added to clear RBX to mitigate certain speculation attacks.  Now,
      depending on kernel configuration, RBX got zeroed and, when running
      some Wine workloads, the kernel crashes.  This was introduced by:
      
          commit 3ac6d8c7 ("x86/entry/64: Clear registers for exceptions/interrupts, to reduce speculation attack surface")
      
      With this patch applied, RBX is no longer needed as a flag, and the
      problem goes away.
      
      I suspect that malicious userspace could use this bug to crash the
      kernel even without the offending patch applied, though.
      
      [ Historical note: I wrote this patch as a cleanup before I was aware
        of the bug it fixed. ]
      
      [ Note to stable maintainers: this should probably get applied to all
        kernels.  If you're nervous about that, a more conservative fix to
        add xorl %ebx,%ebx; incl %ebx before the jump to error_exit should
        also fix the problem. ]
      Reported-and-tested-by: NM. Vefa Bicakci <m.v.b@runbox.com>
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Dominik Brodowski <linux@dominikbrodowski.net>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      Cc: xen-devel@lists.xenproject.org
      Fixes: 3ac6d8c7 ("x86/entry/64: Clear registers for exceptions/interrupts, to reduce speculation attack surface")
      Link: http://lkml.kernel.org/r/b5010a090d3586b2d6e06c7ad3ec5542d1241c45.1532282627.git.luto@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      b3681dd5
  26. 03 7月, 2018 1 次提交
    • J
      x86/entry/64: Add two more instruction suffixes · 6709812f
      Jan Beulich 提交于
      Sadly, other than claimed in:
      
        a368d7fd ("x86/entry/64: Add instruction suffix")
      
      ... there are two more instances which want to be adjusted.
      
      As said there, omitting suffixes from instructions in AT&T mode is bad
      practice when operand size cannot be determined by the assembler from
      register operands, and is likely going to be warned about by upstream
      gas in the future (mine does already).
      
      Add the other missing suffixes here as well.
      Signed-off-by: NJan Beulich <jbeulich@suse.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/5B3A02DD02000078001CFB78@prv1-mh.provo.novell.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      6709812f