1. 18 7月, 2019 2 次提交
  2. 03 7月, 2019 2 次提交
    • T
      x86/fsgsbase: Revert FSGSBASE support · 049331f2
      Thomas Gleixner 提交于
      The FSGSBASE series turned out to have serious bugs and there is still an
      open issue which is not fully understood yet.
      
      The confidence in those changes has become close to zero especially as the
      test cases which have been shipped with that series were obviously never
      run before sending the final series out to LKML.
      
        ./fsgsbase_64 >/dev/null
        Segmentation fault
      
      As the merge window is close, the only sane decision is to revert FSGSBASE
      support. The revert is necessary as this branch has been merged into
      perf/core already and rebasing all of that a few days before the merge
      window is not the most brilliant idea.
      
      I could definitely slap myself for not noticing the test case fail when
      merging that series, but TBH my expectations weren't that low back
      then. Won't happen again.
      
      Revert the following commits:
      539bca53 ("x86/entry/64: Fix and clean up paranoid_exit")
      2c7b5ac5 ("Documentation/x86/64: Add documentation for GS/FS addressing mode")
      f987c955 ("x86/elf: Enumerate kernel FSGSBASE capability in AT_HWCAP2")
      2032f1f9 ("x86/cpu: Enable FSGSBASE on 64bit by default and add a chicken bit")
      5bf0cab6 ("x86/entry/64: Document GSBASE handling in the paranoid path")
      708078f6 ("x86/entry/64: Handle FSGSBASE enabled paranoid entry/exit")
      79e1932f ("x86/entry/64: Introduce the FIND_PERCPU_BASE macro")
      1d07316b ("x86/entry/64: Switch CR3 before SWAPGS in paranoid entry")
      f60a83df ("x86/process/64: Use FSGSBASE instructions on thread copy and ptrace")
      1ab5f3f7 ("x86/process/64: Use FSBSBASE in switch_to() if available")
      a86b4625 ("x86/fsgsbase/64: Enable FSGSBASE instructions in helper functions")
      8b71340d ("x86/fsgsbase/64: Add intrinsics for FSGSBASE instructions")
      b64ed19b ("x86/cpu: Add 'unsafe_fsgsbase' to enable CR4.FSGSBASE")
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NIngo Molnar <mingo@kernel.org>
      Cc: Chang S. Bae <chang.seok.bae@intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Ravi Shankar <ravi.v.shankar@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      049331f2
    • T
      x86/irq: Seperate unused system vectors from spurious entry again · f8a8fe61
      Thomas Gleixner 提交于
      Quite some time ago the interrupt entry stubs for unused vectors in the
      system vector range got removed and directly mapped to the spurious
      interrupt vector entry point.
      
      Sounds reasonable, but it's subtly broken. The spurious interrupt vector
      entry point pushes vector number 0xFF on the stack which makes the whole
      logic in __smp_spurious_interrupt() pointless.
      
      As a consequence any spurious interrupt which comes from a vector != 0xFF
      is treated as a real spurious interrupt (vector 0xFF) and not
      acknowledged. That subsequently stalls all interrupt vectors of equal and
      lower priority, which brings the system to a grinding halt.
      
      This can happen because even on 64-bit the system vector space is not
      guaranteed to be fully populated. A full compile time handling of the
      unused vectors is not possible because quite some of them are conditonally
      populated at runtime.
      
      Bring the entry stubs back, which wastes 160 bytes if all stubs are unused,
      but gains the proper handling back. There is no point to selectively spare
      some of the stubs which are known at compile time as the required code in
      the IDT management would be way larger and convoluted.
      
      Do not route the spurious entries through common_interrupt and do_IRQ() as
      the original code did. Route it to smp_spurious_interrupt() which evaluates
      the vector number and acts accordingly now that the real vector numbers are
      handed in.
      
      Fixup the pr_warn so the actual spurious vector (0xff) is clearly
      distiguished from the other vectors and also note for the vectored case
      whether it was pending in the ISR or not.
      
       "Spurious APIC interrupt (vector 0xFF) on CPU#0, should never happen."
       "Spurious interrupt vector 0xed on CPU#1. Acked."
       "Spurious interrupt vector 0xee on CPU#1. Not pending!."
      
      Fixes: 2414e021 ("x86: Avoid building unused IRQ entry stubs")
      Reported-by: NJan Kiszka <jan.kiszka@siemens.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Marc Zyngier <marc.zyngier@arm.com>
      Cc: Jan Beulich <jbeulich@suse.com>
      Link: https://lkml.kernel.org/r/20190628111440.550568228@linutronix.de
      f8a8fe61
  3. 02 7月, 2019 2 次提交
  4. 22 6月, 2019 2 次提交
  5. 12 6月, 2019 1 次提交
  6. 09 6月, 2019 1 次提交
  7. 09 5月, 2019 1 次提交
  8. 17 4月, 2019 5 次提交
    • A
      x86/irq/64: Split the IRQ stack into its own pages · e6401c13
      Andy Lutomirski 提交于
      Currently, the IRQ stack is hardcoded as the first page of the percpu
      area, and the stack canary lives on the IRQ stack. The former gets in
      the way of adding an IRQ stack guard page, and the latter is a potential
      weakness in the stack canary mechanism.
      
      Split the IRQ stack into its own private percpu pages.
      
      [ tglx: Make 64 and 32 bit share struct irq_stack ]
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Brijesh Singh <brijesh.singh@amd.com>
      Cc: "Chang S. Bae" <chang.seok.bae@intel.com>
      Cc: Dominik Brodowski <linux@dominikbrodowski.net>
      Cc: Feng Tang <feng.tang@intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jan Beulich <JBeulich@suse.com>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Joerg Roedel <jroedel@suse.de>
      Cc: Jordan Borgner <mail@jordan-borgner.de>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Maran Wilson <maran.wilson@oracle.com>
      Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Nick Desaulniers <ndesaulniers@google.com>
      Cc: Nicolai Stange <nstange@suse.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Pu Wen <puwen@hygon.cn>
      Cc: "Rafael Ávila de Espíndola" <rafael@espindo.la>
      Cc: Sean Christopherson <sean.j.christopherson@intel.com>
      Cc: Stefano Stabellini <sstabellini@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: x86-ml <x86@kernel.org>
      Cc: xen-devel@lists.xenproject.org
      Link: https://lkml.kernel.org/r/20190414160146.267376656@linutronix.de
      e6401c13
    • T
      x86/irq/64: Rename irq_stack_ptr to hardirq_stack_ptr · 758a2e31
      Thomas Gleixner 提交于
      Preparatory patch to share code with 32bit.
      
      No functional changes.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: "Chang S. Bae" <chang.seok.bae@intel.com>
      Cc: Dominik Brodowski <linux@dominikbrodowski.net>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Nick Desaulniers <ndesaulniers@google.com>
      Cc: Nicolai Stange <nstange@suse.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Pingfan Liu <kernelfans@gmail.com>
      Cc: Sean Christopherson <sean.j.christopherson@intel.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: x86-ml <x86@kernel.org>
      Link: https://lkml.kernel.org/r/20190414160145.912584074@linutronix.de
      758a2e31
    • T
      x86/exceptions: Split debug IST stack · 2a594d4c
      Thomas Gleixner 提交于
      The debug IST stack is actually two separate debug stacks to handle #DB
      recursion. This is required because the CPU starts always at top of stack
      on exception entry, which means on #DB recursion the second #DB would
      overwrite the stack of the first.
      
      The low level entry code therefore adjusts the top of stack on entry so a
      secondary #DB starts from a different stack page. But the stack pages are
      adjacent without a guard page between them.
      
      Split the debug stack into 3 stacks which are separated by guard pages. The
      3rd stack is never mapped into the cpu_entry_area and is only there to
      catch triple #DB nesting:
      
            --- top of DB_stack	<- Initial stack
            --- end of DB_stack
            	  guard page
      
            --- top of DB1_stack	<- Top of stack after entering first #DB
            --- end of DB1_stack
            	  guard page
      
            --- top of DB2_stack	<- Top of stack after entering second #DB
            --- end of DB2_stack
            	  guard page
      
      If DB2 would not act as the final guard hole, a second #DB would point the
      top of #DB stack to the stack below #DB1 which would be valid and not catch
      the not so desired triple nesting.
      
      The backing store does not allocate any memory for DB2 and its guard page
      as it is not going to be mapped into the cpu_entry_area.
      
       - Adjust the low level entry code so it adjusts top of #DB with the offset
         between the stacks instead of exception stack size.
      
       - Make the dumpstack code aware of the new stacks.
      
       - Adjust the in_debug_stack() implementation and move it into the NMI code
         where it belongs. As this is NMI hotpath code, it just checks the full
         area between top of DB_stack and bottom of DB1_stack without checking
         for the guard page. That's correct because the NMI cannot hit a
         stackpointer pointing to the guard page between DB and DB1 stack.  Even
         if it would, then the NMI operation still is unaffected, but the resume
         of the debug exception on the topmost DB stack will crash by touching
         the guard page.
      
        [ bp: Make exception_stack_names static const char * const ]
      Suggested-by: NAndy Lutomirski <luto@kernel.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Reviewed-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: "Chang S. Bae" <chang.seok.bae@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Dominik Brodowski <linux@dominikbrodowski.net>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Joerg Roedel <jroedel@suse.de>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: linux-doc@vger.kernel.org
      Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Sean Christopherson <sean.j.christopherson@intel.com>
      Cc: x86-ml <x86@kernel.org>
      Link: https://lkml.kernel.org/r/20190414160145.439944544@linutronix.de
      2a594d4c
    • T
      x86/exceptions: Disconnect IST index and stack order · 32074269
      Thomas Gleixner 提交于
      The entry order of the TSS.IST array and the order of the stack
      storage/mapping are not required to be the same.
      
      With the upcoming split of the debug stack this is going to fall apart as
      the number of TSS.IST array entries stays the same while the actual stacks
      are increasing.
      
      Make them separate so that code like dumpstack can just utilize the mapping
      order. The IST index is solely required for the actual TSS.IST array
      initialization.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: "Chang S. Bae" <chang.seok.bae@intel.com>
      Cc: Dominik Brodowski <linux@dominikbrodowski.net>
      Cc: Dou Liyang <douly.fnst@cn.fujitsu.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Nicolai Stange <nstange@suse.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Sean Christopherson <sean.j.christopherson@intel.com>
      Cc: x86-ml <x86@kernel.org>
      Link: https://lkml.kernel.org/r/20190414160145.241588113@linutronix.de
      32074269
    • T
      x86/exceptions: Make IST index zero based · 8f34c5b5
      Thomas Gleixner 提交于
      The defines for the exception stack (IST) array in the TSS are using the
      SDM convention IST1 - IST7. That causes all sorts of code to subtract 1 for
      array indices related to IST. That's confusing at best and does not provide
      any value.
      
      Make the indices zero based and fixup the usage sites. The only code which
      needs to adjust the 0 based index is the interrupt descriptor setup which
      needs to add 1 now.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Reviewed-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: "Chang S. Bae" <chang.seok.bae@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Dominik Brodowski <linux@dominikbrodowski.net>
      Cc: Dou Liyang <douly.fnst@cn.fujitsu.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: linux-doc@vger.kernel.org
      Cc: Nicolai Stange <nstange@suse.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qian Cai <cai@lca.pw>
      Cc: x86-ml <x86@kernel.org>
      Link: https://lkml.kernel.org/r/20190414160144.331772825@linutronix.de
      8f34c5b5
  9. 05 4月, 2019 1 次提交
  10. 03 4月, 2019 2 次提交
    • P
      sched/x86_64: Don't save flags on context switch · 64604d54
      Peter Zijlstra 提交于
      Now that we have objtool validating AC=1 state for all x86_64 code,
      we can once again guarantee clean flags on schedule.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      64604d54
    • P
      sched/x86: Save [ER]FLAGS on context switch · 6690e86b
      Peter Zijlstra 提交于
      Effectively reverts commit:
      
        2c7577a7 ("sched/x86_64: Don't save flags on context switch")
      
      Specifically because SMAP uses FLAGS.AC which invalidates the claim
      that the kernel has clean flags.
      
      In particular; while preemption from interrupt return is fine (the
      IRET frame on the exception stack contains FLAGS) it breaks any code
      that does synchonous scheduling, including preempt_enable().
      
      This has become a significant issue ever since commit:
      
        5b24a7a2 ("Add 'unsafe' user access functions for batched accesses")
      
      provided for means of having 'normal' C code between STAC / CLAC,
      exposing the FLAGS.AC state. So far this hasn't led to trouble,
      however fix it before it comes apart.
      Reported-by: NJulien Thierry <julien.thierry@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NAndy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: stable@kernel.org
      Fixes: 5b24a7a2 ("Add 'unsafe' user access functions for batched accesses")
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      6690e86b
  11. 06 12月, 2018 1 次提交
    • A
      kprobes/x86: Blacklist non-attachable interrupt functions · a50480cb
      Andrea Righi 提交于
      These interrupt functions are already non-attachable by kprobes.
      Blacklist them explicitly so that they can show up in
      /sys/kernel/debug/kprobes/blacklist and tools like BCC can use this
      additional information.
      Signed-off-by: NAndrea Righi <righi.andrea@gmail.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Yonghong Song <yhs@fb.com>
      Link: http://lkml.kernel.org/r/20181206095648.GA8249@DellSigned-off-by: NIngo Molnar <mingo@kernel.org>
      a50480cb
  12. 17 10月, 2018 1 次提交
  13. 14 10月, 2018 1 次提交
  14. 13 9月, 2018 1 次提交
    • A
      x86/pti/64: Remove the SYSCALL64 entry trampoline · bf904d27
      Andy Lutomirski 提交于
      The SYSCALL64 trampoline has a couple of nice properties:
      
       - The usual sequence of SWAPGS followed by two GS-relative accesses to
         set up RSP is somewhat slow because the GS-relative accesses need
         to wait for SWAPGS to finish.  The trampoline approach allows
         RIP-relative accesses to set up RSP, which avoids the stall.
      
       - The trampoline avoids any percpu access before CR3 is set up,
         which means that no percpu memory needs to be mapped in the user
         page tables.  This prevents using Meltdown to read any percpu memory
         outside the cpu_entry_area and prevents using timing leaks
         to directly locate the percpu areas.
      
      The downsides of using a trampoline may outweigh the upsides, however.
      It adds an extra non-contiguous I$ cache line to system calls, and it
      forces an indirect jump to transfer control back to the normal kernel
      text after CR3 is set up.  The latter is because x86 lacks a 64-bit
      direct jump instruction that could jump from the trampoline to the entry
      text.  With retpolines enabled, the indirect jump is extremely slow.
      
      Change the code to map the percpu TSS into the user page tables to allow
      the non-trampoline SYSCALL64 path to work under PTI.  This does not add a
      new direct information leak, since the TSS is readable by Meltdown from the
      cpu_entry_area alias regardless.  It does allow a timing attack to locate
      the percpu area, but KASLR is more or less a lost cause against local
      attack on CPUs vulnerable to Meltdown regardless.  As far as I'm concerned,
      on current hardware, KASLR is only useful to mitigate remote attacks that
      try to attack the kernel without first gaining RCE against a vulnerable
      user process.
      
      On Skylake, with CONFIG_RETPOLINE=y and KPTI on, this reduces syscall
      overhead from ~237ns to ~228ns.
      
      There is a possible alternative approach: Move the trampoline within 2G of
      the entry text and make a separate copy for each CPU.  This would allow a
      direct jump to rejoin the normal entry path. There are pro's and con's for
      this approach:
      
       + It avoids a pipeline stall
      
       - It executes from an extra page and read from another extra page during
         the syscall. The latter is because it needs to use a relative
         addressing mode to find sp1 -- it's the same *cacheline*, but accessed
         using an alias, so it's an extra TLB entry.
      
       - Slightly more memory. This would be one page per CPU for a simple
         implementation and 64-ish bytes per CPU or one page per node for a more
         complex implementation.
      
       - More code complexity.
      
      The current approach is chosen for simplicity and because the alternative
      does not provide a significant benefit, which makes it worth.
      
      [ tglx: Added the alternative discussion to the changelog ]
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NBorislav Petkov <bp@suse.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Joerg Roedel <joro@8bytes.org>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/8c7c6e483612c3e4e10ca89495dc160b1aa66878.1536015544.git.luto@kernel.org
      bf904d27
  15. 08 9月, 2018 2 次提交
  16. 05 9月, 2018 1 次提交
    • A
      x86/entry: Add STACKLEAK erasing the kernel stack at the end of syscalls · afaef01c
      Alexander Popov 提交于
      The STACKLEAK feature (initially developed by PaX Team) has the following
      benefits:
      
      1. Reduces the information that can be revealed through kernel stack leak
         bugs. The idea of erasing the thread stack at the end of syscalls is
         similar to CONFIG_PAGE_POISONING and memzero_explicit() in kernel
         crypto, which all comply with FDP_RIP.2 (Full Residual Information
         Protection) of the Common Criteria standard.
      
      2. Blocks some uninitialized stack variable attacks (e.g. CVE-2017-17712,
         CVE-2010-2963). That kind of bugs should be killed by improving C
         compilers in future, which might take a long time.
      
      This commit introduces the code filling the used part of the kernel
      stack with a poison value before returning to userspace. Full
      STACKLEAK feature also contains the gcc plugin which comes in a
      separate commit.
      
      The STACKLEAK feature is ported from grsecurity/PaX. More information at:
        https://grsecurity.net/
        https://pax.grsecurity.net/
      
      This code is modified from Brad Spengler/PaX Team's code in the last
      public patch of grsecurity/PaX based on our understanding of the code.
      Changes or omissions from the original code are ours and don't reflect
      the original grsecurity/PaX code.
      
      Performance impact:
      
      Hardware: Intel Core i7-4770, 16 GB RAM
      
      Test #1: building the Linux kernel on a single core
              0.91% slowdown
      
      Test #2: hackbench -s 4096 -l 2000 -g 15 -f 25 -P
              4.2% slowdown
      
      So the STACKLEAK description in Kconfig includes: "The tradeoff is the
      performance impact: on a single CPU system kernel compilation sees a 1%
      slowdown, other systems and workloads may vary and you are advised to
      test this feature on your expected workload before deploying it".
      Signed-off-by: NAlexander Popov <alex.popov@linux.com>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NDave Hansen <dave.hansen@linux.intel.com>
      Acked-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NKees Cook <keescook@chromium.org>
      afaef01c
  17. 03 9月, 2018 1 次提交
  18. 24 7月, 2018 1 次提交
    • A
      x86/entry/64: Remove %ebx handling from error_entry/exit · b3681dd5
      Andy Lutomirski 提交于
      error_entry and error_exit communicate the user vs. kernel status of
      the frame using %ebx.  This is unnecessary -- the information is in
      regs->cs.  Just use regs->cs.
      
      This makes error_entry simpler and makes error_exit more robust.
      
      It also fixes a nasty bug.  Before all the Spectre nonsense, the
      xen_failsafe_callback entry point returned like this:
      
              ALLOC_PT_GPREGS_ON_STACK
              SAVE_C_REGS
              SAVE_EXTRA_REGS
              ENCODE_FRAME_POINTER
              jmp     error_exit
      
      And it did not go through error_entry.  This was bogus: RBX
      contained garbage, and error_exit expected a flag in RBX.
      
      Fortunately, it generally contained *nonzero* garbage, so the
      correct code path was used.  As part of the Spectre fixes, code was
      added to clear RBX to mitigate certain speculation attacks.  Now,
      depending on kernel configuration, RBX got zeroed and, when running
      some Wine workloads, the kernel crashes.  This was introduced by:
      
          commit 3ac6d8c7 ("x86/entry/64: Clear registers for exceptions/interrupts, to reduce speculation attack surface")
      
      With this patch applied, RBX is no longer needed as a flag, and the
      problem goes away.
      
      I suspect that malicious userspace could use this bug to crash the
      kernel even without the offending patch applied, though.
      
      [ Historical note: I wrote this patch as a cleanup before I was aware
        of the bug it fixed. ]
      
      [ Note to stable maintainers: this should probably get applied to all
        kernels.  If you're nervous about that, a more conservative fix to
        add xorl %ebx,%ebx; incl %ebx before the jump to error_exit should
        also fix the problem. ]
      Reported-and-tested-by: NM. Vefa Bicakci <m.v.b@runbox.com>
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Dominik Brodowski <linux@dominikbrodowski.net>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      Cc: xen-devel@lists.xenproject.org
      Fixes: 3ac6d8c7 ("x86/entry/64: Clear registers for exceptions/interrupts, to reduce speculation attack surface")
      Link: http://lkml.kernel.org/r/b5010a090d3586b2d6e06c7ad3ec5542d1241c45.1532282627.git.luto@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      b3681dd5
  19. 03 7月, 2018 1 次提交
    • J
      x86/entry/64: Add two more instruction suffixes · 6709812f
      Jan Beulich 提交于
      Sadly, other than claimed in:
      
        a368d7fd ("x86/entry/64: Add instruction suffix")
      
      ... there are two more instances which want to be adjusted.
      
      As said there, omitting suffixes from instructions in AT&T mode is bad
      practice when operand size cannot be determined by the assembler from
      register operands, and is likely going to be warned about by upstream
      gas in the future (mine does already).
      
      Add the other missing suffixes here as well.
      Signed-off-by: NJan Beulich <jbeulich@suse.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/5B3A02DD02000078001CFB78@prv1-mh.provo.novell.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      6709812f
  20. 21 6月, 2018 1 次提交
  21. 14 6月, 2018 1 次提交
    • L
      Kbuild: rename CC_STACKPROTECTOR[_STRONG] config variables · 050e9baa
      Linus Torvalds 提交于
      The changes to automatically test for working stack protector compiler
      support in the Kconfig files removed the special STACKPROTECTOR_AUTO
      option that picked the strongest stack protector that the compiler
      supported.
      
      That was all a nice cleanup - it makes no sense to have the AUTO case
      now that the Kconfig phase can just determine the compiler support
      directly.
      
      HOWEVER.
      
      It also meant that doing "make oldconfig" would now _disable_ the strong
      stackprotector if you had AUTO enabled, because in a legacy config file,
      the sane stack protector configuration would look like
      
        CONFIG_HAVE_CC_STACKPROTECTOR=y
        # CONFIG_CC_STACKPROTECTOR_NONE is not set
        # CONFIG_CC_STACKPROTECTOR_REGULAR is not set
        # CONFIG_CC_STACKPROTECTOR_STRONG is not set
        CONFIG_CC_STACKPROTECTOR_AUTO=y
      
      and when you ran this through "make oldconfig" with the Kbuild changes,
      it would ask you about the regular CONFIG_CC_STACKPROTECTOR (that had
      been renamed from CONFIG_CC_STACKPROTECTOR_REGULAR to just
      CONFIG_CC_STACKPROTECTOR), but it would think that the STRONG version
      used to be disabled (because it was really enabled by AUTO), and would
      disable it in the new config, resulting in:
      
        CONFIG_HAVE_CC_STACKPROTECTOR=y
        CONFIG_CC_HAS_STACKPROTECTOR_NONE=y
        CONFIG_CC_STACKPROTECTOR=y
        # CONFIG_CC_STACKPROTECTOR_STRONG is not set
        CONFIG_CC_HAS_SANE_STACKPROTECTOR=y
      
      That's dangerously subtle - people could suddenly find themselves with
      the weaker stack protector setup without even realizing.
      
      The solution here is to just rename not just the old RECULAR stack
      protector option, but also the strong one.  This does that by just
      removing the CC_ prefix entirely for the user choices, because it really
      is not about the compiler support (the compiler support now instead
      automatially impacts _visibility_ of the options to users).
      
      This results in "make oldconfig" actually asking the user for their
      choice, so that we don't have any silent subtle security model changes.
      The end result would generally look like this:
      
        CONFIG_HAVE_CC_STACKPROTECTOR=y
        CONFIG_CC_HAS_STACKPROTECTOR_NONE=y
        CONFIG_STACKPROTECTOR=y
        CONFIG_STACKPROTECTOR_STRONG=y
        CONFIG_CC_HAS_SANE_STACKPROTECTOR=y
      
      where the "CC_" versions really are about internal compiler
      infrastructure, not the user selections.
      Acked-by: NMasahiro Yamada <yamada.masahiro@socionext.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      050e9baa
  22. 10 4月, 2018 1 次提交
  23. 05 4月, 2018 1 次提交
  24. 24 3月, 2018 1 次提交
  25. 07 3月, 2018 1 次提交
    • M
      Drivers: hv: vmbus: Implement Direct Mode for stimer0 · 248e742a
      Michael Kelley 提交于
      The 2016 version of Hyper-V offers the option to operate the guest VM
      per-vcpu stimer's in Direct Mode, which means the timer interupts on its
      own vector rather than queueing a VMbus message. Direct Mode reduces
      timer processing overhead in both the hypervisor and the guest, and
      avoids having timer interrupts pollute the VMbus interrupt stream for
      the synthetic NIC and storage.  This patch enables Direct Mode by
      default on stimer0 when running on a version of Hyper-V that supports
      it.
      
      In prep for coming support of Hyper-V on ARM64, the arch independent
      portion of the code contains calls to routines that will be populated
      on ARM64 but are not needed and do nothing on x86.
      Signed-off-by: NMichael Kelley <mikelley@microsoft.com>
      Signed-off-by: NK. Y. Srinivasan <kys@microsoft.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      248e742a
  26. 28 2月, 2018 1 次提交
  27. 21 2月, 2018 4 次提交