1. 21 8月, 2018 1 次提交
  2. 18 8月, 2018 1 次提交
    • S
      x86/speculation/l1tf: Exempt zeroed PTEs from inversion · f19f5c49
      Sean Christopherson 提交于
      It turns out that we should *not* invert all not-present mappings,
      because the all zeroes case is obviously special.
      
      clear_page() does not undergo the XOR logic to invert the address bits,
      i.e. PTE, PMD and PUD entries that have not been individually written
      will have val=0 and so will trigger __pte_needs_invert(). As a result,
      {pte,pmd,pud}_pfn() will return the wrong PFN value, i.e. all ones
      (adjusted by the max PFN mask) instead of zero. A zeroed entry is ok
      because the page at physical address 0 is reserved early in boot
      specifically to mitigate L1TF, so explicitly exempt them from the
      inversion when reading the PFN.
      
      Manifested as an unexpected mprotect(..., PROT_NONE) failure when called
      on a VMA that has VM_PFNMAP and was mmap'd to as something other than
      PROT_NONE but never used. mprotect() sends the PROT_NONE request down
      prot_none_walk(), which walks the PTEs to check the PFNs.
      prot_none_pte_entry() gets the bogus PFN from pte_pfn() and returns
      -EACCES because it thinks mprotect() is trying to adjust a high MMIO
      address.
      
      [ This is a very modified version of Sean's original patch, but all
        credit goes to Sean for doing this and also pointing out that
        sometimes the __pte_needs_invert() function only gets the protection
        bits, not the full eventual pte.  But zero remains special even in
        just protection bits, so that's ok.   - Linus ]
      
      Fixes: f22cc87f ("x86/speculation/l1tf: Invert all not present mappings")
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Acked-by: NAndi Kleen <ak@linux.intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f19f5c49
  3. 17 8月, 2018 1 次提交
  4. 16 8月, 2018 2 次提交
  5. 15 8月, 2018 2 次提交
  6. 11 8月, 2018 1 次提交
    • J
      x86/mm/pti: Move user W+X check into pti_finalize() · d878efce
      Joerg Roedel 提交于
      The user page-table gets the updated kernel mappings in pti_finalize(),
      which runs after the RO+X permissions got applied to the kernel page-table
      in mark_readonly().
      
      But with CONFIG_DEBUG_WX enabled, the user page-table is already checked in
      mark_readonly() for insecure mappings.  This causes false-positive
      warnings, because the user page-table did not get the updated mappings yet.
      
      Move the W+X check for the user page-table into pti_finalize() after it
      updated all required mappings.
      
      [ tglx: Folded !NX supported fix ]
      Signed-off-by: NJoerg Roedel <jroedel@suse.de>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: "H . Peter Anvin" <hpa@zytor.com>
      Cc: linux-mm@kvack.org
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: David Laight <David.Laight@aculab.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Eduardo Valentin <eduval@amazon.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: aliguori@amazon.com
      Cc: daniel.gruss@iaik.tugraz.at
      Cc: hughd@google.com
      Cc: keescook@google.com
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Waiman Long <llong@redhat.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: "David H . Gutteridge" <dhgutteridge@sympatico.ca>
      Cc: joro@8bytes.org
      Link: https://lkml.kernel.org/r/1533727000-9172-1-git-send-email-joro@8bytes.org
      d878efce
  7. 10 8月, 2018 2 次提交
  8. 09 8月, 2018 1 次提交
    • A
      x86/mm/kmmio: Make the tracer robust against L1TF · 1063711b
      Andi Kleen 提交于
      The mmio tracer sets io mapping PTEs and PMDs to non present when enabled
      without inverting the address bits, which makes the PTE entry vulnerable
      for L1TF.
      
      Make it use the right low level macros to actually invert the address bits
      to protect against L1TF.
      
      In principle this could be avoided because MMIO tracing is not likely to be
      enabled on production machines, but the fix is straigt forward and for
      consistency sake it's better to get rid of the open coded PTE manipulation.
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      1063711b
  9. 08 8月, 2018 7 次提交
    • A
      x86/mm/pat: Make set_memory_np() L1TF safe · 958f79b9
      Andi Kleen 提交于
      set_memory_np() is used to mark kernel mappings not present, but it has
      it's own open coded mechanism which does not have the L1TF protection of
      inverting the address bits.
      
      Replace the open coded PTE manipulation with the L1TF protecting low level
      PTE routines.
      
      Passes the CPA self test.
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      958f79b9
    • A
      x86/speculation/l1tf: Make pmd/pud_mknotpresent() invert · 0768f915
      Andi Kleen 提交于
      Some cases in THP like:
        - MADV_FREE
        - mprotect
        - split
      
      mark the PMD non present for temporarily to prevent races. The window for
      an L1TF attack in these contexts is very small, but it wants to be fixed
      for correctness sake.
      
      Use the proper low level functions for pmd/pud_mknotpresent() to address
      this.
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      0768f915
    • A
      x86/speculation/l1tf: Invert all not present mappings · f22cc87f
      Andi Kleen 提交于
      For kernel mappings PAGE_PROTNONE is not necessarily set for a non present
      mapping, but the inversion logic explicitely checks for !PRESENT and
      PROT_NONE.
      
      Remove the PROT_NONE check and make the inversion unconditional for all not
      present mappings.
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      f22cc87f
    • J
      x86/mm/pti: Clone kernel-image on PTE level for 32 bit · 16a3fe63
      Joerg Roedel 提交于
      On 32 bit the kernel sections are not huge-page aligned.  When we clone
      them on PMD-level we unevitably map some areas that are normal kernel
      memory and may contain secrets to user-space. To prevent that we need to
      clone the kernel-image on PTE-level for 32 bit.
      
      Also make the page-table cloning code more general so that it can handle
      PMD and PTE level cloning. This can be generalized further in the future to
      also handle clones on the P4D-level.
      Signed-off-by: NJoerg Roedel <jroedel@suse.de>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: "H . Peter Anvin" <hpa@zytor.com>
      Cc: linux-mm@kvack.org
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: David Laight <David.Laight@aculab.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Eduardo Valentin <eduval@amazon.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: aliguori@amazon.com
      Cc: daniel.gruss@iaik.tugraz.at
      Cc: hughd@google.com
      Cc: keescook@google.com
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Waiman Long <llong@redhat.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: "David H . Gutteridge" <dhgutteridge@sympatico.ca>
      Cc: joro@8bytes.org
      Link: https://lkml.kernel.org/r/1533637471-30953-4-git-send-email-joro@8bytes.org
      16a3fe63
    • J
      x86/mm/pti: Don't clear permissions in pti_clone_pmd() · 30514eff
      Joerg Roedel 提交于
      The function sets the global-bit on cloned PMD entries, which only makes
      sense when the permissions are identical between the user and the kernel
      page-table. Further, only write-permissions are cleared for entry-text and
      kernel-text sections, which are not writeable at the end of the boot
      process.
      
      The reason why this RW clearing exists is that in the early PTI
      implementations the cloned kernel areas were set up during early boot
      before the kernel text is set to read only and not touched afterwards.
      
      This is not longer true. The cloned areas are still set up early to get the
      entry code working for interrupts and other things, but after the kernel
      text has been set RO the clone is repeated which copies the RO PMD/PTEs
      over to the user visible clone. That means the initial clearing of the
      writable bit can be avoided.
      
      [ tglx: Amended changelog ]
      Signed-off-by: NJoerg Roedel <jroedel@suse.de>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NDave Hansen <dave.hansen@intel.com>
      Cc: "H . Peter Anvin" <hpa@zytor.com>
      Cc: linux-mm@kvack.org
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: David Laight <David.Laight@aculab.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Eduardo Valentin <eduval@amazon.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: aliguori@amazon.com
      Cc: daniel.gruss@iaik.tugraz.at
      Cc: hughd@google.com
      Cc: keescook@google.com
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Waiman Long <llong@redhat.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: "David H . Gutteridge" <dhgutteridge@sympatico.ca>
      Cc: joro@8bytes.org
      Link: https://lkml.kernel.org/r/1533637471-30953-3-git-send-email-joro@8bytes.org
      30514eff
    • P
      x86/paravirt: Fix spectre-v2 mitigations for paravirt guests · 5800dc5c
      Peter Zijlstra 提交于
      Nadav reported that on guests we're failing to rewrite the indirect
      calls to CALLEE_SAVE paravirt functions. In particular the
      pv_queued_spin_unlock() call is left unpatched and that is all over the
      place. This obviously wrecks Spectre-v2 mitigation (for paravirt
      guests) which relies on not actually having indirect calls around.
      
      The reason is an incorrect clobber test in paravirt_patch_call(); this
      function rewrites an indirect call with a direct call to the _SAME_
      function, there is no possible way the clobbers can be different
      because of this.
      
      Therefore remove this clobber check. Also put WARNs on the other patch
      failure case (not enough room for the instruction) which I've not seen
      trigger in my (limited) testing.
      
      Three live kernel image disassemblies for lock_sock_nested (as a small
      function that illustrates the problem nicely). PRE is the current
      situation for guests, POST is with this patch applied and NATIVE is with
      or without the patch for !guests.
      
      PRE:
      
      (gdb) disassemble lock_sock_nested
      Dump of assembler code for function lock_sock_nested:
         0xffffffff817be970 <+0>:     push   %rbp
         0xffffffff817be971 <+1>:     mov    %rdi,%rbp
         0xffffffff817be974 <+4>:     push   %rbx
         0xffffffff817be975 <+5>:     lea    0x88(%rbp),%rbx
         0xffffffff817be97c <+12>:    callq  0xffffffff819f7160 <_cond_resched>
         0xffffffff817be981 <+17>:    mov    %rbx,%rdi
         0xffffffff817be984 <+20>:    callq  0xffffffff819fbb00 <_raw_spin_lock_bh>
         0xffffffff817be989 <+25>:    mov    0x8c(%rbp),%eax
         0xffffffff817be98f <+31>:    test   %eax,%eax
         0xffffffff817be991 <+33>:    jne    0xffffffff817be9ba <lock_sock_nested+74>
         0xffffffff817be993 <+35>:    movl   $0x1,0x8c(%rbp)
         0xffffffff817be99d <+45>:    mov    %rbx,%rdi
         0xffffffff817be9a0 <+48>:    callq  *0xffffffff822299e8
         0xffffffff817be9a7 <+55>:    pop    %rbx
         0xffffffff817be9a8 <+56>:    pop    %rbp
         0xffffffff817be9a9 <+57>:    mov    $0x200,%esi
         0xffffffff817be9ae <+62>:    mov    $0xffffffff817be993,%rdi
         0xffffffff817be9b5 <+69>:    jmpq   0xffffffff81063ae0 <__local_bh_enable_ip>
         0xffffffff817be9ba <+74>:    mov    %rbp,%rdi
         0xffffffff817be9bd <+77>:    callq  0xffffffff817be8c0 <__lock_sock>
         0xffffffff817be9c2 <+82>:    jmp    0xffffffff817be993 <lock_sock_nested+35>
      End of assembler dump.
      
      POST:
      
      (gdb) disassemble lock_sock_nested
      Dump of assembler code for function lock_sock_nested:
         0xffffffff817be970 <+0>:     push   %rbp
         0xffffffff817be971 <+1>:     mov    %rdi,%rbp
         0xffffffff817be974 <+4>:     push   %rbx
         0xffffffff817be975 <+5>:     lea    0x88(%rbp),%rbx
         0xffffffff817be97c <+12>:    callq  0xffffffff819f7160 <_cond_resched>
         0xffffffff817be981 <+17>:    mov    %rbx,%rdi
         0xffffffff817be984 <+20>:    callq  0xffffffff819fbb00 <_raw_spin_lock_bh>
         0xffffffff817be989 <+25>:    mov    0x8c(%rbp),%eax
         0xffffffff817be98f <+31>:    test   %eax,%eax
         0xffffffff817be991 <+33>:    jne    0xffffffff817be9ba <lock_sock_nested+74>
         0xffffffff817be993 <+35>:    movl   $0x1,0x8c(%rbp)
         0xffffffff817be99d <+45>:    mov    %rbx,%rdi
         0xffffffff817be9a0 <+48>:    callq  0xffffffff810a0c20 <__raw_callee_save___pv_queued_spin_unlock>
         0xffffffff817be9a5 <+53>:    xchg   %ax,%ax
         0xffffffff817be9a7 <+55>:    pop    %rbx
         0xffffffff817be9a8 <+56>:    pop    %rbp
         0xffffffff817be9a9 <+57>:    mov    $0x200,%esi
         0xffffffff817be9ae <+62>:    mov    $0xffffffff817be993,%rdi
         0xffffffff817be9b5 <+69>:    jmpq   0xffffffff81063aa0 <__local_bh_enable_ip>
         0xffffffff817be9ba <+74>:    mov    %rbp,%rdi
         0xffffffff817be9bd <+77>:    callq  0xffffffff817be8c0 <__lock_sock>
         0xffffffff817be9c2 <+82>:    jmp    0xffffffff817be993 <lock_sock_nested+35>
      End of assembler dump.
      
      NATIVE:
      
      (gdb) disassemble lock_sock_nested
      Dump of assembler code for function lock_sock_nested:
         0xffffffff817be970 <+0>:     push   %rbp
         0xffffffff817be971 <+1>:     mov    %rdi,%rbp
         0xffffffff817be974 <+4>:     push   %rbx
         0xffffffff817be975 <+5>:     lea    0x88(%rbp),%rbx
         0xffffffff817be97c <+12>:    callq  0xffffffff819f7160 <_cond_resched>
         0xffffffff817be981 <+17>:    mov    %rbx,%rdi
         0xffffffff817be984 <+20>:    callq  0xffffffff819fbb00 <_raw_spin_lock_bh>
         0xffffffff817be989 <+25>:    mov    0x8c(%rbp),%eax
         0xffffffff817be98f <+31>:    test   %eax,%eax
         0xffffffff817be991 <+33>:    jne    0xffffffff817be9ba <lock_sock_nested+74>
         0xffffffff817be993 <+35>:    movl   $0x1,0x8c(%rbp)
         0xffffffff817be99d <+45>:    mov    %rbx,%rdi
         0xffffffff817be9a0 <+48>:    movb   $0x0,(%rdi)
         0xffffffff817be9a3 <+51>:    nopl   0x0(%rax)
         0xffffffff817be9a7 <+55>:    pop    %rbx
         0xffffffff817be9a8 <+56>:    pop    %rbp
         0xffffffff817be9a9 <+57>:    mov    $0x200,%esi
         0xffffffff817be9ae <+62>:    mov    $0xffffffff817be993,%rdi
         0xffffffff817be9b5 <+69>:    jmpq   0xffffffff81063ae0 <__local_bh_enable_ip>
         0xffffffff817be9ba <+74>:    mov    %rbp,%rdi
         0xffffffff817be9bd <+77>:    callq  0xffffffff817be8c0 <__lock_sock>
         0xffffffff817be9c2 <+82>:    jmp    0xffffffff817be993 <lock_sock_nested+35>
      End of assembler dump.
      
      
      Fixes: 63f70270 ("[PATCH] i386: PARAVIRT: add common patching machinery")
      Fixes: 3010a066 ("x86/paravirt, objtool: Annotate indirect calls")
      Reported-by: NNadav Amit <namit@vmware.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NJuergen Gross <jgross@suse.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: stable@vger.kernel.org
      5800dc5c
    • J
      x86/mm/pti: Fix 32 bit PCID check · 88c6f8a3
      Joerg Roedel 提交于
      The check uses the wrong operator and causes false positive
      warnings in the kernel log on some systems.
      
      Fixes: 5e810595 ('x86/mm/pti: Add Warning when booting on a PCID capable CPU')
      Signed-off-by: NJoerg Roedel <jroedel@suse.de>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: "H . Peter Anvin" <hpa@zytor.com>
      Cc: linux-mm@kvack.org
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: David Laight <David.Laight@aculab.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Eduardo Valentin <eduval@amazon.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: aliguori@amazon.com
      Cc: daniel.gruss@iaik.tugraz.at
      Cc: hughd@google.com
      Cc: keescook@google.com
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Waiman Long <llong@redhat.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: "David H . Gutteridge" <dhgutteridge@sympatico.ca>
      Cc: joro@8bytes.org
      Link: https://lkml.kernel.org/r/1533637471-30953-2-git-send-email-joro@8bytes.org
      88c6f8a3
  10. 07 8月, 2018 5 次提交
    • J
      xen: don't use privcmd_call() from xen_mc_flush() · cd913922
      Juergen Gross 提交于
      Using privcmd_call() for a singleton multicall seems to be wrong, as
      privcmd_call() is using stac()/clac() to enable hypervisor access to
      Linux user space.
      
      Even if currently not a problem (pv domains can't use SMAP while HVM
      and PVH domains can't use multicalls) things might change when
      PVH dom0 support is added to the kernel.
      Reported-by: NJan Beulich <jbeulich@suse.com>
      Signed-off-by: NJuergen Gross <jgross@suse.com>
      Reviewed-by: NJan Beulich <jbeulich@suse.com>
      Signed-off-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>
      cd913922
    • T
      cpu/hotplug: Fix SMT supported evaluation · bc2d8d26
      Thomas Gleixner 提交于
      Josh reported that the late SMT evaluation in cpu_smt_state_init() sets
      cpu_smt_control to CPU_SMT_NOT_SUPPORTED in case that 'nosmt' was supplied
      on the kernel command line as it cannot differentiate between SMT disabled
      by BIOS and SMT soft disable via 'nosmt'. That wreckages the state and
      makes the sysfs interface unusable.
      
      Rework this so that during bringup of the non boot CPUs the availability of
      SMT is determined in cpu_smt_allowed(). If a newly booted CPU is not a
      'primary' thread then set the local cpu_smt_available marker and evaluate
      this explicitely right after the initial SMP bringup has finished.
      
      SMT evaulation on x86 is a trainwreck as the firmware has all the
      information _before_ booting the kernel, but there is no interface to query
      it.
      
      Fixes: 73d5e2b4 ("cpu/hotplug: detect SMT disabled by BIOS")
      Reported-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      bc2d8d26
    • O
      crypto: x86/aegis,morus - Fix and simplify CPUID checks · 877ccce7
      Ondrej Mosnacek 提交于
      It turns out I had misunderstood how the x86_match_cpu() function works.
      It evaluates a logical OR of the matching conditions, not logical AND.
      This caused the CPU feature checks for AEGIS to pass even if only SSE2
      (but not AES-NI) was supported (or vice versa), leading to potential
      crashes if something tried to use the registered algs.
      
      This patch switches the checks to a simpler method that is used e.g. in
      the Camellia x86 code.
      
      The patch also removes the MODULE_DEVICE_TABLE declarations which
      actually seem to cause the modules to be auto-loaded at boot, which is
      not desired. The crypto API on-demand module loading is sufficient.
      
      Fixes: 1d373d4e ("crypto: x86 - Add optimized AEGIS implementations")
      Fixes: 6ecc9d9f ("crypto: x86 - Add optimized MORUS implementations")
      Signed-off-by: NOndrej Mosnacek <omosnace@redhat.com>
      Tested-by: NMilan Broz <gmazyland@gmail.com>
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      877ccce7
    • M
      xen/pv: Call get_cpu_address_sizes to set x86_virt/phys_bits · 405c018a
      M. Vefa Bicakci 提交于
      Commit d94a155c ("x86/cpu: Prevent cpuinfo_x86::x86_phys_bits
      adjustment corruption") has moved the query and calculation of the
      x86_virt_bits and x86_phys_bits fields of the cpuinfo_x86 struct
      from the get_cpu_cap function to a new function named
      get_cpu_address_sizes.
      
      One of the call sites related to Xen PV VMs was unfortunately missed
      in the aforementioned commit. This prevents successful boot-up of
      kernel versions 4.17 and up in Xen PV VMs if CONFIG_DEBUG_VIRTUAL
      is enabled, due to the following code path:
      
        enlighten_pv.c::xen_start_kernel
          mmu_pv.c::xen_reserve_special_pages
            page.h::__pa
              physaddr.c::__phys_addr
                physaddr.h::phys_addr_valid
      
      phys_addr_valid uses boot_cpu_data.x86_phys_bits to validate physical
      addresses. boot_cpu_data.x86_phys_bits is no longer populated before
      the call to xen_reserve_special_pages due to the aforementioned commit
      though, so the validation performed by phys_addr_valid fails, which
      causes __phys_addr to trigger a BUG, preventing boot-up.
      Signed-off-by: NM. Vefa Bicakci <m.v.b@runbox.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: xen-devel@lists.xenproject.org
      Cc: x86@kernel.org
      Cc: stable@vger.kernel.org # for v4.17 and up
      Fixes: d94a155c ("x86/cpu: Prevent cpuinfo_x86::x86_phys_bits adjustment corruption")
      Signed-off-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>
      405c018a
    • D
      x86/mm/init: Remove freed kernel image areas from alias mapping · c40a56a7
      Dave Hansen 提交于
      The kernel image is mapped into two places in the virtual address space
      (addresses without KASLR, of course):
      
      	1. The kernel direct map (0xffff880000000000)
      	2. The "high kernel map" (0xffffffff81000000)
      
      We actually execute out of #2.  If we get the address of a kernel symbol,
      it points to #2, but almost all physical-to-virtual translations point to
      
      Parts of the "high kernel map" alias are mapped in the userspace page
      tables with the Global bit for performance reasons.  The parts that we map
      to userspace do not (er, should not) have secrets. When PTI is enabled then
      the global bit is usually not set in the high mapping and just used to
      compensate for poor performance on systems which lack PCID.
      
      This is fine, except that some areas in the kernel image that are adjacent
      to the non-secret-containing areas are unused holes.  We free these holes
      back into the normal page allocator and reuse them as normal kernel memory.
      The memory will, of course, get *used* via the normal map, but the alias
      mapping is kept.
      
      This otherwise unused alias mapping of the holes will, by default keep the
      Global bit, be mapped out to userspace, and be vulnerable to Meltdown.
      
      Remove the alias mapping of these pages entirely.  This is likely to
      fracture the 2M page mapping the kernel image near these areas, but this
      should affect a minority of the area.
      
      The pageattr code changes *all* aliases mapping the physical pages that it
      operates on (by default).  We only want to modify a single alias, so we
      need to tweak its behavior.
      
      This unmapping behavior is currently dependent on PTI being in place.
      Going forward, we should at least consider doing this for all
      configurations.  Having an extra read-write alias for memory is not exactly
      ideal for debugging things like random memory corruption and this does
      undercut features like DEBUG_PAGEALLOC or future work like eXclusive Page
      Frame Ownership (XPFO).
      
      Before this patch:
      
      current_kernel:---[ High Kernel Mapping ]---
      current_kernel-0xffffffff80000000-0xffffffff81000000          16M                               pmd
      current_kernel-0xffffffff81000000-0xffffffff81e00000          14M     ro         PSE     GLB x  pmd
      current_kernel-0xffffffff81e00000-0xffffffff81e11000          68K     ro                 GLB x  pte
      current_kernel-0xffffffff81e11000-0xffffffff82000000        1980K     RW                     NX pte
      current_kernel-0xffffffff82000000-0xffffffff82600000           6M     ro         PSE     GLB NX pmd
      current_kernel-0xffffffff82600000-0xffffffff82c00000           6M     RW         PSE         NX pmd
      current_kernel-0xffffffff82c00000-0xffffffff82e00000           2M     RW                     NX pte
      current_kernel-0xffffffff82e00000-0xffffffff83200000           4M     RW         PSE         NX pmd
      current_kernel-0xffffffff83200000-0xffffffffa0000000         462M                               pmd
      
        current_user:---[ High Kernel Mapping ]---
        current_user-0xffffffff80000000-0xffffffff81000000          16M                               pmd
        current_user-0xffffffff81000000-0xffffffff81e00000          14M     ro         PSE     GLB x  pmd
        current_user-0xffffffff81e00000-0xffffffff81e11000          68K     ro                 GLB x  pte
        current_user-0xffffffff81e11000-0xffffffff82000000        1980K     RW                     NX pte
        current_user-0xffffffff82000000-0xffffffff82600000           6M     ro         PSE     GLB NX pmd
        current_user-0xffffffff82600000-0xffffffffa0000000         474M                               pmd
      
      After this patch:
      
      current_kernel:---[ High Kernel Mapping ]---
      current_kernel-0xffffffff80000000-0xffffffff81000000          16M                               pmd
      current_kernel-0xffffffff81000000-0xffffffff81e00000          14M     ro         PSE     GLB x  pmd
      current_kernel-0xffffffff81e00000-0xffffffff81e11000          68K     ro                 GLB x  pte
      current_kernel-0xffffffff81e11000-0xffffffff82000000        1980K                               pte
      current_kernel-0xffffffff82000000-0xffffffff82400000           4M     ro         PSE     GLB NX pmd
      current_kernel-0xffffffff82400000-0xffffffff82488000         544K     ro                     NX pte
      current_kernel-0xffffffff82488000-0xffffffff82600000        1504K                               pte
      current_kernel-0xffffffff82600000-0xffffffff82c00000           6M     RW         PSE         NX pmd
      current_kernel-0xffffffff82c00000-0xffffffff82c0d000          52K     RW                     NX pte
      current_kernel-0xffffffff82c0d000-0xffffffff82dc0000        1740K                               pte
      
        current_user:---[ High Kernel Mapping ]---
        current_user-0xffffffff80000000-0xffffffff81000000          16M                               pmd
        current_user-0xffffffff81000000-0xffffffff81e00000          14M     ro         PSE     GLB x  pmd
        current_user-0xffffffff81e00000-0xffffffff81e11000          68K     ro                 GLB x  pte
        current_user-0xffffffff81e11000-0xffffffff82000000        1980K                               pte
        current_user-0xffffffff82000000-0xffffffff82400000           4M     ro         PSE     GLB NX pmd
        current_user-0xffffffff82400000-0xffffffff82488000         544K     ro                     NX pte
        current_user-0xffffffff82488000-0xffffffff82600000        1504K                               pte
        current_user-0xffffffff82600000-0xffffffffa0000000         474M                               pmd
      
      [ tglx: Do not unmap on 32bit as there is only one mapping ]
      
      Fixes: 0f561fce ("x86/pti: Enable global pages for shared areas")
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Kees Cook <keescook@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Joerg Roedel <jroedel@suse.de>
      Link: https://lkml.kernel.org/r/20180802225831.5F6A2BFC@viggo.jf.intel.com
      c40a56a7
  11. 06 8月, 2018 5 次提交
    • A
      x86: vdso: Use $LD instead of $CC to link · 379d98dd
      Alistair Strachan 提交于
      The vdso{32,64}.so can fail to link with CC=clang when clang tries to find
      a suitable GCC toolchain to link these libraries with.
      
      /usr/bin/ld: arch/x86/entry/vdso/vclock_gettime.o:
        access beyond end of merged section (782)
      
      This happens because the host environment leaked into the cross compiler
      environment due to the way clang searches for suitable GCC toolchains.
      
      Clang is a retargetable compiler, and each invocation of it must provide
      --target=<something> --gcc-toolchain=<something> to allow it to find the
      correct binutils for cross compilation. These flags had been added to
      KBUILD_CFLAGS, but the vdso code uses CC and not KBUILD_CFLAGS (for various
      reasons) which breaks clang's ability to find the correct linker when cross
      compiling.
      
      Most of the time this goes unnoticed because the host linker is new enough
      to work anyway, or is incompatible and skipped, but this cannot be reliably
      assumed.
      
      This change alters the vdso makefile to just use LD directly, which
      bypasses clang and thus the searching problem. The makefile will just use
      ${CROSS_COMPILE}ld instead, which is always what we want. This matches the
      method used to link vmlinux.
      
      This drops references to DISABLE_LTO; this option doesn't seem to be set
      anywhere, and not knowing what its possible values are, it's not clear how
      to convert it from CC to LD flag.
      Signed-off-by: NAlistair Strachan <astrachan@google.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NAndy Lutomirski <luto@kernel.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: kernel-team@android.com
      Cc: joel@joelfernandes.org
      Cc: Andi Kleen <andi.kleen@intel.com>
      Link: https://lkml.kernel.org/r/20180803173931.117515-1-astrachan@google.com
      379d98dd
    • N
      x86/irqflags: Provide a declaration for native_save_fl · 208cbb32
      Nick Desaulniers 提交于
      It was reported that the commit d0a8d937 is causing users of gcc < 4.9
      to observe -Werror=missing-prototypes errors.
      
      Indeed, it seems that:
      extern inline unsigned long native_save_fl(void) { return 0; }
      
      compiled with -Werror=missing-prototypes produces this warning in gcc <
      4.9, but not gcc >= 4.9.
      
      Fixes: d0a8d937 ("x86/paravirt: Make native_save_fl() extern inline").
      Reported-by: NDavid Laight <david.laight@aculab.com>
      Reported-by: NJean Delvare <jdelvare@suse.de>
      Signed-off-by: NNick Desaulniers <ndesaulniers@google.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: hpa@zytor.com
      Cc: jgross@suse.com
      Cc: kstewart@linuxfoundation.org
      Cc: gregkh@linuxfoundation.org
      Cc: boris.ostrovsky@oracle.com
      Cc: astrachan@google.com
      Cc: mka@chromium.org
      Cc: arnd@arndb.de
      Cc: tstellar@redhat.com
      Cc: sedat.dilek@gmail.com
      Cc: David.Laight@aculab.com
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20180803170550.164688-1-ndesaulniers@google.com
      208cbb32
    • D
      x86/mm/init: Add helper for freeing kernel image pages · 6ea2738e
      Dave Hansen 提交于
      When chunks of the kernel image are freed, free_init_pages() is used
      directly.  Consolidate the three sites that do this.  Also update the
      string to give an incrementally better description of that memory versus
      what was there before.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: keescook@google.com
      Cc: aarcange@redhat.com
      Cc: jgross@suse.com
      Cc: jpoimboe@redhat.com
      Cc: gregkh@linuxfoundation.org
      Cc: peterz@infradead.org
      Cc: hughd@google.com
      Cc: torvalds@linux-foundation.org
      Cc: bp@alien8.de
      Cc: luto@kernel.org
      Cc: ak@linux.intel.com
      Cc: Kees Cook <keescook@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Andi Kleen <ak@linux.intel.com>
      Link: https://lkml.kernel.org/r/20180802225829.FE0E32EA@viggo.jf.intel.com
      6ea2738e
    • D
      x86/mm/init: Pass unconverted symbol addresses to free_init_pages() · 9f515cdb
      Dave Hansen 提交于
      The x86 code has several places where it frees parts of kernel image:
      
       1. Unused SMP alternative
       2. __init code
       3. The hole between text and rodata
       4. The hole between rodata and data
      
      We call free_init_pages() to do this.  Strangely, we convert the symbol
      addresses to kernel direct map addresses in some cases (#3, #4) but not
      others (#1, #2).
      
      The virt_to_page() and the other code in free_reserved_area() now works
      fine for for symbol addresses on x86, so don't bother converting the
      addresses to direct map addresses before freeing them.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: keescook@google.com
      Cc: aarcange@redhat.com
      Cc: jgross@suse.com
      Cc: jpoimboe@redhat.com
      Cc: gregkh@linuxfoundation.org
      Cc: peterz@infradead.org
      Cc: hughd@google.com
      Cc: torvalds@linux-foundation.org
      Cc: bp@alien8.de
      Cc: luto@kernel.org
      Cc: ak@linux.intel.com
      Cc: Kees Cook <keescook@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Andi Kleen <ak@linux.intel.com>
      Link: https://lkml.kernel.org/r/20180802225828.89B2D0E2@viggo.jf.intel.com
      9f515cdb
    • D
      x86/mm/pti: Clear Global bit more aggressively · eac7073a
      Dave Hansen 提交于
      The kernel image starts out with the Global bit set across the entire
      kernel image.  The bit is cleared with set_memory_nonglobal() in the
      configurations with PCIDs where the performance benefits of the Global bit
      are not needed.
      
      However, this is fragile.  It means that we are stuck opting *out* of the
      less-secure (Global bit set) configuration, which seems backwards.  Let's
      start more secure (Global bit clear) and then let things opt back in if
      they want performance, or are truly mapping common data between kernel and
      userspace.
      
      This fixes a bug.  Before this patch, there are areas that are unmapped
      from the user page tables (like like everything above 0xffffffff82600000 in
      the example below).  These have the hallmark of being a wrong Global area:
      they are not identical in the 'current_kernel' and 'current_user' page
      table dumps.  They are also read-write, which means they're much more
      likely to contain secrets.
      
      Before this patch:
      
      current_kernel:---[ High Kernel Mapping ]---
      current_kernel-0xffffffff80000000-0xffffffff81000000          16M                               pmd
      current_kernel-0xffffffff81000000-0xffffffff81e00000          14M     ro         PSE     GLB x  pmd
      current_kernel-0xffffffff81e00000-0xffffffff81e11000          68K     ro                 GLB x  pte
      current_kernel-0xffffffff81e11000-0xffffffff82000000        1980K     RW                 GLB NX pte
      current_kernel-0xffffffff82000000-0xffffffff82600000           6M     ro         PSE     GLB NX pmd
      current_kernel-0xffffffff82600000-0xffffffff82c00000           6M     RW         PSE     GLB NX pmd
      current_kernel-0xffffffff82c00000-0xffffffff82e00000           2M     RW                 GLB NX pte
      current_kernel-0xffffffff82e00000-0xffffffff83200000           4M     RW         PSE     GLB NX pmd
      current_kernel-0xffffffff83200000-0xffffffffa0000000         462M                               pmd
      
       current_user:---[ High Kernel Mapping ]---
       current_user-0xffffffff80000000-0xffffffff81000000          16M                               pmd
       current_user-0xffffffff81000000-0xffffffff81e00000          14M     ro         PSE     GLB x  pmd
       current_user-0xffffffff81e00000-0xffffffff81e11000          68K     ro                 GLB x  pte
       current_user-0xffffffff81e11000-0xffffffff82000000        1980K     RW                 GLB NX pte
       current_user-0xffffffff82000000-0xffffffff82600000           6M     ro         PSE     GLB NX pmd
       current_user-0xffffffff82600000-0xffffffffa0000000         474M                               pmd
      
      After this patch:
      
      current_kernel:---[ High Kernel Mapping ]---
      current_kernel-0xffffffff80000000-0xffffffff81000000          16M                               pmd
      current_kernel-0xffffffff81000000-0xffffffff81e00000          14M     ro         PSE     GLB x  pmd
      current_kernel-0xffffffff81e00000-0xffffffff81e11000          68K     ro                 GLB x  pte
      current_kernel-0xffffffff81e11000-0xffffffff82000000        1980K     RW                     NX pte
      current_kernel-0xffffffff82000000-0xffffffff82600000           6M     ro         PSE     GLB NX pmd
      current_kernel-0xffffffff82600000-0xffffffff82c00000           6M     RW         PSE         NX pmd
      current_kernel-0xffffffff82c00000-0xffffffff82e00000           2M     RW                     NX pte
      current_kernel-0xffffffff82e00000-0xffffffff83200000           4M     RW         PSE         NX pmd
      current_kernel-0xffffffff83200000-0xffffffffa0000000         462M                               pmd
      
        current_user:---[ High Kernel Mapping ]---
        current_user-0xffffffff80000000-0xffffffff81000000          16M                               pmd
        current_user-0xffffffff81000000-0xffffffff81e00000          14M     ro         PSE     GLB x  pmd
        current_user-0xffffffff81e00000-0xffffffff81e11000          68K     ro                 GLB x  pte
        current_user-0xffffffff81e11000-0xffffffff82000000        1980K     RW                     NX pte
        current_user-0xffffffff82000000-0xffffffff82600000           6M     ro         PSE     GLB NX pmd
        current_user-0xffffffff82600000-0xffffffffa0000000         474M                               pmd
      
      Fixes: 0f561fce ("x86/pti: Enable global pages for shared areas")
      Reported-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: keescook@google.com
      Cc: aarcange@redhat.com
      Cc: jgross@suse.com
      Cc: jpoimboe@redhat.com
      Cc: gregkh@linuxfoundation.org
      Cc: peterz@infradead.org
      Cc: torvalds@linux-foundation.org
      Cc: bp@alien8.de
      Cc: luto@kernel.org
      Cc: ak@linux.intel.com
      Cc: Kees Cook <keescook@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Andi Kleen <ak@linux.intel.com>
      Link: https://lkml.kernel.org/r/20180802225825.A100C071@viggo.jf.intel.com
      eac7073a
  12. 05 8月, 2018 11 次提交
    • P
      KVM: VMX: Tell the nested hypervisor to skip L1D flush on vmentry · 5b76a3cf
      Paolo Bonzini 提交于
      When nested virtualization is in use, VMENTER operations from the nested
      hypervisor into the nested guest will always be processed by the bare metal
      hypervisor, and KVM's "conditional cache flushes" mode in particular does a
      flush on nested vmentry.  Therefore, include the "skip L1D flush on
      vmentry" bit in KVM's suggested ARCH_CAPABILITIES setting.
      
      Add the relevant Documentation.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      5b76a3cf
    • P
      x86/speculation: Use ARCH_CAPABILITIES to skip L1D flush on vmentry · 8e0b2b91
      Paolo Bonzini 提交于
      Bit 3 of ARCH_CAPABILITIES tells a hypervisor that L1D flush on vmentry is
      not needed.  Add a new value to enum vmx_l1d_flush_state, which is used
      either if there is no L1TF bug at all, or if bit 3 is set in ARCH_CAPABILITIES.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      8e0b2b91
    • P
      x86/speculation: Simplify sysfs report of VMX L1TF vulnerability · ea156d19
      Paolo Bonzini 提交于
      Three changes to the content of the sysfs file:
      
       - If EPT is disabled, L1TF cannot be exploited even across threads on the
         same core, and SMT is irrelevant.
      
       - If mitigation is completely disabled, and SMT is enabled, print "vulnerable"
         instead of "vulnerable, SMT vulnerable"
      
       - Reorder the two parts so that the main vulnerability state comes first
         and the detail on SMT is second.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      ea156d19
    • N
      x86/KVM/VMX: Don't set l1tf_flush_l1d from vmx_handle_external_intr() · 18b57ce2
      Nicolai Stange 提交于
      For VMEXITs caused by external interrupts, vmx_handle_external_intr()
      indirectly calls into the interrupt handlers through the host's IDT.
      
      It follows that these interrupts get accounted for in the
      kvm_cpu_l1tf_flush_l1d per-cpu flag.
      
      The subsequently executed vmx_l1d_flush() will thus be aware that some
      interrupts have happened and conduct a L1d flush anyway.
      
      Setting l1tf_flush_l1d from vmx_handle_external_intr() isn't needed
      anymore. Drop it.
      Signed-off-by: NNicolai Stange <nstange@suse.de>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      18b57ce2
    • N
      x86/irq: Let interrupt handlers set kvm_cpu_l1tf_flush_l1d · ffcba43f
      Nicolai Stange 提交于
      The last missing piece to having vmx_l1d_flush() take interrupts after
      VMEXIT into account is to set the kvm_cpu_l1tf_flush_l1d per-cpu flag on
      irq entry.
      
      Issue calls to kvm_set_cpu_l1tf_flush_l1d() from entering_irq(),
      ipi_entering_ack_irq(), smp_reschedule_interrupt() and
      uv_bau_message_interrupt().
      Suggested-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NNicolai Stange <nstange@suse.de>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      ffcba43f
    • N
      x86: Don't include linux/irq.h from asm/hardirq.h · 447ae316
      Nicolai Stange 提交于
      The next patch in this series will have to make the definition of
      irq_cpustat_t available to entering_irq().
      
      Inclusion of asm/hardirq.h into asm/apic.h would cause circular header
      dependencies like
      
        asm/smp.h
          asm/apic.h
            asm/hardirq.h
              linux/irq.h
                linux/topology.h
                  linux/smp.h
                    asm/smp.h
      
      or
      
        linux/gfp.h
          linux/mmzone.h
            asm/mmzone.h
              asm/mmzone_64.h
                asm/smp.h
                  asm/apic.h
                    asm/hardirq.h
                      linux/irq.h
                        linux/irqdesc.h
                          linux/kobject.h
                            linux/sysfs.h
                              linux/kernfs.h
                                linux/idr.h
                                  linux/gfp.h
      
      and others.
      
      This causes compilation errors because of the header guards becoming
      effective in the second inclusion: symbols/macros that had been defined
      before wouldn't be available to intermediate headers in the #include chain
      anymore.
      
      A possible workaround would be to move the definition of irq_cpustat_t
      into its own header and include that from both, asm/hardirq.h and
      asm/apic.h.
      
      However, this wouldn't solve the real problem, namely asm/harirq.h
      unnecessarily pulling in all the linux/irq.h cruft: nothing in
      asm/hardirq.h itself requires it. Also, note that there are some other
      archs, like e.g. arm64, which don't have that #include in their
      asm/hardirq.h.
      
      Remove the linux/irq.h #include from x86' asm/hardirq.h.
      
      Fix resulting compilation errors by adding appropriate #includes to *.c
      files as needed.
      
      Note that some of these *.c files could be cleaned up a bit wrt. to their
      set of #includes, but that should better be done from separate patches, if
      at all.
      Signed-off-by: NNicolai Stange <nstange@suse.de>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      447ae316
    • N
      x86/KVM/VMX: Introduce per-host-cpu analogue of l1tf_flush_l1d · 45b575c0
      Nicolai Stange 提交于
      Part of the L1TF mitigation for vmx includes flushing the L1D cache upon
      VMENTRY.
      
      L1D flushes are costly and two modes of operations are provided to users:
      "always" and the more selective "conditional" mode.
      
      If operating in the latter, the cache would get flushed only if a host side
      code path considered unconfined had been traversed. "Unconfined" in this
      context means that it might have pulled in sensitive data like user data
      or kernel crypto keys.
      
      The need for L1D flushes is tracked by means of the per-vcpu flag
      l1tf_flush_l1d. KVM exit handlers considered unconfined set it. A
      vmx_l1d_flush() subsequently invoked before the next VMENTER will conduct a
      L1d flush based on its value and reset that flag again.
      
      Currently, interrupts delivered "normally" while in root operation between
      VMEXIT and VMENTER are not taken into account. Part of the reason is that
      these don't leave any traces and thus, the vmx code is unable to tell if
      any such has happened.
      
      As proposed by Paolo Bonzini, prepare for tracking all interrupts by
      introducing a new per-cpu flag, "kvm_cpu_l1tf_flush_l1d". It will be in
      strong analogy to the per-vcpu ->l1tf_flush_l1d.
      
      A later patch will make interrupt handlers set it.
      
      For the sake of cache locality, group kvm_cpu_l1tf_flush_l1d into x86'
      per-cpu irq_cpustat_t as suggested by Peter Zijlstra.
      
      Provide the helpers kvm_set_cpu_l1tf_flush_l1d(),
      kvm_clear_cpu_l1tf_flush_l1d() and kvm_get_cpu_l1tf_flush_l1d(). Make them
      trivial resp. non-existent for !CONFIG_KVM_INTEL as appropriate.
      
      Let vmx_l1d_flush() handle kvm_cpu_l1tf_flush_l1d in the same way as
      l1tf_flush_l1d.
      Suggested-by: NPaolo Bonzini <pbonzini@redhat.com>
      Suggested-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NNicolai Stange <nstange@suse.de>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NPaolo Bonzini <pbonzini@redhat.com>
      45b575c0
    • N
      x86/irq: Demote irq_cpustat_t::__softirq_pending to u16 · 9aee5f8a
      Nicolai Stange 提交于
      An upcoming patch will extend KVM's L1TF mitigation in conditional mode
      to also cover interrupts after VMEXITs. For tracking those, stores to a
      new per-cpu flag from interrupt handlers will become necessary.
      
      In order to improve cache locality, this new flag will be added to x86's
      irq_cpustat_t.
      
      Make some space available there by shrinking the ->softirq_pending bitfield
      from 32 to 16 bits: the number of bits actually used is only NR_SOFTIRQS,
      i.e. 10.
      Suggested-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NNicolai Stange <nstange@suse.de>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NPaolo Bonzini <pbonzini@redhat.com>
      9aee5f8a
    • N
      x86/KVM/VMX: Move the l1tf_flush_l1d test to vmx_l1d_flush() · 5b6ccc6c
      Nicolai Stange 提交于
      Currently, vmx_vcpu_run() checks if l1tf_flush_l1d is set and invokes
      vmx_l1d_flush() if so.
      
      This test is unncessary for the "always flush L1D" mode.
      
      Move the check to vmx_l1d_flush()'s conditional mode code path.
      
      Notes:
      - vmx_l1d_flush() is likely to get inlined anyway and thus, there's no
        extra function call.
        
      - This inverts the (static) branch prediction, but there hadn't been any
        explicit likely()/unlikely() annotations before and so it stays as is.
      Signed-off-by: NNicolai Stange <nstange@suse.de>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      5b6ccc6c
    • N
      x86/KVM/VMX: Replace 'vmx_l1d_flush_always' with 'vmx_l1d_flush_cond' · 427362a1
      Nicolai Stange 提交于
      The vmx_l1d_flush_always static key is only ever evaluated if
      vmx_l1d_should_flush is enabled. In that case however, there are only two
      L1d flushing modes possible: "always" and "conditional".
      
      The "conditional" mode's implementation tends to require more sophisticated
      logic than the "always" mode.
      
      Avoid inverted logic by replacing the 'vmx_l1d_flush_always' static key
      with a 'vmx_l1d_flush_cond' one.
      
      There is no change in functionality.
      Signed-off-by: NNicolai Stange <nstange@suse.de>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      427362a1
    • N
      x86/KVM/VMX: Don't set l1tf_flush_l1d to true from vmx_l1d_flush() · 379fd0c7
      Nicolai Stange 提交于
      vmx_l1d_flush() gets invoked only if l1tf_flush_l1d is true. There's no
      point in setting l1tf_flush_l1d to true from there again.
      Signed-off-by: NNicolai Stange <nstange@suse.de>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      379fd0c7
  13. 03 8月, 2018 1 次提交
    • T
      x86/intel_rdt: Disable PMU access · 4a7a54a5
      Thomas Gleixner 提交于
      Peter is objecting to the direct PMU access in RDT. Right now the PMU usage
      is broken anyway as it is not coordinated with perf.
      
      Until this discussion settled, disable the PMU mechanics by simply
      rejecting the type '2' measurement in the resctrl file.
      Reported-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Reinette Chatre <reinette.chatre@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: fenghua.yu@intel.com
      Cc: tony.luck@intel.com
      Cc: vikas.shivappa@linux.intel.com
      CC: gavin.hindman@intel.com
      Cc: jithu.joseph@intel.com
      Cc: hpa@zytor.com
      4a7a54a5