1. 21 6月, 2018 3 次提交
    • L
      x86/speculation/l1tf: Protect swap entries against L1TF · 2f22b4cd
      Linus Torvalds 提交于
      With L1 terminal fault the CPU speculates into unmapped PTEs, and resulting
      side effects allow to read the memory the PTE is pointing too, if its
      values are still in the L1 cache.
      
      For swapped out pages Linux uses unmapped PTEs and stores a swap entry into
      them.
      
      To protect against L1TF it must be ensured that the swap entry is not
      pointing to valid memory, which requires setting higher bits (between bit
      36 and bit 45) that are inside the CPUs physical address space, but outside
      any real memory.
      
      To do this invert the offset to make sure the higher bits are always set,
      as long as the swap file is not too big.
      
      Note there is no workaround for 32bit !PAE, or on systems which have more
      than MAX_PA/2 worth of memory. The later case is very unlikely to happen on
      real systems.
      
      [AK: updated description and minor tweaks by. Split out from the original
           patch ]
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Tested-by: NAndi Kleen <ak@linux.intel.com>
      Reviewed-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NDave Hansen <dave.hansen@intel.com>
      
      2f22b4cd
    • L
      x86/speculation/l1tf: Change order of offset/type in swap entry · bcd11afa
      Linus Torvalds 提交于
      If pages are swapped out, the swap entry is stored in the corresponding
      PTE, which has the Present bit cleared. CPUs vulnerable to L1TF speculate
      on PTE entries which have the present bit set and would treat the swap
      entry as phsyical address (PFN). To mitigate that the upper bits of the PTE
      must be set so the PTE points to non existent memory.
      
      The swap entry stores the type and the offset of a swapped out page in the
      PTE. type is stored in bit 9-13 and offset in bit 14-63. The hardware
      ignores the bits beyond the phsyical address space limit, so to make the
      mitigation effective its required to start 'offset' at the lowest possible
      bit so that even large swap offsets do not reach into the physical address
      space limit bits.
      
      Move offset to bit 9-58 and type to bit 59-63 which are the bits that
      hardware generally doesn't care about.
      
      That, in turn, means that if you on desktop chip with only 40 bits of
      physical addressing, now that the offset starts at bit 9, there needs to be
      30 bits of offset actually *in use* until bit 39 ends up being set, which
      means when inverted it will again point into existing memory.
      
      So that's 4 terabyte of swap space (because the offset is counted in pages,
      so 30 bits of offset is 42 bits of actual coverage). With bigger physical
      addressing, that obviously grows further, until the limit of the offset is
      hit (at 50 bits of offset - 62 bits of actual swap file coverage).
      
      This is a preparatory change for the actual swap entry inversion to protect
      against L1TF.
      
      [ AK: Updated description and minor tweaks. Split into two parts ]
      [ tglx: Massaged changelog ]
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Tested-by: NAndi Kleen <ak@linux.intel.com>
      Reviewed-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NDave Hansen <dave.hansen@intel.com>
      
      bcd11afa
    • A
      x86/speculation/l1tf: Increase 32bit PAE __PHYSICAL_PAGE_SHIFT · 50896e18
      Andi Kleen 提交于
      L1 Terminal Fault (L1TF) is a speculation related vulnerability. The CPU
      speculates on PTE entries which do not have the PRESENT bit set, if the
      content of the resulting physical address is available in the L1D cache.
      
      The OS side mitigation makes sure that a !PRESENT PTE entry points to a
      physical address outside the actually existing and cachable memory
      space. This is achieved by inverting the upper bits of the PTE. Due to the
      address space limitations this only works for 64bit and 32bit PAE kernels,
      but not for 32bit non PAE.
      
      This mitigation applies to both host and guest kernels, but in case of a
      64bit host (hypervisor) and a 32bit PAE guest, inverting the upper bits of
      the PAE address space (44bit) is not enough if the host has more than 43
      bits of populated memory address space, because the speculation treats the
      PTE content as a physical host address bypassing EPT.
      
      The host (hypervisor) protects itself against the guest by flushing L1D as
      needed, but pages inside the guest are not protected against attacks from
      other processes inside the same guest.
      
      For the guest the inverted PTE mask has to match the host to provide the
      full protection for all pages the host could possibly map into the
      guest. The hosts populated address space is not known to the guest, so the
      mask must cover the possible maximal host address space, i.e. 52 bit.
      
      On 32bit PAE the maximum PTE mask is currently set to 44 bit because that
      is the limit imposed by 32bit unsigned long PFNs in the VMs. This limits
      the mask to be below what the host could possible use for physical pages.
      
      The L1TF PROT_NONE protection code uses the PTE masks to determine which
      bits to invert to make sure the higher bits are set for unmapped entries to
      prevent L1TF speculation attacks against EPT inside guests.
      
      In order to invert all bits that could be used by the host, increase
      __PHYSICAL_PAGE_SHIFT to 52 to match 64bit.
      
      The real limit for a 32bit PAE kernel is still 44 bits because all Linux
      PTEs are created from unsigned long PFNs, so they cannot be higher than 44
      bits on a 32bit kernel. So these extra PFN bits should be never set. The
      only users of this macro are using it to look at PTEs, so it's safe.
      
      [ tglx: Massaged changelog ]
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NDave Hansen <dave.hansen@intel.com>
      50896e18
  2. 14 6月, 2018 1 次提交
    • L
      Kbuild: rename CC_STACKPROTECTOR[_STRONG] config variables · 050e9baa
      Linus Torvalds 提交于
      The changes to automatically test for working stack protector compiler
      support in the Kconfig files removed the special STACKPROTECTOR_AUTO
      option that picked the strongest stack protector that the compiler
      supported.
      
      That was all a nice cleanup - it makes no sense to have the AUTO case
      now that the Kconfig phase can just determine the compiler support
      directly.
      
      HOWEVER.
      
      It also meant that doing "make oldconfig" would now _disable_ the strong
      stackprotector if you had AUTO enabled, because in a legacy config file,
      the sane stack protector configuration would look like
      
        CONFIG_HAVE_CC_STACKPROTECTOR=y
        # CONFIG_CC_STACKPROTECTOR_NONE is not set
        # CONFIG_CC_STACKPROTECTOR_REGULAR is not set
        # CONFIG_CC_STACKPROTECTOR_STRONG is not set
        CONFIG_CC_STACKPROTECTOR_AUTO=y
      
      and when you ran this through "make oldconfig" with the Kbuild changes,
      it would ask you about the regular CONFIG_CC_STACKPROTECTOR (that had
      been renamed from CONFIG_CC_STACKPROTECTOR_REGULAR to just
      CONFIG_CC_STACKPROTECTOR), but it would think that the STRONG version
      used to be disabled (because it was really enabled by AUTO), and would
      disable it in the new config, resulting in:
      
        CONFIG_HAVE_CC_STACKPROTECTOR=y
        CONFIG_CC_HAS_STACKPROTECTOR_NONE=y
        CONFIG_CC_STACKPROTECTOR=y
        # CONFIG_CC_STACKPROTECTOR_STRONG is not set
        CONFIG_CC_HAS_SANE_STACKPROTECTOR=y
      
      That's dangerously subtle - people could suddenly find themselves with
      the weaker stack protector setup without even realizing.
      
      The solution here is to just rename not just the old RECULAR stack
      protector option, but also the strong one.  This does that by just
      removing the CC_ prefix entirely for the user choices, because it really
      is not about the compiler support (the compiler support now instead
      automatially impacts _visibility_ of the options to users).
      
      This results in "make oldconfig" actually asking the user for their
      choice, so that we don't have any silent subtle security model changes.
      The end result would generally look like this:
      
        CONFIG_HAVE_CC_STACKPROTECTOR=y
        CONFIG_CC_HAS_STACKPROTECTOR_NONE=y
        CONFIG_STACKPROTECTOR=y
        CONFIG_STACKPROTECTOR_STRONG=y
        CONFIG_CC_HAS_SANE_STACKPROTECTOR=y
      
      where the "CC_" versions really are about internal compiler
      infrastructure, not the user selections.
      Acked-by: NMasahiro Yamada <yamada.masahiro@socionext.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      050e9baa
  3. 12 6月, 2018 1 次提交
    • P
      kvm: x86: use correct privilege level for sgdt/sidt/fxsave/fxrstor access · 3c9fa24c
      Paolo Bonzini 提交于
      The functions that were used in the emulation of fxrstor, fxsave, sgdt and
      sidt were originally meant for task switching, and as such they did not
      check privilege levels.  This is very bad when the same functions are used
      in the emulation of unprivileged instructions.  This is CVE-2018-10853.
      
      The obvious fix is to add a new argument to ops->read_std and ops->write_std,
      which decides whether the access is a "system" access or should use the
      processor's CPL.
      
      Fixes: 129a72a0 ("KVM: x86: Introduce segmented_write_std", 2017-01-12)
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      3c9fa24c
  4. 08 6月, 2018 1 次提交
    • L
      mm: introduce ARCH_HAS_PTE_SPECIAL · 3010a5ea
      Laurent Dufour 提交于
      Currently the PTE special supports is turned on in per architecture
      header files.  Most of the time, it is defined in
      arch/*/include/asm/pgtable.h depending or not on some other per
      architecture static definition.
      
      This patch introduce a new configuration variable to manage this
      directly in the Kconfig files.  It would later replace
      __HAVE_ARCH_PTE_SPECIAL.
      
      Here notes for some architecture where the definition of
      __HAVE_ARCH_PTE_SPECIAL is not obvious:
      
      arm
       __HAVE_ARCH_PTE_SPECIAL which is currently defined in
      arch/arm/include/asm/pgtable-3level.h which is included by
      arch/arm/include/asm/pgtable.h when CONFIG_ARM_LPAE is set.
      So select ARCH_HAS_PTE_SPECIAL if ARM_LPAE.
      
      powerpc
      __HAVE_ARCH_PTE_SPECIAL is defined in 2 files:
       - arch/powerpc/include/asm/book3s/64/pgtable.h
       - arch/powerpc/include/asm/pte-common.h
      The first one is included if (PPC_BOOK3S & PPC64) while the second is
      included in all the other cases.
      So select ARCH_HAS_PTE_SPECIAL all the time.
      
      sparc:
      __HAVE_ARCH_PTE_SPECIAL is defined if defined(__sparc__) &&
      defined(__arch64__) which are defined through the compiler in
      sparc/Makefile if !SPARC32 which I assume to be if SPARC64.
      So select ARCH_HAS_PTE_SPECIAL if SPARC64
      
      There is no functional change introduced by this patch.
      
      Link: http://lkml.kernel.org/r/1523433816-14460-2-git-send-email-ldufour@linux.vnet.ibm.comSigned-off-by: NLaurent Dufour <ldufour@linux.vnet.ibm.com>
      Suggested-by: NJerome Glisse <jglisse@redhat.com>
      Reviewed-by: NJerome Glisse <jglisse@redhat.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Rich Felker <dalias@libc.org>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Palmer Dabbelt <palmer@sifive.com>
      Cc: Albert Ou <albert@sifive.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Robin Murphy <robin.murphy@arm.com>
      Cc: Christophe LEROY <christophe.leroy@c-s.fr>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3010a5ea
  5. 06 6月, 2018 6 次提交
  6. 02 6月, 2018 1 次提交
  7. 28 5月, 2018 1 次提交
  8. 26 5月, 2018 2 次提交
  9. 25 5月, 2018 1 次提交
  10. 23 5月, 2018 2 次提交
    • J
      KVM: nVMX: Restore the VMCS12 offsets for v4.0 fields · b348e793
      Jim Mattson 提交于
      Changing the VMCS12 layout will break save/restore compatibility with
      older kvm releases once the KVM_{GET,SET}_NESTED_STATE ioctls are
      accepted upstream. Google has already been using these ioctls for some
      time, and we implore the community not to disturb the existing layout.
      
      Move the four most recently added fields to preserve the offsets of
      the previously defined fields and reserve locations for the vmread and
      vmwrite bitmaps, which will be used in the virtualization of VMCS
      shadowing (to improve the performance of double-nesting).
      Signed-off-by: NJim Mattson <jmattson@google.com>
      Reviewed-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      [Kept the SDM order in vmcs_field_to_offset_table. - Radim]
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      b348e793
    • D
      x86, nfit_test: Add unit test for memcpy_mcsafe() · 5d8beee2
      Dan Williams 提交于
      Given the fact that the ACPI "EINJ" (error injection) facility is not
      universally available, implement software infrastructure to validate the
      memcpy_mcsafe() exception handling implementation.
      
      For each potential read exception point in memcpy_mcsafe(), inject a
      emulated exception point at the address identified by 'mcsafe_inject'
      variable. With this infrastructure implement a test to validate that the
      'bytes remaining' calculation is correct for a range of various source
      buffer alignments.
      
      This code is compiled out by default. The CONFIG_MCSAFE_DEBUG
      configuration symbol needs to be manually enabled by editing
      Kconfig.debug. I.e. this functionality can not be accidentally enabled
      by a user / distro, it's only for development.
      
      Cc: <x86@kernel.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Reported-by: NTony Luck <tony.luck@intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      5d8beee2
  11. 20 5月, 2018 1 次提交
    • T
      x86/Hyper-V/hv_apic: Build the Hyper-V APIC conditionally · 2d2ccf24
      Thomas Gleixner 提交于
      The Hyper-V APIC code is built when CONFIG_HYPERV is enabled but the actual
      code in that file is guarded with CONFIG_X86_64. There is no point in doing
      this. Neither is there a point in having the CONFIG_HYPERV guard in there
      because the containing directory is not built when CONFIG_HYPERV=n.
      
      Further for the hv_init_apic() function a stub is provided only for
      CONFIG_HYPERV=n, which is pointless as the callsite is not compiled at
      all. But for X86_32 the stub is missing and the build fails.
      
      Clean that up:
      
        - Compile hv_apic.c only when CONFIG_X86_64=y
        - Make the stub for hv_init_apic() available when CONFG_X86_64=n
      
      Fixes: 6b48cb5f ("X86/Hyper-V: Enlighten APIC access")
      Reported-by: Nkbuild test robot <lkp@intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: K. Y. Srinivasan <kys@microsoft.com>
      Cc: Michael Kelley <mikelley@microsoft.com>
      2d2ccf24
  12. 19 5月, 2018 8 次提交
  13. 18 5月, 2018 3 次提交
    • K
      x86/bugs: Rename SSBD_NO to SSB_NO · 240da953
      Konrad Rzeszutek Wilk 提交于
      The "336996 Speculative Execution Side Channel Mitigations" from
      May defines this as SSB_NO, hence lets sync-up.
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      240da953
    • A
      x86/io: Define readq()/writeq() to use 64-bit type · 6469a0ee
      Andy Shevchenko 提交于
      Since non atomic readq() and writeq() were added some of the drivers
      would like to use it in a manner of:
      
       #include <io-64-nonatomic-lo-hi.h>
       ...
       pr_debug("Debug value of some register: %016llx\n", readq(addr));
      
      However, lo_hi_readq() always returns __u64 data, while readq()
      on x86_64 defines it as unsigned long. and thus compiler warns
      about type mismatch, although they are both 64-bit on x86_64.
      
      Convert readq() and writeq() on x86 to operate on deterministic
      64-bit type. The most of architectures in the kernel already are using
      either unsigned long long, or u64 type for readq() / writeq().
      This change propagates consistency in that sense.
      
      While this is not an issue per se, though if someone wants to address it,
      the anchor could be the commit:
      
        797a796a ("asm-generic: architecture independent readq/writeq for 32bit environment")
      
      where non-atomic variants had been introduced.
      
      Note, there are only few users of above pattern and they will not be
      affected because they do cast returned value. The actual warning has
      been issued on not-yet-upstreamed code.
      
      Potentially we might get a new warnings if some 64-bit only code
      assigns returned value to unsigned long type of variable. This is
      assumed to be addressed on case-by-case basis.
      Reported-by: Nlkp <lkp@intel.com>
      Tested-by: NSohil Mehta <sohil.mehta@intel.com>
      Signed-off-by: NAndy Shevchenko <andriy.shevchenko@linux.intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20180515115211.55050-1-andriy.shevchenko@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      6469a0ee
    • M
      kvm: rename KVM_HINTS_DEDICATED to KVM_HINTS_REALTIME · 633711e8
      Michael S. Tsirkin 提交于
      KVM_HINTS_DEDICATED seems to be somewhat confusing:
      
      Guest doesn't really care whether it's the only task running on a host
      CPU as long as it's not preempted.
      
      And there are more reasons for Guest to be preempted than host CPU
      sharing, for example, with memory overcommit it can get preempted on a
      memory access, post copy migration can cause preemption, etc.
      
      Let's call it KVM_HINTS_REALTIME which seems to better
      match what guests expect.
      
      Also, the flag most be set on all vCPUs - current guests assume this.
      Note so in the documentation.
      Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      633711e8
  14. 17 5月, 2018 9 次提交