1. 21 9月, 2018 1 次提交
  2. 20 9月, 2018 4 次提交
    • D
      KVM: x86: Control guest reads of MSR_PLATFORM_INFO · 6fbbde9a
      Drew Schmitt 提交于
      Add KVM_CAP_MSR_PLATFORM_INFO so that userspace can disable guest access
      to reads of MSR_PLATFORM_INFO.
      
      Disabling access to reads of this MSR gives userspace the control to "expose"
      this platform-dependent information to guests in a clear way. As it exists
      today, guests that read this MSR would get unpopulated information if userspace
      hadn't already set it (and prior to this patch series, only the CPUID faulting
      information could have been populated). This existing interface could be
      confusing if guests don't handle the potential for incorrect/incomplete
      information gracefully (e.g. zero reported for base frequency).
      Signed-off-by: NDrew Schmitt <dasch@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      6fbbde9a
    • L
      KVM: nVMX: Wake blocked vCPU in guest-mode if pending interrupt in virtual APICv · e6c67d8c
      Liran Alon 提交于
      In case L1 do not intercept L2 HLT or enter L2 in HLT activity-state,
      it is possible for a vCPU to be blocked while it is in guest-mode.
      
      According to Intel SDM 26.6.5 Interrupt-Window Exiting and
      Virtual-Interrupt Delivery: "These events wake the logical processor
      if it just entered the HLT state because of a VM entry".
      Therefore, if L1 enters L2 in HLT activity-state and L2 has a pending
      deliverable interrupt in vmcs12->guest_intr_status.RVI, then the vCPU
      should be waken from the HLT state and injected with the interrupt.
      
      In addition, if while the vCPU is blocked (while it is in guest-mode),
      it receives a nested posted-interrupt, then the vCPU should also be
      waken and injected with the posted interrupt.
      
      To handle these cases, this patch enhances kvm_vcpu_has_events() to also
      check if there is a pending interrupt in L2 virtual APICv provided by
      L1. That is, it evaluates if there is a pending virtual interrupt for L2
      by checking RVI[7:4] > VPPR[7:4] as specified in Intel SDM 29.2.1
      Evaluation of Pending Interrupts.
      
      Note that this also handles the case of nested posted-interrupt by the
      fact RVI is updated in vmx_complete_nested_posted_interrupt() which is
      called from kvm_vcpu_check_block() -> kvm_arch_vcpu_runnable() ->
      kvm_vcpu_running() -> vmx_check_nested_events() ->
      vmx_complete_nested_posted_interrupt().
      Reviewed-by: NNikita Leshenko <nikita.leshchenko@oracle.com>
      Reviewed-by: NDarren Kenny <darren.kenny@oracle.com>
      Signed-off-by: NLiran Alon <liran.alon@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      e6c67d8c
    • V
      x86/hyper-v: rename ipi_arg_{ex,non_ex} structures · a1efa9b7
      Vitaly Kuznetsov 提交于
      These structures are going to be used from KVM code so let's make
      their names reflect their Hyper-V origin.
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Reviewed-by: NRoman Kagan <rkagan@virtuozzo.com>
      Acked-by: NK. Y. Srinivasan <kys@microsoft.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a1efa9b7
    • S
      KVM: VMX: use preemption timer to force immediate VMExit · d264ee0c
      Sean Christopherson 提交于
      A VMX preemption timer value of '0' is guaranteed to cause a VMExit
      prior to the CPU executing any instructions in the guest.  Use the
      preemption timer (if it's supported) to trigger immediate VMExit
      in place of the current method of sending a self-IPI.  This ensures
      that pending VMExit injection to L1 occurs prior to executing any
      instructions in the guest (regardless of nesting level).
      
      When deferring VMExit injection, KVM generates an immediate VMExit
      from the (possibly nested) guest by sending itself an IPI.  Because
      hardware interrupts are blocked prior to VMEnter and are unblocked
      (in hardware) after VMEnter, this results in taking a VMExit(INTR)
      before any guest instruction is executed.  But, as this approach
      relies on the IPI being received before VMEnter executes, it only
      works as intended when KVM is running as L0.  Because there are no
      architectural guarantees regarding when IPIs are delivered, when
      running nested the INTR may "arrive" long after L2 is running e.g.
      L0 KVM doesn't force an immediate switch to L1 to deliver an INTR.
      
      For the most part, this unintended delay is not an issue since the
      events being injected to L1 also do not have architectural guarantees
      regarding their timing.  The notable exception is the VMX preemption
      timer[1], which is architecturally guaranteed to cause a VMExit prior
      to executing any instructions in the guest if the timer value is '0'
      at VMEnter.  Specifically, the delay in injecting the VMExit causes
      the preemption timer KVM unit test to fail when run in a nested guest.
      
      Note: this approach is viable even on CPUs with a broken preemption
      timer, as broken in this context only means the timer counts at the
      wrong rate.  There are no known errata affecting timer value of '0'.
      
      [1] I/O SMIs also have guarantees on when they arrive, but I have
          no idea if/how those are emulated in KVM.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      [Use a hook for SVM instead of leaving the default in x86.c - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d264ee0c
  3. 16 9月, 2018 1 次提交
    • B
      x86/mm: Add .bss..decrypted section to hold shared variables · b3f0907c
      Brijesh Singh 提交于
      kvmclock defines few static variables which are shared with the
      hypervisor during the kvmclock initialization.
      
      When SEV is active, memory is encrypted with a guest-specific key, and
      if the guest OS wants to share the memory region with the hypervisor
      then it must clear the C-bit before sharing it.
      
      Currently, we use kernel_physical_mapping_init() to split large pages
      before clearing the C-bit on shared pages. But it fails when called from
      the kvmclock initialization (mainly because the memblock allocator is
      not ready that early during boot).
      
      Add a __bss_decrypted section attribute which can be used when defining
      such shared variable. The so-defined variables will be placed in the
      .bss..decrypted section. This section will be mapped with C=0 early
      during boot.
      
      The .bss..decrypted section has a big chunk of memory that may be unused
      when memory encryption is not active, free it when memory encryption is
      not active.
      Suggested-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NBrijesh Singh <brijesh.singh@amd.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Sean Christopherson <sean.j.christopherson@intel.com>
      Cc: Radim Krčmář<rkrcmar@redhat.com>
      Cc: kvm@vger.kernel.org
      Link: https://lkml.kernel.org/r/1536932759-12905-2-git-send-email-brijesh.singh@amd.com
      b3f0907c
  4. 14 9月, 2018 1 次提交
    • J
      Revert "x86/mm/legacy: Populate the user page-table with user pgd's" · 61a6bd83
      Joerg Roedel 提交于
      This reverts commit 1f40a46c.
      
      It turned out that this patch is not sufficient to enable PTI on 32 bit
      systems with legacy 2-level page-tables. In this paging mode the huge-page
      PTEs are in the top-level page-table directory, where also the mirroring to
      the user-space page-table happens. So every huge PTE exits twice, in the
      kernel and in the user page-table.
      
      That means that accessed/dirty bits need to be fetched from two PTEs in
      this mode to be safe, but this is not trivial to implement because it needs
      changes to generic code just for the sake of enabling PTI with 32-bit
      legacy paging. As all systems that need PTI should support PAE anyway,
      remove support for PTI when 32-bit legacy paging is used.
      
      Fixes: 7757d607 ('x86/pti: Allow CONFIG_PAGE_TABLE_ISOLATION for x86_32')
      Reported-by: NMeelis Roos <mroos@linux.ee>
      Signed-off-by: NJoerg Roedel <jroedel@suse.de>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: hpa@zytor.com
      Cc: linux-mm@kvack.org
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Link: https://lkml.kernel.org/r/1536922754-31379-1-git-send-email-joro@8bytes.org
      61a6bd83
  5. 08 9月, 2018 2 次提交
    • N
      x86/mm: Use WRITE_ONCE() when setting PTEs · 9bc4f28a
      Nadav Amit 提交于
      When page-table entries are set, the compiler might optimize their
      assignment by using multiple instructions to set the PTE. This might
      turn into a security hazard if the user somehow manages to use the
      interim PTE. L1TF does not make our lives easier, making even an interim
      non-present PTE a security hazard.
      
      Using WRITE_ONCE() to set PTEs and friends should prevent this potential
      security hazard.
      
      I skimmed the differences in the binary with and without this patch. The
      differences are (obviously) greater when CONFIG_PARAVIRT=n as more
      code optimizations are possible. For better and worse, the impact on the
      binary with this patch is pretty small. Skimming the code did not cause
      anything to jump out as a security hazard, but it seems that at least
      move_soft_dirty_pte() caused set_pte_at() to use multiple writes.
      Signed-off-by: NNadav Amit <namit@vmware.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Sean Christopherson <sean.j.christopherson@intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20180902181451.80520-1-namit@vmware.com
      9bc4f28a
    • W
      KVM: LAPIC: Fix pv ipis out-of-bounds access · bdf7ffc8
      Wanpeng Li 提交于
      Dan Carpenter reported that the untrusted data returns from kvm_register_read()
      results in the following static checker warning:
        arch/x86/kvm/lapic.c:576 kvm_pv_send_ipi()
        error: buffer underflow 'map->phys_map' 's32min-s32max'
      
      KVM guest can easily trigger this by executing the following assembly sequence
      in Ring0:
      
      mov $10, %rax
      mov $0xFFFFFFFF, %rbx
      mov $0xFFFFFFFF, %rdx
      mov $0, %rsi
      vmcall
      
      As this will cause KVM to execute the following code-path:
      vmx_handle_exit() -> handle_vmcall() -> kvm_emulate_hypercall() -> kvm_pv_send_ipi()
      which will reach out-of-bounds access.
      
      This patch fixes it by adding a check to kvm_pv_send_ipi() against map->max_apic_id,
      ignoring destinations that are not present and delivering the rest. We also check
      whether or not map->phys_map[min + i] is NULL since the max_apic_id is set to the
      max apic id, some phys_map maybe NULL when apic id is sparse, especially kvm
      unconditionally set max_apic_id to 255 to reserve enough space for any xAPIC ID.
      Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
      Reviewed-by: NLiran Alon <liran.alon@oracle.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Liran Alon <liran.alon@oracle.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      [Add second "if (min > map->max_apic_id)" to complete the fix. -Radim]
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      bdf7ffc8
  6. 07 9月, 2018 1 次提交
  7. 06 9月, 2018 1 次提交
  8. 03 9月, 2018 1 次提交
    • R
      x86: Fix kernel-doc atomic.h warnings · 4331f4d5
      Randy Dunlap 提交于
      Fix kernel-doc warnings in arch/x86/include/asm/atomic.h that are caused by
      having a #define macro between the kernel-doc notation and the function
      name.  Fixed by moving the #define macro to after the function
      implementation.
      
      Make the same change for atomic64_{32,64}.h for consistency even though
      there were no kernel-doc warnings found in these header files, but there
      would be if they were used in generation of documentation.
      
      Fixes these kernel-doc warnings:
      
      ../arch/x86/include/asm/atomic.h:84: warning: Excess function parameter 'i' description in 'arch_atomic_sub_and_test'
      ../arch/x86/include/asm/atomic.h:84: warning: Excess function parameter 'v' description in 'arch_atomic_sub_and_test'
      ../arch/x86/include/asm/atomic.h:96: warning: Excess function parameter 'v' description in 'arch_atomic_inc'
      ../arch/x86/include/asm/atomic.h:109: warning: Excess function parameter 'v' description in 'arch_atomic_dec'
      ../arch/x86/include/asm/atomic.h:124: warning: Excess function parameter 'v' description in 'arch_atomic_dec_and_test'
      ../arch/x86/include/asm/atomic.h:138: warning: Excess function parameter 'v' description in 'arch_atomic_inc_and_test'
      ../arch/x86/include/asm/atomic.h:153: warning: Excess function parameter 'i' description in 'arch_atomic_add_negative'
      ../arch/x86/include/asm/atomic.h:153: warning: Excess function parameter 'v' description in 'arch_atomic_add_negative'
      
      Fixes: 18cc1814 ("atomics/treewide: Make test ops optional")
      Signed-off-by: NRandy Dunlap <rdunlap@infradead.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Link: https://lkml.kernel.org/r/0a1e678d-c8c5-b32c-2640-ed4e94d399d2@infradead.org
      
      4331f4d5
  9. 02 9月, 2018 1 次提交
  10. 31 8月, 2018 2 次提交
  11. 30 8月, 2018 7 次提交
  12. 28 8月, 2018 1 次提交
  13. 27 8月, 2018 1 次提交
    • A
      x86/speculation/l1tf: Increase l1tf memory limit for Nehalem+ · cc51e542
      Andi Kleen 提交于
      On Nehalem and newer core CPUs the CPU cache internally uses 44 bits
      physical address space. The L1TF workaround is limited by this internal
      cache address width, and needs to have one bit free there for the
      mitigation to work.
      
      Older client systems report only 36bit physical address space so the range
      check decides that L1TF is not mitigated for a 36bit phys/32GB system with
      some memory holes.
      
      But since these actually have the larger internal cache width this warning
      is bogus because it would only really be needed if the system had more than
      43bits of memory.
      
      Add a new internal x86_cache_bits field. Normally it is the same as the
      physical bits field reported by CPUID, but for Nehalem and newerforce it to
      be at least 44bits.
      
      Change the L1TF memory size warning to use the new cache_bits field to
      avoid bogus warnings and remove the bogus comment about memory size.
      
      Fixes: 17dbca11 ("x86/speculation/l1tf: Add sysfs reporting for l1tf")
      Reported-by: NGeorge Anchev <studio@anchev.net>
      Reported-by: NChristopher Snowhill <kode54@gmail.com>
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: x86@kernel.org
      Cc: linux-kernel@vger.kernel.org
      Cc: Michael Hocko <mhocko@suse.com>
      Cc: vbabka@suse.cz
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20180824170351.34874-1-andi@firstfloor.org
      cc51e542
  14. 24 8月, 2018 2 次提交
  15. 23 8月, 2018 2 次提交
    • P
      x86/mm/tlb: Revert the recent lazy TLB patches · 52a288c7
      Peter Zijlstra 提交于
      Revert commits:
      
        95b0e635 x86/mm/tlb: Always use lazy TLB mode
        64482aaf x86/mm/tlb: Only send page table free TLB flush to lazy TLB CPUs
        ac031589 x86/mm/tlb: Make lazy TLB mode lazier
        61d0beb5 x86/mm/tlb: Restructure switch_mm_irqs_off()
        2ff6ddf1 x86/mm/tlb: Leave lazy TLB mode at page table free time
      
      In order to simplify the TLB invalidate fixes for x86 and unify the
      parts that need backporting.  We'll try again later.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NRik van Riel <riel@surriel.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      52a288c7
    • A
      module: use relative references for __ksymtab entries · 7290d580
      Ard Biesheuvel 提交于
      An ordinary arm64 defconfig build has ~64 KB worth of __ksymtab entries,
      each consisting of two 64-bit fields containing absolute references, to
      the symbol itself and to a char array containing its name, respectively.
      
      When we build the same configuration with KASLR enabled, we end up with an
      additional ~192 KB of relocations in the .init section, i.e., one 24 byte
      entry for each absolute reference, which all need to be processed at boot
      time.
      
      Given how the struct kernel_symbol that describes each entry is completely
      local to module.c (except for the references emitted by EXPORT_SYMBOL()
      itself), we can easily modify it to contain two 32-bit relative references
      instead.  This reduces the size of the __ksymtab section by 50% for all
      64-bit architectures, and gets rid of the runtime relocations entirely for
      architectures implementing KASLR, either via standard PIE linking (arm64)
      or using custom host tools (x86).
      
      Note that the binary search involving __ksymtab contents relies on each
      section being sorted by symbol name.  This is implemented based on the
      input section names, not the names in the ksymtab entries, so this patch
      does not interfere with that.
      
      Given that the use of place-relative relocations requires support both in
      the toolchain and in the module loader, we cannot enable this feature for
      all architectures.  So make it dependent on whether
      CONFIG_HAVE_ARCH_PREL32_RELOCATIONS is defined.
      
      Link: http://lkml.kernel.org/r/20180704083651.24360-4-ard.biesheuvel@linaro.orgSigned-off-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Acked-by: NJessica Yu <jeyu@kernel.org>
      Acked-by: NMichael Ellerman <mpe@ellerman.id.au>
      Reviewed-by: NWill Deacon <will.deacon@arm.com>
      Acked-by: NIngo Molnar <mingo@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: James Morris <james.morris@microsoft.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Nicolas Pitre <nico@linaro.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: "Serge E. Hallyn" <serge@hallyn.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Garnier <thgarnie@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7290d580
  16. 22 8月, 2018 1 次提交
    • S
      KVM: vmx: Add defines for SGX ENCLS exiting · 802ec461
      Sean Christopherson 提交于
      Hardware support for basic SGX virtualization adds a new execution
      control (ENCLS_EXITING), VMCS field (ENCLS_EXITING_BITMAP) and exit
      reason (ENCLS), that enables a VMM to intercept specific ENCLS leaf
      functions, e.g. to inject faults when the VMM isn't exposing SGX to
      a VM.  When ENCLS_EXITING is enabled, the VMM can set/clear bits in
      the bitmap to intercept/allow ENCLS leaf functions in non-root, e.g.
      setting bit 2 in the ENCLS_EXITING_BITMAP will cause ENCLS[EINIT]
      to VMExit(ENCLS).
      
      Note: EXIT_REASON_ENCLS was previously added by commit 1f519992
      ("KVM: VMX: add missing exit reasons").
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Message-Id: <20180814163334.25724-2-sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      802ec461
  17. 21 8月, 2018 4 次提交
  18. 18 8月, 2018 1 次提交
    • S
      x86/speculation/l1tf: Exempt zeroed PTEs from inversion · f19f5c49
      Sean Christopherson 提交于
      It turns out that we should *not* invert all not-present mappings,
      because the all zeroes case is obviously special.
      
      clear_page() does not undergo the XOR logic to invert the address bits,
      i.e. PTE, PMD and PUD entries that have not been individually written
      will have val=0 and so will trigger __pte_needs_invert(). As a result,
      {pte,pmd,pud}_pfn() will return the wrong PFN value, i.e. all ones
      (adjusted by the max PFN mask) instead of zero. A zeroed entry is ok
      because the page at physical address 0 is reserved early in boot
      specifically to mitigate L1TF, so explicitly exempt them from the
      inversion when reading the PFN.
      
      Manifested as an unexpected mprotect(..., PROT_NONE) failure when called
      on a VMA that has VM_PFNMAP and was mmap'd to as something other than
      PROT_NONE but never used. mprotect() sends the PROT_NONE request down
      prot_none_walk(), which walks the PTEs to check the PFNs.
      prot_none_pte_entry() gets the bogus PFN from pte_pfn() and returns
      -EACCES because it thinks mprotect() is trying to adjust a high MMIO
      address.
      
      [ This is a very modified version of Sean's original patch, but all
        credit goes to Sean for doing this and also pointing out that
        sometimes the __pte_needs_invert() function only gets the protection
        bits, not the full eventual pte.  But zero remains special even in
        just protection bits, so that's ok.   - Linus ]
      
      Fixes: f22cc87f ("x86/speculation/l1tf: Invert all not present mappings")
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Acked-by: NAndi Kleen <ak@linux.intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f19f5c49
  19. 16 8月, 2018 1 次提交
  20. 11 8月, 2018 1 次提交
    • J
      x86/mm/pti: Move user W+X check into pti_finalize() · d878efce
      Joerg Roedel 提交于
      The user page-table gets the updated kernel mappings in pti_finalize(),
      which runs after the RO+X permissions got applied to the kernel page-table
      in mark_readonly().
      
      But with CONFIG_DEBUG_WX enabled, the user page-table is already checked in
      mark_readonly() for insecure mappings.  This causes false-positive
      warnings, because the user page-table did not get the updated mappings yet.
      
      Move the W+X check for the user page-table into pti_finalize() after it
      updated all required mappings.
      
      [ tglx: Folded !NX supported fix ]
      Signed-off-by: NJoerg Roedel <jroedel@suse.de>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: "H . Peter Anvin" <hpa@zytor.com>
      Cc: linux-mm@kvack.org
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: David Laight <David.Laight@aculab.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Eduardo Valentin <eduval@amazon.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: aliguori@amazon.com
      Cc: daniel.gruss@iaik.tugraz.at
      Cc: hughd@google.com
      Cc: keescook@google.com
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Waiman Long <llong@redhat.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: "David H . Gutteridge" <dhgutteridge@sympatico.ca>
      Cc: joro@8bytes.org
      Link: https://lkml.kernel.org/r/1533727000-9172-1-git-send-email-joro@8bytes.org
      d878efce
  21. 08 8月, 2018 2 次提交
  22. 07 8月, 2018 2 次提交
    • J
      xen: don't use privcmd_call() from xen_mc_flush() · cd913922
      Juergen Gross 提交于
      Using privcmd_call() for a singleton multicall seems to be wrong, as
      privcmd_call() is using stac()/clac() to enable hypervisor access to
      Linux user space.
      
      Even if currently not a problem (pv domains can't use SMAP while HVM
      and PVH domains can't use multicalls) things might change when
      PVH dom0 support is added to the kernel.
      Reported-by: NJan Beulich <jbeulich@suse.com>
      Signed-off-by: NJuergen Gross <jgross@suse.com>
      Reviewed-by: NJan Beulich <jbeulich@suse.com>
      Signed-off-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>
      cd913922
    • D
      x86/mm/init: Remove freed kernel image areas from alias mapping · c40a56a7
      Dave Hansen 提交于
      The kernel image is mapped into two places in the virtual address space
      (addresses without KASLR, of course):
      
      	1. The kernel direct map (0xffff880000000000)
      	2. The "high kernel map" (0xffffffff81000000)
      
      We actually execute out of #2.  If we get the address of a kernel symbol,
      it points to #2, but almost all physical-to-virtual translations point to
      
      Parts of the "high kernel map" alias are mapped in the userspace page
      tables with the Global bit for performance reasons.  The parts that we map
      to userspace do not (er, should not) have secrets. When PTI is enabled then
      the global bit is usually not set in the high mapping and just used to
      compensate for poor performance on systems which lack PCID.
      
      This is fine, except that some areas in the kernel image that are adjacent
      to the non-secret-containing areas are unused holes.  We free these holes
      back into the normal page allocator and reuse them as normal kernel memory.
      The memory will, of course, get *used* via the normal map, but the alias
      mapping is kept.
      
      This otherwise unused alias mapping of the holes will, by default keep the
      Global bit, be mapped out to userspace, and be vulnerable to Meltdown.
      
      Remove the alias mapping of these pages entirely.  This is likely to
      fracture the 2M page mapping the kernel image near these areas, but this
      should affect a minority of the area.
      
      The pageattr code changes *all* aliases mapping the physical pages that it
      operates on (by default).  We only want to modify a single alias, so we
      need to tweak its behavior.
      
      This unmapping behavior is currently dependent on PTI being in place.
      Going forward, we should at least consider doing this for all
      configurations.  Having an extra read-write alias for memory is not exactly
      ideal for debugging things like random memory corruption and this does
      undercut features like DEBUG_PAGEALLOC or future work like eXclusive Page
      Frame Ownership (XPFO).
      
      Before this patch:
      
      current_kernel:---[ High Kernel Mapping ]---
      current_kernel-0xffffffff80000000-0xffffffff81000000          16M                               pmd
      current_kernel-0xffffffff81000000-0xffffffff81e00000          14M     ro         PSE     GLB x  pmd
      current_kernel-0xffffffff81e00000-0xffffffff81e11000          68K     ro                 GLB x  pte
      current_kernel-0xffffffff81e11000-0xffffffff82000000        1980K     RW                     NX pte
      current_kernel-0xffffffff82000000-0xffffffff82600000           6M     ro         PSE     GLB NX pmd
      current_kernel-0xffffffff82600000-0xffffffff82c00000           6M     RW         PSE         NX pmd
      current_kernel-0xffffffff82c00000-0xffffffff82e00000           2M     RW                     NX pte
      current_kernel-0xffffffff82e00000-0xffffffff83200000           4M     RW         PSE         NX pmd
      current_kernel-0xffffffff83200000-0xffffffffa0000000         462M                               pmd
      
        current_user:---[ High Kernel Mapping ]---
        current_user-0xffffffff80000000-0xffffffff81000000          16M                               pmd
        current_user-0xffffffff81000000-0xffffffff81e00000          14M     ro         PSE     GLB x  pmd
        current_user-0xffffffff81e00000-0xffffffff81e11000          68K     ro                 GLB x  pte
        current_user-0xffffffff81e11000-0xffffffff82000000        1980K     RW                     NX pte
        current_user-0xffffffff82000000-0xffffffff82600000           6M     ro         PSE     GLB NX pmd
        current_user-0xffffffff82600000-0xffffffffa0000000         474M                               pmd
      
      After this patch:
      
      current_kernel:---[ High Kernel Mapping ]---
      current_kernel-0xffffffff80000000-0xffffffff81000000          16M                               pmd
      current_kernel-0xffffffff81000000-0xffffffff81e00000          14M     ro         PSE     GLB x  pmd
      current_kernel-0xffffffff81e00000-0xffffffff81e11000          68K     ro                 GLB x  pte
      current_kernel-0xffffffff81e11000-0xffffffff82000000        1980K                               pte
      current_kernel-0xffffffff82000000-0xffffffff82400000           4M     ro         PSE     GLB NX pmd
      current_kernel-0xffffffff82400000-0xffffffff82488000         544K     ro                     NX pte
      current_kernel-0xffffffff82488000-0xffffffff82600000        1504K                               pte
      current_kernel-0xffffffff82600000-0xffffffff82c00000           6M     RW         PSE         NX pmd
      current_kernel-0xffffffff82c00000-0xffffffff82c0d000          52K     RW                     NX pte
      current_kernel-0xffffffff82c0d000-0xffffffff82dc0000        1740K                               pte
      
        current_user:---[ High Kernel Mapping ]---
        current_user-0xffffffff80000000-0xffffffff81000000          16M                               pmd
        current_user-0xffffffff81000000-0xffffffff81e00000          14M     ro         PSE     GLB x  pmd
        current_user-0xffffffff81e00000-0xffffffff81e11000          68K     ro                 GLB x  pte
        current_user-0xffffffff81e11000-0xffffffff82000000        1980K                               pte
        current_user-0xffffffff82000000-0xffffffff82400000           4M     ro         PSE     GLB NX pmd
        current_user-0xffffffff82400000-0xffffffff82488000         544K     ro                     NX pte
        current_user-0xffffffff82488000-0xffffffff82600000        1504K                               pte
        current_user-0xffffffff82600000-0xffffffffa0000000         474M                               pmd
      
      [ tglx: Do not unmap on 32bit as there is only one mapping ]
      
      Fixes: 0f561fce ("x86/pti: Enable global pages for shared areas")
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Kees Cook <keescook@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Joerg Roedel <jroedel@suse.de>
      Link: https://lkml.kernel.org/r/20180802225831.5F6A2BFC@viggo.jf.intel.com
      c40a56a7