1. 28 9月, 2018 1 次提交
    • K
      x86/boot: Fix kexec booting failure in the SEV bit detection code · bdec8d7f
      Kairui Song 提交于
      Commit
      
        1958b5fc ("x86/boot: Add early boot support when running with SEV active")
      
      can occasionally cause system resets when kexec-ing a second kernel even
      if SEV is not active.
      
      That's because get_sev_encryption_bit() uses 32-bit rIP-relative
      addressing to read the value of enc_bit - a variable which caches a
      previously detected encryption bit position - but kexec may allocate
      the early boot code to a higher location, beyond the 32-bit addressing
      limit.
      
      In this case, garbage will be read and get_sev_encryption_bit() will
      return the wrong value, leading to accessing memory with the wrong
      encryption setting.
      
      Therefore, remove enc_bit, and thus get rid of the need to do 32-bit
      rIP-relative addressing in the first place.
      
       [ bp: massage commit message heavily. ]
      
      Fixes: 1958b5fc ("x86/boot: Add early boot support when running with SEV active")
      Suggested-by: NBorislav Petkov <bp@suse.de>
      Signed-off-by: NKairui Song <kasong@redhat.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Reviewed-by: NTom Lendacky <thomas.lendacky@amd.com>
      Cc: linux-kernel@vger.kernel.org
      Cc: tglx@linutronix.de
      Cc: mingo@redhat.com
      Cc: hpa@zytor.com
      Cc: brijesh.singh@amd.com
      Cc: kexec@lists.infradead.org
      Cc: dyoung@redhat.com
      Cc: bhe@redhat.com
      Cc: ghook@redhat.com
      Link: https://lkml.kernel.org/r/20180927123845.32052-1-kasong@redhat.com
      bdec8d7f
  2. 21 9月, 2018 1 次提交
  3. 19 9月, 2018 10 次提交
  4. 16 9月, 2018 2 次提交
    • B
      x86/kvm: Use __bss_decrypted attribute in shared variables · 6a1cac56
      Brijesh Singh 提交于
      The recent removal of the memblock dependency from kvmclock caused a SEV
      guest regression because the wall_clock and hv_clock_boot variables are
      no longer mapped decrypted when SEV is active.
      
      Use the __bss_decrypted attribute to put the static wall_clock and
      hv_clock_boot in the .bss..decrypted section so that they are mapped
      decrypted during boot.
      
      In the preparatory stage of CPU hotplug, the per-cpu pvclock data pointer
      assigns either an element of the static array or dynamically allocated
      memory for the pvclock data pointer. The static array are now mapped
      decrypted but the dynamically allocated memory is not mapped decrypted.
      However, when SEV is active this memory range must be mapped decrypted.
      
      Add a function which is called after the page allocator is up, and
      allocate memory for the pvclock data pointers for the all possible cpus.
      Map this memory range as decrypted when SEV is active.
      
      Fixes: 368a540e ("x86/kvmclock: Remove memblock dependency")
      Suggested-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NBrijesh Singh <brijesh.singh@amd.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Sean Christopherson <sean.j.christopherson@intel.com>
      Cc: "Radim Krčmář" <rkrcmar@redhat.com>
      Cc: kvm@vger.kernel.org
      Link: https://lkml.kernel.org/r/1536932759-12905-3-git-send-email-brijesh.singh@amd.com
      6a1cac56
    • B
      x86/mm: Add .bss..decrypted section to hold shared variables · b3f0907c
      Brijesh Singh 提交于
      kvmclock defines few static variables which are shared with the
      hypervisor during the kvmclock initialization.
      
      When SEV is active, memory is encrypted with a guest-specific key, and
      if the guest OS wants to share the memory region with the hypervisor
      then it must clear the C-bit before sharing it.
      
      Currently, we use kernel_physical_mapping_init() to split large pages
      before clearing the C-bit on shared pages. But it fails when called from
      the kvmclock initialization (mainly because the memblock allocator is
      not ready that early during boot).
      
      Add a __bss_decrypted section attribute which can be used when defining
      such shared variable. The so-defined variables will be placed in the
      .bss..decrypted section. This section will be mapped with C=0 early
      during boot.
      
      The .bss..decrypted section has a big chunk of memory that may be unused
      when memory encryption is not active, free it when memory encryption is
      not active.
      Suggested-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NBrijesh Singh <brijesh.singh@amd.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Sean Christopherson <sean.j.christopherson@intel.com>
      Cc: Radim Krčmář<rkrcmar@redhat.com>
      Cc: kvm@vger.kernel.org
      Link: https://lkml.kernel.org/r/1536932759-12905-2-git-send-email-brijesh.singh@amd.com
      b3f0907c
  5. 15 9月, 2018 1 次提交
  6. 14 9月, 2018 1 次提交
    • J
      Revert "x86/mm/legacy: Populate the user page-table with user pgd's" · 61a6bd83
      Joerg Roedel 提交于
      This reverts commit 1f40a46c.
      
      It turned out that this patch is not sufficient to enable PTI on 32 bit
      systems with legacy 2-level page-tables. In this paging mode the huge-page
      PTEs are in the top-level page-table directory, where also the mirroring to
      the user-space page-table happens. So every huge PTE exits twice, in the
      kernel and in the user page-table.
      
      That means that accessed/dirty bits need to be fetched from two PTEs in
      this mode to be safe, but this is not trivial to implement because it needs
      changes to generic code just for the sake of enabling PTI with 32-bit
      legacy paging. As all systems that need PTI should support PAE anyway,
      remove support for PTI when 32-bit legacy paging is used.
      
      Fixes: 7757d607 ('x86/pti: Allow CONFIG_PAGE_TABLE_ISOLATION for x86_32')
      Reported-by: NMeelis Roos <mroos@linux.ee>
      Signed-off-by: NJoerg Roedel <jroedel@suse.de>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: hpa@zytor.com
      Cc: linux-mm@kvack.org
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Link: https://lkml.kernel.org/r/1536922754-31379-1-git-send-email-joro@8bytes.org
      61a6bd83
  7. 13 9月, 2018 2 次提交
  8. 12 9月, 2018 1 次提交
  9. 10 9月, 2018 1 次提交
    • J
      perf/x86/intel: Add support/quirk for the MISPREDICT bit on Knights Landing CPUs · 16160c19
      Jacek Tomaka 提交于
      Problem: perf did not show branch predicted/mispredicted bit in brstack.
      
      Output of perf -F brstack for profile collected
      
      Before:
      
       0x4fdbcd/0x4fdc03/-/-/-/0
       0x45f4c1/0x4fdba0/-/-/-/0
       0x45f544/0x45f4bb/-/-/-/0
       0x45f555/0x45f53c/-/-/-/0
       0x7f66901cc24b/0x45f555/-/-/-/0
       0x7f66901cc22e/0x7f66901cc23d/-/-/-/0
       0x7f66901cc1ff/0x7f66901cc20f/-/-/-/0
       0x7f66901cc1e8/0x7f66901cc1fc/-/-/-/0
      
      After:
      
       0x4fdbcd/0x4fdc03/P/-/-/0
       0x45f4c1/0x4fdba0/P/-/-/0
       0x45f544/0x45f4bb/P/-/-/0
       0x45f555/0x45f53c/P/-/-/0
       0x7f66901cc24b/0x45f555/P/-/-/0
       0x7f66901cc22e/0x7f66901cc23d/P/-/-/0
       0x7f66901cc1ff/0x7f66901cc20f/P/-/-/0
       0x7f66901cc1e8/0x7f66901cc1fc/P/-/-/0
      
      Cause:
      
      As mentioned in Software Development Manual vol 3, 17.4.8.1,
      IA32_PERF_CAPABILITIES[5:0] indicates the format of the address that is
      stored in the LBR stack. Knights Landing reports 1 (LBR_FORMAT_LIP) as
      its format. Despite that, registers containing FROM address of the branch,
      do have MISPREDICT bit but because of the format indicated in
      IA32_PERF_CAPABILITIES[5:0], LBR did not read MISPREDICT bit.
      
      Solution:
      
      Teach LBR about above Knights Landing quirk and make it read MISPREDICT bit.
      Signed-off-by: NJacek Tomaka <jacek.tomaka@poczta.fm>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20180802013830.10600-1-jacekt@dugeo.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      16160c19
  10. 08 9月, 2018 4 次提交
    • N
      x86/mm: Use WRITE_ONCE() when setting PTEs · 9bc4f28a
      Nadav Amit 提交于
      When page-table entries are set, the compiler might optimize their
      assignment by using multiple instructions to set the PTE. This might
      turn into a security hazard if the user somehow manages to use the
      interim PTE. L1TF does not make our lives easier, making even an interim
      non-present PTE a security hazard.
      
      Using WRITE_ONCE() to set PTEs and friends should prevent this potential
      security hazard.
      
      I skimmed the differences in the binary with and without this patch. The
      differences are (obviously) greater when CONFIG_PARAVIRT=n as more
      code optimizations are possible. For better and worse, the impact on the
      binary with this patch is pretty small. Skimming the code did not cause
      anything to jump out as a security hazard, but it seems that at least
      move_soft_dirty_pte() caused set_pte_at() to use multiple writes.
      Signed-off-by: NNadav Amit <namit@vmware.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Sean Christopherson <sean.j.christopherson@intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20180902181451.80520-1-namit@vmware.com
      9bc4f28a
    • T
      x86/apic/vector: Make error return value negative · 47b7360c
      Thomas Gleixner 提交于
      activate_managed() returns EINVAL instead of -EINVAL in case of
      error. While this is unlikely to happen, the positive return value would
      cause further malfunction at the call site.
      
      Fixes: 2db1f959 ("x86/vector: Handle managed interrupts proper")
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      47b7360c
    • W
      KVM: LAPIC: Fix pv ipis out-of-bounds access · bdf7ffc8
      Wanpeng Li 提交于
      Dan Carpenter reported that the untrusted data returns from kvm_register_read()
      results in the following static checker warning:
        arch/x86/kvm/lapic.c:576 kvm_pv_send_ipi()
        error: buffer underflow 'map->phys_map' 's32min-s32max'
      
      KVM guest can easily trigger this by executing the following assembly sequence
      in Ring0:
      
      mov $10, %rax
      mov $0xFFFFFFFF, %rbx
      mov $0xFFFFFFFF, %rdx
      mov $0, %rsi
      vmcall
      
      As this will cause KVM to execute the following code-path:
      vmx_handle_exit() -> handle_vmcall() -> kvm_emulate_hypercall() -> kvm_pv_send_ipi()
      which will reach out-of-bounds access.
      
      This patch fixes it by adding a check to kvm_pv_send_ipi() against map->max_apic_id,
      ignoring destinations that are not present and delivering the rest. We also check
      whether or not map->phys_map[min + i] is NULL since the max_apic_id is set to the
      max apic id, some phys_map maybe NULL when apic id is sparse, especially kvm
      unconditionally set max_apic_id to 255 to reserve enough space for any xAPIC ID.
      Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
      Reviewed-by: NLiran Alon <liran.alon@oracle.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Liran Alon <liran.alon@oracle.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      [Add second "if (min > map->max_apic_id)" to complete the fix. -Radim]
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      bdf7ffc8
    • L
      KVM: nVMX: Fix loss of pending IRQ/NMI before entering L2 · b5861e5c
      Liran Alon 提交于
      Consider the case L1 had a IRQ/NMI event until it executed
      VMLAUNCH/VMRESUME which wasn't delivered because it was disallowed
      (e.g. interrupts disabled). When L1 executes VMLAUNCH/VMRESUME,
      L0 needs to evaluate if this pending event should cause an exit from
      L2 to L1 or delivered directly to L2 (e.g. In case L1 don't intercept
      EXTERNAL_INTERRUPT).
      
      Usually this would be handled by L0 requesting a IRQ/NMI window
      by setting VMCS accordingly. However, this setting was done on
      VMCS01 and now VMCS02 is active instead. Thus, when L1 executes
      VMLAUNCH/VMRESUME we force L0 to perform pending event evaluation by
      requesting a KVM_REQ_EVENT.
      
      Note that above scenario exists when L1 KVM is about to enter L2 but
      requests an "immediate-exit". As in this case, L1 will
      disable-interrupts and then send a self-IPI before entering L2.
      Reviewed-by: NNikita Leshchenko <nikita.leshchenko@oracle.com>
      Co-developed-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NLiran Alon <liran.alon@oracle.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      b5861e5c
  11. 07 9月, 2018 1 次提交
  12. 06 9月, 2018 2 次提交
  13. 03 9月, 2018 1 次提交
    • R
      x86: Fix kernel-doc atomic.h warnings · 4331f4d5
      Randy Dunlap 提交于
      Fix kernel-doc warnings in arch/x86/include/asm/atomic.h that are caused by
      having a #define macro between the kernel-doc notation and the function
      name.  Fixed by moving the #define macro to after the function
      implementation.
      
      Make the same change for atomic64_{32,64}.h for consistency even though
      there were no kernel-doc warnings found in these header files, but there
      would be if they were used in generation of documentation.
      
      Fixes these kernel-doc warnings:
      
      ../arch/x86/include/asm/atomic.h:84: warning: Excess function parameter 'i' description in 'arch_atomic_sub_and_test'
      ../arch/x86/include/asm/atomic.h:84: warning: Excess function parameter 'v' description in 'arch_atomic_sub_and_test'
      ../arch/x86/include/asm/atomic.h:96: warning: Excess function parameter 'v' description in 'arch_atomic_inc'
      ../arch/x86/include/asm/atomic.h:109: warning: Excess function parameter 'v' description in 'arch_atomic_dec'
      ../arch/x86/include/asm/atomic.h:124: warning: Excess function parameter 'v' description in 'arch_atomic_dec_and_test'
      ../arch/x86/include/asm/atomic.h:138: warning: Excess function parameter 'v' description in 'arch_atomic_inc_and_test'
      ../arch/x86/include/asm/atomic.h:153: warning: Excess function parameter 'i' description in 'arch_atomic_add_negative'
      ../arch/x86/include/asm/atomic.h:153: warning: Excess function parameter 'v' description in 'arch_atomic_add_negative'
      
      Fixes: 18cc1814 ("atomics/treewide: Make test ops optional")
      Signed-off-by: NRandy Dunlap <rdunlap@infradead.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Link: https://lkml.kernel.org/r/0a1e678d-c8c5-b32c-2640-ed4e94d399d2@infradead.org
      
      4331f4d5
  14. 02 9月, 2018 4 次提交
  15. 01 9月, 2018 1 次提交
  16. 31 8月, 2018 4 次提交
  17. 30 8月, 2018 3 次提交
    • S
      KVM: x86: Unexport x86_emulate_instruction() · c60658d1
      Sean Christopherson 提交于
      Allowing x86_emulate_instruction() to be called directly has led to
      subtle bugs being introduced, e.g. not setting EMULTYPE_NO_REEXECUTE
      in the emulation type.  While most of the blame lies on re-execute
      being opt-out, exporting x86_emulate_instruction() also exposes its
      cr2 parameter, which may have contributed to commit d391f120
      ("x86/kvm/vmx: do not use vm-exit instruction length for fast MMIO
      when running nested") using x86_emulate_instruction() instead of
      emulate_instruction() because "hey, I have a cr2!", which in turn
      introduced its EMULTYPE_NO_REEXECUTE bug.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      c60658d1
    • S
      KVM: x86: Rename emulate_instruction() to kvm_emulate_instruction() · 0ce97a2b
      Sean Christopherson 提交于
      Lack of the kvm_ prefix gives the impression that it's a VMX or SVM
      specific function, and there's no conflict that prevents adding the
      kvm_ prefix.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      0ce97a2b
    • S
      KVM: x86: Do not re-{try,execute} after failed emulation in L2 · 6c3dfeb6
      Sean Christopherson 提交于
      Commit a6f177ef ("KVM: Reenter guest after emulation failure if
      due to access to non-mmio address") added reexecute_instruction() to
      handle the scenario where two (or more) vCPUS race to write a shadowed
      page, i.e. reexecute_instruction() is intended to return true if and
      only if the instruction being emulated was accessing a shadowed page.
      As L0 is only explicitly shadowing L1 tables, an emulation failure of
      a nested VM instruction cannot be due to a race to write a shadowed
      page and so should never be re-executed.
      
      This fixes an issue where an "MMIO" emulation failure[1] in L2 is all
      but guaranteed to result in an infinite loop when TDP is enabled.
      Because "cr2" is actually an L2 GPA when TDP is enabled, calling
      kvm_mmu_gva_to_gpa_write() to translate cr2 in the non-direct mapped
      case (L2 is never direct mapped) will almost always yield UNMAPPED_GVA
      and cause reexecute_instruction() to immediately return true.  The
      !mmio_info_in_cache() check in kvm_mmu_page_fault() doesn't catch this
      case because mmio_info_in_cache() returns false for a nested MMU (the
      MMIO caching currently handles L1 only, e.g. to cache nested guests'
      GPAs we'd have to manually flush the cache when switching between
      VMs and when L1 updated its page tables controlling the nested guest).
      
      Way back when, commit 68be0803 ("KVM: x86: never re-execute
      instruction with enabled tdp") changed reexecute_instruction() to
      always return false when using TDP under the assumption that KVM would
      only get into the emulator for MMIO.  Commit 95b3cf69 ("KVM: x86:
      let reexecute_instruction work for tdp") effectively reverted that
      behavior in order to handle the scenario where emulation failed due to
      an access from L1 to the shadow page tables for L2, but it didn't
      account for the case where emulation failed in L2 with TDP enabled.
      
      All of the above logic also applies to retry_instruction(), added by
      commit 1cb3f3ae ("KVM: x86: retry non-page-table writing
      instructions").  An indefinite loop in retry_instruction() should be
      impossible as it protects against retrying the same instruction over
      and over, but it's still correct to not retry an L2 instruction in
      the first place.
      
      Fix the immediate issue by adding a check for a nested guest when
      determining whether or not to allow retry in kvm_mmu_page_fault().
      In addition to fixing the immediate bug, add WARN_ON_ONCE in the
      retry functions since they are not designed to handle nested cases,
      i.e. they need to be modified even if there is some scenario in the
      future where we want to allow retrying a nested guest.
      
      [1] This issue was encountered after commit 3a2936de ("kvm: mmu:
          Don't expose private memslots to L2") changed the page fault path
          to return KVM_PFN_NOSLOT when translating an L2 access to a
          prive memslot.  Returning KVM_PFN_NOSLOT is semantically correct
          when we want to hide a memslot from L2, i.e. there effectively is
          no defined memory region for L2, but it has the unfortunate side
          effect of making KVM think the GFN is a MMIO page, thus triggering
          emulation.  The failure occurred with in-development code that
          deliberately exposed a private memslot to L2, which L2 accessed
          with an instruction that is not emulated by KVM.
      
      Fixes: 95b3cf69 ("KVM: x86: let reexecute_instruction work for tdp")
      Fixes: 1cb3f3ae ("KVM: x86: retry non-page-table writing instructions")
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Cc: Jim Mattson <jmattson@google.com>
      Cc: Krish Sadhukhan <krish.sadhukhan@oracle.com>
      Cc: Xiao Guangrong <xiaoguangrong@tencent.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      6c3dfeb6