1. 17 3月, 2020 2 次提交
  2. 23 2月, 2020 3 次提交
  3. 22 2月, 2020 1 次提交
    • V
      KVM: nVMX: clear PIN_BASED_POSTED_INTR from nested pinbased_ctls only when... · a4443267
      Vitaly Kuznetsov 提交于
      KVM: nVMX: clear PIN_BASED_POSTED_INTR from nested pinbased_ctls only when apicv is globally disabled
      
      When apicv is disabled on a vCPU (e.g. by enabling KVM_CAP_HYPERV_SYNIC*),
      nothing happens to VMX MSRs on the already existing vCPUs, however, all new
      ones are created with PIN_BASED_POSTED_INTR filtered out. This is very
      confusing and results in the following picture inside the guest:
      
      $ rdmsr -ax 0x48d
      ff00000016
      7f00000016
      7f00000016
      7f00000016
      
      This is observed with QEMU and 4-vCPU guest: QEMU creates vCPU0, does
      KVM_CAP_HYPERV_SYNIC2 and then creates the remaining three.
      
      L1 hypervisor may only check CPU0's controls to find out what features
      are available and it will be very confused later. Switch to setting
      PIN_BASED_POSTED_INTR control based on global 'enable_apicv' setting.
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a4443267
  4. 17 2月, 2020 1 次提交
  5. 13 2月, 2020 1 次提交
  6. 12 2月, 2020 1 次提交
  7. 05 2月, 2020 3 次提交
  8. 28 1月, 2020 3 次提交
    • K
      KVM: nVMX: Check GUEST_DR7 on vmentry of nested guests · b91991bf
      Krish Sadhukhan 提交于
      According to section "Checks on Guest Control Registers, Debug Registers, and
      and MSRs" in Intel SDM vol 3C, the following checks are performed on vmentry
      of nested guests:
      
          If the "load debug controls" VM-entry control is 1, bits 63:32 in the DR7
          field must be 0.
      
      In KVM, GUEST_DR7 is set prior to the vmcs02 VM-entry by kvm_set_dr() and the
      latter synthesizes a #GP if any bit in the high dword in the former is set.
      Hence this field needs to be checked in software.
      Signed-off-by: NKrish Sadhukhan <krish.sadhukhan@oracle.com>
      Reviewed-by: NKarl Heubaum <karl.heubaum@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      b91991bf
    • S
      KVM: x86: Perform non-canonical checks in 32-bit KVM · de761ea7
      Sean Christopherson 提交于
      Remove the CONFIG_X86_64 condition from the low level non-canonical
      helpers to effectively enable non-canonical checks on 32-bit KVM.
      Non-canonical checks are performed by hardware if the CPU *supports*
      64-bit mode, whether or not the CPU is actually in 64-bit mode is
      irrelevant.
      
      For the most part, skipping non-canonical checks on 32-bit KVM is ok-ish
      because 32-bit KVM always (hopefully) drops bits 63:32 of whatever value
      it's checking before propagating it to hardware, and architecturally,
      the expected behavior for the guest is a bit of a grey area since the
      vCPU itself doesn't support 64-bit mode.  I.e. a 32-bit KVM guest can
      observe the missed checks in several paths, e.g. INVVPID and VM-Enter,
      but it's debatable whether or not the missed checks constitute a bug
      because technically the vCPU doesn't support 64-bit mode.
      
      The primary motivation for enabling the non-canonical checks is defense
      in depth.  As mentioned above, a guest can trigger a missed check via
      INVVPID or VM-Enter.  INVVPID is straightforward as it takes a 64-bit
      virtual address as part of its 128-bit INVVPID descriptor and fails if
      the address is non-canonical, even if INVVPID is executed in 32-bit PM.
      Nested VM-Enter is a bit more convoluted as it requires the guest to
      write natural width VMCS fields via memory accesses and then VMPTRLD the
      VMCS, but it's still possible.  In both cases, KVM is saved from a true
      bug only because its flows that propagate values to hardware (correctly)
      take "unsigned long" parameters and so drop bits 63:32 of the bad value.
      
      Explicitly performing the non-canonical checks makes it less likely that
      a bad value will be propagated to hardware, e.g. in the INVVPID case,
      if __invvpid() didn't implicitly drop bits 63:32 then KVM would BUG() on
      the resulting unexpected INVVPID failure due to hardware rejecting the
      non-canonical address.
      
      The only downside to enabling the non-canonical checks is that it adds a
      relatively small amount of overhead, but the affected flows are not hot
      paths, i.e. the overhead is negligible.
      
      Note, KVM technically could gate the non-canonical checks on 32-bit KVM
      with static_cpu_has(X86_FEATURE_LM), but on bare metal that's an even
      bigger waste of code for everyone except the 0.00000000000001% of the
      population running on Yonah, and nested 32-bit on 64-bit already fudges
      things with respect to 64-bit CPU behavior.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      [Also do so in nested_vmx_check_host_state as reported by Krish. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      de761ea7
    • O
      KVM: nVMX: WARN on failure to set IA32_PERF_GLOBAL_CTRL · d1968421
      Oliver Upton 提交于
      Writes to MSR_CORE_PERF_GLOBAL_CONTROL should never fail if the VM-exit
      and VM-entry controls are exposed to L1. Promote the checks to perform a
      full WARN if kvm_set_msr() fails and remove the now unused macro
      SET_MSR_OR_WARN().
      Suggested-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NOliver Upton <oupton@google.com>
      Reviewed-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d1968421
  9. 21 1月, 2020 3 次提交
  10. 14 1月, 2020 1 次提交
    • S
      x86/msr-index: Clean up bit defines for IA32_FEATURE_CONTROL MSR · 32ad73db
      Sean Christopherson 提交于
      As pointed out by Boris, the defines for bits in IA32_FEATURE_CONTROL
      are quite a mouthful, especially the VMX bits which must differentiate
      between enabling VMX inside and outside SMX (TXT) operation.  Rename the
      MSR and its bit defines to abbreviate FEATURE_CONTROL as FEAT_CTL to
      make them a little friendlier on the eyes.
      
      Arguably, the MSR itself should keep the full IA32_FEATURE_CONTROL name
      to match Intel's SDM, but a future patch will add a dedicated Kconfig,
      file and functions for the MSR. Using the full name for those assets is
      rather unwieldy, so bite the bullet and use IA32_FEAT_CTL so that its
      nomenclature is consistent throughout the kernel.
      
      Opportunistically, fix a few other annoyances with the defines:
      
        - Relocate the bit defines so that they immediately follow the MSR
          define, e.g. aren't mistaken as belonging to MISC_FEATURE_CONTROL.
        - Add whitespace around the block of feature control defines to make
          it clear they're all related.
        - Use BIT() instead of manually encoding the bit shift.
        - Use "VMX" instead of "VMXON" to match the SDM.
        - Append "_ENABLED" to the LMCE (Local Machine Check Exception) bit to
          be consistent with the kernel's verbiage used for all other feature
          control bits.  Note, the SDM refers to the LMCE bit as LMCE_ON,
          likely to differentiate it from IA32_MCG_EXT_CTL.LMCE_EN.  Ignore
          the (literal) one-off usage of _ON, the SDM is simply "wrong".
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Link: https://lkml.kernel.org/r/20191221044513.21680-2-sean.j.christopherson@intel.com
      32ad73db
  11. 09 1月, 2020 6 次提交
  12. 21 11月, 2019 2 次提交
    • L
      KVM: nVMX: Remove unnecessary TLB flushes on L1<->L2 switches when L1 use apic-access-page · 0155b2b9
      Liran Alon 提交于
      According to Intel SDM section 28.3.3.3/28.3.3.4 Guidelines for Use
      of the INVVPID/INVEPT Instruction, the hypervisor needs to execute
      INVVPID/INVEPT X in case CPU executes VMEntry with VPID/EPTP X and
      either: "Virtualize APIC accesses" VM-execution control was changed
      from 0 to 1, OR the value of apic_access_page was changed.
      
      In the nested case, the burden falls on L1, unless L0 enables EPT in
      vmcs02 but L1 enables neither EPT nor VPID in vmcs12.  For this reason
      prepare_vmcs02() and load_vmcs12_host_state() have special code to
      request a TLB flush in case L1 does not use EPT but it uses
      "virtualize APIC accesses".
      
      This special case however is not necessary. On a nested vmentry the
      physical TLB will already be flushed except if all the following apply:
      
      * L0 uses VPID
      
      * L1 uses VPID
      
      * L0 can guarantee TLB entries populated while running L1 are tagged
      differently than TLB entries populated while running L2.
      
      If the first condition is false, the processor will flush the TLB
      on vmentry to L2.  If the second or third condition are false,
      prepare_vmcs02() will request KVM_REQ_TLB_FLUSH.  However, even
      if both are true, no extra TLB flush is needed to handle the APIC
      access page:
      
      * if L1 doesn't use VPID, the second condition doesn't hold and the
      TLB will be flushed anyway.
      
      * if L1 uses VPID, it has to flush the TLB itself with INVVPID and
      section 28.3.3.3 doesn't apply to L0.
      
      * even INVEPT is not needed because, if L0 uses EPT, it uses different
      EPTP when running L2 than L1 (because guest_mode is part of mmu-role).
      In this case SDM section 28.3.3.4 doesn't apply.
      
      Similarly, examining nested_vmx_vmexit()->load_vmcs12_host_state(),
      one could note that L0 won't flush TLB only in cases where SDM sections
      28.3.3.3 and 28.3.3.4 don't apply.  In particular, if L0 uses different
      VPIDs for L1 and L2 (i.e. vmx->vpid != vmx->nested.vpid02), section
      28.3.3.3 doesn't apply.
      
      Thus, remove this flush from prepare_vmcs02() and nested_vmx_vmexit().
      
      Side-note: This patch can be viewed as removing parts of commit
      fb6c8198 ("kvm: vmx: Flush TLB when the APIC-access address changes”)
      that is not relevant anymore since commit
      1313cc2b ("kvm: mmu: Add guest_mode to kvm_mmu_page_role”).
      i.e. The first commit assumes that if L0 use EPT and L1 doesn’t use EPT,
      then L0 will use same EPTP for both L0 and L1. Which indeed required
      L0 to execute INVEPT before entering L2 guest. This assumption is
      not true anymore since when guest_mode was added to mmu-role.
      Reviewed-by: NJoao Martins <joao.m.martins@oracle.com>
      Signed-off-by: NLiran Alon <liran.alon@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      0155b2b9
    • L
      KVM: nVMX: Do not mark vmcs02->apic_access_page as dirty when unpinning · b11494bc
      Liran Alon 提交于
      vmcs->apic_access_page is simply a token that the hypervisor puts into
      the PFN of a 4KB EPTE (or PTE if using shadow-paging) that triggers
      APIC-access VMExit or APIC virtualization logic whenever a CPU running
      in VMX non-root mode read/write from/to this PFN.
      
      As every write either triggers an APIC-access VMExit or write is
      performed on vmcs->virtual_apic_page, the PFN pointed to by
      vmcs->apic_access_page should never actually be touched by CPU.
      
      Therefore, there is no need to mark vmcs02->apic_access_page as dirty
      after unpin it on L2->L1 emulated VMExit or when L1 exit VMX operation.
      Reviewed-by: NKrish Sadhukhan <krish.sadhukhan@oracle.com>
      Reviewed-by: NJoao Martins <joao.m.martins@oracle.com>
      Reviewed-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NLiran Alon <liran.alon@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      b11494bc
  13. 20 11月, 2019 2 次提交
  14. 15 11月, 2019 9 次提交
  15. 23 10月, 2019 1 次提交
    • J
      KVM: nVMX: Don't leak L1 MMIO regions to L2 · 671ddc70
      Jim Mattson 提交于
      If the "virtualize APIC accesses" VM-execution control is set in the
      VMCS, the APIC virtualization hardware is triggered when a page walk
      in VMX non-root mode terminates at a PTE wherein the address of the 4k
      page frame matches the APIC-access address specified in the VMCS. On
      hardware, the APIC-access address may be any valid 4k-aligned physical
      address.
      
      KVM's nVMX implementation enforces the additional constraint that the
      APIC-access address specified in the vmcs12 must be backed by
      a "struct page" in L1. If not, L0 will simply clear the "virtualize
      APIC accesses" VM-execution control in the vmcs02.
      
      The problem with this approach is that the L1 guest has arranged the
      vmcs12 EPT tables--or shadow page tables, if the "enable EPT"
      VM-execution control is clear in the vmcs12--so that the L2 guest
      physical address(es)--or L2 guest linear address(es)--that reference
      the L2 APIC map to the APIC-access address specified in the
      vmcs12. Without the "virtualize APIC accesses" VM-execution control in
      the vmcs02, the APIC accesses in the L2 guest will directly access the
      APIC-access page in L1.
      
      When there is no mapping whatsoever for the APIC-access address in L1,
      the L2 VM just loses the intended APIC virtualization. However, when
      the APIC-access address is mapped to an MMIO region in L1, the L2
      guest gets direct access to the L1 MMIO device. For example, if the
      APIC-access address specified in the vmcs12 is 0xfee00000, then L2
      gets direct access to L1's APIC.
      
      Since this vmcs12 configuration is something that KVM cannot
      faithfully emulate, the appropriate response is to exit to userspace
      with KVM_INTERNAL_ERROR_EMULATION.
      
      Fixes: fe3ef05c ("KVM: nVMX: Prepare vmcs02 from vmcs01 and vmcs12")
      Reported-by: NDan Cross <dcross@google.com>
      Signed-off-by: NJim Mattson <jmattson@google.com>
      Reviewed-by: NPeter Shier <pshier@google.com>
      Reviewed-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      671ddc70
  16. 22 10月, 2019 1 次提交