1. 09 1月, 2020 6 次提交
  2. 21 11月, 2019 2 次提交
    • L
      KVM: nVMX: Remove unnecessary TLB flushes on L1<->L2 switches when L1 use apic-access-page · 0155b2b9
      Liran Alon 提交于
      According to Intel SDM section 28.3.3.3/28.3.3.4 Guidelines for Use
      of the INVVPID/INVEPT Instruction, the hypervisor needs to execute
      INVVPID/INVEPT X in case CPU executes VMEntry with VPID/EPTP X and
      either: "Virtualize APIC accesses" VM-execution control was changed
      from 0 to 1, OR the value of apic_access_page was changed.
      
      In the nested case, the burden falls on L1, unless L0 enables EPT in
      vmcs02 but L1 enables neither EPT nor VPID in vmcs12.  For this reason
      prepare_vmcs02() and load_vmcs12_host_state() have special code to
      request a TLB flush in case L1 does not use EPT but it uses
      "virtualize APIC accesses".
      
      This special case however is not necessary. On a nested vmentry the
      physical TLB will already be flushed except if all the following apply:
      
      * L0 uses VPID
      
      * L1 uses VPID
      
      * L0 can guarantee TLB entries populated while running L1 are tagged
      differently than TLB entries populated while running L2.
      
      If the first condition is false, the processor will flush the TLB
      on vmentry to L2.  If the second or third condition are false,
      prepare_vmcs02() will request KVM_REQ_TLB_FLUSH.  However, even
      if both are true, no extra TLB flush is needed to handle the APIC
      access page:
      
      * if L1 doesn't use VPID, the second condition doesn't hold and the
      TLB will be flushed anyway.
      
      * if L1 uses VPID, it has to flush the TLB itself with INVVPID and
      section 28.3.3.3 doesn't apply to L0.
      
      * even INVEPT is not needed because, if L0 uses EPT, it uses different
      EPTP when running L2 than L1 (because guest_mode is part of mmu-role).
      In this case SDM section 28.3.3.4 doesn't apply.
      
      Similarly, examining nested_vmx_vmexit()->load_vmcs12_host_state(),
      one could note that L0 won't flush TLB only in cases where SDM sections
      28.3.3.3 and 28.3.3.4 don't apply.  In particular, if L0 uses different
      VPIDs for L1 and L2 (i.e. vmx->vpid != vmx->nested.vpid02), section
      28.3.3.3 doesn't apply.
      
      Thus, remove this flush from prepare_vmcs02() and nested_vmx_vmexit().
      
      Side-note: This patch can be viewed as removing parts of commit
      fb6c8198 ("kvm: vmx: Flush TLB when the APIC-access address changes”)
      that is not relevant anymore since commit
      1313cc2b ("kvm: mmu: Add guest_mode to kvm_mmu_page_role”).
      i.e. The first commit assumes that if L0 use EPT and L1 doesn’t use EPT,
      then L0 will use same EPTP for both L0 and L1. Which indeed required
      L0 to execute INVEPT before entering L2 guest. This assumption is
      not true anymore since when guest_mode was added to mmu-role.
      Reviewed-by: NJoao Martins <joao.m.martins@oracle.com>
      Signed-off-by: NLiran Alon <liran.alon@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      0155b2b9
    • L
      KVM: nVMX: Do not mark vmcs02->apic_access_page as dirty when unpinning · b11494bc
      Liran Alon 提交于
      vmcs->apic_access_page is simply a token that the hypervisor puts into
      the PFN of a 4KB EPTE (or PTE if using shadow-paging) that triggers
      APIC-access VMExit or APIC virtualization logic whenever a CPU running
      in VMX non-root mode read/write from/to this PFN.
      
      As every write either triggers an APIC-access VMExit or write is
      performed on vmcs->virtual_apic_page, the PFN pointed to by
      vmcs->apic_access_page should never actually be touched by CPU.
      
      Therefore, there is no need to mark vmcs02->apic_access_page as dirty
      after unpin it on L2->L1 emulated VMExit or when L1 exit VMX operation.
      Reviewed-by: NKrish Sadhukhan <krish.sadhukhan@oracle.com>
      Reviewed-by: NJoao Martins <joao.m.martins@oracle.com>
      Reviewed-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NLiran Alon <liran.alon@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      b11494bc
  3. 20 11月, 2019 2 次提交
  4. 15 11月, 2019 9 次提交
  5. 23 10月, 2019 1 次提交
    • J
      KVM: nVMX: Don't leak L1 MMIO regions to L2 · 671ddc70
      Jim Mattson 提交于
      If the "virtualize APIC accesses" VM-execution control is set in the
      VMCS, the APIC virtualization hardware is triggered when a page walk
      in VMX non-root mode terminates at a PTE wherein the address of the 4k
      page frame matches the APIC-access address specified in the VMCS. On
      hardware, the APIC-access address may be any valid 4k-aligned physical
      address.
      
      KVM's nVMX implementation enforces the additional constraint that the
      APIC-access address specified in the vmcs12 must be backed by
      a "struct page" in L1. If not, L0 will simply clear the "virtualize
      APIC accesses" VM-execution control in the vmcs02.
      
      The problem with this approach is that the L1 guest has arranged the
      vmcs12 EPT tables--or shadow page tables, if the "enable EPT"
      VM-execution control is clear in the vmcs12--so that the L2 guest
      physical address(es)--or L2 guest linear address(es)--that reference
      the L2 APIC map to the APIC-access address specified in the
      vmcs12. Without the "virtualize APIC accesses" VM-execution control in
      the vmcs02, the APIC accesses in the L2 guest will directly access the
      APIC-access page in L1.
      
      When there is no mapping whatsoever for the APIC-access address in L1,
      the L2 VM just loses the intended APIC virtualization. However, when
      the APIC-access address is mapped to an MMIO region in L1, the L2
      guest gets direct access to the L1 MMIO device. For example, if the
      APIC-access address specified in the vmcs12 is 0xfee00000, then L2
      gets direct access to L1's APIC.
      
      Since this vmcs12 configuration is something that KVM cannot
      faithfully emulate, the appropriate response is to exit to userspace
      with KVM_INTERNAL_ERROR_EMULATION.
      
      Fixes: fe3ef05c ("KVM: nVMX: Prepare vmcs02 from vmcs01 and vmcs12")
      Reported-by: NDan Cross <dcross@google.com>
      Signed-off-by: NJim Mattson <jmattson@google.com>
      Reviewed-by: NPeter Shier <pshier@google.com>
      Reviewed-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      671ddc70
  6. 22 10月, 2019 3 次提交
  7. 03 10月, 2019 1 次提交
  8. 26 9月, 2019 1 次提交
  9. 24 9月, 2019 4 次提交
    • M
      kvm: nvmx: limit atomic switch MSRs · f0b5105a
      Marc Orr 提交于
      Allowing an unlimited number of MSRs to be specified via the VMX
      load/store MSR lists (e.g., vm-entry MSR load list) is bad for two
      reasons. First, a guest can specify an unreasonable number of MSRs,
      forcing KVM to process all of them in software. Second, the SDM bounds
      the number of MSRs allowed to be packed into the atomic switch MSR lists.
      Quoting the "Miscellaneous Data" section in the "VMX Capability
      Reporting Facility" appendix:
      
      "Bits 27:25 is used to compute the recommended maximum number of MSRs
      that should appear in the VM-exit MSR-store list, the VM-exit MSR-load
      list, or the VM-entry MSR-load list. Specifically, if the value bits
      27:25 of IA32_VMX_MISC is N, then 512 * (N + 1) is the recommended
      maximum number of MSRs to be included in each list. If the limit is
      exceeded, undefined processor behavior may result (including a machine
      check during the VMX transition)."
      
      Because KVM needs to protect itself and can't model "undefined processor
      behavior", arbitrarily force a VM-entry to fail due to MSR loading when
      the MSR load list is too large. Similarly, trigger an abort during a VM
      exit that encounters an MSR load list or MSR store list that is too large.
      
      The MSR list size is intentionally not pre-checked so as to maintain
      compatibility with hardware inasmuch as possible.
      
      Test these new checks with the kvm-unit-test "x86: nvmx: test max atomic
      switch MSRs".
      Suggested-by: NJim Mattson <jmattson@google.com>
      Reviewed-by: NJim Mattson <jmattson@google.com>
      Reviewed-by: NPeter Shier <pshier@google.com>
      Signed-off-by: NMarc Orr <marcorr@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      f0b5105a
    • T
      KVM: vmx: Introduce handle_unexpected_vmexit and handle WAITPKG vmexit · bf653b78
      Tao Xu 提交于
      As the latest Intel 64 and IA-32 Architectures Software Developer's
      Manual, UMWAIT and TPAUSE instructions cause a VM exit if the
      RDTSC exiting and enable user wait and pause VM-execution
      controls are both 1.
      
      Because KVM never enable RDTSC exiting, the vm-exit for UMWAIT and TPAUSE
      should never happen. Considering EXIT_REASON_XSAVES and
      EXIT_REASON_XRSTORS is also unexpected VM-exit for KVM. Introduce a common
      exit helper handle_unexpected_vmexit() to handle these unexpected VM-exit.
      Suggested-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Co-developed-by: NJingqi Liu <jingqi.liu@intel.com>
      Signed-off-by: NJingqi Liu <jingqi.liu@intel.com>
      Signed-off-by: NTao Xu <tao3.xu@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      bf653b78
    • T
      KVM: x86: Add support for user wait instructions · e69e72fa
      Tao Xu 提交于
      UMONITOR, UMWAIT and TPAUSE are a set of user wait instructions.
      This patch adds support for user wait instructions in KVM. Availability
      of the user wait instructions is indicated by the presence of the CPUID
      feature flag WAITPKG CPUID.0x07.0x0:ECX[5]. User wait instructions may
      be executed at any privilege level, and use 32bit IA32_UMWAIT_CONTROL MSR
      to set the maximum time.
      
      The behavior of user wait instructions in VMX non-root operation is
      determined first by the setting of the "enable user wait and pause"
      secondary processor-based VM-execution control bit 26.
      	If the VM-execution control is 0, UMONITOR/UMWAIT/TPAUSE cause
      an invalid-opcode exception (#UD).
      	If the VM-execution control is 1, treatment is based on the
      setting of the “RDTSC exiting†VM-execution control. Because KVM never
      enables RDTSC exiting, if the instruction causes a delay, the amount of
      time delayed is called here the physical delay. The physical delay is
      first computed by determining the virtual delay. If
      IA32_UMWAIT_CONTROL[31:2] is zero, the virtual delay is the value in
      EDX:EAX minus the value that RDTSC would return; if
      IA32_UMWAIT_CONTROL[31:2] is not zero, the virtual delay is the minimum
      of that difference and AND(IA32_UMWAIT_CONTROL,FFFFFFFCH).
      
      Because umwait and tpause can put a (psysical) CPU into a power saving
      state, by default we dont't expose it to kvm and enable it only when
      guest CPUID has it.
      
      Detailed information about user wait instructions can be found in the
      latest Intel 64 and IA-32 Architectures Software Developer's Manual.
      Co-developed-by: NJingqi Liu <jingqi.liu@intel.com>
      Signed-off-by: NJingqi Liu <jingqi.liu@intel.com>
      Signed-off-by: NTao Xu <tao3.xu@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      e69e72fa
    • K
      KVM: nVMX: Check Host Address Space Size on vmentry of nested guests · 5845038c
      Krish Sadhukhan 提交于
      According to section "Checks Related to Address-Space Size" in Intel SDM
      vol 3C, the following checks are performed on vmentry of nested guests:
      
          If the logical processor is outside IA-32e mode (if IA32_EFER.LMA = 0)
          at the time of VM entry, the following must hold:
      	- The "IA-32e mode guest" VM-entry control is 0.
      	- The "host address-space size" VM-exit control is 0.
      
          If the logical processor is in IA-32e mode (if IA32_EFER.LMA = 1) at the
          time of VM entry, the "host address-space size" VM-exit control must be 1.
      
          If the "host address-space size" VM-exit control is 0, the following must
          hold:
      	- The "IA-32e mode guest" VM-entry control is 0.
      	- Bit 17 of the CR4 field (corresponding to CR4.PCIDE) is 0.
      	- Bits 63:32 in the RIP field are 0.
      
          If the "host address-space size" VM-exit control is 1, the following must
          hold:
      	- Bit 5 of the CR4 field (corresponding to CR4.PAE) is 1.
      	- The RIP field contains a canonical address.
      
          On processors that do not support Intel 64 architecture, checks are
          performed to ensure that the "IA-32e mode guest" VM-entry control and the
          "host address-space size" VM-exit control are both 0.
      Signed-off-by: NKrish Sadhukhan <krish.sadhukhan@oracle.com>
      Reviewed-by: NKarl Heubaum <karl.heubaum@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      5845038c
  10. 14 9月, 2019 1 次提交
  11. 12 9月, 2019 1 次提交
    • L
      KVM: x86: Fix INIT signal handling in various CPU states · 4b9852f4
      Liran Alon 提交于
      Commit cd7764fe ("KVM: x86: latch INITs while in system management mode")
      changed code to latch INIT while vCPU is in SMM and process latched INIT
      when leaving SMM. It left a subtle remark in commit message that similar
      treatment should also be done while vCPU is in VMX non-root-mode.
      
      However, INIT signals should actually be latched in various vCPU states:
      (*) For both Intel and AMD, INIT signals should be latched while vCPU
      is in SMM.
      (*) For Intel, INIT should also be latched while vCPU is in VMX
      operation and later processed when vCPU leaves VMX operation by
      executing VMXOFF.
      (*) For AMD, INIT should also be latched while vCPU runs with GIF=0
      or in guest-mode with intercept defined on INIT signal.
      
      To fix this:
      1) Add kvm_x86_ops->apic_init_signal_blocked() such that each CPU vendor
      can define the various CPU states in which INIT signals should be
      blocked and modify kvm_apic_accept_events() to use it.
      2) Modify vmx_check_nested_events() to check for pending INIT signal
      while vCPU in guest-mode. If so, emualte vmexit on
      EXIT_REASON_INIT_SIGNAL. Note that nSVM should have similar behaviour
      but is currently left as a TODO comment to implement in the future
      because nSVM don't yet implement svm_check_nested_events().
      
      Note: Currently KVM nVMX implementation don't support VMX wait-for-SIPI
      activity state as specified in MSR_IA32_VMX_MISC bits 6:8 exposed to
      guest (See nested_vmx_setup_ctls_msrs()).
      If and when support for this activity state will be implemented,
      kvm_check_nested_events() would need to avoid emulating vmexit on
      INIT signal in case activity-state is wait-for-SIPI. In addition,
      kvm_apic_accept_events() would need to be modified to avoid discarding
      SIPI in case VMX activity-state is wait-for-SIPI but instead delay
      SIPI processing to vmx_check_nested_events() that would clear
      pending APIC events and emulate vmexit on SIPI.
      Reviewed-by: NJoao Martins <joao.m.martins@oracle.com>
      Co-developed-by: NNikita Leshenko <nikita.leshchenko@oracle.com>
      Signed-off-by: NNikita Leshenko <nikita.leshchenko@oracle.com>
      Signed-off-by: NLiran Alon <liran.alon@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      4b9852f4
  12. 11 9月, 2019 3 次提交
    • S
      KVM: nVMX: trace nested VM-Enter failures detected by H/W · 380e0055
      Sean Christopherson 提交于
      Use the recently added tracepoint for logging nested VM-Enter failures
      instead of spamming the kernel log when hardware detects a consistency
      check failure.  Take the opportunity to print the name of the error code
      instead of dumping the raw hex number, but limit the symbol table to
      error codes that can reasonably be encountered by KVM.
      
      Add an equivalent tracepoint in nested_vmx_check_vmentry_hw(), e.g. so
      that tracing of "invalid control field" errors isn't suppressed when
      nested early checks are enabled.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      380e0055
    • S
      KVM: nVMX: add tracepoint for failed nested VM-Enter · 5497b955
      Sean Christopherson 提交于
      Debugging a failed VM-Enter is often like searching for a needle in a
      haystack, e.g. there are over 80 consistency checks that funnel into
      the "invalid control field" error code.  One way to expedite debug is
      to run the buggy code as an L1 guest under KVM (and pray that the
      failing check is detected by KVM).  However, extracting useful debug
      information out of L0 KVM requires attaching a debugger to KVM and/or
      modifying the source, e.g. to log which check is failing.
      
      Make life a little less painful for VMM developers and add a tracepoint
      for failed VM-Enter consistency checks.  Ideally the tracepoint would
      capture both what check failed and precisely why it failed, but logging
      why a checked failed is difficult to do in a generic tracepoint without
      resorting to invasive techniques, e.g. generating a custom string on
      failure.  That being said, for the vast majority of VM-Enter failures
      the most difficult step is figuring out exactly what to look at, e.g.
      figuring out which bit was incorrectly set in a control field is usually
      not too painful once the guilty field as been identified.
      
      To reach a happy medium between precision and ease of use, simply log
      the code that detected a failed check, using a macro to execute the
      check and log the trace event on failure.  This approach enables tracing
      arbitrary code, e.g. it's not limited to function calls or specific
      formats of checks, and the changes to the existing code are minimally
      invasive.  A macro with a two-character name is desirable as usage of
      the macro doesn't result in overly long lines or confusing alignment,
      while still retaining some amount of readability.  I.e. a one-character
      name is a little too terse, and a three-character name results in the
      contents being passed to the macro aligning with an indented line when
      the macro is used an in if-statement, e.g.:
      
              if (VCC(nested_vmx_check_long_line_one(...) &&
                      nested_vmx_check_long_line_two(...)))
                      return -EINVAL;
      
      And that is the story of how the CC(), a.k.a. Consistency Check, macro
      got its name.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      5497b955
    • S
      KVM: x86: Refactor up kvm_{g,s}et_msr() to simplify callers · f20935d8
      Sean Christopherson 提交于
      Refactor the top-level MSR accessors to take/return the index and value
      directly instead of requiring the caller to dump them into a msr_data
      struct.
      
      No functional change intended.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      f20935d8
  13. 22 7月, 2019 2 次提交
  14. 20 7月, 2019 1 次提交
  15. 16 7月, 2019 1 次提交
  16. 05 7月, 2019 2 次提交
    • K
      KVM nVMX: Check Host Segment Registers and Descriptor Tables on vmentry of nested guests · 1ef23e1f
      Krish Sadhukhan 提交于
      According to section "Checks on Host Segment and Descriptor-Table
      Registers" in Intel SDM vol 3C, the following checks are performed on
      vmentry of nested guests:
      
         - In the selector field for each of CS, SS, DS, ES, FS, GS and TR, the
           RPL (bits 1:0) and the TI flag (bit 2) must be 0.
         - The selector fields for CS and TR cannot be 0000H.
         - The selector field for SS cannot be 0000H if the "host address-space
           size" VM-exit control is 0.
         - On processors that support Intel 64 architecture, the base-address
           fields for FS, GS and TR must contain canonical addresses.
      Signed-off-by: NKrish Sadhukhan <krish.sadhukhan@oracle.com>
      Reviewed-by: NKarl Heubaum <karl.heubaum@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      1ef23e1f
    • S
      KVM: nVMX: Stash L1's CR3 in vmcs01.GUEST_CR3 on nested entry w/o EPT · f087a029
      Sean Christopherson 提交于
      KVM does not have 100% coverage of VMX consistency checks, i.e. some
      checks that cause VM-Fail may only be detected by hardware during a
      nested VM-Entry.  In such a case, KVM must restore L1's state to the
      pre-VM-Enter state as L2's state has already been loaded into KVM's
      software model.
      
      L1's CR3 and PDPTRs in particular are loaded from vmcs01.GUEST_*.  But
      when EPT is disabled, the associated fields hold KVM's shadow values,
      not L1's "real" values.  Fortunately, when EPT is disabled the PDPTRs
      come from memory, i.e. are not cached in the VMCS.  Which leaves CR3
      as the sole anomaly.
      
      A previously applied workaround to handle CR3 was to force nested early
      checks if EPT is disabled:
      
        commit 2b27924b ("KVM: nVMX: always use early vmcs check when EPT
                               is disabled")
      
      Forcing nested early checks is undesirable as doing so adds hundreds of
      cycles to every nested VM-Entry.  Rather than take this performance hit,
      handle CR3 by overwriting vmcs01.GUEST_CR3 with L1's CR3 during nested
      VM-Entry when EPT is disabled *and* nested early checks are disabled.
      By stuffing vmcs01.GUEST_CR3, nested_vmx_restore_host_state() will
      naturally restore the correct vcpu->arch.cr3 from vmcs01.GUEST_CR3.
      
      These shenanigans work because nested_vmx_restore_host_state() does a
      full kvm_mmu_reset_context(), i.e. unloads the current MMU, which
      guarantees vmcs01.GUEST_CR3 will be rewritten with a new shadow CR3
      prior to re-entering L1.
      
      vcpu->arch.root_mmu.root_hpa is set to INVALID_PAGE via:
      
          nested_vmx_restore_host_state() ->
              kvm_mmu_reset_context() ->
                  kvm_mmu_unload() ->
                      kvm_mmu_free_roots()
      
      kvm_mmu_unload() has WARN_ON(root_hpa != INVALID_PAGE), i.e. we can bank
      on 'root_hpa == INVALID_PAGE' unless the implementation of
      kvm_mmu_reset_context() is changed.
      
      On the way into L1, VMCS.GUEST_CR3 is guaranteed to be written (on a
      successful entry) via:
      
          vcpu_enter_guest() ->
              kvm_mmu_reload() ->
                  kvm_mmu_load() ->
                      kvm_mmu_load_cr3() ->
                          vmx_set_cr3()
      
      Stuff vmcs01.GUEST_CR3 if and only if nested early checks are disabled
      as a "late" VM-Fail should never happen win that case (KVM WARNs), and
      the conditional write avoids the need to restore the correct GUEST_CR3
      when nested_vmx_check_vmentry_hw() fails.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Message-Id: <20190607185534.24368-1-sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      f087a029