1. 15 11月, 2019 2 次提交
    • L
      KVM: nVMX: Update vmcs01 TPR_THRESHOLD if L2 changed L1 TPR · 02d496cf
      Liran Alon 提交于
      When L1 don't use TPR-Shadow to run L2, L0 configures vmcs02 without
      TPR-Shadow and install intercepts on CR8 access (load and store).
      
      If L1 do not intercept L2 CR8 access, L0 intercepts on those accesses
      will emulate load/store on L1's LAPIC TPR. If in this case L2 lowers
      TPR such that there is now an injectable interrupt to L1,
      apic_update_ppr() will request a KVM_REQ_EVENT which will trigger a call
      to update_cr8_intercept() to update TPR-Threshold to highest pending IRR
      priority.
      
      However, this update to TPR-Threshold is done while active vmcs is
      vmcs02 instead of vmcs01. Thus, when later at some point L0 will
      emulate an exit from L2 to L1, L1 will still run with high
      TPR-Threshold. This will result in every VMEntry to L1 to immediately
      exit on TPR_BELOW_THRESHOLD and continue to do so infinitely until
      some condition will cause KVM_REQ_EVENT to be set.
      (Note that TPR_BELOW_THRESHOLD exit handler do not set KVM_REQ_EVENT
      until apic_update_ppr() will notice a new injectable interrupt for PPR)
      
      To fix this issue, change update_cr8_intercept() such that if L2 lowers
      L1's TPR in a way that requires to lower L1's TPR-Threshold, save update
      to TPR-Threshold and apply it to vmcs01 when L0 emulates an exit from
      L2 to L1.
      Reviewed-by: NJoao Martins <joao.m.martins@oracle.com>
      Signed-off-by: NLiran Alon <liran.alon@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      02d496cf
    • L
      KVM: VMX: Consume pending LAPIC INIT event when exit on INIT_SIGNAL · e64a8508
      Liran Alon 提交于
      Intel SDM section 25.2 OTHER CAUSES OF VM EXITS specifies the following
      on INIT signals: "Such exits do not modify register state or clear pending
      events as they would outside of VMX operation."
      
      When commit 4b9852f4 ("KVM: x86: Fix INIT signal handling in various CPU states")
      was applied, I interepted above Intel SDM statement such that
      INIT_SIGNAL exit don’t consume the LAPIC INIT pending event.
      
      However, when Nadav Amit run matching kvm-unit-test on a bare-metal
      machine, it turned out my interpetation was wrong. i.e. INIT_SIGNAL
      exit does consume the LAPIC INIT pending event.
      (See: https://www.spinics.net/lists/kvm/msg196757.html)
      
      Therefore, fix KVM code to behave as observed on bare-metal.
      
      Fixes: 4b9852f4 ("KVM: x86: Fix INIT signal handling in various CPU states")
      Reported-by: NNadav Amit <nadav.amit@gmail.com>
      Reviewed-by: NMihai Carabas <mihai.carabas@oracle.com>
      Reviewed-by: NJoao Martins <joao.m.martins@oracle.com>
      Signed-off-by: NLiran Alon <liran.alon@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      e64a8508
  2. 22 10月, 2019 3 次提交
  3. 03 10月, 2019 1 次提交
  4. 26 9月, 2019 1 次提交
  5. 24 9月, 2019 4 次提交
    • M
      kvm: nvmx: limit atomic switch MSRs · f0b5105a
      Marc Orr 提交于
      Allowing an unlimited number of MSRs to be specified via the VMX
      load/store MSR lists (e.g., vm-entry MSR load list) is bad for two
      reasons. First, a guest can specify an unreasonable number of MSRs,
      forcing KVM to process all of them in software. Second, the SDM bounds
      the number of MSRs allowed to be packed into the atomic switch MSR lists.
      Quoting the "Miscellaneous Data" section in the "VMX Capability
      Reporting Facility" appendix:
      
      "Bits 27:25 is used to compute the recommended maximum number of MSRs
      that should appear in the VM-exit MSR-store list, the VM-exit MSR-load
      list, or the VM-entry MSR-load list. Specifically, if the value bits
      27:25 of IA32_VMX_MISC is N, then 512 * (N + 1) is the recommended
      maximum number of MSRs to be included in each list. If the limit is
      exceeded, undefined processor behavior may result (including a machine
      check during the VMX transition)."
      
      Because KVM needs to protect itself and can't model "undefined processor
      behavior", arbitrarily force a VM-entry to fail due to MSR loading when
      the MSR load list is too large. Similarly, trigger an abort during a VM
      exit that encounters an MSR load list or MSR store list that is too large.
      
      The MSR list size is intentionally not pre-checked so as to maintain
      compatibility with hardware inasmuch as possible.
      
      Test these new checks with the kvm-unit-test "x86: nvmx: test max atomic
      switch MSRs".
      Suggested-by: NJim Mattson <jmattson@google.com>
      Reviewed-by: NJim Mattson <jmattson@google.com>
      Reviewed-by: NPeter Shier <pshier@google.com>
      Signed-off-by: NMarc Orr <marcorr@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      f0b5105a
    • T
      KVM: vmx: Introduce handle_unexpected_vmexit and handle WAITPKG vmexit · bf653b78
      Tao Xu 提交于
      As the latest Intel 64 and IA-32 Architectures Software Developer's
      Manual, UMWAIT and TPAUSE instructions cause a VM exit if the
      RDTSC exiting and enable user wait and pause VM-execution
      controls are both 1.
      
      Because KVM never enable RDTSC exiting, the vm-exit for UMWAIT and TPAUSE
      should never happen. Considering EXIT_REASON_XSAVES and
      EXIT_REASON_XRSTORS is also unexpected VM-exit for KVM. Introduce a common
      exit helper handle_unexpected_vmexit() to handle these unexpected VM-exit.
      Suggested-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Co-developed-by: NJingqi Liu <jingqi.liu@intel.com>
      Signed-off-by: NJingqi Liu <jingqi.liu@intel.com>
      Signed-off-by: NTao Xu <tao3.xu@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      bf653b78
    • T
      KVM: x86: Add support for user wait instructions · e69e72fa
      Tao Xu 提交于
      UMONITOR, UMWAIT and TPAUSE are a set of user wait instructions.
      This patch adds support for user wait instructions in KVM. Availability
      of the user wait instructions is indicated by the presence of the CPUID
      feature flag WAITPKG CPUID.0x07.0x0:ECX[5]. User wait instructions may
      be executed at any privilege level, and use 32bit IA32_UMWAIT_CONTROL MSR
      to set the maximum time.
      
      The behavior of user wait instructions in VMX non-root operation is
      determined first by the setting of the "enable user wait and pause"
      secondary processor-based VM-execution control bit 26.
      	If the VM-execution control is 0, UMONITOR/UMWAIT/TPAUSE cause
      an invalid-opcode exception (#UD).
      	If the VM-execution control is 1, treatment is based on the
      setting of the “RDTSC exiting†VM-execution control. Because KVM never
      enables RDTSC exiting, if the instruction causes a delay, the amount of
      time delayed is called here the physical delay. The physical delay is
      first computed by determining the virtual delay. If
      IA32_UMWAIT_CONTROL[31:2] is zero, the virtual delay is the value in
      EDX:EAX minus the value that RDTSC would return; if
      IA32_UMWAIT_CONTROL[31:2] is not zero, the virtual delay is the minimum
      of that difference and AND(IA32_UMWAIT_CONTROL,FFFFFFFCH).
      
      Because umwait and tpause can put a (psysical) CPU into a power saving
      state, by default we dont't expose it to kvm and enable it only when
      guest CPUID has it.
      
      Detailed information about user wait instructions can be found in the
      latest Intel 64 and IA-32 Architectures Software Developer's Manual.
      Co-developed-by: NJingqi Liu <jingqi.liu@intel.com>
      Signed-off-by: NJingqi Liu <jingqi.liu@intel.com>
      Signed-off-by: NTao Xu <tao3.xu@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      e69e72fa
    • K
      KVM: nVMX: Check Host Address Space Size on vmentry of nested guests · 5845038c
      Krish Sadhukhan 提交于
      According to section "Checks Related to Address-Space Size" in Intel SDM
      vol 3C, the following checks are performed on vmentry of nested guests:
      
          If the logical processor is outside IA-32e mode (if IA32_EFER.LMA = 0)
          at the time of VM entry, the following must hold:
      	- The "IA-32e mode guest" VM-entry control is 0.
      	- The "host address-space size" VM-exit control is 0.
      
          If the logical processor is in IA-32e mode (if IA32_EFER.LMA = 1) at the
          time of VM entry, the "host address-space size" VM-exit control must be 1.
      
          If the "host address-space size" VM-exit control is 0, the following must
          hold:
      	- The "IA-32e mode guest" VM-entry control is 0.
      	- Bit 17 of the CR4 field (corresponding to CR4.PCIDE) is 0.
      	- Bits 63:32 in the RIP field are 0.
      
          If the "host address-space size" VM-exit control is 1, the following must
          hold:
      	- Bit 5 of the CR4 field (corresponding to CR4.PAE) is 1.
      	- The RIP field contains a canonical address.
      
          On processors that do not support Intel 64 architecture, checks are
          performed to ensure that the "IA-32e mode guest" VM-entry control and the
          "host address-space size" VM-exit control are both 0.
      Signed-off-by: NKrish Sadhukhan <krish.sadhukhan@oracle.com>
      Reviewed-by: NKarl Heubaum <karl.heubaum@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      5845038c
  6. 14 9月, 2019 1 次提交
  7. 12 9月, 2019 1 次提交
    • L
      KVM: x86: Fix INIT signal handling in various CPU states · 4b9852f4
      Liran Alon 提交于
      Commit cd7764fe ("KVM: x86: latch INITs while in system management mode")
      changed code to latch INIT while vCPU is in SMM and process latched INIT
      when leaving SMM. It left a subtle remark in commit message that similar
      treatment should also be done while vCPU is in VMX non-root-mode.
      
      However, INIT signals should actually be latched in various vCPU states:
      (*) For both Intel and AMD, INIT signals should be latched while vCPU
      is in SMM.
      (*) For Intel, INIT should also be latched while vCPU is in VMX
      operation and later processed when vCPU leaves VMX operation by
      executing VMXOFF.
      (*) For AMD, INIT should also be latched while vCPU runs with GIF=0
      or in guest-mode with intercept defined on INIT signal.
      
      To fix this:
      1) Add kvm_x86_ops->apic_init_signal_blocked() such that each CPU vendor
      can define the various CPU states in which INIT signals should be
      blocked and modify kvm_apic_accept_events() to use it.
      2) Modify vmx_check_nested_events() to check for pending INIT signal
      while vCPU in guest-mode. If so, emualte vmexit on
      EXIT_REASON_INIT_SIGNAL. Note that nSVM should have similar behaviour
      but is currently left as a TODO comment to implement in the future
      because nSVM don't yet implement svm_check_nested_events().
      
      Note: Currently KVM nVMX implementation don't support VMX wait-for-SIPI
      activity state as specified in MSR_IA32_VMX_MISC bits 6:8 exposed to
      guest (See nested_vmx_setup_ctls_msrs()).
      If and when support for this activity state will be implemented,
      kvm_check_nested_events() would need to avoid emulating vmexit on
      INIT signal in case activity-state is wait-for-SIPI. In addition,
      kvm_apic_accept_events() would need to be modified to avoid discarding
      SIPI in case VMX activity-state is wait-for-SIPI but instead delay
      SIPI processing to vmx_check_nested_events() that would clear
      pending APIC events and emulate vmexit on SIPI.
      Reviewed-by: NJoao Martins <joao.m.martins@oracle.com>
      Co-developed-by: NNikita Leshenko <nikita.leshchenko@oracle.com>
      Signed-off-by: NNikita Leshenko <nikita.leshchenko@oracle.com>
      Signed-off-by: NLiran Alon <liran.alon@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      4b9852f4
  8. 11 9月, 2019 3 次提交
    • S
      KVM: nVMX: trace nested VM-Enter failures detected by H/W · 380e0055
      Sean Christopherson 提交于
      Use the recently added tracepoint for logging nested VM-Enter failures
      instead of spamming the kernel log when hardware detects a consistency
      check failure.  Take the opportunity to print the name of the error code
      instead of dumping the raw hex number, but limit the symbol table to
      error codes that can reasonably be encountered by KVM.
      
      Add an equivalent tracepoint in nested_vmx_check_vmentry_hw(), e.g. so
      that tracing of "invalid control field" errors isn't suppressed when
      nested early checks are enabled.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      380e0055
    • S
      KVM: nVMX: add tracepoint for failed nested VM-Enter · 5497b955
      Sean Christopherson 提交于
      Debugging a failed VM-Enter is often like searching for a needle in a
      haystack, e.g. there are over 80 consistency checks that funnel into
      the "invalid control field" error code.  One way to expedite debug is
      to run the buggy code as an L1 guest under KVM (and pray that the
      failing check is detected by KVM).  However, extracting useful debug
      information out of L0 KVM requires attaching a debugger to KVM and/or
      modifying the source, e.g. to log which check is failing.
      
      Make life a little less painful for VMM developers and add a tracepoint
      for failed VM-Enter consistency checks.  Ideally the tracepoint would
      capture both what check failed and precisely why it failed, but logging
      why a checked failed is difficult to do in a generic tracepoint without
      resorting to invasive techniques, e.g. generating a custom string on
      failure.  That being said, for the vast majority of VM-Enter failures
      the most difficult step is figuring out exactly what to look at, e.g.
      figuring out which bit was incorrectly set in a control field is usually
      not too painful once the guilty field as been identified.
      
      To reach a happy medium between precision and ease of use, simply log
      the code that detected a failed check, using a macro to execute the
      check and log the trace event on failure.  This approach enables tracing
      arbitrary code, e.g. it's not limited to function calls or specific
      formats of checks, and the changes to the existing code are minimally
      invasive.  A macro with a two-character name is desirable as usage of
      the macro doesn't result in overly long lines or confusing alignment,
      while still retaining some amount of readability.  I.e. a one-character
      name is a little too terse, and a three-character name results in the
      contents being passed to the macro aligning with an indented line when
      the macro is used an in if-statement, e.g.:
      
              if (VCC(nested_vmx_check_long_line_one(...) &&
                      nested_vmx_check_long_line_two(...)))
                      return -EINVAL;
      
      And that is the story of how the CC(), a.k.a. Consistency Check, macro
      got its name.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      5497b955
    • S
      KVM: x86: Refactor up kvm_{g,s}et_msr() to simplify callers · f20935d8
      Sean Christopherson 提交于
      Refactor the top-level MSR accessors to take/return the index and value
      directly instead of requiring the caller to dump them into a msr_data
      struct.
      
      No functional change intended.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      f20935d8
  9. 22 7月, 2019 2 次提交
  10. 20 7月, 2019 1 次提交
  11. 16 7月, 2019 1 次提交
  12. 05 7月, 2019 2 次提交
    • K
      KVM nVMX: Check Host Segment Registers and Descriptor Tables on vmentry of nested guests · 1ef23e1f
      Krish Sadhukhan 提交于
      According to section "Checks on Host Segment and Descriptor-Table
      Registers" in Intel SDM vol 3C, the following checks are performed on
      vmentry of nested guests:
      
         - In the selector field for each of CS, SS, DS, ES, FS, GS and TR, the
           RPL (bits 1:0) and the TI flag (bit 2) must be 0.
         - The selector fields for CS and TR cannot be 0000H.
         - The selector field for SS cannot be 0000H if the "host address-space
           size" VM-exit control is 0.
         - On processors that support Intel 64 architecture, the base-address
           fields for FS, GS and TR must contain canonical addresses.
      Signed-off-by: NKrish Sadhukhan <krish.sadhukhan@oracle.com>
      Reviewed-by: NKarl Heubaum <karl.heubaum@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      1ef23e1f
    • S
      KVM: nVMX: Stash L1's CR3 in vmcs01.GUEST_CR3 on nested entry w/o EPT · f087a029
      Sean Christopherson 提交于
      KVM does not have 100% coverage of VMX consistency checks, i.e. some
      checks that cause VM-Fail may only be detected by hardware during a
      nested VM-Entry.  In such a case, KVM must restore L1's state to the
      pre-VM-Enter state as L2's state has already been loaded into KVM's
      software model.
      
      L1's CR3 and PDPTRs in particular are loaded from vmcs01.GUEST_*.  But
      when EPT is disabled, the associated fields hold KVM's shadow values,
      not L1's "real" values.  Fortunately, when EPT is disabled the PDPTRs
      come from memory, i.e. are not cached in the VMCS.  Which leaves CR3
      as the sole anomaly.
      
      A previously applied workaround to handle CR3 was to force nested early
      checks if EPT is disabled:
      
        commit 2b27924b ("KVM: nVMX: always use early vmcs check when EPT
                               is disabled")
      
      Forcing nested early checks is undesirable as doing so adds hundreds of
      cycles to every nested VM-Entry.  Rather than take this performance hit,
      handle CR3 by overwriting vmcs01.GUEST_CR3 with L1's CR3 during nested
      VM-Entry when EPT is disabled *and* nested early checks are disabled.
      By stuffing vmcs01.GUEST_CR3, nested_vmx_restore_host_state() will
      naturally restore the correct vcpu->arch.cr3 from vmcs01.GUEST_CR3.
      
      These shenanigans work because nested_vmx_restore_host_state() does a
      full kvm_mmu_reset_context(), i.e. unloads the current MMU, which
      guarantees vmcs01.GUEST_CR3 will be rewritten with a new shadow CR3
      prior to re-entering L1.
      
      vcpu->arch.root_mmu.root_hpa is set to INVALID_PAGE via:
      
          nested_vmx_restore_host_state() ->
              kvm_mmu_reset_context() ->
                  kvm_mmu_unload() ->
                      kvm_mmu_free_roots()
      
      kvm_mmu_unload() has WARN_ON(root_hpa != INVALID_PAGE), i.e. we can bank
      on 'root_hpa == INVALID_PAGE' unless the implementation of
      kvm_mmu_reset_context() is changed.
      
      On the way into L1, VMCS.GUEST_CR3 is guaranteed to be written (on a
      successful entry) via:
      
          vcpu_enter_guest() ->
              kvm_mmu_reload() ->
                  kvm_mmu_load() ->
                      kvm_mmu_load_cr3() ->
                          vmx_set_cr3()
      
      Stuff vmcs01.GUEST_CR3 if and only if nested early checks are disabled
      as a "late" VM-Fail should never happen win that case (KVM WARNs), and
      the conditional write avoids the need to restore the correct GUEST_CR3
      when nested_vmx_check_vmentry_hw() fails.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Message-Id: <20190607185534.24368-1-sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      f087a029
  13. 03 7月, 2019 5 次提交
    • L
      KVM: nVMX: Change KVM_STATE_NESTED_EVMCS to signal vmcs12 is copied from eVMCS · 323d73a8
      Liran Alon 提交于
      Currently KVM_STATE_NESTED_EVMCS is used to signal that eVMCS
      capability is enabled on vCPU.
      As indicated by vmx->nested.enlightened_vmcs_enabled.
      
      This is quite bizarre as userspace VMM should make sure to expose
      same vCPU with same CPUID values in both source and destination.
      In case vCPU is exposed with eVMCS support on CPUID, it is also
      expected to enable KVM_CAP_HYPERV_ENLIGHTENED_VMCS capability.
      Therefore, KVM_STATE_NESTED_EVMCS is redundant.
      
      KVM_STATE_NESTED_EVMCS is currently used on restore path
      (vmx_set_nested_state()) only to enable eVMCS capability in KVM
      and to signal need_vmcs12_sync such that on next VMEntry to guest
      nested_sync_from_vmcs12() will be called to sync vmcs12 content
      into eVMCS in guest memory.
      However, because restore nested-state is rare enough, we could
      have just modified vmx_set_nested_state() to always signal
      need_vmcs12_sync.
      
      From all the above, it seems that we could have just removed
      the usage of KVM_STATE_NESTED_EVMCS. However, in order to preserve
      backwards migration compatibility, we cannot do that.
      (vmx_get_nested_state() needs to signal flag when migrating from
      new kernel to old kernel).
      
      Returning KVM_STATE_NESTED_EVMCS when just vCPU have eVMCS enabled
      have a bad side-effect of userspace VMM having to send nested-state
      from source to destination as part of migration stream. Even if
      guest have never used eVMCS as it doesn't even run a nested
      hypervisor workload. This requires destination userspace VMM and
      KVM to support setting nested-state. Which make it more difficult
      to migrate from new host to older host.
      To avoid this, change KVM_STATE_NESTED_EVMCS to signal eVMCS is
      not only enabled but also active. i.e. Guest have made some
      eVMCS active via an enlightened VMEntry. i.e. vmcs12 is copied
      from eVMCS and therefore should be restored into eVMCS resident
      in memory (by copy_vmcs12_to_enlightened()).
      Reviewed-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Reviewed-by: NMaran Wilson <maran.wilson@oracle.com>
      Reviewed-by: NKrish Sadhukhan <krish.sadhukhan@oracle.com>
      Signed-off-by: NLiran Alon <liran.alon@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      323d73a8
    • L
      KVM: nVMX: Allow restore nested-state to enable eVMCS when vCPU in SMM · 65b712f1
      Liran Alon 提交于
      As comment in code specifies, SMM temporarily disables VMX so we cannot
      be in guest mode, nor can VMLAUNCH/VMRESUME be pending.
      
      However, code currently assumes that these are the only flags that can be
      set on kvm_state->flags. This is not true as KVM_STATE_NESTED_EVMCS
      can also be set on this field to signal that eVMCS should be enabled.
      
      Therefore, fix code to check for guest-mode and pending VMLAUNCH/VMRESUME
      explicitly.
      Reviewed-by: NJoao Martins <joao.m.martins@oracle.com>
      Signed-off-by: NLiran Alon <liran.alon@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      65b712f1
    • J
      kvm: nVMX: Remove unnecessary sync_roots from handle_invept · b1190198
      Jim Mattson 提交于
      When L0 is executing handle_invept(), the TDP MMU is active. Emulating
      an L1 INVEPT does require synchronizing the appropriate shadow EPT
      root(s), but a call to kvm_mmu_sync_roots in this context won't do
      that. Similarly, the hardware TLB and paging-structure-cache entries
      associated with the appropriate shadow EPT root(s) must be flushed,
      but requesting a TLB_FLUSH from this context won't do that either.
      
      How did this ever work? KVM always does a sync_roots and TLB flush (in
      the correct context) when transitioning from L1 to L2. That isn't the
      best choice for nested VM performance, but it effectively papers over
      the mistakes here.
      
      Remove the unnecessary operations and leave a comment to try to do
      better in the future.
      Reported-by: NJunaid Shahid <junaids@google.com>
      Fixes: bfd0a56b ("nEPT: Nested INVEPT")
      Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Cc: Nadav Har'El <nyh@il.ibm.com>
      Cc: Jun Nakajima <jun.nakajima@intel.com>
      Cc: Xinhao Xu <xinhao.xu@intel.com>
      Cc: Yang Zhang <yang.z.zhang@Intel.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Reviewed-by Peter Shier <pshier@google.com>
      Reviewed-by: NJunaid Shahid <junaids@google.com>
      Signed-off-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      b1190198
    • V
      x86/kvm/nVMX: fix VMCLEAR when Enlightened VMCS is in use · 11e34914
      Vitaly Kuznetsov 提交于
      When Enlightened VMCS is in use, it is valid to do VMCLEAR and,
      according to TLFS, this should "transition an enlightened VMCS from the
      active to the non-active state". It is, however, wrong to assume that
      it is only valid to do VMCLEAR for the eVMCS which is currently active
      on the vCPU performing VMCLEAR.
      
      Currently, the logic in handle_vmclear() is broken: in case, there is no
      active eVMCS on the vCPU doing VMCLEAR we treat the argument as a 'normal'
      VMCS and kvm_vcpu_write_guest() to the 'launch_state' field irreversibly
      corrupts the memory area.
      
      So, in case the VMCLEAR argument is not the current active eVMCS on the
      vCPU, how can we know if the area it is pointing to is a normal or an
      enlightened VMCS?
      Thanks to the bug in Hyper-V (see commit 72aeb60c ("KVM: nVMX: Verify
      eVMCS revision id match supported eVMCS version on eVMCS VMPTRLD")) we can
      not, the revision can't be used to distinguish between them. So let's
      assume it is always enlightened in case enlightened vmentry is enabled in
      the assist page. Also, check if vmx->nested.enlightened_vmcs_enabled to
      minimize the impact for 'unenlightened' workloads.
      
      Fixes: b8bbab92 ("KVM: nVMX: implement enlightened VMPTRLD and VMCLEAR")
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      11e34914
    • V
      x86/KVM/nVMX: don't use clean fields data on enlightened VMLAUNCH · a21a39c2
      Vitaly Kuznetsov 提交于
      Apparently, Windows doesn't maintain clean fields data after it does
      VMCLEAR for an enlightened VMCS so we can only use it on VMRESUME.
      The issue went unnoticed because currently we do nested_release_evmcs()
      in handle_vmclear() and the consecutive enlightened VMPTRLD invalidates
      clean fields when a new eVMCS is mapped but we're going to change the
      logic.
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a21a39c2
  14. 02 7月, 2019 2 次提交
  15. 21 6月, 2019 1 次提交
    • P
      KVM: nVMX: reorganize initial steps of vmx_set_nested_state · 9fd58877
      Paolo Bonzini 提交于
      Commit 332d0797 ("KVM: nVMX: KVM_SET_NESTED_STATE - Tear down old EVMCS
      state before setting new state", 2019-05-02) broke evmcs_test because the
      eVMCS setup must be performed even if there is no VMXON region defined,
      as long as the eVMCS bit is set in the assist page.
      
      While the simplest possible fix would be to add a check on
      kvm_state->flags & KVM_STATE_NESTED_EVMCS in the initial "if" that
      covers kvm_state->hdr.vmx.vmxon_pa == -1ull, that is quite ugly.
      
      Instead, this patch moves checks earlier in the function and
      conditionalizes them on kvm_state->hdr.vmx.vmxon_pa, so that
      vmx_set_nested_state always goes through vmx_leave_nested
      and nested_enable_evmcs.
      
      Fixes: 332d0797 ("KVM: nVMX: KVM_SET_NESTED_STATE - Tear down old EVMCS state before setting new state")
      Cc: Aaron Lewis <aaronlewis@google.com>
      Reviewed-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      9fd58877
  16. 19 6月, 2019 1 次提交
    • L
      KVM: x86: Modify struct kvm_nested_state to have explicit fields for data · 6ca00dfa
      Liran Alon 提交于
      Improve the KVM_{GET,SET}_NESTED_STATE structs by detailing the format
      of VMX nested state data in a struct.
      
      In order to avoid changing the ioctl values of
      KVM_{GET,SET}_NESTED_STATE, there is a need to preserve
      sizeof(struct kvm_nested_state). This is done by defining the data
      struct as "data.vmx[0]". It was the most elegant way I found to
      preserve struct size while still keeping struct readable and easy to
      maintain. It does have a misfortunate side-effect that now it has to be
      accessed as "data.vmx[0]" rather than just "data.vmx".
      
      Because we are already modifying these structs, I also modified the
      following:
      * Define the "format" field values as macros.
      * Rename vmcs_pa to vmcs12_pa for better readability.
      Signed-off-by: NLiran Alon <liran.alon@oracle.com>
      [Remove SVM stubs, add KVM_STATE_NESTED_VMX_VMCS12_SIZE. - Paolo]
      Reviewed-by: NLiran Alon <liran.alon@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      6ca00dfa
  17. 18 6月, 2019 9 次提交
    • S
      KVM: VMX: Leave preemption timer running when it's disabled · 804939ea
      Sean Christopherson 提交于
      VMWRITEs to the major VMCS controls, pin controls included, are
      deceptively expensive.  CPUs with VMCS caching (Westmere and later) also
      optimize away consistency checks on VM-Entry, i.e. skip consistency
      checks if the relevant fields have not changed since the last successful
      VM-Entry (of the cached VMCS).  Because uops are a precious commodity,
      uCode's dirty VMCS field tracking isn't as precise as software would
      prefer.  Notably, writing any of the major VMCS fields effectively marks
      the entire VMCS dirty, i.e. causes the next VM-Entry to perform all
      consistency checks, which consumes several hundred cycles.
      
      As it pertains to KVM, toggling PIN_BASED_VMX_PREEMPTION_TIMER more than
      doubles the latency of the next VM-Entry (and again when/if the flag is
      toggled back).  In a non-nested scenario, running a "standard" guest
      with the preemption timer enabled, toggling the timer flag is uncommon
      but not rare, e.g. roughly 1 in 10 entries.  Disabling the preemption
      timer can change these numbers due to its use for "immediate exits",
      even when explicitly disabled by userspace.
      
      Nested virtualization in particular is painful, as the timer flag is set
      for the majority of VM-Enters, but prepare_vmcs02() initializes vmcs02's
      pin controls to *clear* the flag since its the timer's final state isn't
      known until vmx_vcpu_run().  I.e. the majority of nested VM-Enters end
      up unnecessarily writing pin controls *twice*.
      
      Rather than toggle the timer flag in pin controls, set the timer value
      itself to the largest allowed value to put it into a "soft disabled"
      state, and ignore any spurious preemption timer exits.
      
      Sadly, the timer is a 32-bit value and so theoretically it can fire
      before the head death of the universe, i.e. spurious exits are possible.
      But because KVM does *not* save the timer value on VM-Exit and because
      the timer runs at a slower rate than the TSC, the maximuma timer value
      is still sufficiently large for KVM's purposes.  E.g. on a modern CPU
      with a timer that runs at 1/32 the frequency of a 2.4ghz constant-rate
      TSC, the timer will fire after ~55 seconds of *uninterrupted* guest
      execution.  In other words, spurious VM-Exits are effectively only
      possible if the host is completely tickless on the logical CPU, the
      guest is not using the preemption timer, and the guest is not generating
      VM-Exits for any other reason.
      
      To be safe from bad/weird hardware, disable the preemption timer if its
      maximum delay is less than ten seconds.  Ten seconds is mostly arbitrary
      and was selected in no small part because it's a nice round number.
      For simplicity and paranoia, fall back to __kvm_request_immediate_exit()
      if the preemption timer is disabled by KVM or userspace.  Previously
      KVM continued to use the preemption timer to force immediate exits even
      when the timer was disabled by userspace.  Now that KVM leaves the timer
      running instead of truly disabling it, allow userspace to kill it
      entirely in the unlikely event the timer (or KVM) malfunctions.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      804939ea
    • S
      KVM: VMX: Drop hv_timer_armed from 'struct loaded_vmcs' · 9d99cc49
      Sean Christopherson 提交于
      ... now that it is fully redundant with the pin controls shadow.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      9d99cc49
    • S
      KVM: nVMX: Preset *DT exiting in vmcs02 when emulating UMIP · 469debdb
      Sean Christopherson 提交于
      KVM dynamically toggles SECONDARY_EXEC_DESC to intercept (a subset of)
      instructions that are subject to User-Mode Instruction Prevention, i.e.
      VMCS.SECONDARY_EXEC_DESC == CR4.UMIP when emulating UMIP.  Preset the
      VMCS control when preparing vmcs02 to avoid unnecessarily VMWRITEs,
      e.g. KVM will clear VMCS.SECONDARY_EXEC_DESC in prepare_vmcs02_early()
      and then set it in vmx_set_cr4().
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      469debdb
    • S
      KVM: nVMX: Preserve last USE_MSR_BITMAPS when preparing vmcs02 · de0286b7
      Sean Christopherson 提交于
      KVM dynamically toggles the CPU_BASED_USE_MSR_BITMAPS execution control
      for nested guests based on whether or not both L0 and L1 want to pass
      through the same MSRs to L2.  Preserve the last used value from vmcs02
      so as to avoid multiple VMWRITEs to (re)set/(re)clear the bit on nested
      VM-Entry.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      de0286b7
    • S
      KVM: VMX: Explicitly initialize controls shadow at VMCS allocation · 3af80fec
      Sean Christopherson 提交于
      Or: Don't re-initialize vmcs02's controls on every nested VM-Entry.
      
      VMWRITEs to the major VMCS controls are deceptively expensive.  Intel
      CPUs with VMCS caching (Westmere and later) also optimize away
      consistency checks on VM-Entry, i.e. skip consistency checks if the
      relevant fields have not changed since the last successful VM-Entry (of
      the cached VMCS).  Because uops are a precious commodity, uCode's dirty
      VMCS field tracking isn't as precise as software would prefer.  Notably,
      writing any of the major VMCS fields effectively marks the entire VMCS
      dirty, i.e. causes the next VM-Entry to perform all consistency checks,
      which consumes several hundred cycles.
      
      Zero out the controls' shadow copies during VMCS allocation and use the
      optimized setter when "initializing" controls.  While this technically
      affects both non-nested and nested virtualization, nested virtualization
      is the primary beneficiary as avoid VMWRITEs when prepare vmcs02 allows
      hardware to optimizie away consistency checks.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      3af80fec
    • S
      KVM: nVMX: Don't reset VMCS controls shadow on VMCS switch · ae81d089
      Sean Christopherson 提交于
      ... now that the shadow copies are per-VMCS.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ae81d089
    • S
      KVM: VMX: Shadow VMCS secondary execution controls · fe7f895d
      Sean Christopherson 提交于
      Prepare to shadow all major control fields on a per-VMCS basis, which
      allows KVM to avoid costly VMWRITEs when switching between vmcs01 and
      vmcs02.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      fe7f895d
    • S
      KVM: VMX: Shadow VMCS primary execution controls · 2183f564
      Sean Christopherson 提交于
      Prepare to shadow all major control fields on a per-VMCS basis, which
      allows KVM to avoid VMREADs when switching between vmcs01 and vmcs02,
      and more importantly can eliminate costly VMWRITEs to controls when
      preparing vmcs02.
      
      Shadowing exec controls also saves a VMREAD when opening virtual
      INTR/NMI windows, yay...
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      2183f564
    • S
      KVM: VMX: Shadow VMCS pin controls · c5f2c766
      Sean Christopherson 提交于
      Prepare to shadow all major control fields on a per-VMCS basis, which
      allows KVM to avoid costly VMWRITEs when switching between vmcs01 and
      vmcs02.
      
      Shadowing pin controls also allows a future patch to remove the per-VMCS
      'hv_timer_armed' flag, as the shadow copy is a superset of said flag.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      c5f2c766