1. 20 4月, 2021 1 次提交
    • S
      KVM: VMX: Add basic handling of VM-Exit from SGX enclave · 3c0c2ad1
      Sean Christopherson 提交于
      Add support for handling VM-Exits that originate from a guest SGX
      enclave.  In SGX, an "enclave" is a new CPL3-only execution environment,
      wherein the CPU and memory state is protected by hardware to make the
      state inaccesible to code running outside of the enclave.  When exiting
      an enclave due to an asynchronous event (from the perspective of the
      enclave), e.g. exceptions, interrupts, and VM-Exits, the enclave's state
      is automatically saved and scrubbed (the CPU loads synthetic state), and
      then reloaded when re-entering the enclave.  E.g. after an instruction
      based VM-Exit from an enclave, vmcs.GUEST_RIP will not contain the RIP
      of the enclave instruction that trigered VM-Exit, but will instead point
      to a RIP in the enclave's untrusted runtime (the guest userspace code
      that coordinates entry/exit to/from the enclave).
      
      To help a VMM recognize and handle exits from enclaves, SGX adds bits to
      existing VMCS fields, VM_EXIT_REASON.VMX_EXIT_REASON_FROM_ENCLAVE and
      GUEST_INTERRUPTIBILITY_INFO.GUEST_INTR_STATE_ENCLAVE_INTR.  Define the
      new architectural bits, and add a boolean to struct vcpu_vmx to cache
      VMX_EXIT_REASON_FROM_ENCLAVE.  Clear the bit in exit_reason so that
      checks against exit_reason do not need to account for SGX, e.g.
      "if (exit_reason == EXIT_REASON_EXCEPTION_NMI)" continues to work.
      
      KVM is a largely a passive observer of the new bits, e.g. KVM needs to
      account for the bits when propagating information to a nested VMM, but
      otherwise doesn't need to act differently for the majority of VM-Exits
      from enclaves.
      
      The one scenario that is directly impacted is emulation, which is for
      all intents and purposes impossible[1] since KVM does not have access to
      the RIP or instruction stream that triggered the VM-Exit.  The inability
      to emulate is a non-issue for KVM, as most instructions that might
      trigger VM-Exit unconditionally #UD in an enclave (before the VM-Exit
      check.  For the few instruction that conditionally #UD, KVM either never
      sets the exiting control, e.g. PAUSE_EXITING[2], or sets it if and only
      if the feature is not exposed to the guest in order to inject a #UD,
      e.g. RDRAND_EXITING.
      
      But, because it is still possible for a guest to trigger emulation,
      e.g. MMIO, inject a #UD if KVM ever attempts emulation after a VM-Exit
      from an enclave.  This is architecturally accurate for instruction
      VM-Exits, and for MMIO it's the least bad choice, e.g. it's preferable
      to killing the VM.  In practice, only broken or particularly stupid
      guests should ever encounter this behavior.
      
      Add a WARN in skip_emulated_instruction to detect any attempt to
      modify the guest's RIP during an SGX enclave VM-Exit as all such flows
      should either be unreachable or must handle exits from enclaves before
      getting to skip_emulated_instruction.
      
      [1] Impossible for all practical purposes.  Not truly impossible
          since KVM could implement some form of para-virtualization scheme.
      
      [2] PAUSE_LOOP_EXITING only affects CPL0 and enclaves exist only at
          CPL3, so we also don't need to worry about that interaction.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NKai Huang <kai.huang@intel.com>
      Message-Id: <315f54a8507d09c292463ef29104e1d4c62e9090.1618196135.git.kai.huang@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      3c0c2ad1
  2. 17 4月, 2021 1 次提交
  3. 15 3月, 2021 4 次提交
    • S
      KVM: x86: Handle triple fault in L2 without killing L1 · cb6a32c2
      Sean Christopherson 提交于
      Synthesize a nested VM-Exit if L2 triggers an emulated triple fault
      instead of exiting to userspace, which likely will kill L1.  Any flow
      that does KVM_REQ_TRIPLE_FAULT is suspect, but the most common scenario
      for L2 killing L1 is if L0 (KVM) intercepts a contributory exception that
      is _not_intercepted by L1.  E.g. if KVM is intercepting #GPs for the
      VMware backdoor, a #GP that occurs in L2 while vectoring an injected #DF
      will cause KVM to emulate triple fault.
      
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Jim Mattson <jmattson@google.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210302174515.2812275-2-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      cb6a32c2
    • S
      KVM: x86: Move nVMX's consistency check macro to common code · 648fc8ae
      Sean Christopherson 提交于
      Move KVM's CC() macro to x86.h so that it can be reused by nSVM.
      Debugging VM-Enter is as painful on SVM as it is on VMX.
      
      Rename the more visible macro to KVM_NESTED_VMENTER_CONSISTENCY_CHECK
      to avoid any collisions with the uber-concise "CC".
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210204000117.3303214-12-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      648fc8ae
    • S
      KVM: nVMX: Defer the MMU reload to the normal path on an EPTP switch · c805f5d5
      Sean Christopherson 提交于
      Defer reloading the MMU after a EPTP successful EPTP switch.  The VMFUNC
      instruction itself is executed in the previous EPTP context, any side
      effects, e.g. updating RIP, should occur in the old context.  Practically
      speaking, this bug is benign as VMX doesn't touch the MMU when skipping
      an emulated instruction, nor does queuing a single-step #DB.  No other
      post-switch side effects exist.
      
      Fixes: 41ab9372 ("KVM: nVMX: Emulate EPTP switching for the L1 hypervisor")
      Cc: stable@vger.kernel.org
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210305011101.3597423-14-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      c805f5d5
    • D
      KVM: x86: to track if L1 is running L2 VM · 43c11d91
      Dongli Zhang 提交于
      The new per-cpu stat 'nested_run' is introduced in order to track if L1 VM
      is running or used to run L2 VM.
      
      An example of the usage of 'nested_run' is to help the host administrator
      to easily track if any L1 VM is used to run L2 VM. Suppose there is issue
      that may happen with nested virtualization, the administrator will be able
      to easily narrow down and confirm if the issue is due to nested
      virtualization via 'nested_run'. For example, whether the fix like
      commit 88dddc11 ("KVM: nVMX: do not use dangling shadow VMCS after
      guest reset") is required.
      
      Cc: Joe Jin <joe.jin@oracle.com>
      Signed-off-by: NDongli Zhang <dongli.zhang@oracle.com>
      Message-Id: <20210305225747.7682-1-dongli.zhang@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      43c11d91
  4. 19 2月, 2021 2 次提交
    • M
      KVM: VMX: Dynamically enable/disable PML based on memslot dirty logging · a85863c2
      Makarand Sonare 提交于
      Currently, if enable_pml=1 PML remains enabled for the entire lifetime
      of the VM irrespective of whether dirty logging is enable or disabled.
      When dirty logging is disabled, all the pages of the VM are manually
      marked dirty, so that PML is effectively non-operational.  Setting
      the dirty bits is an expensive operation which can cause severe MMU
      lock contention in a performance sensitive path when dirty logging is
      disabled after a failed or canceled live migration.
      
      Manually setting dirty bits also fails to prevent PML activity if some
      code path clears dirty bits, which can incur unnecessary VM-Exits.
      
      In order to avoid this extra overhead, dynamically enable/disable PML
      when dirty logging gets turned on/off for the first/last memslot.
      Signed-off-by: NMakarand Sonare <makarandsonare@google.com>
      Co-developed-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210213005015.1651772-12-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a85863c2
    • S
      KVM: nVMX: Disable PML in hardware when running L2 · c3bb9a20
      Sean Christopherson 提交于
      Unconditionally disable PML in vmcs02, KVM emulates PML purely in the
      MMU, e.g. vmx_flush_pml_buffer() doesn't even try to copy the L2 GPAs
      from vmcs02's buffer to vmcs12.  At best, enabling PML is a nop.  At
      worst, it will cause vmx_flush_pml_buffer() to record bogus GFNs in the
      dirty logs.
      
      Initialize vmcs02.GUEST_PML_INDEX such that PML writes would trigger
      VM-Exit if PML was somehow enabled, skip flushing the buffer for guest
      mode since the index is bogus, and freak out if a PML full exit occurs
      when L2 is active.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210213005015.1651772-7-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      c3bb9a20
  5. 18 2月, 2021 1 次提交
  6. 04 2月, 2021 5 次提交
    • S
      KVM: VMX: Use GPA legality helpers to replace open coded equivalents · 636e8b73
      Sean Christopherson 提交于
      Replace a variety of open coded GPA checks with the recently introduced
      common helpers.
      
      No functional change intended.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210204000117.3303214-6-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      636e8b73
    • U
      KVM/nVMX: Use __vmx_vcpu_run in nested_vmx_check_vmentry_hw · 150f17bf
      Uros Bizjak 提交于
      Replace inline assembly in nested_vmx_check_vmentry_hw
      with a call to __vmx_vcpu_run.  The function is not
      performance critical, so (double) GPR save/restore
      in __vmx_vcpu_run can be tolerated, as far as performance
      effects are concerned.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Sean Christopherson <seanjc@google.com>
      Reviewed-and-tested-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NUros Bizjak <ubizjak@gmail.com>
      [sean: dropped versioning info from changelog]
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20201231002702.22237077-5-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      150f17bf
    • J
      KVM: X86: prepend vmx/svm prefix to additional kvm_x86_ops functions · b6a7cc35
      Jason Baron 提交于
      A subsequent patch introduces macros in preparation for simplifying the
      definition for vmx_x86_ops and svm_x86_ops. Making the naming more uniform
      expands the coverage of the macros. Add vmx/svm prefix to the following
      functions: update_exception_bitmap(), enable_nmi_window(),
      enable_irq_window(), update_cr8_intercept and enable_smi_window().
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Sean Christopherson <seanjc@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NJason Baron <jbaron@akamai.com>
      Message-Id: <ed594696f8e2c2b2bfc747504cee9bbb2a269300.1610680941.git.jbaron@akamai.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      b6a7cc35
    • C
      KVM: X86: Rename DR6_INIT to DR6_ACTIVE_LOW · 9a3ecd5e
      Chenyi Qiang 提交于
      DR6_INIT contains the 1-reserved bits as well as the bit that is cleared
      to 0 when the condition (e.g. RTM) happens. The value can be used to
      initialize dr6 and also be the XOR mask between the #DB exit
      qualification (or payload) and DR6.
      
      Concerning that DR6_INIT is used as initial value only once, rename it
      to DR6_ACTIVE_LOW and apply it in other places, which would make the
      incoming changes for bus lock debug exception more simple.
      Signed-off-by: NChenyi Qiang <chenyi.qiang@intel.com>
      Message-Id: <20210202090433.13441-2-chenyi.qiang@intel.com>
      [Define DR6_FIXED_1 from DR6_ACTIVE_LOW and DR6_VOLATILE. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      9a3ecd5e
    • S
      KVM: VMX: Convert vcpu_vmx.exit_reason to a union · 8e533240
      Sean Christopherson 提交于
      Convert vcpu_vmx.exit_reason from a u32 to a union (of size u32).  The
      full VM_EXIT_REASON field is comprised of a 16-bit basic exit reason in
      bits 15:0, and single-bit modifiers in bits 31:16.
      
      Historically, KVM has only had to worry about handling the "failed
      VM-Entry" modifier, which could only be set in very specific flows and
      required dedicated handling.  I.e. manually stripping the FAILED_VMENTRY
      bit was a somewhat viable approach.  But even with only a single bit to
      worry about, KVM has had several bugs related to comparing a basic exit
      reason against the full exit reason store in vcpu_vmx.
      
      Upcoming Intel features, e.g. SGX, will add new modifier bits that can
      be set on more or less any VM-Exit, as opposed to the significantly more
      restricted FAILED_VMENTRY, i.e. correctly handling everything in one-off
      flows isn't scalable.  Tracking exit reason in a union forces code to
      explicitly choose between consuming the full exit reason and the basic
      exit, and is a convenient way to document and access the modifiers.
      
      No functional change intended.
      
      Cc: Xiaoyao Li <xiaoyao.li@intel.com>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NChenyi Qiang <chenyi.qiang@intel.com>
      Message-Id: <20201106090315.18606-2-chenyi.qiang@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      8e533240
  7. 26 1月, 2021 2 次提交
    • P
      KVM: x86: allow KVM_REQ_GET_NESTED_STATE_PAGES outside guest mode for VMX · 9a78e158
      Paolo Bonzini 提交于
      VMX also uses KVM_REQ_GET_NESTED_STATE_PAGES for the Hyper-V eVMCS,
      which may need to be loaded outside guest mode.  Therefore we cannot
      WARN in that case.
      
      However, that part of nested_get_vmcs12_pages is _not_ needed at
      vmentry time.  Split it out of KVM_REQ_GET_NESTED_STATE_PAGES handling,
      so that both vmentry and migration (and in the latter case, independent
      of is_guest_mode) do the parts that are needed.
      
      Cc: <stable@vger.kernel.org> # 5.10.x: f2c7ef3b: KVM: nSVM: cancel KVM_REQ_GET_NESTED_STATE_PAGES
      Cc: <stable@vger.kernel.org> # 5.10.x
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      9a78e158
    • M
      KVM: nVMX: Sync unsync'd vmcs02 state to vmcs12 on migration · d51e1d3f
      Maxim Levitsky 提交于
      Even when we are outside the nested guest, some vmcs02 fields
      may not be in sync vs vmcs12.  This is intentional, even across
      nested VM-exit, because the sync can be delayed until the nested
      hypervisor performs a VMCLEAR or a VMREAD/VMWRITE that affects those
      rarely accessed fields.
      
      However, during KVM_GET_NESTED_STATE, the vmcs12 has to be up to date to
      be able to restore it.  To fix that, call copy_vmcs02_to_vmcs12_rare()
      before the vmcs12 contents are copied to userspace.
      
      Fixes: 7952d769 ("KVM: nVMX: Sync rarely accessed guest fields only when needed")
      Reviewed-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20210114205449.8715-2-mlevitsk@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d51e1d3f
  8. 08 1月, 2021 1 次提交
  9. 15 11月, 2020 2 次提交
    • Y
      KVM: x86: emulate wait-for-SIPI and SIPI-VMExit · bf0cd88c
      Yadong Qi 提交于
      Background: We have a lightweight HV, it needs INIT-VMExit and
      SIPI-VMExit to wake-up APs for guests since it do not monitor
      the Local APIC. But currently virtual wait-for-SIPI(WFS) state
      is not supported in nVMX, so when running on top of KVM, the L1
      HV cannot receive the INIT-VMExit and SIPI-VMExit which cause
      the L2 guest cannot wake up the APs.
      
      According to Intel SDM Chapter 25.2 Other Causes of VM Exits,
      SIPIs cause VM exits when a logical processor is in
      wait-for-SIPI state.
      
      In this patch:
          1. introduce SIPI exit reason,
          2. introduce wait-for-SIPI state for nVMX,
          3. advertise wait-for-SIPI support to guest.
      
      When L1 hypervisor is not monitoring Local APIC, L0 need to emulate
      INIT-VMExit and SIPI-VMExit to L1 to emulate INIT-SIPI-SIPI for
      L2. L2 LAPIC write would be traped by L0 Hypervisor(KVM), L0 should
      emulate the INIT/SIPI vmexit to L1 hypervisor to set proper state
      for L2's vcpu state.
      
      Handle procdure:
      Source vCPU:
          L2 write LAPIC.ICR(INIT).
          L0 trap LAPIC.ICR write(INIT): inject a latched INIT event to target
             vCPU.
      Target vCPU:
          L0 emulate an INIT VMExit to L1 if is guest mode.
          L1 set guest VMCS, guest_activity_state=WAIT_SIPI, vmresume.
          L0 set vcpu.mp_state to INIT_RECEIVED if (vmcs12.guest_activity_state
             == WAIT_SIPI).
      
      Source vCPU:
          L2 write LAPIC.ICR(SIPI).
          L0 trap LAPIC.ICR write(INIT): inject a latched SIPI event to traget
             vCPU.
      Target vCPU:
          L0 emulate an SIPI VMExit to L1 if (vcpu.mp_state == INIT_RECEIVED).
          L1 set CS:IP, guest_activity_state=ACTIVE, vmresume.
          L0 resume to L2.
          L2 start-up.
      Signed-off-by: NYadong Qi <yadong.qi@intel.com>
      Message-Id: <20200922052343.84388-1-yadong.qi@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <20201106065122.403183-1-yadong.qi@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      bf0cd88c
    • S
      KVM: x86: Move vendor CR4 validity check to dedicated kvm_x86_ops hook · c2fe3cd4
      Sean Christopherson 提交于
      Split out VMX's checks on CR4.VMXE to a dedicated hook, .is_valid_cr4(),
      and invoke the new hook from kvm_valid_cr4().  This fixes an issue where
      KVM_SET_SREGS would return success while failing to actually set CR4.
      
      Fixing the issue by explicitly checking kvm_x86_ops.set_cr4()'s return
      in __set_sregs() is not a viable option as KVM has already stuffed a
      variety of vCPU state.
      
      Note, kvm_valid_cr4() and is_valid_cr4() have different return types and
      inverted semantics.  This will be remedied in a future patch.
      
      Fixes: 5e1746d6 ("KVM: nVMX: Allow setting the VMXE bit in CR4")
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Message-Id: <20201007014417.29276-5-sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      c2fe3cd4
  10. 28 9月, 2020 21 次提交