1. 11 9月, 2019 5 次提交
  2. 10 9月, 2019 1 次提交
  3. 22 8月, 2019 3 次提交
    • S
      KVM: Assert that struct kvm_vcpu is always as offset zero · 12b58f4e
      Sean Christopherson 提交于
      KVM implementations that wrap struct kvm_vcpu with a vendor specific
      struct, e.g. struct vcpu_vmx, must place the vcpu member at offset 0,
      otherwise the usercopy region intended to encompass struct kvm_vcpu_arch
      will instead overlap random chunks of the vendor specific struct.
      E.g. padding a large number of bytes before struct kvm_vcpu triggers
      a usercopy warn when running with CONFIG_HARDENED_USERCOPY=y.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Reviewed-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      12b58f4e
    • S
      KVM: x86/mmu: Add explicit access mask for MMIO SPTEs · 4af77151
      Sean Christopherson 提交于
      When shadow paging is enabled, KVM tracks the allowed access type for
      MMIO SPTEs so that it can do a permission check on a MMIO GVA cache hit
      without having to walk the guest's page tables.  The tracking is done
      by retaining the WRITE and USER bits of the access when inserting the
      MMIO SPTE (read access is implicitly allowed), which allows the MMIO
      page fault handler to retrieve and cache the WRITE/USER bits from the
      SPTE.
      
      Unfortunately for EPT, the mask used to retain the WRITE/USER bits is
      hardcoded using the x86 paging versions of the bits.  This funkiness
      happens to work because KVM uses a completely different mask/value for
      MMIO SPTEs when EPT is enabled, and the EPT mask/value just happens to
      overlap exactly with the x86 WRITE/USER bits[*].
      
      Explicitly define the access mask for MMIO SPTEs to accurately reflect
      that EPT does not want to incorporate any access bits into the SPTE, and
      so that KVM isn't subtly relying on EPT's WX bits always being set in
      MMIO SPTEs, e.g. attempting to use other bits for experimentation breaks
      horribly.
      
      Note, vcpu_match_mmio_gva() explicits prevents matching GVA==0, and all
      TDP flows explicit set mmio_gva to 0, i.e. zeroing vcpu->arch.access for
      EPT has no (known) functional impact.
      
      [*] Using WX to generate EPT misconfigurations (equivalent to reserved
          bit page fault) ensures KVM can employ its MMIO page fault tricks
          even platforms without reserved address bits.
      
      Fixes: ce88decf ("KVM: MMU: mmio page fault support")
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      4af77151
    • V
      x86: kvm: svm: propagate errors from skip_emulated_instruction() · f8ea7c60
      Vitaly Kuznetsov 提交于
      On AMD, kvm_x86_ops->skip_emulated_instruction(vcpu) can, in theory,
      fail: in !nrips case we call kvm_emulate_instruction(EMULTYPE_SKIP).
      Currently, we only do printk(KERN_DEBUG) when this happens and this
      is not ideal. Propagate the error up the stack.
      
      On VMX, skip_emulated_instruction() doesn't fail, we have two call
      sites calling it explicitly: handle_exception_nmi() and
      handle_task_switch(), we can just ignore the result.
      
      On SVM, we also have two explicit call sites:
      svm_queue_exception() and it seems we don't need to do anything there as
      we check if RIP was advanced or not. In task_switch_interception(),
      however, we are better off not proceeding to kvm_task_switch() in case
      skip_emulated_instruction() failed.
      Suggested-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      f8ea7c60
  4. 05 8月, 2019 1 次提交
    • W
      KVM: Fix leak vCPU's VMCS value into other pCPU · 17e433b5
      Wanpeng Li 提交于
      After commit d73eb57b (KVM: Boost vCPUs that are delivering interrupts), a
      five years old bug is exposed. Running ebizzy benchmark in three 80 vCPUs VMs
      on one 80 pCPUs Skylake server, a lot of rcu_sched stall warning splatting
      in the VMs after stress testing:
      
       INFO: rcu_sched detected stalls on CPUs/tasks: { 4 41 57 62 77} (detected by 15, t=60004 jiffies, g=899, c=898, q=15073)
       Call Trace:
         flush_tlb_mm_range+0x68/0x140
         tlb_flush_mmu.part.75+0x37/0xe0
         tlb_finish_mmu+0x55/0x60
         zap_page_range+0x142/0x190
         SyS_madvise+0x3cd/0x9c0
         system_call_fastpath+0x1c/0x21
      
      swait_active() sustains to be true before finish_swait() is called in
      kvm_vcpu_block(), voluntarily preempted vCPUs are taken into account
      by kvm_vcpu_on_spin() loop greatly increases the probability condition
      kvm_arch_vcpu_runnable(vcpu) is checked and can be true, when APICv
      is enabled the yield-candidate vCPU's VMCS RVI field leaks(by
      vmx_sync_pir_to_irr()) into spinning-on-a-taken-lock vCPU's current
      VMCS.
      
      This patch fixes it by checking conservatively a subset of events.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Marc Zyngier <Marc.Zyngier@arm.com>
      Cc: stable@vger.kernel.org
      Fixes: 98f4a146 (KVM: add kvm_arch_vcpu_runnable() test to kvm_vcpu_on_spin() loop)
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      17e433b5
  5. 22 7月, 2019 1 次提交
    • W
      KVM: X86: Dynamically allocate user_fpu · d9a710e5
      Wanpeng Li 提交于
      After reverting commit 240c35a3 (kvm: x86: Use task structs fpu field
      for user), struct kvm_vcpu is 19456 bytes on my server, PAGE_ALLOC_COSTLY_ORDER(3)
      is the order at which allocations are deemed costly to service. In serveless
      scenario, one host can service hundreds/thoudands firecracker/kata-container
      instances, howerver, new instance will fail to launch after memory is too
      fragmented to allocate kvm_vcpu struct on host, this was observed in some
      cloud provider product environments.
      
      This patch dynamically allocates user_fpu, kvm_vcpu is 15168 bytes now on my
      Skylake server.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d9a710e5
  6. 20 7月, 2019 2 次提交
    • P
      KVM: VMX: dump VMCS on failed entry · 3b20e03a
      Paolo Bonzini 提交于
      This is useful for debugging, and is ratelimited nowadays.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      3b20e03a
    • W
      KVM: LAPIC: Inject timer interrupt via posted interrupt · 0c5f81da
      Wanpeng Li 提交于
      Dedicated instances are currently disturbed by unnecessary jitter due
      to the emulated lapic timers firing on the same pCPUs where the
      vCPUs reside.  There is no hardware virtual timer on Intel for guest
      like ARM, so both programming timer in guest and the emulated timer fires
      incur vmexits.  This patch tries to avoid vmexit when the emulated timer
      fires, at least in dedicated instance scenario when nohz_full is enabled.
      
      In that case, the emulated timers can be offload to the nearest busy
      housekeeping cpus since APICv has been found for several years in server
      processors. The guest timer interrupt can then be injected via posted interrupts,
      which are delivered by the housekeeping cpu once the emulated timer fires.
      
      The host should tuned so that vCPUs are placed on isolated physical
      processors, and with several pCPUs surplus for busy housekeeping.
      If disabled mwait/hlt/pause vmexits keep the vCPUs in non-root mode,
      ~3% redis performance benefit can be observed on Skylake server, and the
      number of external interrupt vmexits drops substantially.  Without patch
      
                  VM-EXIT  Samples  Samples%  Time%   Min Time  Max Time   Avg time
      EXTERNAL_INTERRUPT    42916    49.43%   39.30%   0.47us   106.09us   0.71us ( +-   1.09% )
      
      While with patch:
      
                  VM-EXIT  Samples  Samples%  Time%   Min Time  Max Time         Avg time
      EXTERNAL_INTERRUPT    6871     9.29%     2.96%   0.44us    57.88us   0.72us ( +-   4.02% )
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      0c5f81da
  7. 15 7月, 2019 1 次提交
  8. 02 7月, 2019 1 次提交
    • P
      KVM: nVMX: list VMX MSRs in KVM_GET_MSR_INDEX_LIST · 95c5c7c7
      Paolo Bonzini 提交于
      This allows userspace to know which MSRs are supported by the hypervisor.
      Unfortunately userspace must resort to tricks for everything except
      MSR_IA32_VMX_VMFUNC (which was just added in the previous patch).
      One possibility is to use the feature control MSR, which is tied to nested
      VMX as well and is present on all KVM versions that support feature MSRs.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      95c5c7c7
  9. 20 6月, 2019 1 次提交
  10. 19 6月, 2019 1 次提交
  11. 18 6月, 2019 23 次提交
    • S
      KVM: VMX: Leave preemption timer running when it's disabled · 804939ea
      Sean Christopherson 提交于
      VMWRITEs to the major VMCS controls, pin controls included, are
      deceptively expensive.  CPUs with VMCS caching (Westmere and later) also
      optimize away consistency checks on VM-Entry, i.e. skip consistency
      checks if the relevant fields have not changed since the last successful
      VM-Entry (of the cached VMCS).  Because uops are a precious commodity,
      uCode's dirty VMCS field tracking isn't as precise as software would
      prefer.  Notably, writing any of the major VMCS fields effectively marks
      the entire VMCS dirty, i.e. causes the next VM-Entry to perform all
      consistency checks, which consumes several hundred cycles.
      
      As it pertains to KVM, toggling PIN_BASED_VMX_PREEMPTION_TIMER more than
      doubles the latency of the next VM-Entry (and again when/if the flag is
      toggled back).  In a non-nested scenario, running a "standard" guest
      with the preemption timer enabled, toggling the timer flag is uncommon
      but not rare, e.g. roughly 1 in 10 entries.  Disabling the preemption
      timer can change these numbers due to its use for "immediate exits",
      even when explicitly disabled by userspace.
      
      Nested virtualization in particular is painful, as the timer flag is set
      for the majority of VM-Enters, but prepare_vmcs02() initializes vmcs02's
      pin controls to *clear* the flag since its the timer's final state isn't
      known until vmx_vcpu_run().  I.e. the majority of nested VM-Enters end
      up unnecessarily writing pin controls *twice*.
      
      Rather than toggle the timer flag in pin controls, set the timer value
      itself to the largest allowed value to put it into a "soft disabled"
      state, and ignore any spurious preemption timer exits.
      
      Sadly, the timer is a 32-bit value and so theoretically it can fire
      before the head death of the universe, i.e. spurious exits are possible.
      But because KVM does *not* save the timer value on VM-Exit and because
      the timer runs at a slower rate than the TSC, the maximuma timer value
      is still sufficiently large for KVM's purposes.  E.g. on a modern CPU
      with a timer that runs at 1/32 the frequency of a 2.4ghz constant-rate
      TSC, the timer will fire after ~55 seconds of *uninterrupted* guest
      execution.  In other words, spurious VM-Exits are effectively only
      possible if the host is completely tickless on the logical CPU, the
      guest is not using the preemption timer, and the guest is not generating
      VM-Exits for any other reason.
      
      To be safe from bad/weird hardware, disable the preemption timer if its
      maximum delay is less than ten seconds.  Ten seconds is mostly arbitrary
      and was selected in no small part because it's a nice round number.
      For simplicity and paranoia, fall back to __kvm_request_immediate_exit()
      if the preemption timer is disabled by KVM or userspace.  Previously
      KVM continued to use the preemption timer to force immediate exits even
      when the timer was disabled by userspace.  Now that KVM leaves the timer
      running instead of truly disabling it, allow userspace to kill it
      entirely in the unlikely event the timer (or KVM) malfunctions.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      804939ea
    • S
      KVM: VMX: Drop hv_timer_armed from 'struct loaded_vmcs' · 9d99cc49
      Sean Christopherson 提交于
      ... now that it is fully redundant with the pin controls shadow.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      9d99cc49
    • S
      KVM: VMX: Explicitly initialize controls shadow at VMCS allocation · 3af80fec
      Sean Christopherson 提交于
      Or: Don't re-initialize vmcs02's controls on every nested VM-Entry.
      
      VMWRITEs to the major VMCS controls are deceptively expensive.  Intel
      CPUs with VMCS caching (Westmere and later) also optimize away
      consistency checks on VM-Entry, i.e. skip consistency checks if the
      relevant fields have not changed since the last successful VM-Entry (of
      the cached VMCS).  Because uops are a precious commodity, uCode's dirty
      VMCS field tracking isn't as precise as software would prefer.  Notably,
      writing any of the major VMCS fields effectively marks the entire VMCS
      dirty, i.e. causes the next VM-Entry to perform all consistency checks,
      which consumes several hundred cycles.
      
      Zero out the controls' shadow copies during VMCS allocation and use the
      optimized setter when "initializing" controls.  While this technically
      affects both non-nested and nested virtualization, nested virtualization
      is the primary beneficiary as avoid VMWRITEs when prepare vmcs02 allows
      hardware to optimizie away consistency checks.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      3af80fec
    • S
      KVM: VMX: Shadow VMCS secondary execution controls · fe7f895d
      Sean Christopherson 提交于
      Prepare to shadow all major control fields on a per-VMCS basis, which
      allows KVM to avoid costly VMWRITEs when switching between vmcs01 and
      vmcs02.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      fe7f895d
    • S
      KVM: VMX: Shadow VMCS primary execution controls · 2183f564
      Sean Christopherson 提交于
      Prepare to shadow all major control fields on a per-VMCS basis, which
      allows KVM to avoid VMREADs when switching between vmcs01 and vmcs02,
      and more importantly can eliminate costly VMWRITEs to controls when
      preparing vmcs02.
      
      Shadowing exec controls also saves a VMREAD when opening virtual
      INTR/NMI windows, yay...
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      2183f564
    • S
      KVM: VMX: Shadow VMCS pin controls · c5f2c766
      Sean Christopherson 提交于
      Prepare to shadow all major control fields on a per-VMCS basis, which
      allows KVM to avoid costly VMWRITEs when switching between vmcs01 and
      vmcs02.
      
      Shadowing pin controls also allows a future patch to remove the per-VMCS
      'hv_timer_armed' flag, as the shadow copy is a superset of said flag.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      c5f2c766
    • S
      KVM: nVMX: Use adjusted pin controls for vmcs02 · c075c3e4
      Sean Christopherson 提交于
      KVM provides a module parameter to allow disabling virtual NMI support
      to simplify testing (hardware *without* virtual NMI support is hard to
      come by but it does have users).  When preparing vmcs02, use the accessor
      for pin controls to ensure that the module param is respected for nested
      guests.
      
      Opportunistically swap the order of applying L0's and L1's pin controls
      to better align with other controls and to prepare for a future patche
      that will ignore L1's, but not L0's, preemption timer flag.
      
      Fixes: d02fcf50 ("kvm: vmx: Allow disabling virtual NMI support")
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      c075c3e4
    • P
      KVM: x86: introduce is_pae_paging · bf03d4f9
      Paolo Bonzini 提交于
      Checking for 32-bit PAE is quite common around code that fiddles with
      the PDPTRs.  Add a function to compress all checks into a single
      invocation.
      Reviewed-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      bf03d4f9
    • S
      KVM: nVMX: Update vmcs12 for MSR_IA32_DEBUGCTLMSR when it's written · 699a1ac2
      Sean Christopherson 提交于
      KVM unconditionally intercepts WRMSR to MSR_IA32_DEBUGCTLMSR.  In the
      unlikely event that L1 allows L2 to write L1's MSR_IA32_DEBUGCTLMSR, but
      but saves L2's value on VM-Exit, update vmcs12 during L2's WRMSR so as
      to eliminate the need to VMREAD the value from vmcs02 on nested VM-Exit.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      699a1ac2
    • S
      KVM: nVMX: Update vmcs12 for SYSENTER MSRs when they're written · de70d279
      Sean Christopherson 提交于
      For L2, KVM always intercepts WRMSR to SYSENTER MSRs.  Update vmcs12 in
      the WRMSR handler so that they don't need to be (re)read from vmcs02 on
      every nested VM-Exit.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      de70d279
    • S
      KVM: nVMX: Update vmcs12 for MSR_IA32_CR_PAT when it's written · 142e4be7
      Sean Christopherson 提交于
      As alluded to by the TODO comment, KVM unconditionally intercepts writes
      to the PAT MSR.  In the unlikely event that L1 allows L2 to write L1's
      PAT directly but saves L2's PAT on VM-Exit, update vmcs12 when L2 writes
      the PAT.  This eliminates the need to VMREAD the value from vmcs02 on
      VM-Exit as vmcs12 is already up to date in all situations.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      142e4be7
    • S
      KVM: nVMX: Don't reread VMCS-agnostic state when switching VMCS · 8ef863e6
      Sean Christopherson 提交于
      When switching between vmcs01 and vmcs02, there is no need to update
      state tracking for values that aren't tied to any particular VMCS as
      the per-vCPU values are already up-to-date (vmx_switch_vmcs() can only
      be called when the vCPU is loaded).
      
      Avoiding the update eliminates a RDMSR, and potentially a RDPKRU and
      posted-interrupt update (cmpxchg64() and more).
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      8ef863e6
    • S
      KVM: nVMX: Don't "put" vCPU or host state when switching VMCS · 13b964a2
      Sean Christopherson 提交于
      When switching between vmcs01 and vmcs02, KVM isn't actually switching
      between guest and host.  If guest state is already loaded (the likely,
      if not guaranteed, case), keep the guest state loaded and manually swap
      the loaded_cpu_state pointer after propagating saved host state to the
      new vmcs0{1,2}.
      
      Avoiding the switch between guest and host reduces the latency of
      switching between vmcs01 and vmcs02 by several hundred cycles, and
      reduces the roundtrip time of a nested VM by upwards of 1000 cycles.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      13b964a2
    • P
      KVM: VMX: simplify vmx_prepare_switch_to_{guest,host} · b464f57e
      Paolo Bonzini 提交于
      vmx->loaded_cpu_state can only be NULL or equal to vmx->loaded_vmcs,
      so change it to a bool.  Because the direction of the bool is
      now the opposite of vmx->guest_msrs_dirty, change the direction of
      vmx->guest_msrs_dirty so that they match.
      
      Finally, do not imply that MSRs have to be reloaded when
      vmx->guest_state_loaded is false; instead, set vmx->guest_msrs_ready
      to false explicitly in vmx_prepare_switch_to_host.
      
      Cc: Sean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      b464f57e
    • S
      KVM: VMX: Always signal #GP on WRMSR to MSR_IA32_CR_PAT with bad value · d28f4290
      Sean Christopherson 提交于
      The behavior of WRMSR is in no way dependent on whether or not KVM
      consumes the value.
      
      Fixes: 4566654b ("KVM: vmx: Inject #GP on invalid PAT CR")
      Cc: stable@vger.kernel.org
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d28f4290
    • S
      KVM: nVMX: Use descriptive names for VMCS sync functions and flags · 3731905e
      Sean Christopherson 提交于
      Nested virtualization involves copying data between many different types
      of VMCSes, e.g. vmcs02, vmcs12, shadow VMCS and eVMCS.  Rename a variety
      of functions and flags to document both the source and destination of
      each sync.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      3731905e
    • S
      KVM: VMX: Handle NMIs, #MCs and async #PFs in common irqs-disabled fn · 95b5a48c
      Sean Christopherson 提交于
      Per commit 1b6269db ("KVM: VMX: Handle NMIs before enabling
      interrupts and preemption"), NMIs are handled directly in vmx_vcpu_run()
      to "make sure we handle NMI on the current cpu, and that we don't
      service maskable interrupts before non-maskable ones".  The other
      exceptions handled by complete_atomic_exit(), e.g. async #PF and #MC,
      have similar requirements, and are located there to avoid extra VMREADs
      since VMX bins hardware exceptions and NMIs into a single exit reason.
      
      Clean up the code and eliminate the vaguely named complete_atomic_exit()
      by moving the interrupts-disabled exception and NMI handling into the
      existing handle_external_intrs() callback, and rename the callback to
      a more appropriate name.  Rename VMexit handlers throughout so that the
      atomic and non-atomic counterparts have similar names.
      
      In addition to improving code readability, this also ensures the NMI
      handler is run with the host's debug registers loaded in the unlikely
      event that the user is debugging NMIs.  Accuracy of the last_guest_tsc
      field is also improved when handling NMIs (and #MCs) as the handler
      will run after updating said field.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      [Naming cleanups. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      95b5a48c
    • S
      KVM: x86: Move kvm_{before,after}_interrupt() calls to vendor code · 165072b0
      Sean Christopherson 提交于
      VMX can conditionally call kvm_{before,after}_interrupt() since KVM
      always uses "ack interrupt on exit" and therefore explicitly handles
      interrupts as opposed to blindly enabling irqs.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      165072b0
    • S
      KVM: VMX: Store the host kernel's IDT base in a global variable · 2342080c
      Sean Christopherson 提交于
      Although the kernel may use multiple IDTs, KVM should only ever see the
      "real" IDT, e.g. the early init IDT is long gone by the time KVM runs
      and the debug stack IDT is only used for small windows of time in very
      specific flows.
      
      Before commit a547c6db ("KVM: VMX: Enable acknowledge interupt on
      vmexit"), the kernel's IDT base was consumed by KVM only when setting
      constant VMCS state, i.e. to set VMCS.HOST_IDTR_BASE.  Because constant
      host state is done once per vCPU, there was ostensibly no need to cache
      the kernel's IDT base.
      
      When support for "ack interrupt on exit" was introduced, KVM added a
      second consumer of the IDT base as handling already-acked interrupts
      requires directly calling the interrupt handler, i.e. KVM uses the IDT
      base to find the address of the handler.  Because interrupts are a fast
      path, KVM cached the IDT base to avoid having to VMREAD HOST_IDTR_BASE.
      Presumably, the IDT base was cached on a per-vCPU basis simply because
      the existing code grabbed the IDT base on a per-vCPU (VMCS) basis.
      
      Note, all post-boot IDTs use the same handlers for external interrupts,
      i.e. the "ack interrupt on exit" use of the IDT base would be unaffected
      even if the cached IDT somehow did not match the current IDT.  And as
      for the original use case of setting VMCS.HOST_IDTR_BASE, if any of the
      above analysis is wrong then KVM has had a bug since the beginning of
      time since KVM has effectively been caching the IDT at vCPU creation
      since commit a8b732ca01c ("[PATCH] kvm: userspace interface").
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      2342080c
    • S
      KVM: VMX: Read cached VM-Exit reason to detect external interrupt · 49def500
      Sean Christopherson 提交于
      Generic x86 code invokes the kvm_x86_ops external interrupt handler on
      all VM-Exits regardless of the actual exit type.  Use the already-cached
      EXIT_REASON to determine if the VM-Exit was due to an interrupt, thus
      avoiding an extra VMREAD (to query VM_EXIT_INTR_INFO) for all other
      types of VM-Exit.
      
      In addition to avoiding the extra VMREAD, checking the EXIT_REASON
      instead of VM_EXIT_INTR_INFO makes it more obvious that
      vmx_handle_external_intr() is called for all VM-Exits, e.g. someone
      unfamiliar with the flow might wonder under what condition(s)
      VM_EXIT_INTR_INFO does not contain a valid interrupt, which is
      simply not possible since KVM always runs with "ack interrupt on exit".
      
      WARN once if VM_EXIT_INTR_INFO doesn't contain a valid interrupt on
      an EXTERNAL_INTERRUPT VM-Exit, as such a condition would indicate a
      hardware bug.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      49def500
    • P
      kvm: nVMX: small cleanup in handle_exception · 2ea72039
      Paolo Bonzini 提交于
      The reason for skipping handling of NMI and #MC in handle_exception is
      the same, namely they are handled earlier by vmx_complete_atomic_exit.
      Calling the machine check handler (which just returns 1) is misleading,
      don't do it.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      2ea72039
    • S
      KVM: VMX: Fix handling of #MC that occurs during VM-Entry · beb8d93b
      Sean Christopherson 提交于
      A previous fix to prevent KVM from consuming stale VMCS state after a
      failed VM-Entry inadvertantly blocked KVM's handling of machine checks
      that occur during VM-Entry.
      
      Per Intel's SDM, a #MC during VM-Entry is handled in one of three ways,
      depending on when the #MC is recognoized.  As it pertains to this bug
      fix, the third case explicitly states EXIT_REASON_MCE_DURING_VMENTRY
      is handled like any other VM-Exit during VM-Entry, i.e. sets bit 31 to
      indicate the VM-Entry failed.
      
      If a machine-check event occurs during a VM entry, one of the following occurs:
       - The machine-check event is handled as if it occurred before the VM entry:
              ...
       - The machine-check event is handled after VM entry completes:
              ...
       - A VM-entry failure occurs as described in Section 26.7. The basic
         exit reason is 41, for "VM-entry failure due to machine-check event".
      
      Explicitly handle EXIT_REASON_MCE_DURING_VMENTRY as a one-off case in
      vmx_vcpu_run() instead of binning it into vmx_complete_atomic_exit().
      Doing so allows vmx_vcpu_run() to handle VMX_EXIT_REASONS_FAILED_VMENTRY
      in a sane fashion and also simplifies vmx_complete_atomic_exit() since
      VMCS.VM_EXIT_INTR_INFO is guaranteed to be fresh.
      
      Fixes: b060ca3b ("kvm: vmx: Handle VMLAUNCH/VMRESUME failure properly")
      Cc: stable@vger.kernel.org
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Reviewed-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      beb8d93b
    • P
      KVM: x86: move MSR_IA32_POWER_CTL handling to common code · 73f624f4
      Paolo Bonzini 提交于
      Make it available to AMD hosts as well, just in case someone is trying
      to use an Intel processor's CPUID setup.
      Suggested-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      73f624f4