1. 02 7月, 2019 1 次提交
    • P
      KVM: nVMX: list VMX MSRs in KVM_GET_MSR_INDEX_LIST · 95c5c7c7
      Paolo Bonzini 提交于
      This allows userspace to know which MSRs are supported by the hypervisor.
      Unfortunately userspace must resort to tricks for everything except
      MSR_IA32_VMX_VMFUNC (which was just added in the previous patch).
      One possibility is to use the feature control MSR, which is tied to nested
      VMX as well and is present on all KVM versions that support feature MSRs.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      95c5c7c7
  2. 20 6月, 2019 1 次提交
  3. 18 6月, 2019 25 次提交
    • S
      KVM: VMX: Leave preemption timer running when it's disabled · 804939ea
      Sean Christopherson 提交于
      VMWRITEs to the major VMCS controls, pin controls included, are
      deceptively expensive.  CPUs with VMCS caching (Westmere and later) also
      optimize away consistency checks on VM-Entry, i.e. skip consistency
      checks if the relevant fields have not changed since the last successful
      VM-Entry (of the cached VMCS).  Because uops are a precious commodity,
      uCode's dirty VMCS field tracking isn't as precise as software would
      prefer.  Notably, writing any of the major VMCS fields effectively marks
      the entire VMCS dirty, i.e. causes the next VM-Entry to perform all
      consistency checks, which consumes several hundred cycles.
      
      As it pertains to KVM, toggling PIN_BASED_VMX_PREEMPTION_TIMER more than
      doubles the latency of the next VM-Entry (and again when/if the flag is
      toggled back).  In a non-nested scenario, running a "standard" guest
      with the preemption timer enabled, toggling the timer flag is uncommon
      but not rare, e.g. roughly 1 in 10 entries.  Disabling the preemption
      timer can change these numbers due to its use for "immediate exits",
      even when explicitly disabled by userspace.
      
      Nested virtualization in particular is painful, as the timer flag is set
      for the majority of VM-Enters, but prepare_vmcs02() initializes vmcs02's
      pin controls to *clear* the flag since its the timer's final state isn't
      known until vmx_vcpu_run().  I.e. the majority of nested VM-Enters end
      up unnecessarily writing pin controls *twice*.
      
      Rather than toggle the timer flag in pin controls, set the timer value
      itself to the largest allowed value to put it into a "soft disabled"
      state, and ignore any spurious preemption timer exits.
      
      Sadly, the timer is a 32-bit value and so theoretically it can fire
      before the head death of the universe, i.e. spurious exits are possible.
      But because KVM does *not* save the timer value on VM-Exit and because
      the timer runs at a slower rate than the TSC, the maximuma timer value
      is still sufficiently large for KVM's purposes.  E.g. on a modern CPU
      with a timer that runs at 1/32 the frequency of a 2.4ghz constant-rate
      TSC, the timer will fire after ~55 seconds of *uninterrupted* guest
      execution.  In other words, spurious VM-Exits are effectively only
      possible if the host is completely tickless on the logical CPU, the
      guest is not using the preemption timer, and the guest is not generating
      VM-Exits for any other reason.
      
      To be safe from bad/weird hardware, disable the preemption timer if its
      maximum delay is less than ten seconds.  Ten seconds is mostly arbitrary
      and was selected in no small part because it's a nice round number.
      For simplicity and paranoia, fall back to __kvm_request_immediate_exit()
      if the preemption timer is disabled by KVM or userspace.  Previously
      KVM continued to use the preemption timer to force immediate exits even
      when the timer was disabled by userspace.  Now that KVM leaves the timer
      running instead of truly disabling it, allow userspace to kill it
      entirely in the unlikely event the timer (or KVM) malfunctions.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      804939ea
    • S
      KVM: VMX: Drop hv_timer_armed from 'struct loaded_vmcs' · 9d99cc49
      Sean Christopherson 提交于
      ... now that it is fully redundant with the pin controls shadow.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      9d99cc49
    • S
      KVM: VMX: Explicitly initialize controls shadow at VMCS allocation · 3af80fec
      Sean Christopherson 提交于
      Or: Don't re-initialize vmcs02's controls on every nested VM-Entry.
      
      VMWRITEs to the major VMCS controls are deceptively expensive.  Intel
      CPUs with VMCS caching (Westmere and later) also optimize away
      consistency checks on VM-Entry, i.e. skip consistency checks if the
      relevant fields have not changed since the last successful VM-Entry (of
      the cached VMCS).  Because uops are a precious commodity, uCode's dirty
      VMCS field tracking isn't as precise as software would prefer.  Notably,
      writing any of the major VMCS fields effectively marks the entire VMCS
      dirty, i.e. causes the next VM-Entry to perform all consistency checks,
      which consumes several hundred cycles.
      
      Zero out the controls' shadow copies during VMCS allocation and use the
      optimized setter when "initializing" controls.  While this technically
      affects both non-nested and nested virtualization, nested virtualization
      is the primary beneficiary as avoid VMWRITEs when prepare vmcs02 allows
      hardware to optimizie away consistency checks.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      3af80fec
    • S
      KVM: VMX: Shadow VMCS secondary execution controls · fe7f895d
      Sean Christopherson 提交于
      Prepare to shadow all major control fields on a per-VMCS basis, which
      allows KVM to avoid costly VMWRITEs when switching between vmcs01 and
      vmcs02.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      fe7f895d
    • S
      KVM: VMX: Shadow VMCS primary execution controls · 2183f564
      Sean Christopherson 提交于
      Prepare to shadow all major control fields on a per-VMCS basis, which
      allows KVM to avoid VMREADs when switching between vmcs01 and vmcs02,
      and more importantly can eliminate costly VMWRITEs to controls when
      preparing vmcs02.
      
      Shadowing exec controls also saves a VMREAD when opening virtual
      INTR/NMI windows, yay...
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      2183f564
    • S
      KVM: VMX: Shadow VMCS pin controls · c5f2c766
      Sean Christopherson 提交于
      Prepare to shadow all major control fields on a per-VMCS basis, which
      allows KVM to avoid costly VMWRITEs when switching between vmcs01 and
      vmcs02.
      
      Shadowing pin controls also allows a future patch to remove the per-VMCS
      'hv_timer_armed' flag, as the shadow copy is a superset of said flag.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      c5f2c766
    • S
      KVM: nVMX: Use adjusted pin controls for vmcs02 · c075c3e4
      Sean Christopherson 提交于
      KVM provides a module parameter to allow disabling virtual NMI support
      to simplify testing (hardware *without* virtual NMI support is hard to
      come by but it does have users).  When preparing vmcs02, use the accessor
      for pin controls to ensure that the module param is respected for nested
      guests.
      
      Opportunistically swap the order of applying L0's and L1's pin controls
      to better align with other controls and to prepare for a future patche
      that will ignore L1's, but not L0's, preemption timer flag.
      
      Fixes: d02fcf50 ("kvm: vmx: Allow disabling virtual NMI support")
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      c075c3e4
    • P
      KVM: x86: introduce is_pae_paging · bf03d4f9
      Paolo Bonzini 提交于
      Checking for 32-bit PAE is quite common around code that fiddles with
      the PDPTRs.  Add a function to compress all checks into a single
      invocation.
      Reviewed-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      bf03d4f9
    • S
      KVM: nVMX: Update vmcs12 for MSR_IA32_DEBUGCTLMSR when it's written · 699a1ac2
      Sean Christopherson 提交于
      KVM unconditionally intercepts WRMSR to MSR_IA32_DEBUGCTLMSR.  In the
      unlikely event that L1 allows L2 to write L1's MSR_IA32_DEBUGCTLMSR, but
      but saves L2's value on VM-Exit, update vmcs12 during L2's WRMSR so as
      to eliminate the need to VMREAD the value from vmcs02 on nested VM-Exit.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      699a1ac2
    • S
      KVM: nVMX: Update vmcs12 for SYSENTER MSRs when they're written · de70d279
      Sean Christopherson 提交于
      For L2, KVM always intercepts WRMSR to SYSENTER MSRs.  Update vmcs12 in
      the WRMSR handler so that they don't need to be (re)read from vmcs02 on
      every nested VM-Exit.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      de70d279
    • S
      KVM: nVMX: Update vmcs12 for MSR_IA32_CR_PAT when it's written · 142e4be7
      Sean Christopherson 提交于
      As alluded to by the TODO comment, KVM unconditionally intercepts writes
      to the PAT MSR.  In the unlikely event that L1 allows L2 to write L1's
      PAT directly but saves L2's PAT on VM-Exit, update vmcs12 when L2 writes
      the PAT.  This eliminates the need to VMREAD the value from vmcs02 on
      VM-Exit as vmcs12 is already up to date in all situations.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      142e4be7
    • S
      KVM: nVMX: Don't reread VMCS-agnostic state when switching VMCS · 8ef863e6
      Sean Christopherson 提交于
      When switching between vmcs01 and vmcs02, there is no need to update
      state tracking for values that aren't tied to any particular VMCS as
      the per-vCPU values are already up-to-date (vmx_switch_vmcs() can only
      be called when the vCPU is loaded).
      
      Avoiding the update eliminates a RDMSR, and potentially a RDPKRU and
      posted-interrupt update (cmpxchg64() and more).
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      8ef863e6
    • S
      KVM: nVMX: Don't "put" vCPU or host state when switching VMCS · 13b964a2
      Sean Christopherson 提交于
      When switching between vmcs01 and vmcs02, KVM isn't actually switching
      between guest and host.  If guest state is already loaded (the likely,
      if not guaranteed, case), keep the guest state loaded and manually swap
      the loaded_cpu_state pointer after propagating saved host state to the
      new vmcs0{1,2}.
      
      Avoiding the switch between guest and host reduces the latency of
      switching between vmcs01 and vmcs02 by several hundred cycles, and
      reduces the roundtrip time of a nested VM by upwards of 1000 cycles.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      13b964a2
    • P
      KVM: VMX: simplify vmx_prepare_switch_to_{guest,host} · b464f57e
      Paolo Bonzini 提交于
      vmx->loaded_cpu_state can only be NULL or equal to vmx->loaded_vmcs,
      so change it to a bool.  Because the direction of the bool is
      now the opposite of vmx->guest_msrs_dirty, change the direction of
      vmx->guest_msrs_dirty so that they match.
      
      Finally, do not imply that MSRs have to be reloaded when
      vmx->guest_state_loaded is false; instead, set vmx->guest_msrs_ready
      to false explicitly in vmx_prepare_switch_to_host.
      
      Cc: Sean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      b464f57e
    • S
      KVM: VMX: Always signal #GP on WRMSR to MSR_IA32_CR_PAT with bad value · d28f4290
      Sean Christopherson 提交于
      The behavior of WRMSR is in no way dependent on whether or not KVM
      consumes the value.
      
      Fixes: 4566654b ("KVM: vmx: Inject #GP on invalid PAT CR")
      Cc: stable@vger.kernel.org
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d28f4290
    • S
      KVM: nVMX: Use descriptive names for VMCS sync functions and flags · 3731905e
      Sean Christopherson 提交于
      Nested virtualization involves copying data between many different types
      of VMCSes, e.g. vmcs02, vmcs12, shadow VMCS and eVMCS.  Rename a variety
      of functions and flags to document both the source and destination of
      each sync.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      3731905e
    • S
      KVM: VMX: Handle NMIs, #MCs and async #PFs in common irqs-disabled fn · 95b5a48c
      Sean Christopherson 提交于
      Per commit 1b6269db ("KVM: VMX: Handle NMIs before enabling
      interrupts and preemption"), NMIs are handled directly in vmx_vcpu_run()
      to "make sure we handle NMI on the current cpu, and that we don't
      service maskable interrupts before non-maskable ones".  The other
      exceptions handled by complete_atomic_exit(), e.g. async #PF and #MC,
      have similar requirements, and are located there to avoid extra VMREADs
      since VMX bins hardware exceptions and NMIs into a single exit reason.
      
      Clean up the code and eliminate the vaguely named complete_atomic_exit()
      by moving the interrupts-disabled exception and NMI handling into the
      existing handle_external_intrs() callback, and rename the callback to
      a more appropriate name.  Rename VMexit handlers throughout so that the
      atomic and non-atomic counterparts have similar names.
      
      In addition to improving code readability, this also ensures the NMI
      handler is run with the host's debug registers loaded in the unlikely
      event that the user is debugging NMIs.  Accuracy of the last_guest_tsc
      field is also improved when handling NMIs (and #MCs) as the handler
      will run after updating said field.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      [Naming cleanups. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      95b5a48c
    • S
      KVM: x86: Move kvm_{before,after}_interrupt() calls to vendor code · 165072b0
      Sean Christopherson 提交于
      VMX can conditionally call kvm_{before,after}_interrupt() since KVM
      always uses "ack interrupt on exit" and therefore explicitly handles
      interrupts as opposed to blindly enabling irqs.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      165072b0
    • S
      KVM: VMX: Store the host kernel's IDT base in a global variable · 2342080c
      Sean Christopherson 提交于
      Although the kernel may use multiple IDTs, KVM should only ever see the
      "real" IDT, e.g. the early init IDT is long gone by the time KVM runs
      and the debug stack IDT is only used for small windows of time in very
      specific flows.
      
      Before commit a547c6db ("KVM: VMX: Enable acknowledge interupt on
      vmexit"), the kernel's IDT base was consumed by KVM only when setting
      constant VMCS state, i.e. to set VMCS.HOST_IDTR_BASE.  Because constant
      host state is done once per vCPU, there was ostensibly no need to cache
      the kernel's IDT base.
      
      When support for "ack interrupt on exit" was introduced, KVM added a
      second consumer of the IDT base as handling already-acked interrupts
      requires directly calling the interrupt handler, i.e. KVM uses the IDT
      base to find the address of the handler.  Because interrupts are a fast
      path, KVM cached the IDT base to avoid having to VMREAD HOST_IDTR_BASE.
      Presumably, the IDT base was cached on a per-vCPU basis simply because
      the existing code grabbed the IDT base on a per-vCPU (VMCS) basis.
      
      Note, all post-boot IDTs use the same handlers for external interrupts,
      i.e. the "ack interrupt on exit" use of the IDT base would be unaffected
      even if the cached IDT somehow did not match the current IDT.  And as
      for the original use case of setting VMCS.HOST_IDTR_BASE, if any of the
      above analysis is wrong then KVM has had a bug since the beginning of
      time since KVM has effectively been caching the IDT at vCPU creation
      since commit a8b732ca01c ("[PATCH] kvm: userspace interface").
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      2342080c
    • S
      KVM: VMX: Read cached VM-Exit reason to detect external interrupt · 49def500
      Sean Christopherson 提交于
      Generic x86 code invokes the kvm_x86_ops external interrupt handler on
      all VM-Exits regardless of the actual exit type.  Use the already-cached
      EXIT_REASON to determine if the VM-Exit was due to an interrupt, thus
      avoiding an extra VMREAD (to query VM_EXIT_INTR_INFO) for all other
      types of VM-Exit.
      
      In addition to avoiding the extra VMREAD, checking the EXIT_REASON
      instead of VM_EXIT_INTR_INFO makes it more obvious that
      vmx_handle_external_intr() is called for all VM-Exits, e.g. someone
      unfamiliar with the flow might wonder under what condition(s)
      VM_EXIT_INTR_INFO does not contain a valid interrupt, which is
      simply not possible since KVM always runs with "ack interrupt on exit".
      
      WARN once if VM_EXIT_INTR_INFO doesn't contain a valid interrupt on
      an EXTERNAL_INTERRUPT VM-Exit, as such a condition would indicate a
      hardware bug.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      49def500
    • P
      kvm: nVMX: small cleanup in handle_exception · 2ea72039
      Paolo Bonzini 提交于
      The reason for skipping handling of NMI and #MC in handle_exception is
      the same, namely they are handled earlier by vmx_complete_atomic_exit.
      Calling the machine check handler (which just returns 1) is misleading,
      don't do it.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      2ea72039
    • S
      KVM: VMX: Fix handling of #MC that occurs during VM-Entry · beb8d93b
      Sean Christopherson 提交于
      A previous fix to prevent KVM from consuming stale VMCS state after a
      failed VM-Entry inadvertantly blocked KVM's handling of machine checks
      that occur during VM-Entry.
      
      Per Intel's SDM, a #MC during VM-Entry is handled in one of three ways,
      depending on when the #MC is recognoized.  As it pertains to this bug
      fix, the third case explicitly states EXIT_REASON_MCE_DURING_VMENTRY
      is handled like any other VM-Exit during VM-Entry, i.e. sets bit 31 to
      indicate the VM-Entry failed.
      
      If a machine-check event occurs during a VM entry, one of the following occurs:
       - The machine-check event is handled as if it occurred before the VM entry:
              ...
       - The machine-check event is handled after VM entry completes:
              ...
       - A VM-entry failure occurs as described in Section 26.7. The basic
         exit reason is 41, for "VM-entry failure due to machine-check event".
      
      Explicitly handle EXIT_REASON_MCE_DURING_VMENTRY as a one-off case in
      vmx_vcpu_run() instead of binning it into vmx_complete_atomic_exit().
      Doing so allows vmx_vcpu_run() to handle VMX_EXIT_REASONS_FAILED_VMENTRY
      in a sane fashion and also simplifies vmx_complete_atomic_exit() since
      VMCS.VM_EXIT_INTR_INFO is guaranteed to be fresh.
      
      Fixes: b060ca3b ("kvm: vmx: Handle VMLAUNCH/VMRESUME failure properly")
      Cc: stable@vger.kernel.org
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Reviewed-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      beb8d93b
    • P
      KVM: x86: move MSR_IA32_POWER_CTL handling to common code · 73f624f4
      Paolo Bonzini 提交于
      Make it available to AMD hosts as well, just in case someone is trying
      to use an Intel processor's CPUID setup.
      Suggested-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      73f624f4
    • E
      kvm: vmx: segment limit check: use access length · fdb28619
      Eugene Korenevsky 提交于
      There is an imperfection in get_vmx_mem_address(): access length is ignored
      when checking the limit. To fix this, pass access length as a function argument.
      The access length is usually obvious since it is used by callers after
      get_vmx_mem_address() call, but for vmread/vmwrite it depends on the
      state of 64-bit mode.
      Signed-off-by: NEugene Korenevsky <ekorenevsky@gmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      fdb28619
    • L
      KVM: x86: Use DR_TRAP_BITS instead of hard-coded 15 · 1fc5d194
      Liran Alon 提交于
      Make all code consistent with kvm_deliver_exception_payload() by using
      appropriate symbolic constant instead of hard-coded number.
      Reviewed-by: NNikita Leshenko <nikita.leshchenko@oracle.com>
      Reviewed-by: NKrish Sadhukhan <krish.sadhukhan@oracle.com>
      Signed-off-by: NLiran Alon <liran.alon@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      1fc5d194
  4. 05 6月, 2019 3 次提交
    • W
      KVM: X86: Provide a capability to disable cstate msr read intercepts · b5170063
      Wanpeng Li 提交于
      Allow guest reads CORE cstate when exposing host CPU power management capabilities
      to the guest. PKG cstate is restricted to avoid a guest to get the whole package
      information in multi-tenant scenario.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Sean Christopherson <sean.j.christopherson@intel.com>
      Cc: Liran Alon <liran.alon@oracle.com>
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      b5170063
    • S
      KVM: Directly return result from kvm_arch_check_processor_compat() · f257d6dc
      Sean Christopherson 提交于
      Add a wrapper to invoke kvm_arch_check_processor_compat() so that the
      boilerplate ugliness of checking virtualization support on all CPUs is
      hidden from the arch specific code.  x86's implementation in particular
      is quite heinous, as it unnecessarily propagates the out-param pattern
      into kvm_x86_ops.
      
      While the x86 specific issue could be resolved solely by changing
      kvm_x86_ops, make the change for all architectures as returning a value
      directly is prettier and technically more robust, e.g. s390 doesn't set
      the out param, which could lead to subtle breakage in the (highly
      unlikely) scenario where the out-param was not pre-initialized by the
      caller.
      
      Opportunistically annotate svm_check_processor_compat() with __init.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Reviewed-by: NCornelia Huck <cohuck@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      f257d6dc
    • W
      KVM: LAPIC: Optimize timer latency further · b6c4bc65
      Wanpeng Li 提交于
      Advance lapic timer tries to hidden the hypervisor overhead between the
      host emulated timer fires and the guest awares the timer is fired. However,
      it just hidden the time between apic_timer_fn/handle_preemption_timer ->
      wait_lapic_expire, instead of the real position of vmentry which is
      mentioned in the orignial commit d0659d94 ("KVM: x86: add option to
      advance tscdeadline hrtimer expiration"). There is 700+ cpu cycles between
      the end of wait_lapic_expire and before world switch on my haswell desktop.
      
      This patch tries to narrow the last gap(wait_lapic_expire -> world switch),
      it takes the real overhead time between apic_timer_fn/handle_preemption_timer
      and before world switch into consideration when adaptively tuning timer
      advancement. The patch can reduce 40% latency (~1600+ cycles to ~1000+ cycles
      on a haswell desktop) for kvm-unit-tests/tscdeadline_latency when testing
      busy waits.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Sean Christopherson <sean.j.christopherson@intel.com>
      Cc: Liran Alon <liran.alon@oracle.com>
      Reviewed-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      b6c4bc65
  5. 25 5月, 2019 1 次提交
  6. 16 5月, 2019 1 次提交
    • S
      Revert "KVM: nVMX: Expose RDPMC-exiting only when guest supports PMU" · f93f7ede
      Sean Christopherson 提交于
      The RDPMC-exiting control is dependent on the existence of the RDPMC
      instruction itself, i.e. is not tied to the "Architectural Performance
      Monitoring" feature.  For all intents and purposes, the control exists
      on all CPUs with VMX support since RDPMC also exists on all VCPUs with
      VMX supported.  Per Intel's SDM:
      
        The RDPMC instruction was introduced into the IA-32 Architecture in
        the Pentium Pro processor and the Pentium processor with MMX technology.
        The earlier Pentium processors have performance-monitoring counters, but
        they must be read with the RDMSR instruction.
      
      Because RDPMC-exiting always exists, KVM requires the control and refuses
      to load if it's not available.  As a result, hiding the PMU from a guest
      breaks nested virtualization if the guest attemts to use KVM.
      
      While it's not explicitly stated in the RDPMC pseudocode, the VM-Exit
      check for RDPMC-exiting follows standard fault vs. VM-Exit prioritization
      for privileged instructions, e.g. occurs after the CPL/CR0.PE/CR4.PCE
      checks, but before the counter referenced in ECX is checked for validity.
      
      In other words, the original KVM behavior of injecting a #GP was correct,
      and the KVM unit test needs to be adjusted accordingly, e.g. eat the #GP
      when the unit test guest (L3 in this case) executes RDPMC without
      RDPMC-exiting set in the unit test host (L2).
      
      This reverts commit e51bfdb6.
      
      Fixes: e51bfdb6 ("KVM: nVMX: Expose RDPMC-exiting only when guest supports PMU")
      Reported-by: NDavid Hill <hilld@binarystorm.net>
      Cc: Saar Amar <saaramar@microsoft.com>
      Cc: Mihai Carabas <mihai.carabas@oracle.com>
      Cc: Jim Mattson <jmattson@google.com>
      Cc: Liran Alon <liran.alon@oracle.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      f93f7ede
  7. 01 5月, 2019 8 次提交