1. 25 5月, 2018 1 次提交
  2. 24 5月, 2018 2 次提交
    • W
      KVM: x86: Update cpuid properly when CR4.OSXAVE or CR4.PKE is changed · c4d21882
      Wei Huang 提交于
      The CPUID bits of OSXSAVE (function=0x1) and OSPKE (func=0x7, leaf=0x0)
      allows user apps to detect if OS has set CR4.OSXSAVE or CR4.PKE. KVM is
      supposed to update these CPUID bits when CR4 is updated. Current KVM
      code doesn't handle some special cases when updates come from emulator.
      Here is one example:
      
        Step 1: guest boots
        Step 2: guest OS enables XSAVE ==> CR4.OSXSAVE=1 and CPUID.OSXSAVE=1
        Step 3: guest hot reboot ==> QEMU reset CR4 to 0, but CPUID.OSXAVE==1
        Step 4: guest os checks CPUID.OSXAVE, detects 1, then executes xgetbv
      
      Step 4 above will cause an #UD and guest crash because guest OS hasn't
      turned on OSXAVE yet. This patch solves the problem by comparing the the
      old_cr4 with cr4. If the related bits have been changed,
      kvm_update_cpuid() needs to be called.
      Signed-off-by: NWei Huang <wei@redhat.com>
      Reviewed-by: NBandan Das <bsd@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      c4d21882
    • D
      x86/kvm: fix LAPIC timer drift when guest uses periodic mode · d8f2f498
      David Vrabel 提交于
      Since 4.10, commit 8003c9ae (KVM: LAPIC: add APIC Timer
      periodic/oneshot mode VMX preemption timer support), guests using
      periodic LAPIC timers (such as FreeBSD 8.4) would see their timers
      drift significantly over time.
      
      Differences in the underlying clocks and numerical errors means the
      periods of the two timers (hv and sw) are not the same. This
      difference will accumulate with every expiry resulting in a large
      error between the hv and sw timer.
      
      This means the sw timer may be running slow when compared to the hv
      timer. When the timer is switched from hv to sw, the now active sw
      timer will expire late. The guest VCPU is reentered and it switches to
      using the hv timer. This timer catches up, injecting multiple IRQs
      into the guest (of which the guest only sees one as it does not get to
      run until the hv timer has caught up) and thus the guest's timer rate
      is low (and becomes increasing slower over time as the sw timer lags
      further and further behind).
      
      I believe a similar problem would occur if the hv timer is the slower
      one, but I have not observed this.
      
      Fix this by synchronizing the deadlines for both timers to the same
      time source on every tick. This prevents the errors from accumulating.
      
      Fixes: 8003c9ae
      Cc: Wanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: NDavid Vrabel <david.vrabel@nutanix.com>
      Cc: stable@vger.kernel.org
      Reviewed-by: NPaolo Bonzini <pbonzini@redhat.com>
      Reviewed-by: NWanpeng Li <wanpengli@tencent.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      d8f2f498
  3. 15 5月, 2018 1 次提交
  4. 11 5月, 2018 4 次提交
  5. 06 5月, 2018 1 次提交
    • A
      KVM: x86: remove APIC Timer periodic/oneshot spikes · ecf08dad
      Anthoine Bourgeois 提交于
      Since the commit "8003c9ae: add APIC Timer periodic/oneshot mode VMX
      preemption timer support", a Windows 10 guest has some erratic timer
      spikes.
      
      Here the results on a 150000 times 1ms timer without any load:
      	  Before 8003c9ae | After 8003c9ae
      Max           1834us          |  86000us
      Mean          1100us          |   1021us
      Deviation       59us          |    149us
      Here the results on a 150000 times 1ms timer with a cpu-z stress test:
      	  Before 8003c9ae | After 8003c9ae
      Max          32000us          | 140000us
      Mean          1006us          |   1997us
      Deviation      140us          |  11095us
      
      The root cause of the problem is starting hrtimer with an expiry time
      already in the past can take more than 20 milliseconds to trigger the
      timer function.  It can be solved by forward such past timers
      immediately, rather than submitting them to hrtimer_start().
      In case the timer is periodic, update the target expiration and call
      hrtimer_start with it.
      
      v2: Check if the tsc deadline is already expired. Thank you Mika.
      v3: Execute the past timers immediately rather than submitting them to
      hrtimer_start().
      v4: Rearm the periodic timer with advance_periodic_target_expiration() a
      simpler version of set_target_expiration(). Thank you Paolo.
      
      Cc: Mika Penttilä <mika.penttila@nextfour.com>
      Cc: Wanpeng Li <kernellwp@gmail.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NAnthoine Bourgeois <anthoine.bourgeois@blade-group.com>
      8003c9ae ("KVM: LAPIC: add APIC Timer periodic/oneshot mode VMX preemption timer support")
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      ecf08dad
  6. 28 4月, 2018 1 次提交
  7. 27 4月, 2018 1 次提交
  8. 16 4月, 2018 2 次提交
  9. 13 4月, 2018 1 次提交
  10. 11 4月, 2018 2 次提交
  11. 10 4月, 2018 1 次提交
    • K
      X86/VMX: Disable VMX preemption timer if MWAIT is not intercepted · 386c6ddb
      KarimAllah Ahmed 提交于
      The VMX-preemption timer is used by KVM as a way to set deadlines for the
      guest (i.e. timer emulation). That was safe till very recently when
      capability KVM_X86_DISABLE_EXITS_MWAIT to disable intercepting MWAIT was
      introduced. According to Intel SDM 25.5.1:
      
      """
      The VMX-preemption timer operates in the C-states C0, C1, and C2; it also
      operates in the shutdown and wait-for-SIPI states. If the timer counts down
      to zero in any state other than the wait-for SIPI state, the logical
      processor transitions to the C0 C-state and causes a VM exit; the timer
      does not cause a VM exit if it counts down to zero in the wait-for-SIPI
      state. The timer is not decremented in C-states deeper than C2.
      """
      
      Now once the guest issues the MWAIT with a c-state deeper than
      C2 the preemption timer will never wake it up again since it stopped
      ticking! Usually this is compensated by other activities in the system that
      would wake the core from the deep C-state (and cause a VMExit). For
      example, if the host itself is ticking or it received interrupts, etc!
      
      So disable the VMX-preemption timer if MWAIT is exposed to the guest!
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: kvm@vger.kernel.org
      Signed-off-by: NKarimAllah Ahmed <karahmed@amazon.de>
      Fixes: 4d5422ceSigned-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      386c6ddb
  12. 07 4月, 2018 1 次提交
  13. 05 4月, 2018 7 次提交
    • P
      kvm: x86: fix a compile warning · 3140c156
      Peng Hao 提交于
      fix a "warning: no previous prototype".
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NPeng Hao <peng.hao2@zte.com.cn>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      3140c156
    • W
      KVM: X86: Add Force Emulation Prefix for "emulate the next instruction" · 6c86eedc
      Wanpeng Li 提交于
      There is no easy way to force KVM to run an instruction through the emulator
      (by design as that will expose the x86 emulator as a significant attack-surface).
      However, we do wish to expose the x86 emulator in case we are testing it
      (e.g. via kvm-unit-tests). Therefore, this patch adds a "force emulation prefix"
      that is designed to raise #UD which KVM will trap and it's #UD exit-handler will
      match "force emulation prefix" to run instruction after prefix by the x86 emulator.
      To not expose the x86 emulator by default, we add a module parameter that should
      be off by default.
      
      A simple testcase here:
      
          #include <stdio.h>
          #include <string.h>
      
          #define HYPERVISOR_INFO 0x40000000
      
          #define CPUID(idx, eax, ebx, ecx, edx) \
              asm volatile (\
              "ud2a; .ascii \"kvm\"; cpuid" \
              :"=b" (*ebx), "=a" (*eax), "=c" (*ecx), "=d" (*edx) \
                  :"0"(idx) );
      
          void main()
          {
              unsigned int eax, ebx, ecx, edx;
              char string[13];
      
              CPUID(HYPERVISOR_INFO, &eax, &ebx, &ecx, &edx);
              *(unsigned int *)(string + 0) = ebx;
              *(unsigned int *)(string + 4) = ecx;
              *(unsigned int *)(string + 8) = edx;
      
              string[12] = 0;
              if (strncmp(string, "KVMKVMKVM\0\0\0", 12) == 0)
                  printf("kvm guest\n");
              else
                  printf("bare hardware\n");
          }
      Suggested-by: NAndrew Cooper <andrew.cooper3@citrix.com>
      Reviewed-by: NRadim Krčmář <rkrcmar@redhat.com>
      Reviewed-by: NLiran Alon <liran.alon@oracle.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Andrew Cooper <andrew.cooper3@citrix.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Liran Alon <liran.alon@oracle.com>
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      [Correctly handle usermode exits. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      6c86eedc
    • W
      KVM: X86: Introduce handle_ud() · 082d06ed
      Wanpeng Li 提交于
      Introduce handle_ud() to handle invalid opcode, this function will be
      used by later patches.
      Reviewed-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Reviewed-by: NLiran Alon <liran.alon@oracle.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim KrÄmář <rkrcmar@redhat.com>
      Cc: Andrew Cooper <andrew.cooper3@citrix.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Liran Alon <liran.alon@oracle.com>
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      082d06ed
    • P
      KVM: vmx: unify adjacent #ifdefs · 4fde8d57
      Paolo Bonzini 提交于
      vmx_save_host_state has multiple ifdefs for CONFIG_X86_64 that have
      no other code between them.  Simplify by reducing them to a single
      conditional.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      4fde8d57
    • A
      x86: kvm: hide the unused 'cpu' variable · 51e8a8cc
      Arnd Bergmann 提交于
      The local variable was newly introduced but is only accessed in one
      place on x86_64, but not on 32-bit:
      
      arch/x86/kvm/vmx.c: In function 'vmx_save_host_state':
      arch/x86/kvm/vmx.c:2175:6: error: unused variable 'cpu' [-Werror=unused-variable]
      
      This puts it into another #ifdef.
      
      Fixes: 35060ed6 ("x86/kvm/vmx: avoid expensive rdmsr for MSR_GS_BASE")
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      51e8a8cc
    • S
      KVM: VMX: remove bogus WARN_ON in handle_ept_misconfig · c75d0edc
      Sean Christopherson 提交于
      Remove the WARN_ON in handle_ept_misconfig() as it is unnecessary
      and causes false positives.  Return the unmodified result of
      kvm_mmu_page_fault() instead of converting a system error code to
      KVM_EXIT_UNKNOWN so that userspace sees the error code of the
      actual failure, not a generic "we don't know what went wrong".
      
        * kvm_mmu_page_fault() will WARN if reserved bits are set in the
          SPTEs, i.e. it covers the case where an EPT misconfig occurred
          because of a KVM bug.
      
        * The WARN_ON will fire on any system error code that is hit while
          handling the fault, e.g. -ENOMEM from mmu_topup_memory_caches()
          while handling a legitmate MMIO EPT misconfig or -EFAULT from
          kvm_handle_bad_page() if the corresponding HVA is invalid.  In
          either case, userspace should receive the original error code
          and firing a warning is incorrect behavior as KVM is operating
          as designed.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      c75d0edc
    • S
      Revert "KVM: X86: Fix SMRAM accessing even if VM is shutdown" · 2c151b25
      Sean Christopherson 提交于
      The bug that led to commit 95e057e2
      was a benign warning (no adverse affects other than the warning
      itself) that was detected by syzkaller.  Further inspection shows
      that the WARN_ON in question, in handle_ept_misconfig(), is
      unnecessary and flawed (this was also briefly discussed in the
      original patch: https://patchwork.kernel.org/patch/10204649).
      
        * The WARN_ON is unnecessary as kvm_mmu_page_fault() will WARN
          if reserved bits are set in the SPTEs, i.e. it covers the case
          where an EPT misconfig occurred because of a KVM bug.
      
        * The WARN_ON is flawed because it will fire on any system error
          code that is hit while handling the fault, e.g. -ENOMEM can be
          returned by mmu_topup_memory_caches() while handling a legitmate
          MMIO EPT misconfig.
      
      The original behavior of returning -EFAULT when userspace munmaps
      an HVA without first removing the memslot is correct and desirable,
      i.e. KVM is letting userspace know it has generated a bad address.
      Returning RET_PF_EMULATE masks the WARN_ON in the EPT misconfig path,
      but does not fix the underlying bug, i.e. the WARN_ON is bogus.
      
      Furthermore, returning RET_PF_EMULATE has the unwanted side effect of
      causing KVM to attempt to emulate an instruction on any page fault
      with an invalid HVA translation, e.g. a not-present EPT violation
      on a VM_PFNMAP VMA whose fault handler failed to insert a PFN.
      
        * There is no guarantee that the fault is directly related to the
          instruction, i.e. the fault could have been triggered by a side
          effect memory access in the guest, e.g. while vectoring a #DB or
          writing a tracing record.  This could cause KVM to effectively
          mask the fault if KVM doesn't model the behavior leading to the
          fault, i.e. emulation could succeed and resume the guest.
      
        * If emulation does fail, KVM will return EMULATION_FAILED instead
          of -EFAULT, which is a red herring as the user will either debug
          a bogus emulation attempt or scratch their head wondering why we
          were attempting emulation in the first place.
      
      TL;DR: revert to returning -EFAULT and remove the bogus WARN_ON in
      handle_ept_misconfig in a future patch.
      
      This reverts commit 95e057e2.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      2c151b25
  14. 04 4月, 2018 2 次提交
    • S
      kvm: Add emulation for movups/movupd · 29916968
      Stefan Fritsch 提交于
      This is very similar to the aligned versions movaps/movapd.
      
      We have seen the corresponding emulation failures with openbsd as guest
      and with Windows 10 with intel HD graphics pass through.
      Signed-off-by: NChristian Ehrhardt <christian_ehrhardt@genua.de>
      Signed-off-by: NStefan Fritsch <sf@sfritsch.de>
      Reviewed-by: NRadim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      29916968
    • S
      KVM: VMX: raise internal error for exception during invalid protected mode state · add5ff7a
      Sean Christopherson 提交于
      Exit to userspace with KVM_INTERNAL_ERROR_EMULATION if we encounter
      an exception in Protected Mode while emulating guest due to invalid
      guest state.  Unlike Big RM, KVM doesn't support emulating exceptions
      in PM, i.e. PM exceptions are always injected via the VMCS.  Because
      we will never do VMRESUME due to emulation_required, the exception is
      never realized and we'll keep emulating the faulting instruction over
      and over until we receive a signal.
      
      Exit to userspace iff there is a pending exception, i.e. don't exit
      simply on a requested event. The purpose of this check and exit is to
      aid in debugging a guest that is in all likelihood already doomed.
      Invalid guest state in PM is extremely limited in normal operation,
      e.g. it generally only occurs for a few instructions early in BIOS,
      and any exception at this time is all but guaranteed to be fatal.
      Non-vectored interrupts, e.g. INIT, SIPI and SMI, can be cleanly
      handled/emulated, while checking for vectored interrupts, e.g. INTR
      and NMI, without hitting false positives would add a fair amount of
      complexity for almost no benefit (getting hit by lightning seems
      more likely than encountering this specific scenario).
      
      Add a WARN_ON_ONCE to vmx_queue_exception() if we try to inject an
      exception via the VMCS and emulation_required is true.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      add5ff7a
  15. 29 3月, 2018 12 次提交
    • L
      KVM: nVMX: Optimization: Dont set KVM_REQ_EVENT when VMExit with nested_run_pending · f497b6c2
      Liran Alon 提交于
      When vCPU runs L2 and there is a pending event that requires to exit
      from L2 to L1 and nested_run_pending=1, vcpu_enter_guest() will request
      an immediate-exit from L2 (See req_immediate_exit).
      
      Since now handling of req_immediate_exit also makes sure to set
      KVM_REQ_EVENT, there is no need to also set it on vmx_vcpu_run() when
      nested_run_pending=1.
      
      This optimizes cases where VMRESUME was executed by L1 to enter L2 and
      there is no pending events that require exit from L2 to L1. Previously,
      this would have set KVM_REQ_EVENT unnecessarly.
      Signed-off-by: NLiran Alon <liran.alon@oracle.com>
      Reviewed-by: NNikita Leshenko <nikita.leshchenko@oracle.com>
      Reviewed-by: NKrish Sadhukhan <krish.sadhukhan@oracle.com>
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      f497b6c2
    • L
      KVM: nVMX: Require immediate-exit when event reinjected to L2 and L1 event pending · 1a680e35
      Liran Alon 提交于
      In case L2 VMExit to L0 during event-delivery, VMCS02 is filled with
      IDT-vectoring-info which vmx_complete_interrupts() makes sure to
      reinject before next resume of L2.
      
      While handling the VMExit in L0, an IPI could be sent by another L1 vCPU
      to the L1 vCPU which currently runs L2 and exited to L0.
      
      When L0 will reach vcpu_enter_guest() and call inject_pending_event(),
      it will note that a previous event was re-injected to L2 (by
      IDT-vectoring-info) and therefore won't check if there are pending L1
      events which require exit from L2 to L1. Thus, L0 enters L2 without
      immediate VMExit even though there are pending L1 events!
      
      This commit fixes the issue by making sure to check for L1 pending
      events even if a previous event was reinjected to L2 and bailing out
      from inject_pending_event() before evaluating a new pending event in
      case an event was already reinjected.
      
      The bug was observed by the following setup:
      * L0 is a 64CPU machine which runs KVM.
      * L1 is a 16CPU machine which runs KVM.
      * L0 & L1 runs with APICv disabled.
      (Also reproduced with APICv enabled but easier to analyze below info
      with APICv disabled)
      * L1 runs a 16CPU L2 Windows Server 2012 R2 guest.
      During L2 boot, L1 hangs completely and analyzing the hang reveals that
      one L1 vCPU is holding KVM's mmu_lock and is waiting forever on an IPI
      that he has sent for another L1 vCPU. And all other L1 vCPUs are
      currently attempting to grab mmu_lock. Therefore, all L1 vCPUs are stuck
      forever (as L1 runs with kernel-preemption disabled).
      
      Observing /sys/kernel/debug/tracing/trace_pipe reveals the following
      series of events:
      (1) qemu-system-x86-19066 [030] kvm_nested_vmexit: rip:
      0xfffff802c5dca82f reason: EPT_VIOLATION ext_inf1: 0x0000000000000182
      ext_inf2: 0x00000000800000d2 ext_int: 0x00000000 ext_int_err: 0x00000000
      (2) qemu-system-x86-19054 [028] kvm_apic_accept_irq: apicid f
      vec 252 (Fixed|edge)
      (3) qemu-system-x86-19066 [030] kvm_inj_virq: irq 210
      (4) qemu-system-x86-19066 [030] kvm_entry: vcpu 15
      (5) qemu-system-x86-19066 [030] kvm_exit: reason EPT_VIOLATION
      rip 0xffffe00069202690 info 83 0
      (6) qemu-system-x86-19066 [030] kvm_nested_vmexit: rip:
      0xffffe00069202690 reason: EPT_VIOLATION ext_inf1: 0x0000000000000083
      ext_inf2: 0x0000000000000000 ext_int: 0x00000000 ext_int_err: 0x00000000
      (7) qemu-system-x86-19066 [030] kvm_nested_vmexit_inject: reason:
      EPT_VIOLATION ext_inf1: 0x0000000000000083 ext_inf2: 0x0000000000000000
      ext_int: 0x00000000 ext_int_err: 0x00000000
      (8) qemu-system-x86-19066 [030] kvm_entry: vcpu 15
      
      Which can be analyzed as follows:
      (1) L2 VMExit to L0 on EPT_VIOLATION during delivery of vector 0xd2.
      Therefore, vmx_complete_interrupts() will set KVM_REQ_EVENT and reinject
      a pending-interrupt of 0xd2.
      (2) L1 sends an IPI of vector 0xfc (CALL_FUNCTION_VECTOR) to destination
      vCPU 15. This will set relevant bit in LAPIC's IRR and set KVM_REQ_EVENT.
      (3) L0 reach vcpu_enter_guest() which calls inject_pending_event() which
      notes that interrupt 0xd2 was reinjected and therefore calls
      vmx_inject_irq() and returns. Without checking for pending L1 events!
      Note that at this point, KVM_REQ_EVENT was cleared by vcpu_enter_guest()
      before calling inject_pending_event().
      (4) L0 resumes L2 without immediate-exit even though there is a pending
      L1 event (The IPI pending in LAPIC's IRR).
      
      We have already reached the buggy scenario but events could be
      furthered analyzed:
      (5+6) L2 VMExit to L0 on EPT_VIOLATION.  This time not during
      event-delivery.
      (7) L0 decides to forward the VMExit to L1 for further handling.
      (8) L0 resumes into L1. Note that because KVM_REQ_EVENT is cleared, the
      LAPIC's IRR is not examined and therefore the IPI is still not delivered
      into L1!
      Signed-off-by: NLiran Alon <liran.alon@oracle.com>
      Reviewed-by: NNikita Leshenko <nikita.leshchenko@oracle.com>
      Reviewed-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      1a680e35
    • L
      KVM: x86: Fix misleading comments on handling pending exceptions · a042c26f
      Liran Alon 提交于
      The reason that exception.pending should block re-injection of
      NMI/interrupt is not described correctly in comment in code.
      Instead, it describes why a pending exception should be injected
      before a pending NMI/interrupt.
      
      Therefore, move currently present comment to code-block evaluating
      a new pending event which explains why exception.pending is evaluated
      first.
      In addition, create a new comment describing that exception.pending
      blocks re-injection of NMI/interrupt because the exception was
      queued by handling vmexit which was due to NMI/interrupt delivery.
      Signed-off-by: NLiran Alon <liran.alon@oracle.com>
      Reviewed-by: NNikita Leshenko <nikita.leshchenko@oracle.com>
      Signed-off-by: NKrish Sadhukhan <krish.sadhukhan@orcle.com>
      [Used a comment from Sean J <sean.j.christopherson@intel.com>. - Radim]
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      a042c26f
    • L
      KVM: x86: Rename interrupt.pending to interrupt.injected · 04140b41
      Liran Alon 提交于
      For exceptions & NMIs events, KVM code use the following
      coding convention:
      *) "pending" represents an event that should be injected to guest at
      some point but it's side-effects have not yet occurred.
      *) "injected" represents an event that it's side-effects have already
      occurred.
      
      However, interrupts don't conform to this coding convention.
      All current code flows mark interrupt.pending when it's side-effects
      have already taken place (For example, bit moved from LAPIC IRR to
      ISR). Therefore, it makes sense to just rename
      interrupt.pending to interrupt.injected.
      
      This change follows logic of previous commit 664f8e26 ("KVM: X86:
      Fix loss of exception which has not yet been injected") which changed
      exception to follow this coding convention as well.
      
      It is important to note that in case !lapic_in_kernel(vcpu),
      interrupt.pending usage was and still incorrect.
      In this case, interrrupt.pending can only be set using one of the
      following ioctls: KVM_INTERRUPT, KVM_SET_VCPU_EVENTS and
      KVM_SET_SREGS. Looking at how QEMU uses these ioctls, one can see that
      QEMU uses them either to re-set an "interrupt.pending" state it has
      received from KVM (via KVM_GET_VCPU_EVENTS interrupt.pending or
      via KVM_GET_SREGS interrupt_bitmap) or by dispatching a new interrupt
      from QEMU's emulated LAPIC which reset bit in IRR and set bit in ISR
      before sending ioctl to KVM. So it seems that indeed "interrupt.pending"
      in this case is also suppose to represent "interrupt.injected".
      However, kvm_cpu_has_interrupt() & kvm_cpu_has_injectable_intr()
      is misusing (now named) interrupt.injected in order to return if
      there is a pending interrupt.
      This leads to nVMX/nSVM not be able to distinguish if it should exit
      from L2 to L1 on EXTERNAL_INTERRUPT on pending interrupt or should
      re-inject an injected interrupt.
      Therefore, add a FIXME at these functions for handling this issue.
      
      This patch introduce no semantics change.
      Signed-off-by: NLiran Alon <liran.alon@oracle.com>
      Reviewed-by: NNikita Leshenko <nikita.leshchenko@oracle.com>
      Reviewed-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NKrish Sadhukhan <krish.sadhukhan@oracle.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      04140b41
    • L
      KVM: VMX: No need to clear pending NMI/interrupt on inject realmode interrupt · 7c5a6a59
      Liran Alon 提交于
      kvm_inject_realmode_interrupt() is called from one of the injection
      functions which writes event-injection to VMCS: vmx_queue_exception(),
      vmx_inject_irq() and vmx_inject_nmi().
      
      All these functions are called just to cause an event-injection to
      guest. They are not responsible of manipulating the event-pending
      flag. The only purpose of kvm_inject_realmode_interrupt() should be
      to emulate real-mode interrupt-injection.
      
      This was also incorrect when called from vmx_queue_exception().
      Signed-off-by: NLiran Alon <liran.alon@oracle.com>
      Reviewed-by: NNikita Leshenko <nikita.leshchenko@oracle.com>
      Reviewed-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NKrish Sadhukhan <krish.sadhukhan@oracle.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      7c5a6a59
    • V
      x86/kvm: use Enlightened VMCS when running on Hyper-V · 773e8a04
      Vitaly Kuznetsov 提交于
      Enlightened VMCS is just a structure in memory, the main benefit
      besides avoiding somewhat slower VMREAD/VMWRITE is using clean field
      mask: we tell the underlying hypervisor which fields were modified
      since VMEXIT so there's no need to inspect them all.
      
      Tight CPUID loop test shows significant speedup:
      Before: 18890 cycles
      After: 8304 cycles
      
      Static key is being used to avoid performance penalty for non-Hyper-V
      deployments.
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Reviewed-by: NRadim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      773e8a04
    • L
      x86/kvm: rename HV_X64_MSR_APIC_ASSIST_PAGE to HV_X64_MSR_VP_ASSIST_PAGE · d4abc577
      Ladi Prosek 提交于
      The assist page has been used only for the paravirtual EOI so far, hence
      the "APIC" in the MSR name. Renaming to match the Hyper-V TLFS where it's
      called "Virtual VP Assist MSR".
      Signed-off-by: NLadi Prosek <lprosek@redhat.com>
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      d4abc577
    • B
      KVM: SVM: Implement pause loop exit logic in SVM · 8566ac8b
      Babu Moger 提交于
      Bring the PLE(pause loop exit) logic to AMD svm driver.
      
      While testing, we found this helping in situations where numerous
      pauses are generated. Without these patches we could see continuos
      VMEXITS due to pause interceptions. Tested it on AMD EPYC server with
      boot parameter idle=poll on a VM with 32 vcpus to simulate extensive
      pause behaviour. Here are VMEXITS in 10 seconds interval.
      
      Pauses                  810199                  504
      Total                   882184                  325415
      Signed-off-by: NBabu Moger <babu.moger@amd.com>
      [Prevented the window from dropping below the initial value. - Radim]
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      8566ac8b
    • B
      KVM: SVM: Add pause filter threshold · 1d8fb44a
      Babu Moger 提交于
      This patch adds the support for pause filtering threshold. This feature
      support is indicated by CPUID Fn8000_000A_EDX. See AMD APM Vol 2 Section
      15.14.4 Pause Intercept Filtering for more details.
      
      In this mode, a 16-bit pause filter threshold field is added in VMCB.
      The threshold value is a cycle count that is used to reset the pause
      counter.  As with simple pause filtering, VMRUN loads the pause count
      value from VMCB into an internal counter. Then, on each pause instruction
      the hardware checks the elapsed number of cycles since the most recent
      pause instruction against the pause Filter Threshold. If the elapsed cycle
      count is greater than the pause filter threshold, then the internal pause
      count is reloaded from VMCB and execution continues. If the elapsed cycle
      count is less than the pause filter threshold, then the internal pause
      count is decremented. If the count value is less than zero and pause
      intercept is enabled, a #VMEXIT is triggered. If advanced pause filtering
      is supported and pause filter threshold field is set to zero, the filter
      will operate in the simpler, count only mode.
      Signed-off-by: NBabu Moger <babu.moger@amd.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      1d8fb44a
    • B
      KVM: VMX: Bring the common code to header file · c8e88717
      Babu Moger 提交于
      This patch brings some of the code from vmx to x86.h header file. Now, we
      can share this code between vmx and svm. Modified couple functions to make
      it common.
      Signed-off-by: NBabu Moger <babu.moger@amd.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      c8e88717
    • B
      KVM: VMX: Remove ple_window_actual_max · 18abdc34
      Babu Moger 提交于
      Get rid of ple_window_actual_max, because its benefits are really
      minuscule and the logic is complicated.
      
      The overflows(and underflow) are controlled in __ple_window_grow
      and _ple_window_shrink respectively.
      Suggested-by: NRadim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: NBabu Moger <babu.moger@amd.com>
      [Fixed potential wraparound and change the max to UINT_MAX. - Radim]
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      18abdc34
    • B
      KVM: VMX: Fix the module parameters for vmx · 7fbc85a5
      Babu Moger 提交于
      The vmx module parameters are supposed to be unsigned variants.
      
      Also fixed the checkpatch errors like the one below.
      
      WARNING: Symbolic permissions 'S_IRUGO' are not preferred. Consider using octal permissions '0444'.
      +module_param(ple_gap, uint, S_IRUGO);
      Signed-off-by: NBabu Moger <babu.moger@amd.com>
      [Expanded uint to unsigned int in code. - Radim]
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      7fbc85a5
  16. 28 3月, 2018 1 次提交
    • A
      KVM: x86: Fix perf timer mode IP reporting · dd60d217
      Andi Kleen 提交于
      KVM and perf have a special backdoor mechanism to report the IP for interrupts
      re-executed after vm exit. This works for the NMIs that perf normally uses.
      
      However when perf is in timer mode it doesn't work because the timer interrupt
      doesn't get this special treatment. This is common when KVM is running
      nested in another hypervisor which may not implement the PMU, so only
      timer mode is available.
      
      Call the functions to set up the backdoor IP also for non NMI interrupts.
      
      I renamed the functions to set up the backdoor IP reporting to be more
      appropiate for their new use.  The SVM change is only compile tested.
      
      v2: Moved the functions inline.
      For the normal interrupt case the before/after functions are now
      called from x86.c, not arch specific code.
      For the NMI case we still need to call it in the architecture
      specific code, because it's already needed in the low level *_run
      functions.
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      [Removed unnecessary calls from arch handle_external_intr. - Radim]
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      dd60d217