1. 12 6月, 2018 2 次提交
    • P
      kvm: x86: use correct privilege level for sgdt/sidt/fxsave/fxrstor access · 3c9fa24c
      Paolo Bonzini 提交于
      The functions that were used in the emulation of fxrstor, fxsave, sgdt and
      sidt were originally meant for task switching, and as such they did not
      check privilege levels.  This is very bad when the same functions are used
      in the emulation of unprivileged instructions.  This is CVE-2018-10853.
      
      The obvious fix is to add a new argument to ops->read_std and ops->write_std,
      which decides whether the access is a "system" access or should use the
      processor's CPL.
      
      Fixes: 129a72a0 ("KVM: x86: Introduce segmented_write_std", 2017-01-12)
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      3c9fa24c
    • P
      KVM: x86: pass kvm_vcpu to kvm_read_guest_virt and kvm_write_guest_virt_system · ce14e868
      Paolo Bonzini 提交于
      Int the next patch the emulator's .read_std and .write_std callbacks will
      grow another argument, which is not needed in kvm_read_guest_virt and
      kvm_write_guest_virt_system's callers.  Since we have to make separate
      functions, let's give the currently existing names a nicer interface, too.
      
      Fixes: 129a72a0 ("KVM: x86: Introduce segmented_write_std", 2017-01-12)
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ce14e868
  2. 04 6月, 2018 1 次提交
    • W
      KVM: VMX: Optimize tscdeadline timer latency · c5ce8235
      Wanpeng Li 提交于
      'Commit d0659d94 ("KVM: x86: add option to advance tscdeadline
      hrtimer expiration")' advances the tscdeadline (the timer is emulated
      by hrtimer) expiration in order that the latency which is incurred
      by hypervisor (apic_timer_fn -> vmentry) can be avoided. This patch
      adds the advance tscdeadline expiration support to which the tscdeadline
      timer is emulated by VMX preemption timer to reduce the hypervisor
      lantency (handle_preemption_timer -> vmentry). The guest can also
      set an expiration that is very small (for example in Linux if an
      hrtimer feeds a expiration in the past); in that case we set delta_tsc
      to 0, leading to an immediately vmexit when delta_tsc is not bigger than
      advance ns.
      
      This patch can reduce ~63% latency (~4450 cycles to ~1660 cycles on
      a haswell desktop) for kvm-unit-tests/tscdeadline_latency when testing
      busy waits.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      c5ce8235
  3. 02 6月, 2018 1 次提交
  4. 26 5月, 2018 1 次提交
  5. 23 5月, 2018 1 次提交
  6. 15 5月, 2018 3 次提交
  7. 11 5月, 2018 2 次提交
  8. 16 4月, 2018 2 次提交
  9. 11 4月, 2018 1 次提交
    • K
      X86/KVM: Do not allow DISABLE_EXITS_MWAIT when LAPIC ARAT is not available · 8e9b29b6
      KarimAllah Ahmed 提交于
      If the processor does not have an "Always Running APIC Timer" (aka ARAT),
      we should not give guests direct access to MWAIT. The LAPIC timer would
      stop ticking in deep C-states, so any host deadlines would not wakeup the
      host kernel.
      
      The host kernel intel_idle driver handles this by switching to broadcast
      mode when ARAT is not available and MWAIT is issued with a deep C-state
      that would stop the LAPIC timer. When MWAIT is passed through, we can not
      tell when MWAIT is issued.
      
      So just disable this capability when LAPIC ARAT is not available. I am not
      even sure if there are any CPUs with VMX support but no LAPIC ARAT or not.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Reported-by: NWanpeng Li <kernellwp@gmail.com>
      Signed-off-by: NKarimAllah Ahmed <karahmed@amazon.de>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      8e9b29b6
  10. 05 4月, 2018 3 次提交
    • P
      kvm: x86: fix a compile warning · 3140c156
      Peng Hao 提交于
      fix a "warning: no previous prototype".
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NPeng Hao <peng.hao2@zte.com.cn>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      3140c156
    • W
      KVM: X86: Add Force Emulation Prefix for "emulate the next instruction" · 6c86eedc
      Wanpeng Li 提交于
      There is no easy way to force KVM to run an instruction through the emulator
      (by design as that will expose the x86 emulator as a significant attack-surface).
      However, we do wish to expose the x86 emulator in case we are testing it
      (e.g. via kvm-unit-tests). Therefore, this patch adds a "force emulation prefix"
      that is designed to raise #UD which KVM will trap and it's #UD exit-handler will
      match "force emulation prefix" to run instruction after prefix by the x86 emulator.
      To not expose the x86 emulator by default, we add a module parameter that should
      be off by default.
      
      A simple testcase here:
      
          #include <stdio.h>
          #include <string.h>
      
          #define HYPERVISOR_INFO 0x40000000
      
          #define CPUID(idx, eax, ebx, ecx, edx) \
              asm volatile (\
              "ud2a; .ascii \"kvm\"; cpuid" \
              :"=b" (*ebx), "=a" (*eax), "=c" (*ecx), "=d" (*edx) \
                  :"0"(idx) );
      
          void main()
          {
              unsigned int eax, ebx, ecx, edx;
              char string[13];
      
              CPUID(HYPERVISOR_INFO, &eax, &ebx, &ecx, &edx);
              *(unsigned int *)(string + 0) = ebx;
              *(unsigned int *)(string + 4) = ecx;
              *(unsigned int *)(string + 8) = edx;
      
              string[12] = 0;
              if (strncmp(string, "KVMKVMKVM\0\0\0", 12) == 0)
                  printf("kvm guest\n");
              else
                  printf("bare hardware\n");
          }
      Suggested-by: NAndrew Cooper <andrew.cooper3@citrix.com>
      Reviewed-by: NRadim Krčmář <rkrcmar@redhat.com>
      Reviewed-by: NLiran Alon <liran.alon@oracle.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Andrew Cooper <andrew.cooper3@citrix.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Liran Alon <liran.alon@oracle.com>
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      [Correctly handle usermode exits. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      6c86eedc
    • W
      KVM: X86: Introduce handle_ud() · 082d06ed
      Wanpeng Li 提交于
      Introduce handle_ud() to handle invalid opcode, this function will be
      used by later patches.
      Reviewed-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Reviewed-by: NLiran Alon <liran.alon@oracle.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim KrÄmář <rkrcmar@redhat.com>
      Cc: Andrew Cooper <andrew.cooper3@citrix.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Liran Alon <liran.alon@oracle.com>
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      082d06ed
  11. 29 3月, 2018 5 次提交
    • L
      KVM: nVMX: Require immediate-exit when event reinjected to L2 and L1 event pending · 1a680e35
      Liran Alon 提交于
      In case L2 VMExit to L0 during event-delivery, VMCS02 is filled with
      IDT-vectoring-info which vmx_complete_interrupts() makes sure to
      reinject before next resume of L2.
      
      While handling the VMExit in L0, an IPI could be sent by another L1 vCPU
      to the L1 vCPU which currently runs L2 and exited to L0.
      
      When L0 will reach vcpu_enter_guest() and call inject_pending_event(),
      it will note that a previous event was re-injected to L2 (by
      IDT-vectoring-info) and therefore won't check if there are pending L1
      events which require exit from L2 to L1. Thus, L0 enters L2 without
      immediate VMExit even though there are pending L1 events!
      
      This commit fixes the issue by making sure to check for L1 pending
      events even if a previous event was reinjected to L2 and bailing out
      from inject_pending_event() before evaluating a new pending event in
      case an event was already reinjected.
      
      The bug was observed by the following setup:
      * L0 is a 64CPU machine which runs KVM.
      * L1 is a 16CPU machine which runs KVM.
      * L0 & L1 runs with APICv disabled.
      (Also reproduced with APICv enabled but easier to analyze below info
      with APICv disabled)
      * L1 runs a 16CPU L2 Windows Server 2012 R2 guest.
      During L2 boot, L1 hangs completely and analyzing the hang reveals that
      one L1 vCPU is holding KVM's mmu_lock and is waiting forever on an IPI
      that he has sent for another L1 vCPU. And all other L1 vCPUs are
      currently attempting to grab mmu_lock. Therefore, all L1 vCPUs are stuck
      forever (as L1 runs with kernel-preemption disabled).
      
      Observing /sys/kernel/debug/tracing/trace_pipe reveals the following
      series of events:
      (1) qemu-system-x86-19066 [030] kvm_nested_vmexit: rip:
      0xfffff802c5dca82f reason: EPT_VIOLATION ext_inf1: 0x0000000000000182
      ext_inf2: 0x00000000800000d2 ext_int: 0x00000000 ext_int_err: 0x00000000
      (2) qemu-system-x86-19054 [028] kvm_apic_accept_irq: apicid f
      vec 252 (Fixed|edge)
      (3) qemu-system-x86-19066 [030] kvm_inj_virq: irq 210
      (4) qemu-system-x86-19066 [030] kvm_entry: vcpu 15
      (5) qemu-system-x86-19066 [030] kvm_exit: reason EPT_VIOLATION
      rip 0xffffe00069202690 info 83 0
      (6) qemu-system-x86-19066 [030] kvm_nested_vmexit: rip:
      0xffffe00069202690 reason: EPT_VIOLATION ext_inf1: 0x0000000000000083
      ext_inf2: 0x0000000000000000 ext_int: 0x00000000 ext_int_err: 0x00000000
      (7) qemu-system-x86-19066 [030] kvm_nested_vmexit_inject: reason:
      EPT_VIOLATION ext_inf1: 0x0000000000000083 ext_inf2: 0x0000000000000000
      ext_int: 0x00000000 ext_int_err: 0x00000000
      (8) qemu-system-x86-19066 [030] kvm_entry: vcpu 15
      
      Which can be analyzed as follows:
      (1) L2 VMExit to L0 on EPT_VIOLATION during delivery of vector 0xd2.
      Therefore, vmx_complete_interrupts() will set KVM_REQ_EVENT and reinject
      a pending-interrupt of 0xd2.
      (2) L1 sends an IPI of vector 0xfc (CALL_FUNCTION_VECTOR) to destination
      vCPU 15. This will set relevant bit in LAPIC's IRR and set KVM_REQ_EVENT.
      (3) L0 reach vcpu_enter_guest() which calls inject_pending_event() which
      notes that interrupt 0xd2 was reinjected and therefore calls
      vmx_inject_irq() and returns. Without checking for pending L1 events!
      Note that at this point, KVM_REQ_EVENT was cleared by vcpu_enter_guest()
      before calling inject_pending_event().
      (4) L0 resumes L2 without immediate-exit even though there is a pending
      L1 event (The IPI pending in LAPIC's IRR).
      
      We have already reached the buggy scenario but events could be
      furthered analyzed:
      (5+6) L2 VMExit to L0 on EPT_VIOLATION.  This time not during
      event-delivery.
      (7) L0 decides to forward the VMExit to L1 for further handling.
      (8) L0 resumes into L1. Note that because KVM_REQ_EVENT is cleared, the
      LAPIC's IRR is not examined and therefore the IPI is still not delivered
      into L1!
      Signed-off-by: NLiran Alon <liran.alon@oracle.com>
      Reviewed-by: NNikita Leshenko <nikita.leshchenko@oracle.com>
      Reviewed-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      1a680e35
    • L
      KVM: x86: Fix misleading comments on handling pending exceptions · a042c26f
      Liran Alon 提交于
      The reason that exception.pending should block re-injection of
      NMI/interrupt is not described correctly in comment in code.
      Instead, it describes why a pending exception should be injected
      before a pending NMI/interrupt.
      
      Therefore, move currently present comment to code-block evaluating
      a new pending event which explains why exception.pending is evaluated
      first.
      In addition, create a new comment describing that exception.pending
      blocks re-injection of NMI/interrupt because the exception was
      queued by handling vmexit which was due to NMI/interrupt delivery.
      Signed-off-by: NLiran Alon <liran.alon@oracle.com>
      Reviewed-by: NNikita Leshenko <nikita.leshchenko@oracle.com>
      Signed-off-by: NKrish Sadhukhan <krish.sadhukhan@orcle.com>
      [Used a comment from Sean J <sean.j.christopherson@intel.com>. - Radim]
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      a042c26f
    • L
      KVM: x86: Rename interrupt.pending to interrupt.injected · 04140b41
      Liran Alon 提交于
      For exceptions & NMIs events, KVM code use the following
      coding convention:
      *) "pending" represents an event that should be injected to guest at
      some point but it's side-effects have not yet occurred.
      *) "injected" represents an event that it's side-effects have already
      occurred.
      
      However, interrupts don't conform to this coding convention.
      All current code flows mark interrupt.pending when it's side-effects
      have already taken place (For example, bit moved from LAPIC IRR to
      ISR). Therefore, it makes sense to just rename
      interrupt.pending to interrupt.injected.
      
      This change follows logic of previous commit 664f8e26 ("KVM: X86:
      Fix loss of exception which has not yet been injected") which changed
      exception to follow this coding convention as well.
      
      It is important to note that in case !lapic_in_kernel(vcpu),
      interrupt.pending usage was and still incorrect.
      In this case, interrrupt.pending can only be set using one of the
      following ioctls: KVM_INTERRUPT, KVM_SET_VCPU_EVENTS and
      KVM_SET_SREGS. Looking at how QEMU uses these ioctls, one can see that
      QEMU uses them either to re-set an "interrupt.pending" state it has
      received from KVM (via KVM_GET_VCPU_EVENTS interrupt.pending or
      via KVM_GET_SREGS interrupt_bitmap) or by dispatching a new interrupt
      from QEMU's emulated LAPIC which reset bit in IRR and set bit in ISR
      before sending ioctl to KVM. So it seems that indeed "interrupt.pending"
      in this case is also suppose to represent "interrupt.injected".
      However, kvm_cpu_has_interrupt() & kvm_cpu_has_injectable_intr()
      is misusing (now named) interrupt.injected in order to return if
      there is a pending interrupt.
      This leads to nVMX/nSVM not be able to distinguish if it should exit
      from L2 to L1 on EXTERNAL_INTERRUPT on pending interrupt or should
      re-inject an injected interrupt.
      Therefore, add a FIXME at these functions for handling this issue.
      
      This patch introduce no semantics change.
      Signed-off-by: NLiran Alon <liran.alon@oracle.com>
      Reviewed-by: NNikita Leshenko <nikita.leshchenko@oracle.com>
      Reviewed-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NKrish Sadhukhan <krish.sadhukhan@oracle.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      04140b41
    • L
      KVM: VMX: No need to clear pending NMI/interrupt on inject realmode interrupt · 7c5a6a59
      Liran Alon 提交于
      kvm_inject_realmode_interrupt() is called from one of the injection
      functions which writes event-injection to VMCS: vmx_queue_exception(),
      vmx_inject_irq() and vmx_inject_nmi().
      
      All these functions are called just to cause an event-injection to
      guest. They are not responsible of manipulating the event-pending
      flag. The only purpose of kvm_inject_realmode_interrupt() should be
      to emulate real-mode interrupt-injection.
      
      This was also incorrect when called from vmx_queue_exception().
      Signed-off-by: NLiran Alon <liran.alon@oracle.com>
      Reviewed-by: NNikita Leshenko <nikita.leshchenko@oracle.com>
      Reviewed-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NKrish Sadhukhan <krish.sadhukhan@oracle.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      7c5a6a59
    • L
      x86/kvm: rename HV_X64_MSR_APIC_ASSIST_PAGE to HV_X64_MSR_VP_ASSIST_PAGE · d4abc577
      Ladi Prosek 提交于
      The assist page has been used only for the paravirtual EOI so far, hence
      the "APIC" in the MSR name. Renaming to match the Hyper-V TLFS where it's
      called "Virtual VP Assist MSR".
      Signed-off-by: NLadi Prosek <lprosek@redhat.com>
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      d4abc577
  12. 28 3月, 2018 1 次提交
    • A
      KVM: x86: Fix perf timer mode IP reporting · dd60d217
      Andi Kleen 提交于
      KVM and perf have a special backdoor mechanism to report the IP for interrupts
      re-executed after vm exit. This works for the NMIs that perf normally uses.
      
      However when perf is in timer mode it doesn't work because the timer interrupt
      doesn't get this special treatment. This is common when KVM is running
      nested in another hypervisor which may not implement the PMU, so only
      timer mode is available.
      
      Call the functions to set up the backdoor IP also for non NMI interrupts.
      
      I renamed the functions to set up the backdoor IP reporting to be more
      appropiate for their new use.  The SVM change is only compile tested.
      
      v2: Moved the functions inline.
      For the normal interrupt case the before/after functions are now
      called from x86.c, not arch specific code.
      For the NMI case we still need to call it in the architecture
      specific code, because it's already needed in the low level *_run
      functions.
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      [Removed unnecessary calls from arch handle_external_intr. - Radim]
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      dd60d217
  13. 24 3月, 2018 1 次提交
  14. 21 3月, 2018 1 次提交
    • L
      KVM: nVMX: Do not load EOI-exitmap while running L2 · e40ff1d6
      Liran Alon 提交于
      When L1 IOAPIC redirection-table is written, a request of
      KVM_REQ_SCAN_IOAPIC is set on all vCPUs. This is done such that
      all vCPUs will now recalc their IOAPIC handled vectors and load
      it to their EOI-exitmap.
      
      However, it could be that one of the vCPUs is currently running
      L2. In this case, load_eoi_exitmap() will be called which would
      write to vmcs02->eoi_exit_bitmap, which is wrong because
      vmcs02->eoi_exit_bitmap should always be equal to
      vmcs12->eoi_exit_bitmap. Furthermore, at this point
      KVM_REQ_SCAN_IOAPIC was already consumed and therefore we will
      never update vmcs01->eoi_exit_bitmap. This could lead to remote_irr
      of some IOAPIC level-triggered entry to remain set forever.
      
      Fix this issue by delaying the load of EOI-exitmap to when vCPU
      is running L1.
      
      One may wonder why not just delay entire KVM_REQ_SCAN_IOAPIC
      processing to when vCPU is running L1. This is done in order to handle
      correctly the case where LAPIC & IO-APIC of L1 is pass-throughed into
      L2. In this case, vmcs12->virtual_interrupt_delivery should be 0. In
      current nVMX implementation, that results in
      vmcs02->virtual_interrupt_delivery to also be 0. Thus,
      vmcs02->eoi_exit_bitmap is not used. Therefore, every L2 EOI cause
      a #VMExit into L0 (either on MSR_WRITE to x2APIC MSR or
      APIC_ACCESS/APIC_WRITE/EPT_MISCONFIG to APIC MMIO page).
      In order for such L2 EOI to be broadcasted, if needed, from LAPIC
      to IO-APIC, vcpu->arch.ioapic_handled_vectors must be updated
      while L2 is running. Therefore, patch makes sure to delay only the
      loading of EOI-exitmap but not the update of
      vcpu->arch.ioapic_handled_vectors.
      Reviewed-by: NArbel Moshe <arbel.moshe@oracle.com>
      Reviewed-by: NKrish Sadhukhan <krish.sadhukhan@oracle.com>
      Signed-off-by: NLiran Alon <liran.alon@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      e40ff1d6
  15. 17 3月, 2018 10 次提交
  16. 07 3月, 2018 4 次提交
  17. 02 3月, 2018 1 次提交