1. 25 8月, 2017 5 次提交
    • W
      KVM: X86: Fix loss of exception which has not yet been injected · 664f8e26
      Wanpeng Li 提交于
      vmx_complete_interrupts() assumes that the exception is always injected,
      so it can be dropped by kvm_clear_exception_queue().  However,
      an exception cannot be injected immediately if it is: 1) originally
      destined to a nested guest; 2) trapped to cause a vmexit; 3) happening
      right after VMLAUNCH/VMRESUME, i.e. when nested_run_pending is true.
      
      This patch applies to exceptions the same algorithm that is used for
      NMIs, replacing exception.reinject with "exception.injected" (equivalent
      to nmi_injected).
      
      exception.pending now represents an exception that is queued and whose
      side effects (e.g., update RFLAGS.RF or DR7) have not been applied yet.
      If exception.pending is true, the exception might result in a nested
      vmexit instead, too (in which case the side effects must not be applied).
      
      exception.injected instead represents an exception that is going to be
      injected into the guest at the next vmentry.
      Reported-by: NRadim Krčmář <rkrcmar@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      664f8e26
    • W
      KVM: VMX: use kvm_event_needs_reinjection · 274bba52
      Wanpeng Li 提交于
      Use kvm_event_needs_reinjection() encapsulation.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      274bba52
    • Y
      KVM: MMU: Expose the LA57 feature to VM. · fd8cb433
      Yu Zhang 提交于
      This patch exposes 5 level page table feature to the VM.
      At the same time, the canonical virtual address checking is
      extended to support both 48-bits and 57-bits address width.
      Signed-off-by: NYu Zhang <yu.c.zhang@linux.intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      fd8cb433
    • Y
      KVM: MMU: Add 5 level EPT & Shadow page table support. · 855feb67
      Yu Zhang 提交于
      Extends the shadow paging code, so that 5 level shadow page
      table can be constructed if VM is running in 5 level paging
      mode.
      
      Also extends the ept code, so that 5 level ept table can be
      constructed if maxphysaddr of VM exceeds 48 bits. Unlike the
      shadow logic, KVM should still use 4 level ept table for a VM
      whose physical address width is less than 48 bits, even when
      the VM is running in 5 level paging mode.
      Signed-off-by: NYu Zhang <yu.c.zhang@linux.intel.com>
      [Unconditionally reset the MMU context in kvm_cpuid_update.
       Changing MAXPHYADDR invalidates the reserved bit bitmasks.
       - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      855feb67
    • P
      kvm: vmx: Raise #UD on unsupported XSAVES/XRSTORS · 3db13480
      Paolo Bonzini 提交于
      A guest may not be configured to support XSAVES/XRSTORS, even when the host
      does. If the guest does not support XSAVES/XRSTORS, clear the secondary
      execution control so that the processor will raise #UD.
      
      Also clear the "allowed-1" bit for XSAVES/XRSTORS exiting in the
      IA32_VMX_PROCBASED_CTLS2 MSR, and pass through VMCS12's control in
      the VMCS02.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      3db13480
  2. 24 8月, 2017 3 次提交
    • J
      kvm: vmx: Raise #UD on unsupported RDSEED · 75f4fc8d
      Jim Mattson 提交于
      A guest may not be configured to support RDSEED, even when the host
      does. If the guest does not support RDSEED, intercept the instruction
      and synthesize #UD. Also clear the "allowed-1" bit for RDSEED exiting
      in the IA32_VMX_PROCBASED_CTLS2 MSR.
      Signed-off-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      75f4fc8d
    • J
      kvm: vmx: Raise #UD on unsupported RDRAND · 45ec368c
      Jim Mattson 提交于
      A guest may not be configured to support RDRAND, even when the host
      does. If the guest does not support RDRAND, intercept the instruction
      and synthesize #UD. Also clear the "allowed-1" bit for RDRAND exiting
      in the IA32_VMX_PROCBASED_CTLS2 MSR.
      Signed-off-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      45ec368c
    • P
      KVM: VMX: cache secondary exec controls · 80154d77
      Paolo Bonzini 提交于
      Currently, secondary execution controls are divided in three groups:
      
      - static, depending mostly on the module arguments or the processor
        (vmx_secondary_exec_control)
      
      - static, depending on CPUID (vmx_cpuid_update)
      
      - dynamic, depending on nested VMX or local APIC state
      
      Because walking CPUID is expensive, prepare_vmcs02 is using only
      the first group.  This however is unnecessarily complicated.  Just
      cache the static secondary execution controls, and then prepare_vmcs02
      does not need to compute them every time.  Computation of all static
      secondary execution controls is now kept in a single function,
      vmx_compute_secondary_exec_control.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      80154d77
  3. 18 8月, 2017 5 次提交
  4. 12 8月, 2017 1 次提交
  5. 10 8月, 2017 2 次提交
    • P
      kvm: nVMX: Add support for fast unprotection of nested guest page tables · eebed243
      Paolo Bonzini 提交于
      This is the same as commit 14727754 ("kvm: svm: Add support for
      additional SVM NPF error codes", 2016-11-23), but for Intel processors.
      In this case, the exit qualification field's bit 8 says whether the
      EPT violation occurred while translating the guest's final physical
      address or rather while translating the guest page tables.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      eebed243
    • W
      KVM: X86: Fix residual mmio emulation request to userspace · bbeac283
      Wanpeng Li 提交于
      Reported by syzkaller:
      
      The kvm-intel.unrestricted_guest=0
      
         WARNING: CPU: 5 PID: 1014 at /home/kernel/data/kvm/arch/x86/kvm//x86.c:7227 kvm_arch_vcpu_ioctl_run+0x38b/0x1be0 [kvm]
         CPU: 5 PID: 1014 Comm: warn_test Tainted: G        W  OE   4.13.0-rc3+ #8
         RIP: 0010:kvm_arch_vcpu_ioctl_run+0x38b/0x1be0 [kvm]
         Call Trace:
          ? put_pid+0x3a/0x50
          ? rcu_read_lock_sched_held+0x79/0x80
          ? kmem_cache_free+0x2f2/0x350
          kvm_vcpu_ioctl+0x340/0x700 [kvm]
          ? kvm_vcpu_ioctl+0x340/0x700 [kvm]
          ? __fget+0xfc/0x210
          do_vfs_ioctl+0xa4/0x6a0
          ? __fget+0x11d/0x210
          SyS_ioctl+0x79/0x90
          entry_SYSCALL_64_fastpath+0x23/0xc2
          ? __this_cpu_preempt_check+0x13/0x20
      
      The syszkaller folks reported a residual mmio emulation request to userspace
      due to vm86 fails to emulate inject real mode interrupt(fails to read CS) and
      incurs a triple fault. The vCPU returns to userspace with vcpu->mmio_needed == true
      and KVM_EXIT_SHUTDOWN exit reason. However, the syszkaller testcase constructs
      several threads to launch the same vCPU, the thread which lauch this vCPU after
      the thread whichs get the vcpu->mmio_needed == true and KVM_EXIT_SHUTDOWN will
      trigger the warning.
      
         #define _GNU_SOURCE
         #include <pthread.h>
         #include <stdio.h>
         #include <stdlib.h>
         #include <string.h>
         #include <sys/wait.h>
         #include <sys/types.h>
         #include <sys/stat.h>
         #include <sys/mman.h>
         #include <fcntl.h>
         #include <unistd.h>
         #include <linux/kvm.h>
         #include <stdio.h>
      
         int kvmcpu;
         struct kvm_run *run;
      
         void* thr(void* arg)
         {
           int res;
           res = ioctl(kvmcpu, KVM_RUN, 0);
           printf("ret1=%d exit_reason=%d suberror=%d\n",
               res, run->exit_reason, run->internal.suberror);
           return 0;
         }
      
         void test()
         {
           int i, kvm, kvmvm;
           pthread_t th[4];
      
           kvm = open("/dev/kvm", O_RDWR);
           kvmvm = ioctl(kvm, KVM_CREATE_VM, 0);
           kvmcpu = ioctl(kvmvm, KVM_CREATE_VCPU, 0);
           run = (struct kvm_run*)mmap(0, 4096, PROT_READ|PROT_WRITE, MAP_SHARED, kvmcpu, 0);
           srand(getpid());
           for (i = 0; i < 4; i++) {
             pthread_create(&th[i], 0, thr, 0);
             usleep(rand() % 10000);
           }
           for (i = 0; i < 4; i++)
             pthread_join(th[i], 0);
         }
      
         int main()
         {
           for (;;) {
             int pid = fork();
             if (pid < 0)
               exit(1);
             if (pid == 0) {
               test();
               exit(0);
             }
             int status;
             while (waitpid(pid, &status, __WALL) != pid) {}
           }
           return 0;
         }
      
      This patch fixes it by resetting the vcpu->mmio_needed once we receive
      the triple fault to avoid the residue.
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Tested-by: NDmitry Vyukov <dvyukov@google.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      bbeac283
  6. 08 8月, 2017 2 次提交
  7. 07 8月, 2017 8 次提交
  8. 03 8月, 2017 4 次提交
    • W
      KVM: nVMX: Fix interrupt window request with "Acknowledge interrupt on exit" · 6550c4df
      Wanpeng Li 提交于
      ------------[ cut here ]------------
       WARNING: CPU: 5 PID: 2288 at arch/x86/kvm/vmx.c:11124 nested_vmx_vmexit+0xd64/0xd70 [kvm_intel]
       CPU: 5 PID: 2288 Comm: qemu-system-x86 Not tainted 4.13.0-rc2+ #7
       RIP: 0010:nested_vmx_vmexit+0xd64/0xd70 [kvm_intel]
      Call Trace:
        vmx_check_nested_events+0x131/0x1f0 [kvm_intel]
        ? vmx_check_nested_events+0x131/0x1f0 [kvm_intel]
        kvm_arch_vcpu_ioctl_run+0x5dd/0x1be0 [kvm]
        ? vmx_vcpu_load+0x1be/0x220 [kvm_intel]
        ? kvm_arch_vcpu_load+0x62/0x230 [kvm]
        kvm_vcpu_ioctl+0x340/0x700 [kvm]
        ? kvm_vcpu_ioctl+0x340/0x700 [kvm]
        ? __fget+0xfc/0x210
        do_vfs_ioctl+0xa4/0x6a0
        ? __fget+0x11d/0x210
        SyS_ioctl+0x79/0x90
        do_syscall_64+0x8f/0x750
        ? trace_hardirqs_on_thunk+0x1a/0x1c
        entry_SYSCALL64_slow_path+0x25/0x25
      
      This can be reproduced by booting L1 guest w/ 'noapic' grub parameter, which
      means that tells the kernel to not make use of any IOAPICs that may be present
      in the system.
      
      Actually external_intr variable in nested_vmx_vmexit() is the req_int_win
      variable passed from vcpu_enter_guest() which means that the L0's userspace
      requests an irq window. I observed the scenario (!kvm_cpu_has_interrupt(vcpu) &&
      L0's userspace reqeusts an irq window) is true, so there is no interrupt which
      L1 requires to inject to L2, we should not attempt to emualte "Acknowledge
      interrupt on exit" for the irq window requirement in this scenario.
      
      This patch fixes it by not attempt to emulate "Acknowledge interrupt on exit"
      if there is no L1 requirement to inject an interrupt to L2.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
      [Added code comment to make it obvious that the behavior is not correct.
       We should do a userspace exit with open interrupt window instead of the
       nested VM exit.  This patch still improves the behavior, so it was
       accepted as a (temporary) workaround.]
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      6550c4df
    • D
      KVM: nVMX: mark vmcs12 pages dirty on L2 exit · c9f04407
      David Matlack 提交于
      The host physical addresses of L1's Virtual APIC Page and Posted
      Interrupt descriptor are loaded into the VMCS02. The CPU may write
      to these pages via their host physical address while L2 is running,
      bypassing address-translation-based dirty tracking (e.g. EPT write
      protection). Mark them dirty on every exit from L2 to prevent them
      from getting out of sync with dirty tracking.
      
      Also mark the virtual APIC page and the posted interrupt descriptor
      dirty when KVM is virtualizing posted interrupt processing.
      Signed-off-by: NDavid Matlack <dmatlack@google.com>
      Reviewed-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      c9f04407
    • D
      kvm: nVMX: don't flush VMCS12 during VMXOFF or VCPU teardown · 8ca44e88
      David Matlack 提交于
      According to the Intel SDM, software cannot rely on the current VMCS to be
      coherent after a VMXOFF or shutdown. So this is a valid way to handle VMCS12
      flushes.
      
      24.11.1 Software Use of Virtual-Machine Control Structures
      ...
        If a logical processor leaves VMX operation, any VMCSs active on
        that logical processor may be corrupted (see below). To prevent
        such corruption of a VMCS that may be used either after a return
        to VMX operation or on another logical processor, software should
        execute VMCLEAR for that VMCS before executing the VMXOFF instruction
        or removing power from the processor (e.g., as part of a transition
        to the S3 and S4 power states).
      ...
      
      This fixes a "suspicious rcu_dereference_check() usage!" warning during
      kvm_vm_release() because nested_release_vmcs12() calls
      kvm_vcpu_write_guest_page() without holding kvm->srcu.
      Signed-off-by: NDavid Matlack <dmatlack@google.com>
      Reviewed-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      8ca44e88
    • P
      KVM: nVMX: do not pin the VMCS12 · 9f744c59
      Paolo Bonzini 提交于
      Since the current implementation of VMCS12 does a memcpy in and out
      of guest memory, we do not need current_vmcs12 and current_vmcs12_page
      anymore.  current_vmptr is enough to read and write the VMCS12.
      
      And David Matlack noted:
      
        This patch also fixes dirty tracking (memslot->dirty_bitmap) of the
        VMCS12 page by using kvm_write_guest. nested_release_page() only marks
        the struct page dirty.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      [Added David Matlack's note and nested_release_page_clean() fix.]
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      9f744c59
  9. 02 8月, 2017 2 次提交
    • P
      KVM: nVMX: fixes to nested virt interrupt injection · b96fb439
      Paolo Bonzini 提交于
      There are three issues in nested_vmx_check_exception:
      
      1) it is not taking PFEC_MATCH/PFEC_MASK into account, as reported
      by Wanpeng Li;
      
      2) it should rebuild the interruption info and exit qualification fields
      from scratch, as reported by Jim Mattson, because the values from the
      L2->L0 vmexit may be invalid (e.g. if an emulated instruction causes
      a page fault, the EPT misconfig's exit qualification is incorrect).
      
      3) CR2 and DR6 should not be written for exception intercept vmexits
      (CR2 only for AMD).
      
      This patch fixes the first two and adds a comment about the last,
      outlining the fix.
      
      Cc: Jim Mattson <jmattson@google.com>
      Cc: Wanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      b96fb439
    • P
      KVM: nVMX: do not fill vm_exit_intr_error_code in prepare_vmcs12 · 7313c698
      Paolo Bonzini 提交于
      Do this in the caller of nested_vmx_vmexit instead.
      
      nested_vmx_check_exception was doing a vmwrite to the vmcs02's
      VM_EXIT_INTR_ERROR_CODE field, so that prepare_vmcs12 would move
      the field to vmcs12->vm_exit_intr_error_code.  However that isn't
      possible on pre-Haswell machines.  Moving the vmcs12 write to the
      callers fixes it.
      Reported-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      [Changed nested_vmx_reflect_vmexit() return type to (int)1 from (bool)1,
       thanks to fengguang.wu@intel.com]
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      7313c698
  10. 27 7月, 2017 2 次提交
    • W
      KVM: nVMX: Fix loss of L2's NMI blocking state · 2d6144e3
      Wanpeng Li 提交于
      Run kvm-unit-tests/eventinj.flat in L1 w/ ept=0 on both L0 and L1:
      
      Before NMI IRET test
      Sending NMI to self
      NMI isr running stack 0x461000
      Sending nested NMI to self
      After nested NMI to self
      Nested NMI isr running rip=40038e
      After iret
      After NMI to self
      FAIL: NMI
      
      Commit 4c4a6f79 (KVM: nVMX: track NMI blocking state separately
      for each VMCS) tracks NMI blocking state separately for vmcs01 and
      vmcs02. However it is not enough:
      
       - The L2 (kvm-unit-tests/eventinj.flat) generates NMI that will fault
         on IRET, so the L2 can generate #PF which can be intercepted by L0.
       - L0 walks L1's guest page table and sees the mapping is invalid, it
         resumes the L1 guest and injects the #PF into L1.  At this point the
         vmcs02 has nmi_known_unmasked=true.
       - L1 sets set bit 3 (blocking by NMI) in the interruptibility-state field
         of vmcs12 (and fixes the shadow page table) before resuming L2 guest.
       - L1 executes VMRESUME to resume L2, causing a vmexit to L0
       - during VMRESUME emulation, prepare_vmcs02 sets bit 3 in the
         interruptibility-state field of vmcs02, but nmi_known_unmasked is
         still true.
       - L2 immediately exits to L0 with another page fault, because L0 still has
         not updated the NGVA->HPA page tables.  However, nmi_known_unmasked is
         true so vmx_recover_nmi_blocking does not do anything.
      
      The fix is to update nmi_known_unmasked when preparing vmcs02 from vmcs12.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      2d6144e3
    • W
      KVM: nVMX: Fix posted intr delivery when vcpu is in guest mode · 06a5524f
      Wincy Van 提交于
      The PI vector for L0 and L1 must be different. If dest vcpu0
      is in guest mode while vcpu1 is delivering a non-nested PI to
      vcpu0, there wont't be any vmexit so that the non-nested interrupt
      will be delayed.
      Signed-off-by: NWincy Van <fanwenyi0529@gmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      06a5524f
  11. 24 7月, 2017 1 次提交
  12. 20 7月, 2017 1 次提交
    • W
      KVM: VMX: Fix invalid guest state detection after task-switch emulation · f244deed
      Wanpeng Li 提交于
      This can be reproduced by EPT=1, unrestricted_guest=N, emulate_invalid_state=Y
      or EPT=0, the trace of kvm-unit-tests/taskswitch2.flat is like below, it tries
      to emulate invalid guest state task-switch:
      
      kvm_exit: reason TASK_SWITCH rip 0x0 info 40000058 0
      kvm_emulate_insn: 42000:0:0f 0b (0x2)
      kvm_emulate_insn: 42000:0:0f 0b (0x2) failed
      kvm_inj_exception: #UD (0x0)
      kvm_entry: vcpu 0
      kvm_exit: reason TASK_SWITCH rip 0x0 info 40000058 0
      kvm_emulate_insn: 42000:0:0f 0b (0x2)
      kvm_emulate_insn: 42000:0:0f 0b (0x2) failed
      kvm_inj_exception: #UD (0x0)
      ......................
      
      It appears that the task-switch emulation updates rflags (and vm86
      flag) only after the segments are loaded, causing vmx->emulation_required
      to be set, when in fact invalid guest state emulation is not needed.
      
      This patch fixes it by updating vmx->emulation_required after the
      rflags (and vm86 flag) is updated in task-switch emulation.
      
      Thanks Radim for moving the update to vmx__set_flags and adding Paolo's
      suggestion for the check.
      Suggested-by: NNadav Amit <nadav.amit@gmail.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      f244deed
  13. 19 7月, 2017 2 次提交
    • J
      KVM: nVMX: Disallow VM-entry in MOV-SS shadow · b3f1dfb6
      Jim Mattson 提交于
      Immediately following MOV-to-SS/POP-to-SS, VM-entry is
      disallowed. This check comes after the check for a valid VMCS. When
      this check fails, the instruction pointer should fall through to the
      next instruction, the ALU flags should be set to indicate VMfailValid,
      and the VM-instruction error should be set to 26 ("VM entry with
      events blocked by MOV SS").
      Signed-off-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      b3f1dfb6
    • P
      KVM: nVMX: track NMI blocking state separately for each VMCS · 4c4a6f79
      Paolo Bonzini 提交于
      vmx_recover_nmi_blocking is using a cached value of the guest
      interruptibility info, which is stored in vmx->nmi_known_unmasked.
      vmx_recover_nmi_blocking is run for both normal and nested guests,
      so the cached value must be per-VMCS.
      
      This fixes eventinj.flat in a nested non-EPT environment.  With EPT it
      works, because the EPT violation handler doesn't have the
      vmx->nmi_known_unmasked optimization (it is unnecessary because, unlike
      vmx_recover_nmi_blocking, it can just look at the exit qualification).
      
      Thanks to Wanpeng Li for debugging the testcase and providing an initial
      patch.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      4c4a6f79
  14. 14 7月, 2017 2 次提交