1. 07 8月, 2017 3 次提交
  2. 03 8月, 2017 6 次提交
    • W
      KVM: nVMX: Fix interrupt window request with "Acknowledge interrupt on exit" · 6550c4df
      Wanpeng Li 提交于
      ------------[ cut here ]------------
       WARNING: CPU: 5 PID: 2288 at arch/x86/kvm/vmx.c:11124 nested_vmx_vmexit+0xd64/0xd70 [kvm_intel]
       CPU: 5 PID: 2288 Comm: qemu-system-x86 Not tainted 4.13.0-rc2+ #7
       RIP: 0010:nested_vmx_vmexit+0xd64/0xd70 [kvm_intel]
      Call Trace:
        vmx_check_nested_events+0x131/0x1f0 [kvm_intel]
        ? vmx_check_nested_events+0x131/0x1f0 [kvm_intel]
        kvm_arch_vcpu_ioctl_run+0x5dd/0x1be0 [kvm]
        ? vmx_vcpu_load+0x1be/0x220 [kvm_intel]
        ? kvm_arch_vcpu_load+0x62/0x230 [kvm]
        kvm_vcpu_ioctl+0x340/0x700 [kvm]
        ? kvm_vcpu_ioctl+0x340/0x700 [kvm]
        ? __fget+0xfc/0x210
        do_vfs_ioctl+0xa4/0x6a0
        ? __fget+0x11d/0x210
        SyS_ioctl+0x79/0x90
        do_syscall_64+0x8f/0x750
        ? trace_hardirqs_on_thunk+0x1a/0x1c
        entry_SYSCALL64_slow_path+0x25/0x25
      
      This can be reproduced by booting L1 guest w/ 'noapic' grub parameter, which
      means that tells the kernel to not make use of any IOAPICs that may be present
      in the system.
      
      Actually external_intr variable in nested_vmx_vmexit() is the req_int_win
      variable passed from vcpu_enter_guest() which means that the L0's userspace
      requests an irq window. I observed the scenario (!kvm_cpu_has_interrupt(vcpu) &&
      L0's userspace reqeusts an irq window) is true, so there is no interrupt which
      L1 requires to inject to L2, we should not attempt to emualte "Acknowledge
      interrupt on exit" for the irq window requirement in this scenario.
      
      This patch fixes it by not attempt to emulate "Acknowledge interrupt on exit"
      if there is no L1 requirement to inject an interrupt to L2.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
      [Added code comment to make it obvious that the behavior is not correct.
       We should do a userspace exit with open interrupt window instead of the
       nested VM exit.  This patch still improves the behavior, so it was
       accepted as a (temporary) workaround.]
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      6550c4df
    • D
      KVM: nVMX: mark vmcs12 pages dirty on L2 exit · c9f04407
      David Matlack 提交于
      The host physical addresses of L1's Virtual APIC Page and Posted
      Interrupt descriptor are loaded into the VMCS02. The CPU may write
      to these pages via their host physical address while L2 is running,
      bypassing address-translation-based dirty tracking (e.g. EPT write
      protection). Mark them dirty on every exit from L2 to prevent them
      from getting out of sync with dirty tracking.
      
      Also mark the virtual APIC page and the posted interrupt descriptor
      dirty when KVM is virtualizing posted interrupt processing.
      Signed-off-by: NDavid Matlack <dmatlack@google.com>
      Reviewed-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      c9f04407
    • D
      kvm: nVMX: don't flush VMCS12 during VMXOFF or VCPU teardown · 8ca44e88
      David Matlack 提交于
      According to the Intel SDM, software cannot rely on the current VMCS to be
      coherent after a VMXOFF or shutdown. So this is a valid way to handle VMCS12
      flushes.
      
      24.11.1 Software Use of Virtual-Machine Control Structures
      ...
        If a logical processor leaves VMX operation, any VMCSs active on
        that logical processor may be corrupted (see below). To prevent
        such corruption of a VMCS that may be used either after a return
        to VMX operation or on another logical processor, software should
        execute VMCLEAR for that VMCS before executing the VMXOFF instruction
        or removing power from the processor (e.g., as part of a transition
        to the S3 and S4 power states).
      ...
      
      This fixes a "suspicious rcu_dereference_check() usage!" warning during
      kvm_vm_release() because nested_release_vmcs12() calls
      kvm_vcpu_write_guest_page() without holding kvm->srcu.
      Signed-off-by: NDavid Matlack <dmatlack@google.com>
      Reviewed-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      8ca44e88
    • P
      KVM: nVMX: do not pin the VMCS12 · 9f744c59
      Paolo Bonzini 提交于
      Since the current implementation of VMCS12 does a memcpy in and out
      of guest memory, we do not need current_vmcs12 and current_vmcs12_page
      anymore.  current_vmptr is enough to read and write the VMCS12.
      
      And David Matlack noted:
      
        This patch also fixes dirty tracking (memslot->dirty_bitmap) of the
        VMCS12 page by using kvm_write_guest. nested_release_page() only marks
        the struct page dirty.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      [Added David Matlack's note and nested_release_page_clean() fix.]
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      9f744c59
    • L
      KVM: X86: init irq->level in kvm_pv_kick_cpu_op · ebd28fcb
      Longpeng(Mike) 提交于
      'lapic_irq' is a local variable and its 'level' field isn't
      initialized, so 'level' is random, it doesn't matter but
      makes UBSAN unhappy:
      
      UBSAN: Undefined behaviour in .../lapic.c:...
      load of value 10 is not a valid value for type '_Bool'
      ...
      Call Trace:
       [<ffffffff81f030b6>] dump_stack+0x1e/0x20
       [<ffffffff81f03173>] ubsan_epilogue+0x12/0x55
       [<ffffffff81f03b96>] __ubsan_handle_load_invalid_value+0x118/0x162
       [<ffffffffa1575173>] kvm_apic_set_irq+0xc3/0xf0 [kvm]
       [<ffffffffa1575b20>] kvm_irq_delivery_to_apic_fast+0x450/0x910 [kvm]
       [<ffffffffa15858ea>] kvm_irq_delivery_to_apic+0xfa/0x7a0 [kvm]
       [<ffffffffa1517f4e>] kvm_emulate_hypercall+0x62e/0x760 [kvm]
       [<ffffffffa113141a>] handle_vmcall+0x1a/0x30 [kvm_intel]
       [<ffffffffa114e592>] vmx_handle_exit+0x7a2/0x1fa0 [kvm_intel]
      ...
      Signed-off-by: NLongpeng(Mike) <longpeng2@huawei.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      ebd28fcb
    • W
      KVM: X86: Fix loss of pending INIT due to race · f4ef1910
      Wanpeng Li 提交于
      When SMP VM start, AP may lost INIT because of receiving INIT between
      kvm_vcpu_ioctl_x86_get/set_vcpu_events.
      
             vcpu 0                             vcpu 1
                                         kvm_vcpu_ioctl_x86_get_vcpu_events
                                           events->smi.latched_init = 0
        send INIT to vcpu1
          set vcpu1's pending_events
                                         kvm_vcpu_ioctl_x86_set_vcpu_events
                                            if (events->smi.latched_init == 0)
                                              clear INIT in pending_events
      
      This patch fixes it by just update SMM related flags if we are in SMM.
      
      Thanks Peng Hao for the report and original commit message.
      Reported-by: NPeng Hao <peng.hao2@zte.com.cn>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Reviewed-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      f4ef1910
  3. 02 8月, 2017 2 次提交
    • P
      KVM: nVMX: fixes to nested virt interrupt injection · b96fb439
      Paolo Bonzini 提交于
      There are three issues in nested_vmx_check_exception:
      
      1) it is not taking PFEC_MATCH/PFEC_MASK into account, as reported
      by Wanpeng Li;
      
      2) it should rebuild the interruption info and exit qualification fields
      from scratch, as reported by Jim Mattson, because the values from the
      L2->L0 vmexit may be invalid (e.g. if an emulated instruction causes
      a page fault, the EPT misconfig's exit qualification is incorrect).
      
      3) CR2 and DR6 should not be written for exception intercept vmexits
      (CR2 only for AMD).
      
      This patch fixes the first two and adds a comment about the last,
      outlining the fix.
      
      Cc: Jim Mattson <jmattson@google.com>
      Cc: Wanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      b96fb439
    • P
      KVM: nVMX: do not fill vm_exit_intr_error_code in prepare_vmcs12 · 7313c698
      Paolo Bonzini 提交于
      Do this in the caller of nested_vmx_vmexit instead.
      
      nested_vmx_check_exception was doing a vmwrite to the vmcs02's
      VM_EXIT_INTR_ERROR_CODE field, so that prepare_vmcs12 would move
      the field to vmcs12->vm_exit_intr_error_code.  However that isn't
      possible on pre-Haswell machines.  Moving the vmcs12 write to the
      callers fixes it.
      Reported-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      [Changed nested_vmx_reflect_vmexit() return type to (int)1 from (bool)1,
       thanks to fengguang.wu@intel.com]
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      7313c698
  4. 27 7月, 2017 4 次提交
    • W
      KVM: LAPIC: Fix reentrancy issues with preempt notifiers · 1d518c68
      Wanpeng Li 提交于
      Preempt can occur in the preemption timer expiration handler:
      
                CPU0                    CPU1
      
        preemption timer vmexit
        handle_preemption_timer(vCPU0)
          kvm_lapic_expired_hv_timer
            hv_timer_is_use == true
        sched_out
                                 sched_in
                                 kvm_arch_vcpu_load
                                   kvm_lapic_restart_hv_timer
                                     restart_apic_timer
                                       start_hv_timer
                                         already-expired timer or sw timer triggerd in the window
                                       start_sw_timer
                                         cancel_hv_timer
                                 /* back in kvm_lapic_expired_hv_timer */
                                 cancel_hv_timer
                                   WARN_ON(!apic->lapic_timer.hv_timer_in_use);  ==> Oops
      
      This can be reproduced if CONFIG_PREEMPT is enabled.
      
      ------------[ cut here ]------------
       WARNING: CPU: 4 PID: 2972 at /home/kernel/linux/arch/x86/kvm//lapic.c:1563 kvm_lapic_expired_hv_timer+0x9e/0xb0 [kvm]
       CPU: 4 PID: 2972 Comm: qemu-system-x86 Tainted: G           OE   4.13.0-rc2+ #16
       RIP: 0010:kvm_lapic_expired_hv_timer+0x9e/0xb0 [kvm]
      Call Trace:
        handle_preemption_timer+0xe/0x20 [kvm_intel]
        vmx_handle_exit+0xb8/0xd70 [kvm_intel]
        kvm_arch_vcpu_ioctl_run+0xdd1/0x1be0 [kvm]
        ? kvm_arch_vcpu_load+0x47/0x230 [kvm]
        ? kvm_arch_vcpu_load+0x62/0x230 [kvm]
        kvm_vcpu_ioctl+0x340/0x700 [kvm]
        ? kvm_vcpu_ioctl+0x340/0x700 [kvm]
        ? __fget+0xfc/0x210
        do_vfs_ioctl+0xa4/0x6a0
        ? __fget+0x11d/0x210
        SyS_ioctl+0x79/0x90
        do_syscall_64+0x81/0x220
        entry_SYSCALL64_slow_path+0x25/0x25
       ------------[ cut here ]------------
       WARNING: CPU: 4 PID: 2972 at /home/kernel/linux/arch/x86/kvm//lapic.c:1498 cancel_hv_timer.isra.40+0x4f/0x60 [kvm]
       CPU: 4 PID: 2972 Comm: qemu-system-x86 Tainted: G        W  OE   4.13.0-rc2+ #16
       RIP: 0010:cancel_hv_timer.isra.40+0x4f/0x60 [kvm]
      Call Trace:
        kvm_lapic_expired_hv_timer+0x3e/0xb0 [kvm]
        handle_preemption_timer+0xe/0x20 [kvm_intel]
        vmx_handle_exit+0xb8/0xd70 [kvm_intel]
        kvm_arch_vcpu_ioctl_run+0xdd1/0x1be0 [kvm]
        ? kvm_arch_vcpu_load+0x47/0x230 [kvm]
        ? kvm_arch_vcpu_load+0x62/0x230 [kvm]
        kvm_vcpu_ioctl+0x340/0x700 [kvm]
        ? kvm_vcpu_ioctl+0x340/0x700 [kvm]
        ? __fget+0xfc/0x210
        do_vfs_ioctl+0xa4/0x6a0
        ? __fget+0x11d/0x210
        SyS_ioctl+0x79/0x90
        do_syscall_64+0x81/0x220
        entry_SYSCALL64_slow_path+0x25/0x25
      
      This patch fixes it by making the caller of cancel_hv_timer, start_hv_timer
      and start_sw_timer be in preemption-disabled regions, which trivially
      avoid any reentrancy issue with preempt notifier.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
      [Add more WARNs. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      1d518c68
    • W
      KVM: nVMX: Fix loss of L2's NMI blocking state · 2d6144e3
      Wanpeng Li 提交于
      Run kvm-unit-tests/eventinj.flat in L1 w/ ept=0 on both L0 and L1:
      
      Before NMI IRET test
      Sending NMI to self
      NMI isr running stack 0x461000
      Sending nested NMI to self
      After nested NMI to self
      Nested NMI isr running rip=40038e
      After iret
      After NMI to self
      FAIL: NMI
      
      Commit 4c4a6f79 (KVM: nVMX: track NMI blocking state separately
      for each VMCS) tracks NMI blocking state separately for vmcs01 and
      vmcs02. However it is not enough:
      
       - The L2 (kvm-unit-tests/eventinj.flat) generates NMI that will fault
         on IRET, so the L2 can generate #PF which can be intercepted by L0.
       - L0 walks L1's guest page table and sees the mapping is invalid, it
         resumes the L1 guest and injects the #PF into L1.  At this point the
         vmcs02 has nmi_known_unmasked=true.
       - L1 sets set bit 3 (blocking by NMI) in the interruptibility-state field
         of vmcs12 (and fixes the shadow page table) before resuming L2 guest.
       - L1 executes VMRESUME to resume L2, causing a vmexit to L0
       - during VMRESUME emulation, prepare_vmcs02 sets bit 3 in the
         interruptibility-state field of vmcs02, but nmi_known_unmasked is
         still true.
       - L2 immediately exits to L0 with another page fault, because L0 still has
         not updated the NGVA->HPA page tables.  However, nmi_known_unmasked is
         true so vmx_recover_nmi_blocking does not do anything.
      
      The fix is to update nmi_known_unmasked when preparing vmcs02 from vmcs12.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      2d6144e3
    • W
      KVM: nVMX: Fix posted intr delivery when vcpu is in guest mode · 06a5524f
      Wincy Van 提交于
      The PI vector for L0 and L1 must be different. If dest vcpu0
      is in guest mode while vcpu1 is delivering a non-nested PI to
      vcpu0, there wont't be any vmexit so that the non-nested interrupt
      will be delayed.
      Signed-off-by: NWincy Van <fanwenyi0529@gmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      06a5524f
    • P
      KVM: x86: do mask out upper bits of PAE CR3 · a512177e
      Paolo Bonzini 提交于
      This reverts the change of commit f85c758d,
      as the behavior it modified was intended.
      
      The VM is running in 32-bit PAE mode, and Table 4-7 of the Intel manual
      says:
      
      Table 4-7. Use of CR3 with PAE Paging
      Bit Position(s)	Contents
      4:0		Ignored
      31:5		Physical address of the 32-Byte aligned
      		page-directory-pointer table used for linear-address
      		translation
      63:32		Ignored (these bits exist only on processors supporting
      		the Intel-64 architecture)
      
      To placate the static checker, write the mask explicitly as an
      unsigned long constant instead of using a 32-bit unsigned constant.
      
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Fixes: f85c758dSigned-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a512177e
  5. 24 7月, 2017 1 次提交
  6. 20 7月, 2017 2 次提交
    • R
      kvm: x86: hyperv: avoid livelock in oneshot SynIC timers · f1ff89ec
      Roman Kagan 提交于
      If the SynIC timer message delivery fails due to SINT message slot being
      busy, there's no point to attempt starting the timer again until we're
      notified of the slot being released by the guest (via EOM or EOI).
      
      Even worse, when a oneshot timer fails to deliver its message, its
      re-arming with an expiration time in the past leads to immediate retry
      of the delivery, and so on, without ever letting the guest vcpu to run
      and release the slot, which results in a livelock.
      
      To avoid that, only start the timer when there's no timer message
      pending delivery.  When there is, meaning the slot is busy, the
      processing will be restarted upon notification from the guest that the
      slot is released.
      Signed-off-by: NRoman Kagan <rkagan@virtuozzo.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      f1ff89ec
    • W
      KVM: VMX: Fix invalid guest state detection after task-switch emulation · f244deed
      Wanpeng Li 提交于
      This can be reproduced by EPT=1, unrestricted_guest=N, emulate_invalid_state=Y
      or EPT=0, the trace of kvm-unit-tests/taskswitch2.flat is like below, it tries
      to emulate invalid guest state task-switch:
      
      kvm_exit: reason TASK_SWITCH rip 0x0 info 40000058 0
      kvm_emulate_insn: 42000:0:0f 0b (0x2)
      kvm_emulate_insn: 42000:0:0f 0b (0x2) failed
      kvm_inj_exception: #UD (0x0)
      kvm_entry: vcpu 0
      kvm_exit: reason TASK_SWITCH rip 0x0 info 40000058 0
      kvm_emulate_insn: 42000:0:0f 0b (0x2)
      kvm_emulate_insn: 42000:0:0f 0b (0x2) failed
      kvm_inj_exception: #UD (0x0)
      ......................
      
      It appears that the task-switch emulation updates rflags (and vm86
      flag) only after the segments are loaded, causing vmx->emulation_required
      to be set, when in fact invalid guest state emulation is not needed.
      
      This patch fixes it by updating vmx->emulation_required after the
      rflags (and vm86 flag) is updated in task-switch emulation.
      
      Thanks Radim for moving the update to vmx__set_flags and adding Paolo's
      suggestion for the check.
      Suggested-by: NNadav Amit <nadav.amit@gmail.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      f244deed
  7. 19 7月, 2017 4 次提交
  8. 14 7月, 2017 5 次提交
  9. 13 7月, 2017 13 次提交