1. 17 11月, 2014 3 次提交
    • N
      KVM: x86: Fix lost interrupt on irr_pending race · f210f757
      Nadav Amit 提交于
      apic_find_highest_irr assumes irr_pending is set if any vector in APIC_IRR is
      set.  If this assumption is broken and apicv is disabled, the injection of
      interrupts may be deferred until another interrupt is delivered to the guest.
      Ultimately, if no other interrupt should be injected to that vCPU, the pending
      interrupt may be lost.
      
      commit 56cc2406 ("KVM: nVMX: fix "acknowledge interrupt on exit" when APICv
      is in use") changed the behavior of apic_clear_irr so irr_pending is cleared
      after setting APIC_IRR vector. After this commit, if apic_set_irr and
      apic_clear_irr run simultaneously, a race may occur, resulting in APIC_IRR
      vector set, and irr_pending cleared. In the following example, assume a single
      vector is set in IRR prior to calling apic_clear_irr:
      
      apic_set_irr				apic_clear_irr
      ------------				--------------
      apic->irr_pending = true;
      					apic_clear_vector(...);
      					vec = apic_search_irr(apic);
      					// => vec == -1
      apic_set_vector(...);
      					apic->irr_pending = (vec != -1);
      					// => apic->irr_pending == false
      
      Nonetheless, it appears the race might even occur prior to this commit:
      
      apic_set_irr				apic_clear_irr
      ------------				--------------
      apic->irr_pending = true;
      					apic->irr_pending = false;
      					apic_clear_vector(...);
      					if (apic_search_irr(apic) != -1)
      						apic->irr_pending = true;
      					// => apic->irr_pending == false
      apic_set_vector(...);
      
      Fixing this issue by:
      1. Restoring the previous behavior of apic_clear_irr: clear irr_pending, call
         apic_clear_vector, and then if APIC_IRR is non-zero, set irr_pending.
      2. On apic_set_irr: first call apic_set_vector, then set irr_pending.
      Signed-off-by: NNadav Amit <namit@cs.technion.ac.il>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      f210f757
    • P
      KVM: compute correct map even if all APICs are software disabled · a3e339e1
      Paolo Bonzini 提交于
      Logical destination mode can be used to send NMI IPIs even when all
      APICs are software disabled, so if all APICs are software disabled we
      should still look at the DFRs.
      
      So the DFRs should all be the same, even if some or all APICs are
      software disabled.  However, the SDM does not say this, so tweak
      the logic as follows:
      
      - if one APIC is enabled and has LDR != 0, use that one to build the map.
      This picks the right DFR in case an OS is only setting it for the
      software-enabled APICs, or in case an OS is using logical addressing
      on some APICs while leaving the rest in reset state (using LDR was
      suggested by Radim).
      
      - if all APICs are disabled, pick a random one to build the map.
      We use the last one with LDR != 0 for simplicity.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a3e339e1
    • N
      KVM: x86: Software disabled APIC should still deliver NMIs · 173beedc
      Nadav Amit 提交于
      Currently, the APIC logical map does not consider VCPUs whose local-apic is
      software-disabled.  However, NMIs, INIT, etc. should still be delivered to such
      VCPUs. Therefore, the APIC mode should first be determined, and then the map,
      considering all VCPUs should be constructed.
      
      To address this issue, first find the APIC mode, and only then construct the
      logical map.
      Signed-off-by: NNadav Amit <namit@cs.technion.ac.il>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      173beedc
  2. 14 11月, 2014 1 次提交
  3. 13 11月, 2014 1 次提交
    • C
      kvm: svm: move WARN_ON in svm_adjust_tsc_offset · d913b904
      Chris J Arges 提交于
      When running the tsc_adjust kvm-unit-test on an AMD processor with the
      IA32_TSC_ADJUST feature enabled, the WARN_ON in svm_adjust_tsc_offset can be
      triggered. This WARN_ON checks for a negative adjustment in case __scale_tsc
      is called; however it may trigger unnecessary warnings.
      
      This patch moves the WARN_ON to trigger only if __scale_tsc will actually be
      called from svm_adjust_tsc_offset. In addition make adj in kvm_set_msr_common
      s64 since this can have signed values.
      Signed-off-by: NChris J Arges <chris.j.arges@canonical.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d913b904
  4. 12 11月, 2014 2 次提交
    • A
      x86, kvm, vmx: Don't set LOAD_IA32_EFER when host and guest match · 54b98bff
      Andy Lutomirski 提交于
      There's nothing to switch if the host and guest values are the same.
      I am unable to find evidence that this makes any difference
      whatsoever.
      Signed-off-by: NAndy Lutomirski <luto@amacapital.net>
      [I could see a difference on Nehalem.  From 5 runs:
      
       userspace exit, guest!=host   12200 11772 12130 12164 12327
       userspace exit, guest=host    11983 11780 11920 11919 12040
       lightweight exit, guest!=host  3214  3220  3238  3218  3337
       lightweight exit, guest=host   3178  3193  3193  3187  3220
      
       This passes the t-test with 99% confidence for userspace exit,
       98.5% confidence for lightweight exit. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      54b98bff
    • A
      x86, kvm, vmx: Always use LOAD_IA32_EFER if available · f6577a5f
      Andy Lutomirski 提交于
      At least on Sandy Bridge, letting the CPU switch IA32_EFER is much
      faster than switching it manually.
      
      I benchmarked this using the vmexit kvm-unit-test (single run, but
      GOAL multiplied by 5 to do more iterations):
      
      Test                                  Before      After    Change
      cpuid                                   2000       1932    -3.40%
      vmcall                                  1914       1817    -5.07%
      mov_from_cr8                              13         13     0.00%
      mov_to_cr8                                19         19     0.00%
      inl_from_pmtimer                       19164      10619   -44.59%
      inl_from_qemu                          15662      10302   -34.22%
      inl_from_kernel                         3916       3802    -2.91%
      outl_to_kernel                          2230       2194    -1.61%
      mov_dr                                   172        176     2.33%
      ipi                                (skipped)  (skipped)
      ipi+halt                           (skipped)  (skipped)
      ple-round-robin                           13         13     0.00%
      wr_tsc_adjust_msr                       1920       1845    -3.91%
      rd_tsc_adjust_msr                       1892       1814    -4.12%
      mmio-no-eventfd:pci-mem                16394      11165   -31.90%
      mmio-wildcard-eventfd:pci-mem           4607       4645     0.82%
      mmio-datamatch-eventfd:pci-mem          4601       4610     0.20%
      portio-no-eventfd:pci-io               11507       7942   -30.98%
      portio-wildcard-eventfd:pci-io          2239       2225    -0.63%
      portio-datamatch-eventfd:pci-io         2250       2234    -0.71%
      
      I haven't explicitly computed the significance of these numbers,
      but this isn't subtle.
      Signed-off-by: NAndy Lutomirski <luto@amacapital.net>
      [The results were reproducible on all of Nehalem, Sandy Bridge and
       Ivy Bridge.  The slowness of manual switching is because writing
       to EFER with WRMSR triggers a TLB flush, even if the only bit you're
       touching is SCE (so the page table format is not affected).  Doing
       the write as part of vmentry/vmexit, instead, does not flush the TLB,
       probably because all processors that have EPT also have VPID. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      f6577a5f
  5. 10 11月, 2014 1 次提交
  6. 08 11月, 2014 7 次提交
  7. 07 11月, 2014 18 次提交
  8. 03 11月, 2014 7 次提交
    • T
      kvm: kvmclock: use get_cpu() and put_cpu() · c6338ce4
      Tiejun Chen 提交于
      We can use get_cpu() and put_cpu() to replace
      preempt_disable()/cpu = smp_processor_id() and
      preempt_enable() for slightly better code.
      Signed-off-by: NTiejun Chen <tiejun.chen@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      c6338ce4
    • R
      KVM: x86: optimize some accesses to LVTT and SPIV · f30ebc31
      Radim Krčmář 提交于
      We mirror a subset of these registers in separate variables.
      Using them directly should be faster.
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      f30ebc31
    • R
      KVM: x86: detect LVTT changes under APICv · a323b409
      Radim Krčmář 提交于
      APIC-write VM exits are "trap-like": they save CS:RIP values for the
      instruction after the write, and more importantly, the handler will
      already see the new value in the virtual-APIC page.  This means that
      apic_reg_write cannot use kvm_apic_get_reg to omit timer cancelation
      when mode changes.
      
      timer_mode_mask shouldn't be changing as it depends on cpuid.
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a323b409
    • R
      KVM: x86: detect SPIV changes under APICv · e462755c
      Radim Krčmář 提交于
      APIC-write VM exits are "trap-like": they save CS:RIP values for the
      instruction after the write, and more importantly, the handler will
      already see the new value in the virtual-APIC page.
      
      This caused a bug if you used KVM_SET_IRQCHIP to set the SW-enabled bit
      in the SPIV register.  The chain of events is as follows:
      
      * When the irqchip is added to the destination VM, the apic_sw_disabled
      static key is incremented (1)
      
      * When the KVM_SET_IRQCHIP ioctl is invoked, it is decremented (0)
      
      * When the guest disables the bit in the SPIV register, e.g. as part of
      shutdown, apic_set_spiv does not notice the change and the static key is
      _not_ incremented.
      
      * When the guest is destroyed, the static key is decremented (-1),
      resulting in this trace:
      
        WARNING: at kernel/jump_label.c:81 __static_key_slow_dec+0xa6/0xb0()
        jump label: negative count!
      
        [<ffffffff816bf898>] dump_stack+0x19/0x1b
        [<ffffffff8107c6f1>] warn_slowpath_common+0x61/0x80
        [<ffffffff8107c76c>] warn_slowpath_fmt+0x5c/0x80
        [<ffffffff811931e6>] __static_key_slow_dec+0xa6/0xb0
        [<ffffffff81193226>] static_key_slow_dec_deferred+0x16/0x20
        [<ffffffffa0637698>] kvm_free_lapic+0x88/0xa0 [kvm]
        [<ffffffffa061c63e>] kvm_arch_vcpu_uninit+0x2e/0xe0 [kvm]
        [<ffffffffa05ff301>] kvm_vcpu_uninit+0x21/0x40 [kvm]
        [<ffffffffa067cec7>] vmx_free_vcpu+0x47/0x70 [kvm_intel]
        [<ffffffffa061bc50>] kvm_arch_vcpu_free+0x50/0x60 [kvm]
        [<ffffffffa061ca22>] kvm_arch_destroy_vm+0x102/0x260 [kvm]
        [<ffffffff810b68fd>] ? synchronize_srcu+0x1d/0x20
        [<ffffffffa06030d1>] kvm_put_kvm+0xe1/0x1c0 [kvm]
        [<ffffffffa06036f8>] kvm_vcpu_release+0x18/0x20 [kvm]
        [<ffffffff81215c62>] __fput+0x102/0x310
        [<ffffffff81215f4e>] ____fput+0xe/0x10
        [<ffffffff810ab664>] task_work_run+0xb4/0xe0
        [<ffffffff81083944>] do_exit+0x304/0xc60
        [<ffffffff816c8dfc>] ? _raw_spin_unlock_irq+0x2c/0x50
        [<ffffffff810fd22d>] ?  trace_hardirqs_on_caller+0xfd/0x1c0
        [<ffffffff8108432c>] do_group_exit+0x4c/0xc0
        [<ffffffff810843b4>] SyS_exit_group+0x14/0x20
        [<ffffffff816d33a9>] system_call_fastpath+0x16/0x1b
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      e462755c
    • C
      KVM: x86: Enable Intel AVX-512 for guest · 612263b3
      Chao Peng 提交于
      Expose Intel AVX-512 feature bits to guest. Also add checks for
      xcr0 AVX512 related bits according to spec:
      http://download-software.intel.com/sites/default/files/managed/71/2e/319433-017.pdfSigned-off-by: NChao Peng <chao.p.peng@linux.intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      612263b3
    • R
      KVM: x86: fix deadline tsc interrupt injection · 1e0ad70c
      Radim Krčmář 提交于
      The check in kvm_set_lapic_tscdeadline_msr() was trying to prevent a
      situation where we lose a pending deadline timer in a MSR write.
      Losing it is fine, because it effectively occurs before the timer fired,
      so we should be able to cancel or postpone it.
      
      Another problem comes from interaction with QEMU, or other userspace
      that can set deadline MSR without a good reason, when timer is already
      pending:  one guest's deadline request results in more than one
      interrupt because one is injected immediately on MSR write from
      userspace and one through hrtimer later.
      
      The solution is to remove the injection when replacing a pending timer
      and to improve the usual QEMU path, we inject without a hrtimer when the
      deadline has already passed.
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      Reported-by: NNadav Amit <namit@cs.technion.ac.il>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      1e0ad70c
    • R
      KVM: x86: add apic_timer_expired() · 5d87db71
      Radim Krčmář 提交于
      Make the code reusable.
      
      If the timer was already pending, we shouldn't be waiting in a queue,
      so wake_up can be skipped, simplifying the path.
      
      There is no 'reinject' case => the comment is removed.
      Current race behaves correctly.
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      5d87db71