1. 11 7月, 2014 6 次提交
    • P
      KVM: x86: use kvm_read_guest_page for emulator accesses · 44583cba
      Paolo Bonzini 提交于
      Emulator accesses are always done a page at a time, either by the emulator
      itself (for fetches) or because we need to query the MMU for address
      translations.  Speed up these accesses by using kvm_read_guest_page
      and, in the case of fetches, by inlining kvm_read_guest_virt_helper and
      dropping the loop around kvm_read_guest_page.
      
      This final tweak saves 30-100 more clock cycles (4-10%), bringing the
      count (as measured by kvm-unit-tests) down to 720-1100 clock cycles on
      a Sandy Bridge Xeon host, compared to 2300-3200 before the whole series
      and 925-1700 after the first two low-hanging fruit changes.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      44583cba
    • B
      KVM: emulate: move init_decode_cache to emulate.c · 1498507a
      Bandan Das 提交于
      Core emulator functions all belong in emulator.c,
      x86 should have no knowledge of emulator internals
      Signed-off-by: NBandan Das <bsd@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      1498507a
    • P
      KVM: x86: avoid useless set of KVM_REQ_EVENT after emulation · 6addfc42
      Paolo Bonzini 提交于
      Despite the provisions to emulate up to 130 consecutive instructions, in
      practice KVM will emulate just one before exiting handle_invalid_guest_state,
      because x86_emulate_instruction always sets KVM_REQ_EVENT.
      
      However, we only need to do this if an interrupt could be injected,
      which happens a) if an interrupt shadow bit (STI or MOV SS) has gone
      away; b) if the interrupt flag has just been set (other instructions
      than STI can set it without enabling an interrupt shadow).
      
      This cuts another 700-900 cycles from the cost of emulating an
      instruction (measured on a Sandy Bridge Xeon: 1650-2600 cycles
      before the patch on kvm-unit-tests, 925-1700 afterwards).
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      6addfc42
    • P
      KVM: x86: return all bits from get_interrupt_shadow · 37ccdcbe
      Paolo Bonzini 提交于
      For the next patch we will need to know the full state of the
      interrupt shadow; we will then set KVM_REQ_EVENT when one bit
      is cleared.
      
      However, right now get_interrupt_shadow only returns the one
      corresponding to the emulated instruction, or an unconditional
      0 if the emulated instruction does not have an interrupt shadow.
      This is confusing and does not allow us to check for cleared
      bits as mentioned above.
      
      Clean the callback up, and modify toggle_interruptibility to
      match the comment above the call.  As a small result, the
      call to set_interrupt_shadow will be skipped in the common
      case where int_shadow == 0 && mask == 0.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      37ccdcbe
    • M
      KVM: svm: writes to MSR_K7_HWCR generates GPE in guest · 22d48b2d
      Matthias Lange 提交于
      Since commit 575203 the MCE subsystem in the Linux kernel for AMD sets bit 18
      in MSR_K7_HWCR. Running such a kernel as a guest in KVM on an AMD host results
      in a GPE injected into the guest because kvm_set_msr_common returns 1. This
      patch fixes this by masking bit 18 from the MSR value desired by the guest.
      Signed-off-by: NMatthias Lange <matthias.lange@kernkonzept.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      22d48b2d
    • N
      KVM: x86: Pending interrupt may be delivered after INIT · 5f7552d4
      Nadav Amit 提交于
      We encountered a scenario in which after an INIT is delivered, a pending
      interrupt is delivered, although it was sent before the INIT.  As the SDM
      states in section 10.4.7.1, the ISR and the IRR should be cleared after INIT as
      KVM does.  This also means that pending interrupts should be cleared.  This
      patch clears upon reset (and INIT) the pending interrupts; and at the same
      occassion clears the pending exceptions, since they may cause a similar issue.
      Signed-off-by: NNadav Amit <namit@cs.technion.ac.il>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      5f7552d4
  2. 10 7月, 2014 1 次提交
    • T
      KVM: x86: fix TSC matching · 0d3da0d2
      Tomasz Grabiec 提交于
      I've observed kvmclock being marked as unstable on a modern
      single-socket system with a stable TSC and qemu-1.6.2 or qemu-2.0.0.
      
      The culprit was failure in TSC matching because of overflow of
      kvm_arch::nr_vcpus_matched_tsc in case there were multiple TSC writes
      in a single synchronization cycle.
      
      Turns out that qemu does multiple TSC writes during init, below is the
      evidence of that (qemu-2.0.0):
      
      The first one:
      
       0xffffffffa08ff2b4 : vmx_write_tsc_offset+0xa4/0xb0 [kvm_intel]
       0xffffffffa04c9c05 : kvm_write_tsc+0x1a5/0x360 [kvm]
       0xffffffffa04cfd6b : kvm_arch_vcpu_postcreate+0x4b/0x80 [kvm]
       0xffffffffa04b8188 : kvm_vm_ioctl+0x418/0x750 [kvm]
      
      The second one:
      
       0xffffffffa08ff2b4 : vmx_write_tsc_offset+0xa4/0xb0 [kvm_intel]
       0xffffffffa04c9c05 : kvm_write_tsc+0x1a5/0x360 [kvm]
       0xffffffffa090610d : vmx_set_msr+0x29d/0x350 [kvm_intel]
       0xffffffffa04be83b : do_set_msr+0x3b/0x60 [kvm]
       0xffffffffa04c10a8 : msr_io+0xc8/0x160 [kvm]
       0xffffffffa04caeb6 : kvm_arch_vcpu_ioctl+0xc86/0x1060 [kvm]
       0xffffffffa04b6797 : kvm_vcpu_ioctl+0xc7/0x5a0 [kvm]
      
       #0  kvm_vcpu_ioctl at /build/buildd/qemu-2.0.0+dfsg/kvm-all.c:1780
       #1  kvm_put_msrs at /build/buildd/qemu-2.0.0+dfsg/target-i386/kvm.c:1270
       #2  kvm_arch_put_registers at /build/buildd/qemu-2.0.0+dfsg/target-i386/kvm.c:1909
       #3  kvm_cpu_synchronize_post_init at /build/buildd/qemu-2.0.0+dfsg/kvm-all.c:1641
       #4  cpu_synchronize_post_init at /build/buildd/qemu-2.0.0+dfsg/include/sysemu/kvm.h:330
       #5  cpu_synchronize_all_post_init () at /build/buildd/qemu-2.0.0+dfsg/cpus.c:521
       #6  main at /build/buildd/qemu-2.0.0+dfsg/vl.c:4390
      
      The third one:
      
       0xffffffffa08ff2b4 : vmx_write_tsc_offset+0xa4/0xb0 [kvm_intel]
       0xffffffffa04c9c05 : kvm_write_tsc+0x1a5/0x360 [kvm]
       0xffffffffa090610d : vmx_set_msr+0x29d/0x350 [kvm_intel]
       0xffffffffa04be83b : do_set_msr+0x3b/0x60 [kvm]
       0xffffffffa04c10a8 : msr_io+0xc8/0x160 [kvm]
       0xffffffffa04caeb6 : kvm_arch_vcpu_ioctl+0xc86/0x1060 [kvm]
       0xffffffffa04b6797 : kvm_vcpu_ioctl+0xc7/0x5a0 [kvm]
      
       #0  kvm_vcpu_ioctl at /build/buildd/qemu-2.0.0+dfsg/kvm-all.c:1780
       #1  kvm_put_msrs  at /build/buildd/qemu-2.0.0+dfsg/target-i386/kvm.c:1270
       #2  kvm_arch_put_registers  at /build/buildd/qemu-2.0.0+dfsg/target-i386/kvm.c:1909
       #3  kvm_cpu_synchronize_post_reset  at /build/buildd/qemu-2.0.0+dfsg/kvm-all.c:1635
       #4  cpu_synchronize_post_reset  at /build/buildd/qemu-2.0.0+dfsg/include/sysemu/kvm.h:323
       #5  cpu_synchronize_all_post_reset () at /build/buildd/qemu-2.0.0+dfsg/cpus.c:512
       #6  main  at /build/buildd/qemu-2.0.0+dfsg/vl.c:4482
      
      The fix is to count each vCPU only once when matched, so that
      nr_vcpus_matched_tsc holds the size of the matched set. This is
      achieved by reusing generation counters. Every vCPU with
      this_tsc_generation == cur_tsc_generation is in the matched set. The
      match set is cleared by setting cur_tsc_generation to a value which no
      other vCPU is set to (by incrementing it).
      
      I needed to bump up the counter size form u8 to u64 to ensure it never
      overflows. Otherwise in cases TSC is not written the same number of
      times on each vCPU the counter could overflow and incorrectly indicate
      some vCPUs as being in the matched set. This scenario seems unlikely
      but I'm not sure if it can be disregarded.
      Signed-off-by: NTomasz Grabiec <tgrabiec@cloudius-systems.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      0d3da0d2
  3. 30 6月, 2014 1 次提交
  4. 19 6月, 2014 1 次提交
  5. 18 6月, 2014 1 次提交
  6. 22 5月, 2014 1 次提交
  7. 14 5月, 2014 1 次提交
  8. 13 5月, 2014 1 次提交
  9. 08 5月, 2014 1 次提交
    • M
      kvm/x86: implement hv EOI assist · b63cf42f
      Michael S. Tsirkin 提交于
      It seems that it's easy to implement the EOI assist
      on top of the PV EOI feature: simply convert the
      page address to the format expected by PV EOI.
      
      Notes:
      -"No EOI required" is set only if interrupt injected
       is edge triggered; this is true because level interrupts are going
       through IOAPIC which disables PV EOI.
       In any case, if guest triggers EOI the bit will get cleared on exit.
      -For migration, set of HV_X64_MSR_APIC_ASSIST_PAGE sets
       KVM_PV_EOI_EN internally, so restoring HV_X64_MSR_APIC_ASSIST_PAGE
       seems sufficient
       In any case, bit is cleared on exit so worst case it's never re-enabled
      -no handling of PV EOI data is performed at HV_X64_MSR_EOI write;
       HV_X64_MSR_EOI is a separate optimization - it's an X2APIC
       replacement that lets you do EOI with an MSR and not IO.
      Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      b63cf42f
  10. 06 5月, 2014 2 次提交
  11. 24 4月, 2014 4 次提交
  12. 18 4月, 2014 1 次提交
    • M
      KVM: support any-length wildcard ioeventfd · f848a5a8
      Michael S. Tsirkin 提交于
      It is sometimes benefitial to ignore IO size, and only match on address.
      In hindsight this would have been a better default than matching length
      when KVM_IOEVENTFD_FLAG_DATAMATCH is not set, In particular, this kind
      of access can be optimized on VMX: there no need to do page lookups.
      This can currently be done with many ioeventfds but in a suboptimal way.
      
      However we can't change kernel/userspace ABI without risk of breaking
      some applications.
      Use len = 0 to mean "ignore length for matching" in a more optimal way.
      Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      f848a5a8
  13. 15 4月, 2014 2 次提交
  14. 27 3月, 2014 1 次提交
  15. 20 3月, 2014 1 次提交
    • S
      x86, kvm: Fix CPU hotplug callback registration · 460dd42e
      Srivatsa S. Bhat 提交于
      Subsystems that want to register CPU hotplug callbacks, as well as perform
      initialization for the CPUs that are already online, often do it as shown
      below:
      
      	get_online_cpus();
      
      	for_each_online_cpu(cpu)
      		init_cpu(cpu);
      
      	register_cpu_notifier(&foobar_cpu_notifier);
      
      	put_online_cpus();
      
      This is wrong, since it is prone to ABBA deadlocks involving the
      cpu_add_remove_lock and the cpu_hotplug.lock (when running concurrently
      with CPU hotplug operations).
      
      Instead, the correct and race-free way of performing the callback
      registration is:
      
      	cpu_notifier_register_begin();
      
      	for_each_online_cpu(cpu)
      		init_cpu(cpu);
      
      	/* Note the use of the double underscored version of the API */
      	__register_cpu_notifier(&foobar_cpu_notifier);
      
      	cpu_notifier_register_done();
      
      Fix the kvm code in x86 by using this latter form of callback registration.
      
      Cc: Gleb Natapov <gleb@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Acked-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NSrivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      460dd42e
  16. 17 3月, 2014 2 次提交
    • P
      KVM: x86: handle missing MPX in nested virtualization · 93c4adc7
      Paolo Bonzini 提交于
      When doing nested virtualization, we may be able to read BNDCFGS but
      still not be allowed to write to GUEST_BNDCFGS in the VMCS.  Guard
      writes to the field with vmx_mpx_supported(), and similarly hide the
      MSR from userspace if the processor does not support the field.
      
      We could work around this with the generic MSR save/load machinery,
      but there is only a limited number of MSR save/load slots and it is
      not really worthwhile to waste one for a scenario that should not
      happen except in the nested virtualization case.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      93c4adc7
    • P
      KVM: x86: introduce kvm_supported_xcr0() · 4ff41732
      Paolo Bonzini 提交于
      XSAVE support for KVM is already using host_xcr0 & KVM_SUPPORTED_XCR0 as
      a "dynamic" version of KVM_SUPPORTED_XCR0.
      
      However, this is not enough because the MPX bits should not be presented
      to the guest unless kvm_x86_ops confirms the support.  So, replace all
      instances of host_xcr0 & KVM_SUPPORTED_XCR0 with a new function
      kvm_supported_xcr0() that also has this check.
      
      Note that here:
      
      		if (xstate_bv & ~KVM_SUPPORTED_XCR0)
      			return -EINVAL;
      		if (xstate_bv & ~host_cr0)
      			return -EINVAL;
      
      the code is equivalent to
      
      		if ((xstate_bv & ~KVM_SUPPORTED_XCR0) ||
      		    (xstate_bv & ~host_cr0)
      			return -EINVAL;
      
      i.e. "xstate_bv & (~KVM_SUPPORTED_XCR0 | ~host_cr0)" which is in turn
      equal to "xstate_bv & ~(KVM_SUPPORTED_XCR0 & host_cr0)".  So we should
      also use the new function there.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      4ff41732
  17. 13 3月, 2014 1 次提交
    • G
      kvm: x86: ignore ioapic polarity · 100943c5
      Gabriel L. Somlo 提交于
      Both QEMU and KVM have already accumulated a significant number of
      optimizations based on the hard-coded assumption that ioapic polarity
      will always use the ActiveHigh convention, where the logical and
      physical states of level-triggered irq lines always match (i.e.,
      active(asserted) == high == 1, inactive == low == 0). QEMU guests
      are expected to follow directions given via ACPI and configure the
      ioapic with polarity 0 (ActiveHigh). However, even when misbehaving
      guests (e.g. OS X <= 10.9) set the ioapic polarity to 1 (ActiveLow),
      QEMU will still use the ActiveHigh signaling convention when
      interfacing with KVM.
      
      This patch modifies KVM to completely ignore ioapic polarity as set by
      the guest OS, enabling misbehaving guests to work alongside those which
      comply with the ActiveHigh polarity specified by QEMU's ACPI tables.
      Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NGabriel L. Somlo <somlo@cmu.edu>
      [Move documentation to KVM_IRQ_LINE, add ia64. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      100943c5
  18. 11 3月, 2014 4 次提交
    • P
      KVM: x86: Allow the guest to run with dirty debug registers · c77fb5fe
      Paolo Bonzini 提交于
      When not running in guest-debug mode, the guest controls the debug
      registers and having to take an exit for each DR access is a waste
      of time.  If the guest gets into a state where each context switch
      causes DR to be saved and restored, this can take away as much as 40%
      of the execution time from the guest.
      
      After this patch, VMX- and SVM-specific code can set a flag in
      switch_db_regs, telling vcpu_enter_guest that on the next exit the debug
      registers might be dirty and need to be reloaded (syncing will be taken
      care of by a new callback in kvm_x86_ops).  This flag can be set on the
      first access to a debug registers, so that multiple accesses to the
      debug registers only cause one vmexit.
      
      Note that since the guest will be able to read debug registers and
      enable breakpoints in DR7, we need to ensure that they are synchronized
      on entry to the guest---including DR6 that was not synced before.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      c77fb5fe
    • P
      KVM: x86: change vcpu->arch.switch_db_regs to a bit mask · 360b948d
      Paolo Bonzini 提交于
      The next patch will add another bit that we can test with the
      same "if".
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      360b948d
    • J
      KVM: x86: Remove return code from enable_irq/nmi_window · c9a7953f
      Jan Kiszka 提交于
      It's no longer possible to enter enable_irq_window in guest mode when
      L1 intercepts external interrupts and we are entering L2. This is now
      caught in vcpu_enter_guest. So we can remove the check from the VMX
      version of enable_irq_window, thus the need to return an error code from
      both enable_irq_window and enable_nmi_window.
      Signed-off-by: NJan Kiszka <jan.kiszka@siemens.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      c9a7953f
    • J
      KVM: nVMX: Rework interception of IRQs and NMIs · b6b8a145
      Jan Kiszka 提交于
      Move the check for leaving L2 on pending and intercepted IRQs or NMIs
      from the *_allowed handler into a dedicated callback. Invoke this
      callback at the relevant points before KVM checks if IRQs/NMIs can be
      injected. The callback has the task to switch from L2 to L1 if needed
      and inject the proper vmexit events.
      
      The rework fixes L2 wakeups from HLT and provides the foundation for
      preemption timer emulation.
      Signed-off-by: NJan Kiszka <jan.kiszka@siemens.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      b6b8a145
  19. 04 3月, 2014 2 次提交
    • A
      x86: kvm: introduce periodic global clock updates · 332967a3
      Andrew Jones 提交于
      commit 0061d53d introduced a mechanism to execute a global clock
      update for a vm. We can apply this periodically in order to propagate
      host NTP corrections. Also, if all vcpus of a vm are pinned, then
      without an additional trigger, no guest NTP corrections can propagate
      either, as the current trigger is only vcpu cpu migration.
      Signed-off-by: NAndrew Jones <drjones@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      332967a3
    • A
      x86: kvm: rate-limit global clock updates · 7e44e449
      Andrew Jones 提交于
      When we update a vcpu's local clock it may pick up an NTP correction.
      We can't wait an indeterminate amount of time for other vcpus to pick
      up that correction, so commit 0061d53d introduced a global clock
      update. However, we can't request a global clock update on every vcpu
      load either (which is what happens if the tsc is marked as unstable).
      The solution is to rate-limit the global clock updates. Marcelo
      calculated that we should delay the global clock updates no more
      than 0.1s as follows:
      
      Assume an NTP correction c is applied to one vcpu, but not the other,
      then in n seconds the delta of the vcpu system_timestamps will be
      c * n. If we assume a correction of 500ppm (worst-case), then the two
      vcpus will diverge 50us in 0.1s, which is a considerable amount.
      Signed-off-by: NAndrew Jones <drjones@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      7e44e449
  20. 28 2月, 2014 2 次提交
    • A
      kvm: x86: fix emulator buffer overflow (CVE-2014-0049) · a08d3b3b
      Andrew Honig 提交于
      The problem occurs when the guest performs a pusha with the stack
      address pointing to an mmio address (or an invalid guest physical
      address) to start with, but then extending into an ordinary guest
      physical address.  When doing repeated emulated pushes
      emulator_read_write sets mmio_needed to 1 on the first one.  On a
      later push when the stack points to regular memory,
      mmio_nr_fragments is set to 0, but mmio_is_needed is not set to 0.
      
      As a result, KVM exits to userspace, and then returns to
      complete_emulated_mmio.  In complete_emulated_mmio
      vcpu->mmio_cur_fragment is incremented.  The termination condition of
      vcpu->mmio_cur_fragment == vcpu->mmio_nr_fragments is never achieved.
      The code bounces back and fourth to userspace incrementing
      mmio_cur_fragment past it's buffer.  If the guest does nothing else it
      eventually leads to a a crash on a memcpy from invalid memory address.
      
      However if a guest code can cause the vm to be destroyed in another
      vcpu with excellent timing, then kvm_clear_async_pf_completion_queue
      can be used by the guest to control the data that's pointed to by the
      call to cancel_work_item, which can be used to gain execution.
      
      Fixes: f78146b0Signed-off-by: NAndrew Honig <ahonig@google.com>
      Cc: stable@vger.kernel.org (3.5+)
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a08d3b3b
    • T
      KVM: x86: Break kvm_for_each_vcpu loop after finding the VP_INDEX · 684851a1
      Takuya Yoshikawa 提交于
      No need to scan the entire VCPU array.
      Signed-off-by: NTakuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      684851a1
  21. 26 2月, 2014 3 次提交
  22. 22 2月, 2014 1 次提交