1. 14 7月, 2016 1 次提交
  2. 16 6月, 2016 1 次提交
    • Y
      KVM: x86: support using the vmx preemption timer for tsc deadline timer · ce7a058a
      Yunhong Jiang 提交于
      The VMX preemption timer can be used to virtualize the TSC deadline timer.
      The VMX preemption timer is armed when the vCPU is running, and a VMExit
      will happen if the virtual TSC deadline timer expires.
      
      When the vCPU thread is blocked because of HLT, KVM will switch to use
      an hrtimer, and then go back to the VMX preemption timer when the vCPU
      thread is unblocked.
      
      This solution avoids the complex OS's hrtimer system, and the host
      timer interrupt handling cost, replacing them with a little math
      (for guest->host TSC and host TSC->preemption timer conversion)
      and a cheaper VMexit.  This benefits latency for isolated pCPUs.
      
      [A word about performance... Yunhong reported a 30% reduction in average
       latency from cyclictest.  I made a similar test with tscdeadline_latency
       from kvm-unit-tests, and measured
      
       - ~20 clock cycles loss (out of ~3200, so less than 1% but still
         statistically significant) in the worst case where the test halts
         just after programming the TSC deadline timer
      
       - ~800 clock cycles gain (25% reduction in latency) in the best case
         where the test busy waits.
      
       I removed the VMX bits from Yunhong's patch, to concentrate them in the
       next patch - Paolo]
      Signed-off-by: NYunhong Jiang <yunhong.jiang@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ce7a058a
  3. 19 5月, 2016 3 次提交
  4. 03 3月, 2016 1 次提交
  5. 09 2月, 2016 2 次提交
  6. 26 11月, 2015 2 次提交
    • A
      kvm/x86: Hyper-V synthetic interrupt controller · 5c919412
      Andrey Smetanin 提交于
      SynIC (synthetic interrupt controller) is a lapic extension,
      which is controlled via MSRs and maintains for each vCPU
       - 16 synthetic interrupt "lines" (SINT's); each can be configured to
         trigger a specific interrupt vector optionally with auto-EOI
         semantics
       - a message page in the guest memory with 16 256-byte per-SINT message
         slots
       - an event flag page in the guest memory with 16 2048-bit per-SINT
         event flag areas
      
      The host triggers a SINT whenever it delivers a new message to the
      corresponding slot or flips an event flag bit in the corresponding area.
      The guest informs the host that it can try delivering a message by
      explicitly asserting EOI in lapic or writing to End-Of-Message (EOM)
      MSR.
      
      The userspace (qemu) triggers interrupts and receives EOM notifications
      via irqfd with resampler; for that, a GSI is allocated for each
      configured SINT, and irq_routing api is extended to support GSI-SINT
      mapping.
      
      Changes v4:
      * added activation of SynIC by vcpu KVM_ENABLE_CAP
      * added per SynIC active flag
      * added deactivation of APICv upon SynIC activation
      
      Changes v3:
      * added KVM_CAP_HYPERV_SYNIC and KVM_IRQ_ROUTING_HV_SINT notes into
      docs
      
      Changes v2:
      * do not use posted interrupts for Hyper-V SynIC AutoEOI vectors
      * add Hyper-V SynIC vectors into EOI exit bitmap
      * Hyper-V SyniIC SINT msr write logic simplified
      Signed-off-by: NAndrey Smetanin <asmetanin@virtuozzo.com>
      Reviewed-by: NRoman Kagan <rkagan@virtuozzo.com>
      Signed-off-by: NDenis V. Lunev <den@openvz.org>
      CC: Gleb Natapov <gleb@kernel.org>
      CC: Paolo Bonzini <pbonzini@redhat.com>
      CC: Roman Kagan <rkagan@virtuozzo.com>
      CC: Denis V. Lunev <den@openvz.org>
      CC: qemu-devel@nongnu.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      5c919412
    • A
      kvm/x86: per-vcpu apicv deactivation support · d62caabb
      Andrey Smetanin 提交于
      The decision on whether to use hardware APIC virtualization used to be
      taken globally, based on the availability of the feature in the CPU
      and the value of a module parameter.
      
      However, under certain circumstances we want to control it on per-vcpu
      basis.  In particular, when the userspace activates HyperV synthetic
      interrupt controller (SynIC), APICv has to be disabled as it's
      incompatible with SynIC auto-EOI behavior.
      
      To achieve that, introduce 'apicv_active' flag on struct
      kvm_vcpu_arch, and kvm_vcpu_deactivate_apicv() function to turn APICv
      off.  The flag is initialized based on the module parameter and CPU
      capability, and consulted whenever an APICv-specific action is
      performed.
      Signed-off-by: NAndrey Smetanin <asmetanin@virtuozzo.com>
      Reviewed-by: NRoman Kagan <rkagan@virtuozzo.com>
      Signed-off-by: NDenis V. Lunev <den@openvz.org>
      CC: Gleb Natapov <gleb@kernel.org>
      CC: Paolo Bonzini <pbonzini@redhat.com>
      CC: Roman Kagan <rkagan@virtuozzo.com>
      CC: Denis V. Lunev <den@openvz.org>
      CC: qemu-devel@nongnu.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d62caabb
  7. 01 10月, 2015 3 次提交
  8. 23 7月, 2015 1 次提交
  9. 04 7月, 2015 1 次提交
  10. 04 6月, 2015 2 次提交
  11. 07 5月, 2015 2 次提交
    • J
      kvm: x86: Deliver MSI IRQ to only lowest prio cpu if msi_redir_hint is true · d1ebdbf9
      James Sullivan 提交于
      An MSI interrupt should only be delivered to the lowest priority CPU
      when it has RH=1, regardless of the delivery mode. Modified
      kvm_is_dm_lowest_prio() to check for either irq->delivery_mode == APIC_DM_LOWPRI
      or irq->msi_redir_hint.
      
      Moved kvm_is_dm_lowest_prio() into lapic.h and renamed to
      kvm_lowest_prio_delivery().
      
      Changed a check in kvm_irq_delivery_to_apic_fast() from
      irq->delivery_mode == APIC_DM_LOWPRI to kvm_is_dm_lowest_prio().
      Signed-off-by: NJames Sullivan <sullivan.james.f@gmail.com>
      Reviewed-by: NRadim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d1ebdbf9
    • N
      KVM: x86: INIT and reset sequences are different · d28bc9dd
      Nadav Amit 提交于
      x86 architecture defines differences between the reset and INIT sequences.
      INIT does not initialize the FPU (including MMX, XMM, YMM, etc.), TSC, PMU,
      MSRs (in general), MTRRs machine-check, APIC ID, APIC arbitration ID and BSP.
      
      References (from Intel SDM):
      
      "If the MP protocol has completed and a BSP is chosen, subsequent INITs (either
      to a specific processor or system wide) do not cause the MP protocol to be
      repeated." [8.4.2: MP Initialization Protocol Requirements and Restrictions]
      
      [Table 9-1. IA-32 Processor States Following Power-up, Reset, or INIT]
      
      "If the processor is reset by asserting the INIT# pin, the x87 FPU state is not
      changed." [9.2: X87 FPU INITIALIZATION]
      
      "The state of the local APIC following an INIT reset is the same as it is after
      a power-up or hardware reset, except that the APIC ID and arbitration ID
      registers are not affected." [10.4.7.3: Local APIC State After an INIT Reset
      ("Wait-for-SIPI" State)]
      Signed-off-by: NNadav Amit <namit@cs.technion.ac.il>
      Message-Id: <1428924848-28212-1-git-send-email-namit@cs.technion.ac.il>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d28bc9dd
  12. 08 4月, 2015 1 次提交
    • R
      KVM: x86: simplify kvm_apic_map · 3b5a5ffa
      Radim Krčmář 提交于
      recalculate_apic_map() uses two passes over all VCPUs.  This is a relic
      from time when we selected a global mode in the first pass and set up
      the optimized table in the second pass (to have a consistent mode).
      
      Recent changes made mixed mode unoptimized and we can do it in one pass.
      Format of logical MDA is a function of the mode, so we encode it in
      apic_logical_id() and drop obsoleted variables from the struct.
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      Message-Id: <1423766494-26150-5-git-send-email-rkrcmar@redhat.com>
      [Add lid_bits temporary in apic_logical_id. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      3b5a5ffa
  13. 27 3月, 2015 1 次提交
  14. 04 2月, 2015 1 次提交
    • W
      KVM: nVMX: Enable nested posted interrupt processing · 705699a1
      Wincy Van 提交于
      If vcpu has a interrupt in vmx non-root mode, injecting that interrupt
      requires a vmexit.  With posted interrupt processing, the vmexit
      is not needed, and interrupts are fully taken care of by hardware.
      In nested vmx, this feature avoids much more vmexits than non-nested vmx.
      
      When L1 asks L0 to deliver L1's posted interrupt vector, and the target
      VCPU is in non-root mode, we use a physical ipi to deliver POSTED_INTR_NV
      to the target vCPU.  Using POSTED_INTR_NV avoids unexpected interrupts
      if a concurrent vmexit happens and L1's vector is different with L0's.
      The IPI triggers posted interrupt processing in the target physical CPU.
      
      In case the target vCPU was not in guest mode, complete the posted
      interrupt delivery on the next entry to L2.
      Signed-off-by: NWincy Van <fanwenyi0529@gmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      705699a1
  15. 30 1月, 2015 1 次提交
  16. 09 1月, 2015 1 次提交
    • M
      KVM: x86: add option to advance tscdeadline hrtimer expiration · d0659d94
      Marcelo Tosatti 提交于
      For the hrtimer which emulates the tscdeadline timer in the guest,
      add an option to advance expiration, and busy spin on VM-entry waiting
      for the actual expiration time to elapse.
      
      This allows achieving low latencies in cyclictest (or any scenario
      which requires strict timing regarding timer expiration).
      
      Reduces average cyclictest latency from 12us to 8us
      on Core i5 desktop.
      
      Note: this option requires tuning to find the appropriate value
      for a particular hardware/guest combination. One method is to measure the
      average delay between apic_timer_fn and VM-entry.
      Another method is to start with 1000ns, and increase the value
      in say 500ns increments until avg cyclictest numbers stop decreasing.
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d0659d94
  17. 04 12月, 2014 1 次提交
    • R
      KVM: x86: allow 256 logical x2APICs again · 45c3094a
      Radim Krčmář 提交于
      While fixing an x2apic bug,
       17d68b76 KVM: x86: fix guest-initiated crash with x2apic (CVE-2013-6376)
      we've made only one cluster available.  This means that the amount of
      logically addressible x2APICs was reduced to 16 and VCPUs kept
      overwriting themselves in that region, so even the first cluster wasn't
      set up correctly.
      
      This patch extends x2APIC support back to the logical_map's limit, and
      keeps the CVE fixed as messages for non-present APICs are dropped.
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      45c3094a
  18. 03 11月, 2014 4 次提交
    • R
      KVM: x86: optimize some accesses to LVTT and SPIV · f30ebc31
      Radim Krčmář 提交于
      We mirror a subset of these registers in separate variables.
      Using them directly should be faster.
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      f30ebc31
    • R
      KVM: x86: detect LVTT changes under APICv · a323b409
      Radim Krčmář 提交于
      APIC-write VM exits are "trap-like": they save CS:RIP values for the
      instruction after the write, and more importantly, the handler will
      already see the new value in the virtual-APIC page.  This means that
      apic_reg_write cannot use kvm_apic_get_reg to omit timer cancelation
      when mode changes.
      
      timer_mode_mask shouldn't be changing as it depends on cpuid.
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a323b409
    • R
      KVM: x86: detect SPIV changes under APICv · e462755c
      Radim Krčmář 提交于
      APIC-write VM exits are "trap-like": they save CS:RIP values for the
      instruction after the write, and more importantly, the handler will
      already see the new value in the virtual-APIC page.
      
      This caused a bug if you used KVM_SET_IRQCHIP to set the SW-enabled bit
      in the SPIV register.  The chain of events is as follows:
      
      * When the irqchip is added to the destination VM, the apic_sw_disabled
      static key is incremented (1)
      
      * When the KVM_SET_IRQCHIP ioctl is invoked, it is decremented (0)
      
      * When the guest disables the bit in the SPIV register, e.g. as part of
      shutdown, apic_set_spiv does not notice the change and the static key is
      _not_ incremented.
      
      * When the guest is destroyed, the static key is decremented (-1),
      resulting in this trace:
      
        WARNING: at kernel/jump_label.c:81 __static_key_slow_dec+0xa6/0xb0()
        jump label: negative count!
      
        [<ffffffff816bf898>] dump_stack+0x19/0x1b
        [<ffffffff8107c6f1>] warn_slowpath_common+0x61/0x80
        [<ffffffff8107c76c>] warn_slowpath_fmt+0x5c/0x80
        [<ffffffff811931e6>] __static_key_slow_dec+0xa6/0xb0
        [<ffffffff81193226>] static_key_slow_dec_deferred+0x16/0x20
        [<ffffffffa0637698>] kvm_free_lapic+0x88/0xa0 [kvm]
        [<ffffffffa061c63e>] kvm_arch_vcpu_uninit+0x2e/0xe0 [kvm]
        [<ffffffffa05ff301>] kvm_vcpu_uninit+0x21/0x40 [kvm]
        [<ffffffffa067cec7>] vmx_free_vcpu+0x47/0x70 [kvm_intel]
        [<ffffffffa061bc50>] kvm_arch_vcpu_free+0x50/0x60 [kvm]
        [<ffffffffa061ca22>] kvm_arch_destroy_vm+0x102/0x260 [kvm]
        [<ffffffff810b68fd>] ? synchronize_srcu+0x1d/0x20
        [<ffffffffa06030d1>] kvm_put_kvm+0xe1/0x1c0 [kvm]
        [<ffffffffa06036f8>] kvm_vcpu_release+0x18/0x20 [kvm]
        [<ffffffff81215c62>] __fput+0x102/0x310
        [<ffffffff81215f4e>] ____fput+0xe/0x10
        [<ffffffff810ab664>] task_work_run+0xb4/0xe0
        [<ffffffff81083944>] do_exit+0x304/0xc60
        [<ffffffff816c8dfc>] ? _raw_spin_unlock_irq+0x2c/0x50
        [<ffffffff810fd22d>] ?  trace_hardirqs_on_caller+0xfd/0x1c0
        [<ffffffff8108432c>] do_group_exit+0x4c/0xc0
        [<ffffffff810843b4>] SyS_exit_group+0x14/0x20
        [<ffffffff816d33a9>] system_call_fastpath+0x16/0x1b
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      e462755c
    • N
      KVM: x86: some apic broadcast modes does not work · 394457a9
      Nadav Amit 提交于
      KVM does not deliver x2APIC broadcast messages with physical mode.  Intel SDM
      (10.12.9 ICR Operation in x2APIC Mode) states: "A destination ID value of
      FFFF_FFFFH is used for broadcast of interrupts in both logical destination and
      physical destination modes."
      
      In addition, the local-apic enables cluster mode broadcast. As Intel SDM
      10.6.2.2 says: "Broadcast to all local APICs is achieved by setting all
      destination bits to one." This patch enables cluster mode broadcast.
      
      The fix tries to combine broadcast in different modes through a unified code.
      
      One rare case occurs when the source of IPI has its APIC disabled.  In such
      case, the source can still issue IPIs, but since the source is not obliged to
      have the same LAPIC mode as the enabled ones, we cannot rely on it.
      Since it is a rare case, it is unoptimized and done on the slow-path.
      Signed-off-by: NNadav Amit <namit@cs.technion.ac.il>
      Reviewed-by: NRadim Krčmář <rkrcmar@redhat.com>
      Reviewed-by: NWanpeng Li <wanpeng.li@linux.intel.com>
      [As per Radim's review, use unsigned int for X2APIC_BROADCAST, return bool from
       kvm_apic_broadcast. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      394457a9
  19. 27 1月, 2014 1 次提交
  20. 13 12月, 2013 1 次提交
    • A
      KVM: x86: Convert vapic synchronization to _cached functions (CVE-2013-6368) · fda4e2e8
      Andy Honig 提交于
      In kvm_lapic_sync_from_vapic and kvm_lapic_sync_to_vapic there is the
      potential to corrupt kernel memory if userspace provides an address that
      is at the end of a page.  This patches concerts those functions to use
      kvm_write_guest_cached and kvm_read_guest_cached.  It also checks the
      vapic_address specified by userspace during ioctl processing and returns
      an error to userspace if the address is not a valid GPA.
      
      This is generally not guest triggerable, because the required write is
      done by firmware that runs before the guest.  Also, it only affects AMD
      processors and oldish Intel that do not have the FlexPriority feature
      (unless you disable FlexPriority, of course; then newer processors are
      also affected).
      
      Fixes: b93463aa ('KVM: Accelerated apic support')
      Reported-by: NAndrew Honig <ahonig@google.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NAndrew Honig <ahonig@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      fda4e2e8
  21. 27 6月, 2013 1 次提交
  22. 14 5月, 2013 1 次提交
  23. 17 4月, 2013 2 次提交
  24. 16 4月, 2013 2 次提交
  25. 07 4月, 2013 1 次提交
  26. 13 3月, 2013 1 次提交
    • J
      KVM: x86: Rework INIT and SIPI handling · 66450a21
      Jan Kiszka 提交于
      A VCPU sending INIT or SIPI to some other VCPU races for setting the
      remote VCPU's mp_state. When we were unlucky, KVM_MP_STATE_INIT_RECEIVED
      was overwritten by kvm_emulate_halt and, thus, got lost.
      
      This introduces APIC events for those two signals, keeping them in
      kvm_apic until kvm_apic_accept_events is run over the target vcpu
      context. kvm_apic_has_events reports to kvm_arch_vcpu_runnable if there
      are pending events, thus if vcpu blocking should end.
      
      The patch comes with the side effect of effectively obsoleting
      KVM_MP_STATE_SIPI_RECEIVED. We still accept it from user space, but
      immediately translate it to KVM_MP_STATE_INIT_RECEIVED + KVM_APIC_SIPI.
      The vcpu itself will no longer enter the KVM_MP_STATE_SIPI_RECEIVED
      state. That also means we no longer exit to user space after receiving a
      SIPI event.
      
      Furthermore, we already reset the VCPU on INIT, only fixing up the code
      segment later on when SIPI arrives. Moreover, we fix INIT handling for
      the BSP: it never enter wait-for-SIPI but directly starts over on INIT.
      Tested-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NJan Kiszka <jan.kiszka@siemens.com>
      Signed-off-by: NGleb Natapov <gleb@redhat.com>
      66450a21
  27. 29 1月, 2013 1 次提交