1. 18 6月, 2019 2 次提交
  2. 05 6月, 2019 3 次提交
    • W
      KVM: LAPIC: Optimize timer latency further · b6c4bc65
      Wanpeng Li 提交于
      Advance lapic timer tries to hidden the hypervisor overhead between the
      host emulated timer fires and the guest awares the timer is fired. However,
      it just hidden the time between apic_timer_fn/handle_preemption_timer ->
      wait_lapic_expire, instead of the real position of vmentry which is
      mentioned in the orignial commit d0659d94 ("KVM: x86: add option to
      advance tscdeadline hrtimer expiration"). There is 700+ cpu cycles between
      the end of wait_lapic_expire and before world switch on my haswell desktop.
      
      This patch tries to narrow the last gap(wait_lapic_expire -> world switch),
      it takes the real overhead time between apic_timer_fn/handle_preemption_timer
      and before world switch into consideration when adaptively tuning timer
      advancement. The patch can reduce 40% latency (~1600+ cycles to ~1000+ cycles
      on a haswell desktop) for kvm-unit-tests/tscdeadline_latency when testing
      busy waits.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Sean Christopherson <sean.j.christopherson@intel.com>
      Cc: Liran Alon <liran.alon@oracle.com>
      Reviewed-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      b6c4bc65
    • W
      KVM: LAPIC: Delay trace_kvm_wait_lapic_expire tracepoint to after vmexit · ec0671d5
      Wanpeng Li 提交于
      wait_lapic_expire() call was moved above guest_enter_irqoff() because of
      its tracepoint, which violated the RCU extended quiescent state invoked
      by guest_enter_irqoff()[1][2]. This patch simply moves the tracepoint
      below guest_exit_irqoff() in vcpu_enter_guest(). Snapshot the delta before
      VM-Enter, but trace it after VM-Exit. This can help us to move
      wait_lapic_expire() just before vmentry in the later patch.
      
      [1] Commit 8b89fe1f ("kvm: x86: move tracepoints outside extended quiescent state")
      [2] https://patchwork.kernel.org/patch/7821111/
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Liran Alon <liran.alon@oracle.com>
      Suggested-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      [Track whether wait_lapic_expire was called, and do not invoke the tracepoint
       if not. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ec0671d5
    • W
      KVM: LAPIC: Extract adaptive tune timer advancement logic · 84ea3aca
      Wanpeng Li 提交于
      Extract adaptive tune timer advancement logic to a single function.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Sean Christopherson <sean.j.christopherson@intel.com>
      Cc: Liran Alon <liran.alon@oracle.com>
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      [Rename new function. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      84ea3aca
  3. 01 5月, 2019 5 次提交
  4. 19 4月, 2019 5 次提交
    • S
      KVM: lapic: Convert guest TSC to host time domain if necessary · b6aa57c6
      Sean Christopherson 提交于
      To minimize the latency of timer interrupts as observed by the guest,
      KVM adjusts the values it programs into the host timers to account for
      the host's overhead of programming and handling the timer event.  In
      the event that the adjustments are too aggressive, i.e. the timer fires
      earlier than the guest expects, KVM busy waits immediately prior to
      entering the guest.
      
      Currently, KVM manually converts the delay from nanoseconds to clock
      cycles.  But, the conversion is done in the guest's time domain, while
      the delay occurs in the host's time domain.  This is perfectly ok when
      the guest and host are using the same TSC ratio, but if the guest is
      using a different ratio then the delay may not be accurate and could
      wait too little or too long.
      
      When the guest is not using the host's ratio, convert the delay from
      guest clock cycles to host nanoseconds and use ndelay() instead of
      __delay() to provide more accurate timing.  Because converting to
      nanoseconds is relatively expensive, e.g. requires division and more
      multiplication ops, continue using __delay() directly when guest and
      host TSCs are running at the same ratio.
      
      Cc: Liran Alon <liran.alon@oracle.com>
      Cc: Wanpeng Li <wanpengli@tencent.com>
      Cc: stable@vger.kernel.org
      Fixes: 3b8a5df6 ("KVM: LAPIC: Tune lapic_timer_advance_ns automatically")
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      b6aa57c6
    • S
      KVM: lapic: Allow user to disable adaptive tuning of timer advancement · c3941d9e
      Sean Christopherson 提交于
      The introduction of adaptive tuning of lapic timer advancement did not
      allow for the scenario where userspace would want to disable adaptive
      tuning but still employ timer advancement, e.g. for testing purposes or
      to handle a use case where adaptive tuning is unable to settle on a
      suitable time.  This is epecially pertinent now that KVM places a hard
      threshold on the maximum advancment time.
      
      Rework the timer semantics to accept signed values, with a value of '-1'
      being interpreted as "use adaptive tuning with KVM's internal default",
      and any other value being used as an explicit advancement time, e.g. a
      time of '0' effectively disables advancement.
      
      Note, this does not completely restore the original behavior of
      lapic_timer_advance_ns.  Prior to tracking the advancement per vCPU,
      which is necessary to support autotuning, userspace could adjust
      lapic_timer_advance_ns for *running* vCPU.  With per-vCPU tracking, the
      module params are snapshotted at vCPU creation, i.e. applying a new
      advancement effectively requires restarting a VM.
      
      Dynamically updating a running vCPU is possible, e.g. a helper could be
      added to retrieve the desired delay, choosing between the global module
      param and the per-VCPU value depending on whether or not auto-tuning is
      (globally) enabled, but introduces a great deal of complexity.  The
      wrapper itself is not complex, but understanding and documenting the
      effects of dynamically toggling auto-tuning and/or adjusting the timer
      advancement is nigh impossible since the behavior would be dependent on
      KVM's implementation as well as compiler optimizations.  In other words,
      providing stable behavior would require extremely careful consideration
      now and in the future.
      
      Given that the expected use of a manually-tuned timer advancement is to
      "tune once, run many", use the vastly simpler approach of recognizing
      changes to the module params only when creating a new vCPU.
      
      Cc: Liran Alon <liran.alon@oracle.com>
      Cc: Wanpeng Li <wanpengli@tencent.com>
      Reviewed-by: NLiran Alon <liran.alon@oracle.com>
      Cc: stable@vger.kernel.org
      Fixes: 3b8a5df6 ("KVM: LAPIC: Tune lapic_timer_advance_ns automatically")
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      c3941d9e
    • S
      KVM: lapic: Track lapic timer advance per vCPU · 39497d76
      Sean Christopherson 提交于
      Automatically adjusting the globally-shared timer advancement could
      corrupt the timer, e.g. if multiple vCPUs are concurrently adjusting
      the advancement value.  That could be partially fixed by using a local
      variable for the arithmetic, but it would still be susceptible to a
      race when setting timer_advance_adjust_done.
      
      And because virtual_tsc_khz and tsc_scaling_ratio are per-vCPU, the
      correct calibration for a given vCPU may not apply to all vCPUs.
      
      Furthermore, lapic_timer_advance_ns is marked __read_mostly, which is
      effectively violated when finding a stable advancement takes an extended
      amount of timer.
      
      Opportunistically change the definition of lapic_timer_advance_ns to
      a u32 so that it matches the style of struct kvm_timer.  Explicitly
      pass the param to kvm_create_lapic() so that it doesn't have to be
      exposed to lapic.c, thus reducing the probability of unintentionally
      using the global value instead of the per-vCPU value.
      
      Cc: Liran Alon <liran.alon@oracle.com>
      Cc: Wanpeng Li <wanpengli@tencent.com>
      Reviewed-by: NLiran Alon <liran.alon@oracle.com>
      Cc: stable@vger.kernel.org
      Fixes: 3b8a5df6 ("KVM: LAPIC: Tune lapic_timer_advance_ns automatically")
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      39497d76
    • S
      KVM: lapic: Disable timer advancement if adaptive tuning goes haywire · 57bf67e7
      Sean Christopherson 提交于
      To minimize the latency of timer interrupts as observed by the guest,
      KVM adjusts the values it programs into the host timers to account for
      the host's overhead of programming and handling the timer event.  Now
      that the timer advancement is automatically tuned during runtime, it's
      effectively unbounded by default, e.g. if KVM is running as L1 the
      advancement can measure in hundreds of milliseconds.
      
      Disable timer advancement if adaptive tuning yields an advancement of
      more than 5000ns, as large advancements can break reasonable assumptions
      of the guest, e.g. that a timer configured to fire after 1ms won't
      arrive on the next instruction.  Although KVM busy waits to mitigate the
      case of a timer event arriving too early, complications can arise when
      shifting the interrupt too far, e.g. kvm-unit-test's vmx.interrupt test
      will fail when its "host" exits on interrupts as KVM may inject the INTR
      before the guest executes STI+HLT.   Arguably the unit test is "broken"
      in the sense that delaying a timer interrupt by 1ms doesn't technically
      guarantee the interrupt will arrive after STI+HLT, but it's a reasonable
      assumption that KVM should support.
      
      Furthermore, an unbounded advancement also effectively unbounds the time
      spent busy waiting, e.g. if the guest programs a timer with a very large
      delay.
      
      5000ns is a somewhat arbitrary threshold.  When running on bare metal,
      which is the intended use case, timer advancement is expected to be in
      the general vicinity of 1000ns.  5000ns is high enough that false
      positives are unlikely, while not being so high as to negatively affect
      the host's performance/stability.
      
      Note, a future patch will enable userspace to disable KVM's adaptive
      tuning, which will allow priveleged userspace will to specifying an
      advancement value in excess of this arbitrary threshold in order to
      satisfy an abnormal use case.
      
      Cc: Liran Alon <liran.alon@oracle.com>
      Cc: Wanpeng Li <wanpengli@tencent.com>
      Cc: stable@vger.kernel.org
      Fixes: 3b8a5df6 ("KVM: LAPIC: Tune lapic_timer_advance_ns automatically")
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      57bf67e7
    • L
      KVM: x86: Consider LAPIC TSC-Deadline timer expired if deadline too short · c09d65d9
      Liran Alon 提交于
      If guest sets MSR_IA32_TSCDEADLINE to value such that in host
      time-domain it's shorter than lapic_timer_advance_ns, we can
      reach a case that we call hrtimer_start() with expiration time set at
      the past.
      
      Because lapic_timer.timer is init with HRTIMER_MODE_ABS_PINNED, it
      is not allowed to run in softirq and therefore will never expire.
      
      To avoid such a scenario, verify that deadline expiration time is set on
      host time-domain further than (now + lapic_timer_advance_ns).
      
      A future patch can also consider adding a min_timer_deadline_ns module parameter,
      similar to min_timer_period_us to avoid races that amount of ns it takes
      to run logic could still call hrtimer_start() with expiration timer set
      at the past.
      Reviewed-by: NJoao Martins <joao.m.martins@oracle.com>
      Signed-off-by: NLiran Alon <liran.alon@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      c09d65d9
  5. 16 4月, 2019 1 次提交
  6. 21 2月, 2019 1 次提交
    • B
      kvm: x86: Add memcg accounting to KVM allocations · 254272ce
      Ben Gardon 提交于
      There are many KVM kernel memory allocations which are tied to the life of
      the VM process and should be charged to the VM process's cgroup. If the
      allocations aren't tied to the process, the OOM killer will not know
      that killing the process will free the associated kernel memory.
      Add __GFP_ACCOUNT flags to many of the allocations which are not yet being
      charged to the VM process's cgroup.
      
      Tested:
      	Ran all kvm-unit-tests on a 64 bit Haswell machine, the patch
      	introduced no new failures.
      	Ran a kernel memory accounting test which creates a VM to touch
      	memory and then checks that the kernel memory allocated for the
      	process is within certain bounds.
      	With this patch we account for much more of the vmalloc and slab memory
      	allocated for the VM.
      
      There remain a few allocations which should be charged to the VM's
      cgroup but are not. In x86, they include:
      	vcpu->arch.pio_data
      There allocations are unaccounted in this patch because they are mapped
      to userspace, and accounting them to a cgroup causes problems. This
      should be addressed in a future patch.
      Signed-off-by: NBen Gardon <bgardon@google.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      254272ce
  7. 26 1月, 2019 1 次提交
    • G
      KVM: x86: Mark expected switch fall-throughs · b2869f28
      Gustavo A. R. Silva 提交于
      In preparation to enabling -Wimplicit-fallthrough, mark switch
      cases where we are expecting to fall through.
      
      This patch fixes the following warnings:
      
      arch/x86/kvm/lapic.c:1037:27: warning: this statement may fall through [-Wimplicit-fallthrough=]
      arch/x86/kvm/lapic.c:1876:3: warning: this statement may fall through [-Wimplicit-fallthrough=]
      arch/x86/kvm/hyperv.c:1637:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
      arch/x86/kvm/svm.c:4396:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
      arch/x86/kvm/mmu.c:4372:36: warning: this statement may fall through [-Wimplicit-fallthrough=]
      arch/x86/kvm/x86.c:3835:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
      arch/x86/kvm/x86.c:7938:23: warning: this statement may fall through [-Wimplicit-fallthrough=]
      arch/x86/kvm/vmx/vmx.c:2015:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
      arch/x86/kvm/vmx/vmx.c:1773:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
      
      Warning level 3 was used: -Wimplicit-fallthrough=3
      
      This patch is part of the ongoing efforts to enabling -Wimplicit-fallthrough.
      Signed-off-by: NGustavo A. R. Silva <gustavo@embeddedor.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      b2869f28
  8. 15 12月, 2018 1 次提交
  9. 27 11月, 2018 2 次提交
    • Y
      KVM: x86: fix empty-body warnings · 354cb410
      Yi Wang 提交于
      We get the following warnings about empty statements when building
      with 'W=1':
      
      arch/x86/kvm/lapic.c:632:53: warning: suggest braces around empty body in an ‘if’ statement [-Wempty-body]
      arch/x86/kvm/lapic.c:1907:42: warning: suggest braces around empty body in an ‘if’ statement [-Wempty-body]
      arch/x86/kvm/lapic.c:1936:65: warning: suggest braces around empty body in an ‘if’ statement [-Wempty-body]
      arch/x86/kvm/lapic.c:1975:44: warning: suggest braces around empty body in an ‘if’ statement [-Wempty-body]
      
      Rework the debug helper macro to get rid of these warnings.
      Signed-off-by: NYi Wang <wang.yi59@zte.com.cn>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      354cb410
    • W
      KVM: LAPIC: Fix pv ipis use-before-initialization · 38ab012f
      Wanpeng Li 提交于
      Reported by syzkaller:
      
       BUG: unable to handle kernel NULL pointer dereference at 0000000000000014
       PGD 800000040410c067 P4D 800000040410c067 PUD 40410d067 PMD 0
       Oops: 0000 [#1] PREEMPT SMP PTI
       CPU: 3 PID: 2567 Comm: poc Tainted: G           OE     4.19.0-rc5 #16
       RIP: 0010:kvm_pv_send_ipi+0x94/0x350 [kvm]
       Call Trace:
        kvm_emulate_hypercall+0x3cc/0x700 [kvm]
        handle_vmcall+0xe/0x10 [kvm_intel]
        vmx_handle_exit+0xc1/0x11b0 [kvm_intel]
        vcpu_enter_guest+0x9fb/0x1910 [kvm]
        kvm_arch_vcpu_ioctl_run+0x35c/0x610 [kvm]
        kvm_vcpu_ioctl+0x3e9/0x6d0 [kvm]
        do_vfs_ioctl+0xa5/0x690
        ksys_ioctl+0x6d/0x80
        __x64_sys_ioctl+0x1a/0x20
        do_syscall_64+0x83/0x6e0
        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      The reason is that the apic map has not yet been initialized, the testcase
      triggers pv_send_ipi interface by vmcall which results in kvm->arch.apic_map
      is dereferenced. This patch fixes it by checking whether or not apic map is
      NULL and bailing out immediately if that is the case.
      
      Fixes: 4180bf1b (KVM: X86: Implement "send IPI" hypercall)
      Reported-by: NWei Wu <ww9210@gmail.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Wei Wu <ww9210@gmail.com>
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      38ab012f
  10. 29 10月, 2018 1 次提交
  11. 17 10月, 2018 4 次提交
  12. 20 9月, 2018 1 次提交
    • V
      x86/kvm/lapic: always disable MMIO interface in x2APIC mode · d1766202
      Vitaly Kuznetsov 提交于
      When VMX is used with flexpriority disabled (because of no support or
      if disabled with module parameter) MMIO interface to lAPIC is still
      available in x2APIC mode while it shouldn't be (kvm-unit-tests):
      
      PASS: apic_disable: Local apic enabled in x2APIC mode
      PASS: apic_disable: CPUID.1H:EDX.APIC[bit 9] is set
      FAIL: apic_disable: *0xfee00030: 50014
      
      The issue appears because we basically do nothing while switching to
      x2APIC mode when APIC access page is not used. apic_mmio_{read,write}
      only check if lAPIC is disabled before proceeding to actual write.
      
      When APIC access is virtualized we correctly manipulate with VMX controls
      in vmx_set_virtual_apic_mode() and we don't get vmexits from memory writes
      in x2APIC mode so there's no issue.
      
      Disabling MMIO interface seems to be easy. The question is: what do we
      do with these reads and writes? If we add apic_x2apic_mode() check to
      apic_mmio_in_range() and return -EOPNOTSUPP these reads and writes will
      go to userspace. When lAPIC is in kernel, Qemu uses this interface to
      inject MSIs only (see kvm_apic_mem_write() in hw/i386/kvm/apic.c). This
      somehow works with disabled lAPIC but when we're in xAPIC mode we will
      get a real injected MSI from every write to lAPIC. Not good.
      
      The simplest solution seems to be to just ignore writes to the region
      and return ~0 for all reads when we're in x2APIC mode. This is what this
      patch does. However, this approach is inconsistent with what currently
      happens when flexpriority is enabled: we allocate APIC access page and
      create KVM memory region so in x2APIC modes all reads and writes go to
      this pre-allocated page which is, btw, the same for all vCPUs.
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d1766202
  13. 08 9月, 2018 1 次提交
    • W
      KVM: LAPIC: Fix pv ipis out-of-bounds access · bdf7ffc8
      Wanpeng Li 提交于
      Dan Carpenter reported that the untrusted data returns from kvm_register_read()
      results in the following static checker warning:
        arch/x86/kvm/lapic.c:576 kvm_pv_send_ipi()
        error: buffer underflow 'map->phys_map' 's32min-s32max'
      
      KVM guest can easily trigger this by executing the following assembly sequence
      in Ring0:
      
      mov $10, %rax
      mov $0xFFFFFFFF, %rbx
      mov $0xFFFFFFFF, %rdx
      mov $0, %rsi
      vmcall
      
      As this will cause KVM to execute the following code-path:
      vmx_handle_exit() -> handle_vmcall() -> kvm_emulate_hypercall() -> kvm_pv_send_ipi()
      which will reach out-of-bounds access.
      
      This patch fixes it by adding a check to kvm_pv_send_ipi() against map->max_apic_id,
      ignoring destinations that are not present and delivering the rest. We also check
      whether or not map->phys_map[min + i] is NULL since the max_apic_id is set to the
      max apic id, some phys_map maybe NULL when apic id is sparse, especially kvm
      unconditionally set max_apic_id to 255 to reserve enough space for any xAPIC ID.
      Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
      Reviewed-by: NLiran Alon <liran.alon@oracle.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Liran Alon <liran.alon@oracle.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      [Add second "if (min > map->max_apic_id)" to complete the fix. -Radim]
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      bdf7ffc8
  14. 06 8月, 2018 1 次提交
    • W
      KVM: X86: Implement "send IPI" hypercall · 4180bf1b
      Wanpeng Li 提交于
      Using hypercall to send IPIs by one vmexit instead of one by one for
      xAPIC/x2APIC physical mode and one vmexit per-cluster for x2APIC cluster
      mode. Intel guest can enter x2apic cluster mode when interrupt remmaping
      is enabled in qemu, however, latest AMD EPYC still just supports xapic
      mode which can get great improvement by Exit-less IPIs. This patchset
      lets a guest send multicast IPIs, with at most 128 destinations per
      hypercall in 64-bit mode and 64 vCPUs per hypercall in 32-bit mode.
      
      Hardware: Xeon Skylake 2.5GHz, 2 sockets, 40 cores, 80 threads, the VM
      is 80 vCPUs, IPI microbenchmark(https://lkml.org/lkml/2017/12/19/141):
      
      x2apic cluster mode, vanilla
      
       Dry-run:                         0,            2392199 ns
       Self-IPI:                  6907514,           15027589 ns
       Normal IPI:              223910476,          251301666 ns
       Broadcast IPI:                   0,         9282161150 ns
       Broadcast lock:                  0,         8812934104 ns
      
      x2apic cluster mode, pv-ipi
      
       Dry-run:                         0,            2449341 ns
       Self-IPI:                  6720360,           15028732 ns
       Normal IPI:              228643307,          255708477 ns
       Broadcast IPI:                   0,         7572293590 ns  => 22% performance boost
       Broadcast lock:                  0,         8316124651 ns
      
      x2apic physical mode, vanilla
      
       Dry-run:                         0,            3135933 ns
       Self-IPI:                  8572670,           17901757 ns
       Normal IPI:              226444334,          255421709 ns
       Broadcast IPI:                   0,        19845070887 ns
       Broadcast lock:                  0,        19827383656 ns
      
      x2apic physical mode, pv-ipi
      
       Dry-run:                         0,            2446381 ns
       Self-IPI:                  6788217,           15021056 ns
       Normal IPI:              219454441,          249583458 ns
       Broadcast IPI:                   0,         7806540019 ns  => 154% performance boost
       Broadcast lock:                  0,         9143618799 ns
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      4180bf1b
  15. 20 6月, 2018 1 次提交
  16. 24 5月, 2018 1 次提交
    • D
      x86/kvm: fix LAPIC timer drift when guest uses periodic mode · d8f2f498
      David Vrabel 提交于
      Since 4.10, commit 8003c9ae (KVM: LAPIC: add APIC Timer
      periodic/oneshot mode VMX preemption timer support), guests using
      periodic LAPIC timers (such as FreeBSD 8.4) would see their timers
      drift significantly over time.
      
      Differences in the underlying clocks and numerical errors means the
      periods of the two timers (hv and sw) are not the same. This
      difference will accumulate with every expiry resulting in a large
      error between the hv and sw timer.
      
      This means the sw timer may be running slow when compared to the hv
      timer. When the timer is switched from hv to sw, the now active sw
      timer will expire late. The guest VCPU is reentered and it switches to
      using the hv timer. This timer catches up, injecting multiple IRQs
      into the guest (of which the guest only sees one as it does not get to
      run until the hv timer has caught up) and thus the guest's timer rate
      is low (and becomes increasing slower over time as the sw timer lags
      further and further behind).
      
      I believe a similar problem would occur if the hv timer is the slower
      one, but I have not observed this.
      
      Fix this by synchronizing the deadlines for both timers to the same
      time source on every tick. This prevents the errors from accumulating.
      
      Fixes: 8003c9ae
      Cc: Wanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: NDavid Vrabel <david.vrabel@nutanix.com>
      Cc: stable@vger.kernel.org
      Reviewed-by: NPaolo Bonzini <pbonzini@redhat.com>
      Reviewed-by: NWanpeng Li <wanpengli@tencent.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      d8f2f498
  17. 15 5月, 2018 1 次提交
  18. 06 5月, 2018 1 次提交
    • A
      KVM: x86: remove APIC Timer periodic/oneshot spikes · ecf08dad
      Anthoine Bourgeois 提交于
      Since the commit "8003c9ae: add APIC Timer periodic/oneshot mode VMX
      preemption timer support", a Windows 10 guest has some erratic timer
      spikes.
      
      Here the results on a 150000 times 1ms timer without any load:
      	  Before 8003c9ae | After 8003c9ae
      Max           1834us          |  86000us
      Mean          1100us          |   1021us
      Deviation       59us          |    149us
      Here the results on a 150000 times 1ms timer with a cpu-z stress test:
      	  Before 8003c9ae | After 8003c9ae
      Max          32000us          | 140000us
      Mean          1006us          |   1997us
      Deviation      140us          |  11095us
      
      The root cause of the problem is starting hrtimer with an expiry time
      already in the past can take more than 20 milliseconds to trigger the
      timer function.  It can be solved by forward such past timers
      immediately, rather than submitting them to hrtimer_start().
      In case the timer is periodic, update the target expiration and call
      hrtimer_start with it.
      
      v2: Check if the tsc deadline is already expired. Thank you Mika.
      v3: Execute the past timers immediately rather than submitting them to
      hrtimer_start().
      v4: Rearm the periodic timer with advance_periodic_target_expiration() a
      simpler version of set_target_expiration(). Thank you Paolo.
      
      Cc: Mika Penttilä <mika.penttila@nextfour.com>
      Cc: Wanpeng Li <kernellwp@gmail.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NAnthoine Bourgeois <anthoine.bourgeois@blade-group.com>
      8003c9ae ("KVM: LAPIC: add APIC Timer periodic/oneshot mode VMX preemption timer support")
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      ecf08dad
  19. 17 3月, 2018 1 次提交
  20. 02 3月, 2018 1 次提交
  21. 24 2月, 2018 1 次提交
    • P
      KVM: x86: move LAPIC initialization after VMCS creation · 0b2e9904
      Paolo Bonzini 提交于
      The initial reset of the local APIC is performed before the VMCS has been
      created, but it tries to do a vmwrite:
      
       vmwrite error: reg 810 value 4a00 (err 18944)
       CPU: 54 PID: 38652 Comm: qemu-kvm Tainted: G        W I      4.16.0-0.rc2.git0.1.fc28.x86_64 #1
       Hardware name: Intel Corporation S2600CW/S2600CW, BIOS SE5C610.86B.01.01.0003.090520141303 09/05/2014
       Call Trace:
        vmx_set_rvi [kvm_intel]
        vmx_hwapic_irr_update [kvm_intel]
        kvm_lapic_reset [kvm]
        kvm_create_lapic [kvm]
        kvm_arch_vcpu_init [kvm]
        kvm_vcpu_init [kvm]
        vmx_create_vcpu [kvm_intel]
        kvm_vm_ioctl [kvm]
      
      Move it later, after the VMCS has been created.
      
      Fixes: 4191db26 ("KVM: x86: Update APICv on APIC reset")
      Cc: stable@vger.kernel.org
      Cc: Liran Alon <liran.alon@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      0b2e9904
  22. 16 1月, 2018 2 次提交
  23. 28 11月, 2017 2 次提交