1. 15 11月, 2019 16 次提交
    • L
      KVM: x86/vPMU: Add lazy mechanism to release perf_event per vPMC · b35e5548
      Like Xu 提交于
      Currently, a host perf_event is created for a vPMC functionality emulation.
      It’s unpredictable to determine if a disabled perf_event will be reused.
      If they are disabled and are not reused for a considerable period of time,
      those obsolete perf_events would increase host context switch overhead that
      could have been avoided.
      
      If the guest doesn't WRMSR any of the vPMC's MSRs during an entire vcpu
      sched time slice, and its independent enable bit of the vPMC isn't set,
      we can predict that the guest has finished the use of this vPMC, and then
      do request KVM_REQ_PMU in kvm_arch_sched_in and release those perf_events
      in the first call of kvm_pmu_handle_event() after the vcpu is scheduled in.
      
      This lazy mechanism delays the event release time to the beginning of the
      next scheduled time slice if vPMC's MSRs aren't changed during this time
      slice. If guest comes back to use this vPMC in next time slice, a new perf
      event would be re-created via perf_event_create_kernel_counter() as usual.
      Suggested-by: NWei Wang <wei.w.wang@intel.com>
      Suggested-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NLike Xu <like.xu@linux.intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      b35e5548
    • L
      KVM: x86/vPMU: Reuse perf_event to avoid unnecessary pmc_reprogram_counter · a6da0d77
      Like Xu 提交于
      The perf_event_create_kernel_counter() in the pmc_reprogram_counter() is
      a heavyweight and high-frequency operation, especially when host disables
      the watchdog (maximum 21000000 ns) which leads to an unacceptable latency
      of the guest NMI handler. It limits the use of vPMUs in the guest.
      
      When a vPMC is fully enabled, the legacy reprogram_*_counter() would stop
      and release its existing perf_event (if any) every time EVEN in most cases
      almost the same requested perf_event will be created and configured again.
      
      For each vPMC, if the reuqested config ('u64 eventsel' for gp and 'u8 ctrl'
      for fixed) is the same as its current config AND a new sample period based
      on pmc->counter is accepted by host perf interface, the current event could
      be reused safely as a new created one does. Otherwise, do release the
      undesirable perf_event and reprogram a new one as usual.
      
      It's light-weight to call pmc_pause_counter (disable, read and reset event)
      and pmc_resume_counter (recalibrate period and re-enable event) as guest
      expects instead of release-and-create again on any condition. Compared to
      use the filterable event->attr or hw.config, a new 'u64 current_config'
      field is added to save the last original programed config for each vPMC.
      
      Based on this implementation, the number of calls to pmc_reprogram_counter
      is reduced by ~82.5% for a gp sampling event and ~99.9% for a fixed event.
      In the usage of multiplexing perf sampling mode, the average latency of the
      guest NMI handler is reduced from 104923 ns to 48393 ns (~2.16x speed up).
      If host disables watchdog, the minimum latecy of guest NMI handler could be
      speed up at ~3413x (from 20407603 to 5979 ns) and at ~786x in the average.
      Suggested-by: NKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: NLike Xu <like.xu@linux.intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a6da0d77
    • L
      KVM: x86/vPMU: Introduce a new kvm_pmu_ops->msr_idx_to_pmc callback · c900c156
      Like Xu 提交于
      Introduce a new callback msr_idx_to_pmc that returns a struct kvm_pmc*,
      and change kvm_pmu_is_valid_msr to return ".msr_idx_to_pmc(vcpu, msr) ||
      .is_valid_msr(vcpu, msr)" and AMD just returns false from .is_valid_msr.
      Suggested-by: NPaolo Bonzini <pbonzini@redhat.com>
      Reported-by: Nkbuild test robot <lkp@intel.com>
      Signed-off-by: NLike Xu <like.xu@linux.intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      c900c156
    • L
      KVM: x86/vPMU: Rename pmu_ops callbacks from msr_idx to rdpmc_ecx · 98ff80f5
      Like Xu 提交于
      The leagcy pmu_ops->msr_idx_to_pmc is only called in kvm_pmu_rdpmc, so
      this function actually receives the contents of ECX before RDPMC, and
      translates it to a kvm_pmc. Let's clarify its semantic by renaming the
      existing msr_idx_to_pmc to rdpmc_ecx_to_pmc, and is_valid_msr_idx to
      is_valid_rdpmc_ecx; likewise for the wrapper kvm_pmu_is_valid_msr_idx.
      Suggested-by: NPaolo Bonzini <pbonzini@redhat.com>
      Reviewed-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NLike Xu <like.xu@linux.intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      98ff80f5
    • L
      KVM: nVMX: Update vmcs01 TPR_THRESHOLD if L2 changed L1 TPR · 02d496cf
      Liran Alon 提交于
      When L1 don't use TPR-Shadow to run L2, L0 configures vmcs02 without
      TPR-Shadow and install intercepts on CR8 access (load and store).
      
      If L1 do not intercept L2 CR8 access, L0 intercepts on those accesses
      will emulate load/store on L1's LAPIC TPR. If in this case L2 lowers
      TPR such that there is now an injectable interrupt to L1,
      apic_update_ppr() will request a KVM_REQ_EVENT which will trigger a call
      to update_cr8_intercept() to update TPR-Threshold to highest pending IRR
      priority.
      
      However, this update to TPR-Threshold is done while active vmcs is
      vmcs02 instead of vmcs01. Thus, when later at some point L0 will
      emulate an exit from L2 to L1, L1 will still run with high
      TPR-Threshold. This will result in every VMEntry to L1 to immediately
      exit on TPR_BELOW_THRESHOLD and continue to do so infinitely until
      some condition will cause KVM_REQ_EVENT to be set.
      (Note that TPR_BELOW_THRESHOLD exit handler do not set KVM_REQ_EVENT
      until apic_update_ppr() will notice a new injectable interrupt for PPR)
      
      To fix this issue, change update_cr8_intercept() such that if L2 lowers
      L1's TPR in a way that requires to lower L1's TPR-Threshold, save update
      to TPR-Threshold and apply it to vmcs01 when L0 emulates an exit from
      L2 to L1.
      Reviewed-by: NJoao Martins <joao.m.martins@oracle.com>
      Signed-off-by: NLiran Alon <liran.alon@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      02d496cf
    • L
      KVM: VMX: Refactor update_cr8_intercept() · 132f4f7e
      Liran Alon 提交于
      No functional changes.
      Reviewed-by: NJoao Martins <joao.m.martins@oracle.com>
      Signed-off-by: NLiran Alon <liran.alon@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      132f4f7e
    • L
      KVM: SVM: Remove check if APICv enabled in SVM update_cr8_intercept() handler · 49d654d8
      Liran Alon 提交于
      This check is unnecessary as x86 update_cr8_intercept() which calls
      this VMX/SVM specific callback already performs this check.
      Reviewed-by: NJoao Martins <joao.m.martins@oracle.com>
      Signed-off-by: NLiran Alon <liran.alon@oracle.com>
      Reviewed-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      49d654d8
    • M
      KVM: APIC: add helper func to remove duplicate code in kvm_pv_send_ipi · 1a686237
      Miaohe Lin 提交于
      There are some duplicate code in kvm_pv_send_ipi when deal with ipi
      bitmap. Add helper func to remove it, and eliminate odd out label,
      get rid of unnecessary kvm_lapic_irq field init and so on.
      Signed-off-by: NMiaohe Lin <linmiaohe@huawei.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      1a686237
    • M
      KVM: X86: avoid unused setup_syscalls_segments call when SYSCALL check failed · 5b4ce93a
      Miaohe Lin 提交于
      When SYSCALL/SYSENTER ability check failed, cs and ss is inited but
      remain not used. Delay initializing cs and ss until SYSCALL/SYSENTER
      ability check passed.
      Signed-off-by: NMiaohe Lin <linmiaohe@huawei.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      5b4ce93a
    • L
      KVM: VMX: Consume pending LAPIC INIT event when exit on INIT_SIGNAL · e64a8508
      Liran Alon 提交于
      Intel SDM section 25.2 OTHER CAUSES OF VM EXITS specifies the following
      on INIT signals: "Such exits do not modify register state or clear pending
      events as they would outside of VMX operation."
      
      When commit 4b9852f4 ("KVM: x86: Fix INIT signal handling in various CPU states")
      was applied, I interepted above Intel SDM statement such that
      INIT_SIGNAL exit don’t consume the LAPIC INIT pending event.
      
      However, when Nadav Amit run matching kvm-unit-test on a bare-metal
      machine, it turned out my interpetation was wrong. i.e. INIT_SIGNAL
      exit does consume the LAPIC INIT pending event.
      (See: https://www.spinics.net/lists/kvm/msg196757.html)
      
      Therefore, fix KVM code to behave as observed on bare-metal.
      
      Fixes: 4b9852f4 ("KVM: x86: Fix INIT signal handling in various CPU states")
      Reported-by: NNadav Amit <nadav.amit@gmail.com>
      Reviewed-by: NMihai Carabas <mihai.carabas@oracle.com>
      Reviewed-by: NJoao Martins <joao.m.martins@oracle.com>
      Signed-off-by: NLiran Alon <liran.alon@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      e64a8508
    • L
      KVM: x86: Prevent set vCPU into INIT/SIPI_RECEIVED state when INIT are latched · 27cbe7d6
      Liran Alon 提交于
      Commit 4b9852f4 ("KVM: x86: Fix INIT signal handling in various CPU states")
      fixed KVM to also latch pending LAPIC INIT event when vCPU is in VMX
      operation.
      
      However, current API of KVM_SET_MP_STATE allows userspace to put vCPU
      into KVM_MP_STATE_SIPI_RECEIVED or KVM_MP_STATE_INIT_RECEIVED even when
      vCPU is in VMX operation.
      
      Fix this by introducing a util method to check if vCPU state latch INIT
      signals and use it in KVM_SET_MP_STATE handler.
      
      Fixes: 4b9852f4 ("KVM: x86: Fix INIT signal handling in various CPU states")
      Reported-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Reviewed-by: NMihai Carabas <mihai.carabas@oracle.com>
      Signed-off-by: NLiran Alon <liran.alon@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      27cbe7d6
    • L
      KVM: x86: Evaluate latched_init in KVM_SET_VCPU_EVENTS when vCPU not in SMM · ff90afa7
      Liran Alon 提交于
      Commit 4b9852f4 ("KVM: x86: Fix INIT signal handling in various CPU states")
      fixed KVM to also latch pending LAPIC INIT event when vCPU is in VMX
      operation.
      
      However, current API of KVM_SET_VCPU_EVENTS defines this field as
      part of SMM state and only set pending LAPIC INIT event if vCPU is
      specified to be in SMM mode (events->smi.smm is set).
      
      Change KVM_SET_VCPU_EVENTS handler to set pending LAPIC INIT event
      by latched_init field regardless of if vCPU is in SMM mode or not.
      
      Fixes: 4b9852f4 ("KVM: x86: Fix INIT signal handling in various CPU states")
      Reviewed-by: NMihai Carabas <mihai.carabas@oracle.com>
      Signed-off-by: NLiran Alon <liran.alon@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ff90afa7
    • A
      x86: retpolines: eliminate retpoline from msr event handlers · 74c504a6
      Andrea Arcangeli 提交于
      It's enough to check the value and issue the direct call.
      
      After this commit is applied, here the most common retpolines executed
      under a high resolution timer workload in the guest on a VMX host:
      
      [..]
      @[
          trace_retpoline+1
          __trace_retpoline+30
          __x86_indirect_thunk_rax+33
          do_syscall_64+89
          entry_SYSCALL_64_after_hwframe+68
      ]: 267
      @[]: 2256
      @[
          trace_retpoline+1
          __trace_retpoline+30
          __x86_indirect_thunk_rax+33
          __kvm_wait_lapic_expire+284
          vmx_vcpu_run.part.97+1091
          vcpu_enter_guest+377
          kvm_arch_vcpu_ioctl_run+261
          kvm_vcpu_ioctl+559
          do_vfs_ioctl+164
          ksys_ioctl+96
          __x64_sys_ioctl+22
          do_syscall_64+89
          entry_SYSCALL_64_after_hwframe+68
      ]: 2390
      @[]: 33410
      
      @total: 315707
      
      Note the highest hit above is __delay so probably not worth optimizing
      even if it would be more frequent than 2k hits per sec.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      74c504a6
    • A
      KVM: retpolines: x86: eliminate retpoline from svm.c exit handlers · 3dcb2a3f
      Andrea Arcangeli 提交于
      It's enough to check the exit value and issue a direct call to avoid
      the retpoline for all the common vmexit reasons.
      
      After this commit is applied, here the most common retpolines executed
      under a high resolution timer workload in the guest on a SVM host:
      
      [..]
      @[
          trace_retpoline+1
          __trace_retpoline+30
          __x86_indirect_thunk_rax+33
          ktime_get_update_offsets_now+70
          hrtimer_interrupt+131
          smp_apic_timer_interrupt+106
          apic_timer_interrupt+15
          start_sw_timer+359
          restart_apic_timer+85
          kvm_set_msr_common+1497
          msr_interception+142
          vcpu_enter_guest+684
          kvm_arch_vcpu_ioctl_run+261
          kvm_vcpu_ioctl+559
          do_vfs_ioctl+164
          ksys_ioctl+96
          __x64_sys_ioctl+22
          do_syscall_64+89
          entry_SYSCALL_64_after_hwframe+68
      ]: 1940
      @[
          trace_retpoline+1
          __trace_retpoline+30
          __x86_indirect_thunk_r12+33
          force_qs_rnp+217
          rcu_gp_kthread+1270
          kthread+268
          ret_from_fork+34
      ]: 4644
      @[]: 25095
      @[
          trace_retpoline+1
          __trace_retpoline+30
          __x86_indirect_thunk_rax+33
          lapic_next_event+28
          clockevents_program_event+148
          hrtimer_start_range_ns+528
          start_sw_timer+356
          restart_apic_timer+85
          kvm_set_msr_common+1497
          msr_interception+142
          vcpu_enter_guest+684
          kvm_arch_vcpu_ioctl_run+261
          kvm_vcpu_ioctl+559
          do_vfs_ioctl+164
          ksys_ioctl+96
          __x64_sys_ioctl+22
          do_syscall_64+89
          entry_SYSCALL_64_after_hwframe+68
      ]: 41474
      @[
          trace_retpoline+1
          __trace_retpoline+30
          __x86_indirect_thunk_rax+33
          clockevents_program_event+148
          hrtimer_start_range_ns+528
          start_sw_timer+356
          restart_apic_timer+85
          kvm_set_msr_common+1497
          msr_interception+142
          vcpu_enter_guest+684
          kvm_arch_vcpu_ioctl_run+261
          kvm_vcpu_ioctl+559
          do_vfs_ioctl+164
          ksys_ioctl+96
          __x64_sys_ioctl+22
          do_syscall_64+89
          entry_SYSCALL_64_after_hwframe+68
      ]: 41474
      @[
          trace_retpoline+1
          __trace_retpoline+30
          __x86_indirect_thunk_rax+33
          ktime_get+58
          clockevents_program_event+84
          hrtimer_start_range_ns+528
          start_sw_timer+356
          restart_apic_timer+85
          kvm_set_msr_common+1497
          msr_interception+142
          vcpu_enter_guest+684
          kvm_arch_vcpu_ioctl_run+261
          kvm_vcpu_ioctl+559
          do_vfs_ioctl+164
          ksys_ioctl+96
          __x64_sys_ioctl+22
          do_syscall_64+89
          entry_SYSCALL_64_after_hwframe+68
      ]: 41887
      @[
          trace_retpoline+1
          __trace_retpoline+30
          __x86_indirect_thunk_rax+33
          lapic_next_event+28
          clockevents_program_event+148
          hrtimer_try_to_cancel+168
          hrtimer_cancel+21
          kvm_set_lapic_tscdeadline_msr+43
          kvm_set_msr_common+1497
          msr_interception+142
          vcpu_enter_guest+684
          kvm_arch_vcpu_ioctl_run+261
          kvm_vcpu_ioctl+559
          do_vfs_ioctl+164
          ksys_ioctl+96
          __x64_sys_ioctl+22
          do_syscall_64+89
          entry_SYSCALL_64_after_hwframe+68
      ]: 42723
      @[
          trace_retpoline+1
          __trace_retpoline+30
          __x86_indirect_thunk_rax+33
          clockevents_program_event+148
          hrtimer_try_to_cancel+168
          hrtimer_cancel+21
          kvm_set_lapic_tscdeadline_msr+43
          kvm_set_msr_common+1497
          msr_interception+142
          vcpu_enter_guest+684
          kvm_arch_vcpu_ioctl_run+261
          kvm_vcpu_ioctl+559
          do_vfs_ioctl+164
          ksys_ioctl+96
          __x64_sys_ioctl+22
          do_syscall_64+89
          entry_SYSCALL_64_after_hwframe+68
      ]: 42766
      @[
          trace_retpoline+1
          __trace_retpoline+30
          __x86_indirect_thunk_rax+33
          ktime_get+58
          clockevents_program_event+84
          hrtimer_try_to_cancel+168
          hrtimer_cancel+21
          kvm_set_lapic_tscdeadline_msr+43
          kvm_set_msr_common+1497
          msr_interception+142
          vcpu_enter_guest+684
          kvm_arch_vcpu_ioctl_run+261
          kvm_vcpu_ioctl+559
          do_vfs_ioctl+164
          ksys_ioctl+96
          __x64_sys_ioctl+22
          do_syscall_64+89
          entry_SYSCALL_64_after_hwframe+68
      ]: 42848
      @[
          trace_retpoline+1
          __trace_retpoline+30
          __x86_indirect_thunk_rax+33
          ktime_get+58
          start_sw_timer+279
          restart_apic_timer+85
          kvm_set_msr_common+1497
          msr_interception+142
          vcpu_enter_guest+684
          kvm_arch_vcpu_ioctl_run+261
          kvm_vcpu_ioctl+559
          do_vfs_ioctl+164
          ksys_ioctl+96
          __x64_sys_ioctl+22
          do_syscall_64+89
          entry_SYSCALL_64_after_hwframe+68
      ]: 499845
      
      @total: 1780243
      
      SVM has no TSC based programmable preemption timer so it is invoking
      ktime_get() frequently.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      3dcb2a3f
    • A
      KVM: retpolines: x86: eliminate retpoline from vmx.c exit handlers · 4289d272
      Andrea Arcangeli 提交于
      It's enough to check the exit value and issue a direct call to avoid
      the retpoline for all the common vmexit reasons.
      
      Of course CONFIG_RETPOLINE already forbids gcc to use indirect jumps
      while compiling all switch() statements, however switch() would still
      allow the compiler to bisect the case value. It's more efficient to
      prioritize the most frequent vmexits instead.
      
      The halt may be slow paths from the point of the guest, but not
      necessarily so from the point of the host if the host runs at full CPU
      capacity and no host CPU is ever left idle.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      4289d272
    • A
      KVM: x86: optimize more exit handlers in vmx.c · f399e60c
      Andrea Arcangeli 提交于
      Eliminate wasteful call/ret non RETPOLINE case and unnecessary fentry
      dynamic tracing hooking points.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      f399e60c
  2. 11 11月, 2019 1 次提交
  3. 02 11月, 2019 1 次提交
    • M
      KVM: x86: switch KVMCLOCK base to monotonic raw clock · 53fafdbb
      Marcelo Tosatti 提交于
      Commit 0bc48bea ("KVM: x86: update master clock before computing
      kvmclock_offset")
      switches the order of operations to avoid the conversion
      
      TSC (without frequency correction) ->
      system_timestamp (with frequency correction),
      
      which might cause a time jump.
      
      However, it leaves any other masterclock update unsafe, which includes,
      at the moment:
      
              * HV_X64_MSR_REFERENCE_TSC MSR write.
              * TSC writes.
              * Host suspend/resume.
      
      Avoid the time jump issue by using frequency uncorrected
      CLOCK_MONOTONIC_RAW clock.
      
      Its the guests time keeping software responsability
      to track and correct a reference clock such as UTC.
      
      This fixes forward time jump (which can result in
      failure to bring up a vCPU) during vCPU hotplug:
      
      Oct 11 14:48:33 storage kernel: CPU2 has been hot-added
      Oct 11 14:48:34 storage kernel: CPU3 has been hot-added
      Oct 11 14:49:22 storage kernel: smpboot: Booting Node 0 Processor 2 APIC 0x2          <-- time jump of almost 1 minute
      Oct 11 14:49:22 storage kernel: smpboot: do_boot_cpu failed(-1) to wakeup CPU#2
      Oct 11 14:49:23 storage kernel: smpboot: Booting Node 0 Processor 3 APIC 0x3
      Oct 11 14:49:23 storage kernel: kvm-clock: cpu 3, msr 0:7ff640c1, secondary cpu clock
      
      Which happens because:
      
                      /*
                       * Wait 10s total for a response from AP
                       */
                      boot_error = -1;
                      timeout = jiffies + 10*HZ;
                      while (time_before(jiffies, timeout)) {
                               ...
                      }
      Analyzed-by: NIgor Mammedov <imammedo@redhat.com>
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      53fafdbb
  4. 25 10月, 2019 1 次提交
  5. 22 10月, 2019 21 次提交