• L
    KVM: x86/pmu: Introduce pmc->is_paused to reduce the call time of perf interfaces · 44bda389
    Like Xu 提交于
    mainline inclusion
    from mainline-v5.14
    commit e79f49c3
    category: feature
    bugzilla: https://gitee.com/openeuler/kernel/issues/I5RD6Y
    CVE: NA
    
    -------------
    
    Based on our observations, after any vm-exit associated with vPMU, there
    are at least two or more perf interfaces to be called for guest counter
    emulation, such as perf_event_{pause, read_value, period}(), and each one
    will {lock, unlock} the same perf_event_ctx. The frequency of calls becomes
    more severe when guest use counters in a multiplexed manner.
    
    Holding a lock once and completing the KVM request operations in the perf
    context would introduce a set of impractical new interfaces. So we can
    further optimize the vPMU implementation by avoiding repeated calls to
    these interfaces in the KVM context for at least one pattern:
    
    After we call perf_event_pause() once, the event will be disabled and its
    internal count will be reset to 0. So there is no need to pause it again
    or read its value. Once the event is paused, event period will not be
    updated until the next time it's resumed or reprogrammed. And there is
    also no need to call perf_event_period twice for a non-running counter,
    considering the perf_event for a running counter is never paused.
    
    Based on this implementation, for the following common usage of
    sampling 4 events using perf on a 4u8g guest:
    
      echo 0 > /proc/sys/kernel/watchdog
      echo 25 > /proc/sys/kernel/perf_cpu_time_max_percent
      echo 10000 > /proc/sys/kernel/perf_event_max_sample_rate
      echo 0 > /proc/sys/kernel/perf_cpu_time_max_percent
      for i in `seq 1 1 10`
      do
      taskset -c 0 perf record \
      -e cpu-cycles -e instructions -e branch-instructions -e cache-misses \
      /root/br_instr a
      done
    
    the average latency of the guest NMI handler is reduced from
    37646.7 ns to 32929.3 ns (~1.14x speed up) on the Intel ICX server.
    Also, in addition to collecting more samples, no loss of sampling
    accuracy was observed compared to before the optimization.
    Signed-off-by: NLike Xu <likexu@tencent.com>
    Message-Id: <20210728120705.6855-1-likexu@tencent.com>
    Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
    Acked-by: NPeter Zijlstra <peterz@infradead.org>
    Signed-off-by: Nyezengruan <yezengruan@huawei.com>
    Reviewed-by: NKeqian Zhu <zhukeqian1@huawei.com>
    Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
    44bda389
pmu.c 14.1 KB