1. 08 7月, 2020 8 次提交
  2. 02 7月, 2020 3 次提交
  3. 01 5月, 2020 1 次提交
    • C
      x86/perf: Add hardware performance events support for Zhaoxin CPU. · 3a4ac121
      CodyYao-oc 提交于
      Zhaoxin CPU has provided facilities for monitoring performance
      via PMU (Performance Monitor Unit), but the functionality is unused so far.
      Therefore, add support for zhaoxin pmu to make performance related
      hardware events available.
      
      The PMU is mostly an Intel Architectural PerfMon-v2 with a novel
      errata for the ZXC line. It supports the following events:
      
        -----------------------------------------------------------------------------------------------------------------------------------
        Event                      | Event  | Umask |          Description
      			     | Select |       |
        -----------------------------------------------------------------------------------------------------------------------------------
        cpu-cycles                 |  82h   |  00h  | unhalt core clock
        instructions               |  00h   |  00h  | number of instructions at retirement.
        cache-references           |  15h   |  05h  | number of fillq pushs at the current cycle.
        cache-misses               |  1ah   |  05h  | number of l2 miss pushed by fillq.
        branch-instructions        |  28h   |  00h  | counts the number of branch instructions retired.
        branch-misses              |  29h   |  00h  | mispredicted branch instructions at retirement.
        bus-cycles                 |  83h   |  00h  | unhalt bus clock
        stalled-cycles-frontend    |  01h   |  01h  | Increments each cycle the # of Uops issued by the RAT to RS.
        stalled-cycles-backend     |  0fh   |  04h  | RS0/1/2/3/45 empty
        L1-dcache-loads            |  68h   |  05h  | number of retire/commit load.
        L1-dcache-load-misses      |  4bh   |  05h  | retired load uops whose data source followed an L1 miss.
        L1-dcache-stores           |  69h   |  06h  | number of retire/commit Store,no LEA
        L1-dcache-store-misses     |  62h   |  05h  | cache lines in M state evicted out of L1D due to Snoop HitM or dirty line replacement.
        L1-icache-loads            |  00h   |  03h  | number of l1i cache access for valid normal fetch,including un-cacheable access.
        L1-icache-load-misses      |  01h   |  03h  | number of l1i cache miss for valid normal fetch,including un-cacheable miss.
        L1-icache-prefetches       |  0ah   |  03h  | number of prefetch.
        L1-icache-prefetch-misses  |  0bh   |  03h  | number of prefetch miss.
        dTLB-loads                 |  68h   |  05h  | number of retire/commit load
        dTLB-load-misses           |  2ch   |  05h  | number of load operations miss all level tlbs and cause a tablewalk.
        dTLB-stores                |  69h   |  06h  | number of retire/commit Store,no LEA
        dTLB-store-misses          |  30h   |  05h  | number of store operations miss all level tlbs and cause a tablewalk.
        dTLB-prefetches            |  64h   |  05h  | number of hardware pte prefetch requests dispatched out of the prefetch FIFO.
        dTLB-prefetch-misses       |  65h   |  05h  | number of hardware pte prefetch requests miss the l1d data cache.
        iTLB-load                  |  00h   |  00h  | actually counter instructions.
        iTLB-load-misses           |  34h   |  05h  | number of code operations miss all level tlbs and cause a tablewalk.
        -----------------------------------------------------------------------------------------------------------------------------------
      Reported-by: Nkbuild test robot <lkp@intel.com>
      Signed-off-by: NCodyYao-oc <CodyYao-oc@zhaoxin.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/1586747669-4827-1-git-send-email-CodyYao-oc@zhaoxin.com
      3a4ac121
  4. 17 1月, 2020 2 次提交
    • K
      perf/x86/amd: Add support for Large Increment per Cycle Events · 57388912
      Kim Phillips 提交于
      Description of hardware operation
      ---------------------------------
      
      The core AMD PMU has a 4-bit wide per-cycle increment for each
      performance monitor counter.  That works for most events, but
      now with AMD Family 17h and above processors, some events can
      occur more than 15 times in a cycle.  Those events are called
      "Large Increment per Cycle" events. In order to count these
      events, two adjacent h/w PMCs get their count signals merged
      to form 8 bits per cycle total.  In addition, the PERF_CTR count
      registers are merged to be able to count up to 64 bits.
      
      Normally, events like instructions retired, get programmed on a single
      counter like so:
      
      PERF_CTL0 (MSR 0xc0010200) 0x000000000053ff0c # event 0x0c, umask 0xff
      PERF_CTR0 (MSR 0xc0010201) 0x0000800000000001 # r/w 48-bit count
      
      The next counter at MSRs 0xc0010202-3 remains unused, or can be used
      independently to count something else.
      
      When counting Large Increment per Cycle events, such as FLOPs,
      however, we now have to reserve the next counter and program the
      PERF_CTL (config) register with the Merge event (0xFFF), like so:
      
      PERF_CTL0 (msr 0xc0010200) 0x000000000053ff03 # FLOPs event, umask 0xff
      PERF_CTR0 (msr 0xc0010201) 0x0000800000000001 # rd 64-bit cnt, wr lo 48b
      PERF_CTL1 (msr 0xc0010202) 0x0000000f004000ff # Merge event, enable bit
      PERF_CTR1 (msr 0xc0010203) 0x0000000000000000 # wr hi 16-bits count
      
      The count is widened from the normal 48-bits to 64 bits by having the
      second counter carry the higher 16 bits of the count in its lower 16
      bits of its counter register.
      
      The odd counter, e.g., PERF_CTL1, is programmed with the enabled Merge
      event before the even counter, PERF_CTL0.
      
      The Large Increment feature is available starting with Family 17h.
      For more details, search any Family 17h PPR for the "Large Increment
      per Cycle Events" section, e.g., section 2.1.15.3 on p. 173 in this
      version:
      
      https://www.amd.com/system/files/TechDocs/56176_ppr_Family_17h_Model_71h_B0_pub_Rev_3.06.zip
      
      Description of software operation
      ---------------------------------
      
      The following steps are taken in order to support reserving and
      enabling the extra counter for Large Increment per Cycle events:
      
      1. In the main x86 scheduler, we reduce the number of available
      counters by the number of Large Increment per Cycle events being
      scheduled, tracked by a new cpuc variable 'n_pair' and a new
      amd_put_event_constraints_f17h().  This improves the counter
      scheduler success rate.
      
      2. In perf_assign_events(), if a counter is assigned to a Large
      Increment event, we increment the current counter variable, so the
      counter used for the Merge event is removed from assignment
      consideration by upcoming event assignments.
      
      3. In find_counter(), if a counter has been found for the Large
      Increment event, we set the next counter as used, to prevent other
      events from using it.
      
      4. We perform steps 2 & 3 also in the x86 scheduler fastpath, i.e.,
      we add Merge event accounting to the existing used_mask logic.
      
      5. Finally, we add on the programming of Merge event to the
      neighbouring PMC counters in the counter enable/disable{_all}
      code paths.
      
      Currently, software does not support a single PMU with mixed 48- and
      64-bit counting, so Large increment event counts are limited to 48
      bits.  In set_period, we zero-out the upper 16 bits of the count, so
      the hardware doesn't copy them to the even counter's higher bits.
      
      Simple invocation example showing counting 8 FLOPs per 256-bit/%ymm
      vaddps instruction executed in a loop 100 million times:
      
      perf stat -e cpu/fp_ret_sse_avx_ops.all/,cpu/instructions/ <workload>
      
       Performance counter stats for '<workload>':
      
             800,000,000      cpu/fp_ret_sse_avx_ops.all/u
             300,042,101      cpu/instructions/u
      
      Prior to this patch, the reported SSE/AVX FLOPs retired count would
      be wrong.
      
      [peterz: lots of renames and edits to the code]
      Signed-off-by: NKim Phillips <kim.phillips@amd.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      57388912
    • K
      perf/x86/amd: Constrain Large Increment per Cycle events · 471af006
      Kim Phillips 提交于
      AMD Family 17h processors and above gain support for Large Increment
      per Cycle events.  Unfortunately there is no CPUID or equivalent bit
      that indicates whether the feature exists or not, so we continue to
      determine eligibility based on a CPU family number comparison.
      
      For Large Increment per Cycle events, we add a f17h-and-compatibles
      get_event_constraints_f17h() that returns an even counter bitmask:
      Large Increment per Cycle events can only be placed on PMCs 0, 2,
      and 4 out of the currently available 0-5.  The only currently
      public event that requires this feature to report valid counts
      is PMCx003 "Retired SSE/AVX Operations".
      
      Note that the CPU family logic in amd_core_pmu_init() is changed
      so as to be able to selectively add initialization for features
      available in ranges of backward-compatible CPU families.  This
      Large Increment per Cycle feature is expected to be retained
      in future families.
      
      A side-effect of assigning a new get_constraints function for f17h
      disables calling the old (prior to f15h) amd_get_event_constraints
      implementation left enabled by commit e40ed154 ("perf/x86: Add perf
      support for AMD family-17h processors"), which is no longer
      necessary since those North Bridge event codes are obsoleted.
      
      Also fix a spelling mistake whilst in the area (calulating ->
      calculating).
      
      Fixes: e40ed154 ("perf/x86: Add perf support for AMD family-17h processors")
      Signed-off-by: NKim Phillips <kim.phillips@amd.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20191114183720.19887-2-kim.phillips@amd.com
      471af006
  5. 28 10月, 2019 2 次提交
    • A
      perf/x86/intel: Implement LBR callstack context synchronization · 421ca868
      Alexey Budankov 提交于
      Implement intel_pmu_lbr_swap_task_ctx() method updating counters
      of the events that requested LBR callstack data on a sample.
      
      The counter can be zero for the case when task context belongs to
      a thread that has just come from a block on a futex and the context
      contains saved (lbr_stack_state == LBR_VALID) LBR register values.
      
      For the values to be restored at LBR registers on the next thread's
      switch-in event it swaps the counter value with the one that is
      expected to be non zero at the previous equivalent task perf event
      context.
      
      Swap operation type ensures the previous task perf event context
      stays consistent with the amount of events that requested LBR
      callstack data on a sample.
      Signed-off-by: NAlexey Budankov <alexey.budankov@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Ian Rogers <irogers@google.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Kan Liang <kan.liang@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Link: https://lkml.kernel.org/r/261ac742-9022-c3f4-5885-1eae7415b091@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      421ca868
    • A
      perf/core, perf/x86: Introduce swap_task_ctx() method at 'struct pmu' · fc1adfe3
      Alexey Budankov 提交于
      Declare swap_task_ctx() methods at the generic and x86 specific
      pmu types to bridge calls to platform specific PMU code on optimized
      context switch path between equivalent task perf event contexts.
      Signed-off-by: NAlexey Budankov <alexey.budankov@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Ian Rogers <irogers@google.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Kan Liang <kan.liang@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Link: https://lkml.kernel.org/r/9a0aa84a-f062-9b64-3133-373658550c4b@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      fc1adfe3
  6. 28 8月, 2019 1 次提交
  7. 25 6月, 2019 2 次提交
  8. 03 6月, 2019 4 次提交
  9. 10 5月, 2019 1 次提交
    • S
      perf/x86/intel: Fix INTEL_FLAGS_EVENT_CONSTRAINT* masking · 6b89d4c1
      Stephane Eranian 提交于
      On Intel Westmere, a cmdline as follows:
      
        $ perf record -e cpu/event=0xc4,umask=0x2,name=br_inst_retired.near_call/p ....
      
      was failing. Yet the event+ umask support PEBS.
      
      It turns out this is due to a bug in the the PEBS event constraint table for
      westmere. All forms of BR_INST_RETIRED.* support PEBS. Therefore the constraint
      mask should ignore the umask. The name of the macro INTEL_FLAGS_EVENT_CONSTRAINT()
      hint that this is the case but it was not. That macros was checking both the
      event code and event umask. Therefore, it was only matching on 0x00c4.
      There are code+umask macros, they all have *UEVENT*.
      
      This bug fixes the issue by checking only the event code in the mask.
      Both single and range version are modified.
      Signed-off-by: NStephane Eranian <eranian@google.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: kan.liang@intel.com
      Link: http://lkml.kernel.org/r/20190509214556.123493-1-eranian@google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      6b89d4c1
  10. 16 4月, 2019 7 次提交
    • K
      perf/x86/intel: Add Icelake support · 60176089
      Kan Liang 提交于
      Add Icelake core PMU perf code, including constraint tables and the main
      enable code.
      
      Icelake expanded the generic counters to always 8 even with HT on, but a
      range of events cannot be scheduled on the extra 4 counters.
      Add new constraint ranges to describe this to the scheduler.
      The number of constraints that need to be checked is larger now than
      with earlier CPUs.
      At some point we may need a new data structure to look them up more
      efficiently than with linear search. So far it still seems to be
      acceptable however.
      
      Icelake added a new fixed counter SLOTS. Full support for it is added
      later in the patch series.
      
      The cache events table is identical to Skylake.
      
      Compare to PEBS instruction event on generic counter, fixed counter 0
      has less skid. Force instruction:ppp always in fixed counter 0.
      Originally-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: acme@kernel.org
      Cc: jolsa@kernel.org
      Link: https://lkml.kernel.org/r/20190402194509.2832-9-kan.liang@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      60176089
    • P
      perf/x86: Support constraint ranges · 63b79f6e
      Peter Zijlstra 提交于
      Icelake extended the general counters to 8, even when SMT is enabled.
      However only a (large) subset of the events can be used on all 8
      counters.
      
      The events that can or cannot be used on all counters are organized
      in ranges.
      
      A lot of scheduler constraints are required to handle all this.
      
      To avoid blowing up the tables add event code ranges to the constraint
      tables, and a new inline function to match them.
      Originally-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> # developer hat on
      Signed-off-by: NKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> # maintainer hat on
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: acme@kernel.org
      Cc: jolsa@kernel.org
      Link: https://lkml.kernel.org/r/20190402194509.2832-8-kan.liang@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      63b79f6e
    • A
      perf/x86/lbr: Avoid reading the LBRs when adaptive PEBS handles them · d3617b98
      Andi Kleen 提交于
      With adaptive PEBS the CPU can directly supply the LBR information,
      so we don't need to read it again. But the LBRs still need to be
      enabled. Add a special count to the cpuc that distinguishes these
      two cases, and avoid reading the LBRs unnecessarily when PEBS is
      active.
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: acme@kernel.org
      Cc: jolsa@kernel.org
      Link: https://lkml.kernel.org/r/20190402194509.2832-7-kan.liang@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      d3617b98
    • K
      perf/x86/intel: Support adaptive PEBS v4 · c22497f5
      Kan Liang 提交于
      Adaptive PEBS is a new way to report PEBS sampling information. Instead
      of a fixed size record for all PEBS events it allows to configure the
      PEBS record to only include the information needed. Events can then opt
      in to use such an extended record, or stay with a basic record which
      only contains the IP.
      
      The major new feature is to support LBRs in PEBS record.
      Besides normal LBR, this allows (much faster) large PEBS, while still
      supporting callstacks through callstack LBR. So essentially a lot of
      profiling can now be done without frequent interrupts, dropping the
      overhead significantly.
      
      The main requirement still is to use a period, and not use frequency
      mode, because frequency mode requires reevaluating the frequency on each
      overflow.
      
      The floating point state (XMM) is also supported, which allows efficient
      profiling of FP function arguments.
      
      Introduce specific drain function to handle variable length records.
      Use a new callback to parse the new record format, and also handle the
      STATUS field now being at a different offset.
      
      Add code to set up the configuration register. Since there is only a
      single register, all events either get the full super set of all events,
      or only the basic record.
      Originally-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: acme@kernel.org
      Cc: jolsa@kernel.org
      Link: https://lkml.kernel.org/r/20190402194509.2832-6-kan.liang@linux.intel.com
      [ Renamed GPRS => GP. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      c22497f5
    • K
      perf/x86: Support outputting XMM registers · 878068ea
      Kan Liang 提交于
      Starting from Icelake, XMM registers can be collected in PEBS record.
      But current code only output the pt_regs.
      
      Add a new struct x86_perf_regs for both pt_regs and xmm_regs. The
      xmm_regs will be used later to keep a pointer to PEBS record which has
      XMM information.
      
      XMM registers are 128 bit. To simplify the code, they are handled like
      two different registers, which means setting two bits in the register
      bitmap. This also allows only sampling the lower 64bit bits in XMM.
      
      The index of XMM registers starts from 32. There are 16 XMM registers.
      So all reserved space for regs are used. Remove REG_RESERVED.
      
      Add PERF_REG_X86_XMM_MAX, which stands for the max number of all x86
      regs including both GPRs and XMM.
      
      Add REG_NOSUPPORT for 32bit to exclude unsupported registers.
      
      Previous platforms can not collect XMM information in PEBS record.
      Adding pebs_no_xmm_regs to indicate the unsupported platforms.
      
      The common code still validates the supported registers. However, it
      cannot check model specific registers, e.g. XMM. Add extra check in
      x86_pmu_hw_config() to reject invalid config of regs_user and regs_intr.
      The regs_user never supports XMM collection.
      The regs_intr only supports XMM collection when sampling PEBS event on
      icelake and later platforms.
      Originally-by: NAndi Kleen <ak@linux.intel.com>
      Suggested-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: acme@kernel.org
      Cc: jolsa@kernel.org
      Link: https://lkml.kernel.org/r/20190402194509.2832-3-kan.liang@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      878068ea
    • S
      perf/x86/intel: Force resched when TFA sysctl is modified · f447e4eb
      Stephane Eranian 提交于
      This patch provides guarantee to the sysadmin that when TFA is disabled, no PMU
      event is using PMC3 when the echo command returns. Vice-Versa, when TFA
      is enabled, PMU can use PMC3 immediately (to eliminate possible multiplexing).
      
        $ perf stat -a -I 1000 --no-merge -e branches,branches,branches,branches
           1.000123979    125,768,725,208      branches
           1.000562520    125,631,000,456      branches
           1.000942898    125,487,114,291      branches
           1.001333316    125,323,363,620      branches
           2.004721306    125,514,968,546      branches
           2.005114560    125,511,110,861      branches
           2.005482722    125,510,132,724      branches
           2.005851245    125,508,967,086      branches
           3.006323475    125,166,570,648      branches
           3.006709247    125,165,650,056      branches
           3.007086605    125,164,639,142      branches
           3.007459298    125,164,402,912      branches
           4.007922698    125,045,577,140      branches
           4.008310775    125,046,804,324      branches
           4.008670814    125,048,265,111      branches
           4.009039251    125,048,677,611      branches
           5.009503373    125,122,240,217      branches
           5.009897067    125,122,450,517      branches
      
      Then on another connection, sysadmin does:
      
        $ echo  1 >/sys/devices/cpu/allow_tsx_force_abort
      
      Then perf stat adjusts the events immediately:
      
           5.010286029    125,121,393,483      branches
           5.010646308    125,120,556,786      branches
           6.011113588    124,963,351,832      branches
           6.011510331    124,964,267,566      branches
           6.011889913    124,964,829,130      branches
           6.012262996    124,965,841,156      branches
           7.012708299    124,419,832,234      branches [79.69%]
           7.012847908    124,416,363,853      branches [79.73%]
           7.013225462    124,400,723,712      branches [79.73%]
           7.013598191    124,376,154,434      branches [79.70%]
           8.014089834    124,250,862,693      branches [74.98%]
           8.014481363    124,267,539,139      branches [74.94%]
           8.014856006    124,259,519,786      branches [74.98%]
           8.014980848    124,225,457,969      branches [75.04%]
           9.015464576    124,204,235,423      branches [75.03%]
           9.015858587    124,204,988,490      branches [75.04%]
           9.016243680    124,220,092,486      branches [74.99%]
           9.016620104    124,231,260,146      branches [74.94%]
      
      And vice-versa if the syadmin does:
      
        $ echo  0 >/sys/devices/cpu/allow_tsx_force_abort
      
      Events are again spread over the 4 counters:
      
          10.017096277    124,276,230,565      branches [74.96%]
          10.017237209    124,228,062,171      branches [75.03%]
          10.017478637    124,178,780,626      branches [75.03%]
          10.017853402    124,198,316,177      branches [75.03%]
          11.018334423    124,602,418,933      branches [85.40%]
          11.018722584    124,602,921,320      branches [85.42%]
          11.019095621    124,603,956,093      branches [85.42%]
          11.019467742    124,595,273,783      branches [85.42%]
          12.019945736    125,110,114,864      branches
          12.020330764    125,109,334,472      branches
          12.020688740    125,109,818,865      branches
          12.021054020    125,108,594,014      branches
          13.021516774    125,109,164,018      branches
          13.021903640    125,108,794,510      branches
          13.022270770    125,107,756,978      branches
          13.022630819    125,109,380,471      branches
          14.023114989    125,133,140,817      branches
          14.023501880    125,133,785,858      branches
          14.023868339    125,133,852,700      branches
      Signed-off-by: NStephane Eranian <eranian@google.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: kan.liang@intel.com
      Cc: nelson.dsouza@intel.com
      Cc: tonyj@suse.com
      Link: https://lkml.kernel.org/r/20190408173252.37932-3-eranian@google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      f447e4eb
    • K
      perf/x86: Fix incorrect PEBS_REGS · 9d5dcc93
      Kan Liang 提交于
      PEBS_REGS used as mask for the supported registers for large PEBS.
      However, the mask cannot filter the sample_regs_user/sample_regs_intr
      correctly.
      
      (1ULL << PERF_REG_X86_*) should be used to replace PERF_REG_X86_*, which
      is only the index.
      
      Rename PEBS_REGS to PEBS_GP_REGS, because the mask is only for general
      purpose registers.
      Signed-off-by: NKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: <stable@vger.kernel.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: acme@kernel.org
      Cc: jolsa@kernel.org
      Fixes: 2fe1bc1f ("perf/x86: Enable free running PEBS for REGS_USER/INTR")
      Link: https://lkml.kernel.org/r/20190402194509.2832-2-kan.liang@linux.intel.com
      [ Renamed it to PEBS_GP_REGS - as 'GPRS' is used elsewhere ;-) ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      9d5dcc93
  11. 03 4月, 2019 1 次提交
    • P
      perf/x86: Remove PERF_X86_EVENT_COMMITTED · 1f6a1e2d
      Peter Zijlstra 提交于
      The flag PERF_X86_EVENT_COMMITTED is used to find uncommitted events
      for which to call put_event_constraint() when scheduling fails.
      
      These are the newly added events to the list, and must form, per
      definition, the tail of cpuc->event_list[]. By computing the list
      index of the last successfull schedule, then iteration can start there
      and the flag is redundant.
      
      There are only 3 callers of x86_schedule_events(), notably:
      
       - x86_pmu_add()
       - x86_pmu_commit_txn()
       - validate_group()
      
      For x86_pmu_add(), cpuc->n_events isn't updated until after
      schedule_events() succeeds, therefore cpuc->n_events points to the
      desired index.
      
      For x86_pmu_commit_txn(), cpuc->n_events is updated, but we can
      trivially compute the desired value with cpuc->n_txn -- the number of
      events added in this transaction.
      
      For validate_group(), we can make the rule for x86_pmu_add() work by
      simply setting cpuc->n_events to 0 before calling schedule_events().
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NStephane Eranian <eranian@google.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      1f6a1e2d
  12. 15 3月, 2019 1 次提交
  13. 06 3月, 2019 2 次提交
  14. 11 2月, 2019 2 次提交
    • J
      perf/x86: Add check_period PMU callback · 81ec3f3c
      Jiri Olsa 提交于
      Vince (and later on Ravi) reported crashes in the BTS code during
      fuzzing with the following backtrace:
      
        general protection fault: 0000 [#1] SMP PTI
        ...
        RIP: 0010:perf_prepare_sample+0x8f/0x510
        ...
        Call Trace:
         <IRQ>
         ? intel_pmu_drain_bts_buffer+0x194/0x230
         intel_pmu_drain_bts_buffer+0x160/0x230
         ? tick_nohz_irq_exit+0x31/0x40
         ? smp_call_function_single_interrupt+0x48/0xe0
         ? call_function_single_interrupt+0xf/0x20
         ? call_function_single_interrupt+0xa/0x20
         ? x86_schedule_events+0x1a0/0x2f0
         ? x86_pmu_commit_txn+0xb4/0x100
         ? find_busiest_group+0x47/0x5d0
         ? perf_event_set_state.part.42+0x12/0x50
         ? perf_mux_hrtimer_restart+0x40/0xb0
         intel_pmu_disable_event+0xae/0x100
         ? intel_pmu_disable_event+0xae/0x100
         x86_pmu_stop+0x7a/0xb0
         x86_pmu_del+0x57/0x120
         event_sched_out.isra.101+0x83/0x180
         group_sched_out.part.103+0x57/0xe0
         ctx_sched_out+0x188/0x240
         ctx_resched+0xa8/0xd0
         __perf_event_enable+0x193/0x1e0
         event_function+0x8e/0xc0
         remote_function+0x41/0x50
         flush_smp_call_function_queue+0x68/0x100
         generic_smp_call_function_single_interrupt+0x13/0x30
         smp_call_function_single_interrupt+0x3e/0xe0
         call_function_single_interrupt+0xf/0x20
         </IRQ>
      
      The reason is that while event init code does several checks
      for BTS events and prevents several unwanted config bits for
      BTS event (like precise_ip), the PERF_EVENT_IOC_PERIOD allows
      to create BTS event without those checks being done.
      
      Following sequence will cause the crash:
      
      If we create an 'almost' BTS event with precise_ip and callchains,
      and it into a BTS event it will crash the perf_prepare_sample()
      function because precise_ip events are expected to come
      in with callchain data initialized, but that's not the
      case for intel_pmu_drain_bts_buffer() caller.
      
      Adding a check_period callback to be called before the period
      is changed via PERF_EVENT_IOC_PERIOD. It will deny the change
      if the event would become BTS. Plus adding also the limit_period
      check as well.
      Reported-by: NVince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NJiri Olsa <jolsa@kernel.org>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: <stable@vger.kernel.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
      Cc: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20190204123532.GA4794@kravaSigned-off-by: NIngo Molnar <mingo@kernel.org>
      81ec3f3c
    • A
      perf/x86/kvm: Avoid unnecessary work in guest filtering · 9b545c04
      Andi Kleen 提交于
      KVM added a workaround for PEBS events leaking into guests with
      commit:
      
        26a4f3c0 ("perf/x86: disable PEBS on a guest entry.")
      
      This uses the VT entry/exit list to add an extra disable of the
      PEBS_ENABLE MSR.
      
      Intel also added a fix for this issue to microcode updates on
      Haswell/Broadwell/Skylake.
      
      It turns out using the MSR entry/exit list makes VM exits
      significantly slower. The list is only needed for disabling
      PEBS, because the GLOBAL_CTRL change gets optimized by
      KVM into changing the VMCS.
      
      Check for the microcode updates that have the microcode
      fix for leaking PEBS, and disable the extra entry/exit list
      entry for PEBS_ENABLE. In addition we always clear the
      GLOBAL_CTRL for the PEBS counter while running in the guest,
      which is enough to make them never fire at the wrong
      side of the host/guest transition.
      
      The overhead for VM exits with the filtering active with the patch is
      reduced from 8% to 4%.
      
      The microcode patch has already been merged into future platforms.
      This patch is one-off thing. The quirks is used here.
      
      For other old platforms which doesn't have microcode patch and quirks,
      extra disable of the PEBS_ENABLE MSR is still required.
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: bp@alien8.de
      Link: https://lkml.kernel.org/r/1549319013-4522-2-git-send-email-kan.liang@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      9b545c04
  15. 22 11月, 2018 1 次提交
    • J
      perf/x86/intel: Add generic branch tracing check to intel_pmu_has_bts() · 67266c10
      Jiri Olsa 提交于
      Currently we check the branch tracing only by checking for the
      PERF_COUNT_HW_BRANCH_INSTRUCTIONS event of PERF_TYPE_HARDWARE
      type. But we can define the same event with the PERF_TYPE_RAW
      type.
      
      Changing the intel_pmu_has_bts() code to check on event's final
      hw config value, so both HW types are covered.
      
      Adding unlikely to intel_pmu_has_bts() condition calls, because
      it was used in the original code in intel_bts_constraints.
      Signed-off-by: NJiri Olsa <jolsa@kernel.org>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: <stable@vger.kernel.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Link: http://lkml.kernel.org/r/20181121101612.16272-2-jolsa@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      67266c10
  16. 02 10月, 2018 1 次提交
    • A
      perf/x86/intel: Add a separate Arch Perfmon v4 PMI handler · af3bdb99
      Andi Kleen 提交于
      Implements counter freezing for Arch Perfmon v4 (Skylake and
      newer). This allows to speed up the PMI handler by avoiding
      unnecessary MSR writes and make it more accurate.
      
      The Arch Perfmon v4 PMI handler is substantially different than
      the older PMI handler.
      
      Differences to the old handler:
      
      - It relies on counter freezing, which eliminates several MSR
        writes from the PMI handler and lowers the overhead significantly.
      
        It makes the PMI handler more accurate, as all counters get
        frozen atomically as soon as any counter overflows. So there is
        much less counting of the PMI handler itself.
      
        With the freezing we don't need to disable or enable counters or
        PEBS. Only BTS which does not support auto-freezing still needs to
        be explicitly managed.
      
      - The PMU acking is done at the end, not the beginning.
        This makes it possible to avoid manual enabling/disabling
        of the PMU, instead we just rely on the freezing/acking.
      
      - The APIC is acked before reenabling the PMU, which avoids
        problems with LBRs occasionally not getting unfreezed on Skylake.
      
      - Looping is only needed to workaround a corner case which several PMIs
        are very close to each other. For common cases, the counters are freezed
        during PMI handler. It doesn't need to do re-check.
      
      This patch:
      
      - Adds code to enable v4 counter freezing
      - Fork <=v3 and >=v4 PMI handlers into separate functions.
      - Add kernel parameter to disable counter freezing. It took some time to
        debug counter freezing, so in case there are new problems we added an
        option to turn it off. Would not expect this to be used until there
        are new bugs.
      - Only for big core. The patch for small core will be posted later
        separately.
      
      Performance:
      
      When profiling a kernel build on Kabylake with different perf options,
      measuring the length of all NMI handlers using the nmi handler
      trace point:
      
      V3 is without counter freezing.
      V4 is with counter freezing.
      The value is the average cost of the PMI handler.
      (lower is better)
      
      perf options    `           V3(ns) V4(ns)  delta
      -c 100000                   1088   894     -18%
      -g -c 100000                1862   1646    -12%
      --call-graph lbr -c 100000  3649   3367    -8%
      --c.g. dwarf -c 100000      2248   1982    -12%
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: acme@kernel.org
      Link: http://lkml.kernel.org/r/1533712328-2834-2-git-send-email-kan.liang@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      af3bdb99
  17. 25 7月, 2018 1 次提交