1. 01 2月, 2021 2 次提交
    • K
      perf/x86/intel: Add perf core PMU support for Sapphire Rapids · 61b985e3
      Kan Liang 提交于
      Add perf core PMU support for the Intel Sapphire Rapids server, which is
      the successor of the Intel Ice Lake server. The enabling code is based
      on Ice Lake, but there are several new features introduced.
      
      The event encoding is changed and simplified, e.g., the event codes
      which are below 0x90 are restricted to counters 0-3. The event codes
      which above 0x90 are likely to have no restrictions. The event
      constraints, extra_regs(), and hardware cache events table are changed
      accordingly.
      
      A new Precise Distribution (PDist) facility is introduced, which
      further minimizes the skid when a precise event is programmed on the GP
      counter 0. Enable the Precise Distribution (PDist) facility with :ppp
      event. For this facility to work, the period must be initialized with a
      value larger than 127. Add spr_limit_period() to apply the limit for
      :ppp event.
      
      Two new data source fields, data block & address block, are added in the
      PEBS Memory Info Record for the load latency event. To enable the
      feature,
      - An auxiliary event has to be enabled together with the load latency
        event on Sapphire Rapids. A new flag PMU_FL_MEM_LOADS_AUX is
        introduced to indicate the case. A new event, mem-loads-aux, is
        exposed to sysfs for the user tool.
        Add a check in hw_config(). If the auxiliary event is not detected,
        return an unique error -ENODATA.
      - The union perf_mem_data_src is extended to support the new fields.
      - Ice Lake and earlier models do not support block information, but the
        fields may be set by HW on some machines. Add pebs_no_block to
        explicitly indicate the previous platforms which don't support the new
        block fields. Accessing the new block fields are ignored on those
        platforms.
      
      A new store Latency facility is introduced, which leverages the PEBS
      facility where it can provide additional information about sampled
      stores. The additional information includes the data address, memory
      auxiliary info (e.g. Data Source, STLB miss) and the latency of the
      store access. To enable the facility, the new event (0x02cd) has to be
      programed on the GP counter 0. A new flag PERF_X86_EVENT_PEBS_STLAT is
      introduced to indicate the event. The store_latency_data() is introduced
      to parse the memory auxiliary info.
      
      The layout of access latency field of PEBS Memory Info Record has been
      changed. Two latency, instruction latency (bit 15:0) and cache access
      latency (bit 47:32) are recorded.
      - The cache access latency is similar to previous memory access latency.
        For loads, the latency starts by the actual cache access until the
        data is returned by the memory subsystem.
        For stores, the latency starts when the demand write accesses the L1
        data cache and lasts until the cacheline write is completed in the
        memory subsystem.
        The cache access latency is stored in low 32bits of the sample type
        PERF_SAMPLE_WEIGHT_STRUCT.
      - The instruction latency starts by the dispatch of the load operation
        for execution and lasts until completion of the instruction it belongs
        to.
        Add a new flag PMU_FL_INSTR_LATENCY to indicate the instruction
        latency support. The instruction latency is stored in the bit 47:32
        of the sample type PERF_SAMPLE_WEIGHT_STRUCT.
      
      Extends the PERF_METRICS MSR to feature TMA method level 2 metrics. The
      lower half of the register is the TMA level 1 metrics (legacy). The
      upper half is also divided into four 8-bit fields for the new level 2
      metrics. Expose all eight Topdown metrics events to user space.
      
      The full description for the SPR features can be found at Intel
      Architecture Instruction Set Extensions and Future Features
      Programming Reference, 319433-041.
      Signed-off-by: NKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/1611873611-156687-5-git-send-email-kan.liang@linux.intel.com
      61b985e3
    • K
      perf/core: Add PERF_SAMPLE_WEIGHT_STRUCT · 2a6c6b7d
      Kan Liang 提交于
      Current PERF_SAMPLE_WEIGHT sample type is very useful to expresses the
      cost of an action represented by the sample. This allows the profiler
      to scale the samples to be more informative to the programmer. It could
      also help to locate a hotspot, e.g., when profiling by memory latencies,
      the expensive load appear higher up in the histograms. But current
      PERF_SAMPLE_WEIGHT sample type is solely determined by one factor. This
      could be a problem, if users want two or more factors to contribute to
      the weight. For example, Golden Cove core PMU can provide both the
      instruction latency and the cache Latency information as factors for the
      memory profiling.
      
      For current X86 platforms, although meminfo::latency is defined as a
      u64, only the lower 32 bits include the valid data in practice (No
      memory access could last than 4G cycles). The higher 32 bits can be used
      to store new factors.
      
      Add a new sample type, PERF_SAMPLE_WEIGHT_STRUCT, to indicate the new
      sample weight structure. It shares the same space as the
      PERF_SAMPLE_WEIGHT sample type.
      
      Users can apply either the PERF_SAMPLE_WEIGHT sample type or the
      PERF_SAMPLE_WEIGHT_STRUCT sample type to retrieve the sample weight, but
      they cannot apply both sample types simultaneously.
      
      Currently, only X86 and PowerPC use the PERF_SAMPLE_WEIGHT sample type.
      - For PowerPC, there is nothing changed for the PERF_SAMPLE_WEIGHT
        sample type. There is no effect for the new PERF_SAMPLE_WEIGHT_STRUCT
        sample type. PowerPC can re-struct the weight field similarly later.
      - For X86, the same value will be dumped for the PERF_SAMPLE_WEIGHT
        sample type or the PERF_SAMPLE_WEIGHT_STRUCT sample type for now.
        The following patches will apply the new factors for the
        PERF_SAMPLE_WEIGHT_STRUCT sample type.
      
      The field in the union perf_sample_weight should be shared among
      different architectures. A generic name is required, but it's hard to
      abstract a name that applies to all architectures. For example, on X86,
      the fields are to store all kinds of latency. While on PowerPC, it
      stores MMCRA[TECX/TECM], which should not be latency. So a general name
      prefix 'var$NUM' is used here.
      Suggested-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/1611873611-156687-2-git-send-email-kan.liang@linux.intel.com
      2a6c6b7d
  2. 03 12月, 2020 2 次提交
  3. 10 11月, 2020 3 次提交
  4. 29 10月, 2020 1 次提交
  5. 26 10月, 2020 1 次提交
  6. 10 9月, 2020 1 次提交
    • K
      perf/x86/intel/ds: Fix x86_pmu_stop warning for large PEBS · 35d1ce6b
      Kan Liang 提交于
      A warning as below may be triggered when sampling with large PEBS.
      
      [  410.411250] perf: interrupt took too long (72145 > 71975), lowering
      kernel.perf_event_max_sample_rate to 2000
      [  410.724923] ------------[ cut here ]------------
      [  410.729822] WARNING: CPU: 0 PID: 16397 at arch/x86/events/core.c:1422
      x86_pmu_stop+0x95/0xa0
      [  410.933811]  x86_pmu_del+0x50/0x150
      [  410.937304]  event_sched_out.isra.0+0xbc/0x210
      [  410.941751]  group_sched_out.part.0+0x53/0xd0
      [  410.946111]  ctx_sched_out+0x193/0x270
      [  410.949862]  __perf_event_task_sched_out+0x32c/0x890
      [  410.954827]  ? set_next_entity+0x98/0x2d0
      [  410.958841]  __schedule+0x592/0x9c0
      [  410.962332]  schedule+0x5f/0xd0
      [  410.965477]  exit_to_usermode_loop+0x73/0x120
      [  410.969837]  prepare_exit_to_usermode+0xcd/0xf0
      [  410.974369]  ret_from_intr+0x2a/0x3a
      [  410.977946] RIP: 0033:0x40123c
      [  411.079661] ---[ end trace bc83adaea7bb664a ]---
      
      In the non-overflow context, e.g., context switch, with large PEBS, perf
      may stop an event twice. An example is below.
      
        //max_samples_per_tick is adjusted to 2
        //NMI is triggered
        intel_pmu_handle_irq()
           handle_pmi_common()
             drain_pebs()
               __intel_pmu_pebs_event()
                 perf_event_overflow()
                   __perf_event_account_interrupt()
                     hwc->interrupts = 1
                     return 0
        //A context switch happens right after the NMI.
        //In the same tick, the perf_throttled_seq is not changed.
        perf_event_task_sched_out()
           perf_pmu_sched_task()
             intel_pmu_drain_pebs_buffer()
               __intel_pmu_pebs_event()
                 perf_event_overflow()
                   __perf_event_account_interrupt()
                     ++hwc->interrupts >= max_samples_per_tick
                     return 1
                 x86_pmu_stop();  # First stop
           perf_event_context_sched_out()
             task_ctx_sched_out()
               ctx_sched_out()
                 event_sched_out()
                   x86_pmu_del()
                     x86_pmu_stop();  # Second stop and trigger the warning
      
      Perf should only invoke the perf_event_overflow() in the overflow
      context.
      
      Current drain_pebs() is called from:
      - handle_pmi_common()			-- overflow context
      - intel_pmu_pebs_sched_task()		-- non-overflow context
      - intel_pmu_pebs_disable()		-- non-overflow context
      - intel_pmu_auto_reload_read()		-- possible overflow context
        With PERF_SAMPLE_READ + PERF_FORMAT_GROUP, the function may be
        invoked in the NMI handler. But, before calling the function, the
        PEBS buffer has already been drained. The __intel_pmu_pebs_event()
        will not be called in the possible overflow context.
      
      To fix the issue, an indicator is required to distinguish between the
      overflow context aka handle_pmi_common() and other cases.
      The dummy regs pointer can be used as the indicator.
      
      In the non-overflow context, perf should treat the last record the same
      as other PEBS records, and doesn't invoke the generic overflow handler.
      
      Fixes: 21509084 ("perf/x86/intel: Handle multiple records in the PEBS buffer")
      Reported-by: NLike Xu <like.xu@linux.intel.com>
      Suggested-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Tested-by: NLike Xu <like.xu@linux.intel.com>
      Link: https://lkml.kernel.org/r/20200902210649.2743-1-kan.liang@linux.intel.com
      35d1ce6b
  7. 08 7月, 2020 1 次提交
  8. 11 2月, 2020 1 次提交
    • K
      perf/x86/intel: Fix inaccurate period in context switch for auto-reload · f861854e
      Kan Liang 提交于
      Perf doesn't take the left period into account when auto-reload is
      enabled with fixed period sampling mode in context switch.
      
      Here is the MSR trace of the perf command as below.
      (The MSR trace is simplified from a ftrace log.)
      
          #perf record -e cycles:p -c 2000000 -- ./triad_loop
      
            //The MSR trace of task schedule out
            //perf disable all counters, disable PEBS, disable GP counter 0,
            //read GP counter 0, and re-enable all counters.
            //The counter 0 stops at 0xfffffff82840
            write_msr: MSR_CORE_PERF_GLOBAL_CTRL(38f), value 0
            write_msr: MSR_IA32_PEBS_ENABLE(3f1), value 0
            write_msr: MSR_P6_EVNTSEL0(186), value 40003003c
            rdpmc: 0, value fffffff82840
            write_msr: MSR_CORE_PERF_GLOBAL_CTRL(38f), value f000000ff
      
            //The MSR trace of the same task schedule in again
            //perf disable all counters, enable and set GP counter 0,
            //enable PEBS, and re-enable all counters.
            //0xffffffe17b80 (-2000000) is written to GP counter 0.
            write_msr: MSR_CORE_PERF_GLOBAL_CTRL(38f), value 0
            write_msr: MSR_IA32_PMC0(4c1), value ffffffe17b80
            write_msr: MSR_P6_EVNTSEL0(186), value 40043003c
            write_msr: MSR_IA32_PEBS_ENABLE(3f1), value 1
            write_msr: MSR_CORE_PERF_GLOBAL_CTRL(38f), value f000000ff
      
      When the same task schedule in again, the counter should starts from
      previous left. However, it starts from the fixed period -2000000 again.
      
      A special variant of intel_pmu_save_and_restart() is used for
      auto-reload, which doesn't update the hwc->period_left.
      When the monitored task schedules in again, perf doesn't know the left
      period. The fixed period is used, which is inaccurate.
      
      With auto-reload, the counter always has a negative counter value. So
      the left period is -value. Update the period_left in
      intel_pmu_save_and_restart_reload().
      
      With the patch:
      
            //The MSR trace of task schedule out
            write_msr: MSR_CORE_PERF_GLOBAL_CTRL(38f), value 0
            write_msr: MSR_IA32_PEBS_ENABLE(3f1), value 0
            write_msr: MSR_P6_EVNTSEL0(186), value 40003003c
            rdpmc: 0, value ffffffe25cbc
            write_msr: MSR_CORE_PERF_GLOBAL_CTRL(38f), value f000000ff
      
            //The MSR trace of the same task schedule in again
            write_msr: MSR_CORE_PERF_GLOBAL_CTRL(38f), value 0
            write_msr: MSR_IA32_PMC0(4c1), value ffffffe25cbc
            write_msr: MSR_P6_EVNTSEL0(186), value 40043003c
            write_msr: MSR_IA32_PEBS_ENABLE(3f1), value 1
            write_msr: MSR_CORE_PERF_GLOBAL_CTRL(38f), value f000000ff
      
      Fixes: d31fc13f ("perf/x86/intel: Fix event update for auto-reload")
      Signed-off-by: NKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Link: https://lkml.kernel.org/r/20200121190125.3389-1-kan.liang@linux.intel.com
      f861854e
  9. 10 12月, 2019 1 次提交
  10. 28 8月, 2019 1 次提交
  11. 25 7月, 2019 1 次提交
  12. 25 6月, 2019 3 次提交
  13. 21 5月, 2019 1 次提交
    • S
      perf/x86/intel/ds: Fix EVENT vs. UEVENT PEBS constraints · 23e3983a
      Stephane Eranian 提交于
      This patch fixes an bug revealed by the following commit:
      
        6b89d4c1 ("perf/x86/intel: Fix INTEL_FLAGS_EVENT_CONSTRAINT* masking")
      
      That patch modified INTEL_FLAGS_EVENT_CONSTRAINT() to only look at the event code
      when matching a constraint. If code+umask were needed, then the
      INTEL_FLAGS_UEVENT_CONSTRAINT() macro was needed instead.
      This broke with some of the constraints for PEBS events.
      
      Several of them, including the one used for cycles:p, cycles:pp, cycles:ppp
      fell in that category and caused the event to be rejected in PEBS mode.
      In other words, on some platforms a cmdline such as:
      
        $ perf top -e cycles:pp
      
      would fail with -EINVAL.
      
      This patch fixes this bug by properly using INTEL_FLAGS_UEVENT_CONSTRAINT()
      when needed in the PEBS constraint tables.
      Reported-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NStephane Eranian <eranian@google.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: kan.liang@intel.com
      Link: http://lkml.kernel.org/r/20190521005246.423-1-eranian@google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      23e3983a
  14. 29 4月, 2019 1 次提交
    • I
      x86/paravirt: Standardize 'insn_buff' variable names · 1fc654cf
      Ingo Molnar 提交于
      We currently have 6 (!) separate naming variants to name temporary instruction
      buffers that are used for code patching:
      
       - insnbuf
       - insnbuff
       - insn_buff
       - insn_buffer
       - ibuf
       - ibuffer
      
      These are used as local variables, percpu fields and function parameters.
      
      Standardize all the names to a single variant: 'insn_buff'.
      
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      1fc654cf
  15. 16 4月, 2019 6 次提交
    • K
      perf/x86/intel: Add Icelake support · 60176089
      Kan Liang 提交于
      Add Icelake core PMU perf code, including constraint tables and the main
      enable code.
      
      Icelake expanded the generic counters to always 8 even with HT on, but a
      range of events cannot be scheduled on the extra 4 counters.
      Add new constraint ranges to describe this to the scheduler.
      The number of constraints that need to be checked is larger now than
      with earlier CPUs.
      At some point we may need a new data structure to look them up more
      efficiently than with linear search. So far it still seems to be
      acceptable however.
      
      Icelake added a new fixed counter SLOTS. Full support for it is added
      later in the patch series.
      
      The cache events table is identical to Skylake.
      
      Compare to PEBS instruction event on generic counter, fixed counter 0
      has less skid. Force instruction:ppp always in fixed counter 0.
      Originally-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: acme@kernel.org
      Cc: jolsa@kernel.org
      Link: https://lkml.kernel.org/r/20190402194509.2832-9-kan.liang@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      60176089
    • P
      perf/x86: Support constraint ranges · 63b79f6e
      Peter Zijlstra 提交于
      Icelake extended the general counters to 8, even when SMT is enabled.
      However only a (large) subset of the events can be used on all 8
      counters.
      
      The events that can or cannot be used on all counters are organized
      in ranges.
      
      A lot of scheduler constraints are required to handle all this.
      
      To avoid blowing up the tables add event code ranges to the constraint
      tables, and a new inline function to match them.
      Originally-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> # developer hat on
      Signed-off-by: NKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> # maintainer hat on
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: acme@kernel.org
      Cc: jolsa@kernel.org
      Link: https://lkml.kernel.org/r/20190402194509.2832-8-kan.liang@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      63b79f6e
    • K
      perf/x86/intel: Support adaptive PEBS v4 · c22497f5
      Kan Liang 提交于
      Adaptive PEBS is a new way to report PEBS sampling information. Instead
      of a fixed size record for all PEBS events it allows to configure the
      PEBS record to only include the information needed. Events can then opt
      in to use such an extended record, or stay with a basic record which
      only contains the IP.
      
      The major new feature is to support LBRs in PEBS record.
      Besides normal LBR, this allows (much faster) large PEBS, while still
      supporting callstacks through callstack LBR. So essentially a lot of
      profiling can now be done without frequent interrupts, dropping the
      overhead significantly.
      
      The main requirement still is to use a period, and not use frequency
      mode, because frequency mode requires reevaluating the frequency on each
      overflow.
      
      The floating point state (XMM) is also supported, which allows efficient
      profiling of FP function arguments.
      
      Introduce specific drain function to handle variable length records.
      Use a new callback to parse the new record format, and also handle the
      STATUS field now being at a different offset.
      
      Add code to set up the configuration register. Since there is only a
      single register, all events either get the full super set of all events,
      or only the basic record.
      Originally-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: acme@kernel.org
      Cc: jolsa@kernel.org
      Link: https://lkml.kernel.org/r/20190402194509.2832-6-kan.liang@linux.intel.com
      [ Renamed GPRS => GP. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      c22497f5
    • K
      perf/x86/intel/ds: Extract code of event update in short period · 477f00f9
      Kan Liang 提交于
      The drain_pebs() could be called twice in a short period for auto-reload
      event in pmu::read(). The intel_pmu_save_and_restart_reload() should be
      called to update the event->count.
      
      This case should also be handled on Icelake. Extract the code for
      later reuse.
      Signed-off-by: NKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: acme@kernel.org
      Cc: jolsa@kernel.org
      Link: https://lkml.kernel.org/r/20190402194509.2832-5-kan.liang@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      477f00f9
    • A
      perf/x86/intel: Extract memory code PEBS parser for reuse · 48f38aa4
      Andi Kleen 提交于
      Extract some code related to memory profiling from the PEBS record
      parser into separate functions. It can be reused by the upcoming
      adaptive PEBS parser. No functional changes.
      Rename intel_hsw_weight to intel_get_tsx_weight, and
      intel_hsw_transaction to intel_get_tsx_transaction. Because the input is
      not the hsw pebs format anymore.
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: acme@kernel.org
      Cc: jolsa@kernel.org
      Link: https://lkml.kernel.org/r/20190402194509.2832-4-kan.liang@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      48f38aa4
    • K
      perf/x86: Support outputting XMM registers · 878068ea
      Kan Liang 提交于
      Starting from Icelake, XMM registers can be collected in PEBS record.
      But current code only output the pt_regs.
      
      Add a new struct x86_perf_regs for both pt_regs and xmm_regs. The
      xmm_regs will be used later to keep a pointer to PEBS record which has
      XMM information.
      
      XMM registers are 128 bit. To simplify the code, they are handled like
      two different registers, which means setting two bits in the register
      bitmap. This also allows only sampling the lower 64bit bits in XMM.
      
      The index of XMM registers starts from 32. There are 16 XMM registers.
      So all reserved space for regs are used. Remove REG_RESERVED.
      
      Add PERF_REG_X86_XMM_MAX, which stands for the max number of all x86
      regs including both GPRs and XMM.
      
      Add REG_NOSUPPORT for 32bit to exclude unsupported registers.
      
      Previous platforms can not collect XMM information in PEBS record.
      Adding pebs_no_xmm_regs to indicate the unsupported platforms.
      
      The common code still validates the supported registers. However, it
      cannot check model specific registers, e.g. XMM. Add extra check in
      x86_pmu_hw_config() to reject invalid config of regs_user and regs_intr.
      The regs_user never supports XMM collection.
      The regs_intr only supports XMM collection when sampling PEBS event on
      icelake and later platforms.
      Originally-by: NAndi Kleen <ak@linux.intel.com>
      Suggested-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: acme@kernel.org
      Cc: jolsa@kernel.org
      Link: https://lkml.kernel.org/r/20190402194509.2832-3-kan.liang@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      878068ea
  16. 11 2月, 2019 1 次提交
    • A
      perf/x86/kvm: Avoid unnecessary work in guest filtering · 9b545c04
      Andi Kleen 提交于
      KVM added a workaround for PEBS events leaking into guests with
      commit:
      
        26a4f3c0 ("perf/x86: disable PEBS on a guest entry.")
      
      This uses the VT entry/exit list to add an extra disable of the
      PEBS_ENABLE MSR.
      
      Intel also added a fix for this issue to microcode updates on
      Haswell/Broadwell/Skylake.
      
      It turns out using the MSR entry/exit list makes VM exits
      significantly slower. The list is only needed for disabling
      PEBS, because the GLOBAL_CTRL change gets optimized by
      KVM into changing the VMCS.
      
      Check for the microcode updates that have the microcode
      fix for leaking PEBS, and disable the extra entry/exit list
      entry for PEBS_ENABLE. In addition we always clear the
      GLOBAL_CTRL for the PEBS counter while running in the guest,
      which is enough to make them never fire at the wrong
      side of the host/guest transition.
      
      The overhead for VM exits with the filtering active with the patch is
      reduced from 8% to 4%.
      
      The microcode patch has already been merged into future platforms.
      This patch is one-off thing. The quirks is used here.
      
      For other old platforms which doesn't have microcode patch and quirks,
      extra disable of the PEBS_ENABLE MSR is still required.
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: bp@alien8.de
      Link: https://lkml.kernel.org/r/1549319013-4522-2-git-send-email-kan.liang@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      9b545c04
  17. 03 12月, 2018 1 次提交
    • I
      x86: Fix various typos in comments · a97673a1
      Ingo Molnar 提交于
      Go over arch/x86/ and fix common typos in comments,
      and a typo in an actual function argument name.
      
      No change in functionality intended.
      
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      a97673a1
  18. 25 7月, 2018 4 次提交
  19. 15 7月, 2018 1 次提交
    • H
      x86/events/intel/ds: Fix bts_interrupt_threshold alignment · 2c991e40
      Hugh Dickins 提交于
      Markus reported that BTS is sporadically missing the tail of the trace
      in the perf_event data buffer: [decode error (1): instruction overflow]
      shown in GDB; and bisected it to the conversion of debug_store to PTI.
      
      A little "optimization" crept into alloc_bts_buffer(), which mistakenly
      placed bts_interrupt_threshold away from the 24-byte record boundary.
      Intel SDM Vol 3B 17.4.9 says "This address must point to an offset from
      the BTS buffer base that is a multiple of the BTS record size."
      
      Revert "max" from a byte count to a record count, to calculate the
      bts_interrupt_threshold correctly: which turns out to fix problem seen.
      
      Fixes: c1961a46 ("x86/events/intel/ds: Map debug buffers in cpu_entry_area")
      Reported-and-tested-by: NMarkus T Metzger <markus.t.metzger@intel.com>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Alexander Shishkin <alexander.shishkin@intel.com>
      Cc: Andi Kleen <andi.kleen@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: stable@vger.kernel.org # v4.14+
      Link: https://lkml.kernel.org/r/alpine.LSU.2.11.1807141248290.1614@eggly.anvils
      2c991e40
  20. 05 4月, 2018 1 次提交
    • S
      perf/x86/intel: Move regs->flags EXACT bit init · d1e7e602
      Stephane Eranian 提交于
      This patch removes a redundant store on regs->flags introduced
      by commit:
      
        71eb9ee9 ("perf/x86/intel: Fix linear IP of PEBS real_ip on Haswell and later CPUs")
      
      We were clearing the PERF_EFLAGS_EXACT but it was overwritten by
      regs->flags = pebs->flags later on.
      
      The PERF_EFLAGS_EXACT is a software flag using bit 3 of regs->flags.
      X86 marks this bit as Reserved. To make sure this bit is zero before
      we do any IP processing, we clear it explicitly.
      
      Patch also removes the following assignment:
      
      	regs->flags = pebs->flags | (regs->flags & PERF_EFLAGS_VM);
      
      Because there is no regs->flags to preserve anymore because
      set_linear_ip() is not called until later.
      Signed-off-by: NStephane Eranian <eranian@google.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: kan.liang@intel.com
      Link: http://lkml.kernel.org/r/1522909791-32498-1-git-send-email-eranian@google.com
      [ Improve capitalization, punctuation and clarity of comments. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      d1e7e602
  21. 27 3月, 2018 1 次提交
    • S
      perf/x86/intel: Fix linear IP of PEBS real_ip on Haswell and later CPUs · 71eb9ee9
      Stephane Eranian 提交于
      this patch fix a bug in how the pebs->real_ip is handled in the PEBS
      handler. real_ip only exists in Haswell and later processor. It is
      actually the eventing IP, i.e., where the event occurred. As opposed
      to the pebs->ip which is the PEBS interrupt IP which is always off
      by one.
      
      The problem is that the real_ip just like the IP needs to be fixed up
      because PEBS does not record all the machine state registers, and
      in particular the code segement (cs). This is why we have the set_linear_ip()
      function. The problem was that set_linear_ip() was only used on the pebs->ip
      and not the pebs->real_ip.
      
      We have profiles which ran into invalid callstacks because of this.
      Here is an example:
      
       .....  0: ffffffffffffff80 recent entry, marker kernel v
       .....  1: 000000000040044d <= user address in kernel space!
       .....  2: fffffffffffffe00 marker enter user v
       .....  3: 000000000040044d
       .....  4: 00000000004004b6 oldest entry
      
      Debugging output in get_perf_callchain():
      
       [  857.769909] CALLCHAIN: CPU8 ip=40044d regs->cs=10 user_mode(regs)=0
      
      The problem is that the kernel entry in 1: points to a user level
      address. How can that be?
      
      The reason is that with PEBS sampling the instruction that caused the event
      to occur and the instruction where the CPU was when the interrupt was posted
      may be far apart. And sometime during that time window, the privilege level may
      change. This happens, for instance, when the PEBS sample is taken close to a
      kernel entry point. Here PEBS, eventing IP (real_ip) captured a user level
      instruction. But by the time the PMU interrupt fired, the processor had already
      entered kernel space. This is why the debug output shows a user address with
      user_mode() false.
      
      The problem comes from PEBS not recording the code segment (cs) register.
      The register is used in x86_64 to determine if executing in kernel vs user
      space. This is okay because the kernel has a software workaround called
      set_linear_ip(). But the issue in setup_pebs_sample_data() is that
      set_linear_ip() is never called on the real_ip value when it is available
      (Haswell and later) and precise_ip > 1.
      
      This patch fixes this problem and eliminates the callchain discrepancy.
      
      The patch restructures the code around set_linear_ip() to minimize the number
      of times the IP has to be set.
      Signed-off-by: NStephane Eranian <eranian@google.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: kan.liang@intel.com
      Link: http://lkml.kernel.org/r/1521788507-10231-1-git-send-email-eranian@google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      71eb9ee9
  22. 20 3月, 2018 1 次提交
  23. 09 3月, 2018 2 次提交
    • K
      perf/x86/intel/ds: Introduce ->read() function for auto-reload events and... · 5bee2cc6
      Kan Liang 提交于
      perf/x86/intel/ds: Introduce ->read() function for auto-reload events and flush the PEBS buffer there
      
      There is no way to get exact auto-reload times and values which are needed
      for event updates unless we flush the PEBS buffer.
      
      Introduce intel_pmu_auto_reload_read() to drain the PEBS buffer for
      auto reload event. To prevent races with the hardware, we can only
      call drain_pebs() when the PMU is disabled.
      Signed-off-by: NKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: acme@kernel.org
      Link: http://lkml.kernel.org/r/1518474035-21006-4-git-send-email-kan.liang@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      5bee2cc6
    • K
      perf/x86/intel: Fix event update for auto-reload · d31fc13f
      Kan Liang 提交于
      There is a bug when reading event->count with large PEBS enabled.
      
      Here is an example:
      
        # ./read_count
        0x71f0
        0x122c0
        0x1000000001c54
        0x100000001257d
        0x200000000bdc5
      
      In fixed period mode, the auto-reload mechanism could be enabled for
      PEBS events, but the calculation of event->count does not take the
      auto-reload values into account.
      
      Anyone who reads event->count will get the wrong result, e.g x86_pmu_read().
      
      This bug was introduced with the auto-reload mechanism enabled since
      commit:
      
        851559e3 ("perf/x86/intel: Use the PEBS auto reload mechanism when possible")
      
      Introduce intel_pmu_save_and_restart_reload() to calculate the
      event->count only for auto-reload.
      
      Since the counter increments a negative counter value and overflows on
      the sign switch, giving the interval:
      
              [-period, 0]
      
      the difference between two consequtive reads is:
      
       A) value2 - value1;
          when no overflows have happened in between,
       B) (0 - value1) + (value2 - (-period));
          when one overflow happened in between,
       C) (0 - value1) + (n - 1) * (period) + (value2 - (-period));
          when @n overflows happened in between.
      
      Here A) is the obvious difference, B) is the extension to the discrete
      interval, where the first term is to the top of the interval and the
      second term is from the bottom of the next interval and C) the extension
      to multiple intervals, where the middle term is the whole intervals
      covered.
      
      The equation for all cases is:
      
          value2 - value1 + n * period
      
      Previously the event->count is updated right before the sample output.
      But for case A, there is no PEBS record ready. It needs to be specially
      handled.
      
      Remove the auto-reload code from x86_perf_event_set_period() since
      we'll not longer call that function in this case.
      
      Based-on-code-from: Peter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: acme@kernel.org
      Fixes: 851559e3 ("perf/x86/intel: Use the PEBS auto reload mechanism when possible")
      Link: http://lkml.kernel.org/r/1518474035-21006-2-git-send-email-kan.liang@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      d31fc13f
  24. 25 1月, 2018 1 次提交
    • P
      perf/x86: Fix perf,x86,cpuhp deadlock · efe951d3
      Peter Zijlstra 提交于
      More lockdep gifts, a 5-way lockup race:
      
      	perf_event_create_kernel_counter()
      	  perf_event_alloc()
      	    perf_try_init_event()
      	      x86_pmu_event_init()
      		__x86_pmu_event_init()
      		  x86_reserve_hardware()
       #0		    mutex_lock(&pmc_reserve_mutex);
      		    reserve_ds_buffer()
       #1		      get_online_cpus()
      
      	perf_event_release_kernel()
      	  _free_event()
      	    hw_perf_event_destroy()
      	      x86_release_hardware()
       #0		mutex_lock(&pmc_reserve_mutex)
      		release_ds_buffer()
       #1		  get_online_cpus()
      
       #1	do_cpu_up()
      	  perf_event_init_cpu()
       #2	    mutex_lock(&pmus_lock)
       #3	    mutex_lock(&ctx->mutex)
      
      	sys_perf_event_open()
      	  mutex_lock_double()
       #3	    mutex_lock(ctx->mutex)
       #4	    mutex_lock_nested(ctx->mutex, 1);
      
      	perf_try_init_event()
       #4	  mutex_lock_nested(ctx->mutex, 1)
      	  x86_pmu_event_init()
      	    intel_pmu_hw_config()
      	      x86_add_exclusive()
       #0		mutex_lock(&pmc_reserve_mutex)
      
      Fix it by using ordering constructs instead of locking.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      efe951d3
  25. 05 1月, 2018 1 次提交