1. 10 11月, 2020 2 次提交
  2. 10 9月, 2020 1 次提交
    • K
      perf/x86/intel/ds: Fix x86_pmu_stop warning for large PEBS · 35d1ce6b
      Kan Liang 提交于
      A warning as below may be triggered when sampling with large PEBS.
      
      [  410.411250] perf: interrupt took too long (72145 > 71975), lowering
      kernel.perf_event_max_sample_rate to 2000
      [  410.724923] ------------[ cut here ]------------
      [  410.729822] WARNING: CPU: 0 PID: 16397 at arch/x86/events/core.c:1422
      x86_pmu_stop+0x95/0xa0
      [  410.933811]  x86_pmu_del+0x50/0x150
      [  410.937304]  event_sched_out.isra.0+0xbc/0x210
      [  410.941751]  group_sched_out.part.0+0x53/0xd0
      [  410.946111]  ctx_sched_out+0x193/0x270
      [  410.949862]  __perf_event_task_sched_out+0x32c/0x890
      [  410.954827]  ? set_next_entity+0x98/0x2d0
      [  410.958841]  __schedule+0x592/0x9c0
      [  410.962332]  schedule+0x5f/0xd0
      [  410.965477]  exit_to_usermode_loop+0x73/0x120
      [  410.969837]  prepare_exit_to_usermode+0xcd/0xf0
      [  410.974369]  ret_from_intr+0x2a/0x3a
      [  410.977946] RIP: 0033:0x40123c
      [  411.079661] ---[ end trace bc83adaea7bb664a ]---
      
      In the non-overflow context, e.g., context switch, with large PEBS, perf
      may stop an event twice. An example is below.
      
        //max_samples_per_tick is adjusted to 2
        //NMI is triggered
        intel_pmu_handle_irq()
           handle_pmi_common()
             drain_pebs()
               __intel_pmu_pebs_event()
                 perf_event_overflow()
                   __perf_event_account_interrupt()
                     hwc->interrupts = 1
                     return 0
        //A context switch happens right after the NMI.
        //In the same tick, the perf_throttled_seq is not changed.
        perf_event_task_sched_out()
           perf_pmu_sched_task()
             intel_pmu_drain_pebs_buffer()
               __intel_pmu_pebs_event()
                 perf_event_overflow()
                   __perf_event_account_interrupt()
                     ++hwc->interrupts >= max_samples_per_tick
                     return 1
                 x86_pmu_stop();  # First stop
           perf_event_context_sched_out()
             task_ctx_sched_out()
               ctx_sched_out()
                 event_sched_out()
                   x86_pmu_del()
                     x86_pmu_stop();  # Second stop and trigger the warning
      
      Perf should only invoke the perf_event_overflow() in the overflow
      context.
      
      Current drain_pebs() is called from:
      - handle_pmi_common()			-- overflow context
      - intel_pmu_pebs_sched_task()		-- non-overflow context
      - intel_pmu_pebs_disable()		-- non-overflow context
      - intel_pmu_auto_reload_read()		-- possible overflow context
        With PERF_SAMPLE_READ + PERF_FORMAT_GROUP, the function may be
        invoked in the NMI handler. But, before calling the function, the
        PEBS buffer has already been drained. The __intel_pmu_pebs_event()
        will not be called in the possible overflow context.
      
      To fix the issue, an indicator is required to distinguish between the
      overflow context aka handle_pmi_common() and other cases.
      The dummy regs pointer can be used as the indicator.
      
      In the non-overflow context, perf should treat the last record the same
      as other PEBS records, and doesn't invoke the generic overflow handler.
      
      Fixes: 21509084 ("perf/x86/intel: Handle multiple records in the PEBS buffer")
      Reported-by: NLike Xu <like.xu@linux.intel.com>
      Suggested-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Tested-by: NLike Xu <like.xu@linux.intel.com>
      Link: https://lkml.kernel.org/r/20200902210649.2743-1-kan.liang@linux.intel.com
      35d1ce6b
  3. 08 7月, 2020 1 次提交
  4. 11 2月, 2020 1 次提交
    • K
      perf/x86/intel: Fix inaccurate period in context switch for auto-reload · f861854e
      Kan Liang 提交于
      Perf doesn't take the left period into account when auto-reload is
      enabled with fixed period sampling mode in context switch.
      
      Here is the MSR trace of the perf command as below.
      (The MSR trace is simplified from a ftrace log.)
      
          #perf record -e cycles:p -c 2000000 -- ./triad_loop
      
            //The MSR trace of task schedule out
            //perf disable all counters, disable PEBS, disable GP counter 0,
            //read GP counter 0, and re-enable all counters.
            //The counter 0 stops at 0xfffffff82840
            write_msr: MSR_CORE_PERF_GLOBAL_CTRL(38f), value 0
            write_msr: MSR_IA32_PEBS_ENABLE(3f1), value 0
            write_msr: MSR_P6_EVNTSEL0(186), value 40003003c
            rdpmc: 0, value fffffff82840
            write_msr: MSR_CORE_PERF_GLOBAL_CTRL(38f), value f000000ff
      
            //The MSR trace of the same task schedule in again
            //perf disable all counters, enable and set GP counter 0,
            //enable PEBS, and re-enable all counters.
            //0xffffffe17b80 (-2000000) is written to GP counter 0.
            write_msr: MSR_CORE_PERF_GLOBAL_CTRL(38f), value 0
            write_msr: MSR_IA32_PMC0(4c1), value ffffffe17b80
            write_msr: MSR_P6_EVNTSEL0(186), value 40043003c
            write_msr: MSR_IA32_PEBS_ENABLE(3f1), value 1
            write_msr: MSR_CORE_PERF_GLOBAL_CTRL(38f), value f000000ff
      
      When the same task schedule in again, the counter should starts from
      previous left. However, it starts from the fixed period -2000000 again.
      
      A special variant of intel_pmu_save_and_restart() is used for
      auto-reload, which doesn't update the hwc->period_left.
      When the monitored task schedules in again, perf doesn't know the left
      period. The fixed period is used, which is inaccurate.
      
      With auto-reload, the counter always has a negative counter value. So
      the left period is -value. Update the period_left in
      intel_pmu_save_and_restart_reload().
      
      With the patch:
      
            //The MSR trace of task schedule out
            write_msr: MSR_CORE_PERF_GLOBAL_CTRL(38f), value 0
            write_msr: MSR_IA32_PEBS_ENABLE(3f1), value 0
            write_msr: MSR_P6_EVNTSEL0(186), value 40003003c
            rdpmc: 0, value ffffffe25cbc
            write_msr: MSR_CORE_PERF_GLOBAL_CTRL(38f), value f000000ff
      
            //The MSR trace of the same task schedule in again
            write_msr: MSR_CORE_PERF_GLOBAL_CTRL(38f), value 0
            write_msr: MSR_IA32_PMC0(4c1), value ffffffe25cbc
            write_msr: MSR_P6_EVNTSEL0(186), value 40043003c
            write_msr: MSR_IA32_PEBS_ENABLE(3f1), value 1
            write_msr: MSR_CORE_PERF_GLOBAL_CTRL(38f), value f000000ff
      
      Fixes: d31fc13f ("perf/x86/intel: Fix event update for auto-reload")
      Signed-off-by: NKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Link: https://lkml.kernel.org/r/20200121190125.3389-1-kan.liang@linux.intel.com
      f861854e
  5. 10 12月, 2019 1 次提交
  6. 28 8月, 2019 1 次提交
  7. 25 7月, 2019 1 次提交
  8. 25 6月, 2019 3 次提交
  9. 21 5月, 2019 1 次提交
    • S
      perf/x86/intel/ds: Fix EVENT vs. UEVENT PEBS constraints · 23e3983a
      Stephane Eranian 提交于
      This patch fixes an bug revealed by the following commit:
      
        6b89d4c1 ("perf/x86/intel: Fix INTEL_FLAGS_EVENT_CONSTRAINT* masking")
      
      That patch modified INTEL_FLAGS_EVENT_CONSTRAINT() to only look at the event code
      when matching a constraint. If code+umask were needed, then the
      INTEL_FLAGS_UEVENT_CONSTRAINT() macro was needed instead.
      This broke with some of the constraints for PEBS events.
      
      Several of them, including the one used for cycles:p, cycles:pp, cycles:ppp
      fell in that category and caused the event to be rejected in PEBS mode.
      In other words, on some platforms a cmdline such as:
      
        $ perf top -e cycles:pp
      
      would fail with -EINVAL.
      
      This patch fixes this bug by properly using INTEL_FLAGS_UEVENT_CONSTRAINT()
      when needed in the PEBS constraint tables.
      Reported-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NStephane Eranian <eranian@google.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: kan.liang@intel.com
      Link: http://lkml.kernel.org/r/20190521005246.423-1-eranian@google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      23e3983a
  10. 29 4月, 2019 1 次提交
    • I
      x86/paravirt: Standardize 'insn_buff' variable names · 1fc654cf
      Ingo Molnar 提交于
      We currently have 6 (!) separate naming variants to name temporary instruction
      buffers that are used for code patching:
      
       - insnbuf
       - insnbuff
       - insn_buff
       - insn_buffer
       - ibuf
       - ibuffer
      
      These are used as local variables, percpu fields and function parameters.
      
      Standardize all the names to a single variant: 'insn_buff'.
      
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      1fc654cf
  11. 16 4月, 2019 6 次提交
    • K
      perf/x86/intel: Add Icelake support · 60176089
      Kan Liang 提交于
      Add Icelake core PMU perf code, including constraint tables and the main
      enable code.
      
      Icelake expanded the generic counters to always 8 even with HT on, but a
      range of events cannot be scheduled on the extra 4 counters.
      Add new constraint ranges to describe this to the scheduler.
      The number of constraints that need to be checked is larger now than
      with earlier CPUs.
      At some point we may need a new data structure to look them up more
      efficiently than with linear search. So far it still seems to be
      acceptable however.
      
      Icelake added a new fixed counter SLOTS. Full support for it is added
      later in the patch series.
      
      The cache events table is identical to Skylake.
      
      Compare to PEBS instruction event on generic counter, fixed counter 0
      has less skid. Force instruction:ppp always in fixed counter 0.
      Originally-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: acme@kernel.org
      Cc: jolsa@kernel.org
      Link: https://lkml.kernel.org/r/20190402194509.2832-9-kan.liang@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      60176089
    • P
      perf/x86: Support constraint ranges · 63b79f6e
      Peter Zijlstra 提交于
      Icelake extended the general counters to 8, even when SMT is enabled.
      However only a (large) subset of the events can be used on all 8
      counters.
      
      The events that can or cannot be used on all counters are organized
      in ranges.
      
      A lot of scheduler constraints are required to handle all this.
      
      To avoid blowing up the tables add event code ranges to the constraint
      tables, and a new inline function to match them.
      Originally-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> # developer hat on
      Signed-off-by: NKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> # maintainer hat on
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: acme@kernel.org
      Cc: jolsa@kernel.org
      Link: https://lkml.kernel.org/r/20190402194509.2832-8-kan.liang@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      63b79f6e
    • K
      perf/x86/intel: Support adaptive PEBS v4 · c22497f5
      Kan Liang 提交于
      Adaptive PEBS is a new way to report PEBS sampling information. Instead
      of a fixed size record for all PEBS events it allows to configure the
      PEBS record to only include the information needed. Events can then opt
      in to use such an extended record, or stay with a basic record which
      only contains the IP.
      
      The major new feature is to support LBRs in PEBS record.
      Besides normal LBR, this allows (much faster) large PEBS, while still
      supporting callstacks through callstack LBR. So essentially a lot of
      profiling can now be done without frequent interrupts, dropping the
      overhead significantly.
      
      The main requirement still is to use a period, and not use frequency
      mode, because frequency mode requires reevaluating the frequency on each
      overflow.
      
      The floating point state (XMM) is also supported, which allows efficient
      profiling of FP function arguments.
      
      Introduce specific drain function to handle variable length records.
      Use a new callback to parse the new record format, and also handle the
      STATUS field now being at a different offset.
      
      Add code to set up the configuration register. Since there is only a
      single register, all events either get the full super set of all events,
      or only the basic record.
      Originally-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: acme@kernel.org
      Cc: jolsa@kernel.org
      Link: https://lkml.kernel.org/r/20190402194509.2832-6-kan.liang@linux.intel.com
      [ Renamed GPRS => GP. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      c22497f5
    • K
      perf/x86/intel/ds: Extract code of event update in short period · 477f00f9
      Kan Liang 提交于
      The drain_pebs() could be called twice in a short period for auto-reload
      event in pmu::read(). The intel_pmu_save_and_restart_reload() should be
      called to update the event->count.
      
      This case should also be handled on Icelake. Extract the code for
      later reuse.
      Signed-off-by: NKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: acme@kernel.org
      Cc: jolsa@kernel.org
      Link: https://lkml.kernel.org/r/20190402194509.2832-5-kan.liang@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      477f00f9
    • A
      perf/x86/intel: Extract memory code PEBS parser for reuse · 48f38aa4
      Andi Kleen 提交于
      Extract some code related to memory profiling from the PEBS record
      parser into separate functions. It can be reused by the upcoming
      adaptive PEBS parser. No functional changes.
      Rename intel_hsw_weight to intel_get_tsx_weight, and
      intel_hsw_transaction to intel_get_tsx_transaction. Because the input is
      not the hsw pebs format anymore.
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: acme@kernel.org
      Cc: jolsa@kernel.org
      Link: https://lkml.kernel.org/r/20190402194509.2832-4-kan.liang@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      48f38aa4
    • K
      perf/x86: Support outputting XMM registers · 878068ea
      Kan Liang 提交于
      Starting from Icelake, XMM registers can be collected in PEBS record.
      But current code only output the pt_regs.
      
      Add a new struct x86_perf_regs for both pt_regs and xmm_regs. The
      xmm_regs will be used later to keep a pointer to PEBS record which has
      XMM information.
      
      XMM registers are 128 bit. To simplify the code, they are handled like
      two different registers, which means setting two bits in the register
      bitmap. This also allows only sampling the lower 64bit bits in XMM.
      
      The index of XMM registers starts from 32. There are 16 XMM registers.
      So all reserved space for regs are used. Remove REG_RESERVED.
      
      Add PERF_REG_X86_XMM_MAX, which stands for the max number of all x86
      regs including both GPRs and XMM.
      
      Add REG_NOSUPPORT for 32bit to exclude unsupported registers.
      
      Previous platforms can not collect XMM information in PEBS record.
      Adding pebs_no_xmm_regs to indicate the unsupported platforms.
      
      The common code still validates the supported registers. However, it
      cannot check model specific registers, e.g. XMM. Add extra check in
      x86_pmu_hw_config() to reject invalid config of regs_user and regs_intr.
      The regs_user never supports XMM collection.
      The regs_intr only supports XMM collection when sampling PEBS event on
      icelake and later platforms.
      Originally-by: NAndi Kleen <ak@linux.intel.com>
      Suggested-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: acme@kernel.org
      Cc: jolsa@kernel.org
      Link: https://lkml.kernel.org/r/20190402194509.2832-3-kan.liang@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      878068ea
  12. 11 2月, 2019 1 次提交
    • A
      perf/x86/kvm: Avoid unnecessary work in guest filtering · 9b545c04
      Andi Kleen 提交于
      KVM added a workaround for PEBS events leaking into guests with
      commit:
      
        26a4f3c0 ("perf/x86: disable PEBS on a guest entry.")
      
      This uses the VT entry/exit list to add an extra disable of the
      PEBS_ENABLE MSR.
      
      Intel also added a fix for this issue to microcode updates on
      Haswell/Broadwell/Skylake.
      
      It turns out using the MSR entry/exit list makes VM exits
      significantly slower. The list is only needed for disabling
      PEBS, because the GLOBAL_CTRL change gets optimized by
      KVM into changing the VMCS.
      
      Check for the microcode updates that have the microcode
      fix for leaking PEBS, and disable the extra entry/exit list
      entry for PEBS_ENABLE. In addition we always clear the
      GLOBAL_CTRL for the PEBS counter while running in the guest,
      which is enough to make them never fire at the wrong
      side of the host/guest transition.
      
      The overhead for VM exits with the filtering active with the patch is
      reduced from 8% to 4%.
      
      The microcode patch has already been merged into future platforms.
      This patch is one-off thing. The quirks is used here.
      
      For other old platforms which doesn't have microcode patch and quirks,
      extra disable of the PEBS_ENABLE MSR is still required.
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: bp@alien8.de
      Link: https://lkml.kernel.org/r/1549319013-4522-2-git-send-email-kan.liang@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      9b545c04
  13. 03 12月, 2018 1 次提交
    • I
      x86: Fix various typos in comments · a97673a1
      Ingo Molnar 提交于
      Go over arch/x86/ and fix common typos in comments,
      and a typo in an actual function argument name.
      
      No change in functionality intended.
      
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      a97673a1
  14. 25 7月, 2018 4 次提交
  15. 15 7月, 2018 1 次提交
    • H
      x86/events/intel/ds: Fix bts_interrupt_threshold alignment · 2c991e40
      Hugh Dickins 提交于
      Markus reported that BTS is sporadically missing the tail of the trace
      in the perf_event data buffer: [decode error (1): instruction overflow]
      shown in GDB; and bisected it to the conversion of debug_store to PTI.
      
      A little "optimization" crept into alloc_bts_buffer(), which mistakenly
      placed bts_interrupt_threshold away from the 24-byte record boundary.
      Intel SDM Vol 3B 17.4.9 says "This address must point to an offset from
      the BTS buffer base that is a multiple of the BTS record size."
      
      Revert "max" from a byte count to a record count, to calculate the
      bts_interrupt_threshold correctly: which turns out to fix problem seen.
      
      Fixes: c1961a46 ("x86/events/intel/ds: Map debug buffers in cpu_entry_area")
      Reported-and-tested-by: NMarkus T Metzger <markus.t.metzger@intel.com>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Alexander Shishkin <alexander.shishkin@intel.com>
      Cc: Andi Kleen <andi.kleen@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: stable@vger.kernel.org # v4.14+
      Link: https://lkml.kernel.org/r/alpine.LSU.2.11.1807141248290.1614@eggly.anvils
      2c991e40
  16. 05 4月, 2018 1 次提交
    • S
      perf/x86/intel: Move regs->flags EXACT bit init · d1e7e602
      Stephane Eranian 提交于
      This patch removes a redundant store on regs->flags introduced
      by commit:
      
        71eb9ee9 ("perf/x86/intel: Fix linear IP of PEBS real_ip on Haswell and later CPUs")
      
      We were clearing the PERF_EFLAGS_EXACT but it was overwritten by
      regs->flags = pebs->flags later on.
      
      The PERF_EFLAGS_EXACT is a software flag using bit 3 of regs->flags.
      X86 marks this bit as Reserved. To make sure this bit is zero before
      we do any IP processing, we clear it explicitly.
      
      Patch also removes the following assignment:
      
      	regs->flags = pebs->flags | (regs->flags & PERF_EFLAGS_VM);
      
      Because there is no regs->flags to preserve anymore because
      set_linear_ip() is not called until later.
      Signed-off-by: NStephane Eranian <eranian@google.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: kan.liang@intel.com
      Link: http://lkml.kernel.org/r/1522909791-32498-1-git-send-email-eranian@google.com
      [ Improve capitalization, punctuation and clarity of comments. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      d1e7e602
  17. 27 3月, 2018 1 次提交
    • S
      perf/x86/intel: Fix linear IP of PEBS real_ip on Haswell and later CPUs · 71eb9ee9
      Stephane Eranian 提交于
      this patch fix a bug in how the pebs->real_ip is handled in the PEBS
      handler. real_ip only exists in Haswell and later processor. It is
      actually the eventing IP, i.e., where the event occurred. As opposed
      to the pebs->ip which is the PEBS interrupt IP which is always off
      by one.
      
      The problem is that the real_ip just like the IP needs to be fixed up
      because PEBS does not record all the machine state registers, and
      in particular the code segement (cs). This is why we have the set_linear_ip()
      function. The problem was that set_linear_ip() was only used on the pebs->ip
      and not the pebs->real_ip.
      
      We have profiles which ran into invalid callstacks because of this.
      Here is an example:
      
       .....  0: ffffffffffffff80 recent entry, marker kernel v
       .....  1: 000000000040044d <= user address in kernel space!
       .....  2: fffffffffffffe00 marker enter user v
       .....  3: 000000000040044d
       .....  4: 00000000004004b6 oldest entry
      
      Debugging output in get_perf_callchain():
      
       [  857.769909] CALLCHAIN: CPU8 ip=40044d regs->cs=10 user_mode(regs)=0
      
      The problem is that the kernel entry in 1: points to a user level
      address. How can that be?
      
      The reason is that with PEBS sampling the instruction that caused the event
      to occur and the instruction where the CPU was when the interrupt was posted
      may be far apart. And sometime during that time window, the privilege level may
      change. This happens, for instance, when the PEBS sample is taken close to a
      kernel entry point. Here PEBS, eventing IP (real_ip) captured a user level
      instruction. But by the time the PMU interrupt fired, the processor had already
      entered kernel space. This is why the debug output shows a user address with
      user_mode() false.
      
      The problem comes from PEBS not recording the code segment (cs) register.
      The register is used in x86_64 to determine if executing in kernel vs user
      space. This is okay because the kernel has a software workaround called
      set_linear_ip(). But the issue in setup_pebs_sample_data() is that
      set_linear_ip() is never called on the real_ip value when it is available
      (Haswell and later) and precise_ip > 1.
      
      This patch fixes this problem and eliminates the callchain discrepancy.
      
      The patch restructures the code around set_linear_ip() to minimize the number
      of times the IP has to be set.
      Signed-off-by: NStephane Eranian <eranian@google.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: kan.liang@intel.com
      Link: http://lkml.kernel.org/r/1521788507-10231-1-git-send-email-eranian@google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      71eb9ee9
  18. 20 3月, 2018 1 次提交
  19. 09 3月, 2018 2 次提交
    • K
      perf/x86/intel/ds: Introduce ->read() function for auto-reload events and... · 5bee2cc6
      Kan Liang 提交于
      perf/x86/intel/ds: Introduce ->read() function for auto-reload events and flush the PEBS buffer there
      
      There is no way to get exact auto-reload times and values which are needed
      for event updates unless we flush the PEBS buffer.
      
      Introduce intel_pmu_auto_reload_read() to drain the PEBS buffer for
      auto reload event. To prevent races with the hardware, we can only
      call drain_pebs() when the PMU is disabled.
      Signed-off-by: NKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: acme@kernel.org
      Link: http://lkml.kernel.org/r/1518474035-21006-4-git-send-email-kan.liang@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      5bee2cc6
    • K
      perf/x86/intel: Fix event update for auto-reload · d31fc13f
      Kan Liang 提交于
      There is a bug when reading event->count with large PEBS enabled.
      
      Here is an example:
      
        # ./read_count
        0x71f0
        0x122c0
        0x1000000001c54
        0x100000001257d
        0x200000000bdc5
      
      In fixed period mode, the auto-reload mechanism could be enabled for
      PEBS events, but the calculation of event->count does not take the
      auto-reload values into account.
      
      Anyone who reads event->count will get the wrong result, e.g x86_pmu_read().
      
      This bug was introduced with the auto-reload mechanism enabled since
      commit:
      
        851559e3 ("perf/x86/intel: Use the PEBS auto reload mechanism when possible")
      
      Introduce intel_pmu_save_and_restart_reload() to calculate the
      event->count only for auto-reload.
      
      Since the counter increments a negative counter value and overflows on
      the sign switch, giving the interval:
      
              [-period, 0]
      
      the difference between two consequtive reads is:
      
       A) value2 - value1;
          when no overflows have happened in between,
       B) (0 - value1) + (value2 - (-period));
          when one overflow happened in between,
       C) (0 - value1) + (n - 1) * (period) + (value2 - (-period));
          when @n overflows happened in between.
      
      Here A) is the obvious difference, B) is the extension to the discrete
      interval, where the first term is to the top of the interval and the
      second term is from the bottom of the next interval and C) the extension
      to multiple intervals, where the middle term is the whole intervals
      covered.
      
      The equation for all cases is:
      
          value2 - value1 + n * period
      
      Previously the event->count is updated right before the sample output.
      But for case A, there is no PEBS record ready. It needs to be specially
      handled.
      
      Remove the auto-reload code from x86_perf_event_set_period() since
      we'll not longer call that function in this case.
      
      Based-on-code-from: Peter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: acme@kernel.org
      Fixes: 851559e3 ("perf/x86/intel: Use the PEBS auto reload mechanism when possible")
      Link: http://lkml.kernel.org/r/1518474035-21006-2-git-send-email-kan.liang@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      d31fc13f
  20. 25 1月, 2018 1 次提交
    • P
      perf/x86: Fix perf,x86,cpuhp deadlock · efe951d3
      Peter Zijlstra 提交于
      More lockdep gifts, a 5-way lockup race:
      
      	perf_event_create_kernel_counter()
      	  perf_event_alloc()
      	    perf_try_init_event()
      	      x86_pmu_event_init()
      		__x86_pmu_event_init()
      		  x86_reserve_hardware()
       #0		    mutex_lock(&pmc_reserve_mutex);
      		    reserve_ds_buffer()
       #1		      get_online_cpus()
      
      	perf_event_release_kernel()
      	  _free_event()
      	    hw_perf_event_destroy()
      	      x86_release_hardware()
       #0		mutex_lock(&pmc_reserve_mutex)
      		release_ds_buffer()
       #1		  get_online_cpus()
      
       #1	do_cpu_up()
      	  perf_event_init_cpu()
       #2	    mutex_lock(&pmus_lock)
       #3	    mutex_lock(&ctx->mutex)
      
      	sys_perf_event_open()
      	  mutex_lock_double()
       #3	    mutex_lock(ctx->mutex)
       #4	    mutex_lock_nested(ctx->mutex, 1);
      
      	perf_try_init_event()
       #4	  mutex_lock_nested(ctx->mutex, 1)
      	  x86_pmu_event_init()
      	    intel_pmu_hw_config()
      	      x86_add_exclusive()
       #0		mutex_lock(&pmc_reserve_mutex)
      
      Fix it by using ordering constructs instead of locking.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      efe951d3
  21. 05 1月, 2018 1 次提交
  22. 24 12月, 2017 2 次提交
    • H
      x86/events/intel/ds: Map debug buffers in cpu_entry_area · c1961a46
      Hugh Dickins 提交于
      The BTS and PEBS buffers both have their virtual addresses programmed into
      the hardware.  This means that any access to them is performed via the page
      tables.  The times that the hardware accesses these are entirely dependent
      on how the performance monitoring hardware events are set up.  In other
      words, there is no way for the kernel to tell when the hardware might
      access these buffers.
      
      To avoid perf crashes, place 'debug_store' allocate pages and map them into
      the cpu_entry_area.
      
      The PEBS fixup buffer does not need this treatment.
      
      [ tglx: Got rid of the kaiser_add_mapping() complication ]
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: David Laight <David.Laight@aculab.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Eduardo Valentin <eduval@amazon.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: aliguori@amazon.com
      Cc: daniel.gruss@iaik.tugraz.at
      Cc: keescook@google.com
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      c1961a46
    • T
      x86/cpu_entry_area: Add debugstore entries to cpu_entry_area · 10043e02
      Thomas Gleixner 提交于
      The Intel PEBS/BTS debug store is a design trainwreck as it expects virtual
      addresses which must be visible in any execution context.
      
      So it is required to make these mappings visible to user space when kernel
      page table isolation is active.
      
      Provide enough room for the buffer mappings in the cpu_entry_area so the
      buffers are available in the user space visible page tables.
      
      At the point where the kernel side entry area is populated there is no
      buffer available yet, but the kernel PMD must be populated. To achieve this
      set the entries for these buffers to non present.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Laight <David.Laight@aculab.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Eduardo Valentin <eduval@amazon.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: aliguori@amazon.com
      Cc: daniel.gruss@iaik.tugraz.at
      Cc: hughd@google.com
      Cc: keescook@google.com
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      10043e02
  23. 02 11月, 2017 1 次提交
    • G
      License cleanup: add SPDX GPL-2.0 license identifier to files with no license · b2441318
      Greg Kroah-Hartman 提交于
      Many source files in the tree are missing licensing information, which
      makes it harder for compliance tools to determine the correct license.
      
      By default all files without license information are under the default
      license of the kernel, which is GPL version 2.
      
      Update the files which contain no license information with the 'GPL-2.0'
      SPDX license identifier.  The SPDX identifier is a legally binding
      shorthand, which can be used instead of the full boiler plate text.
      
      This patch is based on work done by Thomas Gleixner and Kate Stewart and
      Philippe Ombredanne.
      
      How this work was done:
      
      Patches were generated and checked against linux-4.14-rc6 for a subset of
      the use cases:
       - file had no licensing information it it.
       - file was a */uapi/* one with no licensing information in it,
       - file was a */uapi/* one with existing licensing information,
      
      Further patches will be generated in subsequent months to fix up cases
      where non-standard license headers were used, and references to license
      had to be inferred by heuristics based on keywords.
      
      The analysis to determine which SPDX License Identifier to be applied to
      a file was done in a spreadsheet of side by side results from of the
      output of two independent scanners (ScanCode & Windriver) producing SPDX
      tag:value files created by Philippe Ombredanne.  Philippe prepared the
      base worksheet, and did an initial spot review of a few 1000 files.
      
      The 4.13 kernel was the starting point of the analysis with 60,537 files
      assessed.  Kate Stewart did a file by file comparison of the scanner
      results in the spreadsheet to determine which SPDX license identifier(s)
      to be applied to the file. She confirmed any determination that was not
      immediately clear with lawyers working with the Linux Foundation.
      
      Criteria used to select files for SPDX license identifier tagging was:
       - Files considered eligible had to be source code files.
       - Make and config files were included as candidates if they contained >5
         lines of source
       - File already had some variant of a license header in it (even if <5
         lines).
      
      All documentation files were explicitly excluded.
      
      The following heuristics were used to determine which SPDX license
      identifiers to apply.
      
       - when both scanners couldn't find any license traces, file was
         considered to have no license information in it, and the top level
         COPYING file license applied.
      
         For non */uapi/* files that summary was:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|-------
         GPL-2.0                                              11139
      
         and resulted in the first patch in this series.
      
         If that file was a */uapi/* path one, it was "GPL-2.0 WITH
         Linux-syscall-note" otherwise it was "GPL-2.0".  Results of that was:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|-------
         GPL-2.0 WITH Linux-syscall-note                        930
      
         and resulted in the second patch in this series.
      
       - if a file had some form of licensing information in it, and was one
         of the */uapi/* ones, it was denoted with the Linux-syscall-note if
         any GPL family license was found in the file or had no licensing in
         it (per prior point).  Results summary:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|------
         GPL-2.0 WITH Linux-syscall-note                       270
         GPL-2.0+ WITH Linux-syscall-note                      169
         ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause)    21
         ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)    17
         LGPL-2.1+ WITH Linux-syscall-note                      15
         GPL-1.0+ WITH Linux-syscall-note                       14
         ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause)    5
         LGPL-2.0+ WITH Linux-syscall-note                       4
         LGPL-2.1 WITH Linux-syscall-note                        3
         ((GPL-2.0 WITH Linux-syscall-note) OR MIT)              3
         ((GPL-2.0 WITH Linux-syscall-note) AND MIT)             1
      
         and that resulted in the third patch in this series.
      
       - when the two scanners agreed on the detected license(s), that became
         the concluded license(s).
      
       - when there was disagreement between the two scanners (one detected a
         license but the other didn't, or they both detected different
         licenses) a manual inspection of the file occurred.
      
       - In most cases a manual inspection of the information in the file
         resulted in a clear resolution of the license that should apply (and
         which scanner probably needed to revisit its heuristics).
      
       - When it was not immediately clear, the license identifier was
         confirmed with lawyers working with the Linux Foundation.
      
       - If there was any question as to the appropriate license identifier,
         the file was flagged for further research and to be revisited later
         in time.
      
      In total, over 70 hours of logged manual review was done on the
      spreadsheet to determine the SPDX license identifiers to apply to the
      source files by Kate, Philippe, Thomas and, in some cases, confirmation
      by lawyers working with the Linux Foundation.
      
      Kate also obtained a third independent scan of the 4.13 code base from
      FOSSology, and compared selected files where the other two scanners
      disagreed against that SPDX file, to see if there was new insights.  The
      Windriver scanner is based on an older version of FOSSology in part, so
      they are related.
      
      Thomas did random spot checks in about 500 files from the spreadsheets
      for the uapi headers and agreed with SPDX license identifier in the
      files he inspected. For the non-uapi files Thomas did random spot checks
      in about 15000 files.
      
      In initial set of patches against 4.14-rc6, 3 files were found to have
      copy/paste license identifier errors, and have been fixed to reflect the
      correct identifier.
      
      Additionally Philippe spent 10 hours this week doing a detailed manual
      inspection and review of the 12,461 patched files from the initial patch
      version early this week with:
       - a full scancode scan run, collecting the matched texts, detected
         license ids and scores
       - reviewing anything where there was a license detected (about 500+
         files) to ensure that the applied SPDX license was correct
       - reviewing anything where there was no detection but the patch license
         was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
         SPDX license was correct
      
      This produced a worksheet with 20 files needing minor correction.  This
      worksheet was then exported into 3 different .csv files for the
      different types of files to be modified.
      
      These .csv files were then reviewed by Greg.  Thomas wrote a script to
      parse the csv files and add the proper SPDX tag to the file, in the
      format that the file expected.  This script was further refined by Greg
      based on the output to detect more types of files automatically and to
      distinguish between header and source .c files (which need different
      comment types.)  Finally Greg ran the script using the .csv files to
      generate the patches.
      Reviewed-by: NKate Stewart <kstewart@linuxfoundation.org>
      Reviewed-by: NPhilippe Ombredanne <pombredanne@nexb.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b2441318
  24. 29 8月, 2017 1 次提交
    • K
      perf/core, x86: Add PERF_SAMPLE_PHYS_ADDR · fc7ce9c7
      Kan Liang 提交于
      For understanding how the workload maps to memory channels and hardware
      behavior, it's very important to collect address maps with physical
      addresses. For example, 3D XPoint access can only be found by filtering
      the physical address.
      
      Add a new sample type for physical address.
      
      perf already has a facility to collect data virtual address. This patch
      introduces a function to convert the virtual address to physical address.
      The function is quite generic and can be extended to any architecture as
      long as a virtual address is provided.
      
       - For kernel direct mapping addresses, virt_to_phys is used to convert
         the virtual addresses to physical address.
      
       - For user virtual addresses, __get_user_pages_fast is used to walk the
         pages tables for user physical address.
      
       - This does not work for vmalloc addresses right now. These are not
         resolved, but code to do that could be added.
      
      The new sample type requires collecting the virtual address. The
      virtual address will not be output unless SAMPLE_ADDR is applied.
      
      For security, the physical address can only be exposed to root or
      privileged user.
      Tested-by: NMadhavan Srinivasan <maddy@linux.vnet.ibm.com>
      Signed-off-by: NKan Liang <kan.liang@intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: acme@kernel.org
      Cc: mpe@ellerman.id.au
      Link: http://lkml.kernel.org/r/1503967969-48278-1-git-send-email-kan.liang@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      fc7ce9c7
  25. 25 8月, 2017 2 次提交
    • A
      perf/x86: Fix data source decoding for Skylake · 6ae5fa61
      Andi Kleen 提交于
      Skylake changed the encoding of the PEBS data source field.
      Some combinations are not available anymore, but some new cases
      e.g. for L4 cache hit are added.
      
      Fix up the conversion table for Skylake, similar as had been done
      for Nehalem.
      
      On Skylake server the encoding for L4 actually means persistent
      memory. Handle this case too.
      
      To properly describe it in the abstracted perf format I had to add
      some new fields. Since a hit can have only one level add a new
      field that is an enumeration, not a bit field to describe
      the level. It can describe any level. Some numbers are also
      used to describe PMEM and LFB.
      
      Also add a new generic remote flag that can be combined with
      the generic level to signify a remote cache.
      
      And there is an extension field for the snoop indication to handle
      the Forward state.
      
      I didn't add a generic flag for hops because it's not needed
      for Skylake.
      
      I changed the existing encodings for older CPUs to also fill in the
      new level and remote fields.
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: acme@kernel.org
      Cc: jolsa@kernel.org
      Link: http://lkml.kernel.org/r/20170816222156.19953-3-andi@firstfloor.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      6ae5fa61
    • A
      perf/x86: Move Nehalem PEBS code to flag · 95298355
      Andi Kleen 提交于
      Minor cleanup: use an explicit x86_pmu flag to handle the
      missing Lock / TLB information on Nehalem, instead of always
      checking the model number for each PEBS sample.
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: acme@kernel.org
      Cc: jolsa@kernel.org
      Link: http://lkml.kernel.org/r/20170816222156.19953-2-andi@firstfloor.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      95298355
  26. 21 7月, 2017 1 次提交
    • J
      perf/x86/intel: Add proper condition to run sched_task callbacks · df6c3db8
      Jiri Olsa 提交于
      We have 2 functions using the same sched_task callback:
      
        - PEBS drain for free running counters
        - LBR save/store
      
      Both of them are called from intel_pmu_sched_task() and
      either of them can be unwillingly triggered when the
      other one is configured to run.
      
      Let's say there's PEBS drain configured in sched_task
      callback for the event, but in the callback itself
      (intel_pmu_sched_task()) we will also run the code for
      LBR save/restore, which we did not ask for, but the
      code in intel_pmu_sched_task() does not check for that.
      
      This can lead to extra cycles in some perf monitoring,
      like when we monitor PEBS event without LBR data.
      
        # perf record --no-timestamp -c 10000 -e cycles:p ./perf bench sched pipe -l 1000000
      
        (We need PEBS, non freq/non timestamp event to enable
         the sched_task callback)
      
      The perf stat of cycles and msr:write_msr for above
      command before the change:
        ...
        Performance counter stats for './perf record --no-timestamp -c 10000 -e cycles:p \
                                       ./perf bench sched pipe -l 1000000' (5 runs):
      
          18,519,557,441      cycles:k
              91,195,527      msr:write_msr
      
            29.334476406 seconds time elapsed
      
      And after the change:
        ...
        Performance counter stats for './perf record --no-timestamp -c 10000 -e cycles:p \
                                       ./perf bench sched pipe -l 1000000' (5 runs):
      
          18,704,973,540      cycles:k
              27,184,720      msr:write_msr
      
            16.977875900 seconds time elapsed
      
      There's no affect on cycles:k because the sched_task happens
      with events switched off, however the msr:write_msr tracepoint
      counter together with almost 50% of time speedup show the
      improvement.
      
      Monitoring LBR event and having extra PEBS drain processing
      in sched_task callback showed just a little speedup, because
      the drain function does not do much extra work in case there
      is no PEBS data.
      
      Adding conditions to recognize the configured work that needs
      to be done in the x86_pmu's sched_task callback.
      Suggested-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NJiri Olsa <jolsa@kernel.org>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Kan Liang <kan.liang@intel.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Link: http://lkml.kernel.org/r/20170719075247.GA27506@kravaSigned-off-by: NIngo Molnar <mingo@kernel.org>
      df6c3db8