1. 07 1月, 2016 6 次提交
    • N
      perf tools: Add dynamic sort key for tracepoint events · c7c2a5e4
      Namhyung Kim 提交于
      The existing sort keys are less useful for tracepoint events in that
      they are always sampled at the same place, the function where the
      tracepoint is located.
      
      For example, a 'perf report' on sched:sched_switch event looks like the
      following:
      
        # Overhead  Command          Shared Object     Symbol
        # ........  ...............  ................  ..............
        #
            47.22%  swapper          [kernel.vmlinux]  [k] __schedule
            21.67%  transmission-gt  [kernel.vmlinux]  [k] __schedule
             8.23%  netctl-auto      [kernel.vmlinux]  [k] __schedule
             5.53%  kworker/0:1H     [kernel.vmlinux]  [k] __schedule
             1.98%  Xephyr           [kernel.vmlinux]  [k] __schedule
             1.33%  irq/33-iwlwifi   [kernel.vmlinux]  [k] __schedule
             1.17%  wpa_cli          [kernel.vmlinux]  [k] __schedule
             1.13%  rcu_preempt      [kernel.vmlinux]  [k] __schedule
             0.85%  ksoftirqd/0      [kernel.vmlinux]  [k] __schedule
             0.77%  Timer            [kernel.vmlinux]  [k] __schedule
      
      In fact, tracepoints have meaningful information in their fields but
      there's no way to use in 'perf report' currently.  The dynamic sort keys
      are introduced in this patc to overcome this limitation.
      
      The sched:sched_switch events have following fields:
      
        # sudo cat /sys/kernel/debug/tracing/events/sched/sched_switch/format
        name: sched_switch
        ID: 268
        format:
      	field:unsigned short common_type;         offset:0; size:2; signed:0;
      	field:unsigned char common_flags;         offset:2; size:1; signed:0;
      	field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
      	field:int common_pid;                     offset:4; size:4; signed:1;
      
      	field:char prev_comm[16]; offset:8;  size:16; signed:1;
      	field:pid_t prev_pid;     offset:24; size:4;  signed:1;
      	field:int prev_prio;      offset:28; size:4;  signed:1;
      	field:long prev_state;    offset:32; size:8;  signed:1;
      	field:char next_comm[16]; offset:40; size:16; signed:1;
      	field:pid_t next_pid;     offset:56; size:4;  signed:1;
      	field:int next_prio;      offset:60; size:4;  signed:1;
      
        print fmt: "prev_comm=%s prev_pid=%d prev_prio=%d prev_state=%s%s ==>
                    next_comm=%s next_pid=%d next_prio=%d",
          REC->prev_comm, REC->prev_pid, REC->prev_prio,
          REC->prev_state & (2048-1) ? __print_flags(REC->prev_state & (2048-1),
          "|", { 1, "S"} , { 2, "D" }, { 4, "T" }, { 8, "t" }, { 16, "Z" }, { 32, "X" },
          { 64, "x" }, { 128, "K"}, { 256, "W" }, { 512, "P" }, { 1024, "N" }) : "R",
          REC->prev_state & 2048 ? "+" : "", REC->next_comm, REC->next_pid, REC->next_prio
      
      With dynamic sort keys, you can use <event.field> as a sort key.  Those
      dynamic keys are checked and created on demand.  For instance, below is
      to sort by next_pid field output on the same data file:
      
        $ perf report -s comm,sched:sched_switch.next_pid --stdio
        ...
        # Overhead  Command            next_pid
        # ........  ...............  ..........
        #
            21.23%  transmission-gt           0
            20.86%  swapper               17773
             6.62%  netctl-auto               0
             5.25%  swapper                 109
             5.21%  kworker/0:1H              0
             1.98%  Xephyr                    0
             1.98%  swapper                6524
             1.98%  swapper               27478
             1.37%  swapper               27476
             1.17%  swapper                 233
      
      Multiple dynamic sort keys are also supported:
      
        $ perf report -s comm,sched:sched_switch.next_pid,sched:sched_switch.next_comm --stdio
        ...
        # Overhead  Command            next_pid         next_comm
        # ........  ...............  ..........  ................
        #
            20.86%  swapper               17773   transmission-gt
             9.64%  transmission-gt           0         swapper/0
             9.16%  transmission-gt           0         swapper/2
             5.25%  swapper                 109      kworker/0:1H
             5.21%  kworker/0:1H              0         swapper/0
             2.14%  netctl-auto               0         swapper/2
             1.98%  netctl-auto               0         swapper/0
             1.98%  swapper                6524            Xephyr
             1.98%  swapper               27478       netctl-auto
             1.78%  transmission-gt           0         swapper/3
             1.53%  Xephyr                    0         swapper/0
             1.29%  netctl-auto               0         swapper/1
             1.29%  swapper               27476       netctl-auto
             1.21%  netctl-auto               0         swapper/3
             1.17%  swapper                 233    irq/33-iwlwifi
      
      Note that pid 0 exists for each cpu so have comm of 'swapper/N'.
      Signed-off-by: NNamhyung Kim <namhyung@kernel.org>
      Acked-by: NJiri Olsa <jolsa@kernel.org>
      Tested-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Wang Nan <wangnan0@huawei.com>
      Link: http://lkml.kernel.org/r/1450804030-29193-6-git-send-email-namhyung@kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      c7c2a5e4
    • N
      perf tools: Pass evlist to setup_sorting() · 40184c46
      Namhyung Kim 提交于
      This is a preparation to support dynamic sort keys for tracepoint
      events.  Dynamic sort keys can be created for specific fields in trace
      events so it needs the event information.
      Signed-off-by: NNamhyung Kim <namhyung@kernel.org>
      Acked-by: NJiri Olsa <jolsa@kernel.org>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Wang Nan <wangnan0@huawei.com>
      Link: http://lkml.kernel.org/r/1450804030-29193-5-git-send-email-namhyung@kernel.org
      [ Moving the evlist creation earlier in top was split to a previous patch ]
      Signed-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      40184c46
    • N
      perf top: Create the evlist sooner · 54f8f403
      Namhyung Kim 提交于
      This is a preparation to support dynamic sort keys for tracepoint
      events.  Dynamic sort keys can be created for specific fields in trace
      events so it needs the event information, so we need to pass the evlist
      to the sort routines, create it sooner so that the next patch can do
      that.
      Signed-off-by: NNamhyung Kim <namhyung@kernel.org>
      Acked-by: NJiri Olsa <jolsa@kernel.org>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Wang Nan <wangnan0@huawei.com>
      Link: http://lkml.kernel.org/r/1450804030-29193-5-git-send-email-namhyung@kernel.org
      [ Split from the patch passing the evlist to the sort routines ]
      Signed-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      54f8f403
    • N
      tools lib traceevent: Factor out and export print_event_field[s]() · be45d40e
      Namhyung Kim 提交于
      The print_event_field() and print_event_fields() functions print basic
      information of a given field or event without the print format.  They'll
      be used by dynamic sort keys later.
      
      Committer note:
      
      Rename it to pevent_print_field[s]() to get proper namespacing, as
      discussed with Steven Rostedt.
      Signed-off-by: NNamhyung Kim <namhyung@kernel.org>
      Acked-by: NJiri Olsa <jolsa@kernel.org>
      Acked-by: NSteven Rostedt <rostedt@goodmis.org>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Wang Nan <wangnan0@huawei.com>
      Link: http://lkml.kernel.org/r/1450876121-22494-1-git-send-email-namhyung@kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      be45d40e
    • N
      perf hist: Save raw_data/size for tracepoint events · 72392834
      Namhyung Kim 提交于
      The raw_data and raw_size fields are to provide tracepoint specific
      information.  They will be used by dynamic sort keys later.
      Signed-off-by: NNamhyung Kim <namhyung@kernel.org>
      Acked-by: NJiri Olsa <jolsa@kernel.org>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Wang Nan <wangnan0@huawei.com>
      Link: http://lkml.kernel.org/r/1450923377-18641-1-git-send-email-namhyung@kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      72392834
    • N
      perf hist: Pass struct sample to __hists__add_entry() · fd36f3dd
      Namhyung Kim 提交于
      This is a preparation to add more info into the hist_entry.  Also it
      already passes too many argument, so passing sample directly will reduce
      the overhead of the function call.
      Signed-off-by: NNamhyung Kim <namhyung@kernel.org>
      Acked-by: NJiri Olsa <jolsa@kernel.org>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Wang Nan <wangnan0@huawei.com>
      Link: http://lkml.kernel.org/r/1450804030-29193-2-git-send-email-namhyung@kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      fd36f3dd
  2. 06 1月, 2016 18 次提交
    • V
      perf/x86/amd: Remove l1-dcache-stores event for AMD · 9cc2617d
      Vince Weaver 提交于
      This is a long standing bug with the l1-dcache-stores generic event on
      AMD machines.  My perf_event testsuite has been complaining about this
      for years and I'm finally getting around to trying to get it fixed.
      
      The data_cache_refills:system event does not make sense for l1-dcache-stores.
      Maybe this was a typo and it was meant to be for l1-dcache-store-misses?
      
      In any case, the values returned are nowhere near correct for l1-dcache-stores
      and in fact the umask values for the event have completely changed with
      fam15h so it makes even less sense than ever.  So just remove it.
      Signed-off-by: NVince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1512091134350.24311@vincent-weaver-1.umelst.maine.eduSigned-off-by: NIngo Molnar <mingo@kernel.org>
      9cc2617d
    • H
      perf/x86/intel/uncore: Add Knights Landing uncore PMU support · 77af0037
      Harish Chegondi 提交于
      Knights Landing uncore performance monitoring (perfmon) is derived from
      Haswell-EP uncore perfmon with several differences. One notable difference
      is in PCI device IDs. Knights Landing uses common PCI device ID for
      multiple instances of an uncore PMU device type. In Haswell-EP, each
      instance of a PMU device type has a unique device ID.
      
      Knights Landing uncore components that have performance monitoring units
      are UBOX, CHA, EDC, MC, M2PCIe, IRP and PCU. Perfmon registers in EDC, MC,
      IRP, and M2PCIe reside in the PCIe configuration space. Perfmon registers
      in UBOX, CHA and PCU are accessed via the MSR interface.
      
      For more details, please refer to the public document:
      
        https://software.intel.com/sites/default/files/managed/15/8d/IntelXeonPhi%E2%84%A2x200ProcessorPerformanceMonitoringReferenceManual_Volume1_Registers_v0%206.pdfSigned-off-by: NHarish Chegondi <harish.chegondi@intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andi Kleen <andi.kleen@intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Harish Chegondi <harish.chegondi@gmail.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Kan Liang <kan.liang@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Lukasz Anaczkowski <lukasz.anaczkowski@intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Link: http://lkml.kernel.org/r/8ac513981264c3eb10343a3f523f19cc5a2d12fe.1449470704.git.harish.chegondi@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      77af0037
    • H
      perf/x86/intel/uncore: Remove hard coding of PMON box control MSR offset · dae25530
      Harish Chegondi 提交于
      Call uncore_pci_box_ctl() function to get the PMON box control MSR offset
      instead of hard coding the offset. This would allow us to use this
      snbep_uncore_pci_init_box() function for other PCI PMON devices whose box
      control MSR offset is different from SNBEP_PCI_PMON_BOX_CTL.
      Signed-off-by: NHarish Chegondi <harish.chegondi@intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andi Kleen <andi.kleen@intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Harish Chegondi <harish.chegondi@gmail.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Kan Liang <kan.liang@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Lukasz Anaczkowski <lukasz.anaczkowski@intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Link: http://lkml.kernel.org/r/872e8ef16cfc38e5ff3b45fac1094e6f1722e4ad.1449470704.git.harish.chegondi@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      dae25530
    • H
      perf/x86/intel: Add perf core PMU support for Intel Knights Landing · 1e7b9390
      Harish Chegondi 提交于
      Knights Landing core is based on Silvermont core with several differences.
      Like Silvermont, Knights Landing has 8 pairs of LBR MSRs. However, the
      LBR MSRs addresses match those of the Xeon cores' first 8 pairs of LBR MSRs
      Unlike Silvermont, Knights Landing supports hyperthreading. Knights Landing
      offcore response events config register mask is different from that of the
      Silvermont.
      
      This patch was developed based on a patch from Andi Kleen.
      
      For more details, please refer to the public document:
      
        https://software.intel.com/sites/default/files/managed/15/8d/IntelXeonPhi%E2%84%A2x200ProcessorPerformanceMonitoringReferenceManual_Volume1_Registers_v0%206.pdfSigned-off-by: NHarish Chegondi <harish.chegondi@intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andi Kleen <andi.kleen@intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Harish Chegondi <harish.chegondi@gmail.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Kan Liang <kan.liang@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Lukasz Anaczkowski <lukasz.anaczkowski@intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Link: http://lkml.kernel.org/r/d14593c7311f78c93c9cf6b006be843777c5ad5c.1449517401.git.harish.chegondi@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      1e7b9390
    • K
      perf/x86/intel/uncore: Add Broadwell-EP uncore support · d6980ef3
      Kan Liang 提交于
      The uncore subsystem for Broadwell-EP is similar to Haswell-EP.
      There are some differences in pci device IDs, box number and
      constraints. This patch extends the Broadwell-DE codes to support
      Broadwell-EP.
      Signed-off-by: NKan Liang <kan.liang@intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Link: http://lkml.kernel.org/r/1449176411-9499-1-git-send-email-kan.liang@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      d6980ef3
    • H
      perf/x86/rapl: Use unified perf_event_sysfs_show instead of special interface · d3bcd64b
      Huang Rui 提交于
      Actually, rapl_sysfs_show is a duplicate of perf_event_sysfs_show. We
      prefer to use the unified interface.
      Signed-off-by: NHuang Rui <ray.huang@amd.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Dasaratharaman Chandramouli<dasaratharaman.chandramouli@intel.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Robert Richter <rric@kernel.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Link: http://lkml.kernel.org/r/1449223661-2437-1-git-send-email-ray.huang@amd.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      d3bcd64b
    • S
      perf/x86: Enable cycles:pp for Intel Atom · 673d188b
      Stephane Eranian 提交于
      This patch updates the PEBS support for Intel Atom to provide
      an alias for the cycles:pp event used by perf record/top by default
      nowadays.
      
      On Atom, only INST_RETIRED:ANY supports PEBS, so we use this event
      instead with a large cmask to count cycles. Given that Core2 has
      the same issue, we use the intel_pebs_aliases_core2() function for Atom
      as well.
      Signed-off-by: NStephane Eranian <eranian@google.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: kan.liang@intel.com
      Link: http://lkml.kernel.org/r/1449172990-30183-3-git-send-email-eranian@google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      673d188b
    • S
      perf/x86: fix PEBS issues on Intel Atom/Core2 · 1424a09a
      Stephane Eranian 提交于
      This patch fixes broken PEBS support on Intel Atom and Core2
      due to wrong pointer arithmetic in intel_pmu_drain_pebs_core().
      
      The get_next_pebs_record_by_bit() was called on PEBS format fmt0
      which does not use the pebs_record_nhm layout.
      Signed-off-by: NStephane Eranian <eranian@google.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: kan.liang@intel.com
      Fixes: 21509084 ("perf/x86/intel: Handle multiple records in the PEBS buffer")
      Link: http://lkml.kernel.org/r/1449182000-31524-3-git-send-email-eranian@google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      1424a09a
    • S
      perf/x86: Fix LBR related crashes on Intel Atom · 6fc2e830
      Stephane Eranian 提交于
      This patches fixes the LBR kernel crashes on Intel Atom.
      
      The kernel was assuming that if the CPU supports 64-bit format
      LBR, then it has an LBR_SELECT MSR. Atom uses 64-bit LBR format
      but does not have LBR_SELECT. That was causing NULL pointer
      dereferences in a couple of places.
      Signed-off-by: NStephane Eranian <eranian@google.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: kan.liang@intel.com
      Fixes: 96f3eda6 ("perf/x86/intel: Fix static checker warning in lbr enable")
      Link: http://lkml.kernel.org/r/1449182000-31524-2-git-send-email-eranian@google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      6fc2e830
    • S
      perf/x86: Fix filter_events() bug with event mappings · 61b87cae
      Stephane Eranian 提交于
      This patch fixes a bug in the filter_events() function.
      
      The patch fixes the bug whereby if some mappings did not
      exist, e.g., STALLED_CYCLES_FRONTEND, then any event after it
      in the attrs array would disappear from the published list of
      events in /sys/devices/cpu/events. This could be verified
      easily on any system post SNB (which do not publish
      STALLED_CYCLES_FRONTEND):
      
      	$ ./perf stat -e cycles,ref-cycles true
      	Performance counter stats for 'true':
                    1,217,348      cycles
      	<not supported>      ref-cycles
      
      The problem is that in filter_events() there is an assumption
      that the argument (attrs) is organized in increasing continuous
      event indexes related to the event_map(). But if we remove the
      non-supported events by shifing the position in the array, then
      the lookup x86_pmu.event_map() needs to compensate for it, otherwise
      we are looking up the wrong index. This patch corrects this problem
      by compensating for the deleted events and with that ref-cycles
      reappears (here shown on Haswell):
      
      	$ perf stat -e ref-cycles,cycles true
      	Performance counter stats for 'true':
               4,525,910      ref-cycles
               1,064,920      cycles
             0.002943888 seconds time elapsed
      Signed-off-by: NStephane Eranian <eranian@google.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: jolsa@kernel.org
      Cc: kan.liang@intel.com
      Fixes: 8300daa2 ("perf/x86: Filter out undefined events from sysfs events attribute")
      Link: http://lkml.kernel.org/r/1449516805-6637-1-git-send-email-eranian@google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      61b87cae
    • A
      perf/x86: Use INST_RETIRED.PREC_DIST for cycles: ppp · 72469764
      Andi Kleen 提交于
      Add a new 'three-p' precise level, that uses INST_RETIRED.PREC_DIST as
      base. The basic mechanism of abusing the inverse cmask to get all
      cycles works the same as before.
      
      PREC_DIST is available on Sandy Bridge or later. It had some problems
      on Sandy Bridge, so we only use it on IvyBridge and later. I tested it
      on Broadwell and Skylake.
      
      PREC_DIST has special support for avoiding shadow effects, which can
      give better results compare to UOPS_RETIRED. The drawback is that
      PREC_DIST can only schedule on counter 1, but that is ok for cycle
      sampling, as there is normally no need to do multiple cycle sampling
      runs in parallel. It is still possible to run perf top in parallel, as
      that doesn't use precise mode. Also of course the multiplexing can
      still allow parallel operation.
      
      :pp stays with the previous event.
      
      Example:
      
      Sample a loop with 10 sqrt with old cycles:pp
      
      	  0.14 │10:   sqrtps %xmm1,%xmm0     <--------------
      	  9.13 │      sqrtps %xmm1,%xmm0
      	 11.58 │      sqrtps %xmm1,%xmm0
      	 11.51 │      sqrtps %xmm1,%xmm0
      	  6.27 │      sqrtps %xmm1,%xmm0
      	 10.38 │      sqrtps %xmm1,%xmm0
      	 12.20 │      sqrtps %xmm1,%xmm0
      	 12.74 │      sqrtps %xmm1,%xmm0
      	  5.40 │      sqrtps %xmm1,%xmm0
      	 10.14 │      sqrtps %xmm1,%xmm0
      	 10.51 │    ↑ jmp    10
      
      We expect all 10 sqrt to get roughly the sample number of samples.
      
      But you can see that the instruction directly after the JMP is
      systematically underestimated in the result, due to sampling shadow
      effects.
      
      With the new PREC_DIST based sampling this problem is gone and all
      instructions show up roughly evenly:
      
      	  9.51 │10:   sqrtps %xmm1,%xmm0
      	 11.74 │      sqrtps %xmm1,%xmm0
      	 11.84 │      sqrtps %xmm1,%xmm0
      	  6.05 │      sqrtps %xmm1,%xmm0
      	 10.46 │      sqrtps %xmm1,%xmm0
      	 12.25 │      sqrtps %xmm1,%xmm0
      	 12.18 │      sqrtps %xmm1,%xmm0
      	  5.26 │      sqrtps %xmm1,%xmm0
      	 10.13 │      sqrtps %xmm1,%xmm0
      	 10.43 │      sqrtps %xmm1,%xmm0
      	  0.16 │    ↑ jmp    10
      
      Even with PREC_DIST there is still sampling skid and the result is not
      completely even, but systematic shadow effects are significantly
      reduced.
      
      The improvements are mainly expected to make a difference in high IPC
      code. With low IPC it should be similar.
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: hpa@zytor.com
      Link: http://lkml.kernel.org/r/1448929689-13771-2-git-send-email-andi@firstfloor.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      72469764
    • A
      perf/x86: Use INST_RETIRED.TOTAL_CYCLES_PS for cycles:pp for Skylake · 442f5c74
      Andi Kleen 提交于
      I added UOPS_RETIRED.ALL by mistake to the Skylake PEBS event list for
      cycles:pp. But the event is not documented for Skylake, and has some
      issues.
      
      The recommended replacement for cycles:pp is to use
      INST_RETIRED.ANY+pebs as a base, similar to what CPUs before Sandy
      Bridge did. This new event is called INST_RETIRED.TOTAL_CYCLES_PS. The
      event is not really new, but has been already used by perf before
      Sandy Bridge for the original cycles:p
      
      Note the SDM doesn't document that event either, but it's being
      documented in the latest version of the event list on:
      
        https://download.01.org/perfmon/SKL
      
      This patch does:
      
       - Remove UOPS_RETIRED.ALL from the Skylake PEBS event list
      
       - Add INST_RETIRED.ANY to the Skylake PEBS event list, and an table entry to
         allow cmask=16,inv=1 for cycles:pp
      
       - We don't need an extra entry for the base INST_RETIRED event,
         because it is already covered by the catch-all PEBS table entry.
      
       - Switch Skylake to use the Core2 PEBS alias (which is
         INST_RETIRED.TOTAL_CYCLES_PS)
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: hpa@zytor.com
      Link: http://lkml.kernel.org/r/1448929689-13771-1-git-send-email-andi@firstfloor.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      442f5c74
    • A
      perf/x86: Allow zero PEBS status with only single active event · 01330d72
      Andi Kleen 提交于
      Normally we drop PEBS events with a zero status field. But when
      there is only a single PEBS event active we can assume the
      PEBS record is for that event. The PEBS buffer is always flushed
      when PEBS events are disabled, so there is no risk of mishandling
      state PEBS records this way.
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Link: http://lkml.kernel.org/r/1449177740-5422-2-git-send-email-andi@firstfloor.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      01330d72
    • A
      perf/x86: Remove warning for zero PEBS status · 957ea1fd
      Andi Kleen 提交于
      The recent commit:
      
        75f80859 ("perf/x86/intel/pebs: Robustify PEBS buffer drain")
      
      causes lots of warnings on different CPUs before Skylake
      when running PEBS intensive workloads.
      
      They can have a zero status field in the PEBS record when
      PEBS is racing with clearing of GLOBAl_STATUS.
      
      This also can cause hangs (it seems there are still
      problems with printk in NMI).
      
      Disable the warning, but still ignore the record.
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Link: http://lkml.kernel.org/r/1449177740-5422-1-git-send-email-andi@firstfloor.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      957ea1fd
    • P
      perf/core: Collapse more IPI loops · 7b648018
      Peter Zijlstra 提交于
      This patch collapses the two 'hard' cases, which are
      perf_event_{dis,en}able().
      
      I cannot seem to convince myself the current code is correct.
      
      So starting with perf_event_disable(); we don't strictly need to test
      for event->state == ACTIVE, ctx->is_active is enough. If the event is
      not scheduled while the ctx is, __perf_event_disable() still does the
      right thing.  Its a little less efficient to IPI in that case,
      over-all simpler.
      
      For perf_event_enable(); the same goes, but I think that's actually
      broken in its current form. The current condition is: ctx->is_active
      && event->state == OFF, that means it doesn't do anything when
      !ctx->active && event->state == OFF. This is wrong, it should still
      mark the event INACTIVE in that case, otherwise we'll still not try
      and schedule the event once the context becomes active again.
      
      This patch implements the two function using the new
      event_function_call() and does away with the tricky event->state
      tests.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NAlexander Shishkin <alexander.shishkin@intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      7b648018
    • I
    • P
      perf: Fix race in swevent hash · 12ca6ad2
      Peter Zijlstra 提交于
      There's a race on CPU unplug where we free the swevent hash array
      while it can still have events on. This will result in a
      use-after-free which is BAD.
      
      Simply do not free the hash array on unplug. This leaves the thing
      around and no use-after-free takes place.
      
      When the last swevent dies, we do a for_each_possible_cpu() iteration
      anyway to clean these up, at which time we'll free it, so no leakage
      will occur.
      Reported-by: NSasha Levin <sasha.levin@oracle.com>
      Tested-by: NSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      12ca6ad2
    • P
      perf: Fix race in perf_event_exec() · c1274499
      Peter Zijlstra 提交于
      I managed to tickle this warning:
      
        [ 2338.884942] ------------[ cut here ]------------
        [ 2338.890112] WARNING: CPU: 13 PID: 35162 at ../kernel/events/core.c:2702 task_ctx_sched_out+0x6b/0x80()
        [ 2338.900504] Modules linked in:
        [ 2338.903933] CPU: 13 PID: 35162 Comm: bash Not tainted 4.4.0-rc4-dirty #244
        [ 2338.911610] Hardware name: Intel Corporation S2600GZ/S2600GZ, BIOS SE5C600.86B.02.02.0002.122320131210 12/23/2013
        [ 2338.923071]  ffffffff81f1468e ffff8807c6457cb8 ffffffff815c680c 0000000000000000
        [ 2338.931382]  ffff8807c6457cf0 ffffffff810c8a56 ffffe8ffff8c1bd0 ffff8808132ed400
        [ 2338.939678]  0000000000000286 ffff880813170380 ffff8808132ed400 ffff8807c6457d00
        [ 2338.947987] Call Trace:
        [ 2338.950726]  [<ffffffff815c680c>] dump_stack+0x4e/0x82
        [ 2338.956474]  [<ffffffff810c8a56>] warn_slowpath_common+0x86/0xc0
        [ 2338.963195]  [<ffffffff810c8b4a>] warn_slowpath_null+0x1a/0x20
        [ 2338.969720]  [<ffffffff811a49cb>] task_ctx_sched_out+0x6b/0x80
        [ 2338.976244]  [<ffffffff811a62d2>] perf_event_exec+0xe2/0x180
        [ 2338.982575]  [<ffffffff8121fb6f>] setup_new_exec+0x6f/0x1b0
        [ 2338.988810]  [<ffffffff8126de83>] load_elf_binary+0x393/0x1660
        [ 2338.995339]  [<ffffffff811dc772>] ? get_user_pages+0x52/0x60
        [ 2339.001669]  [<ffffffff8121e297>] search_binary_handler+0x97/0x200
        [ 2339.008581]  [<ffffffff8121f8b3>] do_execveat_common.isra.33+0x543/0x6e0
        [ 2339.016072]  [<ffffffff8121fcea>] SyS_execve+0x3a/0x50
        [ 2339.021819]  [<ffffffff819fc165>] stub_execve+0x5/0x5
        [ 2339.027469]  [<ffffffff819fbeb2>] ? entry_SYSCALL_64_fastpath+0x12/0x71
        [ 2339.034860] ---[ end trace ee1337c59a0ddeac ]---
      
      Which is a WARN_ON_ONCE() indicating that cpuctx->task_ctx is not
      what we expected it to be.
      
      This is because context switches can swap the task_struct::perf_event_ctxp[]
      pointer around. Therefore you have to either disable preemption when looking
      at current, or hold ctx->lock.
      
      Fix perf_event_enable_on_exec(), it loads current->perf_event_ctxp[]
      before disabling interrupts, therefore a preemption in the right place
      can swap contexts around and we're using the wrong one.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Kostya Serebryany <kcc@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: syzkaller <syzkaller@googlegroups.com>
      Link: http://lkml.kernel.org/r/20151210195740.GG6357@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      c1274499
  3. 18 12月, 2015 16 次提交