1. 07 1月, 2016 12 次提交
    • N
      perf tools: Skip dynamic fields not defined for current event · 361459f1
      Namhyung Kim 提交于
      When there are multiple events, each dynamic sort key is defined just
      for one event.  In this case other events will always show "N/A" for
      those fields.  But they are meaningless and consume precious screen
      width.
      
      Let's skip those undefined dynamic fields.
      
        $ perf record -e kmem:kmalloc,kmem:kfree -a sleep 1
      
        $ perf report -s 'comm,kmalloc.*' --stdio
        # To display the perf.data header info, please use --header/--header-only options.
        #
        #
        # Total Lost Samples: 0
        #
        # Samples: 20K of event 'kmem:kmalloc'
        # Event count (approx.): 20533
        #
        # Overhead  Command           call_site                 ptr  bytes_req  bytes_alloc            gfp_flags
        # ........  .......  ..................  ..................  .........  ...........  ...................
        #
            99.89%  perf       ffffffffa01d4396  0xffff8803ffb79720         96           96    GFP_NOFS|GFP_ZERO
             0.06%  sleep      ffffffff8114e1cd  0xffff8803d228a000       4096         4096           GFP_KERNEL
             0.03%  perf       ffffffff811d6ae6  0xffff8803f7678f00        240          256  GFP_KERNEL|GFP_ZERO
             0.00%  perf       ffffffff812263c1  0xffff880406172380        128          128           GFP_KERNEL
             0.00%  perf       ffffffff812264b9  0xffff8803ffac1600        504          512           GFP_KERNEL
             0.00%  perf       ffffffff81226634  0xffff880401dc5280         28           32           GFP_KERNEL
             0.00%  sleep      ffffffff81226da9  0xffff8803ffac3a00        392          512           GFP_KERNEL
      
        # Samples: 20K of event 'kmem:kfree'
        # Event count (approx.): 20597
        #
        # Overhead  Command
        # ........  ..............
        #
            99.63%  perf
             0.14%  sleep
             0.11%  irq/36-iwlwifi
             0.11%  kworker/u16:0
             0.01%  Xorg
             0.00%  firefox
      Signed-off-by: NNamhyung Kim <namhyung@kernel.org>
      Acked-by: NJiri Olsa <jolsa@kernel.org>
      Tested-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Wang Nan <wangnan0@huawei.com>
      Link: http://lkml.kernel.org/r/1450804030-29193-12-git-send-email-namhyung@kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      361459f1
    • N
      perf tools: Support '<event>.*' dynamic sort key · 3b099bf5
      Namhyung Kim 提交于
      Support '*' character for field name to add all (non-common) fields as
      sort keys easily.
      
        $ perf report -s 'switch.*' --stdio
        ...
        # Overhead    prev_comm  prev_pid   prev_prio  prev_state     next_comm  next_pid  next_prio
        # ........  ...........  .........  .........  ..........  ............  ........  .........
        #
             3.82%    swapper/0         0         120           0   netctl-auto     18711        120
             3.75%  netctl-auto     18711         120           1     swapper/0         0        120
             2.24%    swapper/1         0         120           0   netctl-auto     18709        120
             2.24%  netctl-auto     18709         120           1     swapper/1         0        120
             1.80%    swapper/2         0         120           0   rcu_preempt         7        120
             1.80%    swapper/2         0         120           0   netctl-auto     18711        120
             1.80%  rcu_preempt         7         120           1     swapper/2         0        120
             1.80%  netctl-auto     18711         120           1     swapper/2         0        120
        ...
      Suggested-and-acked-by: NJiri Olsa <jolsa@redhat.com>
      Signed-off-by: NNamhyung Kim <namhyung@kernel.org>
      Tested-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Wang Nan <wangnan0@huawei.com>
      Link: http://lkml.kernel.org/r/1450804030-29193-11-git-send-email-namhyung@kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      3b099bf5
    • N
      perf tools: Support shortcuts for events in dynamic sort keys · 5d0cff93
      Namhyung Kim 提交于
      The dynamic sort key requires event name but specifying full event name
      is rather inconvenient.  This patch adds more ways to identify the event
      in a more compact way.
      
        1. If session has just one event, event name can be omitted.
        2. Events can be accessed by index preceded by a percent sign.
        3. A part of the name can be used, if it's not ambiguous.  The partial
           name should not contain ':' in it.
        4. Full system + event name is still used, it should contain ':'.
      
      So in the below example all does same thing:
      
        $ perf record -e sched:sched_switch -a sleep 1
      
        $ perf report -s next_pid,next_comm
        $ perf report -s %1.next_pid,%1.next_comm
        $ perf report -s switch.next_pid,switch.next_comm
        $ perf report -s sched:sched_switch.next_pid,sched:sched_switch.next_comm
      Signed-off-by: NNamhyung Kim <namhyung@kernel.org>
      Acked-by: NJiri Olsa <jolsa@kernel.org>
      Tested-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Wang Nan <wangnan0@huawei.com>
      Link: http://lkml.kernel.org/r/1450804030-29193-10-git-send-email-namhyung@kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      5d0cff93
    • N
      perf report/top: Add --raw-trace option · 053a3989
      Namhyung Kim 提交于
      The --raw-trace option allows disabling pretty printing by the event's
      print_fmt or plugin.  Besides that, each dynamic sort key now can
      receive a 'raw' suffix separated by '/' to ask for the raw trace of a
      specific field.
      
        $ perf report -s comm,kmem:kmalloc.gfp_flags
        ...
        # Overhead  Command            gfp_flags
        # ........  .......  ...................
        #
            99.89%  perf       GFP_NOFS|GFP_ZERO
             0.06%  sleep             GFP_KERNEL
             0.03%  perf     GFP_KERNEL|GFP_ZERO
             0.01%  perf              GFP_KERNEL
      
      Now
      
        $ perf report -s comm,kmem:kmalloc.gfp_flags --raw-trace
      or
        $ perf report -s comm,kmem:kmalloc.gfp_flags/raw
        ...
        # Overhead  Command   gfp_flags
        # ........  .......  ..........
        #
            99.89%  perf          32848
             0.06%  sleep           208
             0.03%  perf          32976
             0.01%  perf            208
      Suggested-and-Acked-by: NJiri Olsa <jolsa@redhat.com>
      Signed-off-by: NNamhyung Kim <namhyung@kernel.org>
      Tested-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Wang Nan <wangnan0@huawei.com>
      Link: http://lkml.kernel.org/r/1450804030-29193-9-git-send-email-namhyung@kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      053a3989
    • N
      perf tools: Add 'trace' sort key · a34bb6a0
      Namhyung Kim 提交于
      The 'trace' sort key is to show tracepoint event output using either
      print fmt or plugin.  For example sched_switch event (using plugin) will
      show output like below:
      
        # perf record -e sched:sched_switch -a usleep 10
        [ perf record: Woken up 1 times to write data ]
        [ perf record: Captured and wrote 0.197 MB perf.data (69 samples) ]
        #
      
        $ perf report -s trace --stdio
        ...
        # Overhead  Trace output
        # ........  ...................................................
        #
             9.48%  swapper/0:0 [120] R ==> transmission-gt:17773 [120]
             9.48%  transmission-gt:17773 [120] S ==> swapper/0:0 [120]
             9.04%  swapper/2:0 [120] R ==> transmission-gt:17773 [120]
             8.92%  transmission-gt:17773 [120] S ==> swapper/2:0 [120]
             5.25%  swapper/0:0 [120] R ==> kworker/0:1H:109 [100]
             5.21%  kworker/0:1H:109 [100] S ==> swapper/0:0 [120]
             1.78%  swapper/3:0 [120] R ==> transmission-gt:17773 [120]
             1.78%  transmission-gt:17773 [120] S ==> swapper/3:0 [120]
             1.53%  Xephyr:6524 [120] S ==> swapper/0:0 [120]
             1.53%  swapper/0:0 [120] R ==> Xephyr:6524 [120]
             1.17%  swapper/2:0 [120] R ==> irq/33-iwlwifi:233 [49]
             1.13%  irq/33-iwlwifi:233 [49] S ==> swapper/2:0 [120]
      
      Note that the 'trace' sort key works only for tracepoint events.  If
      it's used to other type of events, just "N/A" will be printed.
      Suggested-and-acked-by: NJiri Olsa <jolsa@redhat.com>
      Signed-off-by: NNamhyung Kim <namhyung@kernel.org>
      Tested-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Wang Nan <wangnan0@huawei.com>
      Link: http://lkml.kernel.org/r/1450804030-29193-8-git-send-email-namhyung@kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      a34bb6a0
    • N
      perf tools: Try to show pretty printed output for dynamic sort keys · 60517d28
      Namhyung Kim 提交于
      Each tracepoint event has format string for print to improve
      readability.  Try to parse the output and match the field name.  If it
      finds one, use that for the result.  If not, fallbacks to the original
      output.
      
      For example, sort on kmem:kmalloc.gfp_flags looks like below:
      (Note: libtraceevent plugins are not installed on my system.  They might
      affect the output below)
      
      Before:
        # Overhead  Command   gfp_flags
        # ........  .......  ..........
        #
            99.89%  perf          32848
             0.06%  sleep           208
             0.03%  perf          32976
             0.01%  perf            208
      
      After:
        # Overhead  Command            gfp_flags
        # ........  .......  ...................
        #
            99.89%  perf       GFP_NOFS|GFP_ZERO
             0.06%  sleep             GFP_KERNEL
             0.03%  perf     GFP_KERNEL|GFP_ZERO
             0.01%  perf              GFP_KERNEL
      Signed-off-by: NNamhyung Kim <namhyung@kernel.org>
      Acked-by: NJiri Olsa <jolsa@kernel.org>
      Tested-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Wang Nan <wangnan0@huawei.com>
      Link: http://lkml.kernel.org/r/1450804030-29193-7-git-send-email-namhyung@kernel.org
      [ Fixed clash with earlier, updated patch in this patchkit ]
      Signed-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      60517d28
    • N
      perf tools: Add dynamic sort key for tracepoint events · c7c2a5e4
      Namhyung Kim 提交于
      The existing sort keys are less useful for tracepoint events in that
      they are always sampled at the same place, the function where the
      tracepoint is located.
      
      For example, a 'perf report' on sched:sched_switch event looks like the
      following:
      
        # Overhead  Command          Shared Object     Symbol
        # ........  ...............  ................  ..............
        #
            47.22%  swapper          [kernel.vmlinux]  [k] __schedule
            21.67%  transmission-gt  [kernel.vmlinux]  [k] __schedule
             8.23%  netctl-auto      [kernel.vmlinux]  [k] __schedule
             5.53%  kworker/0:1H     [kernel.vmlinux]  [k] __schedule
             1.98%  Xephyr           [kernel.vmlinux]  [k] __schedule
             1.33%  irq/33-iwlwifi   [kernel.vmlinux]  [k] __schedule
             1.17%  wpa_cli          [kernel.vmlinux]  [k] __schedule
             1.13%  rcu_preempt      [kernel.vmlinux]  [k] __schedule
             0.85%  ksoftirqd/0      [kernel.vmlinux]  [k] __schedule
             0.77%  Timer            [kernel.vmlinux]  [k] __schedule
      
      In fact, tracepoints have meaningful information in their fields but
      there's no way to use in 'perf report' currently.  The dynamic sort keys
      are introduced in this patc to overcome this limitation.
      
      The sched:sched_switch events have following fields:
      
        # sudo cat /sys/kernel/debug/tracing/events/sched/sched_switch/format
        name: sched_switch
        ID: 268
        format:
      	field:unsigned short common_type;         offset:0; size:2; signed:0;
      	field:unsigned char common_flags;         offset:2; size:1; signed:0;
      	field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
      	field:int common_pid;                     offset:4; size:4; signed:1;
      
      	field:char prev_comm[16]; offset:8;  size:16; signed:1;
      	field:pid_t prev_pid;     offset:24; size:4;  signed:1;
      	field:int prev_prio;      offset:28; size:4;  signed:1;
      	field:long prev_state;    offset:32; size:8;  signed:1;
      	field:char next_comm[16]; offset:40; size:16; signed:1;
      	field:pid_t next_pid;     offset:56; size:4;  signed:1;
      	field:int next_prio;      offset:60; size:4;  signed:1;
      
        print fmt: "prev_comm=%s prev_pid=%d prev_prio=%d prev_state=%s%s ==>
                    next_comm=%s next_pid=%d next_prio=%d",
          REC->prev_comm, REC->prev_pid, REC->prev_prio,
          REC->prev_state & (2048-1) ? __print_flags(REC->prev_state & (2048-1),
          "|", { 1, "S"} , { 2, "D" }, { 4, "T" }, { 8, "t" }, { 16, "Z" }, { 32, "X" },
          { 64, "x" }, { 128, "K"}, { 256, "W" }, { 512, "P" }, { 1024, "N" }) : "R",
          REC->prev_state & 2048 ? "+" : "", REC->next_comm, REC->next_pid, REC->next_prio
      
      With dynamic sort keys, you can use <event.field> as a sort key.  Those
      dynamic keys are checked and created on demand.  For instance, below is
      to sort by next_pid field output on the same data file:
      
        $ perf report -s comm,sched:sched_switch.next_pid --stdio
        ...
        # Overhead  Command            next_pid
        # ........  ...............  ..........
        #
            21.23%  transmission-gt           0
            20.86%  swapper               17773
             6.62%  netctl-auto               0
             5.25%  swapper                 109
             5.21%  kworker/0:1H              0
             1.98%  Xephyr                    0
             1.98%  swapper                6524
             1.98%  swapper               27478
             1.37%  swapper               27476
             1.17%  swapper                 233
      
      Multiple dynamic sort keys are also supported:
      
        $ perf report -s comm,sched:sched_switch.next_pid,sched:sched_switch.next_comm --stdio
        ...
        # Overhead  Command            next_pid         next_comm
        # ........  ...............  ..........  ................
        #
            20.86%  swapper               17773   transmission-gt
             9.64%  transmission-gt           0         swapper/0
             9.16%  transmission-gt           0         swapper/2
             5.25%  swapper                 109      kworker/0:1H
             5.21%  kworker/0:1H              0         swapper/0
             2.14%  netctl-auto               0         swapper/2
             1.98%  netctl-auto               0         swapper/0
             1.98%  swapper                6524            Xephyr
             1.98%  swapper               27478       netctl-auto
             1.78%  transmission-gt           0         swapper/3
             1.53%  Xephyr                    0         swapper/0
             1.29%  netctl-auto               0         swapper/1
             1.29%  swapper               27476       netctl-auto
             1.21%  netctl-auto               0         swapper/3
             1.17%  swapper                 233    irq/33-iwlwifi
      
      Note that pid 0 exists for each cpu so have comm of 'swapper/N'.
      Signed-off-by: NNamhyung Kim <namhyung@kernel.org>
      Acked-by: NJiri Olsa <jolsa@kernel.org>
      Tested-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Wang Nan <wangnan0@huawei.com>
      Link: http://lkml.kernel.org/r/1450804030-29193-6-git-send-email-namhyung@kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      c7c2a5e4
    • N
      perf tools: Pass evlist to setup_sorting() · 40184c46
      Namhyung Kim 提交于
      This is a preparation to support dynamic sort keys for tracepoint
      events.  Dynamic sort keys can be created for specific fields in trace
      events so it needs the event information.
      Signed-off-by: NNamhyung Kim <namhyung@kernel.org>
      Acked-by: NJiri Olsa <jolsa@kernel.org>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Wang Nan <wangnan0@huawei.com>
      Link: http://lkml.kernel.org/r/1450804030-29193-5-git-send-email-namhyung@kernel.org
      [ Moving the evlist creation earlier in top was split to a previous patch ]
      Signed-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      40184c46
    • N
      perf top: Create the evlist sooner · 54f8f403
      Namhyung Kim 提交于
      This is a preparation to support dynamic sort keys for tracepoint
      events.  Dynamic sort keys can be created for specific fields in trace
      events so it needs the event information, so we need to pass the evlist
      to the sort routines, create it sooner so that the next patch can do
      that.
      Signed-off-by: NNamhyung Kim <namhyung@kernel.org>
      Acked-by: NJiri Olsa <jolsa@kernel.org>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Wang Nan <wangnan0@huawei.com>
      Link: http://lkml.kernel.org/r/1450804030-29193-5-git-send-email-namhyung@kernel.org
      [ Split from the patch passing the evlist to the sort routines ]
      Signed-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      54f8f403
    • N
      tools lib traceevent: Factor out and export print_event_field[s]() · be45d40e
      Namhyung Kim 提交于
      The print_event_field() and print_event_fields() functions print basic
      information of a given field or event without the print format.  They'll
      be used by dynamic sort keys later.
      
      Committer note:
      
      Rename it to pevent_print_field[s]() to get proper namespacing, as
      discussed with Steven Rostedt.
      Signed-off-by: NNamhyung Kim <namhyung@kernel.org>
      Acked-by: NJiri Olsa <jolsa@kernel.org>
      Acked-by: NSteven Rostedt <rostedt@goodmis.org>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Wang Nan <wangnan0@huawei.com>
      Link: http://lkml.kernel.org/r/1450876121-22494-1-git-send-email-namhyung@kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      be45d40e
    • N
      perf hist: Save raw_data/size for tracepoint events · 72392834
      Namhyung Kim 提交于
      The raw_data and raw_size fields are to provide tracepoint specific
      information.  They will be used by dynamic sort keys later.
      Signed-off-by: NNamhyung Kim <namhyung@kernel.org>
      Acked-by: NJiri Olsa <jolsa@kernel.org>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Wang Nan <wangnan0@huawei.com>
      Link: http://lkml.kernel.org/r/1450923377-18641-1-git-send-email-namhyung@kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      72392834
    • N
      perf hist: Pass struct sample to __hists__add_entry() · fd36f3dd
      Namhyung Kim 提交于
      This is a preparation to add more info into the hist_entry.  Also it
      already passes too many argument, so passing sample directly will reduce
      the overhead of the function call.
      Signed-off-by: NNamhyung Kim <namhyung@kernel.org>
      Acked-by: NJiri Olsa <jolsa@kernel.org>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Wang Nan <wangnan0@huawei.com>
      Link: http://lkml.kernel.org/r/1450804030-29193-2-git-send-email-namhyung@kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      fd36f3dd
  2. 06 1月, 2016 18 次提交
    • V
      perf/x86/amd: Remove l1-dcache-stores event for AMD · 9cc2617d
      Vince Weaver 提交于
      This is a long standing bug with the l1-dcache-stores generic event on
      AMD machines.  My perf_event testsuite has been complaining about this
      for years and I'm finally getting around to trying to get it fixed.
      
      The data_cache_refills:system event does not make sense for l1-dcache-stores.
      Maybe this was a typo and it was meant to be for l1-dcache-store-misses?
      
      In any case, the values returned are nowhere near correct for l1-dcache-stores
      and in fact the umask values for the event have completely changed with
      fam15h so it makes even less sense than ever.  So just remove it.
      Signed-off-by: NVince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1512091134350.24311@vincent-weaver-1.umelst.maine.eduSigned-off-by: NIngo Molnar <mingo@kernel.org>
      9cc2617d
    • H
      perf/x86/intel/uncore: Add Knights Landing uncore PMU support · 77af0037
      Harish Chegondi 提交于
      Knights Landing uncore performance monitoring (perfmon) is derived from
      Haswell-EP uncore perfmon with several differences. One notable difference
      is in PCI device IDs. Knights Landing uses common PCI device ID for
      multiple instances of an uncore PMU device type. In Haswell-EP, each
      instance of a PMU device type has a unique device ID.
      
      Knights Landing uncore components that have performance monitoring units
      are UBOX, CHA, EDC, MC, M2PCIe, IRP and PCU. Perfmon registers in EDC, MC,
      IRP, and M2PCIe reside in the PCIe configuration space. Perfmon registers
      in UBOX, CHA and PCU are accessed via the MSR interface.
      
      For more details, please refer to the public document:
      
        https://software.intel.com/sites/default/files/managed/15/8d/IntelXeonPhi%E2%84%A2x200ProcessorPerformanceMonitoringReferenceManual_Volume1_Registers_v0%206.pdfSigned-off-by: NHarish Chegondi <harish.chegondi@intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andi Kleen <andi.kleen@intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Harish Chegondi <harish.chegondi@gmail.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Kan Liang <kan.liang@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Lukasz Anaczkowski <lukasz.anaczkowski@intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Link: http://lkml.kernel.org/r/8ac513981264c3eb10343a3f523f19cc5a2d12fe.1449470704.git.harish.chegondi@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      77af0037
    • H
      perf/x86/intel/uncore: Remove hard coding of PMON box control MSR offset · dae25530
      Harish Chegondi 提交于
      Call uncore_pci_box_ctl() function to get the PMON box control MSR offset
      instead of hard coding the offset. This would allow us to use this
      snbep_uncore_pci_init_box() function for other PCI PMON devices whose box
      control MSR offset is different from SNBEP_PCI_PMON_BOX_CTL.
      Signed-off-by: NHarish Chegondi <harish.chegondi@intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andi Kleen <andi.kleen@intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Harish Chegondi <harish.chegondi@gmail.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Kan Liang <kan.liang@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Lukasz Anaczkowski <lukasz.anaczkowski@intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Link: http://lkml.kernel.org/r/872e8ef16cfc38e5ff3b45fac1094e6f1722e4ad.1449470704.git.harish.chegondi@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      dae25530
    • H
      perf/x86/intel: Add perf core PMU support for Intel Knights Landing · 1e7b9390
      Harish Chegondi 提交于
      Knights Landing core is based on Silvermont core with several differences.
      Like Silvermont, Knights Landing has 8 pairs of LBR MSRs. However, the
      LBR MSRs addresses match those of the Xeon cores' first 8 pairs of LBR MSRs
      Unlike Silvermont, Knights Landing supports hyperthreading. Knights Landing
      offcore response events config register mask is different from that of the
      Silvermont.
      
      This patch was developed based on a patch from Andi Kleen.
      
      For more details, please refer to the public document:
      
        https://software.intel.com/sites/default/files/managed/15/8d/IntelXeonPhi%E2%84%A2x200ProcessorPerformanceMonitoringReferenceManual_Volume1_Registers_v0%206.pdfSigned-off-by: NHarish Chegondi <harish.chegondi@intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andi Kleen <andi.kleen@intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Harish Chegondi <harish.chegondi@gmail.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Kan Liang <kan.liang@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Lukasz Anaczkowski <lukasz.anaczkowski@intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Link: http://lkml.kernel.org/r/d14593c7311f78c93c9cf6b006be843777c5ad5c.1449517401.git.harish.chegondi@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      1e7b9390
    • K
      perf/x86/intel/uncore: Add Broadwell-EP uncore support · d6980ef3
      Kan Liang 提交于
      The uncore subsystem for Broadwell-EP is similar to Haswell-EP.
      There are some differences in pci device IDs, box number and
      constraints. This patch extends the Broadwell-DE codes to support
      Broadwell-EP.
      Signed-off-by: NKan Liang <kan.liang@intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Link: http://lkml.kernel.org/r/1449176411-9499-1-git-send-email-kan.liang@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      d6980ef3
    • H
      perf/x86/rapl: Use unified perf_event_sysfs_show instead of special interface · d3bcd64b
      Huang Rui 提交于
      Actually, rapl_sysfs_show is a duplicate of perf_event_sysfs_show. We
      prefer to use the unified interface.
      Signed-off-by: NHuang Rui <ray.huang@amd.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Dasaratharaman Chandramouli<dasaratharaman.chandramouli@intel.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Robert Richter <rric@kernel.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Link: http://lkml.kernel.org/r/1449223661-2437-1-git-send-email-ray.huang@amd.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      d3bcd64b
    • S
      perf/x86: Enable cycles:pp for Intel Atom · 673d188b
      Stephane Eranian 提交于
      This patch updates the PEBS support for Intel Atom to provide
      an alias for the cycles:pp event used by perf record/top by default
      nowadays.
      
      On Atom, only INST_RETIRED:ANY supports PEBS, so we use this event
      instead with a large cmask to count cycles. Given that Core2 has
      the same issue, we use the intel_pebs_aliases_core2() function for Atom
      as well.
      Signed-off-by: NStephane Eranian <eranian@google.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: kan.liang@intel.com
      Link: http://lkml.kernel.org/r/1449172990-30183-3-git-send-email-eranian@google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      673d188b
    • S
      perf/x86: fix PEBS issues on Intel Atom/Core2 · 1424a09a
      Stephane Eranian 提交于
      This patch fixes broken PEBS support on Intel Atom and Core2
      due to wrong pointer arithmetic in intel_pmu_drain_pebs_core().
      
      The get_next_pebs_record_by_bit() was called on PEBS format fmt0
      which does not use the pebs_record_nhm layout.
      Signed-off-by: NStephane Eranian <eranian@google.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: kan.liang@intel.com
      Fixes: 21509084 ("perf/x86/intel: Handle multiple records in the PEBS buffer")
      Link: http://lkml.kernel.org/r/1449182000-31524-3-git-send-email-eranian@google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      1424a09a
    • S
      perf/x86: Fix LBR related crashes on Intel Atom · 6fc2e830
      Stephane Eranian 提交于
      This patches fixes the LBR kernel crashes on Intel Atom.
      
      The kernel was assuming that if the CPU supports 64-bit format
      LBR, then it has an LBR_SELECT MSR. Atom uses 64-bit LBR format
      but does not have LBR_SELECT. That was causing NULL pointer
      dereferences in a couple of places.
      Signed-off-by: NStephane Eranian <eranian@google.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: kan.liang@intel.com
      Fixes: 96f3eda6 ("perf/x86/intel: Fix static checker warning in lbr enable")
      Link: http://lkml.kernel.org/r/1449182000-31524-2-git-send-email-eranian@google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      6fc2e830
    • S
      perf/x86: Fix filter_events() bug with event mappings · 61b87cae
      Stephane Eranian 提交于
      This patch fixes a bug in the filter_events() function.
      
      The patch fixes the bug whereby if some mappings did not
      exist, e.g., STALLED_CYCLES_FRONTEND, then any event after it
      in the attrs array would disappear from the published list of
      events in /sys/devices/cpu/events. This could be verified
      easily on any system post SNB (which do not publish
      STALLED_CYCLES_FRONTEND):
      
      	$ ./perf stat -e cycles,ref-cycles true
      	Performance counter stats for 'true':
                    1,217,348      cycles
      	<not supported>      ref-cycles
      
      The problem is that in filter_events() there is an assumption
      that the argument (attrs) is organized in increasing continuous
      event indexes related to the event_map(). But if we remove the
      non-supported events by shifing the position in the array, then
      the lookup x86_pmu.event_map() needs to compensate for it, otherwise
      we are looking up the wrong index. This patch corrects this problem
      by compensating for the deleted events and with that ref-cycles
      reappears (here shown on Haswell):
      
      	$ perf stat -e ref-cycles,cycles true
      	Performance counter stats for 'true':
               4,525,910      ref-cycles
               1,064,920      cycles
             0.002943888 seconds time elapsed
      Signed-off-by: NStephane Eranian <eranian@google.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: jolsa@kernel.org
      Cc: kan.liang@intel.com
      Fixes: 8300daa2 ("perf/x86: Filter out undefined events from sysfs events attribute")
      Link: http://lkml.kernel.org/r/1449516805-6637-1-git-send-email-eranian@google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      61b87cae
    • A
      perf/x86: Use INST_RETIRED.PREC_DIST for cycles: ppp · 72469764
      Andi Kleen 提交于
      Add a new 'three-p' precise level, that uses INST_RETIRED.PREC_DIST as
      base. The basic mechanism of abusing the inverse cmask to get all
      cycles works the same as before.
      
      PREC_DIST is available on Sandy Bridge or later. It had some problems
      on Sandy Bridge, so we only use it on IvyBridge and later. I tested it
      on Broadwell and Skylake.
      
      PREC_DIST has special support for avoiding shadow effects, which can
      give better results compare to UOPS_RETIRED. The drawback is that
      PREC_DIST can only schedule on counter 1, but that is ok for cycle
      sampling, as there is normally no need to do multiple cycle sampling
      runs in parallel. It is still possible to run perf top in parallel, as
      that doesn't use precise mode. Also of course the multiplexing can
      still allow parallel operation.
      
      :pp stays with the previous event.
      
      Example:
      
      Sample a loop with 10 sqrt with old cycles:pp
      
      	  0.14 │10:   sqrtps %xmm1,%xmm0     <--------------
      	  9.13 │      sqrtps %xmm1,%xmm0
      	 11.58 │      sqrtps %xmm1,%xmm0
      	 11.51 │      sqrtps %xmm1,%xmm0
      	  6.27 │      sqrtps %xmm1,%xmm0
      	 10.38 │      sqrtps %xmm1,%xmm0
      	 12.20 │      sqrtps %xmm1,%xmm0
      	 12.74 │      sqrtps %xmm1,%xmm0
      	  5.40 │      sqrtps %xmm1,%xmm0
      	 10.14 │      sqrtps %xmm1,%xmm0
      	 10.51 │    ↑ jmp    10
      
      We expect all 10 sqrt to get roughly the sample number of samples.
      
      But you can see that the instruction directly after the JMP is
      systematically underestimated in the result, due to sampling shadow
      effects.
      
      With the new PREC_DIST based sampling this problem is gone and all
      instructions show up roughly evenly:
      
      	  9.51 │10:   sqrtps %xmm1,%xmm0
      	 11.74 │      sqrtps %xmm1,%xmm0
      	 11.84 │      sqrtps %xmm1,%xmm0
      	  6.05 │      sqrtps %xmm1,%xmm0
      	 10.46 │      sqrtps %xmm1,%xmm0
      	 12.25 │      sqrtps %xmm1,%xmm0
      	 12.18 │      sqrtps %xmm1,%xmm0
      	  5.26 │      sqrtps %xmm1,%xmm0
      	 10.13 │      sqrtps %xmm1,%xmm0
      	 10.43 │      sqrtps %xmm1,%xmm0
      	  0.16 │    ↑ jmp    10
      
      Even with PREC_DIST there is still sampling skid and the result is not
      completely even, but systematic shadow effects are significantly
      reduced.
      
      The improvements are mainly expected to make a difference in high IPC
      code. With low IPC it should be similar.
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: hpa@zytor.com
      Link: http://lkml.kernel.org/r/1448929689-13771-2-git-send-email-andi@firstfloor.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      72469764
    • A
      perf/x86: Use INST_RETIRED.TOTAL_CYCLES_PS for cycles:pp for Skylake · 442f5c74
      Andi Kleen 提交于
      I added UOPS_RETIRED.ALL by mistake to the Skylake PEBS event list for
      cycles:pp. But the event is not documented for Skylake, and has some
      issues.
      
      The recommended replacement for cycles:pp is to use
      INST_RETIRED.ANY+pebs as a base, similar to what CPUs before Sandy
      Bridge did. This new event is called INST_RETIRED.TOTAL_CYCLES_PS. The
      event is not really new, but has been already used by perf before
      Sandy Bridge for the original cycles:p
      
      Note the SDM doesn't document that event either, but it's being
      documented in the latest version of the event list on:
      
        https://download.01.org/perfmon/SKL
      
      This patch does:
      
       - Remove UOPS_RETIRED.ALL from the Skylake PEBS event list
      
       - Add INST_RETIRED.ANY to the Skylake PEBS event list, and an table entry to
         allow cmask=16,inv=1 for cycles:pp
      
       - We don't need an extra entry for the base INST_RETIRED event,
         because it is already covered by the catch-all PEBS table entry.
      
       - Switch Skylake to use the Core2 PEBS alias (which is
         INST_RETIRED.TOTAL_CYCLES_PS)
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: hpa@zytor.com
      Link: http://lkml.kernel.org/r/1448929689-13771-1-git-send-email-andi@firstfloor.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      442f5c74
    • A
      perf/x86: Allow zero PEBS status with only single active event · 01330d72
      Andi Kleen 提交于
      Normally we drop PEBS events with a zero status field. But when
      there is only a single PEBS event active we can assume the
      PEBS record is for that event. The PEBS buffer is always flushed
      when PEBS events are disabled, so there is no risk of mishandling
      state PEBS records this way.
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Link: http://lkml.kernel.org/r/1449177740-5422-2-git-send-email-andi@firstfloor.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      01330d72
    • A
      perf/x86: Remove warning for zero PEBS status · 957ea1fd
      Andi Kleen 提交于
      The recent commit:
      
        75f80859 ("perf/x86/intel/pebs: Robustify PEBS buffer drain")
      
      causes lots of warnings on different CPUs before Skylake
      when running PEBS intensive workloads.
      
      They can have a zero status field in the PEBS record when
      PEBS is racing with clearing of GLOBAl_STATUS.
      
      This also can cause hangs (it seems there are still
      problems with printk in NMI).
      
      Disable the warning, but still ignore the record.
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Link: http://lkml.kernel.org/r/1449177740-5422-1-git-send-email-andi@firstfloor.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      957ea1fd
    • P
      perf/core: Collapse more IPI loops · 7b648018
      Peter Zijlstra 提交于
      This patch collapses the two 'hard' cases, which are
      perf_event_{dis,en}able().
      
      I cannot seem to convince myself the current code is correct.
      
      So starting with perf_event_disable(); we don't strictly need to test
      for event->state == ACTIVE, ctx->is_active is enough. If the event is
      not scheduled while the ctx is, __perf_event_disable() still does the
      right thing.  Its a little less efficient to IPI in that case,
      over-all simpler.
      
      For perf_event_enable(); the same goes, but I think that's actually
      broken in its current form. The current condition is: ctx->is_active
      && event->state == OFF, that means it doesn't do anything when
      !ctx->active && event->state == OFF. This is wrong, it should still
      mark the event INACTIVE in that case, otherwise we'll still not try
      and schedule the event once the context becomes active again.
      
      This patch implements the two function using the new
      event_function_call() and does away with the tricky event->state
      tests.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NAlexander Shishkin <alexander.shishkin@intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      7b648018
    • I
    • P
      perf: Fix race in swevent hash · 12ca6ad2
      Peter Zijlstra 提交于
      There's a race on CPU unplug where we free the swevent hash array
      while it can still have events on. This will result in a
      use-after-free which is BAD.
      
      Simply do not free the hash array on unplug. This leaves the thing
      around and no use-after-free takes place.
      
      When the last swevent dies, we do a for_each_possible_cpu() iteration
      anyway to clean these up, at which time we'll free it, so no leakage
      will occur.
      Reported-by: NSasha Levin <sasha.levin@oracle.com>
      Tested-by: NSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      12ca6ad2
    • P
      perf: Fix race in perf_event_exec() · c1274499
      Peter Zijlstra 提交于
      I managed to tickle this warning:
      
        [ 2338.884942] ------------[ cut here ]------------
        [ 2338.890112] WARNING: CPU: 13 PID: 35162 at ../kernel/events/core.c:2702 task_ctx_sched_out+0x6b/0x80()
        [ 2338.900504] Modules linked in:
        [ 2338.903933] CPU: 13 PID: 35162 Comm: bash Not tainted 4.4.0-rc4-dirty #244
        [ 2338.911610] Hardware name: Intel Corporation S2600GZ/S2600GZ, BIOS SE5C600.86B.02.02.0002.122320131210 12/23/2013
        [ 2338.923071]  ffffffff81f1468e ffff8807c6457cb8 ffffffff815c680c 0000000000000000
        [ 2338.931382]  ffff8807c6457cf0 ffffffff810c8a56 ffffe8ffff8c1bd0 ffff8808132ed400
        [ 2338.939678]  0000000000000286 ffff880813170380 ffff8808132ed400 ffff8807c6457d00
        [ 2338.947987] Call Trace:
        [ 2338.950726]  [<ffffffff815c680c>] dump_stack+0x4e/0x82
        [ 2338.956474]  [<ffffffff810c8a56>] warn_slowpath_common+0x86/0xc0
        [ 2338.963195]  [<ffffffff810c8b4a>] warn_slowpath_null+0x1a/0x20
        [ 2338.969720]  [<ffffffff811a49cb>] task_ctx_sched_out+0x6b/0x80
        [ 2338.976244]  [<ffffffff811a62d2>] perf_event_exec+0xe2/0x180
        [ 2338.982575]  [<ffffffff8121fb6f>] setup_new_exec+0x6f/0x1b0
        [ 2338.988810]  [<ffffffff8126de83>] load_elf_binary+0x393/0x1660
        [ 2338.995339]  [<ffffffff811dc772>] ? get_user_pages+0x52/0x60
        [ 2339.001669]  [<ffffffff8121e297>] search_binary_handler+0x97/0x200
        [ 2339.008581]  [<ffffffff8121f8b3>] do_execveat_common.isra.33+0x543/0x6e0
        [ 2339.016072]  [<ffffffff8121fcea>] SyS_execve+0x3a/0x50
        [ 2339.021819]  [<ffffffff819fc165>] stub_execve+0x5/0x5
        [ 2339.027469]  [<ffffffff819fbeb2>] ? entry_SYSCALL_64_fastpath+0x12/0x71
        [ 2339.034860] ---[ end trace ee1337c59a0ddeac ]---
      
      Which is a WARN_ON_ONCE() indicating that cpuctx->task_ctx is not
      what we expected it to be.
      
      This is because context switches can swap the task_struct::perf_event_ctxp[]
      pointer around. Therefore you have to either disable preemption when looking
      at current, or hold ctx->lock.
      
      Fix perf_event_enable_on_exec(), it loads current->perf_event_ctxp[]
      before disabling interrupts, therefore a preemption in the right place
      can swap contexts around and we're using the wrong one.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Kostya Serebryany <kcc@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: syzkaller <syzkaller@googlegroups.com>
      Link: http://lkml.kernel.org/r/20151210195740.GG6357@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      c1274499
  3. 18 12月, 2015 10 次提交
    • I
      Merge tag 'perf-core-for-mingo-3' of... · d64fe8e6
      Ingo Molnar 提交于
      Merge tag 'perf-core-for-mingo-3' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/core
      
      Pull new perf tool feature from Arnaldo Carvalho de Melo:
      
      " User visible changes:
      
        - Generate perf.data files from 'perf stat', to tap into the scripting
          capabilities perf has instead of defining a 'perf stat' specific scripting
          support to calculate event ratios, etc. Simple example:
      
          $ perf stat record -e cycles usleep 1
      
           Performance counter stats for 'usleep 1':
      
                 1,134,996      cycles
      
               0.000670644 seconds time elapsed
      
          $ perf stat report
      
           Performance counter stats for '/home/acme/bin/perf stat record -e cycles usleep 1':
      
                 1,134,996      cycles
      
               0.000670644 seconds time elapsed
      
          $
      
          It generates PERF_RECORD_ userspace records to store the details:
      
          $ perf report -D | grep PERF_RECORD
          0xf0 [0x28]: PERF_RECORD_THREAD_MAP nr: 1 thread: 27637
          0x118 [0x12]: PERF_RECORD_CPU_MAP nr: 1 cpu: 65535
          0x12a [0x40]: PERF_RECORD_STAT_CONFIG
          0x16a [0x30]: PERF_RECORD_STAT
          -1 -1 0x19a [0x40]: PERF_RECORD_MMAP -1/0: [0xffffffff81000000(0x1f000000) @ 0xffffffff81000000]: x [kernel.kallsyms]_text
          0x1da [0x18]: PERF_RECORD_STAT_ROUND
          [acme@ssdandy linux]$
      
          An effort was made to make perf.data files generated like this to not
          generate cryptic messages when processed by older tools.
      
          The 'perf script' bits need rebasing, will go up later.
      
        Jiri's cover letter for this series:
      
        The initial attempt defined its own formula lang and allowed triggering user's
        script on the end of the stat command:
      
          http://marc.info/?l=linux-kernel&m=136742146322273&w=2
      
        This patchset abandons the idea of new formula language and rather adds support
        to:
      
          - store stat data into perf.data file
          - add python support to process stat events
      
        Basically it allows to store stat data into perf.data and post process it with
        python scripts in a similar way we do for sampling data.
      
        The stat data are stored in new stat, stat-round, stat-config user events.
          stat        - stored for each read syscall of the counter
          stat round  - stored for each interval or end of the command invocation
          stat config - stores all the config information needed to process data
                        so report tool could restore the same output as record
      
        The python script can now define 'stat__<eventname>_<modifier>' functions
        to get stat events data and 'stat__interval' to get stat-round data.
      
        See CPI script example in scripts/python/stat-cpi.py."
      
      Also a few other changes:
      
      User visible changes:
      
        - Make command line options always available, even when they
          depend on some feature being enabled, warning the user about
          use of such options (Wang Nan)
      
        - Support --vmlinux in perf record, useful, so far, for eBPF,
          where we will set up events that will be used in the record
          session (He Kuang)
      
        - Automatically disable collecting branch flags and cycles with
          --call-graph lbr. This allows avoiding a bunch of extra MSR
          reads in the PMI on Skylake.  (Andi Kleen)
      
      Infrastructure changes:
      
        - Dump the stack when a 'perf test -v ' entry segfaults, so far we
          would have to run it under gdb with 'set follow-fork-mode child'
          set to get a proper backtrace (Arnaldo Carvalho de Melo)
      
        - Initialize the refcnt in 'struct thread' to 1 and fixup its
          users accordingly, so that we try to have the same refcount
          model accross the perf codebase (Arnaldo Carvalho de Melo)
      
        - More prep work for moving the subcmd infrastructure out of
          tools/perf/ and into tools/lib/subcmd/ to be used by other
          tools/ living utilities (Josh Poimboeuf)
      
        - Fix 'perf test' hist testcases when kptr_restrict is on (Namhyung Kim)
      Signed-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      d64fe8e6
    • I
      Merge branch 'perf/urgent' into perf/core, to make sure a cherry-picked commit... · 141a361e
      Ingo Molnar 提交于
      Merge branch 'perf/urgent' into perf/core, to make sure a cherry-picked commit does not create conflicts
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      141a361e
    • I
      Merge tag 'perf-urgent-for-mingo' of... · 2d2e7ac1
      Ingo Molnar 提交于
      Merge tag 'perf-urgent-for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/urgent
      
      Pull perf/urgent tooling fix from Arnaldo Carvalho de Melo:
      
        User visible changes:
      
          - Fix 'perf list' segfault due to lack of support for PERF_CONF_SW_BPF_OUTPUT
            in an array used just for printing available events, robustify the code
            involved (Arnaldo Carvalho de Melo)
      Signed-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      2d2e7ac1
    • I
      Merge tag 'perf-core-for-mingo-2.1' of... · b21daaed
      Ingo Molnar 提交于
      Merge tag 'perf-core-for-mingo-2.1' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/core
      
      Pull perf/core improvements from Arnaldo Carvalho de Melo:
      
      User visible changes:
      
        - Add record.build-id config option to 'perf record', to allow configuring
          in the ~/.perfconfig file if and how build-ids should be processed, allowing
          a permanent setting for options such as -B and -N: (Namhyung Kim)
      
          $ perf record -h -B -N
      
           Usage: perf record [<options>] [<command>]
              or: perf record [<options>] -- <command> [<options>]
      
              -B, --no-buildid       do not collect buildids in perf.data
              -N, --no-buildid-cache do not update the buildid cache
      
          $
      
      Infrastructure changes:
      
        - Move code for options parsing and subcommand handling from tools/perf/
          to tools/lib/subcmd/, so that it can be used by other tools/ living
          utilities (Josh Poimboeuf)
      Signed-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      b21daaed
    • J
      perf stat report: Allow to override aggr_mode · 89af4e05
      Jiri Olsa 提交于
      Allowing to override record aggr_mode. It's possible to use perf stat
      like:
      
         $ perf stat report -A
         $ perf stat report --per-core
         $ perf stat report --per-socket
      
      To customize the recorded aggregate mode regardless what was used during
      the stat record command.
      Reported-by: NKan Liang <kan.liang@intel.com>
      Signed-off-by: NJiri Olsa <jolsa@kernel.org>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/1446734469-11352-19-git-send-email-jolsa@kernel.org
      [ Renamed 'stat' parameter to 'st' to fix 'already defined' build error with older distros (e.g. RHEL6.7) ]
      Signed-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      89af4e05
    • J
      perf stat report: Process event update events · fa6ea781
      Jiri Olsa 提交于
      Adding processing of event update events, so perf stat report can store
      additional info for events - unit,scale,name.
      
      Committer note:
      
      Before:
      
        # perf stat record -e power/energy-cores/ -a
        ^C
        Performance counter stats for 'system wide':
      
                   77.41 Joules power/energy-cores/
      
             1.597176695 seconds time elapsed
      
        # perf stat report
      
        Performance counter stats for '/home/acme/bin/perf stat record -e power/energy-cores/ -a':
      
         332,488,114,176      power/energy-cores/
      
             1.597176695 seconds time elapsed
      
        #
      
      After, using the same perf.data file generated in the "Before" case
      above:
      
        # perf stat report
      
        Performance counter stats for '/home/acme/bin/perf stat record -e power/energy-cores/ -a':
      
                   77.41 Joules power/energy-cores/
      
             1.597176695 seconds time elapsed
      
        #
      Reported-by: NKan Liang <kan.liang@intel.com>
      Signed-off-by: NJiri Olsa <jolsa@kernel.org>
      Tested-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/1446734469-11352-17-git-send-email-jolsa@kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      fa6ea781
    • J
      perf stat report: Process stat and stat round events · a56f9390
      Jiri Olsa 提交于
      Adding processing of stat and stat round events.
      
      The stat data com in stat events, using generic function
      process_stat_round_event to store data under perf_evsel object.
      
      The stat-round events comes each interval or as last event in non
      interval mode. The function process_stat_round_event process stored data
      for each perf_evsel object and print it out.
      
      Committer note:
      
      After this patch:
      
        $ perf stat record usleep 1
      
         Performance counter stats for 'usleep 1':
      
              0.498381  task-clock (msec)       #    0.571 CPUs utilized
                     2  context-switches        #    0.004 M/sec
                     0  cpu-migrations          #    0.000 K/sec
                   149  page-faults             #    0.299 M/sec
             1,271,635  cycles                  #    2.552 GHz
               928,712  stalled-cycles-frontend #   73.03% frontend cycles idle
               663,286  stalled-cycles-backend  #   52.16% backend  cycles idle
               792,614  instructions            #    0.62  insns per cycle
                                                #    1.17  stalled cycles per insn
               136,850  branches                #  274.589 M/sec
         <not counted>  branch-misses            (0.00%)
      
           0.000873419 seconds time elapsed
      
        $
        $ perf stat report
      
         Performance counter stats for '/home/acme/bin/perf stat record usleep 1':
      
              0.498381  task-clock (msec)       #    0.571 CPUs utilized
                     2  context-switches        #    0.004 M/sec
                     0  cpu-migrations          #    0.000 K/sec
                   149  page-faults             #    0.299 M/sec
             1,271,635  cycles                  #    2.552 GHz
               928,712  stalled-cycles-frontend #   73.03% frontend cycles idle
               663,286  stalled-cycles-backend  #   52.16% backend  cycles idle
               792,614  instructions            #    0.62  insns per cycle
                                                #    1.17  stalled cycles per insn
               136,850  branches                #  274.589 M/sec
         <not counted>  branch-misses            (0.00%)
      
           0.000873419 seconds time elapsed
      
        $
      Reported-by: NKan Liang <kan.liang@intel.com>
      Signed-off-by: NJiri Olsa <jolsa@kernel.org>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/1446734469-11352-16-git-send-email-jolsa@kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      a56f9390
    • J
      perf stat report: Move csv_sep initialization before report command · 6edb78a2
      Jiri Olsa 提交于
      So we have csv_sep properly initialized before report command leg.
      Reported-by: NKan Liang <kan.liang@intel.com>
      Signed-off-by: NJiri Olsa <jolsa@kernel.org>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/1446734469-11352-18-git-send-email-jolsa@kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      6edb78a2
    • J
      perf stat report: Add support to initialize aggr_map from file · 68d702f7
      Jiri Olsa 提交于
      Using perf.data's perf_env data to initialize aggregate config.
      Reported-by: NKan Liang <kan.liang@intel.com>
      Signed-off-by: NJiri Olsa <jolsa@kernel.org>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/1446734469-11352-15-git-send-email-jolsa@kernel.org
      [ s/stat/st/g, s/socket/socket_id/g to fix 'already defined' build error with older distros (e.g. RHEL6.7) ]
      Signed-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      68d702f7
    • J
      perf stat report: Process stat config event · 62ba18ba
      Jiri Olsa 提交于
      Adding processing of stat config event and initialize stat_config
      object.
      Reported-by: NKan Liang <kan.liang@intel.com>
      Signed-off-by: NJiri Olsa <jolsa@kernel.org>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/1446734469-11352-14-git-send-email-jolsa@kernel.org
      [ Renamed 'stat' parameter to 'st' to fix 'already defined' build error with older distros (e.g. RHEL6.7) ]
      Signed-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      62ba18ba