1. 06 5月, 2020 1 次提交
  2. 30 4月, 2020 13 次提交
    • K
      perf vendor events power9: Add hv_24x7 socket/chip level metric events · 354575c0
      Kajol Jain 提交于
      The hv_24×7 feature in IBM® POWER9™ processor-based servers provide the
      facility to continuously collect large numbers of hardware performance
      metrics efficiently and accurately.
      
      This patch adds hv_24x7  metric file for different Socket/chip
      resources.
      
      Result:
      
      power9 platform:
      
        command:# ./perf stat --metric-only -M Memory_RD_BW_Chip -C 0 -I 1000
      
           1.000096188          0.9           0.3
           2.000285720          0.5           0.1
           3.000424990          0.4           0.1
      
        command:# ./perf stat --metric-only -M PowerBUS_Frequency -C 0 -I 1000
      
           1.000097981          2.3           2.3
           2.000291713          2.3           2.3
           3.000421719          2.3           2.3
           4.000550912          2.3           2.3
      Signed-off-by: NKajol Jain <kjain@linux.ibm.com>
      Acked-by: NJiri Olsa <jolsa@redhat.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Anju T Sudhakar <anju@linux.vnet.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Jin Yao <yao.jin@linux.intel.com>
      Cc: Joe Mario <jmario@redhat.com>
      Cc: Kan Liang <kan.liang@linux.intel.com>
      Cc: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
      Cc: Mamatha Inamdar <mamatha4@linux.vnet.ibm.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michael Petlan <mpetlan@redhat.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Paul Mackerras <paulus@ozlabs.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
      Cc: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linuxppc-dev@lists.ozlabs.org
      Link: http://lore.kernel.org/lkml/20200401203340.31402-8-kjain@linux.ibm.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      354575c0
    • K
      perf tools: Enable Hz/hz prinitg for --metric-only option · 3351c6da
      Kajol Jain 提交于
      Commit 54b50916 ("perf stat: Implement --metric-only mode") added
      function 'valid_only_metric()' which drops "Hz" or "hz", if it is part
      of "ScaleUnit". This patch enable it since hv_24x7 supports couple of
      frequency events.
      Signed-off-by: NKajol Jain <kjain@linux.ibm.com>
      Acked-by: NJiri Olsa <jolsa@redhat.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Anju T Sudhakar <anju@linux.vnet.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Jin Yao <yao.jin@linux.intel.com>
      Cc: Joe Mario <jmario@redhat.com>
      Cc: Kan Liang <kan.liang@linux.intel.com>
      Cc: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
      Cc: Mamatha Inamdar <mamatha4@linux.vnet.ibm.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michael Petlan <mpetlan@redhat.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Paul Mackerras <paulus@ozlabs.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
      Cc: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linuxppc-dev@lists.ozlabs.org
      Link: http://lore.kernel.org/lkml/20200401203340.31402-7-kjain@linux.ibm.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      3351c6da
    • K
      perf tests expr: Added test for runtime param in metric expression · 9022608e
      Kajol Jain 提交于
      Added test case for parsing  "?" in metric expression.
      Signed-off-by: NKajol Jain <kjain@linux.ibm.com>
      Acked-by: NJiri Olsa <jolsa@redhat.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Anju T Sudhakar <anju@linux.vnet.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Jin Yao <yao.jin@linux.intel.com>
      Cc: Joe Mario <jmario@redhat.com>
      Cc: Kan Liang <kan.liang@linux.intel.com>
      Cc: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
      Cc: Mamatha Inamdar <mamatha4@linux.vnet.ibm.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michael Petlan <mpetlan@redhat.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Paul Mackerras <paulus@ozlabs.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
      Cc: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linuxppc-dev@lists.ozlabs.org
      Link: http://lore.kernel.org/lkml/20200401203340.31402-6-kjain@linux.ibm.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      9022608e
    • K
      perf metricgroups: Enhance JSON/metric infrastructure to handle "?" · 1e1a873d
      Kajol Jain 提交于
      Patch enhances current metric infrastructure to handle "?" in the metric
      expression. The "?" can be use for parameters whose value not known
      while creating metric events and which can be replace later at runtime
      to the proper value. It also add flexibility to create multiple events
      out of single metric event added in JSON file.
      
      Patch adds function 'arch_get_runtimeparam' which is a arch specific
      function, returns the count of metric events need to be created.  By
      default it return 1.
      
      This infrastructure needed for hv_24x7 socket/chip level events.
      "hv_24x7" chip level events needs specific chip-id to which the data is
      requested. Function 'arch_get_runtimeparam' implemented in header.c
      which extract number of sockets from sysfs file "sockets" under
      "/sys/devices/hv_24x7/interface/".
      
      With this patch basically we are trying to create as many metric events
      as define by runtime_param.
      
      For that one loop is added in function 'metricgroup__add_metric', which
      create multiple events at run time depend on return value of
      'arch_get_runtimeparam' and merge that event in 'group_list'.
      
      To achieve that we are actually passing this parameter value as part of
      `expr__find_other` function and changing "?" present in metric
      expression with this value.
      
      As in our JSON file, there gonna be single metric event, and out of
      which we are creating multiple events.
      
      To understand which data count belongs to which parameter value,
      we also printing param value in generic_metric function.
      
      For example,
      
        command:# ./perf stat  -M PowerBUS_Frequency -C 0 -I 1000
          1.000101867  9,356,933  hv_24x7/pm_pb_cyc,chip=0/ #  2.3 GHz  PowerBUS_Frequency_0
          1.000101867  9,366,134  hv_24x7/pm_pb_cyc,chip=1/ #  2.3 GHz  PowerBUS_Frequency_1
          2.000314878  9,365,868  hv_24x7/pm_pb_cyc,chip=0/ #  2.3 GHz  PowerBUS_Frequency_0
          2.000314878  9,366,092  hv_24x7/pm_pb_cyc,chip=1/ #  2.3 GHz  PowerBUS_Frequency_1
      
      So, here _0 and _1 after PowerBUS_Frequency specify parameter value.
      Signed-off-by: NKajol Jain <kjain@linux.ibm.com>
      Acked-by: NJiri Olsa <jolsa@redhat.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Anju T Sudhakar <anju@linux.vnet.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Jin Yao <yao.jin@linux.intel.com>
      Cc: Joe Mario <jmario@redhat.com>
      Cc: Kan Liang <kan.liang@linux.intel.com>
      Cc: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
      Cc: Mamatha Inamdar <mamatha4@linux.vnet.ibm.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michael Petlan <mpetlan@redhat.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Paul Mackerras <paulus@ozlabs.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
      Cc: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linuxppc-dev@lists.ozlabs.org
      Link: http://lore.kernel.org/lkml/20200401203340.31402-5-kjain@linux.ibm.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      1e1a873d
    • S
      perf pmu: Fix function name in comment, its get_cpuid_str(), not get_cpustr() · 454a8be0
      Shaokun Zhang 提交于
      get_cpuid_str() is used in tools/perf/arch/xxx/util/header.c,
      fix the name in comment.
      Signed-off-by: NShaokun Zhang <zhangshaokun@hisilicon.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Link: http://lore.kernel.org/lkml/1588141992-48382-1-git-send-email-zhangshaokun@hisilicon.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      454a8be0
    • Z
      perf report: Fix warning assignment of 0/1 to bool variable · 6fa9c3e7
      Zou Wei 提交于
      Fixes coccicheck warning:
      
        tools/perf/builtin-report.c:1403:2-34: WARNING: Assignment of 0/1 to bool variable
      Reported-by: NHulk Robot <hulkci@huawei.com>
      Signed-off-by: NZou Wei <zou_wei@huawei.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: http://lore.kernel.org/lkml/1587904683-3510-1-git-send-email-zou_wei@huawei.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      6fa9c3e7
    • Z
      perf tools: Remove unneeded semicolons · 8284bbea
      Zou Wei 提交于
      Fixes coccicheck warnings:
      
        tools/perf/builtin-diff.c:1565:2-3: Unneeded semicolon
        tools/perf/builtin-lock.c:778:2-3: Unneeded semicolon
        tools/perf/builtin-mem.c:126:2-3: Unneeded semicolon
        tools/perf/util/intel-pt-decoder/intel-pt-pkt-decoder.c:555:2-3: Unneeded semicolon
        tools/perf/util/ordered-events.c:317:2-3: Unneeded semicolon
        tools/perf/util/synthetic-events.c:1131:2-3: Unneeded semicolon
        tools/perf/util/trace-event-read.c:78:2-3: Unneeded semicolon
      Reported-by: NHulk Robot <hulkci@huawei.com>
      Signed-off-by: NZou Wei <zou_wei@huawei.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: http://lore.kernel.org/lkml/1588065523-71423-1-git-send-email-zou_wei@huawei.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      8284bbea
    • Z
      perf c2c: Remove unneeded semicolon · 2cca512a
      Zou Wei 提交于
      Fixes coccicheck warnings:
      
       tools/perf/builtin-c2c.c:1712:2-3: Unneeded semicolon
       tools/perf/builtin-c2c.c:1928:2-3: Unneeded semicolon
       tools/perf/builtin-c2c.c:2962:2-3: Unneeded semicolon
      Reported-by: NHulk Robot <hulkci@huawei.com>
      Signed-off-by: NZou Wei <zou_wei@huawei.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: http://lore.kernel.org/lkml/1588064336-70456-1-git-send-email-zou_wei@huawei.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      2cca512a
    • Z
      libtraceevent: Remove unneeded semicolon · eebe80c9
      Zou Wei 提交于
      Fixes coccicheck warning:
      
       tools/lib/traceevent/kbuffer-parse.c:441:2-3: Unneeded semicolon
      Reported-by: NHulk Robot <hulkci@huawei.com>
      Signed-off-by: NZou Wei <zou_wei@huawei.com>
      Acked-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Link: http://lore.kernel.org/lkml/1588065121-71236-1-git-send-email-zou_wei@huawei.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      eebe80c9
    • S
      perf script: Remove extraneous newline in perf_sample__fprintf_regs() · fad1f1e7
      Stephane Eranian 提交于
      When printing iregs, there was a double newline printed because
      perf_sample__fprintf_regs() was printing its own and then at the end of
      all fields, perf script was adding one.  This was causing blank line in
      the output:
      
      Before:
      
        $ perf script -Fip,iregs
                   401b8d ABI:2    DX:0x100    SI:0x4a8340    DI:0x4a9340
      
                   401b8d ABI:2    DX:0x100    SI:0x4a9340    DI:0x4a8340
      
                   401b8d ABI:2    DX:0x100    SI:0x4a8340    DI:0x4a9340
      
                   401b8d ABI:2    DX:0x100    SI:0x4a9340    DI:0x4a8340
      
      After:
      
        $ perf script -Fip,iregs
                   401b8d ABI:2    DX:0x100    SI:0x4a8340    DI:0x4a9340
                   401b8d ABI:2    DX:0x100    SI:0x4a9340    DI:0x4a8340
                   401b8d ABI:2    DX:0x100    SI:0x4a8340    DI:0x4a9340
      
      Committer testing:
      
      First we need to figure out how to request that registers be recorded,
      so we use:
      
        # perf record -h reg
      
         Usage: perf record [<options>] [<command>]
            or: perf record [<options>] -- <command> [<options>]
      
            -I, --intr-regs[=<any register>]
                                  sample selected machine registers on interrupt, use '-I?' to list register names
                --buildid-all     Record build-id of all DSOs regardless of hits
                --user-regs[=<any register>]
                                  sample selected machine registers on interrupt, use '--user-regs=?' to list register names
      
        #
      
      Ok, now lets ask for them all:
      
        # perf record -a --intr-regs --user-regs sleep 1
        [ perf record: Woken up 1 times to write data ]
        [ perf record: Captured and wrote 4.105 MB perf.data (2760 samples) ]
        #
      
      Lets look at the first 6 output lines:
      
        # perf script -Fip,iregs | head -6
         ffffffff8a06f2f4 ABI:2    AX:0xffffd168fee0a980    BX:0xffff8a23b087f000    CX:0xfffeb69aaeb25d73    DX:0xffff8a253e8310f0    SI:0xfffffff9bafe7359    DI:0xffffb1690204fb10    BP:0xffffd168fee0a950    SP:0xffffb1690204fb88    IP:0xffffffff8a06f2f4 FLAGS:0x4e    CS:0x10    SS:0x18    R8:0x1495f0a91129a    R9:0xffff8a23b087f000   R10:0x1   R11:0xffffffff   R12:0x0   R13:0xffff8a253e827e00   R14:0xffffd168fee0aa5c   R15:0xffffd168fee0a980
      
         ffffffff8a06f2f4 ABI:2    AX:0x0    BX:0xffffd168fee0a950    CX:0x5684cc1118491900    DX:0x0    SI:0xffffd168fee0a9d0    DI:0x202    BP:0xffffb1690204fd70    SP:0xffffb1690204fd20    IP:0xffffffff8a06f2f4 FLAGS:0x24e    CS:0x10    SS:0x18    R8:0x0    R9:0xffffd168fee0a9d0   R10:0x1   R11:0xffffffff   R12:0xffffffff8a23e480   R13:0xffff8a23b087f240   R14:0xffff8a23b087f000   R15:0xffffd168fee0a950
      
         ffffffff8a06f2f4 ABI:2    AX:0x0    BX:0x0    CX:0x7f25f334335b    DX:0x0    SI:0x2400    DI:0x4    BP:0x7fff5f264570    SP:0x7fff5f264538    IP:0xffffffff8a06f2f4 FLAGS:0x24e    CS:0x10    SS:0x2b    R8:0x0    R9:0x2312d20   R10:0x0   R11:0x246   R12:0x22cc0e0   R13:0x0   R14:0x0   R15:0x22d0780
      
        #
      
      Reproduced, apply the patch and:
      
      [root@five ~]# perf script -Fip,iregs | head -6
       ffffffff8a06f2f4 ABI:2    AX:0xffffd168fee0a980    BX:0xffff8a23b087f000    CX:0xfffeb69aaeb25d73    DX:0xffff8a253e8310f0    SI:0xfffffff9bafe7359    DI:0xffffb1690204fb10    BP:0xffffd168fee0a950    SP:0xffffb1690204fb88    IP:0xffffffff8a06f2f4 FLAGS:0x4e    CS:0x10    SS:0x18    R8:0x1495f0a91129a    R9:0xffff8a23b087f000   R10:0x1   R11:0xffffffff   R12:0x0   R13:0xffff8a253e827e00   R14:0xffffd168fee0aa5c   R15:0xffffd168fee0a980
       ffffffff8a06f2f4 ABI:2    AX:0x0    BX:0xffffd168fee0a950    CX:0x5684cc1118491900    DX:0x0    SI:0xffffd168fee0a9d0    DI:0x202    BP:0xffffb1690204fd70    SP:0xffffb1690204fd20    IP:0xffffffff8a06f2f4 FLAGS:0x24e    CS:0x10    SS:0x18    R8:0x0    R9:0xffffd168fee0a9d0   R10:0x1   R11:0xffffffff   R12:0xffffffff8a23e480   R13:0xffff8a23b087f240   R14:0xffff8a23b087f000   R15:0xffffd168fee0a950
       ffffffff8a06f2f4 ABI:2    AX:0x0    BX:0x0    CX:0x7f25f334335b    DX:0x0    SI:0x2400    DI:0x4    BP:0x7fff5f264570    SP:0x7fff5f264538    IP:0xffffffff8a06f2f4 FLAGS:0x24e    CS:0x10    SS:0x2b    R8:0x0    R9:0x2312d20   R10:0x0   R11:0x246   R12:0x22cc0e0   R13:0x0   R14:0x0   R15:0x22d0780
       ffffffff8a24074b ABI:2    AX:0xcb    BX:0xcb    CX:0x0    DX:0x0    SI:0xffffb1690204ff58    DI:0xcb    BP:0xffffb1690204ff58    SP:0xffffb1690204ff40    IP:0xffffffff8a24074b FLAGS:0x24e    CS:0x10    SS:0x18    R8:0x0    R9:0x0   R10:0x0   R11:0x0   R12:0x0   R13:0x0   R14:0x0   R15:0x0
       ffffffff8a310600 ABI:2    AX:0x0    BX:0xffffffff8b8c39a0    CX:0x0    DX:0xffff8a2503890300    SI:0xffffb1690204ff20    DI:0xffff8a23e4080000    BP:0xffff8a23e4080000    SP:0xffffb1690204fec0    IP:0xffffffff8a310600 FLAGS:0x28e    CS:0x10    SS:0x18    R8:0x0    R9:0x0   R10:0x0   R11:0x0   R12:0xffffffffffffffea   R13:0xffff8a23e4080020   R14:0x0   R15:0x0
       ffffffff8a11b688 ABI:2    AX:0x0    BX:0xffff8a237b7c8800    CX:0xffffb1690204fae0    DX:0x78    SI:0xffff8a237b7c8800    DI:0xffffb1690204fa10    BP:0xffffb1690204fb00    SP:0xffffb1690204fa00    IP:0xffffffff8a11b688 FLAGS:0x8a    CS:0x10    SS:0x18    R8:0x1495f0a917eba    R9:0xffffd168fde19a48   R10:0xffffb1690204fd98   R11:0xffff8a253e82afb0   R12:0xffff8a237b7c8800   R13:0xffffb1690204fb00   R14:0x0   R15:0xffff8a237b7c8800
      [root@five ~]#
      
      To see it more clearly, lets get just two of those registers by sample:
      
        # perf record -a --intr-regs=ax,bx --user-regs=cx,dx sleep 1
        [ perf record: Woken up 1 times to write data ]
        [ perf record: Captured and wrote 3.502 MB perf.data (1653 samples) ]
        #
      
      Extra info, lets see what gets setup in that 'struct perf_event_attr':
      
        # perf evlist -v
        cycles: size: 120, { sample_period, sample_freq }: 4000, sample_type: IP|TID|TIME|CPU|PERIOD|REGS_USER|REGS_INTR, read_format: ID, disabled: 1, inherit: 1, mmap: 1, comm: 1, freq: 1, task: 1, precise_ip: 2, sample_id_all: 1, exclude_guest: 1, mmap2: 1, comm_exec: 1, ksymbol: 1, bpf_event: 1, sample_regs_user: 0xc, sample_regs_intr: 0x3
        #
      
      Cook, some PERF_SAMPLE_REGS_USER|PERF_SAMPLE_REGS_INTR +
      attr.sample_regs_user and attr.sample_regs_intr register masks, now lets
      see if those newlines are gone in a more compact fashion:
      
        # perf script -Fip,iregs,uregs
         ffffffff8a56df78 ABI:2    AX:0xffff8a25137b6028    BX:0xffff8a2502f18000  ABI:2    CX:0x7f204460e49b    DX:0xf42920
         ffffffff8a56df78 ABI:2    AX:0xffff8a25137b6028    BX:0xffff8a2502f18000  ABI:2    CX:0x7f204460e49b    DX:0xf42920
         ffffffff8a56df78 ABI:2    AX:0xffff8a25137b6028    BX:0xffff8a2502f18000  ABI:2    CX:0x7f204460e49b    DX:0xf42920
         ffffffff8a56df78 ABI:2    AX:0xffff8a25137b6028    BX:0xffff8a2502f18000  ABI:2    CX:0x7f204460e49b    DX:0xf42920
         ffffffff8a56df78 ABI:2    AX:0xffff8a25137b6028    BX:0xffff8a2502f18000  ABI:2    CX:0x7f204460e49b    DX:0xf42920
         ffffffff8a56df78 ABI:2    AX:0xffff8a25137b6028    BX:0xffff8a2502f18000  ABI:2    CX:0x7f204460e49b    DX:0xf42920
         ffffffff8a29b78d ABI:2    AX:0x2a20ffcd6000    BX:0x2ec7d9000  ABI:2    CX:0x7f204460e49b    DX:0xf42920
        #
      
      And where was that?
      
        # perf script -Fip,iregs,uregs,sym,dso
         ffffffff8a56df78 strrchr (/lib/modules/5.7.0-rc2/build/vmlinux) ABI:2    AX:0xffff8a25137b6028    BX:0xffff8a2502f18000  ABI:2    CX:0x7f204460e49b    DX:0xf42920
         ffffffff8a56df78 strrchr (/lib/modules/5.7.0-rc2/build/vmlinux) ABI:2    AX:0xffff8a25137b6028    BX:0xffff8a2502f18000  ABI:2    CX:0x7f204460e49b    DX:0xf42920
         ffffffff8a56df78 strrchr (/lib/modules/5.7.0-rc2/build/vmlinux) ABI:2    AX:0xffff8a25137b6028    BX:0xffff8a2502f18000  ABI:2    CX:0x7f204460e49b    DX:0xf42920
         ffffffff8a56df78 strrchr (/lib/modules/5.7.0-rc2/build/vmlinux) ABI:2    AX:0xffff8a25137b6028    BX:0xffff8a2502f18000  ABI:2    CX:0x7f204460e49b    DX:0xf42920
         ffffffff8a56df78 strrchr (/lib/modules/5.7.0-rc2/build/vmlinux) ABI:2    AX:0xffff8a25137b6028    BX:0xffff8a2502f18000  ABI:2    CX:0x7f204460e49b    DX:0xf42920
         ffffffff8a56df78 strrchr (/lib/modules/5.7.0-rc2/build/vmlinux) ABI:2    AX:0xffff8a25137b6028    BX:0xffff8a2502f18000  ABI:2    CX:0x7f204460e49b    DX:0xf42920
         ffffffff8a29b78d __vma_link_rb (/lib/modules/5.7.0-rc2/build/vmlinux) ABI:2    AX:0x2a20ffcd6000    BX:0x2ec7d9000  ABI:2    CX:0x7f204460e49b    DX:0xf42920
        #
      Signed-off-by: NStephane Eranian <eranian@google.com>
      Tested-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Ian Rogers <irogers@google.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Kan Liang <kan.liang@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: http://lore.kernel.org/lkml/20200418231908.152212-1-eranian@google.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      fad1f1e7
    • I
      perf synthetic events: Remove use of sscanf from /proc reading · 2069425e
      Ian Rogers 提交于
      The synthesize benchmark, run on a single process and thread, shows
      perf_event__synthesize_mmap_events as the hottest function with fgets
      and sscanf taking the majority of execution time.
      
      fscanf performs similarly well. Replace the scanf call with manual
      reading of each field of the /proc/pid/maps line, and remove some
      unnecessary buffering.
      
      This change also addresses potential, but unlikely, buffer overruns for
      the string values read by scanf.
      
      Performance before is:
      
        $ sudo perf bench internals synthesize -m 16 -M 16 -s -t
        \# Running 'internals/synthesize' benchmark:
        Computing performance of single threaded perf event synthesis by
        synthesizing events on the perf process itself:
          Average synthesis took: 102.810 usec (+- 0.027 usec)
          Average num. events: 17.000 (+- 0.000)
          Average time per event 6.048 usec
          Average data synthesis took: 106.325 usec (+- 0.018 usec)
          Average num. events: 89.000 (+- 0.000)
          Average time per event 1.195 usec
        Computing performance of multi threaded perf event synthesis by
        synthesizing events on CPU 0:
          Number of synthesis threads: 16
            Average synthesis took: 68103.100 usec (+- 441.234 usec)
            Average num. events: 30703.000 (+- 0.730)
            Average time per event 2.218 usec
      
      And after is:
      
        $ sudo perf bench internals synthesize -m 16 -M 16 -s -t
        \# Running 'internals/synthesize' benchmark:
        Computing performance of single threaded perf event synthesis by
        synthesizing events on the perf process itself:
          Average synthesis took: 50.388 usec (+- 0.031 usec)
          Average num. events: 17.000 (+- 0.000)
          Average time per event 2.964 usec
          Average data synthesis took: 52.693 usec (+- 0.020 usec)
          Average num. events: 89.000 (+- 0.000)
          Average time per event 0.592 usec
        Computing performance of multi threaded perf event synthesis by
        synthesizing events on CPU 0:
          Number of synthesis threads: 16
            Average synthesis took: 45022.400 usec (+- 552.740 usec)
            Average num. events: 30624.200 (+- 10.037)
            Average time per event 1.470 usec
      
      On a Intel Xeon 6154 compiling with Debian gcc 9.2.1.
      
      Committer testing:
      
      On a AMD Ryzen 5 3600X 6-Core Processor:
      
      Before:
      
        # perf bench internals synthesize --min-threads 12 --max-threads 12 --st --mt
        # Running 'internals/synthesize' benchmark:
        Computing performance of single threaded perf event synthesis by
        synthesizing events on the perf process itself:
          Average synthesis took: 267.491 usec (+- 0.176 usec)
          Average num. events: 56.000 (+- 0.000)
          Average time per event 4.777 usec
          Average data synthesis took: 277.257 usec (+- 0.169 usec)
          Average num. events: 287.000 (+- 0.000)
          Average time per event 0.966 usec
        Computing performance of multi threaded perf event synthesis by
        synthesizing events on CPU 0:
          Number of synthesis threads: 12
            Average synthesis took: 81599.500 usec (+- 346.315 usec)
            Average num. events: 36096.100 (+- 2.523)
            Average time per event 2.261 usec
        #
      
      After:
      
        # perf bench internals synthesize --min-threads 12 --max-threads 12 --st --mt
        # Running 'internals/synthesize' benchmark:
        Computing performance of single threaded perf event synthesis by
        synthesizing events on the perf process itself:
          Average synthesis took: 110.125 usec (+- 0.080 usec)
          Average num. events: 56.000 (+- 0.000)
          Average time per event 1.967 usec
          Average data synthesis took: 118.518 usec (+- 0.057 usec)
          Average num. events: 287.000 (+- 0.000)
          Average time per event 0.413 usec
        Computing performance of multi threaded perf event synthesis by
        synthesizing events on CPU 0:
          Number of synthesis threads: 12
            Average synthesis took: 43490.700 usec (+- 284.527 usec)
            Average num. events: 37028.500 (+- 0.563)
            Average time per event 1.175 usec
        #
      Signed-off-by: NIan Rogers <irogers@google.com>
      Tested-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Acked-by: NJiri Olsa <jolsa@redhat.com>
      Acked-by: NNamhyung Kim <namhyung@kernel.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andrey Zhizhikin <andrey.z@gmail.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Kan Liang <kan.liang@linux.intel.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lore.kernel.org/lkml/20200415054050.31645-4-irogers@google.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      2069425e
    • I
      tools api: Add a lightweight buffered reading api · e95770af
      Ian Rogers 提交于
      The synthesize benchmark shows the majority of execution time going to
      fgets and sscanf, necessary to parse /proc/pid/maps. Add a new buffered
      reading library that will be used to replace these calls in a follow-up
      CL. Add tests for the library to perf test.
      
      Committer tests:
      
        $ perf test api
        63: Test api io                                           : Ok
        $
      Signed-off-by: NIan Rogers <irogers@google.com>
      Acked-by: NJiri Olsa <jolsa@redhat.com>
      Acked-by: NNamhyung Kim <namhyung@kernel.org>
      Tested-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andrey Zhizhikin <andrey.z@gmail.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Kan Liang <kan.liang@linux.intel.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lore.kernel.org/lkml/20200415054050.31645-3-irogers@google.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      e95770af
    • I
      perf bench: Add a multi-threaded synthesize benchmark · 13edc237
      Ian Rogers 提交于
      By default this isn't run as it reads /proc and may not have access.
      For consistency, modify the single threaded benchmark to compute an
      average time per event.
      
      Committer testing:
      
        $ grep -m1 "model name" /proc/cpuinfo
        model name	: Intel(R) Core(TM) i7-8650U CPU @ 1.90GHz
        $ grep "model name" /proc/cpuinfo  | wc -l
        8
        $
        $ perf bench internals synthesize -h
        # Running 'internals/synthesize' benchmark:
      
         Usage: perf bench internals synthesize <options>
      
            -I, --multi-iterations <n>
                                  Number of iterations used to compute multi-threaded average
            -i, --single-iterations <n>
                                  Number of iterations used to compute single-threaded average
            -M, --max-threads <n>
                                  Maximum number of threads in multithreaded bench
            -m, --min-threads <n>
                                  Minimum number of threads in multithreaded bench
            -s, --st              Run single threaded benchmark
            -t, --mt              Run multi-threaded benchmark
      
        $
        $ perf bench internals synthesize -t
        # Running 'internals/synthesize' benchmark:
        Computing performance of multi threaded perf event synthesis by
        synthesizing events on CPU 0:
          Number of synthesis threads: 1
            Average synthesis took: 65449.000 usec (+- 586.442 usec)
            Average num. events: 9405.400 (+- 0.306)
            Average time per event 6.959 usec
          Number of synthesis threads: 2
            Average synthesis took: 37838.300 usec (+- 130.259 usec)
            Average num. events: 9501.800 (+- 20.469)
            Average time per event 3.982 usec
          Number of synthesis threads: 3
            Average synthesis took: 48551.400 usec (+- 225.686 usec)
            Average num. events: 9544.000 (+- 0.000)
            Average time per event 5.087 usec
          Number of synthesis threads: 4
            Average synthesis took: 29632.500 usec (+- 50.808 usec)
            Average num. events: 9544.000 (+- 0.000)
            Average time per event 3.105 usec
          Number of synthesis threads: 5
            Average synthesis took: 33920.400 usec (+- 284.509 usec)
            Average num. events: 9544.000 (+- 0.000)
            Average time per event 3.554 usec
          Number of synthesis threads: 6
            Average synthesis took: 27604.100 usec (+- 72.344 usec)
            Average num. events: 9548.000 (+- 0.000)
            Average time per event 2.891 usec
          Number of synthesis threads: 7
            Average synthesis took: 25406.300 usec (+- 933.371 usec)
            Average num. events: 9545.500 (+- 0.167)
            Average time per event 2.662 usec
          Number of synthesis threads: 8
            Average synthesis took: 24110.400 usec (+- 73.229 usec)
            Average num. events: 9551.000 (+- 0.000)
            Average time per event 2.524 usec
        $
      Signed-off-by: NIan Rogers <irogers@google.com>
      Tested-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Acked-by: NJiri Olsa <jolsa@redhat.com>
      Acked-by: NNamhyung Kim <namhyung@kernel.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andrey Zhizhikin <andrey.z@gmail.com>
      Cc: Kan Liang <kan.liang@linux.intel.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lore.kernel.org/lkml/20200415054050.31645-2-irogers@google.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      13edc237
  3. 23 4月, 2020 4 次提交
    • S
      perf record: Add num-synthesize-threads option · d99c22ea
      Stephane Eranian 提交于
      To control degree of parallelism of the synthesize_mmap() code which
      is scanning /proc/PID/task/PID/maps and can be time consuming.
      Mimic perf top way of handling the option.
      If not specified will default to 1 thread, i.e. default behavior before
      this option.
      
      On a desktop computer the processing of /proc/PID/task/PID/maps isn't
      slow enough to warrant parallel processing and the thread creation has
      some cost - hence the default of 1. On a loaded server with
      >100 cores it is possible to see synthesis times in the order of
      seconds and in this case having the option is desirable.
      
      As the processing is a synchronization point, it is legitimate to worry if
      Amdahl's law will apply to this patch. Profiling with this patch in
      place:
      https://lore.kernel.org/lkml/20200415054050.31645-4-irogers@google.com/
      shows:
      ...
            - 32.59% __perf_event__synthesize_threads
               - 32.54% __event__synthesize_thread
                  + 22.13% perf_event__synthesize_mmap_events
                  + 6.68% perf_event__get_comm_ids.constprop.0
                  + 1.49% process_synthesized_event
                  + 1.29% __GI___readdir64
                  + 0.60% __opendir
      ...
      That is the processing is 1.49% of execution time and there is plenty to
      make parallel. This is shown in the benchmark in this patch:
      
      https://lore.kernel.org/lkml/20200415054050.31645-2-irogers@google.com/
      
        Computing performance of multi threaded perf event synthesis by
        synthesizing events on CPU 0:
         Number of synthesis threads: 1
           Average synthesis took: 127729.000 usec (+- 3372.880 usec)
           Average num. events: 21548.600 (+- 0.306)
           Average time per event 5.927 usec
         Number of synthesis threads: 2
           Average synthesis took: 88863.500 usec (+- 385.168 usec)
           Average num. events: 21552.800 (+- 0.327)
           Average time per event 4.123 usec
         Number of synthesis threads: 3
           Average synthesis took: 83257.400 usec (+- 348.617 usec)
           Average num. events: 21553.200 (+- 0.327)
           Average time per event 3.863 usec
         Number of synthesis threads: 4
           Average synthesis took: 75093.000 usec (+- 422.978 usec)
           Average num. events: 21554.200 (+- 0.200)
           Average time per event 3.484 usec
         Number of synthesis threads: 5
           Average synthesis took: 64896.600 usec (+- 353.348 usec)
           Average num. events: 21558.000 (+- 0.000)
           Average time per event 3.010 usec
         Number of synthesis threads: 6
           Average synthesis took: 59210.200 usec (+- 342.890 usec)
           Average num. events: 21560.000 (+- 0.000)
           Average time per event 2.746 usec
         Number of synthesis threads: 7
           Average synthesis took: 54093.900 usec (+- 306.247 usec)
           Average num. events: 21562.000 (+- 0.000)
           Average time per event 2.509 usec
         Number of synthesis threads: 8
           Average synthesis took: 48938.700 usec (+- 341.732 usec)
           Average num. events: 21564.000 (+- 0.000)
           Average time per event 2.269 usec
      
      Where average time per synthesized event goes from 5.927 usec with 1
      thread to 2.269 usec with 8. This isn't a linear speed up as not all of
      synthesize code has been made parallel. If the synthesis time was about
      10 seconds then using 8 threads may bring this down to less than 4.
      Signed-off-by: NStephane Eranian <eranian@google.com>
      Reviewed-by: NIan Rogers <irogers@google.com>
      Acked-by: NJiri Olsa <jolsa@redhat.com>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Alexey Budankov <alexey.budankov@linux.intel.com>
      Cc: Kan Liang <kan.liang@linux.intel.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Tony Jones <tonyj@suse.de>
      Cc: yuzhoujian <yuzhoujian@didichuxing.com>
      Link: http://lore.kernel.org/lkml/20200422155038.9380-1-irogers@google.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      d99c22ea
    • T
      perf test session topology: Fix data path · dbd660e6
      Tommi Rantala 提交于
      Commit 2d4f2799 ("perf data: Add global path holder") missed path
      conversion in tests/topology.c, causing the "Session topology" testcase
      to "hang" (waits forever for input from stdin) when doing "ssh $VM perf
      test".
      
      Can be reproduced by running "cat | perf test topo", and crashed by
      replacing cat with true:
      
        $ true | perf test -v topo
        40: Session topology                                      :
        --- start ---
        test child forked, pid 3638
        templ file: /tmp/perf-test-QPvAch
        incompatible file format
        incompatible file format (rerun with -v to learn more)
        free(): invalid pointer
        test child interrupted
        ---- end ----
        Session topology: FAILED!
      
      Committer testing:
      
      Reproduced the above result before the patch and after it is back
      working:
      
        # true | perf test -v topo
        41: Session topology                                      :
        --- start ---
        test child forked, pid 19374
        templ file: /tmp/perf-test-YOTEQg
        CPU 0, core 0, socket 0
        CPU 1, core 1, socket 0
        CPU 2, core 2, socket 0
        CPU 3, core 3, socket 0
        CPU 4, core 0, socket 0
        CPU 5, core 1, socket 0
        CPU 6, core 2, socket 0
        CPU 7, core 3, socket 0
        test child finished with 0
        ---- end ----
        Session topology: Ok
        #
      
      Fixes: 2d4f2799 ("perf data: Add global path holder")
      Signed-off-by: NTommi Rantala <tommi.t.rantala@nokia.com>
      Tested-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Acked-by: NJiri Olsa <jolsa@redhat.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Mamatha Inamdar <mamatha4@linux.vnet.ibm.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
      Link: http://lore.kernel.org/lkml/20200423115341.562782-1-tommi.t.rantala@nokia.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      dbd660e6
    • J
      perf stat: Improve runtime stat for interval mode · 197ba86f
      Jin Yao 提交于
      For interval mode, the metric is printed after the '#' character if it
      exists. But it's not calculated by the counts generated in this
      interval.
      
      See the following examples:
      
        root@kbl-ppc:~# perf stat -M CPI -I1000 --interval-count 2
        #           time             counts unit events
             1.000422803            764,809      inst_retired.any          #      2.9 CPI
             1.000422803          2,234,932      cycles
             2.001464585          1,960,061      inst_retired.any          #      1.6 CPI
             2.001464585          4,022,591      cycles
      
      The second CPI should not be 1.6 (4,022,591/1,960,061 is 2.1)
      
        root@kbl-ppc:~# perf stat -e cycles,instructions -I1000 --interval-count 2
        #           time             counts unit events
             1.000429493          2,869,311      cycles
             1.000429493            816,875      instructions              #    0.28  insn per cycle
             2.001516426          9,260,973      cycles
             2.001516426          5,250,634      instructions              #    0.87  insn per cycle
      
      The second 'insn per cycle' should not be 0.87 (5,250,634/9,260,973 is
      0.57).
      
      The current code uses a global variable 'rt_stat' for tracking and
      updating the std dev of runtime stat. Unlike the counts, 'rt_stat' is not
      reset for interval. While the counts are reset for interval.
      
        perf_stat_process_counter()
        {
                if (config->interval)
                        init_stats(ps->res_stats);
        }
      
      So for interval mode, the 'rt_stat' variable should be reset too.
      
      This patch resets 'rt_stat' before read_counters(), so the runtime stat
      is only calculated by the counts generated in this interval.
      
      With this patch:
      
        root@kbl-ppc:~# perf stat -M CPI -I1000 --interval-count 2
        #           time             counts unit events
             1.000420924          2,408,818      inst_retired.any          #      2.1 CPI
             1.000420924          5,010,111      cycles
             2.001448579          2,798,407      inst_retired.any          #      1.6 CPI
             2.001448579          4,599,861      cycles
      
        root@kbl-ppc:~# perf stat -e cycles,instructions -I1000 --interval-count 2
        #           time             counts unit events
             1.000428555          2,769,714      cycles
             1.000428555            774,462      instructions              #    0.28  insn per cycle
             2.001471562          3,595,904      cycles
             2.001471562          1,243,703      instructions              #    0.35  insn per cycle
      
      Now the second 'insn per cycle' and CPI are calculated by the counts
      generated in this interval.
      Signed-off-by: NJin Yao <yao.jin@linux.intel.com>
      Acked-by: NJiri Olsa <jolsa@redhat.com>
      Tested-By: NKajol Jain <kjain@linux.ibm.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Jin Yao <yao.jin@intel.com>
      Cc: Kan Liang <kan.liang@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: http://lore.kernel.org/lkml/20200420145417.6864-1-yao.jin@linux.intel.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      197ba86f
    • J
      perf stat: Zero all the 'ena' and 'run' array slot stats for interval mode · 0e0bf1ea
      Jin Yao 提交于
      As the code comments in perf_stat_process_counter() say, we calculate
      counter's data every interval, and the display code shows ps->res_stats
      avg value. We need to zero the stats for interval mode.
      
      But the current code only zeros the res_stats[0], it doesn't zero the
      res_stats[1] and res_stats[2], which are for ena and run of counter.
      
      This patch zeros the whole res_stats[] for interval mode.
      
      Fixes: 51fd2df1 ("perf stat: Fix interval output values")
      Signed-off-by: NJin Yao <yao.jin@linux.intel.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Jin Yao <yao.jin@intel.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Kan Liang <kan.liang@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: http://lore.kernel.org/lkml/20200409070755.17261-1-yao.jin@linux.intel.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      0e0bf1ea
  4. 22 4月, 2020 6 次提交
  5. 18 4月, 2020 16 次提交
    • K
      perf hist: Add fast path for duplicate entries check · 12e89e65
      Kan Liang 提交于
      Perf checks the duplicate entries in a callchain before adding an entry.
      However the check is very slow especially with deeper call stack.
      Almost ~50% elapsed time of perf report is spent on the check when the
      call stack is always depth of 32.
      
      The hist_entry__cmp() is used to compare the new entry with the old
      entries. It will go through all the available sorts in the sort_list,
      and call the specific cmp of each sort, which is very slow.
      
      Actually, for most cases, there are no duplicate entries in callchain.
      The symbols are usually different. It's much faster to do a quick check
      for symbols first. Only do the full cmp when the symbols are exactly the
      same.
      
      The quick check is only to check symbols, not dso. Export
      _sort__sym_cmp.
      
        $ perf record --call-graph lbr ./tchain_edit_64
      
        Without the patch
        $time perf report --stdio
        real    0m21.142s
        user    0m21.110s
        sys     0m0.033s
      
        With the patch
        $time perf report --stdio
        real    0m10.977s
        user    0m10.948s
        sys     0m0.027s
      Signed-off-by: NKan Liang <kan.liang@linux.intel.com>
      Acked-by: NJiri Olsa <jolsa@redhat.com>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Alexey Budankov <alexey.budankov@linux.intel.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Pavel Gerasimov <pavel.gerasimov@intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Vitaly Slobodskoy <vitaly.slobodskoy@intel.com>
      Link: http://lore.kernel.org/lkml/20200319202517.23423-18-kan.liang@linux.intel.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      12e89e65
    • K
      perf c2c: Add option to enable the LBR stitching approach · d80da766
      Kan Liang 提交于
      With the LBR stitching approach, the reconstructed LBR call stack can
      break the HW limitation. However, it may reconstruct invalid call stacks
      in some cases, e.g. exception handing such as setjmp/longjmp.  Also, it
      may impact the processing time especially when the number of samples
      with stitched LBRs are huge.
      
      Add an option to enable the approach.
      Signed-off-by: NKan Liang <kan.liang@linux.intel.com>
      Reviewed-by: NAndi Kleen <ak@linux.intel.com>
      Acked-by: NJiri Olsa <jolsa@redhat.com>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Alexey Budankov <alexey.budankov@linux.intel.com>
      Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Pavel Gerasimov <pavel.gerasimov@intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Vitaly Slobodskoy <vitaly.slobodskoy@intel.com>
      Link: http://lore.kernel.org/lkml/20200319202517.23423-17-kan.liang@linux.intel.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      d80da766
    • K
      perf top: Add option to enable the LBR stitching approach · 13e0c844
      Kan Liang 提交于
      With the LBR stitching approach, the reconstructed LBR call stack
      can break the HW limitation. However, it may reconstruct invalid call
      stacks in some cases, e.g. exception handing such as setjmp/longjmp.
      Also, it may impact the processing time especially when the number of
      samples with stitched LBRs are huge.
      
      Add an option to enable the approach.
      The option must be used with --call-graph lbr.
      Signed-off-by: NKan Liang <kan.liang@linux.intel.com>
      Reviewed-by: NAndi Kleen <ak@linux.intel.com>
      Acked-by: NJiri Olsa <jolsa@redhat.com>
      Tested-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Alexey Budankov <alexey.budankov@linux.intel.com>
      Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Pavel Gerasimov <pavel.gerasimov@intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Vitaly Slobodskoy <vitaly.slobodskoy@intel.com>
      Link: http://lore.kernel.org/lkml/20200319202517.23423-16-kan.liang@linux.intel.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      13e0c844
    • K
      perf script: Add option to enable the LBR stitching approach · 680d125c
      Kan Liang 提交于
      With the LBR stitching approach, the reconstructed LBR call stack can
      break the HW limitation. However, it may reconstruct invalid call stacks
      in some cases, e.g. exception handing such as setjmp/longjmp.  Also, it
      may impact the processing time especially when the number of samples
      with stitched LBRs are huge.
      
      Add an option to enable the approach.
      
      Committer testing:
      
      Using the same perf.data as with the latest cset committer testing
      section:
      
        $ perf script --stitch-lbr
        <SNIP>
        tchain_edit 11131 15164.984292:     437491 cycles:u:
                          401106 f43+0x0 (/wb/tchain_edit)
                          40114c f42+0x18 (/wb/tchain_edit)
                          401172 f41+0xe (/wb/tchain_edit)
                          401194 f40+0x0 (/wb/tchain_edit)
                          40119b f39+0x0 (/wb/tchain_edit)
                          4011a2 f38+0x0 (/wb/tchain_edit)
                          4011a9 f37+0x0 (/wb/tchain_edit)
                          4011b0 f36+0x0 (/wb/tchain_edit)
                          4011b7 f35+0x0 (/wb/tchain_edit)
                          4011be f34+0x0 (/wb/tchain_edit)
                          4011c5 f33+0x0 (/wb/tchain_edit)
                          4011cc f32+0x0 (/wb/tchain_edit)
                          401207 f31+0x34 (/wb/tchain_edit)
                          401212 f30+0x0 (/wb/tchain_edit)
                          401219 f29+0x0 (/wb/tchain_edit)
                          401220 f28+0x0 (/wb/tchain_edit)
                          401227 f27+0x0 (/wb/tchain_edit)
                          40122e f26+0x0 (/wb/tchain_edit)
                          401235 f25+0x0 (/wb/tchain_edit)
                          40123c f24+0x0 (/wb/tchain_edit)
                          401243 f23+0x0 (/wb/tchain_edit)
                          40124a f22+0x0 (/wb/tchain_edit)
                          401251 f21+0x0 (/wb/tchain_edit)
                          401258 f20+0x0 (/wb/tchain_edit)
                          40125f f19+0x0 (/wb/tchain_edit)
                          401266 f18+0x0 (/wb/tchain_edit)
                          40126d f17+0x0 (/wb/tchain_edit)
                          401274 f16+0x0 (/wb/tchain_edit)
                          40127b f15+0x0 (/wb/tchain_edit)
                          401282 f14+0x0 (/wb/tchain_edit)
                          401289 f13+0x0 (/wb/tchain_edit)
                          401290 f12+0x0 (/wb/tchain_edit)
                          401297 f11+0x0 (/wb/tchain_edit)
                          40129e f10+0x0 (/wb/tchain_edit)
                          4012a5 f9+0x0 (/wb/tchain_edit)
                          4012ac f8+0x0 (/wb/tchain_edit)
                          4012b3 f7+0x0 (/wb/tchain_edit)
                          4012ba f6+0x0 (/wb/tchain_edit)
                          4012c1 f5+0x0 (/wb/tchain_edit)
                          4012c8 f4+0x0 (/wb/tchain_edit)
                          4012cf f3+0x0 (/wb/tchain_edit)
                          4012d6 f2+0x0 (/wb/tchain_edit)
                          4012dd f1+0x0 (/wb/tchain_edit)
                          4012e4 main+0x0 (/wb/tchain_edit)
                    7f41a5016f41 __libc_start_main+0xf1 (/usr/lib64/libc-2.29.so)
        <SNIP>
        $
      Signed-off-by: NKan Liang <kan.liang@linux.intel.com>
      Reviewed-by: NAndi Kleen <ak@linux.intel.com>
      Acked-by: NJiri Olsa <jolsa@redhat.com>
      Tested-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Alexey Budankov <alexey.budankov@linux.intel.com>
      Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Pavel Gerasimov <pavel.gerasimov@intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Vitaly Slobodskoy <vitaly.slobodskoy@intel.com>
      Link: http://lore.kernel.org/lkml/20200319202517.23423-15-kan.liang@linux.intel.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      680d125c
    • K
      perf report: Add option to enable the LBR stitching approach · b1d1429b
      Kan Liang 提交于
      With the LBR stitching approach, the reconstructed LBR call stack can
      break the HW limitation. However, it may reconstruct invalid call stacks
      in some cases, e.g. exception handing such as setjmp/longjmp.  Also, it
      may impact the processing time especially when the number of samples
      with stitched LBRs are huge.
      
      Add an option to enable the approach.
      
        # To display the perf.data header info, please use
        # --header/--header-only options.
        #
        #
        # Total Lost Samples: 0
        #
        # Samples: 6K of event 'cycles'
        # Event count (approx.): 6492797701
        #
        # Children      Self  Command          Shared Object       Symbol
        # ........  ........  ...............  ..................
        # .................................
        #
          99.99%    99.99%  tchain_edit      tchain_edit        [.] f43
                  |
                  ---main
                     f1
                     f2
                     f3
                     f4
                     f5
                     f6
                     f7
                     f8
                     f9
                     f10
                     f11
                     f12
                     f13
                     f14
                     f15
                     f16
                     f17
                     f18
                     f19
                     f20
                     f21
                     f22
                     f23
                     f24
                     f25
                     f26
                     f27
                     f28
                     f29
                     f30
                     f31
                     |
                      --99.65%--f32
                                f33
                                f34
                                f35
                                f36
                                f37
                                f38
                                f39
                                f40
                                f41
                                f42
                                f43
      
      Committer testing:
      
        $ perf record --call-graph lbr /wb/tchain_edit
        [ perf record: Woken up 23 times to write data ]
        [ perf record: Captured and wrote 5.578 MB perf.data (6839 samples) ]
        $ perf report --header-only | egrep 'cpu(desc|.*capabilities)'
        # cpudesc : Intel(R) Core(TM) i5-7500 CPU @ 3.40GHz
        # cpu pmu capabilities: branches=32, max_precise=3, pmu_name=skylake
        $
      
      Before:
      
        $ perf report --no-children --stdio
        # To display the perf.data header info, please use --header/--header-only options.
        #
        #
        # Total Lost Samples: 0
        #
        # Samples: 6K of event 'cycles:u'
        # Event count (approx.): 6459523879
        #
        # Overhead  Command      Shared Object     Symbol
        # ........  ...........  ................  .......................
        #
            99.95%  tchain_edit  tchain_edit       [.] f43
                    |
                     --99.92%--f43
                               f42
                               f41
                               f40
                               f39
                               f38
                               f37
                               f36
                               f35
                               f34
                               f33
                               f32
                               f31
                               f30
                               f29
                               f28
                               f27
                               f26
                               f25
                               f24
                               f23
                               f22
                               f21
                               f20
                               f19
                               f18
                               f17
                               f16
                               f15
                               f14
                               f13
                               f12
                               f11
      
             0.03%  tchain_edit  tchain_edit       [.] f42
             0.01%  tchain_edit  tchain_edit       [.] f41
             0.00%  tchain_edit  tchain_edit       [.] f31
             0.00%  tchain_edit  ld-2.29.so        [.] _dl_relocate_object
             0.00%  tchain_edit  ld-2.29.so        [.] memmove
             0.00%  tchain_edit  [unknown]         [k] 0xffffffff93a00b17
      
      After:
      
        $ perf report --stitch-lbr --no-children --stdio
        # To display the perf.data header info, please use --header/--header-only options.
        #
        #
        # Total Lost Samples: 0
        #
        # Samples: 6K of event 'cycles:u'
        # Event count (approx.): 6459496645
        #
        # Overhead  Command      Shared Object     Symbol
        # ........  ...........  ................  ........................
        #
            99.97%  tchain_edit  tchain_edit       [.] f43
                    |
                     --99.93%--f43
                               f42
                               f41
                               f40
                               f39
                               f38
                               f37
                               f36
                               f35
                               f34
                               f33
                               f32
                               f31
                               f30
                               f29
                               f28
                               f27
                               f26
                               f25
                               f24
                               f23
                               f22
                               f21
                               f20
                               f19
                               f18
                               f17
                               f16
                               f15
                               f14
                               f13
                               f12
                               f11
                               f10
                               f9
                               f8
                               f7
                               f6
                               f5
                               f4
                               f3
                               f2
                               f1
                               main
                               __libc_start_main
      
             0.02%  tchain_edit  [unknown]         [k] 0xffffffff93a00b17
             0.01%  tchain_edit  tchain_edit       [.] f31
             0.00%  tchain_edit  ld-2.29.so        [.] _dl_important_hwcaps
      Signed-off-by: NKan Liang <kan.liang@linux.intel.com>
      Reviewed-by: NAndi Kleen <ak@linux.intel.com>
      Acked-by: NJiri Olsa <jolsa@redhat.com>
      Tested-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Alexey Budankov <alexey.budankov@linux.intel.com>
      Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Pavel Gerasimov <pavel.gerasimov@intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Vitaly Slobodskoy <vitaly.slobodskoy@intel.com>
      Link: http://lore.kernel.org/lkml/20200319202517.23423-14-kan.liang@linux.intel.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      b1d1429b
    • K
      perf callchain: Stitch LBR call stack · ff165628
      Kan Liang 提交于
      In LBR call stack mode, the depth of reconstructed LBR call stack limits
      to the number of LBR registers.
      
        For example, on skylake, the depth of reconstructed LBR call stack is
        always <= 32.
      
        # To display the perf.data header info, please use
        # --header/--header-only options.
        #
        #
        # Total Lost Samples: 0
        #
        # Samples: 6K of event 'cycles'
        # Event count (approx.): 6487119731
        #
        # Children      Self  Command          Shared Object       Symbol
        # ........  ........  ...............  ..................
        # ................................
      
          99.97%    99.97%  tchain_edit      tchain_edit        [.] f43
                  |
                   --99.64%--f11
                             f12
                             f13
                             f14
                             f15
                             f16
                             f17
                             f18
                             f19
                             f20
                             f21
                             f22
                             f23
                             f24
                             f25
                             f26
                             f27
                             f28
                             f29
                             f30
                             f31
                             f32
                             f33
                             f34
                             f35
                             f36
                             f37
                             f38
                             f39
                             f40
                             f41
                             f42
                             f43
      
      For a call stack which is deeper than LBR limit, HW will overwrite the
      LBR register with oldest branch. Only partial call stacks can be
      reconstructed.
      
      However, the overwritten LBRs may still be retrieved from previous
      sample. At that moment, HW hasn't overwritten the LBR registers yet.
      Perf tools can stitch those overwritten LBRs on current call stacks to
      get a more complete call stack.
      
      To determine if LBRs can be stitched, perf tools need to compare current
      sample with previous sample.
      
      - They should have identical LBR records (Same from, to and flags
        values, and the same physical index of LBR registers).
      
      - The searching starts from the base-of-stack of current sample.
      
      Once perf determines to stitch the previous LBRs, the corresponding LBR
      cursor nodes will be copied to 'lists'.  The 'lists' is to track the LBR
      cursor nodes which are going to be stitched.
      
      When the stitching is over, the nodes will not be freed immediately.
      They will be moved to 'free_lists'. Next stitching may reuse the space.
      Both 'lists' and 'free_lists' will be freed when all samples are
      processed.
      
      Committer notes:
      
      Fix the intel-pt.c initialization of the union with 'struct
      branch_flags', that breaks the build with its unnamed union on older gcc
      versions.
      
      Uninline thread__free_stitch_list(), as it grew big and started dragging
      includes to thread.h, so move it to thread.c where what it needs in
      terms of headers are already there.
      
      This fixes the build in several systems such as debian:experimental when
      cross building to the MIPS32 architecture, i.e. in the other cases what
      was needed was being included by sheer luck.
      
        In file included from builtin-sched.c:11:
        util/thread.h: In function 'thread__free_stitch_list':
        util/thread.h:169:3: error: implicit declaration of function 'free' [-Werror=implicit-function-declaration]
          169 |   free(pos);
              |   ^~~~
        util/thread.h:169:3: error: incompatible implicit declaration of built-in function 'free' [-Werror]
        util/thread.h:19:1: note: include '<stdlib.h>' or provide a declaration of 'free'
           18 | #include "callchain.h"
          +++ |+#include <stdlib.h>
           19 |
        util/thread.h:174:3: error: incompatible implicit declaration of built-in function 'free' [-Werror]
          174 |   free(pos);
              |   ^~~~
        util/thread.h:174:3: note: include '<stdlib.h>' or provide a declaration of 'free'
      Signed-off-by: NKan Liang <kan.liang@linux.intel.com>
      Reviewed-by: NAndi Kleen <ak@linux.intel.com>
      Acked-by: NJiri Olsa <jolsa@redhat.com>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Alexey Budankov <alexey.budankov@linux.intel.com>
      Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Pavel Gerasimov <pavel.gerasimov@intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Vitaly Slobodskoy <vitaly.slobodskoy@intel.com>
      Link: http://lore.kernel.org/lkml/20200319202517.23423-13-kan.liang@linux.intel.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      ff165628
    • K
      perf callchain: Save previous cursor nodes for LBR stitching approach · 7f1d3931
      Kan Liang 提交于
      The cursor nodes which generates from sample are eventually added into
      callchain. To avoid generating cursor nodes from previous samples again,
      the previous cursor nodes are also saved for LBR stitching approach.
      
      Some option, e.g. hide-unresolved, may hide some LBRs.  Add a variable
      'valid' in struct callchain_cursor_node to indicate this case. The LBR
      stitching approach will only append the valid cursor nodes from previous
      samples later.
      Signed-off-by: NKan Liang <kan.liang@linux.intel.com>
      Reviewed-by: NAndi Kleen <ak@linux.intel.com>
      Acked-by: NJiri Olsa <jolsa@redhat.com>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Alexey Budankov <alexey.budankov@linux.intel.com>
      Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Pavel Gerasimov <pavel.gerasimov@intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Vitaly Slobodskoy <vitaly.slobodskoy@intel.com>
      Link: http://lore.kernel.org/lkml/20200319202517.23423-12-kan.liang@linux.intel.com
      [ Use zfree() instead of open coded equivalent, and use it when freeing members of structs ]
      Signed-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      7f1d3931
    • K
      perf thread: Save previous sample for LBR stitching approach · 9c6c3f47
      Kan Liang 提交于
      To retrieve the overwritten LBRs from previous sample for LBR stitching
      approach, perf has to save the previous sample.
      
      Only allocate the struct lbr_stitch once, when LBR stitching approach is
      enabled and kernel supports hw_idx.
      Signed-off-by: NKan Liang <kan.liang@linux.intel.com>
      Reviewed-by: NAndi Kleen <ak@linux.intel.com>
      Acked-by: NJiri Olsa <jolsa@redhat.com>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Alexey Budankov <alexey.budankov@linux.intel.com>
      Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Pavel Gerasimov <pavel.gerasimov@intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Vitaly Slobodskoy <vitaly.slobodskoy@intel.com>
      Link: http://lore.kernel.org/lkml/20200319202517.23423-11-kan.liang@linux.intel.com
      [ Use zalloc()/zfree() for thread->lbr_stitch ]
      Signed-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      9c6c3f47
    • K
      perf thread: Add a knob for LBR stitch approach · 771fd155
      Kan Liang 提交于
      The LBR stitch approach should be disabled by default. Because
      
      - The stitching approach base on LBR call stack technology. The known
        limitations of LBR call stack technology still apply to the approach,
        e.g. Exception handing such as setjmp/longjmp will have calls/returns
        not match.
      
      - This approach is not foolproof. There can be cases where it creates
        incorrect call stacks from incorrect matches. There is no attempt to
        validate any matches in another way.
      
      The 'lbr_stitch_enable' is used to indicate whether enable LBR stitch
      approach, which is disabled by default. The following patch will
      introduce a new option for each tools to enable the LBR stitch
      approach.
      Signed-off-by: NKan Liang <kan.liang@linux.intel.com>
      Reviewed-by: NAndi Kleen <ak@linux.intel.com>
      Acked-by: NJiri Olsa <jolsa@redhat.com>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Alexey Budankov <alexey.budankov@linux.intel.com>
      Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Pavel Gerasimov <pavel.gerasimov@intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Vitaly Slobodskoy <vitaly.slobodskoy@intel.com>
      Link: http://lore.kernel.org/lkml/20200319202517.23423-10-kan.liang@linux.intel.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      771fd155
    • K
      perf machine: Factor out lbr_callchain_add_lbr_ip() · e2b23483
      Kan Liang 提交于
      Both caller and callee needs to add ip from LBR to callchain.
      Factor out lbr_callchain_add_lbr_ip() to improve code readability.
      Signed-off-by: NKan Liang <kan.liang@linux.intel.com>
      Reviewed-by: NAndi Kleen <ak@linux.intel.com>
      Acked-by: NJiri Olsa <jolsa@redhat.com>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Alexey Budankov <alexey.budankov@linux.intel.com>
      Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Pavel Gerasimov <pavel.gerasimov@intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Vitaly Slobodskoy <vitaly.slobodskoy@intel.com>
      Link: http://lore.kernel.org/lkml/20200319202517.23423-9-kan.liang@linux.intel.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      e2b23483
    • K
      perf machine: Factor out lbr_callchain_add_kernel_ip() · dd3e249a
      Kan Liang 提交于
      Both caller and callee needs to add kernel ip to callchain.  Factor out
      lbr_callchain_add_kernel_ip() to improve code readability.
      Signed-off-by: NKan Liang <kan.liang@linux.intel.com>
      Reviewed-by: NAndi Kleen <ak@linux.intel.com>
      Acked-by: NJiri Olsa <jolsa@redhat.com>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Alexey Budankov <alexey.budankov@linux.intel.com>
      Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Pavel Gerasimov <pavel.gerasimov@intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Vitaly Slobodskoy <vitaly.slobodskoy@intel.com>
      Link: http://lore.kernel.org/lkml/20200319202517.23423-8-kan.liang@linux.intel.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      dd3e249a
    • K
      perf machine: Refine the function for LBR call stack reconstruction · e48b8311
      Kan Liang 提交于
      LBR only collect the user call stack. To reconstruct a call stack, both
      kernel call stack and user call stack are required. The function
      resolve_lbr_callchain_sample() mix the kernel call stack and user call
      stack.
      
      Now, with the help of HW idx, perf tool can reconstruct a more complete
      call stack by adding some user call stack from previous sample. However,
      current implementation is hard to be extended to support it.
      
      Current code path for resolve_lbr_callchain_sample()
      
        for (j = 0; j < mix_chain_nr; j++) {
             if (ORDER_CALLEE) {
                   if (kernel callchain)
                        Fill callchain info
                   else if (LBR callchain)
                        Fill callchain info
             } else {
                   if (LBR callchain)
                        Fill callchain info
                   else if (kernel callchain)
                        Fill callchain info
             }
             add_callchain_ip();
        }
      
      With the patch,
      
        if (ORDER_CALLEE) {
             for (j = 0; j < NUM of kernel callchain) {
                   Fill callchain info
                   add_callchain_ip();
             }
             for (; j < mix_chain_nr) {
                   Fill callchain info
                   add_callchain_ip();
             }
        } else {
             for (; j < NUM of LBR callchain) {
                   Fill callchain info
                   add_callchain_ip();
             }
             for (j = 0; j < mix_chain_nr) {
                   Fill callchain info
                   add_callchain_ip();
             }
        }
      
      No functional changes.
      Signed-off-by: NKan Liang <kan.liang@linux.intel.com>
      Reviewed-by: NAndi Kleen <ak@linux.intel.com>
      Acked-by: NJiri Olsa <jolsa@redhat.com>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Alexey Budankov <alexey.budankov@linux.intel.com>
      Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Pavel Gerasimov <pavel.gerasimov@intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Vitaly Slobodskoy <vitaly.slobodskoy@intel.com>
      Link: http://lore.kernel.org/lkml/20200319202517.23423-7-kan.liang@linux.intel.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      e48b8311
    • K
      perf machine: Remove the indent in resolve_lbr_callchain_sample · f8603267
      Kan Liang 提交于
      The indent is unnecessary in resolve_lbr_callchain_sample.  Removing it
      will make the following patch simpler.
      
      Current code path for resolve_lbr_callchain_sample()
      
              /* LBR only affects the user callchain */
              if (i != chain_nr) {
                      body of the function
                      ....
                      return 1;
              }
      
              return 0;
      
      With the patch,
      
              /* LBR only affects the user callchain */
              if (i == chain_nr)
                      return 0;
      
              body of the function
              ...
              return 1;
      
      No functional changes.
      Signed-off-by: NKan Liang <kan.liang@linux.intel.com>
      Reviewed-by: NAndi Kleen <ak@linux.intel.com>
      Acked-by: NJiri Olsa <jolsa@redhat.com>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Alexey Budankov <alexey.budankov@linux.intel.com>
      Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Pavel Gerasimov <pavel.gerasimov@intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Vitaly Slobodskoy <vitaly.slobodskoy@intel.com>
      Link: http://lore.kernel.org/lkml/20200319202517.23423-6-kan.liang@linux.intel.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      f8603267
    • K
      perf header: Support CPU PMU capabilities · 6f91ea28
      Kan Liang 提交于
      To stitch LBR call stack, the max LBR information is required. So the
      CPU PMU capabilities information has to be stored in perf header.
      
      Add a new feature HEADER_CPU_PMU_CAPS for CPU PMU capabilities.
      Retrieve all CPU PMU capabilities, not just max LBR information.
      
      Add variable max_branches to facilitate future usage.
      
      Committer testing:
      
        # ls -la /sys/devices/cpu/caps/
        total 0
        drwxr-xr-x. 2 root root    0 Apr 17 10:53 .
        drwxr-xr-x. 6 root root    0 Apr 17 07:02 ..
        -r--r--r--. 1 root root 4096 Apr 17 10:53 max_precise
        #
        # cat /sys/devices/cpu/caps/max_precise
        0
        # perf record sleep 1
        [ perf record: Woken up 1 times to write data ]
        [ perf record: Captured and wrote 0.033 MB perf.data (7 samples) ]
        #
        # perf report --header-only | egrep 'cpu(desc|.*capabilities)'
        # cpudesc : AMD Ryzen 5 3600X 6-Core Processor
        # cpu pmu capabilities: max_precise=0
        #
      
      And then on an Intel machine:
      
        $ ls -la /sys/devices/cpu/caps/
        total 0
        drwxr-xr-x. 2 root root    0 Apr 17 10:51 .
        drwxr-xr-x. 6 root root    0 Apr 17 10:04 ..
        -r--r--r--. 1 root root 4096 Apr 17 11:37 branches
        -r--r--r--. 1 root root 4096 Apr 17 10:51 max_precise
        -r--r--r--. 1 root root 4096 Apr 17 11:37 pmu_name
        $ cat /sys/devices/cpu/caps/max_precise
        3
        $ cat /sys/devices/cpu/caps/branches
        32
        $ cat /sys/devices/cpu/caps/pmu_name
        skylake
        $ perf record sleep 1
        [ perf record: Woken up 1 times to write data ]
        [ perf record: Captured and wrote 0.001 MB perf.data (8 samples) ]
        $ perf report --header-only | egrep 'cpu(desc|.*capabilities)'
        # cpudesc : Intel(R) Core(TM) i5-7500 CPU @ 3.40GHz
        # cpu pmu capabilities: branches=32, max_precise=3, pmu_name=skylake
        $
      Signed-off-by: NKan Liang <kan.liang@linux.intel.com>
      Reviewed-by: NAndi Kleen <ak@linux.intel.com>
      Acked-by: NJiri Olsa <jolsa@redhat.com>
      Tested-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Alexey Budankov <alexey.budankov@linux.intel.com>
      Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Pavel Gerasimov <pavel.gerasimov@intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Vitaly Slobodskoy <vitaly.slobodskoy@intel.com>
      Link: http://lore.kernel.org/lkml/20200319202517.23423-3-kan.liang@linux.intel.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      6f91ea28
    • J
      perf parser: Add support to specify rXXX event with pmu · 3a6c51e4
      Jiri Olsa 提交于
      The current rXXXX event specification creates event under PERF_TYPE_RAW
      pmu type. This change allows to use rXXXX within pmu syntax, so it's
      type is used via the following syntax:
      
        -e 'cpu/r3c/'
        -e 'cpum_cf/r0/'
      
      The XXXX number goes directly to perf_event_attr::config the same way as
      in '-e rXXXX' event. The perf_event_attr::type is filled with pmu type.
      
      Committer testing:
      
      So, lets see what goes in perf_event_attr::config for, say, the
      'instructions' PERF_TYPE_HARDWARE (0) event, first we should look at how
      to encode this event as a PERF_TYPE_RAW event for this specific CPU, an
      AMD Ryzen 5:
      
        # cat /sys/devices/cpu/events/instructions
        event=0xc0
        #
      
      Then try with it _and_ the instruction, just to see that they are close
      enough:
      
        # perf stat -e rc0,instructions sleep 1
      
         Performance counter stats for 'sleep 1':
      
                   919,794      rc0
                   919,898      instructions
      
               1.000754579 seconds time elapsed
      
               0.000715000 seconds user
               0.000000000 seconds sys
        #
      
      Now we should try, before this patch, the PMU event encoding:
      
        # perf stat -e cpu/rc0/ sleep 1
        event syntax error: 'cpu/rc0/'
                                 \___ unknown term
      
        valid terms: event,edge,inv,umask,cmask,config,config1,config2,name,period,percore
        #
      
      Now with this patch, the three ways of specifying the 'instructions' CPU
      counter are accepted:
      
        # perf stat -e cpu/rc0/,rc0,instructions sleep 1
      
         Performance counter stats for 'sleep 1':
      
                   892,948      cpu/rc0/
                   893,052      rc0
                   893,156      instructions
      
               1.000931819 seconds time elapsed
      
               0.000916000 seconds user
               0.000000000 seconds sys
      
        #
      Requested-by: NThomas Richter <tmricht@linux.ibm.com>
      Signed-off-by: NJiri Olsa <jolsa@kernel.org>
      Tested-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Tested-by: NThomas Richter <tmricht@linux.ibm.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Michael Petlan <mpetlan@redhat.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Sumanth Korikkar <sumanthk@linux.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Link: http://lore.kernel.org/lkml/20200416221405.437788-1-jolsa@kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      3a6c51e4
    • I
      perf doc: allow ASCIIDOC_EXTRA to be an argument · e9cfa47e
      Ian Rogers 提交于
      This will allow parent makefiles to pass values to asciidoc.
      Signed-off-by: NIan Rogers <irogers@google.com>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Alexey Budankov <alexey.budankov@linux.intel.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Andrii Nakryiko <andriin@fb.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Florian Fainelli <f.fainelli@gmail.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Igor Lubashev <ilubashe@akamai.com>
      Cc: Jin Yao <yao.jin@linux.intel.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Jiwei Sun <jiwei.sun@windriver.com>
      Cc: John Garry <john.garry@huawei.com>
      Cc: Kan Liang <kan.liang@linux.intel.com>
      Cc: Leo Yan <leo.yan@linaro.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Martin KaFai Lau <kafai@fb.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Yonghong Song <yhs@fb.com>
      Cc: bpf@vger.kernel.org
      Cc: netdev@vger.kernel.org
      Cc: yuzhoujian <yuzhoujian@didichuxing.com>
      Link: http://lore.kernel.org/lkml/20200416162058.201954-2-irogers@google.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      e9cfa47e