1. 04 11月, 2020 10 次提交
  2. 03 11月, 2020 12 次提交
    • N
      perf tools: Add missing swap for cgroup events · 2c589d93
      Namhyung Kim 提交于
      It was missed to add a swap function for PERF_RECORD_CGROUP.
      
      Fixes: ba78c1c5 ("perf tools: Basic support for CGROUP event")
      Signed-off-by: NNamhyung Kim <namhyung@kernel.org>
      Acked-by: NJiri Olsa <jolsa@redhat.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Ian Rogers <irogers@google.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Link: http://lore.kernel.org/lkml/20201102140228.303657-1-namhyung@kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      2c589d93
    • J
      perf tools: Add missing swap for ino_generation · fe01adb7
      Jiri Olsa 提交于
      We are missing swap for ino_generation field.
      
      Fixes: 5c5e854b ("perf tools: Add attr->mmap2 support")
      Signed-off-by: NJiri Olsa <jolsa@kernel.org>
      Acked-by: NNamhyung Kim <namhyung@kernel.org>
      Link: https://lore.kernel.org/r/20201101233103.3537427-2-jolsa@kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      fe01adb7
    • J
      perf tools: Initialize output buffer in build_id__sprintf · 6311951d
      Jiri Olsa 提交于
      We display garbage for undefined build_id objects, because we don't
      initialize the output buffer.
      Signed-off-by: NJiri Olsa <jolsa@kernel.org>
      Acked-by: NNamhyung Kim <namhyung@kernel.org>
      Link: https://lore.kernel.org/r/20201101233103.3537427-1-jolsa@kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      6311951d
    • S
      perf hists browser: Increase size of 'buf' in perf_evsel__hists_browse() · 86449b12
      Song Liu 提交于
      Making perf with gcc-9.1.1 generates the following warning:
      
          CC       ui/browsers/hists.o
        ui/browsers/hists.c: In function 'perf_evsel__hists_browse':
        ui/browsers/hists.c:3078:61: error: '%d' directive output may be \
        truncated writing between 1 and 11 bytes into a region of size \
        between 2 and 12 [-Werror=format-truncation=]
      
         3078 |       "Max event group index to sort is %d (index from 0 to %d)",
              |                                                             ^~
        ui/browsers/hists.c:3078:7: note: directive argument in the range [-2147483648, 8]
         3078 |       "Max event group index to sort is %d (index from 0 to %d)",
              |       ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        In file included from /usr/include/stdio.h:937,
                         from ui/browsers/hists.c:5:
      
      IOW, the string in line 3078 might be too long for buf[] of 64 bytes.
      
      Fix this by increasing the size of buf[] to 128.
      
      Fixes: dbddf174  ("perf report/top TUI: Support hotkeys to let user select any event for sorting")
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Acked-by: NJiri Olsa <jolsa@kernel.org>
      Cc: Jin Yao <yao.jin@linux.intel.com>
      Cc: stable@vger.kernel.org # v5.7+
      Link: http://lore.kernel.org/lkml/20201030235431.534417-1-songliubraving@fb.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      86449b12
    • A
      perf scripting python: Avoid declaring function pointers with a visibility attribute · d0e7b0c7
      Arnaldo Carvalho de Melo 提交于
      To avoid this:
      
        util/scripting-engines/trace-event-python.c: In function 'python_start_script':
        util/scripting-engines/trace-event-python.c:1595:2: error: 'visibility' attribute ignored [-Werror=attributes]
         1595 |  PyMODINIT_FUNC (*initfunc)(void);
              |  ^~~~~~~~~~~~~~
      
      That started breaking when building with PYTHON=python3 and these gcc
      versions (I haven't checked with the clang ones, maybe it breaks there
      as well):
      
        # export PERF_TARBALL=http://192.168.86.5/perf/perf-5.9.0.tar.xz
        # dm  fedora:33 fedora:rawhide
           1   107.80 fedora:33         : Ok   gcc (GCC) 10.2.1 20201005 (Red Hat 10.2.1-5), clang version 11.0.0 (Fedora 11.0.0-1.fc33)
           2    92.47 fedora:rawhide    : Ok   gcc (GCC) 10.2.1 20201016 (Red Hat 10.2.1-6), clang version 11.0.0 (Fedora 11.0.0-1.fc34)
        #
      
      Avoid that by ditching that 'initfunc' function pointer with its:
      
          #define Py_EXPORTED_SYMBOL _attribute_ ((visibility ("default")))
          #define PyMODINIT_FUNC Py_EXPORTED_SYMBOL PyObject*
      
      And just call PyImport_AppendInittab() at the end of the ifdef python3
      block with the functions that were being attributed to that initfunc.
      
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Ian Rogers <irogers@google.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Signed-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      d0e7b0c7
    • P
      perf tools: Remove broken __no_tail_call attribute · 9ae1e990
      Peter Zijlstra 提交于
      The GCC specific __attribute__((optimize)) attribute does not what is
      commonly expected and is explicitly recommended against using in
      production code by the GCC people.
      
      Unlike what is often expected, it doesn't add to the optimization flags,
      but it fully replaces them, loosing any and all optimization flags
      provided by the compiler commandline.
      
      The only guaranteed upon means of inhibiting tail-calls is by placing a
      volatile asm with side-effects after the call such that the tail-call simply
      cannot be done.
      
      Given the original commit wasn't specific on which calls were the problem, this
      removal might re-introduce the problem, which can then be re-analyzed and cured
      properly.
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Acked-by: NArd Biesheuvel <ardb@kernel.org>
      Acked-by: NMiguel Ojeda <ojeda@kernel.org>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Arvind Sankar <nivedita@alum.mit.edu>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Ian Rogers <irogers@google.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Kees Kook <keescook@chromium.org>
      Cc: Martin Liška <mliska@suse.cz>
      Cc: Nick Desaulniers <ndesaulniers@google.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lore.kernel.org/lkml/20201028081123.GT2628@hirez.programming.kicks-ass.netSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      9ae1e990
    • J
      perf vendor events: Fix DRAM_BW_Use 0 issue for CLX/SKX · 0dfbe4c6
      Jin Yao 提交于
      Ian reports an issue that the metric DRAM_BW_Use often remains 0.
      
      The metric expression for DRAM_BW_Use on CLX/SKX:
      
      "( 64 * ( uncore_imc@cas_count_read@ + uncore_imc@cas_count_write@ ) / 1000000000 ) / duration_time"
      
      The counts of uncore_imc/cas_count_read/ and uncore_imc/cas_count_write/
      are scaled up by 64, that is to turn a count of cache lines into bytes,
      the count is then divided by 1000000000 to give GB.
      
      However, the counts of uncore_imc/cas_count_read/ and
      uncore_imc/cas_count_write/ have been scaled yet.
      
      The scale values are from sysfs, such as
      /sys/devices/uncore_imc_0/events/cas_count_read.scale.
      It's 6.103515625e-5 (64 / 1024.0 / 1024.0).
      
      So if we use original metric expression, the result is not correct.
      
      But the difficulty is, for SKL client, the counts are not scaled.
      
      The metric expression for DRAM_BW_Use on SKL:
      
      "64 * ( arb@event\\=0x81\\,umask\\=0x1@ + arb@event\\=0x84\\,umask\\=0x1@ ) / 1000000 / duration_time / 1000"
      
      root@kbl-ppc:~# perf stat -M DRAM_BW_Use -a -- sleep 1
      
       Performance counter stats for 'system wide':
      
                     190      arb/event=0x84,umask=0x1/ #     1.86 DRAM_BW_Use
              29,093,178      arb/event=0x81,umask=0x1/
           1,000,703,287 ns   duration_time
      
             1.000703287 seconds time elapsed
      
      The result is expected.
      
      So the easy way is just change the metric expression for CLX/SKX.
      This patch changes the metric expression to:
      
      "( ( ( uncore_imc@cas_count_read@ + uncore_imc@cas_count_write@ ) * 1048576 ) / 1000000000 ) / duration_time"
      
      1048576 = 1024 * 1024.
      
      Before (tested on CLX):
      
      root@lkp-csl-2sp5 ~# perf stat -M DRAM_BW_Use -a -- sleep 1
      
       Performance counter stats for 'system wide':
      
                  765.35 MiB  uncore_imc/cas_count_read/ #     0.00 DRAM_BW_Use
                    5.42 MiB  uncore_imc/cas_count_write/
              1001515088 ns   duration_time
      
             1.001515088 seconds time elapsed
      
      After:
      
      root@lkp-csl-2sp5 ~# perf stat -M DRAM_BW_Use -a -- sleep 1
      
       Performance counter stats for 'system wide':
      
                  767.95 MiB  uncore_imc/cas_count_read/ #     0.80 DRAM_BW_Use
                    5.02 MiB  uncore_imc/cas_count_write/
              1001900010 ns   duration_time
      
             1.001900010 seconds time elapsed
      
      Fixes: 038d3b53 ("perf vendor events intel: Update CascadelakeX events to v1.08")
      Fixes: b5ff7f27 ("perf vendor events: Update SkylakeX events to v1.21")
      Signed-off-by: NJin Yao <yao.jin@linux.intel.com>
      Acked-by: NIan Rogers <irogers@google.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Kan Liang <kan.liang@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: http://lore.kernel.org/lkml/20201023005334.7869-1-yao.jin@linux.intel.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      0dfbe4c6
    • S
      perf trace: Fix segfault when trying to trace events by cgroup · a6293f36
      Stanislav Ivanichkin 提交于
        # ./perf trace -e sched:sched_switch -G test -a sleep 1
        perf: Segmentation fault
        Obtained 11 stack frames.
        ./perf(sighandler_dump_stack+0x43) [0x55cfdc636db3]
        /lib/x86_64-linux-gnu/libc.so.6(+0x3efcf) [0x7fd23eecafcf]
        ./perf(parse_cgroups+0x36) [0x55cfdc673f36]
        ./perf(+0x3186ed) [0x55cfdc70d6ed]
        ./perf(parse_options_subcommand+0x629) [0x55cfdc70e999]
        ./perf(cmd_trace+0x9c2) [0x55cfdc5ad6d2]
        ./perf(+0x1e8ae0) [0x55cfdc5ddae0]
        ./perf(+0x1e8ded) [0x55cfdc5ddded]
        ./perf(main+0x370) [0x55cfdc556f00]
        /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe6) [0x7fd23eeadb96]
        ./perf(_start+0x29) [0x55cfdc557389]
        Segmentation fault
        #
      
       It happens because "struct trace" in option->value is passed to the
       parse_cgroups function instead of "struct evlist".
      
      Fixes: 9ea42ba4 ("perf trace: Support setting cgroups as targets")
      Signed-off-by: NStanislav Ivanichkin <sivanichkin@yandex-team.ru>
      Tested-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Acked-by: NNamhyung Kim <namhyung@kernel.org>
      Cc: Dmitry Monakhov <dmtrmonakhov@yandex-team.ru>
      Link: http://lore.kernel.org/lkml/20201027094357.94881-1-sivanichkin@yandex-team.ruSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      a6293f36
    • T
      perf tools: Fix crash with non-jited bpf progs · ab8bf5f2
      Tommi Rantala 提交于
      The addr in PERF_RECORD_KSYMBOL events for non-jited bpf progs points to
      the bpf interpreter, ie. within kernel text section. When processing the
      unregister event, this causes unexpected removal of vmlinux_map,
      crashing perf later in cleanup:
      
        # perf record -- timeout --signal=INT 2s /usr/share/bcc/tools/execsnoop
        PCOMM            PID    PPID   RET ARGS
        [ perf record: Woken up 1 times to write data ]
        [ perf record: Captured and wrote 0.208 MB perf.data (5155 samples) ]
        perf: tools/include/linux/refcount.h:131: refcount_sub_and_test: Assertion `!(new > val)' failed.
        Aborted (core dumped)
      
        # perf script -D|grep KSYM
        0 0xa40 [0x48]: PERF_RECORD_KSYMBOL addr ffffffffa9b6b530 len 0 type 1 flags 0x0 name bpf_prog_f958f6eb72ef5af6
        0 0xab0 [0x48]: PERF_RECORD_KSYMBOL addr ffffffffa9b6b530 len 0 type 1 flags 0x0 name bpf_prog_8c42dee26e8cd4c2
        0 0xb20 [0x48]: PERF_RECORD_KSYMBOL addr ffffffffa9b6b530 len 0 type 1 flags 0x0 name bpf_prog_f958f6eb72ef5af6
        108563691893 0x33d98 [0x58]: PERF_RECORD_KSYMBOL addr ffffffffa9b6b3b0 len 0 type 1 flags 0x0 name bpf_prog_bc5697a410556fc2_syscall__execve
        108568518458 0x34098 [0x58]: PERF_RECORD_KSYMBOL addr ffffffffa9b6b3f0 len 0 type 1 flags 0x0 name bpf_prog_45e2203c2928704d_do_ret_sys_execve
        109301967895 0x34830 [0x58]: PERF_RECORD_KSYMBOL addr ffffffffa9b6b3b0 len 0 type 1 flags 0x1 name bpf_prog_bc5697a410556fc2_syscall__execve
        109302007356 0x348b0 [0x58]: PERF_RECORD_KSYMBOL addr ffffffffa9b6b3f0 len 0 type 1 flags 0x1 name bpf_prog_45e2203c2928704d_do_ret_sys_execve
        perf: tools/include/linux/refcount.h:131: refcount_sub_and_test: Assertion `!(new > val)' failed.
      
      Here the addresses match the bpf interpreter:
      
        # grep -e ffffffffa9b6b530 -e ffffffffa9b6b3b0 -e ffffffffa9b6b3f0 /proc/kallsyms
        ffffffffa9b6b3b0 t __bpf_prog_run224
        ffffffffa9b6b3f0 t __bpf_prog_run192
        ffffffffa9b6b530 t __bpf_prog_run32
      
      Fix by not allowing vmlinux_map to be removed by PERF_RECORD_KSYMBOL
      unregister event.
      Signed-off-by: NTommi Rantala <tommi.t.rantala@nokia.com>
      Acked-by: NJiri Olsa <jolsa@redhat.com>
      Tested-by: NJiri Olsa <jolsa@redhat.com>
      Link: https://lore.kernel.org/r/20201016114718.54332-1-tommi.t.rantala@nokia.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      ab8bf5f2
    • A
      tools headers UAPI: Update process_madvise affected files · 263e452e
      Arnaldo Carvalho de Melo 提交于
      To pick the changes from:
      
        ecb8ac8b ("mm/madvise: introduce process_madvise() syscall: an external memory hinting API")
      
      That addresses these perf build warning:
      
        Warning: Kernel ABI header at 'tools/include/uapi/asm-generic/unistd.h' differs from latest version at 'include/uapi/asm-generic/unistd.h'
        diff -u tools/include/uapi/asm-generic/unistd.h include/uapi/asm-generic/unistd.h
        Warning: Kernel ABI header at 'tools/perf/arch/x86/entry/syscalls/syscall_64.tbl' differs from latest version at 'arch/x86/entry/syscalls/syscall_64.tbl'
        diff -u tools/perf/arch/x86/entry/syscalls/syscall_64.tbl arch/x86/entry/syscalls/syscall_64.tbl
      
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Ian Rogers <irogers@google.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Signed-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      263e452e
    • A
      perf tools: Update copy of libbpf's hashmap.c · e555b4b8
      Arnaldo Carvalho de Melo 提交于
      To pick the changes in:
      
        85367030 ("libbpf: Centralize poisoning and poison reallocarray()")
        7d9c71e1 ("libbpf: Extract generic string hashing function for reuse")
      
      That don't entail any changes in tools/perf.
      
      This addresses this perf build warning:
      
        Warning: Kernel ABI header at 'tools/perf/util/hashmap.h' differs from latest version at 'tools/lib/bpf/hashmap.h'
        diff -u tools/perf/util/hashmap.h tools/lib/bpf/hashmap.h
      
      Not a kernel ABI, its just that this uses the mechanism in place for
      checking kernel ABI files drift.
      
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Andrii Nakryiko <andriin@fb.com>
      Cc: Ian Rogers <irogers@google.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Signed-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      e555b4b8
    • J
      perf tools: Remove LTO compiler options when building perl support · b773ea65
      Justin M. Forbes 提交于
      To avoid breaking the build by mixing files compiled with things coming
      from distro specific compiler options for perl with the rest of perf,
      i.e. to avoid this:
      
        `.gnu.debuglto_.debug_macro' referenced in section `.gnu.debuglto_.debug_macro' of /tmp/build/perf/util/scripting-engines/perf-in.o: defined in discarded section `.gnu.debuglto_.debug_macro[wm4.stdcpredef.h.19.8dc41bed5d9037ff9622e015fb5f0ce3]' of /tmp/build/perf/util/scripting-engines/perf-in.o
      
      Noticed on Fedora 33.
      Signed-off-by: NJustin M. Forbes <jforbes@fedoraproject.org>
      Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1593431
      Cc: Jiri Olsa <jolsa@redhat.com>
      Link: https://src.fedoraproject.org/rpms/kernel-tools/c/589a32b62f0c12516ab7b34e3dd30d450145bfa4?branch=masterSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      b773ea65
  3. 15 10月, 2020 18 次提交
    • L
      perf c2c: Update documentation for metrics reorganization · 744aec4d
      Leo Yan 提交于
      The output format for metrics has been reorganized, update documentation
      to reflect the changes for it.
      Signed-off-by: NLeo Yan <leo.yan@linaro.org>
      Cc: Al Grant <al.grant@arm.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Don Zickus <dzickus@redhat.com>
      Cc: Ian Rogers <irogers@google.com>
      Cc: James Clark <james.clark@arm.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Joe Mario <jmario@redhat.com>
      Cc: Kan Liang <kan.liang@linux.intel.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: http://lore.kernel.org/lkml/20201015144548.18482-10-leo.yan@linaro.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      744aec4d
    • L
      perf c2c: Add metrics "RMT Load Hit" · 91d933c2
      Leo Yan 提交于
      The metrics "LLC Ld Miss" and "Load Dram" overlap with each other for
      accouting items:
      
        "LLC Ld Miss" = "lcl_dram" + "rmt_dram" + "rmt_hit" + "rmt_hitm"
        "Load Dram"   = "lcl_dram" + "rmt_dram"
      
      Furthermore, the metrics "LLC Ld Miss" is not directive to show
      statistics due to it contains summary value and cannot give out
      breakdown details.
      
      For this reason, add a new metrics "RMT Load Hit" which is used to
      present the remote cache hit; it contains two items:
      
        "RMT Load Hit" = remote hit ("rmt_hit") + remote hitm ("rmt_hitm")
      
      As result, the metrics "LLC Ld Miss" is perfectly divided into two
      metrics "RMT Load Hit" and "Load Dram".  It's not necessary to keep
      metrics "LLC Ld Miss", so remove it.
      
      Before:
      
        #        ----------- Cacheline ----------      Tot  ------- Load Hitm -------    Total    Total    Total  ---- Stores ----  ----- Core Load Hit -----  - LLC Load Hit --      LLC  --- Load Dram ----
        # Index             Address  Node  PA cnt     Hitm    Total  LclHitm  RmtHitm  records    Loads   Stores    L1Hit   L1Miss       FB       L1       L2    LclHit  LclHitm  Ld Miss       Lcl       Rmt
        # .....  ..................  ....  ......  .......  .......  .......  .......  .......  .......  .......  .......  .......  .......  .......  .......  ........  .......  .......  ........  ........
        #
              0      0x55f07d580100     0    1499   85.89%      481      481        0     7243     3879     3364     2599      765      548     2615       66       169      481        0         0         0
              1      0x55f07d580080     0       1   13.93%       78       78        0      664      664        0        0        0      187      361       27        11       78        0         0         0
              2      0x55f07d5800c0     0       1    0.18%        1        1        0      405      405        0        0        0      131        0       10       263        1        0         0         0
      
      After:
      
        #        ----------- Cacheline ----------      Tot  ------- Load Hitm -------    Total    Total    Total  ---- Stores ----  ----- Core Load Hit -----  - LLC Load Hit --  - RMT Load Hit --  --- Load Dram ----
        # Index             Address  Node  PA cnt     Hitm    Total  LclHitm  RmtHitm  records    Loads   Stores    L1Hit   L1Miss       FB       L1       L2    LclHit  LclHitm    RmtHit  RmtHitm       Lcl       Rmt
        # .....  ..................  ....  ......  .......  .......  .......  .......  .......  .......  .......  .......  .......  .......  .......  .......  ........  .......  ........  .......  ........  ........
        #
              0      0x55f07d580100     0    1499   85.89%      481      481        0     7243     3879     3364     2599      765      548     2615       66       169      481         0        0         0         0
              1      0x55f07d580080     0       1   13.93%       78       78        0      664      664        0        0        0      187      361       27        11       78         0        0         0         0
              2      0x55f07d5800c0     0       1    0.18%        1        1        0      405      405        0        0        0      131        0       10       263        1         0        0         0         0
      Signed-off-by: NLeo Yan <leo.yan@linaro.org>
      Tested-by: NJoe Mario <jmario@redhat.com>
      Acked-by: NJiri Olsa <jolsa@redhat.com>
      Signed-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Link: https://lore.kernel.org/r/20201014050921.5591-9-leo.yan@linaro.org
      91d933c2
    • L
      perf c2c: Correct LLC load hit metrics · 77c15869
      Leo Yan 提交于
      "rmt_hit" is accounted into two metrics: one is accounted into the
      metrics "LLC Ld Miss" (see the function llc_miss() for calculation
      "llcmiss"); and it's accounted into metrics "LLC Load Hit".  Thus,
      for the literal meaning, it is contradictory that "rmt_hit" is
      accounted for both "LLC Ld Miss" (LLC miss) and "LLC Load Hit"
      (LLC hit).
      
      Thus this is easily to introduce confusion: "LLC Load Hit" gives
      impression that all items belong to it are LLC hit; in fact "rmt_hit"
      is LLC miss and remote cache hit.
      
      To give out clear semantics for metric "LLC Load Hit", "rmt_hit" is
      moved out from it and changes "LLC Load Hit" to contain two items:
      
        LLC Load Hit = LLC's hit ("ld_llchit") + LLC's hitm ("lcl_hitm")
      
      For output alignment, adjusts the header for "LLC Load Hit".
      Signed-off-by: NLeo Yan <leo.yan@linaro.org>
      Tested-by: NJoe Mario <jmario@redhat.com>
      Acked-by: NJiri Olsa <jolsa@redhat.com>
      Link: https://lore.kernel.org/r/20201014050921.5591-8-leo.yan@linaro.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      77c15869
    • L
      perf c2c: Change header for LLC local hit · ed626a3e
      Leo Yan 提交于
      Replace the header string "Lcl" with "LclHit", which is more explicit
      to express the event type is LLC local hit.
      Signed-off-by: NLeo Yan <leo.yan@linaro.org>
      Tested-by: NJoe Mario <jmario@redhat.com>
      Acked-by: NJiri Olsa <jolsa@redhat.com>
      Link: https://lore.kernel.org/r/20201014050921.5591-7-leo.yan@linaro.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      ed626a3e
    • L
      perf c2c: Use more explicit headers for HITM · 0fbe2fe9
      Leo Yan 提交于
      Local and remote HITM use the headers 'Lcl' and 'Rmt' respectively,
      suppose if we want to extend the tool to display these two dimensions
      under any one metrics, users cannot understand the semantics if only
      based on the header string 'Lcl' or 'Rmt'.
      
      To explicit express the meaning for HITM items, this patch changes the
      headers string as "LclHitm" and "RmtHitm", the strings are more readable
      and this allows to extend metrics for using HITM items.
      Signed-off-by: NLeo Yan <leo.yan@linaro.org>
      Tested-by: NJoe Mario <jmario@redhat.com>
      Acked-by: NJiri Olsa <jolsa@redhat.com>
      Link: https://lore.kernel.org/r/20201014050921.5591-6-leo.yan@linaro.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      0fbe2fe9
    • L
      perf c2c: Change header from "LLC Load Hitm" to "Load Hitm" · fdd32d7e
      Leo Yan 提交于
      The metrics "LLC Load Hitm" contains two items: one is "local Hitm" and
      another is "remote Hitm".
      
      "local Hitm" means: L3 HIT and was serviced by another processor core
      with a cross core snoop where modified copies were found; it's no doubt
      that "local Hitm" belongs to LLC access.
      
      But for "remote Hitm", based on the code in util/mem-events, it's the
      event for remote cache HIT and was serviced by another processor core
      with modified copies.  Thus the remote Hitm is a remote cache's hit and
      actually it's LLC load miss.
      
      Now the display format gives users the impression that "local Hitm" and
      "remote Hitm" both belong to the LLC load, but this is not the fact as
      described.
      
      This patch changes the header from "LLC Load Hitm" to "Load Hitm", this
      can avoid the give the wrong impression that all Hitm belong to LLC.
      Signed-off-by: NLeo Yan <leo.yan@linaro.org>
      Tested-by: NJoe Mario <jmario@redhat.com>
      Acked-by: NJiri Olsa <jolsa@redhat.com>
      Link: https://lore.kernel.org/r/20201014050921.5591-5-leo.yan@linaro.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      fdd32d7e
    • L
      perf c2c: Organize metrics based on memory hierarchy · 6d662d73
      Leo Yan 提交于
      The metrics are not organized based on memory hierarchy, e.g. the tool
      doesn't organize the metrics order based on memory nodes from the close
      node (e.g. L1/L2 cache) to far node (e.g. L3 cache and DRAM).
      
      To output metrics with more friendly form, this patch refines the
      metrics order based on memory hierarchy:
      
        "Core Load Hit" => "LLC Load Hit" => "LLC Ld Miss" => "Load Dram"
      Signed-off-by: NLeo Yan <leo.yan@linaro.org>
      Tested-by: NJoe Mario <jmario@redhat.com>
      Acked-by: NJiri Olsa <jolsa@redhat.com>
      Link: https://lore.kernel.org/r/20201014050921.5591-4-leo.yan@linaro.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      6d662d73
    • L
      perf c2c: Display "Total Stores" as a standalone metrics · 4f28641b
      Leo Yan 提交于
      The total stores is displayed under the metrics "Store Reference", to
      output the same format with total records and all loads, extract the
      total stores number as a standalone metrics "Total Stores".
      
      After this patch, the tool shows the summary numbers ("Total records",
      "Total loads", "Total Stores") in the unified form.
      
      Before:
      
        #        ----------- Cacheline ----------      Tot  ----- LLC Load Hitm -----    Total    Total  ---- Store Reference ----  --- Load Dram ----      LLC  ----- Core Load Hit -----  -- LLC Load Hit --
        # Index             Address  Node  PA cnt     Hitm    Total      Lcl      Rmt  records    Loads    Total    L1Hit   L1Miss       Lcl       Rmt  Ld Miss       FB       L1       L2       Llc       Rmt
        # .....  ..................  ....  ......  .......  .......  .......  .......  .......  .......  .......  .......  .......  ........  ........  .......  .......  .......  .......  ........  ........
        #
              0      0x55f07d580100     0    1499   85.89%      481      481        0     7243     3879     3364     2599      765         0         0        0      548     2615       66       169         0
              1      0x55f07d580080     0       1   13.93%       78       78        0      664      664        0        0        0         0         0        0      187      361       27        11         0
              2      0x55f07d5800c0     0       1    0.18%        1        1        0      405      405        0        0        0         0         0        0      131        0       10       263         0
      
      After:
      
        #        ----------- Cacheline ----------      Tot  ----- LLC Load Hitm -----    Total    Total    Total  ---- Stores ----  --- Load Dram ----      LLC  ----- Core Load Hit -----  -- LLC Load Hit --
        # Index             Address  Node  PA cnt     Hitm    Total      Lcl      Rmt  records    Loads   Stores    L1Hit   L1Miss       Lcl       Rmt  Ld Miss       FB       L1       L2       Llc       Rmt
        # .....  ..................  ....  ......  .......  .......  .......  .......  .......  .......  .......  .......  .......  ........  ........  .......  .......  .......  .......  ........  ........
        #
              0      0x55f07d580100     0    1499   85.89%      481      481        0     7243     3879     3364     2599      765         0         0        0      548     2615       66       169         0
              1      0x55f07d580080     0       1   13.93%       78       78        0      664      664        0        0        0         0         0        0      187      361       27        11         0
              2      0x55f07d5800c0     0       1    0.18%        1        1        0      405      405        0        0        0         0         0        0      131        0       10       263         0
      Signed-off-by: NLeo Yan <leo.yan@linaro.org>
      Tested-by: NJoe Mario <jmario@redhat.com>
      Acked-by: NJiri Olsa <jolsa@redhat.com>
      Link: https://lore.kernel.org/r/20201014050921.5591-3-leo.yan@linaro.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      4f28641b
    • L
      perf c2c: Display the total numbers continuously · b596e979
      Leo Yan 提交于
      To view the statistics with "breakdown" mode, it's good to show the
      summary numbers for the total records, all stores and all loads, then
      the sequential conlumns can be used to break into more detailed items.
      
      To achieve this purpose, this patch displays the summary numbers for
      records/stores/loads continuously and places them before breakdown
      items, this can allow uses to easily read the summarized statistics.
      Signed-off-by: NLeo Yan <leo.yan@linaro.org>
      Tested-by: NJoe Mario <jmario@redhat.com>
      Acked-by: NJiri Olsa <jolsa@redhat.com>
      Link: https://lore.kernel.org/r/20201014050921.5591-2-leo.yan@linaro.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      b596e979
    • I
      perf bench: Use condition variables in numa. · f9299385
      Ian Rogers 提交于
      The existing approach to synchronization between threads in the numa
      benchmark is unbalanced mutexes.
      
      This synchronization causes thread sanitizer to warn of locks being
      taken twice on a thread without an unlock, as well as unlocks with no
      corresponding locks.
      
      This change replaces the synchronization with more regular condition
      variables.
      
      While this fixes one class of thread sanitizer warnings, there still
      remain warnings of data races due to threads reading and writing shared
      memory without any atomics.
      
      Committer testing:
      
        Basic run on a non-NUMA machine.
      
        # perf bench numa
      
                # List of available benchmarks for collection 'numa':
      
                   mem: Benchmark for NUMA workloads
                   all: Run all NUMA benchmarks
      
        # perf bench numa all
        # Running numa/mem benchmark...
      
         # Running main, "perf bench numa numa-mem"
         #
         # Running test on: Linux five 5.8.12-200.fc32.x86_64 #1 SMP Mon Sep 28 12:17:31 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
         #
      
         # Running RAM-bw-local, "perf bench numa mem -p 1 -t 1 -P 1024 -C 0 -M 0 -s 20 -zZq --thp  1 --no-data_rand_walk"
                 20.076 secs slowest (max) thread-runtime
                 20.000 secs fastest (min) thread-runtime
                 20.073 secs average thread-runtime
                  0.190 % difference between max/avg runtime
                241.828 GB data processed, per thread
                241.828 GB data processed, total
                  0.083 nsecs/byte/thread runtime
                 12.045 GB/sec/thread speed
                 12.045 GB/sec total speed
      
         # Running RAM-bw-local-NOTHP, "perf bench numa mem -p 1 -t 1 -P 1024 -C 0 -M 0 -s 20 -zZq --thp  1 --no-data_rand_walk --thp -1"
                 20.045 secs slowest (max) thread-runtime
                 20.000 secs fastest (min) thread-runtime
                 20.014 secs average thread-runtime
                  0.111 % difference between max/avg runtime
                234.304 GB data processed, per thread
                234.304 GB data processed, total
                  0.086 nsecs/byte/thread runtime
                 11.689 GB/sec/thread speed
                 11.689 GB/sec total speed
      
         # Running RAM-bw-remote, "perf bench numa mem -p 1 -t 1 -P 1024 -C 0 -M 1 -s 20 -zZq --thp  1 --no-data_rand_walk"
      
        Test not applicable, system has only 1 nodes.
      
         # Running RAM-bw-local-2x, "perf bench numa mem -p 2 -t 1 -P 1024 -C 0,2 -M 0x2 -s 20 -zZq --thp  1 --no-data_rand_walk"
                 20.138 secs slowest (max) thread-runtime
                 20.000 secs fastest (min) thread-runtime
                 20.121 secs average thread-runtime
                  0.342 % difference between max/avg runtime
                135.961 GB data processed, per thread
                271.922 GB data processed, total
                  0.148 nsecs/byte/thread runtime
                  6.752 GB/sec/thread speed
                 13.503 GB/sec total speed
      
         # Running RAM-bw-remote-2x, "perf bench numa mem -p 2 -t 1 -P 1024 -C 0,2 -M 1x2 -s 20 -zZq --thp  1 --no-data_rand_walk"
      
        Test not applicable, system has only 1 nodes.
      
         # Running RAM-bw-cross, "perf bench numa mem -p 2 -t 1 -P 1024 -C 0,8 -M 1,0 -s 20 -zZq --thp  1 --no-data_rand_walk"
      
        Test not applicable, system has only 1 nodes.
      
         # Running  1x3-convergence, "perf bench numa mem -p 1 -t 3 -P 512 -s 100 -zZ0qcm --thp  1"
                  0.747 secs latency to NUMA-converge
                  0.747 secs slowest (max) thread-runtime
                  0.000 secs fastest (min) thread-runtime
                  0.714 secs average thread-runtime
                 50.000 % difference between max/avg runtime
                  3.228 GB data processed, per thread
                  9.683 GB data processed, total
                  0.231 nsecs/byte/thread runtime
                  4.321 GB/sec/thread speed
                 12.964 GB/sec total speed
      
         # Running  1x4-convergence, "perf bench numa mem -p 1 -t 4 -P 512 -s 100 -zZ0qcm --thp  1"
                  1.127 secs latency to NUMA-converge
                  1.127 secs slowest (max) thread-runtime
                  1.000 secs fastest (min) thread-runtime
                  1.089 secs average thread-runtime
                  5.624 % difference between max/avg runtime
                  3.765 GB data processed, per thread
                 15.062 GB data processed, total
                  0.299 nsecs/byte/thread runtime
                  3.342 GB/sec/thread speed
                 13.368 GB/sec total speed
      
         # Running  1x6-convergence, "perf bench numa mem -p 1 -t 6 -P 1020 -s 100 -zZ0qcm --thp  1"
                  1.003 secs latency to NUMA-converge
                  1.003 secs slowest (max) thread-runtime
                  0.000 secs fastest (min) thread-runtime
                  0.889 secs average thread-runtime
                 50.000 % difference between max/avg runtime
                  2.141 GB data processed, per thread
                 12.847 GB data processed, total
                  0.469 nsecs/byte/thread runtime
                  2.134 GB/sec/thread speed
                 12.805 GB/sec total speed
      
         # Running  2x3-convergence, "perf bench numa mem -p 2 -t 3 -P 1020 -s 100 -zZ0qcm --thp  1"
                  1.814 secs latency to NUMA-converge
                  1.814 secs slowest (max) thread-runtime
                  1.000 secs fastest (min) thread-runtime
                  1.716 secs average thread-runtime
                 22.440 % difference between max/avg runtime
                  3.747 GB data processed, per thread
                 22.483 GB data processed, total
                  0.484 nsecs/byte/thread runtime
                  2.065 GB/sec/thread speed
                 12.393 GB/sec total speed
      
         # Running  3x3-convergence, "perf bench numa mem -p 3 -t 3 -P 1020 -s 100 -zZ0qcm --thp  1"
                  2.065 secs latency to NUMA-converge
                  2.065 secs slowest (max) thread-runtime
                  1.000 secs fastest (min) thread-runtime
                  1.947 secs average thread-runtime
                 25.788 % difference between max/avg runtime
                  2.855 GB data processed, per thread
                 25.694 GB data processed, total
                  0.723 nsecs/byte/thread runtime
                  1.382 GB/sec/thread speed
                 12.442 GB/sec total speed
      
         # Running  4x4-convergence, "perf bench numa mem -p 4 -t 4 -P 512 -s 100 -zZ0qcm --thp  1"
                  1.912 secs latency to NUMA-converge
                  1.912 secs slowest (max) thread-runtime
                  1.000 secs fastest (min) thread-runtime
                  1.775 secs average thread-runtime
                 23.852 % difference between max/avg runtime
                  1.479 GB data processed, per thread
                 23.668 GB data processed, total
                  1.293 nsecs/byte/thread runtime
                  0.774 GB/sec/thread speed
                 12.378 GB/sec total speed
      
         # Running  4x4-convergence-NOTHP, "perf bench numa mem -p 4 -t 4 -P 512 -s 100 -zZ0qcm --thp  1 --thp -1"
                  1.783 secs latency to NUMA-converge
                  1.783 secs slowest (max) thread-runtime
                  1.000 secs fastest (min) thread-runtime
                  1.633 secs average thread-runtime
                 21.960 % difference between max/avg runtime
                  1.345 GB data processed, per thread
                 21.517 GB data processed, total
                  1.326 nsecs/byte/thread runtime
                  0.754 GB/sec/thread speed
                 12.067 GB/sec total speed
      
         # Running  4x6-convergence, "perf bench numa mem -p 4 -t 6 -P 1020 -s 100 -zZ0qcm --thp  1"
                  5.396 secs latency to NUMA-converge
                  5.396 secs slowest (max) thread-runtime
                  4.000 secs fastest (min) thread-runtime
                  4.928 secs average thread-runtime
                 12.937 % difference between max/avg runtime
                  2.721 GB data processed, per thread
                 65.306 GB data processed, total
                  1.983 nsecs/byte/thread runtime
                  0.504 GB/sec/thread speed
                 12.102 GB/sec total speed
      
         # Running  4x8-convergence, "perf bench numa mem -p 4 -t 8 -P 512 -s 100 -zZ0qcm --thp  1"
                  3.121 secs latency to NUMA-converge
                  3.121 secs slowest (max) thread-runtime
                  2.000 secs fastest (min) thread-runtime
                  2.836 secs average thread-runtime
                 17.962 % difference between max/avg runtime
                  1.194 GB data processed, per thread
                 38.192 GB data processed, total
                  2.615 nsecs/byte/thread runtime
                  0.382 GB/sec/thread speed
                 12.236 GB/sec total speed
      
         # Running  8x4-convergence, "perf bench numa mem -p 8 -t 4 -P 512 -s 100 -zZ0qcm --thp  1"
                  4.302 secs latency to NUMA-converge
                  4.302 secs slowest (max) thread-runtime
                  3.000 secs fastest (min) thread-runtime
                  4.045 secs average thread-runtime
                 15.133 % difference between max/avg runtime
                  1.631 GB data processed, per thread
                 52.178 GB data processed, total
                  2.638 nsecs/byte/thread runtime
                  0.379 GB/sec/thread speed
                 12.128 GB/sec total speed
      
         # Running  8x4-convergence-NOTHP, "perf bench numa mem -p 8 -t 4 -P 512 -s 100 -zZ0qcm --thp  1 --thp -1"
                  4.418 secs latency to NUMA-converge
                  4.418 secs slowest (max) thread-runtime
                  3.000 secs fastest (min) thread-runtime
                  4.104 secs average thread-runtime
                 16.045 % difference between max/avg runtime
                  1.664 GB data processed, per thread
                 53.254 GB data processed, total
                  2.655 nsecs/byte/thread runtime
                  0.377 GB/sec/thread speed
                 12.055 GB/sec total speed
      
         # Running  3x1-convergence, "perf bench numa mem -p 3 -t 1 -P 512 -s 100 -zZ0qcm --thp  1"
                  0.973 secs latency to NUMA-converge
                  0.973 secs slowest (max) thread-runtime
                  0.000 secs fastest (min) thread-runtime
                  0.955 secs average thread-runtime
                 50.000 % difference between max/avg runtime
                  4.124 GB data processed, per thread
                 12.372 GB data processed, total
                  0.236 nsecs/byte/thread runtime
                  4.238 GB/sec/thread speed
                 12.715 GB/sec total speed
      
         # Running  4x1-convergence, "perf bench numa mem -p 4 -t 1 -P 512 -s 100 -zZ0qcm --thp  1"
                  0.820 secs latency to NUMA-converge
                  0.820 secs slowest (max) thread-runtime
                  0.000 secs fastest (min) thread-runtime
                  0.808 secs average thread-runtime
                 50.000 % difference between max/avg runtime
                  2.555 GB data processed, per thread
                 10.220 GB data processed, total
                  0.321 nsecs/byte/thread runtime
                  3.117 GB/sec/thread speed
                 12.468 GB/sec total speed
      
         # Running  8x1-convergence, "perf bench numa mem -p 8 -t 1 -P 512 -s 100 -zZ0qcm --thp  1"
                  0.667 secs latency to NUMA-converge
                  0.667 secs slowest (max) thread-runtime
                  0.000 secs fastest (min) thread-runtime
                  0.607 secs average thread-runtime
                 50.000 % difference between max/avg runtime
                  1.009 GB data processed, per thread
                  8.069 GB data processed, total
                  0.661 nsecs/byte/thread runtime
                  1.512 GB/sec/thread speed
                 12.095 GB/sec total speed
      
         # Running 16x1-convergence, "perf bench numa mem -p 16 -t 1 -P 256 -s 100 -zZ0qcm --thp  1"
                  1.546 secs latency to NUMA-converge
                  1.546 secs slowest (max) thread-runtime
                  1.000 secs fastest (min) thread-runtime
                  1.485 secs average thread-runtime
                 17.664 % difference between max/avg runtime
                  1.162 GB data processed, per thread
                 18.594 GB data processed, total
                  1.331 nsecs/byte/thread runtime
                  0.752 GB/sec/thread speed
                 12.025 GB/sec total speed
      
         # Running 32x1-convergence, "perf bench numa mem -p 32 -t 1 -P 128 -s 100 -zZ0qcm --thp  1"
                  0.812 secs latency to NUMA-converge
                  0.812 secs slowest (max) thread-runtime
                  0.000 secs fastest (min) thread-runtime
                  0.739 secs average thread-runtime
                 50.000 % difference between max/avg runtime
                  0.309 GB data processed, per thread
                  9.874 GB data processed, total
                  2.630 nsecs/byte/thread runtime
                  0.380 GB/sec/thread speed
                 12.166 GB/sec total speed
      
         # Running  2x1-bw-process, "perf bench numa mem -p 2 -t 1 -P 1024 -s 20 -zZ0q --thp  1"
                 20.044 secs slowest (max) thread-runtime
                 20.000 secs fastest (min) thread-runtime
                 20.020 secs average thread-runtime
                  0.109 % difference between max/avg runtime
                125.750 GB data processed, per thread
                251.501 GB data processed, total
                  0.159 nsecs/byte/thread runtime
                  6.274 GB/sec/thread speed
                 12.548 GB/sec total speed
      
         # Running  3x1-bw-process, "perf bench numa mem -p 3 -t 1 -P 1024 -s 20 -zZ0q --thp  1"
                 20.148 secs slowest (max) thread-runtime
                 20.000 secs fastest (min) thread-runtime
                 20.090 secs average thread-runtime
                  0.367 % difference between max/avg runtime
                 85.267 GB data processed, per thread
                255.800 GB data processed, total
                  0.236 nsecs/byte/thread runtime
                  4.232 GB/sec/thread speed
                 12.696 GB/sec total speed
      
         # Running  4x1-bw-process, "perf bench numa mem -p 4 -t 1 -P 1024 -s 20 -zZ0q --thp  1"
                 20.169 secs slowest (max) thread-runtime
                 20.000 secs fastest (min) thread-runtime
                 20.100 secs average thread-runtime
                  0.419 % difference between max/avg runtime
                 63.144 GB data processed, per thread
                252.576 GB data processed, total
                  0.319 nsecs/byte/thread runtime
                  3.131 GB/sec/thread speed
                 12.523 GB/sec total speed
      
         # Running  8x1-bw-process, "perf bench numa mem -p 8 -t 1 -P  512 -s 20 -zZ0q --thp  1"
                 20.175 secs slowest (max) thread-runtime
                 20.000 secs fastest (min) thread-runtime
                 20.107 secs average thread-runtime
                  0.433 % difference between max/avg runtime
                 31.267 GB data processed, per thread
                250.133 GB data processed, total
                  0.645 nsecs/byte/thread runtime
                  1.550 GB/sec/thread speed
                 12.398 GB/sec total speed
      
         # Running  8x1-bw-process-NOTHP, "perf bench numa mem -p 8 -t 1 -P  512 -s 20 -zZ0q --thp  1 --thp -1"
                 20.216 secs slowest (max) thread-runtime
                 20.000 secs fastest (min) thread-runtime
                 20.113 secs average thread-runtime
                  0.535 % difference between max/avg runtime
                 30.998 GB data processed, per thread
                247.981 GB data processed, total
                  0.652 nsecs/byte/thread runtime
                  1.533 GB/sec/thread speed
                 12.266 GB/sec total speed
      
         # Running 16x1-bw-process, "perf bench numa mem -p 16 -t 1 -P 256 -s 20 -zZ0q --thp  1"
                 20.234 secs slowest (max) thread-runtime
                 20.000 secs fastest (min) thread-runtime
                 20.174 secs average thread-runtime
                  0.577 % difference between max/avg runtime
                 15.377 GB data processed, per thread
                246.039 GB data processed, total
                  1.316 nsecs/byte/thread runtime
                  0.760 GB/sec/thread speed
                 12.160 GB/sec total speed
      
         # Running  1x4-bw-thread, "perf bench numa mem -p 1 -t 4 -T 256 -s 20 -zZ0q --thp  1"
                 20.040 secs slowest (max) thread-runtime
                 20.000 secs fastest (min) thread-runtime
                 20.028 secs average thread-runtime
                  0.099 % difference between max/avg runtime
                 66.832 GB data processed, per thread
                267.328 GB data processed, total
                  0.300 nsecs/byte/thread runtime
                  3.335 GB/sec/thread speed
                 13.340 GB/sec total speed
      
         # Running  1x8-bw-thread, "perf bench numa mem -p 1 -t 8 -T 256 -s 20 -zZ0q --thp  1"
                 20.064 secs slowest (max) thread-runtime
                 20.000 secs fastest (min) thread-runtime
                 20.034 secs average thread-runtime
                  0.160 % difference between max/avg runtime
                 32.911 GB data processed, per thread
                263.286 GB data processed, total
                  0.610 nsecs/byte/thread runtime
                  1.640 GB/sec/thread speed
                 13.122 GB/sec total speed
      
         # Running 1x16-bw-thread, "perf bench numa mem -p 1 -t 16 -T 128 -s 20 -zZ0q --thp  1"
                 20.092 secs slowest (max) thread-runtime
                 20.000 secs fastest (min) thread-runtime
                 20.052 secs average thread-runtime
                  0.230 % difference between max/avg runtime
                 16.131 GB data processed, per thread
                258.088 GB data processed, total
                  1.246 nsecs/byte/thread runtime
                  0.803 GB/sec/thread speed
                 12.845 GB/sec total speed
      
         # Running 1x32-bw-thread, "perf bench numa mem -p 1 -t 32 -T 64 -s 20 -zZ0q --thp  1"
                 20.099 secs slowest (max) thread-runtime
                 20.000 secs fastest (min) thread-runtime
                 20.063 secs average thread-runtime
                  0.247 % difference between max/avg runtime
                  7.962 GB data processed, per thread
                254.773 GB data processed, total
                  2.525 nsecs/byte/thread runtime
                  0.396 GB/sec/thread speed
                 12.676 GB/sec total speed
      
         # Running  2x3-bw-process, "perf bench numa mem -p 2 -t 3 -P 512 -s 20 -zZ0q --thp  1"
                 20.150 secs slowest (max) thread-runtime
                 20.000 secs fastest (min) thread-runtime
                 20.120 secs average thread-runtime
                  0.372 % difference between max/avg runtime
                 44.827 GB data processed, per thread
                268.960 GB data processed, total
                  0.450 nsecs/byte/thread runtime
                  2.225 GB/sec/thread speed
                 13.348 GB/sec total speed
      
         # Running  4x4-bw-process, "perf bench numa mem -p 4 -t 4 -P 512 -s 20 -zZ0q --thp  1"
                 20.258 secs slowest (max) thread-runtime
                 20.000 secs fastest (min) thread-runtime
                 20.168 secs average thread-runtime
                  0.636 % difference between max/avg runtime
                 17.079 GB data processed, per thread
                273.263 GB data processed, total
                  1.186 nsecs/byte/thread runtime
                  0.843 GB/sec/thread speed
                 13.489 GB/sec total speed
      
         # Running  4x6-bw-process, "perf bench numa mem -p 4 -t 6 -P 512 -s 20 -zZ0q --thp  1"
                 20.559 secs slowest (max) thread-runtime
                 20.000 secs fastest (min) thread-runtime
                 20.382 secs average thread-runtime
                  1.359 % difference between max/avg runtime
                 10.758 GB data processed, per thread
                258.201 GB data processed, total
                  1.911 nsecs/byte/thread runtime
                  0.523 GB/sec/thread speed
                 12.559 GB/sec total speed
      
         # Running  4x8-bw-process, "perf bench numa mem -p 4 -t 8 -P 512 -s 20 -zZ0q --thp  1"
                 20.744 secs slowest (max) thread-runtime
                 20.000 secs fastest (min) thread-runtime
                 20.516 secs average thread-runtime
                  1.792 % difference between max/avg runtime
                  8.069 GB data processed, per thread
                258.201 GB data processed, total
                  2.571 nsecs/byte/thread runtime
                  0.389 GB/sec/thread speed
                 12.447 GB/sec total speed
      
         # Running  4x8-bw-process-NOTHP, "perf bench numa mem -p 4 -t 8 -P 512 -s 20 -zZ0q --thp  1 --thp -1"
                 20.855 secs slowest (max) thread-runtime
                 20.000 secs fastest (min) thread-runtime
                 20.561 secs average thread-runtime
                  2.050 % difference between max/avg runtime
                  8.069 GB data processed, per thread
                258.201 GB data processed, total
                  2.585 nsecs/byte/thread runtime
                  0.387 GB/sec/thread speed
                 12.381 GB/sec total speed
      
         # Running  3x3-bw-process, "perf bench numa mem -p 3 -t 3 -P 512 -s 20 -zZ0q --thp  1"
                 20.134 secs slowest (max) thread-runtime
                 20.000 secs fastest (min) thread-runtime
                 20.077 secs average thread-runtime
                  0.333 % difference between max/avg runtime
                 28.091 GB data processed, per thread
                252.822 GB data processed, total
                  0.717 nsecs/byte/thread runtime
                  1.395 GB/sec/thread speed
                 12.557 GB/sec total speed
      
         # Running  5x5-bw-process, "perf bench numa mem -p 5 -t 5 -P 512 -s 20 -zZ0q --thp  1"
                 20.588 secs slowest (max) thread-runtime
                 20.000 secs fastest (min) thread-runtime
                 20.375 secs average thread-runtime
                  1.427 % difference between max/avg runtime
                 10.177 GB data processed, per thread
                254.436 GB data processed, total
                  2.023 nsecs/byte/thread runtime
                  0.494 GB/sec/thread speed
                 12.359 GB/sec total speed
      
         # Running 2x16-bw-process, "perf bench numa mem -p 2 -t 16 -P 512 -s 20 -zZ0q --thp  1"
                 20.657 secs slowest (max) thread-runtime
                 20.000 secs fastest (min) thread-runtime
                 20.429 secs average thread-runtime
                  1.589 % difference between max/avg runtime
                  8.170 GB data processed, per thread
                261.429 GB data processed, total
                  2.528 nsecs/byte/thread runtime
                  0.395 GB/sec/thread speed
                 12.656 GB/sec total speed
      
         # Running 1x32-bw-process, "perf bench numa mem -p 1 -t 32 -P 2048 -s 20 -zZ0q --thp  1"
                 22.981 secs slowest (max) thread-runtime
                 20.000 secs fastest (min) thread-runtime
                 21.996 secs average thread-runtime
                  6.486 % difference between max/avg runtime
                  8.863 GB data processed, per thread
                283.606 GB data processed, total
                  2.593 nsecs/byte/thread runtime
                  0.386 GB/sec/thread speed
                 12.341 GB/sec total speed
      
         # Running numa02-bw, "perf bench numa mem -p 1 -t 32 -T 32 -s 20 -zZ0q --thp  1"
                 20.047 secs slowest (max) thread-runtime
                 19.000 secs fastest (min) thread-runtime
                 20.026 secs average thread-runtime
                  2.611 % difference between max/avg runtime
                  8.441 GB data processed, per thread
                270.111 GB data processed, total
                  2.375 nsecs/byte/thread runtime
                  0.421 GB/sec/thread speed
                 13.474 GB/sec total speed
      
         # Running numa02-bw-NOTHP, "perf bench numa mem -p 1 -t 32 -T 32 -s 20 -zZ0q --thp  1 --thp -1"
                 20.088 secs slowest (max) thread-runtime
                 19.000 secs fastest (min) thread-runtime
                 20.025 secs average thread-runtime
                  2.709 % difference between max/avg runtime
                  8.411 GB data processed, per thread
                269.142 GB data processed, total
                  2.388 nsecs/byte/thread runtime
                  0.419 GB/sec/thread speed
                 13.398 GB/sec total speed
      
         # Running numa01-bw-thread, "perf bench numa mem -p 2 -t 16 -T 192 -s 20 -zZ0q --thp  1"
                 20.293 secs slowest (max) thread-runtime
                 20.000 secs fastest (min) thread-runtime
                 20.175 secs average thread-runtime
                  0.721 % difference between max/avg runtime
                  7.918 GB data processed, per thread
                253.374 GB data processed, total
                  2.563 nsecs/byte/thread runtime
                  0.390 GB/sec/thread speed
                 12.486 GB/sec total speed
      
         # Running numa01-bw-thread-NOTHP, "perf bench numa mem -p 2 -t 16 -T 192 -s 20 -zZ0q --thp  1 --thp -1"
                 20.411 secs slowest (max) thread-runtime
                 20.000 secs fastest (min) thread-runtime
                 20.226 secs average thread-runtime
                  1.006 % difference between max/avg runtime
                  7.931 GB data processed, per thread
                253.778 GB data processed, total
                  2.574 nsecs/byte/thread runtime
                  0.389 GB/sec/thread speed
                 12.434 GB/sec total speed
      
        #
      Signed-off-by: NIan Rogers <irogers@google.com>
      Acked-by: NJiri Olsa <jolsa@redhat.com>
      Tested-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Link: https://lore.kernel.org/r/20201012161611.366482-1-irogers@google.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      f9299385
    • J
      perf jevents: Fix event code for events referencing std arch events · caf7f968
      John Garry 提交于
      The event code for events referencing std arch events is incorrectly
      evaluated in json_events().
      
      The issue is that je.event is evaluated properly from try_fixup(), but
      later NULLified from the real_event() call, as "event" may be NULL.
      
      Fix by setting "event" same je.event in try_fixup().
      
      Also remove support for overwriting event code for events using std arch
      events, as it is not used.
      Signed-off-by: NJohn Garry <john.garry@huawei.com>
      Reviewed-By: Kajol Jain<kjain@linux.ibm.com>
      Acked-by: NJiri Olsa <jolsa@redhat.com>
      Link: https://lore.kernel.org/r/1602170368-11892-1-git-send-email-john.garry@huawei.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      caf7f968
    • J
      perf diff: Support hot streams comparison · 2a09a84c
      Jin Yao 提交于
      This patch enables perf-diff with "--stream" option.
      
      "--stream": Enable hot streams comparison
      
      Now let's see example.
      
      perf record -b ...      Generate perf.data.old with branch data
      perf record -b ...      Generate perf.data with branch data
      perf diff --stream
      
      [ Matched hot streams ]
      
      hot chain pair 1:
                  cycles: 1, hits: 27.77%                  cycles: 1, hits: 9.24%
              ---------------------------              --------------------------
                            main div.c:39                           main div.c:39
                            main div.c:44                           main div.c:44
      
      hot chain pair 2:
                 cycles: 34, hits: 20.06%                cycles: 27, hits: 16.98%
              ---------------------------              --------------------------
                __random_r random_r.c:360               __random_r random_r.c:360
                __random_r random_r.c:388               __random_r random_r.c:388
                __random_r random_r.c:388               __random_r random_r.c:388
                __random_r random_r.c:380               __random_r random_r.c:380
                __random_r random_r.c:357               __random_r random_r.c:357
                    __random random.c:293                   __random random.c:293
                    __random random.c:293                   __random random.c:293
                    __random random.c:291                   __random random.c:291
                    __random random.c:291                   __random random.c:291
                    __random random.c:291                   __random random.c:291
                    __random random.c:288                   __random random.c:288
                           rand rand.c:27                          rand rand.c:27
                           rand rand.c:26                          rand rand.c:26
                                 rand@plt                                rand@plt
                                 rand@plt                                rand@plt
                    compute_flag div.c:25                   compute_flag div.c:25
                    compute_flag div.c:22                   compute_flag div.c:22
                            main div.c:40                           main div.c:40
                            main div.c:40                           main div.c:40
                            main div.c:39                           main div.c:39
      
      hot chain pair 3:
                   cycles: 9, hits: 4.48%                  cycles: 6, hits: 4.51%
              ---------------------------              --------------------------
                __random_r random_r.c:360               __random_r random_r.c:360
                __random_r random_r.c:388               __random_r random_r.c:388
                __random_r random_r.c:388               __random_r random_r.c:388
                __random_r random_r.c:380               __random_r random_r.c:380
      
      [ Hot streams in old perf data only ]
      
      hot chain 1:
                  cycles: 18, hits: 6.75%
               --------------------------
                __random_r random_r.c:360
                __random_r random_r.c:388
                __random_r random_r.c:388
                __random_r random_r.c:380
                __random_r random_r.c:357
                    __random random.c:293
                    __random random.c:293
                    __random random.c:291
                    __random random.c:291
                    __random random.c:291
                    __random random.c:288
                           rand rand.c:27
                           rand rand.c:26
                                 rand@plt
                                 rand@plt
                    compute_flag div.c:25
                    compute_flag div.c:22
                            main div.c:40
      
      hot chain 2:
                  cycles: 29, hits: 2.78%
               --------------------------
                    compute_flag div.c:22
                            main div.c:40
                            main div.c:40
                            main div.c:39
      
      [ Hot streams in new perf data only ]
      
      hot chain 1:
                                                           cycles: 4, hits: 4.54%
                                                       --------------------------
                                                                    main div.c:42
                                                            compute_flag div.c:28
      
      hot chain 2:
                                                           cycles: 5, hits: 3.51%
                                                       --------------------------
                                                                    main div.c:39
                                                                    main div.c:44
                                                                    main div.c:42
                                                            compute_flag div.c:28
      Signed-off-by: NJin Yao <yao.jin@linux.intel.com>
      Acked-by: NJiri Olsa <jolsa@kernel.org>
      Link: https://lore.kernel.org/r/20201009022845.13141-8-yao.jin@linux.intel.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      2a09a84c
    • J
      perf streams: Report hot streams · 5bbd6bad
      Jin Yao 提交于
      We show the streams separately. They are divided into different sections.
      
      1. "Matched hot streams"
      
      2. "Hot streams in old perf data only"
      
      3. "Hot streams in new perf data only".
      
      For each stream, we report the cycles and hot percent (hits%).
      
      For example,
      
           cycles: 2, hits: 4.08%
       --------------------------
                    main div.c:42
            compute_flag div.c:28
      Signed-off-by: NJin Yao <yao.jin@linux.intel.com>
      Acked-by: NJiri Olsa <jolsa@kernel.org>
      Link: https://lore.kernel.org/r/20201009022845.13141-7-yao.jin@linux.intel.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      5bbd6bad
    • J
      perf streams: Calculate the sum of total streams hits · 28904f4d
      Jin Yao 提交于
      We have used callchain_node->hit to measure the hot level of one stream.
      This patch calculates the sum of hits of total streams.
      
      Thus in next patch, we can use following formula to report hot percent
      for one stream.
      
      hot percent = callchain_node->hit / sum of total hits
      Signed-off-by: NJin Yao <yao.jin@linux.intel.com>
      Acked-by: NJiri Olsa <jolsa@kernel.org>
      Link: https://lore.kernel.org/r/20201009022845.13141-6-yao.jin@linux.intel.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      28904f4d
    • J
      perf streams: Link stream pair · fa79aa64
      Jin Yao 提交于
      In previous patch, we have created an evsel_streams for one event, and
      top N hottest streams will be saved in a stream array in evsel_streams.
      
      This patch compares total streams among two evsel_streams.
      
      Once two streams are fully matched, they will be linked as a pair. From
      the pair, we can know which streams are matched.
      Signed-off-by: NJin Yao <yao.jin@linux.intel.com>
      Acked-by: NJiri Olsa <jolsa@kernel.org>
      Link: https://lore.kernel.org/r/20201009022845.13141-5-yao.jin@linux.intel.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      fa79aa64
    • J
      perf streams: Compare two streams · 47ef8398
      Jin Yao 提交于
      Stream is the branch history which is aggregated by the branch records
      from perf samples. Now we support the callchain as stream.
      
      If the callchain entries of one stream are fully matched with the
      callchain entries of another stream, we think two streams are matched.
      
      For example,
      
         cycles: 1, hits: 26.80%                 cycles: 1, hits: 27.30%
         -----------------------                 -----------------------
                   main div.c:39                           main div.c:39
                   main div.c:44                           main div.c:44
      
      Above two streams are matched (we don't consider the case that source
      code is changed).
      
      The matching logic is, compare the chain string first. If it's not
      matched, fallback to dso address comparison.
      Signed-off-by: NJin Yao <yao.jin@linux.intel.com>
      Acked-by: NJiri Olsa <jolsa@kernel.org>
      Link: https://lore.kernel.org/r/20201009022845.13141-4-yao.jin@linux.intel.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      47ef8398
    • J
      perf streams: Get the evsel_streams by evsel_idx · dd1d8418
      Jin Yao 提交于
      In previous patch, we have created evsel_streams array.
      
      This patch returns the specified evsel_streams according to the
      evsel_idx.
      Signed-off-by: NJin Yao <yao.jin@linux.intel.com>
      Acked-by: NJiri Olsa <jolsa@kernel.org>
      Link: https://lore.kernel.org/r/20201009022845.13141-3-yao.jin@linux.intel.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      dd1d8418
    • J
      perf streams: Introduce branch history "streams" · 480accbb
      Jin Yao 提交于
      We define a stream as the branch history which is aggregated by the
      branch records from perf samples. For example, the callchains aggregated
      from the branch records are considered as streams.  By browsing the hot
      stream, we can understand the hot code path.
      
      Now we only support the callchain for stream. For measuring the hot
      level for a stream, we use the callchain_node->hit, higher is hotter.
      
      There may be many callchains sampled so we only focus on the top N
      hottest callchains. N is a user defined parameter or predefined default
      value (nr_streams_max).
      
      This patch creates an evsel_streams array per event, and saves the top N
      hottest streams in a stream array.
      
      So now we can get the per-event top N hottest streams.
      Signed-off-by: NJin Yao <yao.jin@linux.intel.com>
      Acked-by: NJiri Olsa <jolsa@kernel.org>
      Link: https://lore.kernel.org/r/20201009022845.13141-2-yao.jin@linux.intel.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      480accbb