1. 13 8月, 2020 4 次提交
    • J
      perf tools: Rename 'enum dso_kernel_type' to 'enum dso_space_type' · 1c695c88
      Jiri Olsa 提交于
      Rename enum dso_kernel_type to enum dso_space_type, which seems like
      better fit.
      
      Committer notes:
      
      This is used with 'struct dso'->kernel, which once was a boolean, so
      DSO_SPACE__USER is zero, !zero means some sort of kernel space, be it
      the host kernel space or a guest kernel space.
      Signed-off-by: NJiri Olsa <jolsa@kernel.org>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Signed-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      1c695c88
    • M
      perf test: Allow multiple probes in record+script_probe_vfs_getname.sh · 194cb6b5
      Michael Petlan 提交于
      Sometimes when adding a kprobe by perf, it results in multiple probe
      points, such as the following:
      
        # ./perf probe -l
          probe:vfs_getname    (on getname_flags:73@fs/namei.c with pathname)
          probe:vfs_getname_1  (on getname_flags:73@fs/namei.c with pathname)
          probe:vfs_getname_2  (on getname_flags:73@fs/namei.c with pathname)
        # cat /sys/kernel/debug/tracing/kprobe_events
        p:probe/vfs_getname _text+5501804 pathname=+0(+0(%gpr31)):string
        p:probe/vfs_getname_1 _text+5505388 pathname=+0(+0(%gpr31)):string
        p:probe/vfs_getname_2 _text+5508396 pathname=+0(+0(%gpr31)):string
      
      In this test, we need to record all of them and expect any of them in
      the perf-script output, since it's not clear which one will be used for
      the desired syscall:
      
        # perf stat -e probe:vfs_getname\* -- touch /tmp/nic
      
         Performance counter stats for 'touch /tmp/nic':
      
                      31      probe:vfs_getname_2
                       0      probe:vfs_getname_1
                       1      probe:vfs_getname
             0.001421826 seconds time elapsed
      
             0.001506000 seconds user
             0.000000000 seconds sys
      
      If the test relies only on probe:vfs_getname, it might easily miss the
      relevant data.
      Signed-off-by: NMichael Petlan <mpetlan@redhat.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      LPU-Reference: 20200722135845.29958-1-mpetlan@redhat.com
      Signed-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      194cb6b5
    • V
      perf bench mem: Always memset source before memcpy · 1beaef29
      Vincent Whitchurch 提交于
      For memcpy, the source pages are memset to zero only when --cycles is
      used.  This leads to wildly different results with or without --cycles,
      since all sources pages are likely to be mapped to the same zero page
      without explicit writes.
      
      Before this fix:
      
      $ export cmd="./perf stat -e LLC-loads -- ./perf bench \
        mem memcpy -s 1024MB -l 100 -f default"
      $ $cmd
      
               2,935,826      LLC-loads
             3.821677452 seconds time elapsed
      
      $ $cmd --cycles
      
             217,533,436      LLC-loads
             8.616725985 seconds time elapsed
      
      After this fix:
      
      $ $cmd
      
             214,459,686      LLC-loads
             8.674301124 seconds time elapsed
      
      $ $cmd --cycles
      
             214,758,651      LLC-loads
             8.644480006 seconds time elapsed
      
      Fixes: 47b5757b ("perf bench mem: Move boilerplate memory allocation to the infrastructure")
      Signed-off-by: NVincent Whitchurch <vincent.whitchurch@axis.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: kernel@axis.com
      Link: http://lore.kernel.org/lkml/20200810133404.30829-1-vincent.whitchurch@axis.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      1beaef29
    • D
      perf sched: Prefer sched_waking event when it exists · d566a9c2
      David Ahern 提交于
      Commit fbd705a0 ("sched: Introduce the 'trace_sched_waking'
      tracepoint") added sched_waking tracepoint which should be preferred
      over sched_wakeup when analyzing scheduling delays.
      
      Update 'perf sched record' to collect sched_waking events if it exists
      and fallback to sched_wakeup if it does not. Similarly, update timehist
      command to skip sched_wakeup events if the session includes sched_waking
      (ie., sched_waking is preferred over sched_wakeup).
      Signed-off-by: NDavid Ahern <dsahern@kernel.org>
      Acked-by: NNamhyung Kim <namhyung@kernel.org>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Link: http://lore.kernel.org/lkml/20200807164844.44870-1-dsahern@kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      d566a9c2
  2. 12 8月, 2020 5 次提交
    • C
      perf bench: Fix a couple of spelling mistakes in options text · f9f95068
      Colin Ian King 提交于
      There are a couple of spelling mistakes in the text. Fix these.
      Signed-off-by: NColin King <colin.king@canonical.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: kernel-janitors@vger.kernel.org
      Link: http://lore.kernel.org/lkml/20200812064647.200132-1-colin.king@canonical.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      f9f95068
    • A
      perf bench numa: Fix benchmark names · 85372c69
      Alexander Gordeev 提交于
      Standard benchmark names let users know the tests specifics.  For
      example "2x1-bw-process" name tells that two processes one thread each
      are run and the RAM bandwidth is measured.
      
      Several benchmarks names do not correspond to their actual running
      configuration. Fix that and also some whitespace and comment
      inconsistencies.
      Signed-off-by: NAlexander Gordeev <agordeev@linux.ibm.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: http://lore.kernel.org/lkml/6b6f2084f132ee8e9203dc7c32f9deb209b87a68.1597004831.git.agordeev@linux.ibm.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      85372c69
    • A
      perf bench numa: Fix number of processes in "2x3-convergence" test · 72d69c2a
      Alexander Gordeev 提交于
      Signed-off-by: NAlexander Gordeev <agordeev@linux.ibm.com>
      Acked-by: NNamhyung Kim <namhyung@kernel.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: http://lore.kernel.org/lkml/d949f5f48e17fc816f3beecf8479f1b2480345e4.1597004831.git.agordeev@linux.ibm.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      72d69c2a
    • A
      perf trace beauty: Use the autogenerated protocol family table · f3cf7fa9
      Arnaldo Carvalho de Melo 提交于
      That helps us not to lose new protocol families when they are
      introduced, replacing that hardcoded, dated family->string table.
      
      To recap what this allows us to do:
      
        # perf trace -e syscalls:sys_enter_socket/max-stack=10/ --filter=family==INET --max-events=1
           0.000 fetchmail/41097 syscalls:sys_enter_socket(family: INET, type: DGRAM|CLOEXEC|NONBLOCK, protocol: IP)
                                             __GI___socket (inlined)
                                             reopen (/usr/lib64/libresolv-2.31.so)
                                             send_dg (/usr/lib64/libresolv-2.31.so)
                                             __res_context_send (/usr/lib64/libresolv-2.31.so)
                                             __GI___res_context_query (inlined)
                                             __GI___res_context_search (inlined)
                                             _nss_dns_gethostbyname4_r (/usr/lib64/libnss_dns-2.31.so)
                                             gaih_inet.constprop.0 (/usr/lib64/libc-2.31.so)
                                             __GI_getaddrinfo (inlined)
                                             [0x15cb2] (/usr/bin/fetchmail)
        #
      
      More work is still needed to allow for the more natura strace-like
      syscall name usage instead of the trace event name:
      
        # perf trace -e socket/max-stack=10,family==INET/ --max-events=1
      
      I.e. to allow for modifiers to follow the syscall name and for logical
      expressions to be accepted as filters to use with that syscall, be it as
      trace event filters or BPF based ones.
      
      Using -v we can see how the trace event filter is built:
      
        # perf trace -v -e syscalls:sys_enter_socket/call-graph=dwarf/ --filter=family==INET --max-events=2
        <SNIP>
        New filter for syscalls:sys_enter_socket: (family==0x2) && (common_pid != 41384 && common_pid != 2836)
        <SNIP>
      
        $ tools/perf/trace/beauty/socket.sh | grep -w 2
      	[2] = "INET",
        $
      
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      f3cf7fa9
    • A
      perf trace beauty: Add script to autogenerate socket families table · 58277f50
      Arnaldo Carvalho de Melo 提交于
      To use with 'perf trace', to convert the protocol families to strings,
      e.g:
      
        $ tools/perf/trace/beauty/socket.sh
        static const char *socket_families[] = {
        	[0] = "UNSPEC",
        	[1] = "LOCAL",
        	[2] = "INET",
        	[3] = "AX25",
        	[4] = "IPX",
        	[5] = "APPLETALK",
        	[6] = "NETROM",
        	[7] = "BRIDGE",
        	[8] = "ATMPVC",
        	[9] = "X25",
        	[10] = "INET6",
        	[11] = "ROSE",
        	[12] = "DECnet",
        	[13] = "NETBEUI",
        	[14] = "SECURITY",
        	[15] = "KEY",
        	[16] = "NETLINK",
        	[17] = "PACKET",
        	[18] = "ASH",
        	[19] = "ECONET",
        	[20] = "ATMSVC",
        	[21] = "RDS",
        	[22] = "SNA",
        	[23] = "IRDA",
        	[24] = "PPPOX",
        	[25] = "WANPIPE",
        	[26] = "LLC",
        	[27] = "IB",
        	[28] = "MPLS",
        	[29] = "CAN",
        	[30] = "TIPC",
        	[31] = "BLUETOOTH",
        	[32] = "IUCV",
        	[33] = "RXRPC",
        	[34] = "ISDN",
        	[35] = "PHONET",
        	[36] = "IEEE802154",
        	[37] = "CAIF",
        	[38] = "ALG",
        	[39] = "NFC",
        	[40] = "VSOCK",
        	[41] = "KCM",
        	[42] = "QIPCRTR",
        	[43] = "SMC",
        	[44] = "XDP",
        };
        $
      
      This uses a copy of include/linux/socket.h that is kept in a directory
      to be used just for these table generation scripts and for checking if
      the kernel has a new file that maybe gets something new for these
      tables.
      
      This allows us to:
      
      - Avoid accessing files outside tools/, in the kernel sources, that may
        be changed in unexpected ways and thus break these scripts.
      
      - Notice when those files change and thus check if the changes don't
        break those scripts, update them to automatically get the new
        definitions, a new socket family, for instance.
      
      - Not add then to the tools/include/ where it may end up used while
        building the tools and end up requiring dragging yet more stuff from
        the kernel or plain break the build in some of the myriad environments
        where perf may be built.
      
      This will replace the previous static array in tools/perf/ that was
      dated and was already missing the AF_KCM, AF_QIPCRTR, AF_SMC and AF_XDP
      families.
      
      The next cset will wire this up to the perf build process.
      
      At some point this must be made into a library to be used in places such
      as libtraceevent, bpftrace, etc.
      
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      58277f50
  3. 07 8月, 2020 3 次提交
    • J
      perf record: Skip side-band event setup if HAVE_LIBBPF_SUPPORT is not set · 1101c872
      Jin Yao 提交于
      We received an error report that perf-record caused 'Segmentation fault'
      on a newly system (e.g. on the new installed ubuntu).
      
        (gdb) backtrace
        #0  __read_once_size (size=4, res=<synthetic pointer>, p=0x14) at /root/0-jinyao/acme/tools/include/linux/compiler.h:139
        #1  atomic_read (v=0x14) at /root/0-jinyao/acme/tools/include/asm/../../arch/x86/include/asm/atomic.h:28
        #2  refcount_read (r=0x14) at /root/0-jinyao/acme/tools/include/linux/refcount.h:65
        #3  perf_mmap__read_init (map=map@entry=0x0) at mmap.c:177
        #4  0x0000561ce5c0de39 in perf_evlist__poll_thread (arg=0x561ce68584d0) at util/sideband_evlist.c:62
        #5  0x00007fad78491609 in start_thread (arg=<optimized out>) at pthread_create.c:477
        #6  0x00007fad7823c103 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
      
      The root cause is, evlist__add_bpf_sb_event() just returns 0 if
      HAVE_LIBBPF_SUPPORT is not defined (inline function path). So it will
      not create a valid evsel for side-band event.
      
      But perf-record still creates BPF side band thread to process the
      side-band event, then the error happpens.
      
      We can reproduce this issue by removing the libelf-dev. e.g.
      1. apt-get remove libelf-dev
      2. perf record -a -- sleep 1
      
        root@test:~# ./perf record -a -- sleep 1
        perf: Segmentation fault
        Obtained 6 stack frames.
        ./perf(+0x28eee8) [0x5562d6ef6ee8]
        /lib/x86_64-linux-gnu/libc.so.6(+0x46210) [0x7fbfdc65f210]
        ./perf(+0x342e74) [0x5562d6faae74]
        ./perf(+0x257e39) [0x5562d6ebfe39]
        /lib/x86_64-linux-gnu/libpthread.so.0(+0x9609) [0x7fbfdc990609]
        /lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7fbfdc73b103]
        Segmentation fault (core dumped)
      
      To fix this issue,
      
      1. We either install the missing libraries to let HAVE_LIBBPF_SUPPORT
         be defined.
         e.g. apt-get install libelf-dev and install other related libraries.
      
      2. Use this patch to skip the side-band event setup if HAVE_LIBBPF_SUPPORT
         is not set.
      
      Committer notes:
      
      The side band thread is not used just with BPF, it is also used with
      --switch-output-event, so narrow the ifdef to the BPF specific part.
      
      Fixes: 23cbb41c ("perf record: Move side band evlist setup to separate routine")
      Signed-off-by: NJin Yao <yao.jin@linux.intel.com>
      Acked-by: NJiri Olsa <jolsa@kernel.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Jin Yao <yao.jin@intel.com>
      Cc: Kan Liang <kan.liang@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: http://lore.kernel.org/lkml/20200805022937.29184-1-yao.jin@linux.intel.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      1101c872
    • A
      perf tools powerpc: Add support for extended regs in power10 · 66655986
      Athira Rajeev 提交于
      Added support for supported regs which are new in power10 ( MMCR3,
      SIER2, SIER3 ) to sample_reg_mask in the tool side to use with `-I?`
      option. Also added PVR check to send extended mask for power10 at kernel
      while capturing extended regs in each sample.
      Signed-off-by: NAthira Jajeev <atrajeev@linux.vnet.ibm.com>
      Reviewed-by: NKajol Jain <kjain@linux.ibm.com>
      Reviewed-by: NRavi Bangoria <ravi.bangoria@linux.ibm.com>
      Tested-by: NRavi Bangoria <ravi.bangoria@linux.ibm.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michael Neuling <mikey@neuling.org>
      Cc: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
      Cc: linuxppc-dev@lists.ozlabs.org
      Signed-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      66655986
    • A
      perf tools powerpc: Add support for extended register capability · 33583e69
      Anju T Sudhakar 提交于
      Add extended regs to sample_reg_mask in the tool side to use with `-I?`
      option. Perf tools side uses extended mask to display the platform
      supported register names (with -I? option) to the user and also send
      this mask to the kernel to capture the extended registers in each
      sample. Hence decide the mask value based on the processor version.
      
      Currently definitions for `mfspr`, `SPRN_PVR` are part of
      `arch/powerpc/util/header.c`. Move this to a header file so that these
      definitions can be re-used in other source files as well.
      Signed-off-by: NAnju T Sudhakar <anju@linux.vnet.ibm.com>
      Reviewed-by: NKajol Jain <kjain@linux.ibm.com>
      Reviewed-by: NMadhavan Srinivasan <maddy@linux.vnet.ibm.com>
      Reviewed--by: NRavi Bangoria <ravi.bangoria@linux.ibm.com>
      Tested-by: NRavi Bangoria <ravi.bangoria@linux.ibm.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michael Neuling <mikey@neuling.org> <mikey@neuling.org>
      Cc: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
      Cc: linuxppc-dev@lists.ozlabs.org
      [Decide extended mask at run time based on platform]
      Signed-off-by: NAthira Jajeev <atrajeev@linux.vnet.ibm.com>
      Signed-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      33583e69
  4. 06 8月, 2020 21 次提交
  5. 04 8月, 2020 5 次提交
  6. 31 7月, 2020 2 次提交
    • I
      perf bench: Add benchmark of find_next_bit · 7c43b0c1
      Ian Rogers 提交于
      for_each_set_bit, or similar functions like for_each_cpu, may be hot
      within the kernel. If many bits were set then one could imagine on Intel
      a "bt" instruction with every bit may be faster than the function call
      and word length find_next_bit logic. Add a benchmark to measure this.
      
      This benchmark on AMD rome and Intel skylakex shows "bt" is not a good
      option except for very small bitmaps.
      
      Committer testing:
      
        # perf bench
        Usage:
        	perf bench [<common options>] <collection> <benchmark> [<options>]
      
                # List of all available benchmark collections:
      
                 sched: Scheduler and IPC benchmarks
               syscall: System call benchmarks
                   mem: Memory access benchmarks
                  numa: NUMA scheduling and MM benchmarks
                 futex: Futex stressing benchmarks
                 epoll: Epoll stressing benchmarks
             internals: Perf-internals benchmarks
                   all: All benchmarks
      
        # perf bench mem
      
                # List of available benchmarks for collection 'mem':
      
                memcpy: Benchmark for memcpy() functions
                memset: Benchmark for memset() functions
              find_bit: Benchmark for find_bit() functions
                   all: Run all memory access benchmarks
      
        # perf bench mem find_bit
        # Running 'mem/find_bit' benchmark:
        100000 operations 1 bits set of 1 bits
          Average for_each_set_bit took: 730.200 usec (+- 6.468 usec)
          Average test_bit loop took:    366.200 usec (+- 4.652 usec)
        100000 operations 1 bits set of 2 bits
          Average for_each_set_bit took: 781.000 usec (+- 24.247 usec)
          Average test_bit loop took:    550.200 usec (+- 4.152 usec)
        100000 operations 2 bits set of 2 bits
          Average for_each_set_bit took: 1113.400 usec (+- 112.340 usec)
          Average test_bit loop took:    1098.500 usec (+- 182.834 usec)
        100000 operations 1 bits set of 4 bits
          Average for_each_set_bit took: 843.800 usec (+- 8.772 usec)
          Average test_bit loop took:    948.800 usec (+- 10.278 usec)
        100000 operations 2 bits set of 4 bits
          Average for_each_set_bit took: 1185.800 usec (+- 114.345 usec)
          Average test_bit loop took:    1473.200 usec (+- 175.498 usec)
        100000 operations 4 bits set of 4 bits
          Average for_each_set_bit took: 1769.667 usec (+- 233.177 usec)
          Average test_bit loop took:    1864.933 usec (+- 187.470 usec)
        100000 operations 1 bits set of 8 bits
          Average for_each_set_bit took: 898.000 usec (+- 21.755 usec)
          Average test_bit loop took:    1768.400 usec (+- 23.672 usec)
        100000 operations 2 bits set of 8 bits
          Average for_each_set_bit took: 1244.900 usec (+- 116.396 usec)
          Average test_bit loop took:    2201.800 usec (+- 145.398 usec)
        100000 operations 4 bits set of 8 bits
          Average for_each_set_bit took: 1822.533 usec (+- 231.554 usec)
          Average test_bit loop took:    2569.467 usec (+- 168.453 usec)
        100000 operations 8 bits set of 8 bits
          Average for_each_set_bit took: 2845.100 usec (+- 441.365 usec)
          Average test_bit loop took:    3023.300 usec (+- 219.575 usec)
        100000 operations 1 bits set of 16 bits
          Average for_each_set_bit took: 923.400 usec (+- 17.560 usec)
          Average test_bit loop took:    3240.000 usec (+- 16.492 usec)
        100000 operations 2 bits set of 16 bits
          Average for_each_set_bit took: 1264.300 usec (+- 114.034 usec)
          Average test_bit loop took:    3714.400 usec (+- 158.898 usec)
        100000 operations 4 bits set of 16 bits
          Average for_each_set_bit took: 1817.867 usec (+- 222.199 usec)
          Average test_bit loop took:    4015.333 usec (+- 154.162 usec)
        100000 operations 8 bits set of 16 bits
          Average for_each_set_bit took: 2826.350 usec (+- 433.457 usec)
          Average test_bit loop took:    4460.350 usec (+- 210.762 usec)
        100000 operations 16 bits set of 16 bits
          Average for_each_set_bit took: 4615.600 usec (+- 809.350 usec)
          Average test_bit loop took:    5129.960 usec (+- 320.821 usec)
        100000 operations 1 bits set of 32 bits
          Average for_each_set_bit took: 904.400 usec (+- 14.250 usec)
          Average test_bit loop took:    6194.000 usec (+- 29.254 usec)
        100000 operations 2 bits set of 32 bits
          Average for_each_set_bit took: 1252.700 usec (+- 116.432 usec)
          Average test_bit loop took:    6652.400 usec (+- 154.352 usec)
        100000 operations 4 bits set of 32 bits
          Average for_each_set_bit took: 1824.200 usec (+- 229.133 usec)
          Average test_bit loop took:    6961.733 usec (+- 154.682 usec)
        100000 operations 8 bits set of 32 bits
          Average for_each_set_bit took: 2823.950 usec (+- 432.296 usec)
          Average test_bit loop took:    7351.900 usec (+- 193.626 usec)
        100000 operations 16 bits set of 32 bits
          Average for_each_set_bit took: 4552.560 usec (+- 785.141 usec)
          Average test_bit loop took:    7998.360 usec (+- 305.629 usec)
        100000 operations 32 bits set of 32 bits
          Average for_each_set_bit took: 7557.067 usec (+- 1407.702 usec)
          Average test_bit loop took:    9072.400 usec (+- 513.209 usec)
        100000 operations 1 bits set of 64 bits
          Average for_each_set_bit took: 896.800 usec (+- 14.389 usec)
          Average test_bit loop took:    11927.200 usec (+- 68.862 usec)
        100000 operations 2 bits set of 64 bits
          Average for_each_set_bit took: 1230.400 usec (+- 111.731 usec)
          Average test_bit loop took:    12478.600 usec (+- 189.382 usec)
        100000 operations 4 bits set of 64 bits
          Average for_each_set_bit took: 1844.733 usec (+- 244.826 usec)
          Average test_bit loop took:    12911.467 usec (+- 206.246 usec)
        100000 operations 8 bits set of 64 bits
          Average for_each_set_bit took: 2779.300 usec (+- 413.612 usec)
          Average test_bit loop took:    13372.650 usec (+- 239.623 usec)
        100000 operations 16 bits set of 64 bits
          Average for_each_set_bit took: 4423.920 usec (+- 748.240 usec)
          Average test_bit loop took:    13995.800 usec (+- 318.427 usec)
        100000 operations 32 bits set of 64 bits
          Average for_each_set_bit took: 7580.600 usec (+- 1462.407 usec)
          Average test_bit loop took:    15063.067 usec (+- 516.477 usec)
        100000 operations 64 bits set of 64 bits
          Average for_each_set_bit took: 13391.514 usec (+- 2765.371 usec)
          Average test_bit loop took:    16974.914 usec (+- 916.936 usec)
        100000 operations 1 bits set of 128 bits
          Average for_each_set_bit took: 1153.800 usec (+- 124.245 usec)
          Average test_bit loop took:    26959.000 usec (+- 714.047 usec)
        100000 operations 2 bits set of 128 bits
          Average for_each_set_bit took: 1445.200 usec (+- 113.587 usec)
          Average test_bit loop took:    25798.800 usec (+- 512.908 usec)
        100000 operations 4 bits set of 128 bits
          Average for_each_set_bit took: 1990.933 usec (+- 219.362 usec)
          Average test_bit loop took:    25589.400 usec (+- 348.288 usec)
        100000 operations 8 bits set of 128 bits
          Average for_each_set_bit took: 2963.000 usec (+- 419.487 usec)
          Average test_bit loop took:    25690.050 usec (+- 262.025 usec)
        100000 operations 16 bits set of 128 bits
          Average for_each_set_bit took: 4585.200 usec (+- 741.734 usec)
          Average test_bit loop took:    26125.040 usec (+- 274.127 usec)
        100000 operations 32 bits set of 128 bits
          Average for_each_set_bit took: 7626.200 usec (+- 1404.950 usec)
          Average test_bit loop took:    27038.867 usec (+- 442.554 usec)
        100000 operations 64 bits set of 128 bits
          Average for_each_set_bit took: 13343.371 usec (+- 2686.460 usec)
          Average test_bit loop took:    28936.543 usec (+- 883.257 usec)
        100000 operations 128 bits set of 128 bits
          Average for_each_set_bit took: 23442.950 usec (+- 4880.541 usec)
          Average test_bit loop took:    32484.125 usec (+- 1691.931 usec)
        100000 operations 1 bits set of 256 bits
          Average for_each_set_bit took: 1183.000 usec (+- 32.073 usec)
          Average test_bit loop took:    50114.600 usec (+- 198.880 usec)
        100000 operations 2 bits set of 256 bits
          Average for_each_set_bit took: 1550.000 usec (+- 124.550 usec)
          Average test_bit loop took:    50334.200 usec (+- 128.425 usec)
        100000 operations 4 bits set of 256 bits
          Average for_each_set_bit took: 2164.333 usec (+- 246.359 usec)
          Average test_bit loop took:    49959.867 usec (+- 188.035 usec)
        100000 operations 8 bits set of 256 bits
          Average for_each_set_bit took: 3211.200 usec (+- 454.829 usec)
          Average test_bit loop took:    50140.850 usec (+- 176.046 usec)
        100000 operations 16 bits set of 256 bits
          Average for_each_set_bit took: 5181.640 usec (+- 882.726 usec)
          Average test_bit loop took:    51003.160 usec (+- 419.601 usec)
        100000 operations 32 bits set of 256 bits
          Average for_each_set_bit took: 8369.333 usec (+- 1513.150 usec)
          Average test_bit loop took:    52096.700 usec (+- 573.022 usec)
        100000 operations 64 bits set of 256 bits
          Average for_each_set_bit took: 13866.857 usec (+- 2649.393 usec)
          Average test_bit loop took:    53989.600 usec (+- 938.808 usec)
        100000 operations 128 bits set of 256 bits
          Average for_each_set_bit took: 23588.350 usec (+- 4724.222 usec)
          Average test_bit loop took:    57300.625 usec (+- 1625.962 usec)
        100000 operations 256 bits set of 256 bits
          Average for_each_set_bit took: 42752.200 usec (+- 9202.084 usec)
          Average test_bit loop took:    64426.933 usec (+- 3402.326 usec)
        100000 operations 1 bits set of 512 bits
          Average for_each_set_bit took: 1632.000 usec (+- 229.954 usec)
          Average test_bit loop took:    98090.000 usec (+- 1120.435 usec)
        100000 operations 2 bits set of 512 bits
          Average for_each_set_bit took: 1937.700 usec (+- 148.902 usec)
          Average test_bit loop took:    100364.100 usec (+- 1433.219 usec)
        100000 operations 4 bits set of 512 bits
          Average for_each_set_bit took: 2528.000 usec (+- 243.654 usec)
          Average test_bit loop took:    99932.067 usec (+- 955.868 usec)
        100000 operations 8 bits set of 512 bits
          Average for_each_set_bit took: 3734.100 usec (+- 512.359 usec)
          Average test_bit loop took:    98944.750 usec (+- 812.070 usec)
        100000 operations 16 bits set of 512 bits
          Average for_each_set_bit took: 5551.400 usec (+- 846.605 usec)
          Average test_bit loop took:    98691.600 usec (+- 654.753 usec)
        100000 operations 32 bits set of 512 bits
          Average for_each_set_bit took: 8594.500 usec (+- 1446.072 usec)
          Average test_bit loop took:    99176.867 usec (+- 579.990 usec)
        100000 operations 64 bits set of 512 bits
          Average for_each_set_bit took: 13840.743 usec (+- 2527.055 usec)
          Average test_bit loop took:    100758.743 usec (+- 833.865 usec)
        100000 operations 128 bits set of 512 bits
          Average for_each_set_bit took: 23185.925 usec (+- 4532.910 usec)
          Average test_bit loop took:    103786.700 usec (+- 1475.276 usec)
        100000 operations 256 bits set of 512 bits
          Average for_each_set_bit took: 40322.400 usec (+- 8341.802 usec)
          Average test_bit loop took:    109433.378 usec (+- 2742.615 usec)
        100000 operations 512 bits set of 512 bits
          Average for_each_set_bit took: 71804.540 usec (+- 15436.546 usec)
          Average test_bit loop took:    120255.440 usec (+- 5252.777 usec)
        100000 operations 1 bits set of 1024 bits
          Average for_each_set_bit took: 1859.600 usec (+- 27.969 usec)
          Average test_bit loop took:    187676.000 usec (+- 1337.770 usec)
        100000 operations 2 bits set of 1024 bits
          Average for_each_set_bit took: 2273.600 usec (+- 139.420 usec)
          Average test_bit loop took:    188176.000 usec (+- 684.357 usec)
        100000 operations 4 bits set of 1024 bits
          Average for_each_set_bit took: 2940.400 usec (+- 268.213 usec)
          Average test_bit loop took:    189172.600 usec (+- 593.295 usec)
        100000 operations 8 bits set of 1024 bits
          Average for_each_set_bit took: 4224.200 usec (+- 547.933 usec)
          Average test_bit loop took:    190257.250 usec (+- 621.021 usec)
        100000 operations 16 bits set of 1024 bits
          Average for_each_set_bit took: 6090.560 usec (+- 877.975 usec)
          Average test_bit loop took:    190143.880 usec (+- 503.753 usec)
        100000 operations 32 bits set of 1024 bits
          Average for_each_set_bit took: 9178.800 usec (+- 1475.136 usec)
          Average test_bit loop took:    190757.100 usec (+- 494.757 usec)
        100000 operations 64 bits set of 1024 bits
          Average for_each_set_bit took: 14441.457 usec (+- 2545.497 usec)
          Average test_bit loop took:    192299.486 usec (+- 795.251 usec)
        100000 operations 128 bits set of 1024 bits
          Average for_each_set_bit took: 23623.825 usec (+- 4481.182 usec)
          Average test_bit loop took:    194885.550 usec (+- 1300.817 usec)
        100000 operations 256 bits set of 1024 bits
          Average for_each_set_bit took: 40194.956 usec (+- 8109.056 usec)
          Average test_bit loop took:    200259.311 usec (+- 2566.085 usec)
        100000 operations 512 bits set of 1024 bits
          Average for_each_set_bit took: 70983.560 usec (+- 15074.982 usec)
          Average test_bit loop took:    210527.460 usec (+- 4968.980 usec)
        100000 operations 1024 bits set of 1024 bits
          Average for_each_set_bit took: 136530.345 usec (+- 31584.400 usec)
          Average test_bit loop took:    233329.691 usec (+- 10814.036 usec)
        100000 operations 1 bits set of 2048 bits
          Average for_each_set_bit took: 3077.600 usec (+- 76.376 usec)
          Average test_bit loop took:    402154.400 usec (+- 518.571 usec)
        100000 operations 2 bits set of 2048 bits
          Average for_each_set_bit took: 3508.600 usec (+- 148.350 usec)
          Average test_bit loop took:    403814.500 usec (+- 1133.027 usec)
        100000 operations 4 bits set of 2048 bits
          Average for_each_set_bit took: 4219.333 usec (+- 285.844 usec)
          Average test_bit loop took:    404312.533 usec (+- 985.751 usec)
        100000 operations 8 bits set of 2048 bits
          Average for_each_set_bit took: 5670.550 usec (+- 615.238 usec)
          Average test_bit loop took:    405321.800 usec (+- 1038.487 usec)
        100000 operations 16 bits set of 2048 bits
          Average for_each_set_bit took: 7785.080 usec (+- 992.522 usec)
          Average test_bit loop took:    406746.160 usec (+- 1015.478 usec)
        100000 operations 32 bits set of 2048 bits
          Average for_each_set_bit took: 11163.800 usec (+- 1627.320 usec)
          Average test_bit loop took:    406124.267 usec (+- 898.785 usec)
        100000 operations 64 bits set of 2048 bits
          Average for_each_set_bit took: 16964.629 usec (+- 2806.130 usec)
          Average test_bit loop took:    406618.514 usec (+- 798.356 usec)
        100000 operations 128 bits set of 2048 bits
          Average for_each_set_bit took: 27219.625 usec (+- 4988.458 usec)
          Average test_bit loop took:    410149.325 usec (+- 1705.641 usec)
        100000 operations 256 bits set of 2048 bits
          Average for_each_set_bit took: 45138.578 usec (+- 8831.021 usec)
          Average test_bit loop took:    415462.467 usec (+- 2725.418 usec)
        100000 operations 512 bits set of 2048 bits
          Average for_each_set_bit took: 77450.540 usec (+- 15962.238 usec)
          Average test_bit loop took:    426089.180 usec (+- 5171.788 usec)
        100000 operations 1024 bits set of 2048 bits
          Average for_each_set_bit took: 138023.636 usec (+- 29826.959 usec)
          Average test_bit loop took:    446346.636 usec (+- 9904.417 usec)
        100000 operations 2048 bits set of 2048 bits
          Average for_each_set_bit took: 251072.600 usec (+- 55947.692 usec)
          Average test_bit loop took:    484855.983 usec (+- 18970.431 usec)
        #
      Signed-off-by: NIan Rogers <irogers@google.com>
      Tested-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lore.kernel.org/lkml/20200729220034.1337168-1-irogers@google.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      7c43b0c1
    • W
      perf tools: Fix record failure when mixed with ARM SPE event · bd3c628f
      Wei Li 提交于
      When recording with cache-misses and arm_spe_x event, I found that it
      will just fail without showing any error info if i put cache-misses
      after 'arm_spe_x' event.
      
        [root@localhost 0620]# perf record -e cache-misses \
      				-e arm_spe_0/ts_enable=1,pct_enable=1,pa_enable=1,load_filter=1,jitter=1,store_filter=1,min_latency=0/ sleep 1
        [ perf record: Woken up 1 times to write data ]
        [ perf record: Captured and wrote 0.067 MB perf.data ]
        [root@localhost 0620]#
        [root@localhost 0620]# perf record -e arm_spe_0/ts_enable=1,pct_enable=1,pa_enable=1,load_filter=1,jitter=1,store_filter=1,min_latency=0/ \
      				     -e  cache-misses sleep 1
        [root@localhost 0620]#
      
      The current code can only work if the only event to be traced is an
      'arm_spe_x', or if it is the last event to be specified. Otherwise the
      last event type will be checked against all the arm_spe_pmus[i]->types,
      none will match and an out of bound 'i' index will be used in
      arm_spe_recording_init().
      
      We don't support concurrent multiple arm_spe_x events currently, that
      is checked in arm_spe_recording_options(), and it will show the relevant
      info. So add the check and record of the first found 'arm_spe_pmu' to
      fix this issue here.
      
      Fixes: ffd3d18c ("perf tools: Add ARM Statistical Profiling Extensions (SPE) support")
      Signed-off-by: NWei Li <liwei391@huawei.com>
      Reviewed-by: NMathieu Poirier <mathieu.poirier@linaro.org>
      Tested-by-by: NLeo Yan <leo.yan@linaro.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Hanjun Guo <guohanjun@huawei.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Kim Phillips <kim.phillips@arm.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Mike Leach <mike.leach@linaro.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Suzuki Poulouse <suzuki.poulose@arm.com>
      Cc: linux-arm-kernel@lists.infradead.org
      Link: http://lore.kernel.org/lkml/20200724071111.35593-2-liwei391@huawei.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      bd3c628f