1. 15 10月, 2020 11 次提交
    • L
      perf c2c: Display the total numbers continuously · b596e979
      Leo Yan 提交于
      To view the statistics with "breakdown" mode, it's good to show the
      summary numbers for the total records, all stores and all loads, then
      the sequential conlumns can be used to break into more detailed items.
      
      To achieve this purpose, this patch displays the summary numbers for
      records/stores/loads continuously and places them before breakdown
      items, this can allow uses to easily read the summarized statistics.
      Signed-off-by: NLeo Yan <leo.yan@linaro.org>
      Tested-by: NJoe Mario <jmario@redhat.com>
      Acked-by: NJiri Olsa <jolsa@redhat.com>
      Link: https://lore.kernel.org/r/20201014050921.5591-2-leo.yan@linaro.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      b596e979
    • I
      perf bench: Use condition variables in numa. · f9299385
      Ian Rogers 提交于
      The existing approach to synchronization between threads in the numa
      benchmark is unbalanced mutexes.
      
      This synchronization causes thread sanitizer to warn of locks being
      taken twice on a thread without an unlock, as well as unlocks with no
      corresponding locks.
      
      This change replaces the synchronization with more regular condition
      variables.
      
      While this fixes one class of thread sanitizer warnings, there still
      remain warnings of data races due to threads reading and writing shared
      memory without any atomics.
      
      Committer testing:
      
        Basic run on a non-NUMA machine.
      
        # perf bench numa
      
                # List of available benchmarks for collection 'numa':
      
                   mem: Benchmark for NUMA workloads
                   all: Run all NUMA benchmarks
      
        # perf bench numa all
        # Running numa/mem benchmark...
      
         # Running main, "perf bench numa numa-mem"
         #
         # Running test on: Linux five 5.8.12-200.fc32.x86_64 #1 SMP Mon Sep 28 12:17:31 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
         #
      
         # Running RAM-bw-local, "perf bench numa mem -p 1 -t 1 -P 1024 -C 0 -M 0 -s 20 -zZq --thp  1 --no-data_rand_walk"
                 20.076 secs slowest (max) thread-runtime
                 20.000 secs fastest (min) thread-runtime
                 20.073 secs average thread-runtime
                  0.190 % difference between max/avg runtime
                241.828 GB data processed, per thread
                241.828 GB data processed, total
                  0.083 nsecs/byte/thread runtime
                 12.045 GB/sec/thread speed
                 12.045 GB/sec total speed
      
         # Running RAM-bw-local-NOTHP, "perf bench numa mem -p 1 -t 1 -P 1024 -C 0 -M 0 -s 20 -zZq --thp  1 --no-data_rand_walk --thp -1"
                 20.045 secs slowest (max) thread-runtime
                 20.000 secs fastest (min) thread-runtime
                 20.014 secs average thread-runtime
                  0.111 % difference between max/avg runtime
                234.304 GB data processed, per thread
                234.304 GB data processed, total
                  0.086 nsecs/byte/thread runtime
                 11.689 GB/sec/thread speed
                 11.689 GB/sec total speed
      
         # Running RAM-bw-remote, "perf bench numa mem -p 1 -t 1 -P 1024 -C 0 -M 1 -s 20 -zZq --thp  1 --no-data_rand_walk"
      
        Test not applicable, system has only 1 nodes.
      
         # Running RAM-bw-local-2x, "perf bench numa mem -p 2 -t 1 -P 1024 -C 0,2 -M 0x2 -s 20 -zZq --thp  1 --no-data_rand_walk"
                 20.138 secs slowest (max) thread-runtime
                 20.000 secs fastest (min) thread-runtime
                 20.121 secs average thread-runtime
                  0.342 % difference between max/avg runtime
                135.961 GB data processed, per thread
                271.922 GB data processed, total
                  0.148 nsecs/byte/thread runtime
                  6.752 GB/sec/thread speed
                 13.503 GB/sec total speed
      
         # Running RAM-bw-remote-2x, "perf bench numa mem -p 2 -t 1 -P 1024 -C 0,2 -M 1x2 -s 20 -zZq --thp  1 --no-data_rand_walk"
      
        Test not applicable, system has only 1 nodes.
      
         # Running RAM-bw-cross, "perf bench numa mem -p 2 -t 1 -P 1024 -C 0,8 -M 1,0 -s 20 -zZq --thp  1 --no-data_rand_walk"
      
        Test not applicable, system has only 1 nodes.
      
         # Running  1x3-convergence, "perf bench numa mem -p 1 -t 3 -P 512 -s 100 -zZ0qcm --thp  1"
                  0.747 secs latency to NUMA-converge
                  0.747 secs slowest (max) thread-runtime
                  0.000 secs fastest (min) thread-runtime
                  0.714 secs average thread-runtime
                 50.000 % difference between max/avg runtime
                  3.228 GB data processed, per thread
                  9.683 GB data processed, total
                  0.231 nsecs/byte/thread runtime
                  4.321 GB/sec/thread speed
                 12.964 GB/sec total speed
      
         # Running  1x4-convergence, "perf bench numa mem -p 1 -t 4 -P 512 -s 100 -zZ0qcm --thp  1"
                  1.127 secs latency to NUMA-converge
                  1.127 secs slowest (max) thread-runtime
                  1.000 secs fastest (min) thread-runtime
                  1.089 secs average thread-runtime
                  5.624 % difference between max/avg runtime
                  3.765 GB data processed, per thread
                 15.062 GB data processed, total
                  0.299 nsecs/byte/thread runtime
                  3.342 GB/sec/thread speed
                 13.368 GB/sec total speed
      
         # Running  1x6-convergence, "perf bench numa mem -p 1 -t 6 -P 1020 -s 100 -zZ0qcm --thp  1"
                  1.003 secs latency to NUMA-converge
                  1.003 secs slowest (max) thread-runtime
                  0.000 secs fastest (min) thread-runtime
                  0.889 secs average thread-runtime
                 50.000 % difference between max/avg runtime
                  2.141 GB data processed, per thread
                 12.847 GB data processed, total
                  0.469 nsecs/byte/thread runtime
                  2.134 GB/sec/thread speed
                 12.805 GB/sec total speed
      
         # Running  2x3-convergence, "perf bench numa mem -p 2 -t 3 -P 1020 -s 100 -zZ0qcm --thp  1"
                  1.814 secs latency to NUMA-converge
                  1.814 secs slowest (max) thread-runtime
                  1.000 secs fastest (min) thread-runtime
                  1.716 secs average thread-runtime
                 22.440 % difference between max/avg runtime
                  3.747 GB data processed, per thread
                 22.483 GB data processed, total
                  0.484 nsecs/byte/thread runtime
                  2.065 GB/sec/thread speed
                 12.393 GB/sec total speed
      
         # Running  3x3-convergence, "perf bench numa mem -p 3 -t 3 -P 1020 -s 100 -zZ0qcm --thp  1"
                  2.065 secs latency to NUMA-converge
                  2.065 secs slowest (max) thread-runtime
                  1.000 secs fastest (min) thread-runtime
                  1.947 secs average thread-runtime
                 25.788 % difference between max/avg runtime
                  2.855 GB data processed, per thread
                 25.694 GB data processed, total
                  0.723 nsecs/byte/thread runtime
                  1.382 GB/sec/thread speed
                 12.442 GB/sec total speed
      
         # Running  4x4-convergence, "perf bench numa mem -p 4 -t 4 -P 512 -s 100 -zZ0qcm --thp  1"
                  1.912 secs latency to NUMA-converge
                  1.912 secs slowest (max) thread-runtime
                  1.000 secs fastest (min) thread-runtime
                  1.775 secs average thread-runtime
                 23.852 % difference between max/avg runtime
                  1.479 GB data processed, per thread
                 23.668 GB data processed, total
                  1.293 nsecs/byte/thread runtime
                  0.774 GB/sec/thread speed
                 12.378 GB/sec total speed
      
         # Running  4x4-convergence-NOTHP, "perf bench numa mem -p 4 -t 4 -P 512 -s 100 -zZ0qcm --thp  1 --thp -1"
                  1.783 secs latency to NUMA-converge
                  1.783 secs slowest (max) thread-runtime
                  1.000 secs fastest (min) thread-runtime
                  1.633 secs average thread-runtime
                 21.960 % difference between max/avg runtime
                  1.345 GB data processed, per thread
                 21.517 GB data processed, total
                  1.326 nsecs/byte/thread runtime
                  0.754 GB/sec/thread speed
                 12.067 GB/sec total speed
      
         # Running  4x6-convergence, "perf bench numa mem -p 4 -t 6 -P 1020 -s 100 -zZ0qcm --thp  1"
                  5.396 secs latency to NUMA-converge
                  5.396 secs slowest (max) thread-runtime
                  4.000 secs fastest (min) thread-runtime
                  4.928 secs average thread-runtime
                 12.937 % difference between max/avg runtime
                  2.721 GB data processed, per thread
                 65.306 GB data processed, total
                  1.983 nsecs/byte/thread runtime
                  0.504 GB/sec/thread speed
                 12.102 GB/sec total speed
      
         # Running  4x8-convergence, "perf bench numa mem -p 4 -t 8 -P 512 -s 100 -zZ0qcm --thp  1"
                  3.121 secs latency to NUMA-converge
                  3.121 secs slowest (max) thread-runtime
                  2.000 secs fastest (min) thread-runtime
                  2.836 secs average thread-runtime
                 17.962 % difference between max/avg runtime
                  1.194 GB data processed, per thread
                 38.192 GB data processed, total
                  2.615 nsecs/byte/thread runtime
                  0.382 GB/sec/thread speed
                 12.236 GB/sec total speed
      
         # Running  8x4-convergence, "perf bench numa mem -p 8 -t 4 -P 512 -s 100 -zZ0qcm --thp  1"
                  4.302 secs latency to NUMA-converge
                  4.302 secs slowest (max) thread-runtime
                  3.000 secs fastest (min) thread-runtime
                  4.045 secs average thread-runtime
                 15.133 % difference between max/avg runtime
                  1.631 GB data processed, per thread
                 52.178 GB data processed, total
                  2.638 nsecs/byte/thread runtime
                  0.379 GB/sec/thread speed
                 12.128 GB/sec total speed
      
         # Running  8x4-convergence-NOTHP, "perf bench numa mem -p 8 -t 4 -P 512 -s 100 -zZ0qcm --thp  1 --thp -1"
                  4.418 secs latency to NUMA-converge
                  4.418 secs slowest (max) thread-runtime
                  3.000 secs fastest (min) thread-runtime
                  4.104 secs average thread-runtime
                 16.045 % difference between max/avg runtime
                  1.664 GB data processed, per thread
                 53.254 GB data processed, total
                  2.655 nsecs/byte/thread runtime
                  0.377 GB/sec/thread speed
                 12.055 GB/sec total speed
      
         # Running  3x1-convergence, "perf bench numa mem -p 3 -t 1 -P 512 -s 100 -zZ0qcm --thp  1"
                  0.973 secs latency to NUMA-converge
                  0.973 secs slowest (max) thread-runtime
                  0.000 secs fastest (min) thread-runtime
                  0.955 secs average thread-runtime
                 50.000 % difference between max/avg runtime
                  4.124 GB data processed, per thread
                 12.372 GB data processed, total
                  0.236 nsecs/byte/thread runtime
                  4.238 GB/sec/thread speed
                 12.715 GB/sec total speed
      
         # Running  4x1-convergence, "perf bench numa mem -p 4 -t 1 -P 512 -s 100 -zZ0qcm --thp  1"
                  0.820 secs latency to NUMA-converge
                  0.820 secs slowest (max) thread-runtime
                  0.000 secs fastest (min) thread-runtime
                  0.808 secs average thread-runtime
                 50.000 % difference between max/avg runtime
                  2.555 GB data processed, per thread
                 10.220 GB data processed, total
                  0.321 nsecs/byte/thread runtime
                  3.117 GB/sec/thread speed
                 12.468 GB/sec total speed
      
         # Running  8x1-convergence, "perf bench numa mem -p 8 -t 1 -P 512 -s 100 -zZ0qcm --thp  1"
                  0.667 secs latency to NUMA-converge
                  0.667 secs slowest (max) thread-runtime
                  0.000 secs fastest (min) thread-runtime
                  0.607 secs average thread-runtime
                 50.000 % difference between max/avg runtime
                  1.009 GB data processed, per thread
                  8.069 GB data processed, total
                  0.661 nsecs/byte/thread runtime
                  1.512 GB/sec/thread speed
                 12.095 GB/sec total speed
      
         # Running 16x1-convergence, "perf bench numa mem -p 16 -t 1 -P 256 -s 100 -zZ0qcm --thp  1"
                  1.546 secs latency to NUMA-converge
                  1.546 secs slowest (max) thread-runtime
                  1.000 secs fastest (min) thread-runtime
                  1.485 secs average thread-runtime
                 17.664 % difference between max/avg runtime
                  1.162 GB data processed, per thread
                 18.594 GB data processed, total
                  1.331 nsecs/byte/thread runtime
                  0.752 GB/sec/thread speed
                 12.025 GB/sec total speed
      
         # Running 32x1-convergence, "perf bench numa mem -p 32 -t 1 -P 128 -s 100 -zZ0qcm --thp  1"
                  0.812 secs latency to NUMA-converge
                  0.812 secs slowest (max) thread-runtime
                  0.000 secs fastest (min) thread-runtime
                  0.739 secs average thread-runtime
                 50.000 % difference between max/avg runtime
                  0.309 GB data processed, per thread
                  9.874 GB data processed, total
                  2.630 nsecs/byte/thread runtime
                  0.380 GB/sec/thread speed
                 12.166 GB/sec total speed
      
         # Running  2x1-bw-process, "perf bench numa mem -p 2 -t 1 -P 1024 -s 20 -zZ0q --thp  1"
                 20.044 secs slowest (max) thread-runtime
                 20.000 secs fastest (min) thread-runtime
                 20.020 secs average thread-runtime
                  0.109 % difference between max/avg runtime
                125.750 GB data processed, per thread
                251.501 GB data processed, total
                  0.159 nsecs/byte/thread runtime
                  6.274 GB/sec/thread speed
                 12.548 GB/sec total speed
      
         # Running  3x1-bw-process, "perf bench numa mem -p 3 -t 1 -P 1024 -s 20 -zZ0q --thp  1"
                 20.148 secs slowest (max) thread-runtime
                 20.000 secs fastest (min) thread-runtime
                 20.090 secs average thread-runtime
                  0.367 % difference between max/avg runtime
                 85.267 GB data processed, per thread
                255.800 GB data processed, total
                  0.236 nsecs/byte/thread runtime
                  4.232 GB/sec/thread speed
                 12.696 GB/sec total speed
      
         # Running  4x1-bw-process, "perf bench numa mem -p 4 -t 1 -P 1024 -s 20 -zZ0q --thp  1"
                 20.169 secs slowest (max) thread-runtime
                 20.000 secs fastest (min) thread-runtime
                 20.100 secs average thread-runtime
                  0.419 % difference between max/avg runtime
                 63.144 GB data processed, per thread
                252.576 GB data processed, total
                  0.319 nsecs/byte/thread runtime
                  3.131 GB/sec/thread speed
                 12.523 GB/sec total speed
      
         # Running  8x1-bw-process, "perf bench numa mem -p 8 -t 1 -P  512 -s 20 -zZ0q --thp  1"
                 20.175 secs slowest (max) thread-runtime
                 20.000 secs fastest (min) thread-runtime
                 20.107 secs average thread-runtime
                  0.433 % difference between max/avg runtime
                 31.267 GB data processed, per thread
                250.133 GB data processed, total
                  0.645 nsecs/byte/thread runtime
                  1.550 GB/sec/thread speed
                 12.398 GB/sec total speed
      
         # Running  8x1-bw-process-NOTHP, "perf bench numa mem -p 8 -t 1 -P  512 -s 20 -zZ0q --thp  1 --thp -1"
                 20.216 secs slowest (max) thread-runtime
                 20.000 secs fastest (min) thread-runtime
                 20.113 secs average thread-runtime
                  0.535 % difference between max/avg runtime
                 30.998 GB data processed, per thread
                247.981 GB data processed, total
                  0.652 nsecs/byte/thread runtime
                  1.533 GB/sec/thread speed
                 12.266 GB/sec total speed
      
         # Running 16x1-bw-process, "perf bench numa mem -p 16 -t 1 -P 256 -s 20 -zZ0q --thp  1"
                 20.234 secs slowest (max) thread-runtime
                 20.000 secs fastest (min) thread-runtime
                 20.174 secs average thread-runtime
                  0.577 % difference between max/avg runtime
                 15.377 GB data processed, per thread
                246.039 GB data processed, total
                  1.316 nsecs/byte/thread runtime
                  0.760 GB/sec/thread speed
                 12.160 GB/sec total speed
      
         # Running  1x4-bw-thread, "perf bench numa mem -p 1 -t 4 -T 256 -s 20 -zZ0q --thp  1"
                 20.040 secs slowest (max) thread-runtime
                 20.000 secs fastest (min) thread-runtime
                 20.028 secs average thread-runtime
                  0.099 % difference between max/avg runtime
                 66.832 GB data processed, per thread
                267.328 GB data processed, total
                  0.300 nsecs/byte/thread runtime
                  3.335 GB/sec/thread speed
                 13.340 GB/sec total speed
      
         # Running  1x8-bw-thread, "perf bench numa mem -p 1 -t 8 -T 256 -s 20 -zZ0q --thp  1"
                 20.064 secs slowest (max) thread-runtime
                 20.000 secs fastest (min) thread-runtime
                 20.034 secs average thread-runtime
                  0.160 % difference between max/avg runtime
                 32.911 GB data processed, per thread
                263.286 GB data processed, total
                  0.610 nsecs/byte/thread runtime
                  1.640 GB/sec/thread speed
                 13.122 GB/sec total speed
      
         # Running 1x16-bw-thread, "perf bench numa mem -p 1 -t 16 -T 128 -s 20 -zZ0q --thp  1"
                 20.092 secs slowest (max) thread-runtime
                 20.000 secs fastest (min) thread-runtime
                 20.052 secs average thread-runtime
                  0.230 % difference between max/avg runtime
                 16.131 GB data processed, per thread
                258.088 GB data processed, total
                  1.246 nsecs/byte/thread runtime
                  0.803 GB/sec/thread speed
                 12.845 GB/sec total speed
      
         # Running 1x32-bw-thread, "perf bench numa mem -p 1 -t 32 -T 64 -s 20 -zZ0q --thp  1"
                 20.099 secs slowest (max) thread-runtime
                 20.000 secs fastest (min) thread-runtime
                 20.063 secs average thread-runtime
                  0.247 % difference between max/avg runtime
                  7.962 GB data processed, per thread
                254.773 GB data processed, total
                  2.525 nsecs/byte/thread runtime
                  0.396 GB/sec/thread speed
                 12.676 GB/sec total speed
      
         # Running  2x3-bw-process, "perf bench numa mem -p 2 -t 3 -P 512 -s 20 -zZ0q --thp  1"
                 20.150 secs slowest (max) thread-runtime
                 20.000 secs fastest (min) thread-runtime
                 20.120 secs average thread-runtime
                  0.372 % difference between max/avg runtime
                 44.827 GB data processed, per thread
                268.960 GB data processed, total
                  0.450 nsecs/byte/thread runtime
                  2.225 GB/sec/thread speed
                 13.348 GB/sec total speed
      
         # Running  4x4-bw-process, "perf bench numa mem -p 4 -t 4 -P 512 -s 20 -zZ0q --thp  1"
                 20.258 secs slowest (max) thread-runtime
                 20.000 secs fastest (min) thread-runtime
                 20.168 secs average thread-runtime
                  0.636 % difference between max/avg runtime
                 17.079 GB data processed, per thread
                273.263 GB data processed, total
                  1.186 nsecs/byte/thread runtime
                  0.843 GB/sec/thread speed
                 13.489 GB/sec total speed
      
         # Running  4x6-bw-process, "perf bench numa mem -p 4 -t 6 -P 512 -s 20 -zZ0q --thp  1"
                 20.559 secs slowest (max) thread-runtime
                 20.000 secs fastest (min) thread-runtime
                 20.382 secs average thread-runtime
                  1.359 % difference between max/avg runtime
                 10.758 GB data processed, per thread
                258.201 GB data processed, total
                  1.911 nsecs/byte/thread runtime
                  0.523 GB/sec/thread speed
                 12.559 GB/sec total speed
      
         # Running  4x8-bw-process, "perf bench numa mem -p 4 -t 8 -P 512 -s 20 -zZ0q --thp  1"
                 20.744 secs slowest (max) thread-runtime
                 20.000 secs fastest (min) thread-runtime
                 20.516 secs average thread-runtime
                  1.792 % difference between max/avg runtime
                  8.069 GB data processed, per thread
                258.201 GB data processed, total
                  2.571 nsecs/byte/thread runtime
                  0.389 GB/sec/thread speed
                 12.447 GB/sec total speed
      
         # Running  4x8-bw-process-NOTHP, "perf bench numa mem -p 4 -t 8 -P 512 -s 20 -zZ0q --thp  1 --thp -1"
                 20.855 secs slowest (max) thread-runtime
                 20.000 secs fastest (min) thread-runtime
                 20.561 secs average thread-runtime
                  2.050 % difference between max/avg runtime
                  8.069 GB data processed, per thread
                258.201 GB data processed, total
                  2.585 nsecs/byte/thread runtime
                  0.387 GB/sec/thread speed
                 12.381 GB/sec total speed
      
         # Running  3x3-bw-process, "perf bench numa mem -p 3 -t 3 -P 512 -s 20 -zZ0q --thp  1"
                 20.134 secs slowest (max) thread-runtime
                 20.000 secs fastest (min) thread-runtime
                 20.077 secs average thread-runtime
                  0.333 % difference between max/avg runtime
                 28.091 GB data processed, per thread
                252.822 GB data processed, total
                  0.717 nsecs/byte/thread runtime
                  1.395 GB/sec/thread speed
                 12.557 GB/sec total speed
      
         # Running  5x5-bw-process, "perf bench numa mem -p 5 -t 5 -P 512 -s 20 -zZ0q --thp  1"
                 20.588 secs slowest (max) thread-runtime
                 20.000 secs fastest (min) thread-runtime
                 20.375 secs average thread-runtime
                  1.427 % difference between max/avg runtime
                 10.177 GB data processed, per thread
                254.436 GB data processed, total
                  2.023 nsecs/byte/thread runtime
                  0.494 GB/sec/thread speed
                 12.359 GB/sec total speed
      
         # Running 2x16-bw-process, "perf bench numa mem -p 2 -t 16 -P 512 -s 20 -zZ0q --thp  1"
                 20.657 secs slowest (max) thread-runtime
                 20.000 secs fastest (min) thread-runtime
                 20.429 secs average thread-runtime
                  1.589 % difference between max/avg runtime
                  8.170 GB data processed, per thread
                261.429 GB data processed, total
                  2.528 nsecs/byte/thread runtime
                  0.395 GB/sec/thread speed
                 12.656 GB/sec total speed
      
         # Running 1x32-bw-process, "perf bench numa mem -p 1 -t 32 -P 2048 -s 20 -zZ0q --thp  1"
                 22.981 secs slowest (max) thread-runtime
                 20.000 secs fastest (min) thread-runtime
                 21.996 secs average thread-runtime
                  6.486 % difference between max/avg runtime
                  8.863 GB data processed, per thread
                283.606 GB data processed, total
                  2.593 nsecs/byte/thread runtime
                  0.386 GB/sec/thread speed
                 12.341 GB/sec total speed
      
         # Running numa02-bw, "perf bench numa mem -p 1 -t 32 -T 32 -s 20 -zZ0q --thp  1"
                 20.047 secs slowest (max) thread-runtime
                 19.000 secs fastest (min) thread-runtime
                 20.026 secs average thread-runtime
                  2.611 % difference between max/avg runtime
                  8.441 GB data processed, per thread
                270.111 GB data processed, total
                  2.375 nsecs/byte/thread runtime
                  0.421 GB/sec/thread speed
                 13.474 GB/sec total speed
      
         # Running numa02-bw-NOTHP, "perf bench numa mem -p 1 -t 32 -T 32 -s 20 -zZ0q --thp  1 --thp -1"
                 20.088 secs slowest (max) thread-runtime
                 19.000 secs fastest (min) thread-runtime
                 20.025 secs average thread-runtime
                  2.709 % difference between max/avg runtime
                  8.411 GB data processed, per thread
                269.142 GB data processed, total
                  2.388 nsecs/byte/thread runtime
                  0.419 GB/sec/thread speed
                 13.398 GB/sec total speed
      
         # Running numa01-bw-thread, "perf bench numa mem -p 2 -t 16 -T 192 -s 20 -zZ0q --thp  1"
                 20.293 secs slowest (max) thread-runtime
                 20.000 secs fastest (min) thread-runtime
                 20.175 secs average thread-runtime
                  0.721 % difference between max/avg runtime
                  7.918 GB data processed, per thread
                253.374 GB data processed, total
                  2.563 nsecs/byte/thread runtime
                  0.390 GB/sec/thread speed
                 12.486 GB/sec total speed
      
         # Running numa01-bw-thread-NOTHP, "perf bench numa mem -p 2 -t 16 -T 192 -s 20 -zZ0q --thp  1 --thp -1"
                 20.411 secs slowest (max) thread-runtime
                 20.000 secs fastest (min) thread-runtime
                 20.226 secs average thread-runtime
                  1.006 % difference between max/avg runtime
                  7.931 GB data processed, per thread
                253.778 GB data processed, total
                  2.574 nsecs/byte/thread runtime
                  0.389 GB/sec/thread speed
                 12.434 GB/sec total speed
      
        #
      Signed-off-by: NIan Rogers <irogers@google.com>
      Acked-by: NJiri Olsa <jolsa@redhat.com>
      Tested-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Link: https://lore.kernel.org/r/20201012161611.366482-1-irogers@google.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      f9299385
    • J
      perf jevents: Fix event code for events referencing std arch events · caf7f968
      John Garry 提交于
      The event code for events referencing std arch events is incorrectly
      evaluated in json_events().
      
      The issue is that je.event is evaluated properly from try_fixup(), but
      later NULLified from the real_event() call, as "event" may be NULL.
      
      Fix by setting "event" same je.event in try_fixup().
      
      Also remove support for overwriting event code for events using std arch
      events, as it is not used.
      Signed-off-by: NJohn Garry <john.garry@huawei.com>
      Reviewed-By: Kajol Jain<kjain@linux.ibm.com>
      Acked-by: NJiri Olsa <jolsa@redhat.com>
      Link: https://lore.kernel.org/r/1602170368-11892-1-git-send-email-john.garry@huawei.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      caf7f968
    • J
      perf diff: Support hot streams comparison · 2a09a84c
      Jin Yao 提交于
      This patch enables perf-diff with "--stream" option.
      
      "--stream": Enable hot streams comparison
      
      Now let's see example.
      
      perf record -b ...      Generate perf.data.old with branch data
      perf record -b ...      Generate perf.data with branch data
      perf diff --stream
      
      [ Matched hot streams ]
      
      hot chain pair 1:
                  cycles: 1, hits: 27.77%                  cycles: 1, hits: 9.24%
              ---------------------------              --------------------------
                            main div.c:39                           main div.c:39
                            main div.c:44                           main div.c:44
      
      hot chain pair 2:
                 cycles: 34, hits: 20.06%                cycles: 27, hits: 16.98%
              ---------------------------              --------------------------
                __random_r random_r.c:360               __random_r random_r.c:360
                __random_r random_r.c:388               __random_r random_r.c:388
                __random_r random_r.c:388               __random_r random_r.c:388
                __random_r random_r.c:380               __random_r random_r.c:380
                __random_r random_r.c:357               __random_r random_r.c:357
                    __random random.c:293                   __random random.c:293
                    __random random.c:293                   __random random.c:293
                    __random random.c:291                   __random random.c:291
                    __random random.c:291                   __random random.c:291
                    __random random.c:291                   __random random.c:291
                    __random random.c:288                   __random random.c:288
                           rand rand.c:27                          rand rand.c:27
                           rand rand.c:26                          rand rand.c:26
                                 rand@plt                                rand@plt
                                 rand@plt                                rand@plt
                    compute_flag div.c:25                   compute_flag div.c:25
                    compute_flag div.c:22                   compute_flag div.c:22
                            main div.c:40                           main div.c:40
                            main div.c:40                           main div.c:40
                            main div.c:39                           main div.c:39
      
      hot chain pair 3:
                   cycles: 9, hits: 4.48%                  cycles: 6, hits: 4.51%
              ---------------------------              --------------------------
                __random_r random_r.c:360               __random_r random_r.c:360
                __random_r random_r.c:388               __random_r random_r.c:388
                __random_r random_r.c:388               __random_r random_r.c:388
                __random_r random_r.c:380               __random_r random_r.c:380
      
      [ Hot streams in old perf data only ]
      
      hot chain 1:
                  cycles: 18, hits: 6.75%
               --------------------------
                __random_r random_r.c:360
                __random_r random_r.c:388
                __random_r random_r.c:388
                __random_r random_r.c:380
                __random_r random_r.c:357
                    __random random.c:293
                    __random random.c:293
                    __random random.c:291
                    __random random.c:291
                    __random random.c:291
                    __random random.c:288
                           rand rand.c:27
                           rand rand.c:26
                                 rand@plt
                                 rand@plt
                    compute_flag div.c:25
                    compute_flag div.c:22
                            main div.c:40
      
      hot chain 2:
                  cycles: 29, hits: 2.78%
               --------------------------
                    compute_flag div.c:22
                            main div.c:40
                            main div.c:40
                            main div.c:39
      
      [ Hot streams in new perf data only ]
      
      hot chain 1:
                                                           cycles: 4, hits: 4.54%
                                                       --------------------------
                                                                    main div.c:42
                                                            compute_flag div.c:28
      
      hot chain 2:
                                                           cycles: 5, hits: 3.51%
                                                       --------------------------
                                                                    main div.c:39
                                                                    main div.c:44
                                                                    main div.c:42
                                                            compute_flag div.c:28
      Signed-off-by: NJin Yao <yao.jin@linux.intel.com>
      Acked-by: NJiri Olsa <jolsa@kernel.org>
      Link: https://lore.kernel.org/r/20201009022845.13141-8-yao.jin@linux.intel.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      2a09a84c
    • J
      perf streams: Report hot streams · 5bbd6bad
      Jin Yao 提交于
      We show the streams separately. They are divided into different sections.
      
      1. "Matched hot streams"
      
      2. "Hot streams in old perf data only"
      
      3. "Hot streams in new perf data only".
      
      For each stream, we report the cycles and hot percent (hits%).
      
      For example,
      
           cycles: 2, hits: 4.08%
       --------------------------
                    main div.c:42
            compute_flag div.c:28
      Signed-off-by: NJin Yao <yao.jin@linux.intel.com>
      Acked-by: NJiri Olsa <jolsa@kernel.org>
      Link: https://lore.kernel.org/r/20201009022845.13141-7-yao.jin@linux.intel.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      5bbd6bad
    • J
      perf streams: Calculate the sum of total streams hits · 28904f4d
      Jin Yao 提交于
      We have used callchain_node->hit to measure the hot level of one stream.
      This patch calculates the sum of hits of total streams.
      
      Thus in next patch, we can use following formula to report hot percent
      for one stream.
      
      hot percent = callchain_node->hit / sum of total hits
      Signed-off-by: NJin Yao <yao.jin@linux.intel.com>
      Acked-by: NJiri Olsa <jolsa@kernel.org>
      Link: https://lore.kernel.org/r/20201009022845.13141-6-yao.jin@linux.intel.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      28904f4d
    • J
      perf streams: Link stream pair · fa79aa64
      Jin Yao 提交于
      In previous patch, we have created an evsel_streams for one event, and
      top N hottest streams will be saved in a stream array in evsel_streams.
      
      This patch compares total streams among two evsel_streams.
      
      Once two streams are fully matched, they will be linked as a pair. From
      the pair, we can know which streams are matched.
      Signed-off-by: NJin Yao <yao.jin@linux.intel.com>
      Acked-by: NJiri Olsa <jolsa@kernel.org>
      Link: https://lore.kernel.org/r/20201009022845.13141-5-yao.jin@linux.intel.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      fa79aa64
    • J
      perf streams: Compare two streams · 47ef8398
      Jin Yao 提交于
      Stream is the branch history which is aggregated by the branch records
      from perf samples. Now we support the callchain as stream.
      
      If the callchain entries of one stream are fully matched with the
      callchain entries of another stream, we think two streams are matched.
      
      For example,
      
         cycles: 1, hits: 26.80%                 cycles: 1, hits: 27.30%
         -----------------------                 -----------------------
                   main div.c:39                           main div.c:39
                   main div.c:44                           main div.c:44
      
      Above two streams are matched (we don't consider the case that source
      code is changed).
      
      The matching logic is, compare the chain string first. If it's not
      matched, fallback to dso address comparison.
      Signed-off-by: NJin Yao <yao.jin@linux.intel.com>
      Acked-by: NJiri Olsa <jolsa@kernel.org>
      Link: https://lore.kernel.org/r/20201009022845.13141-4-yao.jin@linux.intel.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      47ef8398
    • J
      perf streams: Get the evsel_streams by evsel_idx · dd1d8418
      Jin Yao 提交于
      In previous patch, we have created evsel_streams array.
      
      This patch returns the specified evsel_streams according to the
      evsel_idx.
      Signed-off-by: NJin Yao <yao.jin@linux.intel.com>
      Acked-by: NJiri Olsa <jolsa@kernel.org>
      Link: https://lore.kernel.org/r/20201009022845.13141-3-yao.jin@linux.intel.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      dd1d8418
    • J
      perf streams: Introduce branch history "streams" · 480accbb
      Jin Yao 提交于
      We define a stream as the branch history which is aggregated by the
      branch records from perf samples. For example, the callchains aggregated
      from the branch records are considered as streams.  By browsing the hot
      stream, we can understand the hot code path.
      
      Now we only support the callchain for stream. For measuring the hot
      level for a stream, we use the callchain_node->hit, higher is hotter.
      
      There may be many callchains sampled so we only focus on the top N
      hottest callchains. N is a user defined parameter or predefined default
      value (nr_streams_max).
      
      This patch creates an evsel_streams array per event, and saves the top N
      hottest streams in a stream array.
      
      So now we can get the per-event top N hottest streams.
      Signed-off-by: NJin Yao <yao.jin@linux.intel.com>
      Acked-by: NJiri Olsa <jolsa@kernel.org>
      Link: https://lore.kernel.org/r/20201009022845.13141-2-yao.jin@linux.intel.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      480accbb
    • A
      perf intel-pt: Improve PT documentation slightly · 6556a75b
      Andi Kleen 提交于
      Document the higher level --insn-trace etc. perf script options.
      
      Include the howto how to build xed into the manpage
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Link: http://lore.kernel.org/lkml/20201014035346.4772-1-andi@firstfloor.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      6556a75b
  2. 14 10月, 2020 17 次提交
    • A
      perf tools: Add support for exclusive groups/events · 0997a266
      Andi Kleen 提交于
      Peter suggested that using the exclusive mode in perf could avoid some
      problems with bad scheduling of groups. Exclusive is implemented in the
      kernel, but wasn't exposed by the perf tool, so hard to use without
      custom low level API users.
      
      Add support for marking groups or events with :e for exclusive in the
      perf tool.  The implementation is basically the same as the existing
      pinned attribute.
      
      Committer testing:
      
        # perf test "parse event"
         6: Parse event definition strings                                  : Ok
        # perf test -v "parse event" |& grep :u*e
        running test 56 'instructions:uep'
        running test 57 '{cycles,cache-misses,branch-misses}:e'
        #
        #
        # grep "model name" -m1 /proc/cpuinfo
        model name	: AMD Ryzen 9 3900X 12-Core Processor
        #
        # perf stat -a -e '{cycles,cache-misses,branch-misses}:e' sleep 1
      
         Performance counter stats for 'system wide':
      
             <not counted>      cycles                                                        (0.00%)
             <not counted>      cache-misses                                                  (0.00%)
             <not counted>      branch-misses                                                 (0.00%)
      
               1.001269893 seconds time elapsed
      
        Some events weren't counted. Try disabling the NMI watchdog:
        	echo 0 > /proc/sys/kernel/nmi_watchdog
        	perf stat ...
        	echo 1 > /proc/sys/kernel/nmi_watchdog
        # echo 0 > /proc/sys/kernel/nmi_watchdog
        # perf stat -a -e '{cycles,cache-misses,branch-misses}:e' sleep 1
      
         Performance counter stats for 'system wide':
      
             1,298,663,141      cycles
                30,962,215      cache-misses
                 5,325,150      branch-misses
      
               1.001474934 seconds time elapsed
      
        #
        # The output for asking for precise events on AMD needs to improve, it
        # supposedly works only for system wide or per CPU
        #
        # perf stat -a -e '{cycles,cache-misses,branch-misses}:uep' sleep 1
        Error:
        The sys_perf_event_open() syscall returned with 22 (Invalid argument) for event (cycles).
        /bin/dmesg | grep -i perf may provide additional information.
      
        # perf stat -a -e '{cycles,cache-misses,branch-misses}:ue' sleep 1
      
         Performance counter stats for 'system wide':
      
               746,363,126      cycles
                16,881,611      cache-misses
                 2,871,259      branch-misses
      
               1.001636066 seconds time elapsed
      
        #
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Acked-by: NJiri Olsa <jolsa@kernel.org>
      Tested-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: http://lore.kernel.org/lkml/20201014144255.22699-1-andi@firstfloor.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      0997a266
    • J
      perf test: Add build id shell test · 78b2c50c
      Jiri Olsa 提交于
      Add a test for the build id cache that adds a binary with sha1 and md5
      build ids and verifies it's added properly.
      
      The test updates build id cache with 'perf record' and 'perf buildid-cache -a'.
      
      Committer testing:
      
        # perf test "build id"
        82: build id cache operations                                       : Ok
        #
        # perf test -v "build id"
        82: build id cache operations                                       :
        --- start ---
        test child forked, pid 447218
        test binaries: /tmp/perf.ex.SHA1.B8I /tmp/perf.ex.MD5.7Nv
        Adding d1abc1eb7568358cf23c959566f23462461834d1 /tmp/perf.ex.SHA1.B8I: Ok
        build id: d1abc1eb7568358cf23c959566f23462461834d1
        link: /tmp/perf.debug.sS2/.build-id/d1/abc1eb7568358cf23c959566f23462461834d1
        file: /tmp/perf.debug.sS2/.build-id/d1/../../tmp/perf.ex.SHA1.B8I/d1abc1eb7568358cf23c959566f23462461834d1/elf
        OK for /tmp/perf.ex.SHA1.B8I
        Adding a50e350e97c43b4708d09bcd85ebfff7 /tmp/perf.ex.MD5.7Nv: Ok
        build id: a50e350e97c43b4708d09bcd85ebfff7
        link: /tmp/perf.debug.IuW/.build-id/a5/0e350e97c43b4708d09bcd85ebfff7
        file: /tmp/perf.debug.IuW/.build-id/a5/../../tmp/perf.ex.MD5.7Nv/a50e350e97c43b4708d09bcd85ebfff7/elf
        OK for /tmp/perf.ex.MD5.7Nv
        [ perf record: Woken up 1 times to write data ]
        [ perf record: Captured and wrote 0.034 MB /tmp/perf.data.xrH ]
        build id: d1abc1eb7568358cf23c959566f23462461834d1
        link: /tmp/perf.debug.eGR/.build-id/d1/abc1eb7568358cf23c959566f23462461834d1
        file: /tmp/perf.debug.eGR/.build-id/d1/../../tmp/perf.ex.SHA1.B8I/d1abc1eb7568358cf23c959566f23462461834d1/elf
        OK for /tmp/perf.ex.SHA1.B8I
        [ perf record: Woken up 2 times to write data ]
        [ perf record: Captured and wrote 0.034 MB /tmp/perf.data.cbE ]
        build id: a50e350e97c43b4708d09bcd85ebfff7
        link: /tmp/perf.debug.82t/.build-id/a5/0e350e97c43b4708d09bcd85ebfff7
        file: /tmp/perf.debug.82t/.build-id/a5/../../tmp/perf.ex.MD5.7Nv/a50e350e97c43b4708d09bcd85ebfff7/elf
        OK for /tmp/perf.ex.MD5.7Nv
        test child finished with 0
        ---- end ----
        build id cache operations: Ok
        #
      Signed-off-by: NJiri Olsa <jolsa@kernel.org>
      Acked-by: NIan Rogers <irogers@google.com>
      Tested-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Link: https://lore.kernel.org/r/20201013192441.1299447-10-jolsa@kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      78b2c50c
    • J
      perf tools: Align buildid list output for short build ids · e9ad9438
      Jiri Olsa 提交于
      With shorter md5 build ids we need to align their paths properly with
      other build ids:
      
        $ perf buildid-list
        17f4e448cc746582ea1881528deb549f7fdb3fd5 [kernel.kallsyms]
        a50e350e97c43b4708d09bcd85ebfff7         .../tools/perf/buildid-ex-md5
        1805c738c8f3ec0f47b7ea09080c28f34d18a82b /usr/lib64/ld-2.31.so
        $
      Signed-off-by: NJiri Olsa <jolsa@kernel.org>
      Acked-by: NIan Rogers <irogers@google.com>
      Link: https://lore.kernel.org/r/20201013192441.1299447-9-jolsa@kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      e9ad9438
    • J
      perf tools: Add size to 'struct perf_record_header_build_id' · b0a323c7
      Jiri Olsa 提交于
      We do not store size with build ids in perf data, but there's enough
      space to do it. Adding misc bit PERF_RECORD_MISC_BUILD_ID_SIZE to mark
      build id event with size.
      
      With this fix the dso with md5 build id will have correct build id data
      and will be usable for debuginfod processing if needed (coming in
      following patches).
      
      Committer notes:
      
      Use %zu with size_t to fix this error on 32-bit arches:
      
        util/header.c: In function '__event_process_build_id':
        util/header.c:2105:3: error: format '%lu' expects argument of type 'long unsigned int', but argument 6 has type 'size_t' [-Werror=format=]
           pr_debug("build id event received for %s: %s [%lu]\n",
           ^
      Signed-off-by: NJiri Olsa <jolsa@kernel.org>
      Acked-by: NIan Rogers <irogers@google.com>
      Link: https://lore.kernel.org/r/20201013192441.1299447-8-jolsa@kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      b0a323c7
    • J
      perf tools: Pass build_id object to dso__build_id_equal() · 39be8d01
      Jiri Olsa 提交于
      Passing build_id object to dso__build_id_equal(), so we can properly
      check build id with different size than sha1.
      Signed-off-by: NJiri Olsa <jolsa@kernel.org>
      Acked-by: NIan Rogers <irogers@google.com>
      Link: https://lore.kernel.org/r/20201013192441.1299447-7-jolsa@kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      39be8d01
    • J
      perf tools: Pass build_id object to dso__set_build_id() · 8dfdf440
      Jiri Olsa 提交于
      Passing build_id object to dso__set_build_id(), so it's easier
      to initialize dos's build id object.
      Signed-off-by: NJiri Olsa <jolsa@kernel.org>
      Acked-by: NIan Rogers <irogers@google.com>
      Link: https://lore.kernel.org/r/20201013192441.1299447-6-jolsa@kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      8dfdf440
    • J
      perf tools: Pass build_id object to build_id__sprintf() · bf541169
      Jiri Olsa 提交于
      Passing build_id object to build_id__sprintf function, so it can operate
      with the proper size of build id.
      
      This will create proper md5 build id readable names,
      like following:
      
        a50e350e97c43b4708d09bcd85ebfff7
      
      instead of:
      
        a50e350e97c43b4708d09bcd85ebfff700000000
      Signed-off-by: NJiri Olsa <jolsa@kernel.org>
      Acked-by: NIan Rogers <irogers@google.com>
      Link: https://lore.kernel.org/r/20201013192441.1299447-5-jolsa@kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      bf541169
    • J
      perf tools: Pass build id object to sysfs__read_build_id() · 3ff1b8c8
      Jiri Olsa 提交于
      Passing build id object to sysfs__read_build_id function, so it can
      populate the size of the build_id object.
      Signed-off-by: NJiri Olsa <jolsa@kernel.org>
      Acked-by: NIan Rogers <irogers@google.com>
      Link: https://lore.kernel.org/r/20201013192441.1299447-4-jolsa@kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      3ff1b8c8
    • J
      perf tools: Pass build_id object to filename__read_build_id() · f766819c
      Jiri Olsa 提交于
      Pass a build_id object to filename__read_build_id function, so it can
      populate the size of the build_id object.
      
      Changing filename__read_build_id() code for both ELF/non-ELF code.
      Signed-off-by: NJiri Olsa <jolsa@kernel.org>
      Acked-by: NIan Rogers <irogers@google.com>
      Link: https://lore.kernel.org/r/20201013192441.1299447-3-jolsa@kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      f766819c
    • J
      perf tools: Use build_id object in dso · 0aba7f03
      Jiri Olsa 提交于
      Replace build_id byte array with struct build_id object and all the code
      that references it.
      
      The objective is to carry size together with build id array, so it's
      better to keep both together.
      
      This is preparatory change for following patches, and there's no
      functional change.
      Signed-off-by: NJiri Olsa <jolsa@kernel.org>
      Acked-by: NIan Rogers <irogers@google.com>
      Link: https://lore.kernel.org/r/20201013192441.1299447-2-jolsa@kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      0aba7f03
    • A
      perf config: Export the perf_config_from_file() function · 79bbbabd
      Arnaldo Carvalho de Melo 提交于
      We'll use it to ask for extra config files to be loaded, profile like
      stuff that will be used first to make 'perf trace' mimic 'strace' output
      via a 'perf strace' command that just sets up 'perf trace' output.
      
      At some point it'll be used for regression tests, where we'll run some
      simple commands like:
      
        perf strace ls > perf-strace.output
        strace ls > strace.output
      
      And then do some mutable syscall arg aware diff like tool to deal with
      arguments for things like mmap, that change at each execution, to be
      first ignored and then properly tracked when used accoss multiple
      syscalls.
      
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Ian Rogers <irogers@google.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Signed-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      79bbbabd
    • J
      perf python: Autodetect python3 binary · 79373082
      James Clark 提交于
      Some distros don't come with python2 and only have python3 available.
      This causes the "'import perf' in python" self test to fail.
      
      This change adds python3 to the list of possible python versions
      that are autodetected but maintains the priorities for
      'python2' and 'python' detection. Python3 has the lowest priority.
      
      Committer notes:
      
      On a fedora system without python2 packages the 'perf test python'
      continues to work:
      
        # python2
        bash: python2: command not found...
        Similar command is: 'python'
        # rpm -qa | grep python2
        #
      
      That "Similar command" gives the clue:
      
        # rpm -qf /usr/bin/python
        python-unversioned-command-3.8.5-5.fc32.noarch
        # rpm -ql python-unversioned-command
        /usr/bin/python
        /usr/share/man/man1/python.1.gz
        #
      
      With it in place the 'python' binary is found and perf builds the python
      binding using python3:
      
        # perf test -v python
        19: 'import perf' in python                                         :
        --- start ---
        test child forked, pid 379988
        python usage test: "echo "import sys ; sys.path.append('/tmp/build/perf/python'); import perf" | '/usr/bin/python' "
        test child finished with 0
        ---- end ----
        'import perf' in python: Ok
        #
      
      Looking at that path:
      
        # ls -la /tmp/build/perf/python
        total 1864
        drwxrwxr-x.  2 acme acme      60 Oct 13 16:20 .
        drwxrwxr-x. 18 acme acme    4420 Oct 13 16:28 ..
        -rwxrwxr-x.  1 acme acme 1907216 Oct 13 16:28 perf.cpython-38-x86_64-linux-gnu.so
        #
      
      And:
      
        # ldd ~/bin/perf | grep python
        	libpython3.8.so.1.0 => /lib64/libpython3.8.so.1.0 (0x00007f5471187000)
        #
      
      As soon as we remove it:
      
        # rpm -e python-unversioned-command-3.8.5-5.fc32.noarch
        # hash -r
        # python
        bash: python: command not found...
        Install package 'python-unversioned-command' to provide command 'python'? [N/y] n
        #
      
      And rebuilding perf now doesn't find python in the system:
      
        make: Entering directory '/home/acme/git/perf/tools/perf'
          BUILD:   Doing 'make -j24' parallel build
        <SNIP>
        Makefile.config:786: No python interpreter was found: disables Python support - please install python-devel/python-dev
        <SNIP>
      
      After this patch:
      
        $ rpm -qi python-unversioned-command
        package python-unversioned-command is not installed
        $
        $ python
        bash: python: command not found...
        Install package 'python-unversioned-command' to provide command 'python'? [N/y] ^C
        $
        $ m
        make: Entering directory '/home/acme/git/perf/tools/perf'
          BUILD:   Doing 'make -j24' parallel build
        <SNIP>
          CC       /tmp/build/perf/tests/attr.o
          CC       /tmp/build/perf/tests/python-use.o
          DESCEND  plugins
          GEN      /tmp/build/perf/python/perf.so
          INSTALL  trace_plugins
          LD       /tmp/build/perf/tests/perf-in.o
          LD       /tmp/build/perf/perf-in.o
          LINK     /tmp/build/perf/perf
        <SNIP>
        make: Leaving directory '/home/acme/git/perf/tools/perf'
        19: 'import perf' in python                                         : Ok
        $ ldd ~/bin/perf | grep python
        	libpython3.8.so.1.0 => /lib64/libpython3.8.so.1.0 (0x00007f2c8c708000)
        $ ls -la /tmp/build/perf/python
        total 1864
        drwxrwxr-x.  2 acme acme      60 Oct 13 16:20 .
        drwxrwxr-x. 18 acme acme    4420 Oct 13 16:31 ..
        -rwxrwxr-x.  1 acme acme 1907216 Oct 13 16:31 perf.cpython-38-x86_64-linux-gnu.so
        $
      Signed-off-by: NJames Clark <james.clark@arm.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      LPU-Reference: 20201005080645.6588-1-james.clark@arm.com
      Signed-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      79373082
    • A
      perf tests: Show python test script in verbose mode · 0fd0f00f
      Arnaldo Carvalho de Melo 提交于
      To help figure out where it is getting the binding.
      
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Ian Rogers <irogers@google.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Signed-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      0fd0f00f
    • V
      perf build: Allow nested externs to enable BUILD_BUG() usage · 6cf4ecf5
      Vasily Gorbik 提交于
      Currently BUILD_BUG() macro is expanded to smth like the following:
      
         do {
                 extern void __compiletime_assert_0(void)
                         __attribute__((error("BUILD_BUG failed")));
                 if (!(!(1)))
                         __compiletime_assert_0();
         } while (0);
      
      If used in a function body this obviously would produce build errors
      with -Wnested-externs and -Werror.
      
      To enable BUILD_BUG() usage in tools/arch/x86/lib/insn.c which perf
      includes in intel-pt-decoder, build perf without -Wnested-externs.
      Reported-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NVasily Gorbik <gor@linux.ibm.com>
      Acked-by: NJiri Olsa <jolsa@redhat.com>
      Tested-by: Stephen Rothwell <sfr@canb.auug.org.au> # build tested
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lore.kernel.org/lkml/patch-1.thread-251403.git-2514037e9477.your-ad-here.call-01602244460-ext-7088@work.hoursSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      6cf4ecf5
    • J
      perf trace: Fix off by ones in memset() after realloc() in arches using libaudit · f3013f7e
      Jiri Slaby 提交于
      'perf trace ls' started crashing after commit d21cb73a on
      !HAVE_SYSCALL_TABLE_SUPPORT configs (armv7l here) like this:
      
        0  strlen () at ../sysdeps/arm/armv6t2/strlen.S:126
        1  0xb6800780 in __vfprintf_internal (s=0xbeff9908, s@entry=0xbeff9900, format=0xa27160 "]: %s()", ap=..., mode_flags=<optimized out>) at vfprintf-internal.c:1688
        ...
        5  0x0056ecdc in fprintf (__fmt=0xa27160 "]: %s()", __stream=<optimized out>) at /usr/include/bits/stdio2.h:100
        6  trace__sys_exit (trace=trace@entry=0xbeffc710, evsel=evsel@entry=0xd968d0, event=<optimized out>, sample=sample@entry=0xbeffc3e8) at builtin-trace.c:2475
        7  0x00566d40 in trace__handle_event (sample=0xbeffc3e8, event=<optimized out>, trace=0xbeffc710) at builtin-trace.c:3122
        ...
        15 main (argc=2, argv=0xbefff6e8) at perf.c:538
      
      It is because memset in trace__read_syscall_info zeroes wrong memory:
      
      1) when initializing for the first time, it does not reset the last id.
      
      2) in other cases, it resets the last id of previous buffer.
      
      ad 1) it causes the crash above as sc->name used in the fprintf above
            contains garbage.
      
      ad 2) it sets nonexistent from true back to false for id 11 here. Not
            sure, what the consequences are.
      
      So fix it by introducing a special case for the initial initialization
      and do the right +1 in both cases.
      
      Fixes: d21cb73a ("perf trace: Grow the syscall table as needed when using libaudit")
      Signed-off-by: NJiri Slaby <jslaby@suse.cz>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: http://lore.kernel.org/lkml/20201001093419.15761-1-jslaby@suse.czSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      f3013f7e
    • L
      perf c2c: Update usage for showing memory events · edac75a2
      Leo Yan 提交于
      Since commit b027cc6f ("perf c2c: Fix 'perf c2c record -e list' to
      show the default events used"), "perf c2c" tool can show the memory
      events properly, it's no reason to still suggest user to use the
      command "perf mem record -e list" for showing events.
      
      This patch updates the usage for showing memory events with command
      "perf c2c record -e list".
      Signed-off-by: NLeo Yan <leo.yan@linaro.org>
      Acked-by: NJiri Olsa <jolsa@redhat.com>
      Acked-by: NIan Rogers <irogers@google.com>
      Signed-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Link: https://lore.kernel.org/r/20201011121022.22409-1-leo.yan@linaro.org
      edac75a2
    • A
      Merge branch 'perf/urgent' into perf/core · dbaa1b3d
      Arnaldo Carvalho de Melo 提交于
      To pick fixes that missed v5.9.
      Signed-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      dbaa1b3d
  3. 13 10月, 2020 9 次提交
    • T
      tools lib traceevent: Hide non API functions · a41c3210
      Tzvetomir Stoyanov (VMware) 提交于
      There are internal library functions, which are not declared as a static.
      They are used inside the library from different files. Hide them from
      the library users, as they are not part of the API.
      These functions are made hidden and are renamed without the prefix "tep_":
       tep_free_plugin_paths
       tep_peek_char
       tep_buffer_init
       tep_get_input_buf_ptr
       tep_get_input_buf
       tep_read_token
       tep_free_token
       tep_free_event
       tep_free_format_field
       __tep_parse_format
      
      Link: https://lore.kernel.org/linux-trace-devel/e4afdd82deb5e023d53231bb13e08dca78085fb0.camel@decadent.org.uk/Reported-by: NBen Hutchings <ben@decadent.org.uk>
      Signed-off-by: NTzvetomir Stoyanov (VMware) <tz.stoyanov@gmail.com>
      Reviewed-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Cc: linux-trace-devel@vger.kernel.org
      Link: http://lore.kernel.org/lkml/20200930110733.280534-1-tz.stoyanov@gmail.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      a41c3210
    • J
      perf sched: Show start of latency as well · dc000c45
      Joel Fernandes (Google) 提交于
      The 'perf sched latency' tool is really useful at showing worst-case
      latencies that task encountered since wakeup. However it shows only the
      end of the latency. Often times the start of a latency is interesting as
      it can show what else was going on at the time to cause the latency. I
      certainly myself spending a lot of time backtracking to the start of the
      latency in "perf sched script" which wastes a lot of time.
      
      This patch therefore adds a new column "Max delay start". Considering
      this, also rename "Maximum delay at" to "Max delay end" as its easier to
      understand.
      
      Example of the new output:
      
        ----------------------------------------------------------------------------------------------------------------------------------
         Task                  | Runtime ms  | Switches | Avg delay ms  | Max delay ms   | Max delay start         | Max delay end       |
        ----------------------------------------------------------------------------------------------------------------------------------
         MediaScannerSer:11936 |  651.296 ms |    67978 | avg: 0.113 ms | max: 77.250 ms | max start: 477.691360 s | max end: 477.768610 s
         audio@2.0-servi:(3)   |    0.000 ms |     3440 | avg: 0.034 ms | max: 72.267 ms | max start: 477.697051 s | max end: 477.769318 s
         AudioOut_1D:8112      |    0.000 ms |     2588 | avg: 0.083 ms | max: 64.020 ms | max start: 477.710740 s | max end: 477.774760 s
         Time-limited te:14973 | 7966.090 ms |    24807 | avg: 0.073 ms | max: 15.563 ms | max start: 477.162746 s | max end: 477.178309 s
         surfaceflinger:8049   |    9.680 ms |      603 | avg: 0.063 ms | max: 13.275 ms | max start: 476.931791 s | max end: 476.945067 s
         HeapTaskDaemon:(3)    | 1588.830 ms |     7040 | avg: 0.065 ms | max:  6.880 ms | max start: 473.666043 s | max end: 473.672922 s
         mount-passthrou:(3)   | 1370.809 ms |    68904 | avg: 0.011 ms | max:  6.524 ms | max start: 478.090630 s | max end: 478.097154 s
         ReferenceQueueD:(3)   |   11.794 ms |     1725 | avg: 0.014 ms | max:  6.521 ms | max start: 476.119782 s | max end: 476.126303 s
         writer:14077          |   18.410 ms |     1427 | avg: 0.036 ms | max:  6.131 ms | max start: 474.169675 s | max end: 474.175805 s
      Signed-off-by: NJoel Fernandes (Google) <joel@joelfernandes.org>
      Acked-by: NNamhyung Kim <namhyung@kernel.org>
      Link: https://lore.kernel.org/r/20200925235634.4089867-1-joel@joelfernandes.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      dc000c45
    • S
      perf vendor events: Fix typos in power8 PMU events · 70830f97
      Sandipan Das 提交于
      This replaces the incorrectly spelled word "localtion" with "location"
      in some power8 PMU event descriptions.
      
      Fixes: 2a81fa3b ("perf vendor events: Add power8 PMU events")
      Signed-off-by: NSandipan Das <sandipan@linux.ibm.com>
      Reviewed-by: NKajol Jain <kjain@linux.ibm.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
      Cc: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
      Link: http://lore.kernel.org/lkml/20201012050205.328523-1-sandipan@linux.ibm.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      70830f97
    • N
      perf bench: Run inject-build-id with --buildid-all option too · bf7ef5dd
      Namhyung Kim 提交于
      For comparison, it now runs the benchmark twice - one if regular -b and
      another for --buildid-all.
      
        $ perf bench internals inject-build-id
        # Running 'internals/inject-build-id' benchmark:
          Average build-id injection took: 21.002 msec (+- 0.172 msec)
          Average time per event: 2.059 usec (+- 0.017 usec)
          Average memory usage: 8169 KB (+- 0 KB)
          Average build-id-all injection took: 19.543 msec (+- 0.124 msec)
          Average time per event: 1.916 usec (+- 0.012 usec)
          Average memory usage: 7348 KB (+- 0 KB)
      Signed-off-by: NNamhyung Kim <namhyung@kernel.org>
      Tested-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Acked-by: NIan Rogers <irogers@google.com>
      Acked-by: NJiri Olsa <jolsa@redhat.com>
      Link: https://lore.kernel.org/r/20201012070214.2074921-7-namhyung@kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      bf7ef5dd
    • N
      perf inject: Add --buildid-all option · 27c9c342
      Namhyung Kim 提交于
      Like 'perf record', we can even more speedup build-id processing by just
      using all DSOs.  Then we don't need to look at all the sample events
      anymore.  The following patch will update 'perf bench' to show the result
      of the --buildid-all option too.
      Signed-off-by: NNamhyung Kim <namhyung@kernel.org>
      Original-patch-by: NStephane Eranian <eranian@google.com>
      Acked-by: NIan Rogers <irogers@google.com>
      Acked-by: NJiri Olsa <jolsa@redhat.com>
      Link: https://lore.kernel.org/r/20201012070214.2074921-6-namhyung@kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      27c9c342
    • N
      perf inject: Do not load map/dso when injecting build-id · e7b60c5a
      Namhyung Kim 提交于
      No need to load symbols in a DSO when injecting build-id.  I guess the
      reason was to check the DSO is a special file like anon files.  Use some
      helper functions in map.c to check them before reading build-id.  Also
      pass sample event's cpumode to a new build-id event.
      
      It brought a speedup in the benchmark of 25 -> 21 msec on my laptop.
      Also the memory usage (Max RSS) went down by ~200 KB.
      
        # Running 'internals/inject-build-id' benchmark:
          Average build-id injection took: 21.389 msec (+- 0.138 msec)
          Average time per event: 2.097 usec (+- 0.014 usec)
          Average memory usage: 8225 KB (+- 0 KB)
      
      Committer notes:
      
      Before:
      
        $ perf stat -r5 perf bench internals inject-build-id > /dev/null
      
         Performance counter stats for 'perf bench internals inject-build-id' (5 runs):
      
                  4,020.56 msec task-clock:u              #    1.271 CPUs utilized            ( +-  0.74% )
                         0      context-switches:u        #    0.000 K/sec
                         0      cpu-migrations:u          #    0.000 K/sec
                   123,354      page-faults:u             #    0.031 M/sec                    ( +-  0.81% )
             7,119,951,568      cycles:u                  #    1.771 GHz                      ( +-  1.74% )  (83.27%)
               230,086,969      stalled-cycles-frontend:u #    3.23% frontend cycles idle     ( +-  1.97% )  (83.41%)
             1,168,298,765      stalled-cycles-backend:u  #   16.41% backend cycles idle      ( +-  1.13% )  (83.44%)
            11,173,083,669      instructions:u            #    1.57  insn per cycle
                                                          #    0.10  stalled cycles per insn  ( +-  1.58% )  (83.31%)
             2,413,908,936      branches:u                #  600.392 M/sec                    ( +-  1.69% )  (83.26%)
                46,576,289      branch-misses:u           #    1.93% of all branches          ( +-  2.20% )  (83.31%)
      
                    3.1638 +- 0.0309 seconds time elapsed  ( +-  0.98% )
      
        $
      
      After:
      
        $ perf stat -r5 perf bench internals inject-build-id > /dev/null
      
         Performance counter stats for 'perf bench internals inject-build-id' (5 runs):
      
                  2,379.94 msec task-clock:u              #    1.473 CPUs utilized            ( +-  0.18% )
                         0      context-switches:u        #    0.000 K/sec
                         0      cpu-migrations:u          #    0.000 K/sec
                    62,584      page-faults:u             #    0.026 M/sec                    ( +-  0.07% )
             2,372,389,668      cycles:u                  #    0.997 GHz                      ( +-  0.29% )  (83.14%)
               106,937,862      stalled-cycles-frontend:u #    4.51% frontend cycles idle     ( +-  4.89% )  (83.20%)
               581,697,915      stalled-cycles-backend:u  #   24.52% backend cycles idle      ( +-  0.71% )  (83.47%)
             3,659,692,199      instructions:u            #    1.54  insn per cycle
                                                          #    0.16  stalled cycles per insn  ( +-  0.10% )  (83.63%)
               791,372,961      branches:u                #  332.518 M/sec                    ( +-  0.27% )  (83.39%)
                10,648,083      branch-misses:u           #    1.35% of all branches          ( +-  0.22% )  (83.16%)
      
                   1.61570 +- 0.00172 seconds time elapsed  ( +-  0.11% )
      
        $
      Signed-off-by: NNamhyung Kim <namhyung@kernel.org>
      Original-patch-by: NStephane Eranian <eranian@google.com>
      Tested-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Acked-by: NJiri Olsa <jolsa@redhat.com>
      Link: https://lore.kernel.org/r/20201012070214.2074921-5-namhyung@kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      e7b60c5a
    • N
      perf inject: Enter namespace when reading build-id · 336c95b2
      Namhyung Kim 提交于
      It should be in a proper mnt namespace when accessing the file.
      
      I think this had no problem since the build-id was actually read from
      map__load() -> dso__load() already.  But I'd like to change it in the
      following commit.
      Acked-by: NJiri Olsa <jolsa@redhat.com>
      Signed-off-by: NNamhyung Kim <namhyung@kernel.org>
      Link: https://lore.kernel.org/r/20201012070214.2074921-4-namhyung@kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      336c95b2
    • N
      perf inject: Add missing callbacks in perf_tool · 2946eced
      Namhyung Kim 提交于
      I found some events (like PERF_RECORD_CGROUP) are not copied by perf
      inject due to the missing callbacks.  Let's add them.
      
      While at it, I've changed the order of the callbacks to match with
      struct perf_tool so that we can compare them easily.
      Acked-by: NJiri Olsa <jolsa@redhat.com>
      Signed-off-by: NNamhyung Kim <namhyung@kernel.org>
      Link: https://lore.kernel.org/r/20201012070214.2074921-3-namhyung@kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      2946eced
    • N
      perf bench: Add build-id injection benchmark · 0bf02a0d
      Namhyung Kim 提交于
      Sometimes I can see that 'perf record' piped with 'perf inject' take a
      long time processing build-ids.
      
      So introduce a inject-build-id benchmark to the internals benchmark
      suite to measure its overhead regularly.
      
      It runs the 'perf inject' command internally and feeds the given number
      of synthesized events (MMAP2 + SAMPLE basically).
      
        Usage: perf bench internals inject-build-id <options>
      
          -i, --iterations <n>  Number of iterations used to compute average (default: 100)
          -m, --nr-mmaps <n>    Number of mmap events for each iteration (default: 100)
          -n, --nr-samples <n>  Number of sample events per mmap event (default: 100)
          -v, --verbose         be more verbose (show iteration count, DSO name, etc)
      
      By default, it measures average processing time of 100 MMAP2 events
      and 10000 SAMPLE events.  Below is a result on my laptop.
      
        $ perf bench internals inject-build-id
        # Running 'internals/inject-build-id' benchmark:
          Average build-id injection took: 25.789 msec (+- 0.202 msec)
          Average time per event: 2.528 usec (+- 0.020 usec)
          Average memory usage: 8411 KB (+- 7 KB)
      
      Committer testing:
      
        $ perf bench
        Usage:
        	perf bench [<common options>] <collection> <benchmark> [<options>]
      
                # List of all available benchmark collections:
      
                 sched: Scheduler and IPC benchmarks
               syscall: System call benchmarks
                   mem: Memory access benchmarks
                  numa: NUMA scheduling and MM benchmarks
                 futex: Futex stressing benchmarks
                 epoll: Epoll stressing benchmarks
             internals: Perf-internals benchmarks
                   all: All benchmarks
      
        $ perf bench internals
      
                # List of available benchmarks for collection 'internals':
      
            synthesize: Benchmark perf event synthesis
        kallsyms-parse: Benchmark kallsyms parsing
        inject-build-id: Benchmark build-id injection
      
        $ perf bench internals inject-build-id
        # Running 'internals/inject-build-id' benchmark:
          Average build-id injection took: 14.202 msec (+- 0.059 msec)
          Average time per event: 1.392 usec (+- 0.006 usec)
          Average memory usage: 12650 KB (+- 10 KB)
          Average build-id-all injection took: 12.831 msec (+- 0.071 msec)
          Average time per event: 1.258 usec (+- 0.007 usec)
          Average memory usage: 11895 KB (+- 10 KB)
        $
      
        $ perf stat -r5 perf bench internals inject-build-id
        # Running 'internals/inject-build-id' benchmark:
          Average build-id injection took: 14.380 msec (+- 0.056 msec)
          Average time per event: 1.410 usec (+- 0.006 usec)
          Average memory usage: 12608 KB (+- 11 KB)
          Average build-id-all injection took: 11.889 msec (+- 0.064 msec)
          Average time per event: 1.166 usec (+- 0.006 usec)
          Average memory usage: 11838 KB (+- 10 KB)
        # Running 'internals/inject-build-id' benchmark:
          Average build-id injection took: 14.246 msec (+- 0.065 msec)
          Average time per event: 1.397 usec (+- 0.006 usec)
          Average memory usage: 12744 KB (+- 10 KB)
          Average build-id-all injection took: 12.019 msec (+- 0.066 msec)
          Average time per event: 1.178 usec (+- 0.006 usec)
          Average memory usage: 11963 KB (+- 10 KB)
        # Running 'internals/inject-build-id' benchmark:
          Average build-id injection took: 14.321 msec (+- 0.067 msec)
          Average time per event: 1.404 usec (+- 0.007 usec)
          Average memory usage: 12690 KB (+- 10 KB)
          Average build-id-all injection took: 11.909 msec (+- 0.041 msec)
          Average time per event: 1.168 usec (+- 0.004 usec)
          Average memory usage: 11938 KB (+- 10 KB)
        # Running 'internals/inject-build-id' benchmark:
          Average build-id injection took: 14.287 msec (+- 0.059 msec)
          Average time per event: 1.401 usec (+- 0.006 usec)
          Average memory usage: 12864 KB (+- 10 KB)
          Average build-id-all injection took: 11.862 msec (+- 0.058 msec)
          Average time per event: 1.163 usec (+- 0.006 usec)
          Average memory usage: 12103 KB (+- 10 KB)
        # Running 'internals/inject-build-id' benchmark:
          Average build-id injection took: 14.402 msec (+- 0.053 msec)
          Average time per event: 1.412 usec (+- 0.005 usec)
          Average memory usage: 12876 KB (+- 10 KB)
          Average build-id-all injection took: 11.826 msec (+- 0.061 msec)
          Average time per event: 1.159 usec (+- 0.006 usec)
          Average memory usage: 12111 KB (+- 10 KB)
      
         Performance counter stats for 'perf bench internals inject-build-id' (5 runs):
      
                  4,267.48 msec task-clock:u              #    1.502 CPUs utilized            ( +-  0.14% )
                         0      context-switches:u        #    0.000 K/sec
                         0      cpu-migrations:u          #    0.000 K/sec
                   102,092      page-faults:u             #    0.024 M/sec                    ( +-  0.08% )
             3,894,589,578      cycles:u                  #    0.913 GHz                      ( +-  0.19% )  (83.49%)
               140,078,421      stalled-cycles-frontend:u #    3.60% frontend cycles idle     ( +-  0.77% )  (83.34%)
               948,581,189      stalled-cycles-backend:u  #   24.36% backend cycles idle      ( +-  0.46% )  (83.25%)
             5,835,587,719      instructions:u            #    1.50  insn per cycle
                                                          #    0.16  stalled cycles per insn  ( +-  0.21% )  (83.24%)
             1,267,423,636      branches:u                #  296.996 M/sec                    ( +-  0.22% )  (83.12%)
                17,484,290      branch-misses:u           #    1.38% of all branches          ( +-  0.12% )  (83.55%)
      
                   2.84176 +- 0.00222 seconds time elapsed  ( +-  0.08% )
      
        $
      Acked-by: NJiri Olsa <jolsa@redhat.com>
      Tested-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: NNamhyung Kim <namhyung@kernel.org>
      Link: https://lore.kernel.org/r/20201012070214.2074921-2-namhyung@kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      0bf02a0d
  4. 07 10月, 2020 1 次提交
    • N
      perf stat: Fix out of bounds CPU map access when handling armv8_pmu events · bef69bd7
      Namhyung Kim 提交于
      It was reported that 'perf stat' crashed when using with armv8_pmu (CPU)
      events with the task mode.  As 'perf stat' uses an empty cpu map for
      task mode but armv8_pmu has its own cpu mask, it has confused which map
      it should use when accessing file descriptors and this causes segfaults:
      
        (gdb) bt
        #0  0x0000000000603fc8 in perf_evsel__close_fd_cpu (evsel=<optimized out>,
            cpu=<optimized out>) at evsel.c:122
        #1  perf_evsel__close_cpu (evsel=evsel@entry=0x716e950, cpu=7) at evsel.c:156
        #2  0x00000000004d4718 in evlist__close (evlist=0x70a7cb0) at util/evlist.c:1242
        #3  0x0000000000453404 in __run_perf_stat (argc=3, argc@entry=1, argv=0x30,
            argv@entry=0xfffffaea2f90, run_idx=119, run_idx@entry=1701998435)
            at builtin-stat.c:929
        #4  0x0000000000455058 in run_perf_stat (run_idx=1701998435, argv=0xfffffaea2f90,
            argc=1) at builtin-stat.c:947
        #5  cmd_stat (argc=1, argv=0xfffffaea2f90) at builtin-stat.c:2357
        #6  0x00000000004bb888 in run_builtin (p=p@entry=0x9764b8 <commands+288>,
            argc=argc@entry=4, argv=argv@entry=0xfffffaea2f90) at perf.c:312
        #7  0x00000000004bbb54 in handle_internal_command (argc=argc@entry=4,
            argv=argv@entry=0xfffffaea2f90) at perf.c:364
        #8  0x0000000000435378 in run_argv (argcp=<synthetic pointer>,
            argv=<synthetic pointer>) at perf.c:408
        #9  main (argc=4, argv=0xfffffaea2f90) at perf.c:538
      
      To fix this, I simply used the given cpu map unless the evsel actually
      is not a system-wide event (like uncore events).
      
      Fixes: 7736627b ("perf stat: Use affinity for closing file descriptors")
      Reported-by: NWei Li <liwei391@huawei.com>
      Signed-off-by: NNamhyung Kim <namhyung@kernel.org>
      Tested-by: NBarry Song <song.bao.hua@hisilicon.com>
      Acked-by: NJiri Olsa <jolsa@redhat.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Link: http://lore.kernel.org/lkml/20201007081311.1831003-1-namhyung@kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      bef69bd7
  5. 01 10月, 2020 2 次提交
    • J
      perf python scripting: Fix printable strings in python3 scripts · 6fcd5ddc
      Jiri Olsa 提交于
      Hagen reported broken strings in python3 tracepoint scripts:
      
        make PYTHON=python3
        perf record -e sched:sched_switch -a -- sleep 5
        perf script --gen-script py
        perf script -s ./perf-script.py
      
        [..]
        sched__sched_switch      7 563231.759525792        0 swapper   prev_comm=bytearray(b'swapper/7\x00\x00\x00\x00\x00\x00\x00'), prev_pid=0, prev_prio=120, prev_state=, next_comm=bytearray(b'mutex-thread-co\x00'),
      
      The problem is in the is_printable_array function that does not take the
      zero byte into account and claim such string as not printable, so the
      code will create byte array instead of string.
      
      Committer testing:
      
      After this fix:
      
      sched__sched_switch 3 484522.497072626  1158680 kworker/3:0-eve  prev_comm=kworker/3:0, prev_pid=1158680, prev_prio=120, prev_state=I, next_comm=swapper/3, next_pid=0, next_prio=120
      Sample: {addr=0, cpu=3, datasrc=84410401, datasrc_decode=N/A|SNP N/A|TLB N/A|LCK N/A, ip=18446744071841817196, period=1, phys_addr=0, pid=1158680, tid=1158680, time=484522497072626, transaction=0, values=[(0, 0)], weight=0}
      
      sched__sched_switch 4 484522.497085610  1225814 perf             prev_comm=perf, prev_pid=1225814, prev_prio=120, prev_state=, next_comm=migration/4, next_pid=30, next_prio=0
      Sample: {addr=0, cpu=4, datasrc=84410401, datasrc_decode=N/A|SNP N/A|TLB N/A|LCK N/A, ip=18446744071841817196, period=1, phys_addr=0, pid=1225814, tid=1225814, time=484522497085610, transaction=0, values=[(0, 0)], weight=0}
      
      Fixes: 249de6e0 ("perf script python: Fix string vs byte array resolving")
      Signed-off-by: NJiri Olsa <jolsa@kernel.org>
      Tested-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Tested-by: NHagen Paul Pfeifer <hagen@jauu.net>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Michael Petlan <mpetlan@redhat.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: stable@vger.kernel.org
      Link: http://lore.kernel.org/lkml/20200928201135.3633850-1-jolsa@kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      6fcd5ddc
    • A
      perf trace: Use the autogenerated mmap 'prot' string/id table · 388968d8
      Arnaldo Carvalho de Melo 提交于
      No change in behaviour:
      
        # perf trace -e mmap sleep 1
             0.000 ( 0.009 ms): sleep/751870 mmap(len: 143317, prot: READ, flags: PRIVATE, fd: 3)                  = 0x7fa96d0f7000
             0.028 ( 0.004 ms): sleep/751870 mmap(len: 8192, prot: READ|WRITE, flags: PRIVATE|ANONYMOUS)           = 0x7fa96d0f5000
             0.037 ( 0.005 ms): sleep/751870 mmap(len: 1872744, prot: READ, flags: PRIVATE|DENYWRITE, fd: 3)       = 0x7fa96cf2b000
             0.044 ( 0.011 ms): sleep/751870 mmap(addr: 0x7fa96cf50000, len: 1376256, prot: READ|EXEC, flags: PRIVATE|FIXED|DENYWRITE, fd: 3, off: 0x25000) = 0x7fa96cf50000
             0.056 ( 0.007 ms): sleep/751870 mmap(addr: 0x7fa96d0a0000, len: 307200, prot: READ, flags: PRIVATE|FIXED|DENYWRITE, fd: 3, off: 0x175000) = 0x7fa96d0a0000
             0.064 ( 0.007 ms): sleep/751870 mmap(addr: 0x7fa96d0eb000, len: 24576, prot: READ|WRITE, flags: PRIVATE|FIXED|DENYWRITE, fd: 3, off: 0x1bf000) = 0x7fa96d0eb000
             0.075 ( 0.005 ms): sleep/751870 mmap(addr: 0x7fa96d0f1000, len: 13160, prot: READ|WRITE, flags: PRIVATE|FIXED|ANONYMOUS) = 0x7fa96d0f1000
             0.253 ( 0.005 ms): sleep/751870 mmap(len: 218049136, prot: READ, flags: PRIVATE, fd: 3)               = 0x7fa95ff38000
        #
        #
        # set -o vi
        # strace -e mmap sleep 1
        mmap(NULL, 143317, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f333bd83000
        mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f333bd81000
        mmap(NULL, 1872744, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f333bbb7000
        mmap(0x7f333bbdc000, 1376256, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x25000) = 0x7f333bbdc000
        mmap(0x7f333bd2c000, 307200, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x175000) = 0x7f333bd2c000
        mmap(0x7f333bd77000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1bf000) = 0x7f333bd77000
        mmap(0x7f333bd7d000, 13160, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f333bd7d000
        mmap(NULL, 218049136, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f332ebc4000
        +++ exited with 0 +++
        #
      
      And you can as well tweak 'perf trace's output to more closely match
      strace's:
      
        # perf config trace.show_arg_names=no
        # perf config trace.show_duration=no
        # perf config trace.show_prefix=yes
        # perf config trace.show_timestamp=no
        # perf config trace.show_zeros=yes
        # perf config trace.no_inherit=yes
        # perf trace -e mmap sleep 1
        mmap(NULL, 143317, PROT_READ, MAP_PRIVATE, 3, 0)                      = 0x7f0d287ca000
        mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS)     = 0x7f0d287c8000
        mmap(NULL, 1872744, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0)       = 0x7f0d285fe000
        mmap(0x7f0d28623000, 1376256, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x25000) = 0x7f0d28623000
        mmap(0x7f0d28773000, 307200, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x175000) = 0x7f0d28773000
        mmap(0x7f0d287be000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1bf000) = 0x7f0d287be000
        mmap(0x7f0d287c4000, 13160, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS) = 0x7f0d287c4000
        mmap(NULL, 218049136, PROT_READ, MAP_PRIVATE, 3, 0)                   = 0x7f0d1b60b000
        #
      
        # perf config | grep ^trace
        trace.show_arg_names=no
        trace.show_duration=no
        trace.show_prefix=yes
        trace.show_timestamp=no
        trace.show_zeros=yes
        trace.no_inherit=yes
        #
      
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Ian Rogers <irogers@google.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Signed-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      388968d8