1. 30 4月, 2020 2 次提交
    • I
      tools api: Add a lightweight buffered reading api · e95770af
      Ian Rogers 提交于
      The synthesize benchmark shows the majority of execution time going to
      fgets and sscanf, necessary to parse /proc/pid/maps. Add a new buffered
      reading library that will be used to replace these calls in a follow-up
      CL. Add tests for the library to perf test.
      
      Committer tests:
      
        $ perf test api
        63: Test api io                                           : Ok
        $
      Signed-off-by: NIan Rogers <irogers@google.com>
      Acked-by: NJiri Olsa <jolsa@redhat.com>
      Acked-by: NNamhyung Kim <namhyung@kernel.org>
      Tested-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andrey Zhizhikin <andrey.z@gmail.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Kan Liang <kan.liang@linux.intel.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lore.kernel.org/lkml/20200415054050.31645-3-irogers@google.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      e95770af
    • I
      perf bench: Add a multi-threaded synthesize benchmark · 13edc237
      Ian Rogers 提交于
      By default this isn't run as it reads /proc and may not have access.
      For consistency, modify the single threaded benchmark to compute an
      average time per event.
      
      Committer testing:
      
        $ grep -m1 "model name" /proc/cpuinfo
        model name	: Intel(R) Core(TM) i7-8650U CPU @ 1.90GHz
        $ grep "model name" /proc/cpuinfo  | wc -l
        8
        $
        $ perf bench internals synthesize -h
        # Running 'internals/synthesize' benchmark:
      
         Usage: perf bench internals synthesize <options>
      
            -I, --multi-iterations <n>
                                  Number of iterations used to compute multi-threaded average
            -i, --single-iterations <n>
                                  Number of iterations used to compute single-threaded average
            -M, --max-threads <n>
                                  Maximum number of threads in multithreaded bench
            -m, --min-threads <n>
                                  Minimum number of threads in multithreaded bench
            -s, --st              Run single threaded benchmark
            -t, --mt              Run multi-threaded benchmark
      
        $
        $ perf bench internals synthesize -t
        # Running 'internals/synthesize' benchmark:
        Computing performance of multi threaded perf event synthesis by
        synthesizing events on CPU 0:
          Number of synthesis threads: 1
            Average synthesis took: 65449.000 usec (+- 586.442 usec)
            Average num. events: 9405.400 (+- 0.306)
            Average time per event 6.959 usec
          Number of synthesis threads: 2
            Average synthesis took: 37838.300 usec (+- 130.259 usec)
            Average num. events: 9501.800 (+- 20.469)
            Average time per event 3.982 usec
          Number of synthesis threads: 3
            Average synthesis took: 48551.400 usec (+- 225.686 usec)
            Average num. events: 9544.000 (+- 0.000)
            Average time per event 5.087 usec
          Number of synthesis threads: 4
            Average synthesis took: 29632.500 usec (+- 50.808 usec)
            Average num. events: 9544.000 (+- 0.000)
            Average time per event 3.105 usec
          Number of synthesis threads: 5
            Average synthesis took: 33920.400 usec (+- 284.509 usec)
            Average num. events: 9544.000 (+- 0.000)
            Average time per event 3.554 usec
          Number of synthesis threads: 6
            Average synthesis took: 27604.100 usec (+- 72.344 usec)
            Average num. events: 9548.000 (+- 0.000)
            Average time per event 2.891 usec
          Number of synthesis threads: 7
            Average synthesis took: 25406.300 usec (+- 933.371 usec)
            Average num. events: 9545.500 (+- 0.167)
            Average time per event 2.662 usec
          Number of synthesis threads: 8
            Average synthesis took: 24110.400 usec (+- 73.229 usec)
            Average num. events: 9551.000 (+- 0.000)
            Average time per event 2.524 usec
        $
      Signed-off-by: NIan Rogers <irogers@google.com>
      Tested-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Acked-by: NJiri Olsa <jolsa@redhat.com>
      Acked-by: NNamhyung Kim <namhyung@kernel.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andrey Zhizhikin <andrey.z@gmail.com>
      Cc: Kan Liang <kan.liang@linux.intel.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lore.kernel.org/lkml/20200415054050.31645-2-irogers@google.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      13edc237
  2. 23 4月, 2020 4 次提交
    • S
      perf record: Add num-synthesize-threads option · d99c22ea
      Stephane Eranian 提交于
      To control degree of parallelism of the synthesize_mmap() code which
      is scanning /proc/PID/task/PID/maps and can be time consuming.
      Mimic perf top way of handling the option.
      If not specified will default to 1 thread, i.e. default behavior before
      this option.
      
      On a desktop computer the processing of /proc/PID/task/PID/maps isn't
      slow enough to warrant parallel processing and the thread creation has
      some cost - hence the default of 1. On a loaded server with
      >100 cores it is possible to see synthesis times in the order of
      seconds and in this case having the option is desirable.
      
      As the processing is a synchronization point, it is legitimate to worry if
      Amdahl's law will apply to this patch. Profiling with this patch in
      place:
      https://lore.kernel.org/lkml/20200415054050.31645-4-irogers@google.com/
      shows:
      ...
            - 32.59% __perf_event__synthesize_threads
               - 32.54% __event__synthesize_thread
                  + 22.13% perf_event__synthesize_mmap_events
                  + 6.68% perf_event__get_comm_ids.constprop.0
                  + 1.49% process_synthesized_event
                  + 1.29% __GI___readdir64
                  + 0.60% __opendir
      ...
      That is the processing is 1.49% of execution time and there is plenty to
      make parallel. This is shown in the benchmark in this patch:
      
      https://lore.kernel.org/lkml/20200415054050.31645-2-irogers@google.com/
      
        Computing performance of multi threaded perf event synthesis by
        synthesizing events on CPU 0:
         Number of synthesis threads: 1
           Average synthesis took: 127729.000 usec (+- 3372.880 usec)
           Average num. events: 21548.600 (+- 0.306)
           Average time per event 5.927 usec
         Number of synthesis threads: 2
           Average synthesis took: 88863.500 usec (+- 385.168 usec)
           Average num. events: 21552.800 (+- 0.327)
           Average time per event 4.123 usec
         Number of synthesis threads: 3
           Average synthesis took: 83257.400 usec (+- 348.617 usec)
           Average num. events: 21553.200 (+- 0.327)
           Average time per event 3.863 usec
         Number of synthesis threads: 4
           Average synthesis took: 75093.000 usec (+- 422.978 usec)
           Average num. events: 21554.200 (+- 0.200)
           Average time per event 3.484 usec
         Number of synthesis threads: 5
           Average synthesis took: 64896.600 usec (+- 353.348 usec)
           Average num. events: 21558.000 (+- 0.000)
           Average time per event 3.010 usec
         Number of synthesis threads: 6
           Average synthesis took: 59210.200 usec (+- 342.890 usec)
           Average num. events: 21560.000 (+- 0.000)
           Average time per event 2.746 usec
         Number of synthesis threads: 7
           Average synthesis took: 54093.900 usec (+- 306.247 usec)
           Average num. events: 21562.000 (+- 0.000)
           Average time per event 2.509 usec
         Number of synthesis threads: 8
           Average synthesis took: 48938.700 usec (+- 341.732 usec)
           Average num. events: 21564.000 (+- 0.000)
           Average time per event 2.269 usec
      
      Where average time per synthesized event goes from 5.927 usec with 1
      thread to 2.269 usec with 8. This isn't a linear speed up as not all of
      synthesize code has been made parallel. If the synthesis time was about
      10 seconds then using 8 threads may bring this down to less than 4.
      Signed-off-by: NStephane Eranian <eranian@google.com>
      Reviewed-by: NIan Rogers <irogers@google.com>
      Acked-by: NJiri Olsa <jolsa@redhat.com>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Alexey Budankov <alexey.budankov@linux.intel.com>
      Cc: Kan Liang <kan.liang@linux.intel.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Tony Jones <tonyj@suse.de>
      Cc: yuzhoujian <yuzhoujian@didichuxing.com>
      Link: http://lore.kernel.org/lkml/20200422155038.9380-1-irogers@google.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      d99c22ea
    • T
      perf test session topology: Fix data path · dbd660e6
      Tommi Rantala 提交于
      Commit 2d4f2799 ("perf data: Add global path holder") missed path
      conversion in tests/topology.c, causing the "Session topology" testcase
      to "hang" (waits forever for input from stdin) when doing "ssh $VM perf
      test".
      
      Can be reproduced by running "cat | perf test topo", and crashed by
      replacing cat with true:
      
        $ true | perf test -v topo
        40: Session topology                                      :
        --- start ---
        test child forked, pid 3638
        templ file: /tmp/perf-test-QPvAch
        incompatible file format
        incompatible file format (rerun with -v to learn more)
        free(): invalid pointer
        test child interrupted
        ---- end ----
        Session topology: FAILED!
      
      Committer testing:
      
      Reproduced the above result before the patch and after it is back
      working:
      
        # true | perf test -v topo
        41: Session topology                                      :
        --- start ---
        test child forked, pid 19374
        templ file: /tmp/perf-test-YOTEQg
        CPU 0, core 0, socket 0
        CPU 1, core 1, socket 0
        CPU 2, core 2, socket 0
        CPU 3, core 3, socket 0
        CPU 4, core 0, socket 0
        CPU 5, core 1, socket 0
        CPU 6, core 2, socket 0
        CPU 7, core 3, socket 0
        test child finished with 0
        ---- end ----
        Session topology: Ok
        #
      
      Fixes: 2d4f2799 ("perf data: Add global path holder")
      Signed-off-by: NTommi Rantala <tommi.t.rantala@nokia.com>
      Tested-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Acked-by: NJiri Olsa <jolsa@redhat.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Mamatha Inamdar <mamatha4@linux.vnet.ibm.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
      Link: http://lore.kernel.org/lkml/20200423115341.562782-1-tommi.t.rantala@nokia.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      dbd660e6
    • J
      perf stat: Improve runtime stat for interval mode · 197ba86f
      Jin Yao 提交于
      For interval mode, the metric is printed after the '#' character if it
      exists. But it's not calculated by the counts generated in this
      interval.
      
      See the following examples:
      
        root@kbl-ppc:~# perf stat -M CPI -I1000 --interval-count 2
        #           time             counts unit events
             1.000422803            764,809      inst_retired.any          #      2.9 CPI
             1.000422803          2,234,932      cycles
             2.001464585          1,960,061      inst_retired.any          #      1.6 CPI
             2.001464585          4,022,591      cycles
      
      The second CPI should not be 1.6 (4,022,591/1,960,061 is 2.1)
      
        root@kbl-ppc:~# perf stat -e cycles,instructions -I1000 --interval-count 2
        #           time             counts unit events
             1.000429493          2,869,311      cycles
             1.000429493            816,875      instructions              #    0.28  insn per cycle
             2.001516426          9,260,973      cycles
             2.001516426          5,250,634      instructions              #    0.87  insn per cycle
      
      The second 'insn per cycle' should not be 0.87 (5,250,634/9,260,973 is
      0.57).
      
      The current code uses a global variable 'rt_stat' for tracking and
      updating the std dev of runtime stat. Unlike the counts, 'rt_stat' is not
      reset for interval. While the counts are reset for interval.
      
        perf_stat_process_counter()
        {
                if (config->interval)
                        init_stats(ps->res_stats);
        }
      
      So for interval mode, the 'rt_stat' variable should be reset too.
      
      This patch resets 'rt_stat' before read_counters(), so the runtime stat
      is only calculated by the counts generated in this interval.
      
      With this patch:
      
        root@kbl-ppc:~# perf stat -M CPI -I1000 --interval-count 2
        #           time             counts unit events
             1.000420924          2,408,818      inst_retired.any          #      2.1 CPI
             1.000420924          5,010,111      cycles
             2.001448579          2,798,407      inst_retired.any          #      1.6 CPI
             2.001448579          4,599,861      cycles
      
        root@kbl-ppc:~# perf stat -e cycles,instructions -I1000 --interval-count 2
        #           time             counts unit events
             1.000428555          2,769,714      cycles
             1.000428555            774,462      instructions              #    0.28  insn per cycle
             2.001471562          3,595,904      cycles
             2.001471562          1,243,703      instructions              #    0.35  insn per cycle
      
      Now the second 'insn per cycle' and CPI are calculated by the counts
      generated in this interval.
      Signed-off-by: NJin Yao <yao.jin@linux.intel.com>
      Acked-by: NJiri Olsa <jolsa@redhat.com>
      Tested-By: NKajol Jain <kjain@linux.ibm.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Jin Yao <yao.jin@intel.com>
      Cc: Kan Liang <kan.liang@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: http://lore.kernel.org/lkml/20200420145417.6864-1-yao.jin@linux.intel.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      197ba86f
    • J
      perf stat: Zero all the 'ena' and 'run' array slot stats for interval mode · 0e0bf1ea
      Jin Yao 提交于
      As the code comments in perf_stat_process_counter() say, we calculate
      counter's data every interval, and the display code shows ps->res_stats
      avg value. We need to zero the stats for interval mode.
      
      But the current code only zeros the res_stats[0], it doesn't zero the
      res_stats[1] and res_stats[2], which are for ena and run of counter.
      
      This patch zeros the whole res_stats[] for interval mode.
      
      Fixes: 51fd2df1 ("perf stat: Fix interval output values")
      Signed-off-by: NJin Yao <yao.jin@linux.intel.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Jin Yao <yao.jin@intel.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Kan Liang <kan.liang@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: http://lore.kernel.org/lkml/20200409070755.17261-1-yao.jin@linux.intel.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      0e0bf1ea
  3. 22 4月, 2020 25 次提交
  4. 21 4月, 2020 8 次提交
  5. 20 4月, 2020 1 次提交
    • M
      vhost: disable for OABI · d085eb8c
      Michael S. Tsirkin 提交于
      vhost is currently broken on the some ARM configs.
      
      The reason is that the ring element addresses are passed between
      components with different alignments assumptions. Thus, if
      guest selects a pointer and host then gets and dereferences
      it, then alignment assumed by the host's compiler might be
      greater than the actual alignment of the pointer.
      compiler on the host from assuming pointer is aligned.
      
      This actually triggers on ARM with -mabi=apcs-gnu - which is a
      deprecated configuration. With this OABI, compiler assumes that
      all structures are 4 byte aligned - which is stronger than
      virtio guarantees for available and used rings, which are
      merely 2 bytes. Thus a guest without -mabi=apcs-gnu running
      on top of host with -mabi=apcs-gnu will be broken.
      
      The correct fix is to force alignment of structures - however
      that is an intrusive fix that's best deferred until the next release.
      
      We didn't previously support such ancient systems at all - this surfaced
      after vdpa support prompted removing dependency of vhost on
      VIRTULIZATION. So for now, let's just add something along the lines of
      
      	depends on !ARM || AEABI
      
      to the virtio Kconfig declaration, and add a comment that it has to do
      with struct member alignment.
      
      Note: we can't make VHOST and VHOST_RING themselves have
      a dependency since these are selected. Add a new symbol for that.
      
      We should be able to drop this dependency down the road.
      
      Fixes: 20c384f1 ("vhost: refine vhost and vringh kconfig")
      Suggested-by: NArd Biesheuvel <ardb@kernel.org>
      Suggested-by: NRichard Earnshaw <Richard.Earnshaw@arm.com>
      Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
      d085eb8c