1. 23 4月, 2020 4 次提交
    • S
      perf record: Add num-synthesize-threads option · d99c22ea
      Stephane Eranian 提交于
      To control degree of parallelism of the synthesize_mmap() code which
      is scanning /proc/PID/task/PID/maps and can be time consuming.
      Mimic perf top way of handling the option.
      If not specified will default to 1 thread, i.e. default behavior before
      this option.
      
      On a desktop computer the processing of /proc/PID/task/PID/maps isn't
      slow enough to warrant parallel processing and the thread creation has
      some cost - hence the default of 1. On a loaded server with
      >100 cores it is possible to see synthesis times in the order of
      seconds and in this case having the option is desirable.
      
      As the processing is a synchronization point, it is legitimate to worry if
      Amdahl's law will apply to this patch. Profiling with this patch in
      place:
      https://lore.kernel.org/lkml/20200415054050.31645-4-irogers@google.com/
      shows:
      ...
            - 32.59% __perf_event__synthesize_threads
               - 32.54% __event__synthesize_thread
                  + 22.13% perf_event__synthesize_mmap_events
                  + 6.68% perf_event__get_comm_ids.constprop.0
                  + 1.49% process_synthesized_event
                  + 1.29% __GI___readdir64
                  + 0.60% __opendir
      ...
      That is the processing is 1.49% of execution time and there is plenty to
      make parallel. This is shown in the benchmark in this patch:
      
      https://lore.kernel.org/lkml/20200415054050.31645-2-irogers@google.com/
      
        Computing performance of multi threaded perf event synthesis by
        synthesizing events on CPU 0:
         Number of synthesis threads: 1
           Average synthesis took: 127729.000 usec (+- 3372.880 usec)
           Average num. events: 21548.600 (+- 0.306)
           Average time per event 5.927 usec
         Number of synthesis threads: 2
           Average synthesis took: 88863.500 usec (+- 385.168 usec)
           Average num. events: 21552.800 (+- 0.327)
           Average time per event 4.123 usec
         Number of synthesis threads: 3
           Average synthesis took: 83257.400 usec (+- 348.617 usec)
           Average num. events: 21553.200 (+- 0.327)
           Average time per event 3.863 usec
         Number of synthesis threads: 4
           Average synthesis took: 75093.000 usec (+- 422.978 usec)
           Average num. events: 21554.200 (+- 0.200)
           Average time per event 3.484 usec
         Number of synthesis threads: 5
           Average synthesis took: 64896.600 usec (+- 353.348 usec)
           Average num. events: 21558.000 (+- 0.000)
           Average time per event 3.010 usec
         Number of synthesis threads: 6
           Average synthesis took: 59210.200 usec (+- 342.890 usec)
           Average num. events: 21560.000 (+- 0.000)
           Average time per event 2.746 usec
         Number of synthesis threads: 7
           Average synthesis took: 54093.900 usec (+- 306.247 usec)
           Average num. events: 21562.000 (+- 0.000)
           Average time per event 2.509 usec
         Number of synthesis threads: 8
           Average synthesis took: 48938.700 usec (+- 341.732 usec)
           Average num. events: 21564.000 (+- 0.000)
           Average time per event 2.269 usec
      
      Where average time per synthesized event goes from 5.927 usec with 1
      thread to 2.269 usec with 8. This isn't a linear speed up as not all of
      synthesize code has been made parallel. If the synthesis time was about
      10 seconds then using 8 threads may bring this down to less than 4.
      Signed-off-by: NStephane Eranian <eranian@google.com>
      Reviewed-by: NIan Rogers <irogers@google.com>
      Acked-by: NJiri Olsa <jolsa@redhat.com>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Alexey Budankov <alexey.budankov@linux.intel.com>
      Cc: Kan Liang <kan.liang@linux.intel.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Tony Jones <tonyj@suse.de>
      Cc: yuzhoujian <yuzhoujian@didichuxing.com>
      Link: http://lore.kernel.org/lkml/20200422155038.9380-1-irogers@google.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      d99c22ea
    • T
      perf test session topology: Fix data path · dbd660e6
      Tommi Rantala 提交于
      Commit 2d4f2799 ("perf data: Add global path holder") missed path
      conversion in tests/topology.c, causing the "Session topology" testcase
      to "hang" (waits forever for input from stdin) when doing "ssh $VM perf
      test".
      
      Can be reproduced by running "cat | perf test topo", and crashed by
      replacing cat with true:
      
        $ true | perf test -v topo
        40: Session topology                                      :
        --- start ---
        test child forked, pid 3638
        templ file: /tmp/perf-test-QPvAch
        incompatible file format
        incompatible file format (rerun with -v to learn more)
        free(): invalid pointer
        test child interrupted
        ---- end ----
        Session topology: FAILED!
      
      Committer testing:
      
      Reproduced the above result before the patch and after it is back
      working:
      
        # true | perf test -v topo
        41: Session topology                                      :
        --- start ---
        test child forked, pid 19374
        templ file: /tmp/perf-test-YOTEQg
        CPU 0, core 0, socket 0
        CPU 1, core 1, socket 0
        CPU 2, core 2, socket 0
        CPU 3, core 3, socket 0
        CPU 4, core 0, socket 0
        CPU 5, core 1, socket 0
        CPU 6, core 2, socket 0
        CPU 7, core 3, socket 0
        test child finished with 0
        ---- end ----
        Session topology: Ok
        #
      
      Fixes: 2d4f2799 ("perf data: Add global path holder")
      Signed-off-by: NTommi Rantala <tommi.t.rantala@nokia.com>
      Tested-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Acked-by: NJiri Olsa <jolsa@redhat.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Mamatha Inamdar <mamatha4@linux.vnet.ibm.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
      Link: http://lore.kernel.org/lkml/20200423115341.562782-1-tommi.t.rantala@nokia.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      dbd660e6
    • J
      perf stat: Improve runtime stat for interval mode · 197ba86f
      Jin Yao 提交于
      For interval mode, the metric is printed after the '#' character if it
      exists. But it's not calculated by the counts generated in this
      interval.
      
      See the following examples:
      
        root@kbl-ppc:~# perf stat -M CPI -I1000 --interval-count 2
        #           time             counts unit events
             1.000422803            764,809      inst_retired.any          #      2.9 CPI
             1.000422803          2,234,932      cycles
             2.001464585          1,960,061      inst_retired.any          #      1.6 CPI
             2.001464585          4,022,591      cycles
      
      The second CPI should not be 1.6 (4,022,591/1,960,061 is 2.1)
      
        root@kbl-ppc:~# perf stat -e cycles,instructions -I1000 --interval-count 2
        #           time             counts unit events
             1.000429493          2,869,311      cycles
             1.000429493            816,875      instructions              #    0.28  insn per cycle
             2.001516426          9,260,973      cycles
             2.001516426          5,250,634      instructions              #    0.87  insn per cycle
      
      The second 'insn per cycle' should not be 0.87 (5,250,634/9,260,973 is
      0.57).
      
      The current code uses a global variable 'rt_stat' for tracking and
      updating the std dev of runtime stat. Unlike the counts, 'rt_stat' is not
      reset for interval. While the counts are reset for interval.
      
        perf_stat_process_counter()
        {
                if (config->interval)
                        init_stats(ps->res_stats);
        }
      
      So for interval mode, the 'rt_stat' variable should be reset too.
      
      This patch resets 'rt_stat' before read_counters(), so the runtime stat
      is only calculated by the counts generated in this interval.
      
      With this patch:
      
        root@kbl-ppc:~# perf stat -M CPI -I1000 --interval-count 2
        #           time             counts unit events
             1.000420924          2,408,818      inst_retired.any          #      2.1 CPI
             1.000420924          5,010,111      cycles
             2.001448579          2,798,407      inst_retired.any          #      1.6 CPI
             2.001448579          4,599,861      cycles
      
        root@kbl-ppc:~# perf stat -e cycles,instructions -I1000 --interval-count 2
        #           time             counts unit events
             1.000428555          2,769,714      cycles
             1.000428555            774,462      instructions              #    0.28  insn per cycle
             2.001471562          3,595,904      cycles
             2.001471562          1,243,703      instructions              #    0.35  insn per cycle
      
      Now the second 'insn per cycle' and CPI are calculated by the counts
      generated in this interval.
      Signed-off-by: NJin Yao <yao.jin@linux.intel.com>
      Acked-by: NJiri Olsa <jolsa@redhat.com>
      Tested-By: NKajol Jain <kjain@linux.ibm.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Jin Yao <yao.jin@intel.com>
      Cc: Kan Liang <kan.liang@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: http://lore.kernel.org/lkml/20200420145417.6864-1-yao.jin@linux.intel.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      197ba86f
    • J
      perf stat: Zero all the 'ena' and 'run' array slot stats for interval mode · 0e0bf1ea
      Jin Yao 提交于
      As the code comments in perf_stat_process_counter() say, we calculate
      counter's data every interval, and the display code shows ps->res_stats
      avg value. We need to zero the stats for interval mode.
      
      But the current code only zeros the res_stats[0], it doesn't zero the
      res_stats[1] and res_stats[2], which are for ena and run of counter.
      
      This patch zeros the whole res_stats[] for interval mode.
      
      Fixes: 51fd2df1 ("perf stat: Fix interval output values")
      Signed-off-by: NJin Yao <yao.jin@linux.intel.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Jin Yao <yao.jin@intel.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Kan Liang <kan.liang@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: http://lore.kernel.org/lkml/20200409070755.17261-1-yao.jin@linux.intel.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      0e0bf1ea
  2. 22 4月, 2020 25 次提交
  3. 21 4月, 2020 8 次提交
  4. 20 4月, 2020 3 次提交
    • M
      vhost: disable for OABI · d085eb8c
      Michael S. Tsirkin 提交于
      vhost is currently broken on the some ARM configs.
      
      The reason is that the ring element addresses are passed between
      components with different alignments assumptions. Thus, if
      guest selects a pointer and host then gets and dereferences
      it, then alignment assumed by the host's compiler might be
      greater than the actual alignment of the pointer.
      compiler on the host from assuming pointer is aligned.
      
      This actually triggers on ARM with -mabi=apcs-gnu - which is a
      deprecated configuration. With this OABI, compiler assumes that
      all structures are 4 byte aligned - which is stronger than
      virtio guarantees for available and used rings, which are
      merely 2 bytes. Thus a guest without -mabi=apcs-gnu running
      on top of host with -mabi=apcs-gnu will be broken.
      
      The correct fix is to force alignment of structures - however
      that is an intrusive fix that's best deferred until the next release.
      
      We didn't previously support such ancient systems at all - this surfaced
      after vdpa support prompted removing dependency of vhost on
      VIRTULIZATION. So for now, let's just add something along the lines of
      
      	depends on !ARM || AEABI
      
      to the virtio Kconfig declaration, and add a comment that it has to do
      with struct member alignment.
      
      Note: we can't make VHOST and VHOST_RING themselves have
      a dependency since these are selected. Add a new symbol for that.
      
      We should be able to drop this dependency down the road.
      
      Fixes: 20c384f1 ("vhost: refine vhost and vringh kconfig")
      Suggested-by: NArd Biesheuvel <ardb@kernel.org>
      Suggested-by: NRichard Earnshaw <Richard.Earnshaw@arm.com>
      Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
      d085eb8c
    • C
    • E
      KVM: s390: Fix PV check in deliverable_irqs() · d47c4c45
      Eric Farman 提交于
      The diag 0x44 handler, which handles a directed yield, goes into a
      a codepath that does a kvm_for_each_vcpu() and ultimately
      deliverable_irqs().  The new check for kvm_s390_pv_cpu_is_protected()
      contains an assertion that the vcpu->mutex is held, which isn't going
      to be the case in this scenario.
      
      The result is a plethora of these messages if the lock debugging
      is enabled, and thus an implication that we have a problem.
      
        WARNING: CPU: 9 PID: 16167 at arch/s390/kvm/kvm-s390.h:239 deliverable_irqs+0x1c6/0x1d0 [kvm]
        ...snip...
        Call Trace:
         [<000003ff80429bf2>] deliverable_irqs+0x1ca/0x1d0 [kvm]
        ([<000003ff80429b34>] deliverable_irqs+0x10c/0x1d0 [kvm])
         [<000003ff8042ba82>] kvm_s390_vcpu_has_irq+0x2a/0xa8 [kvm]
         [<000003ff804101e2>] kvm_arch_dy_runnable+0x22/0x38 [kvm]
         [<000003ff80410284>] kvm_vcpu_on_spin+0x8c/0x1d0 [kvm]
         [<000003ff80436888>] kvm_s390_handle_diag+0x3b0/0x768 [kvm]
         [<000003ff80425af4>] kvm_handle_sie_intercept+0x1cc/0xcd0 [kvm]
         [<000003ff80422bb0>] __vcpu_run+0x7b8/0xfd0 [kvm]
         [<000003ff80423de6>] kvm_arch_vcpu_ioctl_run+0xee/0x3e0 [kvm]
         [<000003ff8040ccd8>] kvm_vcpu_ioctl+0x2c8/0x8d0 [kvm]
         [<00000001504ced06>] ksys_ioctl+0xae/0xe8
         [<00000001504cedaa>] __s390x_sys_ioctl+0x2a/0x38
         [<0000000150cb9034>] system_call+0xd8/0x2d8
        2 locks held by CPU 2/KVM/16167:
         #0: 00000001951980c0 (&vcpu->mutex){+.+.}, at: kvm_vcpu_ioctl+0x90/0x8d0 [kvm]
         #1: 000000019599c0f0 (&kvm->srcu){....}, at: __vcpu_run+0x4bc/0xfd0 [kvm]
        Last Breaking-Event-Address:
         [<000003ff80429b34>] deliverable_irqs+0x10c/0x1d0 [kvm]
        irq event stamp: 11967
        hardirqs last  enabled at (11975): [<00000001502992f2>] console_unlock+0x4ca/0x650
        hardirqs last disabled at (11982): [<0000000150298ee8>] console_unlock+0xc0/0x650
        softirqs last  enabled at (7940): [<0000000150cba6ca>] __do_softirq+0x422/0x4d8
        softirqs last disabled at (7929): [<00000001501cd688>] do_softirq_own_stack+0x70/0x80
      
      Considering what's being done here, let's fix this by removing the
      mutex assertion rather than acquiring the mutex for every other vcpu.
      
      Fixes: 201ae986 ("KVM: s390: protvirt: Implement interrupt injection")
      Signed-off-by: NEric Farman <farman@linux.ibm.com>
      Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Reviewed-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Reviewed-by: NCornelia Huck <cohuck@redhat.com>
      Link: https://lore.kernel.org/r/20200415190353.63625-1-farman@linux.ibm.comSigned-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      d47c4c45