1. 02 12月, 2014 1 次提交
    • A
      perf callchain: Support handling complete branch stacks as histograms · 8b7bad58
      Andi Kleen 提交于
      Currently branch stacks can be only shown as edge histograms for
      individual branches. I never found this display particularly useful.
      
      This implements an alternative mode that creates histograms over
      complete branch traces, instead of individual branches, similar to how
      normal callgraphs are handled. This is done by putting it in front of
      the normal callgraph and then using the normal callgraph histogram
      infrastructure to unify them.
      
      This way in complex functions we can understand the control flow that
      lead to a particular sample, and may even see some control flow in the
      caller for short functions.
      
      Example (simplified, of course for such simple code this is usually not
      needed), please run this after the whole patchkit is in, as at this
      point in the patch order there is no --branch-history, that will be
      added in a patch after this one:
      
      tcall.c:
      
      volatile a = 10000, b = 100000, c;
      
      __attribute__((noinline)) f2()
      {
      	c = a / b;
      }
      
      __attribute__((noinline)) f1()
      {
      	f2();
      	f2();
      }
      main()
      {
      	int i;
      	for (i = 0; i < 1000000; i++)
      		f1();
      }
      
      % perf record -b -g ./tsrc/tcall
      [ perf record: Woken up 1 times to write data ]
      [ perf record: Captured and wrote 0.044 MB perf.data (~1923 samples) ]
      % perf report --no-children --branch-history
      ...
          54.91%  tcall.c:6  [.] f2                      tcall
                  |
                  |--65.53%-- f2 tcall.c:5
                  |          |
                  |          |--70.83%-- f1 tcall.c:11
                  |          |          f1 tcall.c:10
                  |          |          main tcall.c:18
                  |          |          main tcall.c:18
                  |          |          main tcall.c:17
                  |          |          main tcall.c:17
                  |          |          f1 tcall.c:13
                  |          |          f1 tcall.c:13
                  |          |          f2 tcall.c:7
                  |          |          f2 tcall.c:5
                  |          |          f1 tcall.c:12
                  |          |          f1 tcall.c:12
                  |          |          f2 tcall.c:7
                  |          |          f2 tcall.c:5
                  |          |          f1 tcall.c:11
                  |          |
                  |           --29.17%-- f1 tcall.c:12
                  |                     f1 tcall.c:12
                  |                     f2 tcall.c:7
                  |                     f2 tcall.c:5
                  |                     f1 tcall.c:11
                  |                     f1 tcall.c:10
                  |                     main tcall.c:18
                  |                     main tcall.c:18
                  |                     main tcall.c:17
                  |                     main tcall.c:17
                  |                     f1 tcall.c:13
                  |                     f1 tcall.c:13
                  |                     f2 tcall.c:7
                  |                     f2 tcall.c:5
                  |                     f1 tcall.c:12
      
      The default output is unchanged.
      
      This is only implemented in perf report, no change to record or anywhere
      else.
      
      This adds the basic code to report:
      
      - add a new "branch" option to the -g option parser to enable this mode
      - when the flag is set include the LBR into the callstack in machine.c.
      
      The rest of the history code is unchanged and doesn't know the
      difference between LBR entry and normal call entry.
      
      - detect overlaps with the callchain
      - remove small loop duplicates in the LBR
      
      Current limitations:
      
      - The LBR flags (mispredict etc.) are not shown in the history
      and LBR entries have no special marker.
      - It would be nice if annotate marked the LBR entries somehow
      (e.g. with arrows)
      
      v2: Various fixes.
      v3: Merge further patches into this one. Fix white space.
      v4: Improve manpage. Address review feedback.
      v5: Rename functions. Better error message without -g. Fix crash without
          -b.
      v6: Rebase
      v7: Rebase. Use NO_ENTRY in memset.
      v8: Port to latest tip. Move add_callchain_ip to separate
          patch. Skip initial entries in callchain. Minor cleanups.
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Link: http://lkml.kernel.org/r/1415844328-4884-3-git-send-email-andi@firstfloor.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      8b7bad58
  2. 25 11月, 2014 1 次提交
  3. 19 11月, 2014 2 次提交
  4. 05 11月, 2014 2 次提交
    • N
      perf tools: Make vmlinux short name more like kallsyms short name · 96d78059
      Namhyung Kim 提交于
      The previous patch changed kernel dso name from '[kernel.kallsyms]' to
      vmlinux.  However it might add confusion to old users accustomed to the
      old name.  So change the short name to '[kernel.vmlinux]' to reduce such
      confusion.
      
      Before:
        # Overhead  Command         Shared Object            Symbol
        # ........  ..............  .......................  ...............................
        #
             9.83%  swapper         vmlinux                  [k] intel_idle
             4.10%  awk             libc-2.20.so             [.] __strcmp_sse2
             1.86%  sed             libc-2.20.so             [.] __strcmp_sse2
             1.78%  netctl-auto     libc-2.20.so             [.] __strcmp_sse2
             1.23%  netctl-auto     libc-2.20.so             [.] __mbrtowc
             1.21%  firefox         libxul.so                [.] 0x00000000024b62bd
             1.20%  swapper         vmlinux                  [k] cpuidle_enter_state
             1.03%  sleep           vmlinux                  [k] copy_user_generic_unrolled
      
      After:
        # Overhead  Command         Shared Object            Symbol
        # ........  ..............  .......................  ...............................
        #
             9.83%  swapper         [kernel.vmlinux]         [k] intel_idle
             4.10%  awk             libc-2.20.so             [.] __strcmp_sse2
             1.86%  sed             libc-2.20.so             [.] __strcmp_sse2
             1.78%  netctl-auto     libc-2.20.so             [.] __strcmp_sse2
             1.23%  netctl-auto     libc-2.20.so             [.] __mbrtowc
             1.21%  firefox         libxul.so                [.] 0x00000000024b62bd
             1.20%  swapper         [kernel.vmlinux]         [k] cpuidle_enter_state
             1.03%  sleep           [kernel.vmlinux]         [k] copy_user_generic_unrolled
      Signed-off-by: NNamhyung Kim <namhyung@kernel.org>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Namhyung Kim <namhyung.kim@lge.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Stephane Eranian <eranian@google.com>
      Link: http://lkml.kernel.org/r/1415063674-17206-9-git-send-email-namhyung@kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      96d78059
    • N
      perf tools: Fix build-id matching on vmlinux · b837a8bd
      Namhyung Kim 提交于
      There's a problem on finding correct kernel symbols when perf report
      runs on a different kernel.  Although a part of the problem was solved
      by the prior commit 0a7e6d1b ("perf tools: Check recorded kernel
      version when finding vmlinux"), there's a remaining problem still.
      
      When perf records samples, it synthesizes the kernel map using
      machine__mmap_name() and ref_reloc_sym like "[kernel.kallsyms]_text".
      You can easily see it using 'perf report -D' command.
      
      After finishing record, it goes through the recorded events to find
      maps/dsos actually used.  And then record build-id info of them.
      
      During this process, it needs to load symbols in a dso and it'd call
      dso__load_vmlinux_path() since the default value of the symbol_conf.
      try_vmlinux_path is true.  However it changes dso->long_name to a real
      path of the vmlinux file (e.g. /lib/modules/3.16.4/build/vmlinux) if one
      is running on a custom kernel.
      
      It resulted in that perf report reads the build-id of the vmlinux, but
      cannot use it since it only knows about the [kernel.kallsyms] map.  It
      then falls back to possible vmlinux paths by using the recorded kernel
      version (in case of a recent version) or a running kernel silently.
      
      Even with the recent tools, this still has a possibility of breaking
      the result.  As the build directory is a symbolic link, if one built a
      new kernel in the same directory with different source/config, the old
      link to vmlinux will point the new file.  So it's absolutely needed to
      use build-id when finding a kernel image.
      
      In this patch, it's now changed to try to search a kernel dso in the
      existing dso list which was constructed during build-id table parsing
      so it'll always have a build-id.  If not found, search "[kernel.kallsyms]".
      
      Before:
      
        $ perf report
        # Children      Self  Command  Shared Object      Symbol
        # ........  ........  .......  .................  ...............................
        #
            72.15%     0.00%  swapper  [kernel.kallsyms]  [k] set_curr_task_rt
            72.15%     0.00%  swapper  [kernel.kallsyms]  [k] native_calibrate_tsc
            72.15%     0.00%  swapper  [kernel.kallsyms]  [k] tsc_refine_calibration_work
            71.87%    71.87%  swapper  [kernel.kallsyms]  [k] module_finalize
         ...
      
      After (for the same perf.data):
      
            72.15%     0.00%  swapper  vmlinux  [k] cpu_startup_entry
            72.15%     0.00%  swapper  vmlinux  [k] arch_cpu_idle
            72.15%     0.00%  swapper  vmlinux  [k] default_idle
            71.87%    71.87%  swapper  vmlinux  [k] native_safe_halt
         ...
      Signed-off-by: NNamhyung Kim <namhyung@kernel.org>
      Acked-by: NIngo Molnar <mingo@kernel.org>
      Link: http://lkml.kernel.org/r/20140924073356.GB1962@gmail.com
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Namhyung Kim <namhyung.kim@lge.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Stephane Eranian <eranian@google.com>
      Link: http://lkml.kernel.org/r/1415063674-17206-8-git-send-email-namhyung@kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      b837a8bd
  5. 04 11月, 2014 1 次提交
  6. 29 10月, 2014 4 次提交
  7. 15 10月, 2014 1 次提交
    • A
      perf machine: Add missing dsos->root rbtree root initialization · e167f995
      Arnaldo Carvalho de Melo 提交于
      A segfault happens on 'perf test hists_link' because we end up using a
      struct machines on the stack, and then machines__init() was not
      initializing the newly introduced rb_root, just the existing list_head.
      
      When we introduced struct dsos, to group the two ways to store dsos,
      i.e. the linked list and the rbtree, we didn't turned the initialization
      done in:
      
      	machines__init(machines->host) ->
      		machine__init() ->
      			INIT_LIST_HEAD
      
      into a dsos__init() to keep on initializing the list_head but _as well_
      initializing the rb_root, oops.
      
      All worked because outside perf-test we probably zalloc the whole thing
      which ends up initializing it in to NULL.
      
      So the problem looks contained to 'perf test' that uses it on stack,
      etc.
      Reported-by: NJiri Olsa <jolsa@redhat.com>
      Acked-by: Waiman Long <Waiman.Long@hp.com>,
      Cc: Adrian Hunter <adrian.hunter@intel.com>,
      Cc: Don Zickus <dzickus@redhat.com>
      Cc: Douglas Hatch <doug.hatch@hp.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Scott J Norton <scott.norton@hp.com>
      Cc: Waiman Long <Waiman.Long@hp.com>,
      Link: http://lkml.kernel.org/r/20141014180353.GF3198@kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      e167f995
  8. 02 10月, 2014 1 次提交
    • W
      perf symbols: Improve DSO long names lookup speed with rbtree · 4598a0a6
      Waiman Long 提交于
      With workload that spawns and destroys many threads and processes, it
      was found that perf-mem could took a long time to post-process the perf
      data after the target workload had completed its operation.
      
      The performance bottleneck was found to be the lookup and insertion of
      the new DSO structures (thousands of them in this case).
      
      In a dual-socket Ivy-Bridge E7-4890 v2 machine (30-core, 60-thread), the
      perf profile below shows what perf was doing after the profiled AIM7
      shared workload completed:
      
      -     83.94%  perf  libc-2.11.3.so     [.] __strcmp_sse42
         - __strcmp_sse42
            - 99.82% map__new
                 machine__process_mmap_event
                 perf_session_deliver_event
                 perf_session__process_event
                 __perf_session__process_events
                 cmd_record
                 cmd_mem
                 run_builtin
                 main
                 __libc_start_main
      -     13.17%  perf  perf               [.] __dsos__findnew
           __dsos__findnew
           map__new
           machine__process_mmap_event
           perf_session_deliver_event
           perf_session__process_event
           __perf_session__process_events
           cmd_record
           cmd_mem
           run_builtin
           main
           __libc_start_main
      
      So about 97% of CPU times were spent in the map__new() function trying
      to insert new DSO entry into the DSO linked list. The whole
      post-processing step took about 9 minutes.
      
      The DSO structures are currently searched linearly. So the total
      processing time will be proportional to n^2.
      
      To overcome this performance problem, the DSO code is modified to also
      put the DSO structures in a RB tree sorted by its long name in
      additional to being in a simple linked list. With this change, the
      processing time will become proportional to n*log(n) which will be much
      quicker for large n. However, the short name will still be searched
      using the old linear searching method.  With that patch in place, the
      same perf-mem post-processing step took less than 30 seconds to
      complete.
      Signed-off-by: NWaiman Long <Waiman.Long@hp.com>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Don Zickus <dzickus@redhat.com>
      Cc: Douglas Hatch <doug.hatch@hp.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Scott J Norton <scott.norton@hp.com>
      Link: http://lkml.kernel.org/r/1412098575-27863-3-git-send-email-Waiman.Long@hp.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      4598a0a6
  9. 30 9月, 2014 1 次提交
    • W
      perf symbols: Encapsulate dsos list head into struct dsos · 8fa7d87f
      Waiman Long 提交于
      This is a precursor patch to enable long name searching of DSOs using
      a rbtree.
      
      In this patch, a new dsos structure is created which contains only a
      list head structure for the moment.
      
      The new dsos structure is used, in turn, in the machine structure for
      the user_dsos and kernel_dsos fields.
      
      Only the following 3 dsos functions are modified to accept the new dsos
      structure parameter instead of list_head:
      
       - dsos__add()
       - dsos__find()
       - __dsos__findnew()
      Signed-off-by: NWaiman Long <Waiman.Long@hp.com>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Don Zickus <dzickus@redhat.com>
      Cc: Douglas Hatch <doug.hatch@hp.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Scott J Norton <scott.norton@hp.com>
      Link: http://lkml.kernel.org/r/1412021249-19201-2-git-send-email-Waiman.Long@hp.com
      [ Move struct dsos to dso.h to reduce the dso methods depends on machine.h ]
      Signed-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      8fa7d87f
  10. 23 8月, 2014 3 次提交
  11. 14 8月, 2014 2 次提交
  12. 24 7月, 2014 3 次提交
  13. 23 7月, 2014 1 次提交
  14. 17 7月, 2014 3 次提交
  15. 27 6月, 2014 1 次提交
    • S
      perf tools powerpc: Adjust callchain based on DWARF debug info · a60335ba
      Sukadev Bhattiprolu 提交于
      When saving the callchain on Power, the kernel conservatively saves excess
      entries in the callchain. A few of these entries are needed in some cases
      but not others. We should use the DWARF debug information to determine
      when the entries are  needed.
      
      Eg: the value in the link register (LR) is needed only when it holds the
      return address of a function. At other times it must be ignored.
      
      If the unnecessary entries are not ignored, we end up with duplicate arcs
      in the call-graphs.
      
      Use the DWARF debug information to determine if any callchain entries
      should be ignored when building call-graphs.
      
      Callgraph before the patch:
      
          14.67%          2234  sprintft  libc-2.18.so       [.] __random
                  |
                  --- __random
                     |
                     |--61.12%-- __random
                     |          |
                     |          |--97.15%-- rand
                     |          |          do_my_sprintf
                     |          |          main
                     |          |          generic_start_main.isra.0
                     |          |          __libc_start_main
                     |          |          0x0
                     |          |
                     |           --2.85%-- do_my_sprintf
                     |                     main
                     |                     generic_start_main.isra.0
                     |                     __libc_start_main
                     |                     0x0
                     |
                      --38.88%-- rand
                                |
                                |--94.01%-- rand
                                |          do_my_sprintf
                                |          main
                                |          generic_start_main.isra.0
                                |          __libc_start_main
                                |          0x0
                                |
                                 --5.99%-- do_my_sprintf
                                           main
                                           generic_start_main.isra.0
                                           __libc_start_main
                                           0x0
      
      Callgraph after the patch:
      
          14.67%          2234  sprintft  libc-2.18.so       [.] __random
                  |
                  --- __random
                     |
                     |--95.93%-- rand
                     |          do_my_sprintf
                     |          main
                     |          generic_start_main.isra.0
                     |          __libc_start_main
                     |          0x0
                     |
                      --4.07%-- do_my_sprintf
                                main
                                generic_start_main.isra.0
                                __libc_start_main
                                0x0
      
      TODO:	For split-debug info objects like glibc, we can only determine
      	the call-frame-address only when both .eh_frame and .debug_info
      	sections are available. We should be able to determin the CFA
      	even without the .eh_frame section.
      
      Fix suggested by Anton Blanchard.
      
      Thanks to valuable input on DWARF debug information from Ulrich Weigand.
      Reported-by: NMaynard Johnson <maynard@us.ibm.com>
      Tested-by: NMaynard Johnson <maynard@us.ibm.com>
      Signed-off-by: NSukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
      Link: http://lkml.kernel.org/r/20140625154903.GA29607@us.ibm.comSigned-off-by: NJiri Olsa <jolsa@kernel.org>
      a60335ba
  16. 20 6月, 2014 1 次提交
  17. 09 6月, 2014 1 次提交
  18. 30 4月, 2014 1 次提交
  19. 28 4月, 2014 1 次提交
  20. 19 3月, 2014 2 次提交
  21. 15 3月, 2014 2 次提交
  22. 10 3月, 2014 1 次提交
  23. 18 2月, 2014 3 次提交
  24. 01 2月, 2014 1 次提交