1. 21 12月, 2011 4 次提交
  2. 06 12月, 2011 2 次提交
    • G
      perf, core: Rate limit perf_sched_events jump_label patching · b2029520
      Gleb Natapov 提交于
      jump_lable patching is very expensive operation that involves pausing all
      cpus. The patching of perf_sched_events jump_label is easily controllable
      from userspace by unprivileged user.
      
      When te user runs a loop like this:
      
        "while true; do perf stat -e cycles true; done"
      
      ... the performance of my test application that just increments a counter
      for one second drops by 4%.
      
      This is on a 16 cpu box with my test application using only one of
      them. An impact on a real server doing real work will be worse.
      
      Performance of KVM PMU drops nearly 50% due to jump_lable for "perf
      record" since KVM PMU implementation creates and destroys perf event
      frequently.
      
      This patch introduces a way to rate limit jump_label patching and uses
      it to fix the above problem.
      
      I believe that as jump_label use will spread the problem will become more
      common and thus solving it in a generic code is appropriate. Also fixing
      it in the perf code would result in moving jump_label accounting logic to
      perf code with all the ifdefs in case of JUMP_LABEL=n kernel. With this
      patch all details are nicely hidden inside jump_label code.
      Signed-off-by: NGleb Natapov <gleb@redhat.com>
      Acked-by: NJason Baron <jbaron@redhat.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/20111127155909.GO2557@redhat.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
      b2029520
    • P
      perf: Avoid a useless pmu_disable() in the perf-tick · 0f5a2601
      Peter Zijlstra 提交于
      Gleb writes:
      
       > Currently pmu is disabled and re-enabled on each timer interrupt even
       > when no rotation or frequency adjustment is needed. On Intel CPU this
       > results in two writes into PERF_GLOBAL_CTRL MSR per tick. On bare metal
       > it does not cause significant slowdown, but when running perf in a virtual
       > machine it leads to 20% slowdown on my machine.
      
      Cure this by keeping a perf_event_context::nr_freq counter that counts the
      number of active events that require frequency adjustments and use this in a
      similar fashion to the already existing nr_events != nr_active test in
      perf_rotate_context().
      
      By being able to exclude both rotation and frequency adjustments a-priory for
      the common case we can avoid the otherwise superfluous PMU disable.
      Suggested-by: NGleb Natapov <gleb@redhat.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/n/tip-515yhoatehd3gza7we9fapaa@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@elte.hu>
      0f5a2601
  3. 05 12月, 2011 1 次提交
    • P
      perf: Fix loss of notification with multi-event · 10c6db11
      Peter Zijlstra 提交于
      When you do:
              $ perf record -e cycles,cycles,cycles noploop 10
      
      You expect about 10,000 samples for each event, i.e., 10s at
      1000samples/sec. However, this is not what's happening. You
      get much fewer samples, maybe 3700 samples/event:
      
      $ perf report -D | tail -15
      Aggregated stats:
                 TOTAL events:      10998
                  MMAP events:         66
                  COMM events:          2
                SAMPLE events:      10930
      cycles stats:
                 TOTAL events:       3644
                SAMPLE events:       3644
      cycles stats:
                 TOTAL events:       3642
                SAMPLE events:       3642
      cycles stats:
                 TOTAL events:       3644
                SAMPLE events:       3644
      
      On a Intel Nehalem or even AMD64, there are 4 counters capable
      of measuring cycles, so there is plenty of space to measure those
      events without multiplexing (even with the NMI watchdog active).
      And even with multiplexing, we'd expect roughly the same number
      of samples per event.
      
      The root of the problem was that when the event that caused the buffer
      to become full was not the first event passed on the cmdline, the user
      notification would get lost. The notification was sent to the file
      descriptor of the overflowed event but the perf tool was not polling
      on it.  The perf tool aggregates all samples into a single buffer,
      i.e., the buffer of the first event. Consequently, it assumes
      notifications for any event will come via that descriptor.
      
      The seemingly straight forward solution of moving the waitq into the
      ringbuffer object doesn't work because of life-time issues. One could
      perf_event_set_output() on a fd that you're also blocking on and cause
      the old rb object to be freed while its waitq would still be
      referenced by the blocked thread -> FAIL.
      
      Therefore link all events to the ringbuffer and broadcast the wakeup
      from the ringbuffer object to all possible events that could be waited
      upon. This is rather ugly, and we're open to better solutions but it
      works for now.
      Reported-by: NStephane Eranian <eranian@google.com>
      Finished-by: NStephane Eranian <eranian@google.com>
      Reviewed-by: NStephane Eranian <eranian@google.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/20111126014731.GA7030@quadSigned-off-by: NIngo Molnar <mingo@elte.hu>
      10c6db11
  4. 06 10月, 2011 1 次提交
  5. 29 8月, 2011 1 次提交
    • S
      perf events: Fix slow and broken cgroup context switch code · a8d757ef
      Stephane Eranian 提交于
      The current cgroup context switch code was incorrect leading
      to bogus counts. Furthermore, as soon as there was an active
      cgroup event on a CPU, the context switch cost on that CPU
      would increase by a significant amount as demonstrated by a
      simple ping/pong example:
      
       $ ./pong
       Both processes pinned to CPU1, running for 10s
       10684.51 ctxsw/s
      
      Now start a cgroup perf stat:
       $ perf stat -e cycles,cycles -A -a -G test  -C 1 -- sleep 100
      
      $ ./pong
       Both processes pinned to CPU1, running for 10s
       6674.61 ctxsw/s
      
      That's a 37% penalty.
      
      Note that pong is not even in the monitored cgroup.
      
      The results shown by perf stat are bogus:
       $ perf stat -e cycles,cycles -A -a -G test  -C 1 -- sleep 100
      
       Performance counter stats for 'sleep 100':
      
       CPU1 <not counted> cycles   test
       CPU1 16,984,189,138 cycles  #    0.000 GHz
      
      The second 'cycles' event should report a count @ CPU clock
      (here 2.4GHz) as it is counting across all cgroups.
      
      The patch below fixes the bogus accounting and bypasses any
      cgroup switches in case the outgoing and incoming tasks are
      in the same cgroup.
      
      With this patch the same test now yields:
       $ ./pong
       Both processes pinned to CPU1, running for 10s
       10775.30 ctxsw/s
      
      Start perf stat with cgroup:
      
       $ perf stat -e cycles,cycles -A -a -G test  -C 1 -- sleep 10
      
      Run pong outside the cgroup:
       $ /pong
       Both processes pinned to CPU1, running for 10s
       10687.80 ctxsw/s
      
      The penalty is now less than 2%.
      
      And the results for perf stat are correct:
      
      $ perf stat -e cycles,cycles -A -a -G test  -C 1 -- sleep 10
      
       Performance counter stats for 'sleep 10':
      
       CPU1 <not counted> cycles test #    0.000 GHz
       CPU1 23,933,981,448 cycles      #    0.000 GHz
      
      Now perf stat reports the correct counts for
      for the non cgroup event.
      
      If we run pong inside the cgroup, then we also get the
      correct counts:
      
      $ perf stat -e cycles,cycles -A -a -G test  -C 1 -- sleep 10
      
       Performance counter stats for 'sleep 10':
      
       CPU1 22,297,726,205 cycles test #    0.000 GHz
       CPU1 23,933,981,448 cycles      #    0.000 GHz
      
            10.001457237 seconds time elapsed
      Signed-off-by: NStephane Eranian <eranian@google.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/20110825135803.GA4697@quadSigned-off-by: NIngo Molnar <mingo@elte.hu>
      a8d757ef
  6. 27 7月, 2011 1 次提交
  7. 01 7月, 2011 7 次提交
  8. 09 6月, 2011 1 次提交
  9. 04 6月, 2011 1 次提交
  10. 04 5月, 2011 1 次提交
  11. 29 4月, 2011 1 次提交
    • I
      perf events: Add generic front-end and back-end stalled cycle event definitions · 8f622422
      Ingo Molnar 提交于
      Add two generic hardware events: front-end and back-end stalled cycles.
      
      These events measure conditions when the CPU is executing code but its
      capabilities are not fully utilized. Understanding such situations and
      analyzing them is an important sub-task of code optimization workflows.
      
      Both events limit performance: most front end stalls tend to be caused
      by branch misprediction or instruction fetch cachemisses, backend
      stalls can be caused by various resource shortages or inefficient
      instruction scheduling.
      
      Front-end stalls are the more important ones: code cannot run fast
      if the instruction stream is not being kept up.
      
      An over-utilized back-end can cause front-end stalls and thus
      has to be kept an eye on as well.
      
      The exact composition is very program logic and instruction mix
      dependent.
      
      We use the terms 'stall', 'front-end' and 'back-end' loosely and
      try to use the best available events from specific CPUs that
      approximate these concepts.
      
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Link: http://lkml.kernel.org/n/tip-7y40wib8n000io7hjpn1dsrm@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@elte.hu>
      8f622422
  12. 27 4月, 2011 1 次提交
  13. 05 4月, 2011 1 次提交
    • J
      jump label: Introduce static_branch() interface · d430d3d7
      Jason Baron 提交于
      Introduce:
      
      static __always_inline bool static_branch(struct jump_label_key *key);
      
      instead of the old JUMP_LABEL(key, label) macro.
      
      In this way, jump labels become really easy to use:
      
      Define:
      
              struct jump_label_key jump_key;
      
      Can be used as:
      
              if (static_branch(&jump_key))
                      do unlikely code
      
      enable/disale via:
      
              jump_label_inc(&jump_key);
              jump_label_dec(&jump_key);
      
      that's it!
      
      For the jump labels disabled case, the static_branch() becomes an
      atomic_read(), and jump_label_inc()/dec() are simply atomic_inc(),
      atomic_dec() operations. We show testing results for this change below.
      
      Thanks to H. Peter Anvin for suggesting the 'static_branch()' construct.
      
      Since we now require a 'struct jump_label_key *key', we can store a pointer into
      the jump table addresses. In this way, we can enable/disable jump labels, in
      basically constant time. This change allows us to completely remove the previous
      hashtable scheme. Thanks to Peter Zijlstra for this re-write.
      
      Testing:
      
      I ran a series of 'tbench 20' runs 5 times (with reboots) for 3
      configurations, where tracepoints were disabled.
      
      jump label configured in
      avg: 815.6
      
      jump label *not* configured in (using atomic reads)
      avg: 800.1
      
      jump label *not* configured in (regular reads)
      avg: 803.4
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <20110316212947.GA8792@redhat.com>
      Signed-off-by: NJason Baron <jbaron@redhat.com>
      Suggested-by: NH. Peter Anvin <hpa@linux.intel.com>
      Tested-by: NDavid Daney <ddaney@caviumnetworks.com>
      Acked-by: NRalf Baechle <ralf@linux-mips.org>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Acked-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      d430d3d7
  14. 31 3月, 2011 2 次提交
    • L
      Fix common misspellings · 25985edc
      Lucas De Marchi 提交于
      Fixes generated by 'codespell' and manually reviewed.
      Signed-off-by: NLucas De Marchi <lucas.demarchi@profusion.mobi>
      25985edc
    • P
      perf: Fix task context scheduling · ab711fe0
      Peter Zijlstra 提交于
      Jiri reported:
      
       |
       | - once an event is created by sys_perf_event_open, task context
       |   is created and it stays even if the event is closed, until the
       |   task is finished ... thats what I see in code and I assume it's
       |   correct
       |
       | - when the task opens event, perf_sched_events jump label is
       |   incremented and following callbacks are started from scheduler
       |
       |         __perf_event_task_sched_in
       |         __perf_event_task_sched_out
       |
       |   These callback *in/out set/unset cpuctx->task_ctx value to the
       |   task context.
       |
       | - close is called on event on CPU 0:
       |         - the task is scheduled on CPU 0
       |         - __perf_event_task_sched_in is called
       |         - cpuctx->task_ctx is set
       |         - perf_sched_events jump label is decremented and == 0
       |         - __perf_event_task_sched_out is not called
       |         - cpuctx->task_ctx on CPU 0 stays set
       |
       | - exit is called on CPU 1:
       |         - the task is scheduled on CPU 1
       |         - perf_event_exit_task is called
       |         - task_ctx_sched_out unsets cpuctx->task_ctx on CPU 1
       |         - put_ctx destroys the context
       |
       | - another call of perf_rotate_context on CPU 0 will use invalid
       |   task_ctx pointer, and eventualy panic.
       |
      
      Cure this the simplest possibly way by partially reverting the
      jump_label optimization for the sched_out case.
      Reported-and-tested-by: NJiri Olsa <jolsa@redhat.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: <stable@kernel.org> # .37+
      LKML-Reference: <1301520405.4859.213.camel@twins>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      ab711fe0
  15. 23 3月, 2011 1 次提交
    • S
      perf_events: Fix stale ->cgrp pointer in update_cgrp_time_from_cpuctx() · 68cacd29
      Stephane Eranian 提交于
      This patch solves a stale pointer problem in
      update_cgrp_time_from_cpuctx(). The cpuctx->cgrp
      was not cleared on all possible event exit paths,
      including:
      
         close()
           perf_release()
             perf_release_kernel()
               list_del_event()
      
      This patch fixes list_del_event() to clear cpuctx->cgrp
      when there are no cgroup events left in the context.
      
      [ This second version makes the code compile when
        CONFIG_CGROUP_PERF is not enabled. We unconditionally define
        perf_cpu_context->cgrp. ]
      Signed-off-by: NStephane Eranian <eranian@google.com>
      Cc: peterz@infradead.org
      Cc: perfmon2-devel@lists.sf.net
      Cc: paulus@samba.org
      Cc: davem@davemloft.net
      LKML-Reference: <20110323150306.GA1580@quad>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      68cacd29
  16. 16 3月, 2011 1 次提交
  17. 04 3月, 2011 1 次提交
    • A
      perf: Add support for supplementary event registers · a7e3ed1e
      Andi Kleen 提交于
      Change logs against Andi's original version:
      
      - Extends perf_event_attr:config to config{,1,2} (Peter Zijlstra)
      - Fixed a major event scheduling issue. There cannot be a ref++ on an
        event that has already done ref++ once and without calling
        put_constraint() in between. (Stephane Eranian)
      - Use thread_cpumask for percore allocation. (Lin Ming)
      - Use MSR names in the extra reg lists. (Lin Ming)
      - Remove redundant "c = NULL" in intel_percore_constraints
      - Fix comment of perf_event_attr::config1
      
      Intel Nehalem/Westmere have a special OFFCORE_RESPONSE event
      that can be used to monitor any offcore accesses from a core.
      This is a very useful event for various tunings, and it's
      also needed to implement the generic LLC-* events correctly.
      
      Unfortunately this event requires programming a mask in a separate
      register. And worse this separate register is per core, not per
      CPU thread.
      
      This patch:
      
      - Teaches perf_events that OFFCORE_RESPONSE needs extra parameters.
        The extra parameters are passed by user space in the
        perf_event_attr::config1 field.
      
      - Adds support to the Intel perf_event core to schedule per
        core resources. This adds fairly generic infrastructure that
        can be also used for other per core resources.
        The basic code has is patterned after the similar AMD northbridge
        constraints code.
      
      Thanks to Stephane Eranian who pointed out some problems
      in the original version and suggested improvements.
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NLin Ming <ming.m.lin@intel.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1299119690-13991-2-git-send-email-ming.m.lin@intel.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      a7e3ed1e
  18. 16 2月, 2011 2 次提交
    • P
      perf: Optimize throttling code · 163ec435
      Peter Zijlstra 提交于
      By pre-computing the maximum number of samples per tick we can avoid a
      multiplication and a conditional since MAX_INTERRUPTS >
      max_samples_per_tick.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      163ec435
    • S
      perf: Add cgroup support · e5d1367f
      Stephane Eranian 提交于
      This kernel patch adds the ability to filter monitoring based on
      container groups (cgroups). This is for use in per-cpu mode only.
      
      The cgroup to monitor is passed as a file descriptor in the pid
      argument to the syscall. The file descriptor must be opened to
      the cgroup name in the cgroup filesystem. For instance, if the
      cgroup name is foo and cgroupfs is mounted in /cgroup, then the
      file descriptor is opened to /cgroup/foo. Cgroup mode is
      activated by passing PERF_FLAG_PID_CGROUP in the flags argument
      to the syscall.
      
      For instance to measure in cgroup foo on CPU1 assuming
      cgroupfs is mounted under /cgroup:
      
      struct perf_event_attr attr;
      int cgroup_fd, fd;
      
      cgroup_fd = open("/cgroup/foo", O_RDONLY);
      fd = perf_event_open(&attr, cgroup_fd, 1, -1, PERF_FLAG_PID_CGROUP);
      close(cgroup_fd);
      Signed-off-by: NStephane Eranian <eranian@google.com>
      [ added perf_cgroup_{exit,attach} ]
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <4d590250.114ddf0a.689e.4482@mx.google.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      e5d1367f
  19. 16 12月, 2010 2 次提交
    • P
      perf: Sysfs enumeration · abe43400
      Peter Zijlstra 提交于
      Simple sysfs emumeration of the PMUs.
      
      Use a "event_source" bus, and add PMU devices using their name.
      
      Each PMU device has a type attribute which contrains the value needed
      for perf_event_attr::type to identify this PMU.
      
      This is the minimal stub needed to start using this interface,
      we'll consider extending the sysfs usage later.
      
      Cc: Kay Sievers <kay.sievers@vrfy.org>
      Cc: Greg KH <gregkh@suse.de>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <20101117222056.316982569@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      abe43400
    • P
      perf: Dynamic pmu types · 2e80a82a
      Peter Zijlstra 提交于
      Extend the perf_pmu_register() interface to allow for named and
      dynamic pmu types.
      
      Because we need to support the existing static types we cannot use
      dynamic types for everything, hence provide a type argument.
      
      If we want to enumerate the PMUs they need a name, provide one.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <20101117222056.259707703@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      2e80a82a
  20. 09 12月, 2010 1 次提交
  21. 05 12月, 2010 2 次提交
  22. 01 12月, 2010 1 次提交
  23. 26 11月, 2010 3 次提交
  24. 11 11月, 2010 1 次提交
    • S
      perf_events: Fix time tracking in samples · eed01528
      Stephane Eranian 提交于
      This patch corrects time tracking in samples. Without this patch
      both time_enabled and time_running are bogus when user asks for
      PERF_SAMPLE_READ.
      
      One uses PERF_SAMPLE_READ to sample the values of other counters
      in each sample. Because of multiplexing, it is necessary to know
      both time_enabled, time_running to be able to scale counts correctly.
      
      In this second version of the patch, we maintain a shadow
      copy of ctx->time which allows us to compute ctx->time without
      calling update_context_time() from NMI context. We avoid the
      issue that update_context_time() must always be called with
      ctx->lock held.
      
      We do not keep shadow copies of the other event timings
      because if the lead event is overflowing then it is active
      and thus it's been scheduled in via event_sched_in() in
      which case neither tstamp_stopped, tstamp_running can be modified.
      
      This timing logic only applies to samples when PERF_SAMPLE_READ
      is used.
      
      Note that this patch does not address timing issues related
      to sampling inheritance between tasks. This will be addressed
      in a future patch.
      
      With this patch, the libpfm4 example task_smpl now reports
      correct counts (shown on 2.4GHz Core 2):
      
      $ task_smpl -p 2400000000 -e unhalted_core_cycles:u,instructions_retired:u,baclears  noploop 5
      noploop for 5 seconds
      IIP:0x000000004006d6 PID:5596 TID:5596 TIME:466,210,211,430 STREAM_ID:33 PERIOD:2,400,000,000 ENA=1,010,157,814 RUN=1,010,157,814 NR=3
      	2,400,000,254 unhalted_core_cycles:u (33)
      	2,399,273,744 instructions_retired:u (34)
      	53,340 baclears (35)
      Signed-off-by: NStephane Eranian <eranian@google.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <4cc6e14b.1e07e30a.256e.5190@mx.google.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      eed01528