1. 27 1月, 2012 1 次提交
    • S
      perf: Fix broken interrupt rate throttling · e050e3f0
      Stephane Eranian 提交于
      This patch fixes the sampling interrupt throttling mechanism.
      
      It was broken in v3.2. Events were not being unthrottled. The
      unthrottling mechanism required that events be checked at each
      timer tick.
      
      This patch solves this problem and also separates:
      
        - unthrottling
        - multiplexing
        - frequency-mode period adjustments
      
      Not all of them need to be executed at each timer tick.
      
      This third version of the patch is based on my original patch +
      PeterZ proposal (https://lkml.org/lkml/2012/1/7/87).
      
      At each timer tick, for each context:
      
        - if the current CPU has throttled events, we unthrottle events
      
        - if context has frequency-based events, we adjust sampling periods
      
        - if we have reached the jiffies interval, we multiplex (rotate)
      
      We decoupled rotation (multiplexing) from frequency-mode sampling
      period adjustments.  They should not necessarily happen at the same
      rate. Multiplexing is subject to jiffies_interval (currently at 1
      but could be higher once the tunable is exposed via sysfs).
      
      We have grouped frequency-mode adjustment and unthrottling into the
      same routine to minimize code duplication. When throttled while in
      frequency mode, we scan the events only once.
      
      We have fixed the threshold enforcement code in __perf_event_overflow().
      There was a bug whereby it would allow more than the authorized rate
      because an increment of hwc->interrupts was not executed at the right
      place.
      
      The patch was tested with low sampling limit (2000) and fixed periods,
      frequency mode, overcommitted PMU.
      
      On a 2.1GHz AMD CPU:
      
       $ cat /proc/sys/kernel/perf_event_max_sample_rate
       2000
      
      We set a rate of 3000 samples/sec (2.1GHz/3000 = 700000):
      
       $ perf record -e cycles,cycles -c 700000  noploop 10
       $ perf report -D | tail -21
      
       Aggregated stats:
                 TOTAL events:      80086
                  MMAP events:         88
                  COMM events:          2
                  EXIT events:          4
              THROTTLE events:      19996
            UNTHROTTLE events:      19996
                SAMPLE events:      40000
      
       cycles stats:
                 TOTAL events:      40006
                  MMAP events:          5
                  COMM events:          1
                  EXIT events:          4
              THROTTLE events:       9998
            UNTHROTTLE events:       9998
                SAMPLE events:      20000
      
       cycles stats:
                 TOTAL events:      39996
              THROTTLE events:       9998
            UNTHROTTLE events:       9998
                SAMPLE events:      20000
      
      For 10s, the cap is 2x2000x10 = 40000 samples.
      We get exactly that: 20000 samples/event.
      Signed-off-by: NStephane Eranian <eranian@google.com>
      Cc: <stable@kernel.org> # v3.2+
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/20120126160319.GA5655@quadSigned-off-by: NIngo Molnar <mingo@elte.hu>
      e050e3f0
  2. 21 12月, 2011 4 次提交
  3. 06 12月, 2011 2 次提交
    • G
      perf, core: Rate limit perf_sched_events jump_label patching · b2029520
      Gleb Natapov 提交于
      jump_lable patching is very expensive operation that involves pausing all
      cpus. The patching of perf_sched_events jump_label is easily controllable
      from userspace by unprivileged user.
      
      When te user runs a loop like this:
      
        "while true; do perf stat -e cycles true; done"
      
      ... the performance of my test application that just increments a counter
      for one second drops by 4%.
      
      This is on a 16 cpu box with my test application using only one of
      them. An impact on a real server doing real work will be worse.
      
      Performance of KVM PMU drops nearly 50% due to jump_lable for "perf
      record" since KVM PMU implementation creates and destroys perf event
      frequently.
      
      This patch introduces a way to rate limit jump_label patching and uses
      it to fix the above problem.
      
      I believe that as jump_label use will spread the problem will become more
      common and thus solving it in a generic code is appropriate. Also fixing
      it in the perf code would result in moving jump_label accounting logic to
      perf code with all the ifdefs in case of JUMP_LABEL=n kernel. With this
      patch all details are nicely hidden inside jump_label code.
      Signed-off-by: NGleb Natapov <gleb@redhat.com>
      Acked-by: NJason Baron <jbaron@redhat.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/20111127155909.GO2557@redhat.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
      b2029520
    • P
      perf: Avoid a useless pmu_disable() in the perf-tick · 0f5a2601
      Peter Zijlstra 提交于
      Gleb writes:
      
       > Currently pmu is disabled and re-enabled on each timer interrupt even
       > when no rotation or frequency adjustment is needed. On Intel CPU this
       > results in two writes into PERF_GLOBAL_CTRL MSR per tick. On bare metal
       > it does not cause significant slowdown, but when running perf in a virtual
       > machine it leads to 20% slowdown on my machine.
      
      Cure this by keeping a perf_event_context::nr_freq counter that counts the
      number of active events that require frequency adjustments and use this in a
      similar fashion to the already existing nr_events != nr_active test in
      perf_rotate_context().
      
      By being able to exclude both rotation and frequency adjustments a-priory for
      the common case we can avoid the otherwise superfluous PMU disable.
      Suggested-by: NGleb Natapov <gleb@redhat.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/n/tip-515yhoatehd3gza7we9fapaa@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@elte.hu>
      0f5a2601
  4. 05 12月, 2011 1 次提交
    • P
      perf: Fix loss of notification with multi-event · 10c6db11
      Peter Zijlstra 提交于
      When you do:
              $ perf record -e cycles,cycles,cycles noploop 10
      
      You expect about 10,000 samples for each event, i.e., 10s at
      1000samples/sec. However, this is not what's happening. You
      get much fewer samples, maybe 3700 samples/event:
      
      $ perf report -D | tail -15
      Aggregated stats:
                 TOTAL events:      10998
                  MMAP events:         66
                  COMM events:          2
                SAMPLE events:      10930
      cycles stats:
                 TOTAL events:       3644
                SAMPLE events:       3644
      cycles stats:
                 TOTAL events:       3642
                SAMPLE events:       3642
      cycles stats:
                 TOTAL events:       3644
                SAMPLE events:       3644
      
      On a Intel Nehalem or even AMD64, there are 4 counters capable
      of measuring cycles, so there is plenty of space to measure those
      events without multiplexing (even with the NMI watchdog active).
      And even with multiplexing, we'd expect roughly the same number
      of samples per event.
      
      The root of the problem was that when the event that caused the buffer
      to become full was not the first event passed on the cmdline, the user
      notification would get lost. The notification was sent to the file
      descriptor of the overflowed event but the perf tool was not polling
      on it.  The perf tool aggregates all samples into a single buffer,
      i.e., the buffer of the first event. Consequently, it assumes
      notifications for any event will come via that descriptor.
      
      The seemingly straight forward solution of moving the waitq into the
      ringbuffer object doesn't work because of life-time issues. One could
      perf_event_set_output() on a fd that you're also blocking on and cause
      the old rb object to be freed while its waitq would still be
      referenced by the blocked thread -> FAIL.
      
      Therefore link all events to the ringbuffer and broadcast the wakeup
      from the ringbuffer object to all possible events that could be waited
      upon. This is rather ugly, and we're open to better solutions but it
      works for now.
      Reported-by: NStephane Eranian <eranian@google.com>
      Finished-by: NStephane Eranian <eranian@google.com>
      Reviewed-by: NStephane Eranian <eranian@google.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/20111126014731.GA7030@quadSigned-off-by: NIngo Molnar <mingo@elte.hu>
      10c6db11
  5. 06 10月, 2011 1 次提交
  6. 29 8月, 2011 1 次提交
    • S
      perf events: Fix slow and broken cgroup context switch code · a8d757ef
      Stephane Eranian 提交于
      The current cgroup context switch code was incorrect leading
      to bogus counts. Furthermore, as soon as there was an active
      cgroup event on a CPU, the context switch cost on that CPU
      would increase by a significant amount as demonstrated by a
      simple ping/pong example:
      
       $ ./pong
       Both processes pinned to CPU1, running for 10s
       10684.51 ctxsw/s
      
      Now start a cgroup perf stat:
       $ perf stat -e cycles,cycles -A -a -G test  -C 1 -- sleep 100
      
      $ ./pong
       Both processes pinned to CPU1, running for 10s
       6674.61 ctxsw/s
      
      That's a 37% penalty.
      
      Note that pong is not even in the monitored cgroup.
      
      The results shown by perf stat are bogus:
       $ perf stat -e cycles,cycles -A -a -G test  -C 1 -- sleep 100
      
       Performance counter stats for 'sleep 100':
      
       CPU1 <not counted> cycles   test
       CPU1 16,984,189,138 cycles  #    0.000 GHz
      
      The second 'cycles' event should report a count @ CPU clock
      (here 2.4GHz) as it is counting across all cgroups.
      
      The patch below fixes the bogus accounting and bypasses any
      cgroup switches in case the outgoing and incoming tasks are
      in the same cgroup.
      
      With this patch the same test now yields:
       $ ./pong
       Both processes pinned to CPU1, running for 10s
       10775.30 ctxsw/s
      
      Start perf stat with cgroup:
      
       $ perf stat -e cycles,cycles -A -a -G test  -C 1 -- sleep 10
      
      Run pong outside the cgroup:
       $ /pong
       Both processes pinned to CPU1, running for 10s
       10687.80 ctxsw/s
      
      The penalty is now less than 2%.
      
      And the results for perf stat are correct:
      
      $ perf stat -e cycles,cycles -A -a -G test  -C 1 -- sleep 10
      
       Performance counter stats for 'sleep 10':
      
       CPU1 <not counted> cycles test #    0.000 GHz
       CPU1 23,933,981,448 cycles      #    0.000 GHz
      
      Now perf stat reports the correct counts for
      for the non cgroup event.
      
      If we run pong inside the cgroup, then we also get the
      correct counts:
      
      $ perf stat -e cycles,cycles -A -a -G test  -C 1 -- sleep 10
      
       Performance counter stats for 'sleep 10':
      
       CPU1 22,297,726,205 cycles test #    0.000 GHz
       CPU1 23,933,981,448 cycles      #    0.000 GHz
      
            10.001457237 seconds time elapsed
      Signed-off-by: NStephane Eranian <eranian@google.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/20110825135803.GA4697@quadSigned-off-by: NIngo Molnar <mingo@elte.hu>
      a8d757ef
  7. 27 7月, 2011 1 次提交
  8. 01 7月, 2011 7 次提交
  9. 09 6月, 2011 1 次提交
  10. 04 6月, 2011 1 次提交
  11. 04 5月, 2011 1 次提交
  12. 29 4月, 2011 1 次提交
    • I
      perf events: Add generic front-end and back-end stalled cycle event definitions · 8f622422
      Ingo Molnar 提交于
      Add two generic hardware events: front-end and back-end stalled cycles.
      
      These events measure conditions when the CPU is executing code but its
      capabilities are not fully utilized. Understanding such situations and
      analyzing them is an important sub-task of code optimization workflows.
      
      Both events limit performance: most front end stalls tend to be caused
      by branch misprediction or instruction fetch cachemisses, backend
      stalls can be caused by various resource shortages or inefficient
      instruction scheduling.
      
      Front-end stalls are the more important ones: code cannot run fast
      if the instruction stream is not being kept up.
      
      An over-utilized back-end can cause front-end stalls and thus
      has to be kept an eye on as well.
      
      The exact composition is very program logic and instruction mix
      dependent.
      
      We use the terms 'stall', 'front-end' and 'back-end' loosely and
      try to use the best available events from specific CPUs that
      approximate these concepts.
      
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Link: http://lkml.kernel.org/n/tip-7y40wib8n000io7hjpn1dsrm@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@elte.hu>
      8f622422
  13. 27 4月, 2011 1 次提交
  14. 05 4月, 2011 1 次提交
    • J
      jump label: Introduce static_branch() interface · d430d3d7
      Jason Baron 提交于
      Introduce:
      
      static __always_inline bool static_branch(struct jump_label_key *key);
      
      instead of the old JUMP_LABEL(key, label) macro.
      
      In this way, jump labels become really easy to use:
      
      Define:
      
              struct jump_label_key jump_key;
      
      Can be used as:
      
              if (static_branch(&jump_key))
                      do unlikely code
      
      enable/disale via:
      
              jump_label_inc(&jump_key);
              jump_label_dec(&jump_key);
      
      that's it!
      
      For the jump labels disabled case, the static_branch() becomes an
      atomic_read(), and jump_label_inc()/dec() are simply atomic_inc(),
      atomic_dec() operations. We show testing results for this change below.
      
      Thanks to H. Peter Anvin for suggesting the 'static_branch()' construct.
      
      Since we now require a 'struct jump_label_key *key', we can store a pointer into
      the jump table addresses. In this way, we can enable/disable jump labels, in
      basically constant time. This change allows us to completely remove the previous
      hashtable scheme. Thanks to Peter Zijlstra for this re-write.
      
      Testing:
      
      I ran a series of 'tbench 20' runs 5 times (with reboots) for 3
      configurations, where tracepoints were disabled.
      
      jump label configured in
      avg: 815.6
      
      jump label *not* configured in (using atomic reads)
      avg: 800.1
      
      jump label *not* configured in (regular reads)
      avg: 803.4
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <20110316212947.GA8792@redhat.com>
      Signed-off-by: NJason Baron <jbaron@redhat.com>
      Suggested-by: NH. Peter Anvin <hpa@linux.intel.com>
      Tested-by: NDavid Daney <ddaney@caviumnetworks.com>
      Acked-by: NRalf Baechle <ralf@linux-mips.org>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Acked-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      d430d3d7
  15. 31 3月, 2011 2 次提交
    • L
      Fix common misspellings · 25985edc
      Lucas De Marchi 提交于
      Fixes generated by 'codespell' and manually reviewed.
      Signed-off-by: NLucas De Marchi <lucas.demarchi@profusion.mobi>
      25985edc
    • P
      perf: Fix task context scheduling · ab711fe0
      Peter Zijlstra 提交于
      Jiri reported:
      
       |
       | - once an event is created by sys_perf_event_open, task context
       |   is created and it stays even if the event is closed, until the
       |   task is finished ... thats what I see in code and I assume it's
       |   correct
       |
       | - when the task opens event, perf_sched_events jump label is
       |   incremented and following callbacks are started from scheduler
       |
       |         __perf_event_task_sched_in
       |         __perf_event_task_sched_out
       |
       |   These callback *in/out set/unset cpuctx->task_ctx value to the
       |   task context.
       |
       | - close is called on event on CPU 0:
       |         - the task is scheduled on CPU 0
       |         - __perf_event_task_sched_in is called
       |         - cpuctx->task_ctx is set
       |         - perf_sched_events jump label is decremented and == 0
       |         - __perf_event_task_sched_out is not called
       |         - cpuctx->task_ctx on CPU 0 stays set
       |
       | - exit is called on CPU 1:
       |         - the task is scheduled on CPU 1
       |         - perf_event_exit_task is called
       |         - task_ctx_sched_out unsets cpuctx->task_ctx on CPU 1
       |         - put_ctx destroys the context
       |
       | - another call of perf_rotate_context on CPU 0 will use invalid
       |   task_ctx pointer, and eventualy panic.
       |
      
      Cure this the simplest possibly way by partially reverting the
      jump_label optimization for the sched_out case.
      Reported-and-tested-by: NJiri Olsa <jolsa@redhat.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: <stable@kernel.org> # .37+
      LKML-Reference: <1301520405.4859.213.camel@twins>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      ab711fe0
  16. 23 3月, 2011 1 次提交
    • S
      perf_events: Fix stale ->cgrp pointer in update_cgrp_time_from_cpuctx() · 68cacd29
      Stephane Eranian 提交于
      This patch solves a stale pointer problem in
      update_cgrp_time_from_cpuctx(). The cpuctx->cgrp
      was not cleared on all possible event exit paths,
      including:
      
         close()
           perf_release()
             perf_release_kernel()
               list_del_event()
      
      This patch fixes list_del_event() to clear cpuctx->cgrp
      when there are no cgroup events left in the context.
      
      [ This second version makes the code compile when
        CONFIG_CGROUP_PERF is not enabled. We unconditionally define
        perf_cpu_context->cgrp. ]
      Signed-off-by: NStephane Eranian <eranian@google.com>
      Cc: peterz@infradead.org
      Cc: perfmon2-devel@lists.sf.net
      Cc: paulus@samba.org
      Cc: davem@davemloft.net
      LKML-Reference: <20110323150306.GA1580@quad>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      68cacd29
  17. 16 3月, 2011 1 次提交
  18. 04 3月, 2011 1 次提交
    • A
      perf: Add support for supplementary event registers · a7e3ed1e
      Andi Kleen 提交于
      Change logs against Andi's original version:
      
      - Extends perf_event_attr:config to config{,1,2} (Peter Zijlstra)
      - Fixed a major event scheduling issue. There cannot be a ref++ on an
        event that has already done ref++ once and without calling
        put_constraint() in between. (Stephane Eranian)
      - Use thread_cpumask for percore allocation. (Lin Ming)
      - Use MSR names in the extra reg lists. (Lin Ming)
      - Remove redundant "c = NULL" in intel_percore_constraints
      - Fix comment of perf_event_attr::config1
      
      Intel Nehalem/Westmere have a special OFFCORE_RESPONSE event
      that can be used to monitor any offcore accesses from a core.
      This is a very useful event for various tunings, and it's
      also needed to implement the generic LLC-* events correctly.
      
      Unfortunately this event requires programming a mask in a separate
      register. And worse this separate register is per core, not per
      CPU thread.
      
      This patch:
      
      - Teaches perf_events that OFFCORE_RESPONSE needs extra parameters.
        The extra parameters are passed by user space in the
        perf_event_attr::config1 field.
      
      - Adds support to the Intel perf_event core to schedule per
        core resources. This adds fairly generic infrastructure that
        can be also used for other per core resources.
        The basic code has is patterned after the similar AMD northbridge
        constraints code.
      
      Thanks to Stephane Eranian who pointed out some problems
      in the original version and suggested improvements.
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NLin Ming <ming.m.lin@intel.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1299119690-13991-2-git-send-email-ming.m.lin@intel.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      a7e3ed1e
  19. 16 2月, 2011 2 次提交
    • P
      perf: Optimize throttling code · 163ec435
      Peter Zijlstra 提交于
      By pre-computing the maximum number of samples per tick we can avoid a
      multiplication and a conditional since MAX_INTERRUPTS >
      max_samples_per_tick.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      163ec435
    • S
      perf: Add cgroup support · e5d1367f
      Stephane Eranian 提交于
      This kernel patch adds the ability to filter monitoring based on
      container groups (cgroups). This is for use in per-cpu mode only.
      
      The cgroup to monitor is passed as a file descriptor in the pid
      argument to the syscall. The file descriptor must be opened to
      the cgroup name in the cgroup filesystem. For instance, if the
      cgroup name is foo and cgroupfs is mounted in /cgroup, then the
      file descriptor is opened to /cgroup/foo. Cgroup mode is
      activated by passing PERF_FLAG_PID_CGROUP in the flags argument
      to the syscall.
      
      For instance to measure in cgroup foo on CPU1 assuming
      cgroupfs is mounted under /cgroup:
      
      struct perf_event_attr attr;
      int cgroup_fd, fd;
      
      cgroup_fd = open("/cgroup/foo", O_RDONLY);
      fd = perf_event_open(&attr, cgroup_fd, 1, -1, PERF_FLAG_PID_CGROUP);
      close(cgroup_fd);
      Signed-off-by: NStephane Eranian <eranian@google.com>
      [ added perf_cgroup_{exit,attach} ]
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <4d590250.114ddf0a.689e.4482@mx.google.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      e5d1367f
  20. 16 12月, 2010 2 次提交
    • P
      perf: Sysfs enumeration · abe43400
      Peter Zijlstra 提交于
      Simple sysfs emumeration of the PMUs.
      
      Use a "event_source" bus, and add PMU devices using their name.
      
      Each PMU device has a type attribute which contrains the value needed
      for perf_event_attr::type to identify this PMU.
      
      This is the minimal stub needed to start using this interface,
      we'll consider extending the sysfs usage later.
      
      Cc: Kay Sievers <kay.sievers@vrfy.org>
      Cc: Greg KH <gregkh@suse.de>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <20101117222056.316982569@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      abe43400
    • P
      perf: Dynamic pmu types · 2e80a82a
      Peter Zijlstra 提交于
      Extend the perf_pmu_register() interface to allow for named and
      dynamic pmu types.
      
      Because we need to support the existing static types we cannot use
      dynamic types for everything, hence provide a type argument.
      
      If we want to enumerate the PMUs they need a name, provide one.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <20101117222056.259707703@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      2e80a82a
  21. 09 12月, 2010 1 次提交
  22. 05 12月, 2010 2 次提交
  23. 01 12月, 2010 1 次提交
  24. 26 11月, 2010 3 次提交