1. 06 12月, 2011 4 次提交
  2. 05 12月, 2011 1 次提交
    • P
      perf: Fix loss of notification with multi-event · 10c6db11
      Peter Zijlstra 提交于
      When you do:
              $ perf record -e cycles,cycles,cycles noploop 10
      
      You expect about 10,000 samples for each event, i.e., 10s at
      1000samples/sec. However, this is not what's happening. You
      get much fewer samples, maybe 3700 samples/event:
      
      $ perf report -D | tail -15
      Aggregated stats:
                 TOTAL events:      10998
                  MMAP events:         66
                  COMM events:          2
                SAMPLE events:      10930
      cycles stats:
                 TOTAL events:       3644
                SAMPLE events:       3644
      cycles stats:
                 TOTAL events:       3642
                SAMPLE events:       3642
      cycles stats:
                 TOTAL events:       3644
                SAMPLE events:       3644
      
      On a Intel Nehalem or even AMD64, there are 4 counters capable
      of measuring cycles, so there is plenty of space to measure those
      events without multiplexing (even with the NMI watchdog active).
      And even with multiplexing, we'd expect roughly the same number
      of samples per event.
      
      The root of the problem was that when the event that caused the buffer
      to become full was not the first event passed on the cmdline, the user
      notification would get lost. The notification was sent to the file
      descriptor of the overflowed event but the perf tool was not polling
      on it.  The perf tool aggregates all samples into a single buffer,
      i.e., the buffer of the first event. Consequently, it assumes
      notifications for any event will come via that descriptor.
      
      The seemingly straight forward solution of moving the waitq into the
      ringbuffer object doesn't work because of life-time issues. One could
      perf_event_set_output() on a fd that you're also blocking on and cause
      the old rb object to be freed while its waitq would still be
      referenced by the blocked thread -> FAIL.
      
      Therefore link all events to the ringbuffer and broadcast the wakeup
      from the ringbuffer object to all possible events that could be waited
      upon. This is rather ugly, and we're open to better solutions but it
      works for now.
      Reported-by: NStephane Eranian <eranian@google.com>
      Finished-by: NStephane Eranian <eranian@google.com>
      Reviewed-by: NStephane Eranian <eranian@google.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/20111126014731.GA7030@quadSigned-off-by: NIngo Molnar <mingo@elte.hu>
      10c6db11
  3. 14 11月, 2011 3 次提交
  4. 04 11月, 2011 1 次提交
    • R
      oprofile, x86: Reimplement nmi timer mode using perf event · dcfce4a0
      Robert Richter 提交于
      The legacy x86 nmi watchdog code was removed with the implementation
      of the perf based nmi watchdog. This broke Oprofile's nmi timer
      mode. To run nmi timer mode we relied on a continuous ticking nmi
      source which the nmi watchdog provided. The nmi tick was no longer
      available and current watchdog can not be used anymore since it runs
      with very long periods in the range of seconds. This patch
      reimplements the nmi timer mode using a perf counter nmi source.
      
      V2:
      * removing pr_info()
      * fix undefined reference to `__udivdi3' for 32 bit build
      * fix section mismatch of .cpuinit.data:nmi_timer_cpu_nb
      * removed nmi timer setup in arch/x86
      * implemented function stubs for op_nmi_init/exit()
      * made code more readable in oprofile_init()
      
      V3:
      * fix architectural initialization in oprofile_init()
      * fix CONFIG_OPROFILE_NMI_TIMER dependencies
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NRobert Richter <robert.richter@amd.com>
      dcfce4a0
  5. 03 11月, 2011 1 次提交
  6. 01 11月, 2011 2 次提交
    • C
      mm: distinguish between mlocked and pinned pages · bc3e53f6
      Christoph Lameter 提交于
      Some kernel components pin user space memory (infiniband and perf) (by
      increasing the page count) and account that memory as "mlocked".
      
      The difference between mlocking and pinning is:
      
      A. mlocked pages are marked with PG_mlocked and are exempt from
         swapping. Page migration may move them around though.
         They are kept on a special LRU list.
      
      B. Pinned pages cannot be moved because something needs to
         directly access physical memory. They may not be on any
         LRU list.
      
      I recently saw an mlockalled process where mm->locked_vm became
      bigger than the virtual size of the process (!) because some
      memory was accounted for twice:
      
      Once when the page was mlocked and once when the Infiniband
      layer increased the refcount because it needt to pin the RDMA
      memory.
      
      This patch introduces a separate counter for pinned pages and
      accounts them seperately.
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      Cc: Mike Marciniszyn <infinipath@qlogic.com>
      Cc: Roland Dreier <roland@kernel.org>
      Cc: Sean Hefty <sean.hefty@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bc3e53f6
    • P
      kernel: Fix files explicitly needing EXPORT_SYMBOL infrastructure · 6e5fdeed
      Paul Gortmaker 提交于
      These files were getting <linux/module.h> via an implicit non-obvious
      path, but we want to crush those out of existence since they cost
      time during compiles of processing thousands of lines of headers
      for no reason.  Give them the lightweight header that just contains
      the EXPORT_SYMBOL infrastructure.
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      6e5fdeed
  7. 31 8月, 2011 2 次提交
    • E
      perf_event: Fix broken calc_timer_values() · 7f310a5d
      Eric B Munson 提交于
      We detected a serious issue with PERF_SAMPLE_READ and
      timing information when events were being multiplexing.
      
      Samples would have time_running > time_enabled. That
      was easy to reproduce with a libpfm4 example (ran 3
      times to cause multiplexing on Core 2):
      
       $ syst_smpl -e uops_retired:freq=1 &
       $ syst_smpl -e uops_retired:freq=1 &
       $ syst_smpl -e uops_retired:freq=1 &
       IIP:0x0000000040062d ... PERIOD:2355332948 ENA=40144625315 RUN=60014875184
       syst_smpl: WARNING: time_running > time_enabled
      	63277537998 uops_retired:freq=1 , scaled
      
      The bug was not present in kernel up to (and including) 3.0. It turns
      out the bug was introduced by the following commit:
      
      commit c4794295
      
          events: Move lockless timer calculation into helper function
      
      The parameters of the function got reversed yet the call sites
      were not updated to reflect the change. That lead to time_running
      and time_enabled being swapped. That had no effect when there was
      no multiplexing because in that case time_running = time_enabled
      but it would show up in any other scenario.
      Signed-off-by: NStephane Eranian <eranian@google.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/20110829124112.GA4828@quadSigned-off-by: NIngo Molnar <mingo@elte.hu>
      7f310a5d
    • M
      perf: provide PMU when initing events · 5f12a761
      Mark Rutland 提交于
      Currently, an event's 'pmu' field is set after pmu::event_init() is
      called. This means that pmu::event_init() must figure out which struct
      pmu the event was initialised from. This makes it difficult to
      consolidate common event initialisation code for similar PMUs, and
      very difficult to implement drivers for PMUs which can have multiple
      instances (e.g. a USB controller PMU, a GPU PMU, etc).
      
      This patch sets the 'pmu' field before initialising the event, allowing
      event init code to identify the struct pmu instance easily. In the
      event of failure to initialise an event, the event is destroyed via
      kfree() without calling perf_event::destroy(), so this shouldn't
      result in bad behaviour even if the destroy field was set before
      failure to initialise was noted.
      Signed-off-by: NMark Rutland <mark.rutland@arm.com>
      Reviewed-by: NWill Deacon <will.deacon@arm.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/1313062280-19123-1-git-send-email-mark.rutland@arm.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
      5f12a761
  8. 29 8月, 2011 1 次提交
    • S
      perf events: Fix slow and broken cgroup context switch code · a8d757ef
      Stephane Eranian 提交于
      The current cgroup context switch code was incorrect leading
      to bogus counts. Furthermore, as soon as there was an active
      cgroup event on a CPU, the context switch cost on that CPU
      would increase by a significant amount as demonstrated by a
      simple ping/pong example:
      
       $ ./pong
       Both processes pinned to CPU1, running for 10s
       10684.51 ctxsw/s
      
      Now start a cgroup perf stat:
       $ perf stat -e cycles,cycles -A -a -G test  -C 1 -- sleep 100
      
      $ ./pong
       Both processes pinned to CPU1, running for 10s
       6674.61 ctxsw/s
      
      That's a 37% penalty.
      
      Note that pong is not even in the monitored cgroup.
      
      The results shown by perf stat are bogus:
       $ perf stat -e cycles,cycles -A -a -G test  -C 1 -- sleep 100
      
       Performance counter stats for 'sleep 100':
      
       CPU1 <not counted> cycles   test
       CPU1 16,984,189,138 cycles  #    0.000 GHz
      
      The second 'cycles' event should report a count @ CPU clock
      (here 2.4GHz) as it is counting across all cgroups.
      
      The patch below fixes the bogus accounting and bypasses any
      cgroup switches in case the outgoing and incoming tasks are
      in the same cgroup.
      
      With this patch the same test now yields:
       $ ./pong
       Both processes pinned to CPU1, running for 10s
       10775.30 ctxsw/s
      
      Start perf stat with cgroup:
      
       $ perf stat -e cycles,cycles -A -a -G test  -C 1 -- sleep 10
      
      Run pong outside the cgroup:
       $ /pong
       Both processes pinned to CPU1, running for 10s
       10687.80 ctxsw/s
      
      The penalty is now less than 2%.
      
      And the results for perf stat are correct:
      
      $ perf stat -e cycles,cycles -A -a -G test  -C 1 -- sleep 10
      
       Performance counter stats for 'sleep 10':
      
       CPU1 <not counted> cycles test #    0.000 GHz
       CPU1 23,933,981,448 cycles      #    0.000 GHz
      
      Now perf stat reports the correct counts for
      for the non cgroup event.
      
      If we run pong inside the cgroup, then we also get the
      correct counts:
      
      $ perf stat -e cycles,cycles -A -a -G test  -C 1 -- sleep 10
      
       Performance counter stats for 'sleep 10':
      
       CPU1 22,297,726,205 cycles test #    0.000 GHz
       CPU1 23,933,981,448 cycles      #    0.000 GHz
      
            10.001457237 seconds time elapsed
      Signed-off-by: NStephane Eranian <eranian@google.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/20110825135803.GA4697@quadSigned-off-by: NIngo Molnar <mingo@elte.hu>
      a8d757ef
  9. 14 8月, 2011 2 次提交
  10. 22 7月, 2011 1 次提交
  11. 01 7月, 2011 8 次提交
  12. 09 6月, 2011 1 次提交
  13. 07 6月, 2011 1 次提交
  14. 31 5月, 2011 1 次提交
  15. 29 5月, 2011 9 次提交
  16. 28 5月, 2011 1 次提交
  17. 04 5月, 2011 1 次提交