1. 06 4月, 2009 30 次提交
    • P
      perf_counter: make it possible for hw_perf_counter_init to return error codes · d5d2bc0d
      Paul Mackerras 提交于
      Impact: better error reporting
      
      At present, if hw_perf_counter_init encounters an error, all it can do
      is return NULL, which causes sys_perf_counter_open to return an EINVAL
      error to userspace.  This isn't very informative for userspace; it means
      that userspace can't tell the difference between "sorry, oprofile is
      already using the PMU" and "we don't support this CPU" and "this CPU
      doesn't support the requested generic hardware event".
      
      This commit uses the PTR_ERR/ERR_PTR/IS_ERR set of macros to let
      hw_perf_counter_init return an error code on error rather than just NULL
      if it wishes.  If it does so, that error code will be returned from
      sys_perf_counter_open to userspace.  If it returns NULL, an EINVAL
      error will be returned to userspace, as before.
      
      This also adapts the powerpc hw_perf_counter_init to make use of this
      to return ENXIO, EINVAL, EBUSY, or EOPNOTSUPP as appropriate.  It would
      be good to add extra error numbers in future to allow userspace to
      distinguish the various errors that are currently reported as EINVAL,
      i.e. irq_period < 0, too many events in a group, conflict between
      exclude_* settings in a group, and PMU resource conflict in a group.
      
      [ v2: fix a bug pointed out by Corey Ashford where error returns from
            hw_perf_counter_init were not handled correctly in the case of
            raw hardware events.]
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Orig-LKML-Reference: <20090330171023.682428180@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      d5d2bc0d
    • P
      perf_counter: executable mmap() information · 0a4a9391
      Peter Zijlstra 提交于
      Currently the profiling information returns userspace IPs but no way
      to correlate them to userspace code. Userspace could look into
      /proc/$pid/maps but that might not be current or even present anymore
      at the time of analyzing the IPs.
      
      Therefore provide means to track the mmap information and provide it
      in the output stream.
      
      XXX: only covers mmap()/munmap(), mremap() and mprotect() are missing.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Acked-by: NPaul Mackerras <paulus@samba.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Orig-LKML-Reference: <20090330171023.417259499@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      0a4a9391
    • P
      perf_counter: fix update_userpage() · 38ff667b
      Peter Zijlstra 提交于
      It just occured to me it is possible to have multiple contending
      updates of the userpage (mmap information vs overflow vs counter).
      This would break the seqlock logic.
      
      It appear the arch code uses this from NMI context, so we cannot
      possibly serialize its use, therefore separate the data_head update
      from it and let it return to its original use.
      
      The arch code needs to make sure there are no contending callers by
      disabling the counter before using it -- powerpc appears to do this
      nicely.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Acked-by: NPaul Mackerras <paulus@samba.org>
      Orig-LKML-Reference: <20090330171023.241410660@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      38ff667b
    • P
      perf_counter: unify and fix delayed counter wakeup · 925d519a
      Peter Zijlstra 提交于
      While going over the wakeup code I noticed delayed wakeups only work
      for hardware counters but basically all software counters rely on
      them.
      
      This patch unifies and generalizes the delayed wakeup to fix this
      issue.
      
      Since we're dealing with NMI context bits here, use a cmpxchg() based
      single link list implementation to track counters that have pending
      wakeups.
      
      [ This should really be generic code for delayed wakeups, but since we
        cannot use cmpxchg()/xchg() in generic code, I've let it live in the
        perf_counter code. -- Eric Dumazet could use it to aggregate the
        network wakeups. ]
      
      Furthermore, the x86 method of using TIF flags was flawed in that its
      quite possible to end up setting the bit on the idle task, loosing the
      wakeup.
      
      The powerpc method uses per-cpu storage and does appear to be
      sufficient.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Acked-by: NPaul Mackerras <paulus@samba.org>
      Orig-LKML-Reference: <20090330171023.153932974@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      925d519a
    • P
      perf_counter: record time running and time enabled for each counter · 53cfbf59
      Paul Mackerras 提交于
      Impact: new functionality
      
      Currently, if there are more counters enabled than can fit on the CPU,
      the kernel will multiplex the counters on to the hardware using
      round-robin scheduling.  That isn't too bad for sampling counters, but
      for counting counters it means that the value read from a counter
      represents some unknown fraction of the true count of events that
      occurred while the counter was enabled.
      
      This remedies the situation by keeping track of how long each counter
      is enabled for, and how long it is actually on the cpu and counting
      events.  These times are recorded in nanoseconds using the task clock
      for per-task counters and the cpu clock for per-cpu counters.
      
      These values can be supplied to userspace on a read from the counter.
      Userspace requests that they be supplied after the counter value by
      setting the PERF_FORMAT_TOTAL_TIME_ENABLED and/or
      PERF_FORMAT_TOTAL_TIME_RUNNING bits in the hw_event.read_format field
      when creating the counter.  (There is no way to change the read format
      after the counter is created, though it would be possible to add some
      way to do that.)
      
      Using this information it is possible for userspace to scale the count
      it reads from the counter to get an estimate of the true count:
      
      true_count_estimate = count * total_time_enabled / total_time_running
      
      This also lets userspace detect the situation where the counter never
      got to go on the cpu: total_time_running == 0.
      
      This functionality has been requested by the PAPI developers, and will
      be generally needed for interpreting the count values from counting
      counters correctly.
      
      In the implementation, this keeps 5 time values (in nanoseconds) for
      each counter: total_time_enabled and total_time_running are used when
      the counter is in state OFF or ERROR and for reporting back to
      userspace.  When the counter is in state INACTIVE or ACTIVE, it is the
      tstamp_enabled, tstamp_running and tstamp_stopped values that are
      relevant, and total_time_enabled and total_time_running are determined
      from them.  (tstamp_stopped is only used in INACTIVE state.)  The
      reason for doing it like this is that it means that only counters
      being enabled or disabled at sched-in and sched-out time need to be
      updated.  There are no new loops that iterate over all counters to
      update total_time_enabled or total_time_running.
      
      This also keeps separate child_total_time_running and
      child_total_time_enabled fields that get added in when reporting the
      totals to userspace.  They are separate fields so that they can be
      atomic.  We don't want to use atomics for total_time_running,
      total_time_enabled etc., because then we would have to use atomic
      sequences to update them, which are slower than regular arithmetic and
      memory accesses.
      
      It is possible to measure total_time_running by adding a task_clock
      counter to each group of counters, and total_time_enabled can be
      measured approximately with a top-level task_clock counter (though
      inaccuracies will creep in if you need to disable and enable groups
      since it is not possible in general to disable/enable the top-level
      task_clock counter simultaneously with another group).  However, that
      adds extra overhead - I measured around 15% increase in the context
      switch latency reported by lat_ctx (from lmbench) when a task_clock
      counter was added to each of 2 groups, and around 25% increase when a
      task_clock counter was added to each of 4 groups.  (In both cases a
      top-level task-clock counter was also added.)
      
      In contrast, the code added in this commit gives better information
      with no overhead that I could measure (in fact in some cases I
      measured lower times with this code, but the differences were all less
      than one standard deviation).
      
      [ v2: address review comments by Andrew Morton. ]
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Orig-LKML-Reference: <18890.6578.728637.139402@cargo.ozlabs.ibm.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      53cfbf59
    • P
      perf_counter: allow and require one-page mmap on counting counters · 7730d865
      Peter Zijlstra 提交于
      A brainfart stopped single page mmap()s working. The rest of the code
      should be perfectly fine with not having any data pages.
      Reported-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Orig-LKML-Reference: <1237981712.7972.812.camel@twins>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      7730d865
    • P
      perf_counter: optionally provide the pid/tid of the sampled task · ea5d20cf
      Peter Zijlstra 提交于
      Allow cpu wide counters to profile userspace by providing what process
      the sample belongs to.
      
      This raises the first issue with the output type, lots of these
      options: group, tid, callchain, etc.. are non-exclusive and could be
      combined, suggesting a bitfield.
      
      However, things like the mmap() data stream doesn't fit in that.
      
      How to split the type field...
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Arjan van de Ven <arjan@infradead.org>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Orig-LKML-Reference: <20090325113317.013775235@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      ea5d20cf
    • P
      perf_counter: sanity check on the output API · 63e35b25
      Peter Zijlstra 提交于
      Ensure we never write more than we said we would.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Arjan van de Ven <arjan@infradead.org>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Orig-LKML-Reference: <20090325113316.921433024@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      63e35b25
    • P
      perf_counter: output objects · 5c148194
      Peter Zijlstra 提交于
      Provide a {type,size} header for each output entry.
      
      This should provide extensible output, and the ability to mix multiple streams.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Arjan van de Ven <arjan@infradead.org>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Orig-LKML-Reference: <20090325113316.831607932@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      5c148194
    • P
      perf_counter: more elaborate write API · b9cacc7b
      Peter Zijlstra 提交于
      Provide a begin, copy, end interface to the output buffer.
      
      begin() reserves the space,
       copy() copies the data over, considering page boundaries,
        end() finalizes the event and does the wakeup.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Arjan van de Ven <arjan@infradead.org>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Orig-LKML-Reference: <20090325113316.740550870@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      b9cacc7b
    • P
      perf_counter: fix perf_poll() · c7138f37
      Peter Zijlstra 提交于
      Impact: fix kerneltop 100% CPU usage
      
      Only return a poll event when there's actually been one, poll_wait()
      doesn't actually wait for the waitq you pass it, it only enqueues
      you on it.
      
      Only once all FDs have been iterated and none of thm returned a
      poll-event will it schedule().
      
      Also make it return POLL_HUP when there's not mmap() area to read from.
      
      Further, fix a silly bug in the write code.
      Reported-by: NMike Galbraith <efault@gmx.de>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Arjan van de Ven <arjan@infradead.org>
      Orig-LKML-Reference: <1237897096.24918.181.camel@twins>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      c7138f37
    • P
      perf_counter: new output ABI - part 1 · 7b732a75
      Peter Zijlstra 提交于
      Impact: Rework the perfcounter output ABI
      
      use sys_read() only for instant data and provide mmap() output for all
      async overflow data.
      
      The first mmap() determines the size of the output buffer. The mmap()
      size must be a PAGE_SIZE multiple of 1+pages, where pages must be a
      power of 2 or 0. Further mmap()s of the same fd must have the same
      size. Once all maps are gone, you can again mmap() with a new size.
      
      In case of 0 extra pages there is no data output and the first page
      only contains meta data.
      
      When there are data pages, a poll() event will be generated for each
      full page of data. Furthermore, the output is circular. This means
      that although 1 page is a valid configuration, its useless, since
      we'll start overwriting it the instant we report a full page.
      
      Future work will focus on the output format (currently maintained)
      where we'll likey want each entry denoted by a header which includes a
      type and length.
      
      Further future work will allow to splice() the fd, also containing the
      async overflow data -- splice() would be mutually exclusive with
      mmap() of the data.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Paul Mackerras <paulus@samba.org>
      Orig-LKML-Reference: <20090323172417.470536358@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      7b732a75
    • P
      perf_counter: add an mmap method to allow userspace to read hardware counters · 37d81828
      Paul Mackerras 提交于
      Impact: new feature giving performance improvement
      
      This adds the ability for userspace to do an mmap on a hardware counter
      fd and get access to a read-only page that contains the information
      needed to translate a hardware counter value to the full 64-bit
      counter value that would be returned by a read on the fd.  This is
      useful on architectures that allow user programs to read the hardware
      counters, such as PowerPC.
      
      The mmap will only succeed if the counter is a hardware counter
      monitoring the current process.
      
      On my quad 2.5GHz PowerPC 970MP machine, userspace can read a counter
      and translate it to the full 64-bit value in about 30ns using the
      mmapped page, compared to about 830ns for the read syscall on the
      counter, so this does give a significant performance improvement.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Orig-LKML-Reference: <20090323172417.297057964@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      37d81828
    • P
      perf_counter: avoid recursion · 96f6d444
      Peter Zijlstra 提交于
      Tracepoint events like lock_acquire and software counters like
      pagefaults can recurse into the perf counter code again, avoid that.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Paul Mackerras <paulus@samba.org>
      Orig-LKML-Reference: <20090323172417.152096433@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      96f6d444
    • P
      perf_counter: remove the event config bitfields · f4a2deb4
      Peter Zijlstra 提交于
      Since the bitfields turned into a bit of a mess, remove them and rely on
      good old masks.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Paul Mackerras <paulus@samba.org>
      Orig-LKML-Reference: <20090323172417.059499915@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      f4a2deb4
    • P
      perf_counter: unify irq output code · 0322cd6e
      Peter Zijlstra 提交于
      Impact: cleanup
      
      Having 3 slightly different copies of the same code around does nobody
      any good. First step in revamping the output format.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Orig-LKML-Reference: <20090319194233.929962222@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      0322cd6e
    • P
      perf_counter: revamp syscall input ABI · b8e83514
      Peter Zijlstra 提交于
      Impact: modify ABI
      
      The hardware/software classification in hw_event->type became a little
      strained due to the addition of tracepoint tracing.
      
      Instead split up the field and provide a type field to explicitly specify
      the counter type, while using the event_id field to specify which event to
      use.
      
      Raw counters still work as before, only the raw config now goes into
      raw_event.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Orig-LKML-Reference: <20090319194233.836807573@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      b8e83514
    • P
      perf_counter: hook up the tracepoint events · e077df4f
      Peter Zijlstra 提交于
      Impact: new perfcounters feature
      
      Enable usage of tracepoints as perf counter events.
      
      tracepoint event ids can be found in /debug/tracing/event/*/*/id
      and (for now) are represented as -65536+id in the type field.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Orig-LKML-Reference: <20090319194233.744044174@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      e077df4f
    • P
      perf_counter: fix up counter free paths · f1600952
      Peter Zijlstra 提交于
      Impact: fix crash during perfcounters use
      
      I found another counter free path, create a free_counter() call to
      accomodate generic tear-down.
      
      Fixes an RCU bug.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Orig-LKML-Reference: <20090319194233.652078652@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      f1600952
    • P
      perf_counter: generic context switch event · 4a0deca6
      Peter Zijlstra 提交于
      Impact: cleanup
      
      Use the generic software events for context switches.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Orig-LKML-Reference: <20090319194233.283522645@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      4a0deca6
    • P
      perf_counter: fix uninitialized usage of event_list · 01ef09d9
      Peter Zijlstra 提交于
      Impact: fix boot crash
      
      When doing the generic context switch event I ran into some early
      boot hangs, which were caused by inf func recursion (event, fault,
      event, fault).
      
      I eventually tracked it down to event_list not being initialized
      at the time of the first event. Fix this.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Orig-LKML-Reference: <20090319194233.195392657@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      01ef09d9
    • P
      perf_counter: abstract wakeup flag setting in core to fix powerpc build · b6c5a71d
      Paul Mackerras 提交于
      Impact: build fix for powerpc
      
      Commit bd753921015e7905 ("perf_counter: software counter event
      infrastructure") introduced a use of TIF_PERF_COUNTERS into the core
      perfcounter code.  This breaks the build on powerpc because we use
      a flag in a per-cpu area to signal wakeups on powerpc rather than
      a thread_info flag, because the thread_info flags have to be
      manipulated with atomic operations and are thus slower than per-cpu
      flags.
      
      This fixes the by changing the core to use an abstracted
      set_perf_counter_pending() function, which is defined on x86 to set
      the TIF_PERF_COUNTERS flag and on powerpc to set the per-cpu flag
      (paca->perf_counter_pending).  It changes the previous powerpc
      definition of set_perf_counter_pending to not take an argument and
      adds a clear_perf_counter_pending, so as to simplify the definition
      on x86.
      
      On x86, set_perf_counter_pending() is defined as a macro.  Defining
      it as a static inline in arch/x86/include/asm/perf_counters.h causes
      compile failures because <asm/perf_counters.h> gets included early in
      <linux/sched.h>, and the definitions of set_tsk_thread_flag etc. are
      therefore not available in <asm/perf_counters.h>.  (On powerpc this
      problem is avoided by defining set_perf_counter_pending etc. in
      <asm/hw_irq.h>.)
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      b6c5a71d
    • T
      perf_counter: include missing header · 4e193bd4
      Tim Blechmann 提交于
      Impact: build fix
      
      In order to compile a kernel with performance counter patches,
      <asm/irq_regs.h> has to be included to provide the declaration of
      struct pt_regs *get_irq_regs(void);
      
      [ This bug was masked by unrelated x86 header file changes in the
        x86 tree, but occurs in the tip:perfcounters/core standalone
        tree. ]
      Signed-off-by: NTim Blechmann <tim@klingt.org>
      Orig-LKML-Reference: <20090314142925.49c29c17@thinkpad>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      4e193bd4
    • P
      perf_counter: fix hrtimer sampling · 039fc91e
      Peter Zijlstra 提交于
      Impact: fix deadlock with perfstat
      
      Fix for the perfstat fubar..
      
      We cannot unconditionally call hrtimer_cancel() without ever having done
      hrtimer_init() on the thing.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Orig-LKML-Reference: <1236959027.22447.149.camel@twins>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      039fc91e
    • P
      perf_counter: add an event_list · 592903cd
      Peter Zijlstra 提交于
      I noticed that the counter_list only includes top-level counters, thus
      perf_swcounter_event() will miss sw-counters in groups.
      
      Since perf_swcounter_event() also wants an RCU safe list, create a new
      event_list that includes all counters and uses RCU list ops and use call_rcu
      to free the counter structure.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      592903cd
    • P
      perf_counter: hrtimer based sampling for software time events · d6d020e9
      Peter Zijlstra 提交于
      Use hrtimers to profile timer based sampling for the software time
      counters.
      
      This allows platforms without hardware counter support to still
      perform sample based profiling.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      d6d020e9
    • P
      perf_counter: provide major/minor page fault software events · ac17dc8e
      Peter Zijlstra 提交于
      Provide separate sw counters for major and minor page faults.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      ac17dc8e
    • P
      perf_counter: provide pagefault software events · 7dd1fcc2
      Peter Zijlstra 提交于
      We use the generic software counter infrastructure to provide
      page fault events.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      7dd1fcc2
    • P
      perf_counter: software counter event infrastructure · 15dbf27c
      Peter Zijlstra 提交于
      Provide generic software counter infrastructure that supports
      software events.
      
      This will be used to allow sample based profiling based on software
      events such as pagefaults. The current infrastructure can only
      provide a count of such events, no place information.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      15dbf27c
    • P
      perf_counter: use list_move_tail() · 75564232
      Peter Zijlstra 提交于
      Instead of del/add use a move list-op.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      75564232
  2. 04 3月, 2009 1 次提交
    • P
      perfcounters: provide expansion room in the ABI · 2743a5b0
      Paul Mackerras 提交于
      Impact: ABI change
      
      This expands several fields in the perf_counter_hw_event struct and adds
      a "flags" argument to the perf_counter_open system call, in order that
      features can be added in future without ABI changes.
      
      In particular the record_type field is expanded to 64 bits, and the
      space for flag bits has been expanded from 32 to 64 bits.
      
      This also adds some new fields:
      
      * read_format (64 bits) is intended to provide a way to specify what
        userspace wants to get back when it does a read() on a simple
        (non-interrupting) counter;
      
      * exclude_idle (1 bit) provides a way for userspace to ask that events
        that occur when the cpu is idle be excluded;
      
      * extra_config_len will provide a way for userspace to supply an
        arbitrary amount of extra machine-specific PMU configuration data
        immediately following the perf_counter_hw_event struct, to allow
        sophisticated users to program things such as instruction matching
        CAMs and address range registers;
      
      * __reserved_3 and __reserved_4 provide space for future expansion.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      2743a5b0
  3. 26 2月, 2009 1 次提交
    • P
      perfcounters: fix a few minor cleanliness issues · f3dfd265
      Paul Mackerras 提交于
      This fixes three issues noticed by Arnd Bergmann:
      
      - Add #ifdef __KERNEL__ and move some things around in perf_counter.h
        to make sure only the bits that userspace needs are exported to
        userspace.
      
      - Use __u64, __s64, __u32 types in the structs exported to userspace
        rather than u64, s64, u32.
      
      - Make the sys_perf_counter_open syscall available to the SPUs on
        Cell platforms.
      
      And one issue that I noticed in looking at the code again:
      
      - Wrap the perf_counter_open syscall with SYSCALL_DEFINE4 so we get
        the proper handling of int arguments on ppc64 (and some other 64-bit
        architectures).
      Reported-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      f3dfd265
  4. 13 2月, 2009 1 次提交
    • P
      perfcounters: make context switch and migration software counters work again · c07c99b6
      Paul Mackerras 提交于
      Jaswinder Singh Rajput reported that commit 23a185ca caused the
      context switch and migration software counters to report zero always.
      With that commit, the software counters only count events that occur
      between sched-in and sched-out for a task.  This is necessary for the
      counter enable/disable prctls and ioctls to work.  However, the
      context switch and migration counts are incremented after sched-out
      for one task and before sched-in for the next.  Since the increment
      doesn't occur while a task is scheduled in (as far as the software
      counters are concerned) it doesn't count towards any counter.
      
      Thus the context switch and migration counters need to count events
      that occur at any time, provided the counter is enabled, not just
      those that occur while the task is scheduled in (from the perf_counter
      subsystem's point of view).  The problem though is that the software
      counter code can't tell the difference between being enabled and being
      scheduled in, and between being disabled and being scheduled out,
      since we use the one pair of enable/disable entry points for both.
      That is, the high-level disable operation simply arranges for the
      counter to not be scheduled in any more, and the high-level enable
      operation arranges for it to be scheduled in again.
      
      One way to solve this would be to have sched_in/out operations in the
      hw_perf_counter_ops struct as well as enable/disable.  However, this
      takes a simpler approach: it adds a 'prev_state' field to the
      perf_counter struct that allows a counter's enable method to know
      whether the counter was previously disabled or just inactive
      (scheduled out), and therefore whether the enable method is being
      called as a result of a high-level enable or a schedule-in operation.
      
      This then allows the context switch, migration and page fault counters
      to reset their hw.prev_count value in their enable functions only if
      they are called as a result of a high-level enable operation.
      Although page faults would normally only occur while the counter is
      scheduled in, this changes the page fault counter code too in case
      there are ever circumstances where page faults get counted against a
      task while its counters are not scheduled in.
      Reported-by: NJaswinder Singh Rajput <jaswinder@kernel.org>
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      c07c99b6
  5. 11 2月, 2009 3 次提交
    • P
      perfcounters: fix refcounting bug, take 2 · 4bcf349a
      Paul Mackerras 提交于
      Only free child_counter if it has a parent; if it doesn't, then it
      has a file pointing to it and we'll free it in perf_release.
      Signed-off-by: NMike Galbraith <efault@gmx.de>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      4bcf349a
    • M
      perfcounters: fix use after free in perf_release() · 5af75917
      Mike Galbraith 提交于
      running...
      
        while true; do
          foo -d 1 -f 1 -c 100000 & sleep 1
          kerneltop -d 1 -f 1 -e 1 -c 25000 -p `pidof foo`
        done
      
        while true; do
          killall foo; killall kerneltop; sleep 2
        done
      
      ...in two shells with SLUB_DEBUG enabled produces flood of:
      BUG task_struct: Poison overwritten.
      
      Fix the use-after-free bug in perf_release().
      Signed-off-by: NMike Galbraith <efault@gmx.de>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      5af75917
    • P
      perf_counters: allow users to count user, kernel and/or hypervisor events · 0475f9ea
      Paul Mackerras 提交于
      Impact: new perf_counter feature
      
      This extends the perf_counter_hw_event struct with bits that specify
      that events in user, kernel and/or hypervisor mode should not be
      counted (i.e. should be excluded), and adds code to program the PMU
      mode selection bits accordingly on x86 and powerpc.
      
      For software counters, we don't currently have the infrastructure to
      distinguish which mode an event occurs in, so we currently fail the
      counter initialization if the setting of the hw_event.exclude_* bits
      would require us to distinguish.  Context switches and CPU migrations
      are currently considered to occur in kernel mode.
      
      On x86, this changes the previous policy that only root can count
      kernel events.  Now non-root users can count kernel events or exclude
      them.  Non-root users still can't use NMI events, though.  On x86 we
      don't appear to have any way to control whether hypervisor events are
      counted or not, so hw_event.exclude_hv is ignored.
      
      On powerpc, the selection of whether to count events in user, kernel
      and/or hypervisor mode is PMU-wide, not per-counter, so this adds a
      check that the hw_event.exclude_* settings are the same as other events
      on the PMU.  Counters being added to a group have to have the same
      settings as the other hardware counters in the group.  Counters and
      groups can only be enabled in hw_perf_group_sched_in or power_perf_enable
      if they have the same settings as any other counters already on the
      PMU.  If we are not running on a hypervisor, the exclude_hv setting
      is ignored (by forcing it to 0) since we can't ever get any
      hypervisor events.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      0475f9ea
  6. 09 2月, 2009 1 次提交
    • P
      perf_counters: make software counters work as per-cpu counters · 23a185ca
      Paul Mackerras 提交于
      Impact: kernel crash fix
      
      Yanmin Zhang reported that using a PERF_COUNT_TASK_CLOCK software
      counter as a per-cpu counter would reliably crash the system, because
      it calls __task_delta_exec with a null pointer.  The page fault,
      context switch and cpu migration counters also won't function
      correctly as per-cpu counters since they reference the current task.
      
      This fixes the problem by redirecting the task_clock counter to the
      cpu_clock counter when used as a per-cpu counter, and by implementing
      per-cpu page fault, context switch and cpu migration counters.
      
      Along the way, this:
      
      - Initializes counter->ctx earlier, in perf_counter_alloc, so that
        sw_perf_counter_init can use it
      - Adds code to kernel/sched.c to count task migrations into each
        cpu, in rq->nr_migrations_in
      - Exports the per-cpu context switch and task migration counts
        via new functions added to kernel/sched.c
      - Makes sure that if sw_perf_counter_init fails, we don't try to
        initialize the counter as a hardware counter.  Since the user has
        passed a negative, non-raw event type, they clearly don't intend
        for it to be interpreted as a hardware event.
      Reported-by: N"Zhang Yanmin" <yanmin_zhang@linux.intel.com>
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      23a185ca
  7. 29 1月, 2009 1 次提交
  8. 17 1月, 2009 1 次提交
    • P
      perf_counter: Add counter enable/disable ioctls · d859e29f
      Paul Mackerras 提交于
      Impact: New perf_counter features
      
      This primarily adds a way for perf_counter users to enable and disable
      counters and groups.  Enabling or disabling a counter or group also
      enables or disables all of the child counters that have been cloned
      from it to monitor children of the task monitored by the top-level
      counter.  The userspace interface to enable/disable counters is via
      ioctl on the counter file descriptor.
      
      Along the way this extends the code that handles child counters to
      handle child counter groups properly.  A group with multiple counters
      will be cloned to child tasks if and only if the group leader has the
      hw_event.inherit bit set - if it is set the whole group is cloned as a
      group in the child task.
      
      In order to be able to enable or disable all child counters of a given
      top-level counter, we need a way to find them all.  Hence I have added
      a child_list field to struct perf_counter, which is the head of the
      list of children for a top-level counter, or the link in that list for
      a child counter.  That list is protected by the perf_counter.mutex
      field.
      
      This also adds a mutex to the perf_counter_context struct.  Previously
      the list of counters was protected just by the lock field in the
      context, which meant that perf_counter_init_task had to take that lock
      and then take whatever lock/mutex protects the top-level counter's
      child_list.  But the counter enable/disable functions need to take
      that lock in order to traverse the list, then for each counter take
      the lock in that counter's context in order to change the counter's
      state safely, which will lead to a deadlock.
      
      To solve this, we now have both a mutex and a spinlock in the context,
      and taking either is sufficient to ensure the list of counters can't
      change - you have to take both before changing the list.  Now
      perf_counter_init_task takes the mutex instead of the lock (which
      incidentally means that inherit_counter can use GFP_KERNEL instead of
      GFP_ATOMIC) and thus avoids the possible deadlock.  Similarly the new
      enable/disable functions can take the mutex while traversing the list
      of child counters without incurring a possible deadlock when the
      counter manipulation code locks the context for a child counter.
      
      We also had an misfeature that the first counter added to a context
      would possibly not go on until the next sched-in, because we were
      using ctx->nr_active to detect if the context was running on a CPU.
      But nr_active is the number of active counters, and if that was zero
      (because the context didn't have any counters yet) it would look like
      the context wasn't running on a cpu and so the retry code in
      __perf_install_in_context wouldn't retry.  So this adds an 'is_active'
      field that is set when the context is on a CPU, even if it has no
      counters.  The is_active field is only used for task contexts, not for
      per-cpu contexts.
      
      If we enable a subsidiary counter in a group that is active on a CPU,
      and the arch code can't enable the counter, then we have to pull the
      whole group off the CPU.  We do this with group_sched_out, which gets
      moved up in the file so it comes before all its callers.  This also
      adds similar logic to __perf_install_in_context so that the "all on,
      or none" invariant of groups is preserved when adding a new counter to
      a group.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      d859e29f
  9. 14 1月, 2009 1 次提交
    • P
      perf_counter: Add support for pinned and exclusive counter groups · 3b6f9e5c
      Paul Mackerras 提交于
      Impact: New perf_counter features
      
      A pinned counter group is one that the user wants to have on the CPU
      whenever possible, i.e. whenever the associated task is running, for
      a per-task group, or always for a per-cpu group.  If the system
      cannot satisfy that, it puts the group into an error state where
      it is not scheduled any more and reads from it return EOF (i.e. 0
      bytes read).  The group can be released from error state and made
      readable again using prctl(PR_TASK_PERF_COUNTERS_ENABLE).  When we
      have finer-grained enable/disable controls on counters we'll be able
      to reset the error state on individual groups.
      
      An exclusive group is one that the user wants to be the only group
      using the CPU performance monitor hardware whenever it is on.  The
      counter group scheduler will not schedule an exclusive group if there
      are already other groups on the CPU and will not schedule other groups
      onto the CPU if there is an exclusive group scheduled (that statement
      does not apply to groups containing only software counters, which can
      always go on and which do not prevent an exclusive group from going on).
      With an exclusive group, we will be able to let users program PMU
      registers at a low level without the concern that those settings will
      perturb other measurements.
      
      Along the way this reorganizes things a little:
      - is_software_counter() is moved to perf_counter.h.
      - cpuctx->active_oncpu now records the number of hardware counters on
        the CPU, i.e. it now excludes software counters.  Nothing was reading
        cpuctx->active_oncpu before, so this change is harmless.
      - A new cpuctx->exclusive field records whether we currently have an
        exclusive group on the CPU.
      - counter_sched_out moves higher up in perf_counter.c and gets called
        from __perf_counter_remove_from_context and __perf_counter_exit_task,
        where we used to have essentially the same code.
      - __perf_counter_sched_in now goes through the counter list twice, doing
        the pinned counters in the first loop and the non-pinned counters in
        the second loop, in order to give the pinned counters the best chance
        to be scheduled in.
      
      Note that only a group leader can be exclusive or pinned, and that
      attribute applies to the whole group.  This avoids some awkwardness in
      some corner cases (e.g. where a group leader is closed and the other
      group members get added to the context list).  If we want to relax that
      restriction later, we can, and it is easier to relax a restriction than
      to apply a new one.
      
      This doesn't yet handle the case where a pinned counter is inherited
      and goes into error state in the child - the error state is not
      propagated up to the parent when the child exits, and arguably it
      should.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      3b6f9e5c