提交 · 5b249b1b07c42c61d0c53b8ab4fad026e39a9be9 · openeuler / raspberrypi-kernel

04 9月, 2012 2 次提交

perf/hwpb: Invoke __perf_event_disable() if interrupts are already disabled · 500ad2d8

由 K.Prasad 提交于 8月 02, 2012

While debugging a warning message on PowerPC while using hardware
breakpoints, it was discovered that when perf_event_disable is invoked
through hw_breakpoint_handler function with interrupts disabled, a
subsequent IPI in the code path would trigger a WARN_ON_ONCE message in
smp_call_function_single function.

This patch calls __perf_event_disable() when interrupts are already
disabled, instead of perf_event_disable().
Reported-by: NEdjunior Barbosa Machado <emachado@linux.vnet.ibm.com>
Signed-off-by: NK.Prasad <Prasad.Krishnan@gmail.com>
[naveen.n.rao@linux.vnet.ibm.com: v3: Check to make sure we target current task]
Signed-off-by: NNaveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Acked-by: NFrederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120802081635.5811.17737.stgit@localhost.localdomain
[ Fixed build error on MIPS. ]
Signed-off-by: NIngo Molnar <mingo@kernel.org>

500ad2d8

perf_event: Switch to internal refcount, fix race with close() · a6fa941d

由 Al Viro 提交于 8月 20, 2012

Don't mess with file refcounts (or keep a reference to file, for
that matter) in perf_event.  Use explicit refcount of its own
instead.  Deal with the race between the final reference to event
going away and new children getting created for it by use of
atomic_long_inc_not_zero() in inherit_event(); just have the
latter free what it had allocated and return NULL, that works
out just fine (children of siblings of something doomed are
created as singletons, same as if the child of leader had been
created and immediately killed).
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
Cc: stable@kernel.org
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120820135925.GG23464@ZenIV.linux.org.ukSigned-off-by: NIngo Molnar <mingo@kernel.org>

a6fa941d

31 7月, 2012 1 次提交

perf/trace: Add ability to set a target task for events · e6dab5ff

由 Andrew Vagin 提交于 7月 11, 2012

A few events are interesting not only for a current task.
For example, sched_stat_* events are interesting for a task
which wakes up. For this reason, it will be good if such
events will be delivered to a target task too.

Now a target task can be set by using __perf_task().

The original idea and a draft patch belongs to Peter Zijlstra.

I need these events for profiling sleep times. sched_switch is used for
getting callchains and sched_stat_* is used for getting time periods.
These events are combined in user space, then it can be analyzed by
perf tools.
Inspired-by: NPeter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Arun Sharma <asharma@fb.com>
Signed-off-by: NAndrew Vagin <avagin@openvz.org>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1342016098-213063-1-git-send-email-avagin@openvz.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>

e6dab5ff

18 6月, 2012 4 次提交

perf: Introduce perf_pmu_migrate_context() · 0cda4c02

由 Yan, Zheng 提交于 6月 15, 2012

Originally from Peter Zijlstra. The helper migrates perf events
from one cpu to another cpu.
Signed-off-by: NZheng Yan <zheng.z.yan@intel.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1339741902-8449-5-git-send-email-zheng.z.yan@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

0cda4c02

perf: Allow the PMU driver to choose the CPU on which to install events · e2d37cd2

由 Yan, Zheng 提交于 6月 15, 2012

Allow the pmu->event_init callback to change event->cpu, so the PMU driver
can choose the CPU on which to install events.
Signed-off-by: NZheng Yan <zheng.z.yan@intel.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1339741902-8449-4-git-send-email-zheng.z.yan@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

e2d37cd2

perf: Avoid race between cpu hotplug and installing event · fbfc623f

由 Yan, Zheng 提交于 6月 15, 2012

perf_event_open() requires the cpu on which to install event is online,
but the cpu can go offline after perf_event_open checks that. Add a
get_online_cpus()/put_online_cpus() pair to avoid the race.
Signed-off-by: NZheng Yan <zheng.z.yan@intel.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1339741902-8449-3-git-send-email-zheng.z.yan@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

fbfc623f

perf: Use css_tryget() to avoid propping up css refcount · 9c5da09d

由 Salman Qazi 提交于 6月 14, 2012

An rmdir pushes css's ref count to zero.  However, if the associated
directory is open at the time, the dentry ref count is non-zero.  If
the fd for this directory is then passed into perf_event_open, it
does a css_get().  This bounces the ref count back up from zero.  This
is a problem by itself.  But what makes it turn into a crash is the
fact that we end up doing an extra dput, since we perform a dput
when css_put sees the ref count go down to zero.

css_tryget() does not fall into that trap. So, we use that instead.

Reproduction test-case for the bug:

 #include <unistd.h>
 #include <sys/types.h>
 #include <sys/stat.h>
 #include <fcntl.h>
 #include <linux/unistd.h>
 #include <linux/perf_event.h>
 #include <string.h>
 #include <errno.h>
 #include <stdio.h>

 #define PERF_FLAG_PID_CGROUP    (1U << 2)

 int perf_event_open(struct perf_event_attr *hw_event_uptr,
                     pid_t pid, int cpu, int group_fd, unsigned long flags) {
         return syscall(__NR_perf_event_open,hw_event_uptr, pid, cpu,
                 group_fd, flags);
 }

 /*
  * Directly poke at the perf_event bug, since it's proving hard to repro
  * depending on where in the kernel tree.  what moved?
  */
 int main(int argc, char **argv)
 {
        int fd;
        struct perf_event_attr attr;
        memset(&attr, 0, sizeof(attr));
        attr.exclude_kernel = 1;
        attr.size = sizeof(attr);
        mkdir("/dev/cgroup/perf_event/blah", 0777);
        fd = open("/dev/cgroup/perf_event/blah", O_RDONLY);
        perror("open");
        rmdir("/dev/cgroup/perf_event/blah");
        sleep(2);
        perf_event_open(&attr, fd, 0, -1,  PERF_FLAG_PID_CGROUP);
        perror("perf_event_open");
        close(fd);
        return 0;
 }
Signed-off-by: NSalman Qazi <sqazi@google.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: NTejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/20120614223108.1025.2503.stgit@dungbeetle.mtv.corp.google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

9c5da09d

01 6月, 2012 1 次提交

perf: Remove duplicate invocation on perf_event_for_each · cb7225fe

由 Namhyung Kim 提交于 5月 31, 2012

The @func callback was invoked twice for group leader when
perf_event_for_each() called. It seems the commit 75f937f2
("perf_counter: Fix ctx->mutex vs counter ->mutex inversion") made the
mistake during the change.
Signed-off-by: NNamhyung Kim <namhyung.kim@lge.com>
Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Namhyung Kim <namhyung@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1338443506-25009-1-git-send-email-namhyung.kim@lge.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>

cb7225fe

23 5月, 2012 1 次提交

Revert "sched, perf: Use a single callback into the scheduler" · ab0cce56

由 Jiri Olsa 提交于 5月 23, 2012

This reverts commit cb04ff9a ("sched, perf: Use a single
callback into the scheduler").

Before this change was introduced, the process switch worked
like this (wrt. to perf event schedule):

     schedule (prev, next)
       - schedule out all perf events for prev
       - switch to next
       - schedule in all perf events for current (next)

After the commit, the process switch looks like:

     schedule (prev, next)
       - schedule out all perf events for prev
       - schedule in all perf events for (next)
       - switch to next

The problem is, that after we schedule perf events in, the pmu
is enabled and we can receive events even before we make the
switch to next - so "current" still being prev process (event
SAMPLE data are filled based on the value of the "current"
process).

Thats exactly what we see for test__PERF_RECORD test. We receive
SAMPLES with PID of the process that our tracee is scheduled
from.

Discussed with Peter Zijlstra:

 > Bah!, yeah I guess reverting is the right thing for now. Sad
 > though.
 >
 > So by having the two hooks we have a black-spot between them
 > where we receive no events at all, this black-spot covers the
 > hand-over of current and we thus don't receive the 'wrong'
 > events.
 >
 > I rather liked we could do away with both that black-spot and
 > clean up the code a little, but apparently people rely on it.
Signed-off-by: NJiri Olsa <jolsa@redhat.com>
Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Cc: acme@redhat.com
Cc: paulus@samba.org
Cc: cjashfor@linux.vnet.ibm.com
Cc: fweisbec@gmail.com
Cc: eranian@google.com
Link: http://lkml.kernel.org/r/20120523111302.GC1638@m.brq.redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

ab0cce56

09 5月, 2012 2 次提交

sched, perf: Use a single callback into the scheduler · cb04ff9a

由 Peter Zijlstra 提交于 5月 08, 2012

We can easily use a single callback for both sched-in and sched-out. This
reduces the code footprint in the scheduler path as well as removes
the PMU black spot otherwise present between the out and in callback.
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-o56ajxp1edwqg6x9d31wb805@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>

cb04ff9a

perf: Pass last sampling period to perf_sample_data_init() · fd0d000b

由 Robert Richter 提交于 4月 02, 2012

We always need to pass the last sample period to
perf_sample_data_init(), otherwise the event distribution will be
wrong. Thus, modifiyng the function interface with the required period
as argument. So basically a pattern like this:

        perf_sample_data_init(&data, ~0ULL);
        data.period = event->hw.last_period;

will now be like that:

        perf_sample_data_init(&data, ~0ULL, event->hw.last_period);

Avoids unininitialized data.period and simplifies code.
Signed-off-by: NRobert Richter <robert.richter@amd.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1333390758-10893-3-git-send-email-robert.richter@amd.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

fd0d000b

26 4月, 2012 2 次提交

perf: Use static variant of perf_event_overflow in core.c · 33b07b8b

由 Robert Richter 提交于 4月 05, 2012

No need to have an additional function layer.
Signed-off-by: NRobert Richter <robert.richter@amd.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1333643084-26776-4-git-send-email-robert.richter@amd.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

33b07b8b

perf: Fix perf_event_for_each() to use sibling · 724b6daa

由 Michael Ellerman 提交于 4月 11, 2012

In perf_event_for_each() we call a function on an event, and then
iterate over the siblings of the event.

However we don't call the function on the siblings, we call it
repeatedly on the original event - it seems "obvious" that we should
be calling it with sibling as the argument.

It looks like this broke in commit 75f937f2 ("Fix ctx->mutex
vs counter->mutex inversion").

The only effect of the bug is that the PERF_IOC_FLAG_GROUP parameter
to the ioctls doesn't work.
Signed-off-by: NMichael Ellerman <michael@ellerman.id.au>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1334109253-31329-1-git-send-email-michael@ellerman.id.auSigned-off-by: NIngo Molnar <mingo@kernel.org>

724b6daa

24 3月, 2012 1 次提交

perf: Move mmap page data_head offset assertion out of header · b01c3a00

由 Jiri Olsa 提交于 3月 23, 2012

Having the build time assertion in header is making the perf
build fail on x86 with:

  ../../include/linux/perf_event.h:411:32: error: variably modified \
		‘__assert_mmap_data_head_offset’ at file scope [-Werror]

I'm moving the build time validation out of the header, because
I think it's better than to lessen the perf build warn/error
check.
Signed-off-by: NJiri Olsa <jolsa@redhat.com>
Cc: acme@redhat.com
Cc: a.p.zijlstra@chello.nl
Cc: paulus@samba.org
Cc: cjashfor@linux.vnet.ibm.com
Cc: fweisbec@gmail.com
Link: http://lkml.kernel.org/r/1332513680-7870-1-git-send-email-jolsa@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

b01c3a00

23 3月, 2012 1 次提交

perf: Fix mmap_page capabilities and docs · c7206205

由 Peter Zijlstra 提交于 3月 22, 2012

Complete the syscall-less self-profiling feature and address
all complaints, namely:

 - capabilities, so we can detect what is actually available at runtime

     Add a capabilities field to perf_event_mmap_page to indicate
     what is actually available for use.

 - on x86: RDPMC weirdness due to being 40/48 bits and not sign-extending
   properly.

 - ABI documentation as to how all this stuff works.

Also improve the documentation for the new features.
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Stephane Eranian <eranian@google.com>
Cc: Vince Weaver <vweaver1@eecs.utk.edu>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Link: http://lkml.kernel.org/r/1332433596.2487.33.camel@twinsSigned-off-by: NIngo Molnar <mingo@kernel.org>

c7206205

05 3月, 2012 3 次提交

perf: Add callback to flush branch_stack on context switch · d010b332

由 Stephane Eranian 提交于 2月 09, 2012

With branch stack sampling, it is possible to filter by priv levels.

In system-wide mode, that means it is possible to capture only user
level branches. The builtin SW LBR filter needs to disassemble code
based on LBR captured addresses. For that, it needs to know the task
the addresses are associated with. Because of context switches, the
content of the branch stack buffer may contain addresses from
different tasks.

We need a callback on context switch to either flush the branch stack
or save it. This patch adds a new callback in struct pmu which is called
during context switches. The callback is called only when necessary.
That is when a system-wide context has, at least, one event which
uses PERF_SAMPLE_BRANCH_STACK. The callback is never called for
per-thread context.

In this version, the Intel x86 code simply flushes (resets) the LBR
on context switches (fills it with zeroes). Those zeroed branches are
then filtered out by the SW filter.
Signed-off-by: NStephane Eranian <eranian@google.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1328826068-11713-11-git-send-email-eranian@google.comSigned-off-by: NIngo Molnar <mingo@elte.hu>

d010b332

perf: Disable PERF_SAMPLE_BRANCH_* when not supported · 2481c5fa

由 Stephane Eranian 提交于 2月 09, 2012

PERF_SAMPLE_BRANCH_* is disabled for:

 - SW events (sw counters, tracepoints)
 - HW breakpoints
 - ALL but Intel x86 architecture
 - AMD64 processors
Signed-off-by: NStephane Eranian <eranian@google.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1328826068-11713-10-git-send-email-eranian@google.comSigned-off-by: NIngo Molnar <mingo@elte.hu>

2481c5fa

perf: Add generic taken branch sampling support · bce38cd5

由 Stephane Eranian 提交于 2月 09, 2012

This patch adds the ability to sample taken branches to the
perf_event interface.

The ability to capture taken branches is very useful for all
sorts of analysis. For instance, basic block profiling, call
counts, statistical call graph.

This new capability requires hardware assist and as such may
not be available on all HW platforms. On Intel x86 it is
implemented on top of the Last Branch Record (LBR) facility.

To enable taken branches sampling, the PERF_SAMPLE_BRANCH_STACK
bit must be set in attr->sample_type.

Sampled taken branches may be filtered by type and/or priv
levels.

The patch adds a new field, called branch_sample_type, to the
perf_event_attr structure. It contains a bitmask of filters
to apply to the sampled taken branches.

Filters may be implemented in HW. If the HW filter does not exist
or is not good enough, some arch may also implement a SW filter.

The following generic filters are currently defined:
- PERF_SAMPLE_USER
only branches whose targets are at the user level

- PERF_SAMPLE_KERNEL
only branches whose targets are at the kernel level

- PERF_SAMPLE_HV
only branches whose targets are at the hypervisor level

- PERF_SAMPLE_ANY
any type of branches (subject to priv levels filters)

- PERF_SAMPLE_ANY_CALL
any call branches (may incl. syscall on some arch)

- PERF_SAMPLE_ANY_RET
any return branches (may incl. syscall returns on some arch)

- PERF_SAMPLE_IND_CALL
indirect call branches

Obviously filter may be combined. The priv level bits are optional.
If not provided, the priv level of the associated event are used. It
is possible to collect branches at a priv level different from the
associated event. Use of kernel, hv priv levels is subject to permissions
and availability (hv).

The number of taken branch records present in each sample may vary based
on HW, the type of sampled branches, the executed code. Therefore
each sample contains the number of taken branches it contains.
Signed-off-by: NStephane Eranian <eranian@google.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1328826068-11713-2-git-send-email-eranian@google.comSigned-off-by: NIngo Molnar <mingo@elte.hu>

bce38cd5

24 2月, 2012 1 次提交

static keys: Introduce 'struct static_key', static_key_true()/false() and... · c5905afb

由 Ingo Molnar 提交于 2月 24, 2012

static keys: Introduce 'struct static_key', static_key_true()/false() and static_key_slow_[inc|dec]()

So here's a boot tested patch on top of Jason's series that does
all the cleanups I talked about and turns jump labels into a
more intuitive to use facility. It should also address the
various misconceptions and confusions that surround jump labels.

Typical usage scenarios:

        #include <linux/static_key.h>

        struct static_key key = STATIC_KEY_INIT_TRUE;

        if (static_key_false(&key))
                do unlikely code
        else
                do likely code

Or:

        if (static_key_true(&key))
                do likely code
        else
                do unlikely code

The static key is modified via:

        static_key_slow_inc(&key);
        ...
        static_key_slow_dec(&key);

The 'slow' prefix makes it abundantly clear that this is an
expensive operation.

I've updated all in-kernel code to use this everywhere. Note
that I (intentionally) have not pushed through the rename
blindly through to the lowest levels: the actual jump-label
patching arch facility should be named like that, so we want to
decouple jump labels from the static-key facility a bit.

On non-jump-label enabled architectures static keys default to
likely()/unlikely() branches.
Signed-off-by: NIngo Molnar <mingo@elte.hu>
Acked-by: NJason Baron <jbaron@redhat.com>
Acked-by: NSteven Rostedt <rostedt@goodmis.org>
Cc: a.p.zijlstra@chello.nl
Cc: mathieu.desnoyers@efficios.com
Cc: davem@davemloft.net
Cc: ddaney.cavm@gmail.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20120222085809.GA26397@elte.huSigned-off-by: NIngo Molnar <mingo@elte.hu>

c5905afb

07 2月, 2012 1 次提交

perf: Fix double start/stop in x86_pmu_start() · f39d47ff

由 Stephane Eranian 提交于 2月 07, 2012

The following patch fixes a bug introduced by the following
commit:

        e050e3f0 ("perf: Fix broken interrupt rate throttling")

The patch caused the following warning to pop up depending on
the sampling frequency adjustments:

  ------------[ cut here ]------------
  WARNING: at arch/x86/kernel/cpu/perf_event.c:995 x86_pmu_start+0x79/0xd4()

It was caused by the following call sequence:

perf_adjust_freq_unthr_context.part() {
     stop()
     if (delta > 0) {
          perf_adjust_period() {
              if (period > 8*...) {
                  stop()
                  ...
                  start()
              }
          }
      }
      start()
}

Which caused a double start and a double stop, thus triggering
the assert in x86_pmu_start().

The patch fixes the problem by avoiding the double calls. We
pass a new argument to perf_adjust_period() to indicate whether
or not the event is already stopped. We can't just remove the
start/stop from that function because it's called from
__perf_event_overflow where the event needs to be reloaded via a
stop/start back-toback call.

The patch reintroduces the assertion in x86_pmu_start() which
was removed by commit:

	84f2b9b2 ("perf: Remove deprecated WARN_ON_ONCE()")

In this second version, we've added calls to disable/enable PMU
during unthrottling or frequency adjustment based on bug report
of spurious NMI interrupts from Eric Dumazet.
Reported-and-tested-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NStephane Eranian <eranian@google.com>
Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Cc: markus@trippelsdorf.de
Cc: paulus@samba.org
Link: http://lkml.kernel.org/r/20120207133956.GA4932@quad
[ Minor edits to the changelog and to the code ]
Signed-off-by: NIngo Molnar <mingo@elte.hu>

f39d47ff

03 2月, 2012 1 次提交

cgroup: remove cgroup_subsys argument from callbacks · 761b3ef5

由 Li Zefan 提交于 1月 31, 2012

The argument is not used at all, and it's not necessary, because
a specific callback handler of course knows which subsys it
belongs to.

Now only ->pupulate() takes this argument, because the handlers of
this callback always call cgroup_add_file()/cgroup_add_files().

So we reduce a few lines of code, though the shrinking of object size
is minimal.

 16 files changed, 113 insertions(+), 162 deletions(-)

   text    data     bss     dec     hex filename
5486240  656987 7039960 13183187         c928d3 vmlinux.o.orig
5486170  656987 7039960 13183117         c9288d vmlinux.o
Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
Signed-off-by: NTejun Heo <tj@kernel.org>

761b3ef5

27 1月, 2012 1 次提交

perf: Fix broken interrupt rate throttling · e050e3f0

由 Stephane Eranian 提交于 1月 26, 2012

This patch fixes the sampling interrupt throttling mechanism.

It was broken in v3.2. Events were not being unthrottled. The
unthrottling mechanism required that events be checked at each
timer tick.

This patch solves this problem and also separates:

  - unthrottling
  - multiplexing
  - frequency-mode period adjustments

Not all of them need to be executed at each timer tick.

This third version of the patch is based on my original patch +
PeterZ proposal (https://lkml.org/lkml/2012/1/7/87).

At each timer tick, for each context:

  - if the current CPU has throttled events, we unthrottle events

  - if context has frequency-based events, we adjust sampling periods

  - if we have reached the jiffies interval, we multiplex (rotate)

We decoupled rotation (multiplexing) from frequency-mode sampling
period adjustments.  They should not necessarily happen at the same
rate. Multiplexing is subject to jiffies_interval (currently at 1
but could be higher once the tunable is exposed via sysfs).

We have grouped frequency-mode adjustment and unthrottling into the
same routine to minimize code duplication. When throttled while in
frequency mode, we scan the events only once.

We have fixed the threshold enforcement code in __perf_event_overflow().
There was a bug whereby it would allow more than the authorized rate
because an increment of hwc->interrupts was not executed at the right
place.

The patch was tested with low sampling limit (2000) and fixed periods,
frequency mode, overcommitted PMU.

On a 2.1GHz AMD CPU:

 $ cat /proc/sys/kernel/perf_event_max_sample_rate
 2000

We set a rate of 3000 samples/sec (2.1GHz/3000 = 700000):

 $ perf record -e cycles,cycles -c 700000  noploop 10
 $ perf report -D | tail -21

 Aggregated stats:
           TOTAL events:      80086
            MMAP events:         88
            COMM events:          2
            EXIT events:          4
        THROTTLE events:      19996
      UNTHROTTLE events:      19996
          SAMPLE events:      40000

 cycles stats:
           TOTAL events:      40006
            MMAP events:          5
            COMM events:          1
            EXIT events:          4
        THROTTLE events:       9998
      UNTHROTTLE events:       9998
          SAMPLE events:      20000

 cycles stats:
           TOTAL events:      39996
        THROTTLE events:       9998
      UNTHROTTLE events:       9998
          SAMPLE events:      20000

For 10s, the cap is 2x2000x10 = 40000 samples.
We get exactly that: 20000 samples/event.
Signed-off-by: NStephane Eranian <eranian@google.com>
Cc: <stable@kernel.org> # v3.2+
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120126160319.GA5655@quadSigned-off-by: NIngo Molnar <mingo@elte.hu>

e050e3f0

21 1月, 2012 1 次提交

perf: Call perf_cgroup_event_time() directly · 46cd6a7f

由 Namhyung Kim 提交于 1月 20, 2012

The perf_event_time() will call perf_cgroup_event_time()
if @event is a cgroup event. Just do it directly and avoid
the extra check..
Signed-off-by: NNamhyung Kim <namhyung.kim@lge.com>
Cc: Namhyung Kim <namhyung@gmail.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Link: http://lkml.kernel.org/r/1327021966-27688-2-git-send-email-namhyung.kim@lge.comSigned-off-by: NIngo Molnar <mingo@elte.hu>

46cd6a7f

02 1月, 2012 1 次提交

misc latin1 to utf8 conversions · d36b6910

由 Al Viro 提交于 12月 29, 2011

Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NJiri Kosina <jkosina@suse.cz>

d36b6910

21 12月, 2011 5 次提交

perf: Extend the mmap control page with time (TSC) fields · e3f3541c

由 Peter Zijlstra 提交于 11月 21, 2011

Extend the mmap control page with fields so that userspace can compute
time deltas relative to the provided time fields.

Currently only implemented for x86 with constant and nonstop TSC.
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Stephane Eranian <eranian@google.com>
Cc: Arun Sharma <asharma@fb.com>
Link: http://lkml.kernel.org/n/tip-3u1jucza77j3wuvs0x2bic0f@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@elte.hu>

e3f3541c

perf, x86: Provide means for disabling userspace RDPMC · 0c9d42ed

由 Peter Zijlstra 提交于 11月 20, 2011

Allow the disabling of RDPMC via a pmu specific attribute:

  echo 0 > /sys/bus/event_source/devices/cpu/rdpmc
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Stephane Eranian <eranian@google.com>
Cc: Arun Sharma <asharma@fb.com>
Link: http://lkml.kernel.org/n/tip-pqeog465zo5hsimtkfz73f27@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@elte.hu>

0c9d42ed

perf: Fix mmap_page::offset computation · 365a4038

由 Peter Zijlstra 提交于 11月 21, 2011

There's multiple reason the counter might be unavailable, change the
condition to !->index since perf_event_index() should return 0 for all
those cases.
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-1ixr3olci40w8rgv2evv2ldh@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@elte.hu>

365a4038

perf, arch: Rework perf_event_index() · 35edc2a5

由 Peter Zijlstra 提交于 11月 20, 2011

Put the logic to compute the event index into a per pmu method. This
is required because the x86 rules are weird and wonderful and don't
match the capabilities of the current scheme.

AFAIK only powerpc actually has a usable userspace read of the PMCs
but I'm not at all sure anybody actually used that.

ARM is restored to the default since it currently does not support
userspace access at all. And all software events are provided with a
method that reports their index as 0 (disabled).
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Michael Cree <mcree@orcon.net.nz>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Deng-Cheng Zhu <dengcheng.zhu@gmail.com>
Cc: Anton Blanchard <anton@samba.org>
Cc: Eric B Munson <emunson@mgebm.net>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Richard Kuo <rkuo@codeaurora.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Arun Sharma <asharma@fb.com>
Link: http://lkml.kernel.org/n/tip-dfydxodki16lylkt3gl2j7cw@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@elte.hu>

35edc2a5

perf: Update the mmap control page on mmap() · 9a0f05cb

由 Peter Zijlstra 提交于 11月 21, 2011

Apparently we didn't update the mmap control page right after mmap(),
which leads to surprises when userspace wants to use it.
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Stephane Eranian <eranian@google.com>
Cc: Arun Sharma <asharma@fb.com>
Link: http://lkml.kernel.org/n/tip-dcpi7164djsexmx6ya7lilrc@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@elte.hu>

9a0f05cb

14 12月, 2011 1 次提交

perf events: Fix ring_buffer_wakeup() brown paperbag bug · 44b7f4b9

由 Will Deacon 提交于 12月 13, 2011

Commit 10c6db11 ("perf: Fix loss of notification with multi-event")
seems to unconditionally dereference event->rb in the wakeup handler,
this is wrong, there might not be a buffer attached.
Signed-off-by: NWill Deacon <will.deacon@arm.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20111213152651.GP20297@mudshark.cambridge.arm.com
[ minor edits ]
Signed-off-by: NIngo Molnar <mingo@elte.hu>

44b7f4b9

13 12月, 2011 1 次提交

cgroup: don't use subsys->can_attach_task() or ->attach_task() · bb9d97b6

由 Tejun Heo 提交于 12月 12, 2011

Now that subsys->can_attach() and attach() take @tset instead of
@task, they can handle per-task operations.  Convert
->can_attach_task() and ->attach_task() users to use ->can_attach()
and attach() instead.  Most converions are straight-forward.
Noteworthy changes are,

* In cgroup_freezer, remove unnecessary NULL assignments to unused
  methods.  It's useless and very prone to get out of sync, which
  already happened.

* In cpuset, PF_THREAD_BOUND test is checked for each task.  This
  doesn't make any practical difference but is conceptually cleaner.
Signed-off-by: NTejun Heo <tj@kernel.org>
Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: NFrederic Weisbecker <fweisbec@gmail.com>
Acked-by: NLi Zefan <lizf@cn.fujitsu.com>
Cc: Paul Menage <paul@paulmenage.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: James Morris <jmorris@namei.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <peterz@infradead.org>

bb9d97b6

12 12月, 2011 1 次提交

events: Make events use the new is_idle_task() API · 77aeeebd

由 Paul E. McKenney 提交于 11月 10, 2011

Change from direct comparison of ->pid with zero to is_idle_task().
Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Reviewed-by: NJosh Triplett <josh@joshtriplett.org>

77aeeebd

07 12月, 2011 1 次提交

perf: Do no try to schedule task events if there are none · 86b47c25

由 Gleb Natapov 提交于 11月 22, 2011

perf_event_sched_in() shouldn't try to schedule task events if there
are none otherwise task's ctx->is_active will be set and will not be
cleared during sched_out. This will prevent newly added events from
being scheduled into the task context.

Fixes a boo-boo in commit 1d5f003f ("perf: Do not set task_ctx
pointer in cpuctx if there are no events in the context").
Signed-off-by: NGleb Natapov <gleb@redhat.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20111122140821.GF2557@redhat.comSigned-off-by: NIngo Molnar <mingo@elte.hu>

86b47c25

06 12月, 2011 4 次提交

perf, core: Rate limit perf_sched_events jump_label patching · b2029520

由 Gleb Natapov 提交于 11月 27, 2011

jump_lable patching is very expensive operation that involves pausing all
cpus. The patching of perf_sched_events jump_label is easily controllable
from userspace by unprivileged user.

When te user runs a loop like this:

  "while true; do perf stat -e cycles true; done"

... the performance of my test application that just increments a counter
for one second drops by 4%.

This is on a 16 cpu box with my test application using only one of
them. An impact on a real server doing real work will be worse.

Performance of KVM PMU drops nearly 50% due to jump_lable for "perf
record" since KVM PMU implementation creates and destroys perf event
frequently.

This patch introduces a way to rate limit jump_label patching and uses
it to fix the above problem.

I believe that as jump_label use will spread the problem will become more
common and thus solving it in a generic code is appropriate. Also fixing
it in the perf code would result in moving jump_label accounting logic to
perf code with all the ifdefs in case of JUMP_LABEL=n kernel. With this
patch all details are nicely hidden inside jump_label code.
Signed-off-by: NGleb Natapov <gleb@redhat.com>
Acked-by: NJason Baron <jbaron@redhat.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20111127155909.GO2557@redhat.comSigned-off-by: NIngo Molnar <mingo@elte.hu>

b2029520

perf: Fix enable_on_exec for sibling events · b79387ef

由 Peter Zijlstra 提交于 11月 22, 2011

Deng-Cheng Zhu reported that sibling events that were created disabled
with enable_on_exec would never get enabled. Iterate all events
instead of the group lists.
Reported-by: NDeng-Cheng Zhu <dczhu@mips.com>
Tested-by: NDeng-Cheng Zhu <dczhu@mips.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1322048382.14799.41.camel@twinsSigned-off-by: NIngo Molnar <mingo@elte.hu>

b79387ef

perf: Remove superfluous arguments · 1d9b482e

由 Peter Zijlstra 提交于 11月 23, 2011

Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-yv4o74vh90suyghccgykbnry@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@elte.hu>

1d9b482e

perf: Avoid a useless pmu_disable() in the perf-tick · 0f5a2601

由 Peter Zijlstra 提交于 11月 16, 2011

Gleb writes:

 > Currently pmu is disabled and re-enabled on each timer interrupt even
 > when no rotation or frequency adjustment is needed. On Intel CPU this
 > results in two writes into PERF_GLOBAL_CTRL MSR per tick. On bare metal
 > it does not cause significant slowdown, but when running perf in a virtual
 > machine it leads to 20% slowdown on my machine.

Cure this by keeping a perf_event_context::nr_freq counter that counts the
number of active events that require frequency adjustments and use this in a
similar fashion to the already existing nr_events != nr_active test in
perf_rotate_context().

By being able to exclude both rotation and frequency adjustments a-priory for
the common case we can avoid the otherwise superfluous PMU disable.
Suggested-by: NGleb Natapov <gleb@redhat.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-515yhoatehd3gza7we9fapaa@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@elte.hu>

0f5a2601

05 12月, 2011 1 次提交

perf: Fix loss of notification with multi-event · 10c6db11

由 Peter Zijlstra 提交于 11月 26, 2011

When you do:
$ perf record -e cycles,cycles,cycles noploop 10

You expect about 10,000 samples for each event, i.e., 10s at
1000samples/sec. However, this is not what's happening. You
get much fewer samples, maybe 3700 samples/event:

$ perf report -D | tail -15
Aggregated stats:
TOTAL events: 10998
MMAP events: 66
COMM events: 2
SAMPLE events: 10930
cycles stats:
TOTAL events: 3644
SAMPLE events: 3644
cycles stats:
TOTAL events: 3642
SAMPLE events: 3642
cycles stats:
TOTAL events: 3644
SAMPLE events: 3644

On a Intel Nehalem or even AMD64, there are 4 counters capable
of measuring cycles, so there is plenty of space to measure those
events without multiplexing (even with the NMI watchdog active).
And even with multiplexing, we'd expect roughly the same number
of samples per event.

The root of the problem was that when the event that caused the buffer
to become full was not the first event passed on the cmdline, the user
notification would get lost. The notification was sent to the file
descriptor of the overflowed event but the perf tool was not polling
on it. The perf tool aggregates all samples into a single buffer,
i.e., the buffer of the first event. Consequently, it assumes
notifications for any event will come via that descriptor.

The seemingly straight forward solution of moving the waitq into the
ringbuffer object doesn't work because of life-time issues. One could
perf_event_set_output() on a fd that you're also blocking on and cause
the old rb object to be freed while its waitq would still be
referenced by the blocked thread -> FAIL.

Therefore link all events to the ringbuffer and broadcast the wakeup
from the ringbuffer object to all possible events that could be waited
upon. This is rather ugly, and we're open to better solutions but it
works for now.
Reported-by: NStephane Eranian <eranian@google.com>
Finished-by: NStephane Eranian <eranian@google.com>
Reviewed-by: NStephane Eranian <eranian@google.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20111126014731.GA7030@quadSigned-off-by: NIngo Molnar <mingo@elte.hu>

10c6db11

14 11月, 2011 2 次提交

events: Don't divide events if it has field period · 5d81e5cf

由 Andrew Vagin 提交于 11月 07, 2011

This patch solves the following problem:

Now some samples may be lost due to throttling. The number of samples is
restricted by sysctl_perf_event_sample_rate/HZ. A trace event is
divided on some samples according to event's period. I don't sure, that
we should generate more than one sample on each trace event. I think the
better way to use SAMPLE_PERIOD.

E.g.: I want to trace when a process sleeps. I created a process, which
sleeps for 1ms and for 4ms. perf got 100 events in both cases.

swapper 0 [000] 1141.371830: sched_stat_sleep: comm=foo pid=1801 delay=1386750 [ns]
swapper 0 [000] 1141.369444: sched_stat_sleep: comm=foo pid=1801 delay=4499585 [ns]

In the first case a kernel want to send 4499585 events and
in the second case it wants to send 1386750 events.
perf-reports shows that process sleeps in both places equal time. It's
bug.

With this patch kernel generates one event on each "sleep" and the time
slice is saved in the field "period". Perf knows how handle it.
Signed-off-by: NAndrew Vagin <avagin@openvz.org>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1320670457-2633428-3-git-send-email-avagin@openvz.orgSigned-off-by: NIngo Molnar <mingo@elte.hu>

5d81e5cf

perf: Carve out callchain functionality · 9251f904

由 Borislav Petkov 提交于 10月 16, 2011

Split the callchain code from the perf events core into
a new kernel/events/callchain.c file.

This simplifies a bit the big core.c
Signed-off-by: NBorislav Petkov <borislav.petkov@amd.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Stephane Eranian <eranian@google.com>
[keep ctx recursion handling inline and use internal headers]
Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1318778104-17152-1-git-send-email-fweisbec@gmail.comSigned-off-by: NIngo Molnar <mingo@elte.hu>

9251f904