1. 10 2月, 2017 2 次提交
  2. 30 1月, 2017 7 次提交
    • K
      perf/core: Try parent PMU first when initializing a child event · 40999312
      Kan Liang 提交于
      perf has additional overhead when monitoring the task which
      frequently generates child tasks.
      
      perf_init_event() is one of the hotspots for the additional overhead:
      
      Currently, to get the PMU, it tries to search the type in pmu_idr at
      first. But it is not always successful, especially for the widely used
      PERF_TYPE_HARDWARE and PERF_TYPE_HW_CACHE events. So it has to go to the
      slow path which go through the whole PMUs list.
      
      It will be a big performance issue, if the PMUs list is long (e.g. server
      with many uncore boxes) and the task frequently generates child tasks.
      
      The child event inherits its parent event. So the child event should
      try its parent PMU first.
      
      Here is some data from the overhead test on Broadwell server:
      
        perf record -e $TEST_EVENTS -- ./loop.sh 50000
      
        loop.sh
          start=$(date +%s%N)
          i=0
          while [ "$i" -le "$1" ]
          do
                  date > /dev/null
                  i=`expr $i + 1`
          done
          end=$(date +%s%N)
          elapsed=`expr $end - $start`
      
        Event#	Original elapsed time	Elapsed time with patch		delta
        1		196,573,192,397		189,162,029,998			-3.77%
        2		257,567,753,013		241,620,788,683			-6.19%
        4		398,730,726,971		370,518,938,714			-7.08%
        8		824,983,761,120		740,702,489,329			-10.22%
        16		1,883,411,923,498	1,672,027,508,355		-11.22%
      
      ... which shows a nice performance improvement.
      Signed-off-by: NKan Liang <kan.liang@intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Link: http://lkml.kernel.org/r/1484745662-15928-2-git-send-email-kan.liang@intel.com
      [ Tidied up the changelog and the code comment. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      40999312
    • A
      perf/core: Optimize event rescheduling on active contexts · 487f05e1
      Alexander Shishkin 提交于
      When new events are added to an active context, we go and reschedule
      all cpu groups and all task groups in order to preserve the priority
      (cpu pinned, task pinned, cpu flexible, task flexible), but in
      reality we only need to reschedule groups of the same priority as
      that of the events being added, and below.
      
      This patch changes the behavior so that only groups that need to be
      rescheduled are rescheduled.
      Reported-by: NAdrian Hunter <adrian.hunter@intel.com>
      Signed-off-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: vince@deater.net
      Link: http://lkml.kernel.org/r/20170119164330.22887-3-alexander.shishkin@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      487f05e1
    • A
      perf/core: Don't re-schedule CPU flexible events needlessly · fe45bafb
      Alexander Shishkin 提交于
      In the sched-in path, we first remove a CPU's flexible events in order to
      give priority to the task's pinned events. However, this step can be safely
      skipped if the task doesn't have its own pinned events.
      
      This patch implements this skipping.
      Reported-by: NAdrian Hunter <adrian.hunter@intel.com>
      Signed-off-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: vince@deater.net
      Link: http://lkml.kernel.org/r/20170119164330.22887-2-alexander.shishkin@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      fe45bafb
    • D
      perf/core: Remove perf_cpu_context::unique_pmu · 1fd7e416
      David Carrillo-Cisneros 提交于
      cpuctx->unique_pmu was originally introduced as a way to identify cpuctxs
      with shared pmus in order to avoid visiting the same cpuctx more than once
      in a for_each_pmu loop.
      
      cpuctx->unique_pmu == cpuctx->pmu in non-software task contexts since they
      have only one pmu per cpuctx. Since perf_pmu_sched_task() is only called in
      hw contexts, this patch replaces cpuctx->unique_pmu by cpuctx->pmu in it.
      
      The change above, together with the previous patch in this series, removed
      the remaining uses of cpuctx->unique_pmu, so we remove it altogether.
      Signed-off-by: NDavid Carrillo-Cisneros <davidcc@google.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NMark Rutland <mark.rutland@arm.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Kan Liang <kan.liang@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul Turner <pjt@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vikas Shivappa <vikas.shivappa@linux.intel.com>
      Cc: Vince Weaver <vince@deater.net>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Link: http://lkml.kernel.org/r/20170118192454.58008-3-davidcc@google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      1fd7e416
    • D
      perf/core: Make cgroup switch visit only cpuctxs with cgroup events · 058fe1c0
      David Carrillo-Cisneros 提交于
      This patch follows from a conversation in CQM/CMT's last series about
      speeding up the context switch for cgroup events:
      
        https://patchwork.kernel.org/patch/9478617/
      
      This is a low-hanging fruit optimization. It replaces the iteration over
      the "pmus" list in cgroup switch by an iteration over a new list that
      contains only cpuctxs with at least one cgroup event.
      
      This is necessary because the number of PMUs have increased over the years
      e.g modern x86 server systems have well above 50 PMUs.
      
      The iteration over the full PMU list is unneccessary and can be costly in
      heavy cache contention scenarios.
      
      Below are some instrumentation measurements with 10, 50 and 90 percentiles
      of the total cost of context switch before and after this optimization for
      a simple array read/write microbenchark.
      
        Contention
          Level    Nr events      Before (us)            After (us)       Median
        L2    L3     types      (10%, 50%, 90%)       (10%, 50%, 90%     Speedup
        --------------------------------------------------------------------------
        Low   Low       1       (1.72, 2.42, 5.85)    (1.35, 1.64, 5.46)     29%
        High  Low       1       (2.08, 4.56, 19.8)    (1720, 2.20, 13.7)     51%
        High  High      1       (2.86, 10.4, 12.7)    (2.54, 4.32, 12.1)     58%
      
        Low   Low       2       (1.98, 3.20, 6.89)    (1.68, 2.41, 8.89)     24%
        High  Low       2       (2.48, 5.28, 22.4)    (2150, 3.69, 14.6)     30%
        High  High      2       (3.32, 8.09, 13.9)    (2.80, 5.15, 13.7)     36%
      
      where:
      
        1 event type  = cycles
        2 event types = cycles,intel_cqm/llc_occupancy/
      
         Contention L2 Low: workset  <  L2 cache size.
                       High:  "     >>  L2   "     " .
         Contention L3 Low: workset of task on all sockets  <  L3 cache size.
                       High:   "     "   "   "   "    "    >>  L3   "     " .
      
         Median Speedup is (50%ile Before - 50%ile After) /  50%ile Before
      
      Unsurprisingly, the benefits of this optimization decrease with the number
      of cpuctxs with a cgroup events, yet, is never detrimental.
      Tested-by: NMark Rutland <mark.rutland@arm.com>
      Signed-off-by: NDavid Carrillo-Cisneros <davidcc@google.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NMark Rutland <mark.rutland@arm.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Kan Liang <kan.liang@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul Turner <pjt@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vikas Shivappa <vikas.shivappa@linux.intel.com>
      Cc: Vince Weaver <vince@deater.net>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Link: http://lkml.kernel.org/r/20170118192454.58008-2-davidcc@google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      058fe1c0
    • P
      perf/core: Fix PERF_RECORD_MMAP2 prot/flags for anonymous memory · 0b3589be
      Peter Zijlstra 提交于
      Andres reported that MMAP2 records for anonymous memory always have
      their protection field 0.
      
      Turns out, someone daft put the prot/flags generation code in the file
      branch, leaving them unset for anonymous memory.
      Reported-by: NAndres Freund <andres@anarazel.de>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Don Zickus <dzickus@redhat.com
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@gmail.com>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: acme@kernel.org
      Cc: anton@ozlabs.org
      Cc: namhyung@kernel.org
      Cc: stable@vger.kernel.org # v3.16+
      Fixes: f972eb63 ("perf: Pass protection and flags bits through mmap2 interface")
      Link: http://lkml.kernel.org/r/20170126221508.GF6536@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      0b3589be
    • P
      perf/core: Fix use-after-free bug · a76a82a3
      Peter Zijlstra 提交于
      Dmitry reported a KASAN use-after-free on event->group_leader.
      
      It turns out there's a hole in perf_remove_from_context() due to
      event_function_call() not calling its function when the task
      associated with the event is already dead.
      
      In this case the event will have been detached from the task, but the
      grouping will have been retained, such that group operations might
      still work properly while there are live child events etc.
      
      This does however mean that we can miss a perf_group_detach() call
      when the group decomposes, this in turn can then lead to
      use-after-free.
      
      Fix it by explicitly doing the group detach if its still required.
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Tested-by: NDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org # v4.5+
      Cc: syzkaller <syzkaller@googlegroups.com>
      Fixes: 63b6da39 ("perf: Fix perf_event_exit_task() race")
      Link: http://lkml.kernel.org/r/20170126153955.GD6515@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      a76a82a3
  3. 14 1月, 2017 3 次提交
    • J
      perf/x86/intel: Account interrupts for PEBS errors · 475113d9
      Jiri Olsa 提交于
      It's possible to set up PEBS events to get only errors and not
      any data, like on SNB-X (model 45) and IVB-EP (model 62)
      via 2 perf commands running simultaneously:
      
          taskset -c 1 ./perf record -c 4 -e branches:pp -j any -C 10
      
      This leads to a soft lock up, because the error path of the
      intel_pmu_drain_pebs_nhm() does not account event->hw.interrupt
      for error PEBS interrupts, so in case you're getting ONLY
      errors you don't have a way to stop the event when it's over
      the max_samples_per_tick limit:
      
        NMI watchdog: BUG: soft lockup - CPU#22 stuck for 22s! [perf_fuzzer:5816]
        ...
        RIP: 0010:[<ffffffff81159232>]  [<ffffffff81159232>] smp_call_function_single+0xe2/0x140
        ...
        Call Trace:
         ? trace_hardirqs_on_caller+0xf5/0x1b0
         ? perf_cgroup_attach+0x70/0x70
         perf_install_in_context+0x199/0x1b0
         ? ctx_resched+0x90/0x90
         SYSC_perf_event_open+0x641/0xf90
         SyS_perf_event_open+0x9/0x10
         do_syscall_64+0x6c/0x1f0
         entry_SYSCALL64_slow_path+0x25/0x25
      
      Add perf_event_account_interrupt() which does the interrupt
      and frequency checks and call it from intel_pmu_drain_pebs_nhm()'s
      error path.
      
      We keep the pending_kill and pending_wakeup logic only in the
      __perf_event_overflow() path, because they make sense only if
      there's any data to deliver.
      Signed-off-by: NJiri Olsa <jolsa@kernel.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vince@deater.net>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Link: http://lkml.kernel.org/r/1482931866-6018-2-git-send-email-jolsa@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      475113d9
    • P
      perf/core: Fix concurrent sys_perf_event_open() vs. 'move_group' race · 321027c1
      Peter Zijlstra 提交于
      Di Shen reported a race between two concurrent sys_perf_event_open()
      calls where both try and move the same pre-existing software group
      into a hardware context.
      
      The problem is exactly that described in commit:
      
        f63a8daa ("perf: Fix event->ctx locking")
      
      ... where, while we wait for a ctx->mutex acquisition, the event->ctx
      relation can have changed under us.
      
      That very same commit failed to recognise sys_perf_event_context() as an
      external access vector to the events and thereby didn't apply the
      established locking rules correctly.
      
      So while one sys_perf_event_open() call is stuck waiting on
      mutex_lock_double(), the other (which owns said locks) moves the group
      about. So by the time the former sys_perf_event_open() acquires the
      locks, the context we've acquired is stale (and possibly dead).
      
      Apply the established locking rules as per perf_event_ctx_lock_nested()
      to the mutex_lock_double() for the 'move_group' case. This obviously means
      we need to validate state after we acquire the locks.
      
      Reported-by: Di Shen (Keen Lab)
      Tested-by: NJohn Dias <joaodias@google.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Min Chong <mchong@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Fixes: f63a8daa ("perf: Fix event->ctx locking")
      Link: http://lkml.kernel.org/r/20170106131444.GZ3174@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      321027c1
    • P
      perf/core: Fix sys_perf_event_open() vs. hotplug · 63cae12b
      Peter Zijlstra 提交于
      There is problem with installing an event in a task that is 'stuck' on
      an offline CPU.
      
      Blocked tasks are not dis-assosciated from offlined CPUs, after all, a
      blocked task doesn't run and doesn't require a CPU etc.. Only on
      wakeup do we ammend the situation and place the task on a available
      CPU.
      
      If we hit such a task with perf_install_in_context() we'll loop until
      either that task wakes up or the CPU comes back online, if the task
      waking depends on the event being installed, we're stuck.
      
      While looking into this issue, I also spotted another problem, if we
      hit a task with perf_install_in_context() that is in the middle of
      being migrated, that is we observe the old CPU before sending the IPI,
      but run the IPI (on the old CPU) while the task is already running on
      the new CPU, things also go sideways.
      
      Rework things to rely on task_curr() -- outside of rq->lock -- which
      is rather tricky. Imagine the following scenario where we're trying to
      install the first event into our task 't':
      
      CPU0            CPU1            CPU2
      
                      (current == t)
      
      t->perf_event_ctxp[] = ctx;
      smp_mb();
      cpu = task_cpu(t);
      
                      switch(t, n);
                                      migrate(t, 2);
                                      switch(p, t);
      
                                      ctx = t->perf_event_ctxp[]; // must not be NULL
      
      smp_function_call(cpu, ..);
      
                      generic_exec_single()
                        func();
                          spin_lock(ctx->lock);
                          if (task_curr(t)) // false
      
                          add_event_to_ctx();
                          spin_unlock(ctx->lock);
      
                                      perf_event_context_sched_in();
                                        spin_lock(ctx->lock);
                                        // sees event
      
      So its CPU0's store of t->perf_event_ctxp[] that must not go 'missing'.
      Because if CPU2's load of that variable were to observe NULL, it would
      not try to schedule the ctx and we'd have a task running without its
      counter, which would be 'bad'.
      
      As long as we observe !NULL, we'll acquire ctx->lock. If we acquire it
      first and not see the event yet, then CPU0 must observe task_curr()
      and retry. If the install happens first, then we must see the event on
      sched-in and all is well.
      
      I think we can translate the first part (until the 'must not be NULL')
      of the scenario to a litmus test like:
      
        C C-peterz
      
        {
        }
      
        P0(int *x, int *y)
        {
                int r1;
      
                WRITE_ONCE(*x, 1);
                smp_mb();
                r1 = READ_ONCE(*y);
        }
      
        P1(int *y, int *z)
        {
                WRITE_ONCE(*y, 1);
                smp_store_release(z, 1);
        }
      
        P2(int *x, int *z)
        {
                int r1;
                int r2;
      
                r1 = smp_load_acquire(z);
      	  smp_mb();
                r2 = READ_ONCE(*x);
        }
      
        exists
        (0:r1=0 /\ 2:r1=1 /\ 2:r2=0)
      
      Where:
        x is perf_event_ctxp[],
        y is our tasks's CPU, and
        z is our task being placed on the rq of CPU2.
      
      The P0 smp_mb() is the one added by this patch, ordering the store to
      perf_event_ctxp[] from find_get_context() and the load of task_cpu()
      in task_function_call().
      
      The smp_store_release/smp_load_acquire model the RCpc locking of the
      rq->lock and the smp_mb() of P2 is the context switch switching from
      whatever CPU2 was running to our task 't'.
      
      This litmus test evaluates into:
      
        Test C-peterz Allowed
        States 7
        0:r1=0; 2:r1=0; 2:r2=0;
        0:r1=0; 2:r1=0; 2:r2=1;
        0:r1=0; 2:r1=1; 2:r2=1;
        0:r1=1; 2:r1=0; 2:r2=0;
        0:r1=1; 2:r1=0; 2:r2=1;
        0:r1=1; 2:r1=1; 2:r2=0;
        0:r1=1; 2:r1=1; 2:r2=1;
        No
        Witnesses
        Positive: 0 Negative: 7
        Condition exists (0:r1=0 /\ 2:r1=1 /\ 2:r2=0)
        Observation C-peterz Never 0 7
        Hash=e427f41d9146b2a5445101d3e2fcaa34
      
      And the strong and weak model agree.
      Reported-by: NMark Rutland <mark.rutland@arm.com>
      Tested-by: NMark Rutland <mark.rutland@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: jeremy.linton@arm.com
      Link: http://lkml.kernel.org/r/20161209135900.GU3174@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      63cae12b
  4. 06 12月, 2016 1 次提交
  5. 05 12月, 2016 1 次提交
  6. 28 11月, 2016 1 次提交
  7. 21 11月, 2016 1 次提交
  8. 15 11月, 2016 1 次提交
    • D
      perf/core: Do not set cpuctx->cgrp for unscheduled cgroups · 864c2357
      David Carrillo-Cisneros 提交于
      Commit:
      
        db4a8356 ("perf/core: Set cgroup in CPU contexts for new cgroup events")
      
      failed to verify that event->cgrp is actually the scheduled cgroup
      in a CPU before setting cpuctx->cgrp. This patch fixes that.
      
      Now that there is a different path for scheduled and unscheduled
      cgroup, add a warning to catch when cpuctx->cgrp is still set after
      the last cgroup event has been unsheduled.
      
      To verify the bug:
      
        # Create 2 cgroups.
        mkdir /dev/cgroups/devices/g1
        mkdir /dev/cgroups/devices/g2
      
        # launch a task, bind it to a cpu and move it to g1
        CPU=2
        while :; do : ; done &
        P=$!
      
        taskset -pc $CPU $P
        echo $P > /dev/cgroups/devices/g1/tasks
      
        # monitor g2 (it runs no tasks) and observe output
        perf stat -e cycles -I 1000 -C $CPU -G g2
      
        #           time             counts unit events
           1.000091408          7,579,527      cycles                    g2
           2.000350111      <not counted>      cycles                    g2
           3.000589181      <not counted>      cycles                    g2
           4.000771428      <not counted>      cycles                    g2
      
        # note first line that displays that a task run in g2, despite
        # g2 having no tasks. This is because cpuctx->cgrp was wrongly
        # set when context of new event was installed.
        # After applying the fix we obtain the right output:
      
        perf stat -e cycles -I 1000 -C $CPU -G g2
        #           time             counts unit events
           1.000119615      <not counted>      cycles                    g2
           2.000389430      <not counted>      cycles                    g2
           3.000590962      <not counted>      cycles                    g2
      Signed-off-by: NDavid Carrillo-Cisneros <davidcc@google.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Kan Liang <kan.liang@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Nilay Vaish <nilayvaish@gmail.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vegard Nossum <vegard.nossum@gmail.com>
      Link: http://lkml.kernel.org/r/1478026378-86083-1-git-send-email-davidcc@google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      864c2357
  9. 28 10月, 2016 2 次提交
    • J
      perf/powerpc: Don't call perf_event_disable() from atomic context · 5aab90ce
      Jiri Olsa 提交于
      The trinity syscall fuzzer triggered following WARN() on powerpc:
      
        WARNING: CPU: 9 PID: 2998 at arch/powerpc/kernel/hw_breakpoint.c:278
        ...
        NIP [c00000000093aedc] .hw_breakpoint_handler+0x28c/0x2b0
        LR [c00000000093aed8] .hw_breakpoint_handler+0x288/0x2b0
        Call Trace:
        [c0000002f7933580] [c00000000093aed8] .hw_breakpoint_handler+0x288/0x2b0 (unreliable)
        [c0000002f7933630] [c0000000000f671c] .notifier_call_chain+0x7c/0xf0
        [c0000002f79336d0] [c0000000000f6abc] .__atomic_notifier_call_chain+0xbc/0x1c0
        [c0000002f7933780] [c0000000000f6c40] .notify_die+0x70/0xd0
        [c0000002f7933820] [c00000000001a74c] .do_break+0x4c/0x100
        [c0000002f7933920] [c0000000000089fc] handle_dabr_fault+0x14/0x48
      
      Followed by a lockdep warning:
      
        ===============================
        [ INFO: suspicious RCU usage. ]
        4.8.0-rc5+ #7 Tainted: G        W
        -------------------------------
        ./include/linux/rcupdate.h:556 Illegal context switch in RCU read-side critical section!
      
        other info that might help us debug this:
      
        rcu_scheduler_active = 1, debug_locks = 0
        2 locks held by ls/2998:
         #0:  (rcu_read_lock){......}, at: [<c0000000000f6a00>] .__atomic_notifier_call_chain+0x0/0x1c0
         #1:  (rcu_read_lock){......}, at: [<c00000000093ac50>] .hw_breakpoint_handler+0x0/0x2b0
      
        stack backtrace:
        CPU: 9 PID: 2998 Comm: ls Tainted: G        W       4.8.0-rc5+ #7
        Call Trace:
        [c0000002f7933150] [c00000000094b1f8] .dump_stack+0xe0/0x14c (unreliable)
        [c0000002f79331e0] [c00000000013c468] .lockdep_rcu_suspicious+0x138/0x180
        [c0000002f7933270] [c0000000001005d8] .___might_sleep+0x278/0x2e0
        [c0000002f7933300] [c000000000935584] .mutex_lock_nested+0x64/0x5a0
        [c0000002f7933410] [c00000000023084c] .perf_event_ctx_lock_nested+0x16c/0x380
        [c0000002f7933500] [c000000000230a80] .perf_event_disable+0x20/0x60
        [c0000002f7933580] [c00000000093aeec] .hw_breakpoint_handler+0x29c/0x2b0
        [c0000002f7933630] [c0000000000f671c] .notifier_call_chain+0x7c/0xf0
        [c0000002f79336d0] [c0000000000f6abc] .__atomic_notifier_call_chain+0xbc/0x1c0
        [c0000002f7933780] [c0000000000f6c40] .notify_die+0x70/0xd0
        [c0000002f7933820] [c00000000001a74c] .do_break+0x4c/0x100
        [c0000002f7933920] [c0000000000089fc] handle_dabr_fault+0x14/0x48
      
      While it looks like the first WARN() is probably valid, the other one is
      triggered by disabling event via perf_event_disable() from atomic context.
      
      The event is disabled here in case we were not able to emulate
      the instruction that hit the breakpoint. By disabling the event
      we unschedule the event and make sure it's not scheduled back.
      
      But we can't call perf_event_disable() from atomic context, instead
      we need to use the event's pending_disable irq_work method to disable it.
      Reported-by: NJan Stancek <jstancek@redhat.com>
      Signed-off-by: NJiri Olsa <jolsa@kernel.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Michael Neuling <mikey@neuling.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20161026094824.GA21397@kravaSigned-off-by: NIngo Molnar <mingo@kernel.org>
      5aab90ce
    • J
      perf/core: Protect PMU device removal with a 'pmu_bus_running' check, to fix... · 0933840a
      Jiri Olsa 提交于
      perf/core: Protect PMU device removal with a 'pmu_bus_running' check, to fix CONFIG_DEBUG_TEST_DRIVER_REMOVE=y kernel panic
      
      CAI Qian reported a crash in the PMU uncore device removal code,
      enabled by the CONFIG_DEBUG_TEST_DRIVER_REMOVE=y option:
      
        https://marc.info/?l=linux-kernel&m=147688837328451
      
      The reason for the crash is that perf_pmu_unregister() tries to remove
      a PMU device which is not added at this point. We add PMU devices
      only after pmu_bus is registered, which happens in the
      perf_event_sysfs_init() call and sets the 'pmu_bus_running' flag.
      
      The fix is to get the 'pmu_bus_running' flag state at the point
      the PMU is taken out of the PMU list and remove the device
      later only if it's set.
      Reported-by: NCAI Qian <caiqian@redhat.com>
      Tested-by: NCAI Qian <caiqian@redhat.com>
      Signed-off-by: NJiri Olsa <jolsa@kernel.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Kan Liang <kan.liang@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rob Herring <robh@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20161020111011.GA13361@kravaSigned-off-by: NIngo Molnar <mingo@kernel.org>
      0933840a
  10. 22 9月, 2016 1 次提交
  11. 10 9月, 2016 1 次提交
    • A
      perf/core: Fix a race between mmap_close() and set_output() of AUX events · 767ae086
      Alexander Shishkin 提交于
      In the mmap_close() path we need to stop all the AUX events that are
      writing data to the AUX area that we are unmapping, before we can
      safely free the pages. To determine if an event needs to be stopped,
      we're comparing its ->rb against the one that's getting unmapped.
      However, a SET_OUTPUT ioctl may turn up inside an AUX transaction
      and swizzle event::rb to some other ring buffer, but the transaction
      will keep writing data to the old ring buffer until the event gets
      scheduled out. At this point, mmap_close() will skip over such an
      event and will proceed to free the AUX area, while it's still being
      used by this event, which will set off a warning in the mmap_close()
      path and cause a memory corruption.
      
      To avoid this, always stop an AUX event before its ->rb is updated;
      this will release the (potentially) last reference on the AUX area
      of the buffer. If the event gets restarted, its new ring buffer will
      be used. If another SET_OUTPUT comes and switches it back to the
      old ring buffer that's getting unmapped, it's also fine: this
      ring buffer's aux_mmap_count will be zero and AUX transactions won't
      start any more.
      Reported-by: NVince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: vince@deater.net
      Link: http://lkml.kernel.org/r/20160906132353.19887-2-alexander.shishkin@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      767ae086
  12. 07 9月, 2016 1 次提交
  13. 05 9月, 2016 2 次提交
  14. 03 9月, 2016 1 次提交
    • A
      perf, bpf: add perf events core support for BPF_PROG_TYPE_PERF_EVENT programs · aa6a5f3c
      Alexei Starovoitov 提交于
      Allow attaching BPF_PROG_TYPE_PERF_EVENT programs to sw and hw perf events
      via overflow_handler mechanism.
      When program is attached the overflow_handlers become stacked.
      The program acts as a filter.
      Returning zero from the program means that the normal perf_event_output handler
      will not be called and sampling event won't be stored in the ring buffer.
      
      The overflow_handler_context==NULL is an additional safety check
      to make sure programs are not attached to hw breakpoints and watchdog
      in case other checks (that prevent that now anyway) get accidentally
      relaxed in the future.
      
      The program refcnt is incremented in case perf_events are inhereted
      when target task is forked.
      Similar to kprobe and tracepoint programs there is no ioctl to
      detach the program or swap already attached program. The user space
      expected to close(perf_event_fd) like it does right now for kprobe+bpf.
      That restriction simplifies the code quite a bit.
      
      The invocation of overflow_handler in __perf_event_overflow() is now
      done via READ_ONCE, since that pointer can be replaced when the program
      is attached while perf_event itself could have been active already.
      There is no need to do similar treatment for event->prog, since it's
      assigned only once before it's accessed.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      aa6a5f3c
  15. 24 8月, 2016 1 次提交
    • W
      perf/core: Use this_cpu_ptr() when stopping AUX events · 8b6a3fe8
      Will Deacon 提交于
      When tearing down an AUX buf for an event via perf_mmap_close(),
      __perf_event_output_stop() is called on the event's CPU to ensure that
      trace generation is halted before the process of unmapping and
      freeing the buffer pages begins.
      
      The callback is performed via cpu_function_call(), which ensures that it
      runs with interrupts disabled and is therefore not preemptible.
      Unfortunately, the current code grabs the per-cpu context pointer using
      get_cpu_ptr(), which unnecessarily disables preemption and doesn't pair
      the call with put_cpu_ptr(), leading to a preempt_count() imbalance and
      a BUG when freeing the AUX buffer later on:
      
        WARNING: CPU: 1 PID: 2249 at kernel/events/ring_buffer.c:539 __rb_free_aux+0x10c/0x120
        Modules linked in:
        [...]
        Call Trace:
         [<ffffffff813379dd>] dump_stack+0x4f/0x72
         [<ffffffff81059ff6>] __warn+0xc6/0xe0
         [<ffffffff8105a0c8>] warn_slowpath_null+0x18/0x20
         [<ffffffff8112761c>] __rb_free_aux+0x10c/0x120
         [<ffffffff81128163>] rb_free_aux+0x13/0x20
         [<ffffffff8112515e>] perf_mmap_close+0x29e/0x2f0
         [<ffffffff8111da30>] ? perf_iterate_ctx+0xe0/0xe0
         [<ffffffff8115f685>] remove_vma+0x25/0x60
         [<ffffffff81161796>] exit_mmap+0x106/0x140
         [<ffffffff8105725c>] mmput+0x1c/0xd0
         [<ffffffff8105cac3>] do_exit+0x253/0xbf0
         [<ffffffff8105e32e>] do_group_exit+0x3e/0xb0
         [<ffffffff81068d49>] get_signal+0x249/0x640
         [<ffffffff8101c273>] do_signal+0x23/0x640
         [<ffffffff81905f42>] ? _raw_write_unlock_irq+0x12/0x30
         [<ffffffff81905f69>] ? _raw_spin_unlock_irq+0x9/0x10
         [<ffffffff81901896>] ? __schedule+0x2c6/0x710
         [<ffffffff810022a4>] exit_to_usermode_loop+0x74/0x90
         [<ffffffff81002a56>] prepare_exit_to_usermode+0x26/0x30
         [<ffffffff81906d1b>] retint_user+0x8/0x10
      
      This patch uses this_cpu_ptr() instead of get_cpu_ptr(), since preemption is
      already disabled by the caller.
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      Reviewed-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Fixes: 95ff4ca2 ("perf/core: Free AUX pages in unmap path")
      Link: http://lkml.kernel.org/r/20160824091905.GA16944@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      8b6a3fe8
  16. 18 8月, 2016 8 次提交
    • D
      perf/core: Introduce PMU_EV_CAP_READ_ACTIVE_PKG · d6a2f903
      David Carrillo-Cisneros 提交于
      Introduce the flag PMU_EV_CAP_READ_ACTIVE_PKG, useful for uncore events,
      that allows a PMU to signal the generic perf code that an event is readable
      in the current CPU if the event is active in a CPU in the same package as
      the current CPU.
      
      This is an optimization that avoids a unnecessary IPI for the common case
      where uncore events are run and read in the same package but in
      different CPUs.
      
      As an example, the IPI removal speeds up perf_read() in my Haswell system
      as follows:
      
        - For event UNC_C_LLC_LOOKUP: From 260 us to 31 us.
        - For event RAPL's power/energy-cores/: From to 255 us to 27 us.
      
      For the optimization to work, all events in the group must have it
      (similarly to PERF_EV_CAP_SOFTWARE).
      Signed-off-by: NDavid Carrillo-Cisneros <davidcc@google.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: David Carrillo-Cisneros <davidcc@google.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Kan Liang <kan.liang@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul Turner <pjt@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vegard Nossum <vegard.nossum@gmail.com>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Link: http://lkml.kernel.org/r/1471467307-61171-4-git-send-email-davidcc@google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      d6a2f903
    • D
      perf/core: Generalize event->group_flags · 4ff6a8de
      David Carrillo-Cisneros 提交于
      Currently, PERF_GROUP_SOFTWARE is used in the group_flags field of a
      group's leader to indicate that is_software_event(event) is true for all
      events in a group. This is the only usage of event->group_flags.
      
      This pattern of setting a group level flags when all events in the group
      share a property is useful for the flag introduced in the next patch and
      for future CQM/CMT flags. So this patches generalizes group_flags to work
      as an aggregate of event level flags.
      
      PERF_GROUP_SOFTWARE denotes an inmutable event's property. All other flags
      that I intend to add are also determinable at event initialization.
      To better convey the above, this patch renames event's group_flags to
      group_caps and PERF_GROUP_SOFTWARE to PERF_EV_CAP_SOFTWARE.
      
      Individual event flags are stored in the new event->event_caps. Since the
      cap flags do not change after event initialization, there is no need to
      serialize event_caps. This new field is used when events are added to a
      context, similarly to how PERF_GROUP_SOFTWARE and is_software_event()
      worked.
      
      Lastly, for consistency, updates is_software_event() to rely in event_cap
      instead of the context index.
      Signed-off-by: NDavid Carrillo-Cisneros <davidcc@google.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Kan Liang <kan.liang@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul Turner <pjt@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vegard Nossum <vegard.nossum@gmail.com>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Link: http://lkml.kernel.org/r/1471467307-61171-3-git-send-email-davidcc@google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      4ff6a8de
    • M
      bitmap.h, perf/core: Fix the mask in perf_output_sample_regs() · 29dd3288
      Madhavan Srinivasan 提交于
      When decoding the perf_regs mask in perf_output_sample_regs(),
      we loop through the mask using find_first_bit and find_next_bit functions.
      
      While the exisiting code works fine in most of the case, the logic
      is broken for big-endian 32-bit kernels.
      
      When reading a u64 mask using (u32 *)(&val)[0], find_*_bit() assumes
      that it gets the lower 32 bits of u64, but instead it gets the upper
      32 bits - which is wrong.
      
      The fix is to swap the words of the u64 to handle this case.
      This is _not_ a regular endianness swap.
      Suggested-by: NYury Norov <ynorov@caviumnetworks.com>
      Signed-off-by: NMadhavan Srinivasan <maddy@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NYury Norov <ynorov@caviumnetworks.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: linuxppc-dev@lists.ozlabs.org
      Link: http://lkml.kernel.org/r/1471426568-31051-2-git-send-email-maddy@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      29dd3288
    • D
      perf/core: Check return value of the perf_event_read() IPI · 71e7bc2b
      David Carrillo-Cisneros 提交于
      The call to smp_call_function_single in perf_event_read() may fail if
      an invalid or not online CPU index is passed. Warn user if such bug is
      present and return error.
      Signed-off-by: NDavid Carrillo-Cisneros <davidcc@google.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Kan Liang <kan.liang@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul Turner <pjt@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vegard Nossum <vegard.nossum@gmail.com>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Link: http://lkml.kernel.org/r/1471467307-61171-2-git-send-email-davidcc@google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      71e7bc2b
    • M
      perf/core: Enable mapping of the stop filters · 99f5bc9b
      Mathieu Poirier 提交于
      At this time the perf_addr_filter_needs_mmap() function will _not_
      return true on a user space 'stop' filter.  But stop filters need
      exactly the same kind of mapping that range and start filters get.
      Signed-off-by: NMathieu Poirier <mathieu.poirier@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Link: http://lkml.kernel.org/r/1468860187-318-4-git-send-email-mathieu.poirier@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      99f5bc9b
    • M
      perf/core: Update filters only on executable mmap · 12b40a23
      Mathieu Poirier 提交于
      Function perf_event_mmap() is called by the MM subsystem each time
      part of a binary is loaded in memory.  There can be several mapping
      for a binary, many times unrelated to the code section.
      
      Each time a section of a binary is mapped address filters are
      updated, event when the map doesn't pertain to the code section.
      The end result is that filters are configured based on the last map
      event that was received rather than the last mapping of the code
      segment.
      
      For example if we have an executable 'main' that calls library
      'libcstest.so.1.0', and that we want to collect traces on code
      that is in that library.  The perf cmd line for this scenario
      would be:
      
        perf record -e cs_etm// --filter 'filter 0x72c/0x40@/opt/lib/libcstest.so.1.0' --per-thread ./main
      
      Resulting in binaries being mapped this way:
      
        root@linaro-nano:~# cat /proc/1950/maps
        00400000-00401000 r-xp 00000000 08:02 33169     /home/linaro/main
        00410000-00411000 r--p 00000000 08:02 33169     /home/linaro/main
        00411000-00412000 rw-p 00001000 08:02 33169     /home/linaro/main
        7fa2464000-7fa2474000 rw-p 00000000 00:00 0
        7fa2474000-7fa25a4000 r-xp 00000000 08:02 543   /lib/aarch64-linux-gnu/libc-2.21.so
        7fa25a4000-7fa25b3000 ---p 00130000 08:02 543   /lib/aarch64-linux-gnu/libc-2.21.so
        7fa25b3000-7fa25b7000 r--p 0012f000 08:02 543   /lib/aarch64-linux-gnu/libc-2.21.so
        7fa25b7000-7fa25b9000 rw-p 00133000 08:02 543   /lib/aarch64-linux-gnu/libc-2.21.so
        7fa25b9000-7fa25bd000 rw-p 00000000 00:00 0
        7fa25bd000-7fa25be000 r-xp 00000000 08:02 38308 /opt/lib/libcstest.so.1.0
        7fa25be000-7fa25cd000 ---p 00001000 08:02 38308 /opt/lib/libcstest.so.1.0
        7fa25cd000-7fa25ce000 r--p 00000000 08:02 38308 /opt/lib/libcstest.so.1.0
        7fa25ce000-7fa25cf000 rw-p 00001000 08:02 38308 /opt/lib/libcstest.so.1.0
        7fa25cf000-7fa25eb000 r-xp 00000000 08:02 574   /lib/aarch64-linux-gnu/ld-2.21.so
        7fa25ef000-7fa25f2000 rw-p 00000000 00:00 0
        7fa25f7000-7fa25f9000 rw-p 00000000 00:00 0
        7fa25f9000-7fa25fa000 r--p 00000000 00:00 0     [vvar]
        7fa25fa000-7fa25fb000 r-xp 00000000 00:00 0     [vdso]
        7fa25fb000-7fa25fc000 r--p 0001c000 08:02 574   /lib/aarch64-linux-gnu/ld-2.21.so
        7fa25fc000-7fa25fe000 rw-p 0001d000 08:02 574   /lib/aarch64-linux-gnu/ld-2.21.so
        7ff2ea8000-7ff2ec9000 rw-p 00000000 00:00 0     [stack]
        root@linaro-nano:~#
      
      Before 'main()' can execute 'libcstest.so.1.0' has to be loaded in
      memory.  Once that has been done perf_event_mmap() has been called
      4 times, with the last map starting at address 0x7fa25ce000 and
      the address filter configured to start filtering when the
      IP has passed over address 0x0x7fa25ce72c (0x7fa25ce000 + 0x72c).
      
      But that is wrong since the code segment for library 'libcstest.so.1.0'
      as been mapped at 0x7fa25bd000, resulting in traces not being
      collected.
      
      This patch corrects the situation by requesting that address
      filters be updated only if the mapped event is for a code
      segment.
      Signed-off-by: NMathieu Poirier <mathieu.poirier@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Link: http://lkml.kernel.org/r/1468860187-318-3-git-send-email-mathieu.poirier@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      12b40a23
    • M
      perf/core: Fix file name handling for start/stop filters · 4059ffd0
      Mathieu Poirier 提交于
      Binary file names have to be supplied for both range and start/stop
      filters but the current code only processes the filename if an
      address range filter is specified.  This code adds processing of
      the filename for start/stop filters.
      Signed-off-by: NMathieu Poirier <mathieu.poirier@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Link: http://lkml.kernel.org/r/1468860187-318-2-git-send-email-mathieu.poirier@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      4059ffd0
    • P
      perf/core: Fix event_function_local() · cca20946
      Peter Zijlstra 提交于
      Vincent reported triggering the WARN_ON_ONCE() in event_function_local().
      
      While thinking through cases I noticed that by using event_function()
      directly, we miss the inactive case usually handled by
      event_function_call().
      
      Therefore construct a blend of event_function_call() and
      event_function() that handles the cases relevant to
      event_function_local().
      Reported-by: NVince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org # 4.5+
      Fixes: fae3fde6 ("perf: Collapse and fix event_function_call() users")
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      cca20946
  17. 10 8月, 2016 5 次提交
    • P
      perf/core: Optimize perf_pmu_sched_task() · e48c1788
      Peter Zijlstra 提交于
      For perf record -b, which requires the pmu::sched_task callback the
      current code is rather expensive:
      
           7.68%  sched-pipe  [kernel.vmlinux]    [k] perf_pmu_sched_task
           5.95%  sched-pipe  [kernel.vmlinux]    [k] __switch_to
           5.20%  sched-pipe  [kernel.vmlinux]    [k] __intel_pmu_disable_all
           3.95%  sched-pipe  perf                [.] worker_thread
      
      The problem is that it will iterate all registered PMUs, most of which
      will not have anything to do. Avoid this by keeping an explicit list
      of PMUs that have requested the callback.
      
      The perf_sched_cb_{inc,dec}() functions already takes the required pmu
      argument, and now that these functions are no longer called from NMI
      context we can use them to manage a list.
      
      With this patch applied the function doesn't show up in the top 4
      anymore (it dropped to 18th place).
      
           6.67%  sched-pipe  [kernel.vmlinux]    [k] __switch_to
           6.18%  sched-pipe  [kernel.vmlinux]    [k] __intel_pmu_disable_all
           3.92%  sched-pipe  [kernel.vmlinux]    [k] switch_mm_irqs_off
           3.71%  sched-pipe  perf                [.] worker_thread
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      e48c1788
    • P
      perf/x86/intel: Rework the large PEBS setup code · 09e61b4f
      Peter Zijlstra 提交于
      In order to allow optimizing perf_pmu_sched_task() we must ensure
      perf_sched_cb_{inc,dec}() are no longer called from NMI context; this
      means that pmu::{start,stop}() can no longer use them.
      
      Prepare for this by reworking the whole large PEBS setup code.
      
      The current code relied on the cpuc->pebs_enabled state, however since
      that reflects the current active state as per pmu::{start,stop}() we
      can no longer rely on this.
      
      Introduce two counters: cpuc->n_pebs and cpuc->n_large_pebs which
      count the total number of PEBS events and the number of PEBS events
      that have FREERUNNING set, resp.. With this we can tell if the current
      setup requires a single record interrupt threshold or can use a larger
      buffer.
      
      This also improves the code in that it re-enables the large threshold
      once the PEBS event that required single record gets removed.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      09e61b4f
    • M
      perf/core: Sched out groups atomically · 3f005e7d
      Mark Rutland 提交于
      Groups of events are supposed to be scheduled atomically, such that it
      is possible to derive meaningful ratios between their values.
      
      We take great pains to achieve this when scheduling event groups to a
      PMU in group_sched_in(), calling {start,commit}_txn() (which fall back
      to perf_pmu_{disable,enable}() if necessary) to provide this guarantee.
      However we don't mirror this in group_sched_out(), and in some cases
      events will not be scheduled out atomically.
      
      For example, if we disable an event group with PERF_EVENT_IOC_DISABLE,
      we'll cross-call __perf_event_disable() for the group leader, and will
      call group_sched_out() without having first disabled the relevant PMU.
      We will disable/enable the PMU around each pmu->del() call, but between
      each call the PMU will be enabled and events may count.
      
      Avoid this by explicitly disabling and enabling the PMU around event
      removal in group_sched_out(), mirroring what we do in group_sched_in().
      Signed-off-by: NMark Rutland <mark.rutland@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Link: http://lkml.kernel.org/r/1469553141-28314-1-git-send-email-mark.rutland@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      3f005e7d
    • D
      perf/core: Set cgroup in CPU contexts for new cgroup events · db4a8356
      David Carrillo-Cisneros 提交于
      There's a perf stat bug easy to observer on a machine with only one cgroup:
      
        $ perf stat -e cycles -I 1000 -C 0 -G /
        #          time             counts unit events
            1.000161699      <not counted>      cycles                    /
            2.000355591      <not counted>      cycles                    /
            3.000565154      <not counted>      cycles                    /
            4.000951350      <not counted>      cycles                    /
      
      We'd expect some output there.
      
      The underlying problem is that there is an optimization in
      perf_cgroup_sched_{in,out}() that skips the switch of cgroup events
      if the old and new cgroups in a task switch are the same.
      
      This optimization interacts with the current code in two ways
      that cause a CPU context's cgroup (cpuctx->cgrp) to be NULL even if a
      cgroup event matches the current task. These are:
      
        1. On creation of the first cgroup event in a CPU: In current code,
        cpuctx->cpu is only set in perf_cgroup_sched_in, but due to the
        aforesaid optimization, perf_cgroup_sched_in will run until the next
        cgroup switches in that CPU. This may happen late or never happen,
        depending on system's number of cgroups, CPU load, etc.
      
        2. On deletion of the last cgroup event in a cpuctx: In list_del_event,
        cpuctx->cgrp is set NULL. Any new cgroup event will not be sched in
        because cpuctx->cgrp == NULL until a cgroup switch occurs and
        perf_cgroup_sched_in is executed (updating cpuctx->cgrp).
      
      This patch fixes both problems by setting cpuctx->cgrp in list_add_event,
      mirroring what list_del_event does when removing a cgroup event from CPU
      context, as introduced in:
      
        commit 68cacd29 ("perf_events: Fix stale ->cgrp pointer in update_cgrp_time_from_cpuctx()")
      
      With this patch, cpuctx->cgrp is always set/clear when installing/removing
      the first/last cgroup event in/from the CPU context. With cpuctx->cgrp
      correctly set, event_filter_match works as intended when events are
      sched in/out.
      
      After the fix, the output is as expected:
      
        $ perf stat -e cycles -I 1000 -a -G /
        #         time             counts unit events
           1.004699159          627342882      cycles                    /
           2.007397156          615272690      cycles                    /
           3.010019057          616726074      cycles                    /
      Signed-off-by: NDavid Carrillo-Cisneros <davidcc@google.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Kan Liang <kan.liang@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul Turner <pjt@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vegard Nossum <vegard.nossum@gmail.com>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Link: http://lkml.kernel.org/r/1470124092-113192-1-git-send-email-davidcc@google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      db4a8356
    • P
      perf/core: Fix sideband list-iteration vs. event ordering NULL pointer deference crash · 0b8f1e2e
      Peter Zijlstra 提交于
      Vegard Nossum reported that perf fuzzing generates a NULL
      pointer dereference crash:
      
      > Digging a bit deeper into this, it seems the event itself is getting
      > created by perf_event_open() and it gets added to the pmu_event_list
      > through:
      >
      > perf_event_open()
      >  - perf_event_alloc()
      >     - account_event()
      >        - account_pmu_sb_event()
      >           - attach_sb_event()
      >
      > so at this point the event is being attached but its ->ctx is still
      > NULL. It seems like ->ctx is set just a bit later in
      > perf_event_open(), though.
      >
      > But before that, __schedule() comes along and creates a stack trace
      > similar to the one above:
      >
      > __schedule()
      >  - __perf_event_task_sched_out()
      >    - perf_iterate_sb()
      >      - perf_iterate_sb_cpu()
      >         - event_filter_match()
      >           - perf_cgroup_match()
      >             - __get_cpu_context()
      >               - (dereference ctx which is NULL)
      >
      > So I guess the question is... should the event be attached (= put on
      > the list) before ->ctx gets set? Or should the cgroup code check for a
      > NULL ->ctx?
      
      The latter seems like the simplest solution. Moving the list-add later
      creates a bit of a mess.
      Reported-by: NVegard Nossum <vegard.nossum@gmail.com>
      Tested-by: NVegard Nossum <vegard.nossum@gmail.com>
      Tested-by: NVince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: David Carrillo-Cisneros <davidcc@google.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Kan Liang <kan.liang@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: f2fb6bef ("perf/core: Optimize side-band event delivery")
      Link: http://lkml.kernel.org/r/20160804123724.GN6862@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      0b8f1e2e
  18. 02 8月, 2016 1 次提交
    • D
      perf/core: Change log level for duration warning to KERN_INFO · 0d87d7ec
      David Ahern 提交于
      When the perf interrupt handler exceeds a threshold warning messages
      are displayed on console:
      
        [12739.31793] perf interrupt took too long (2504 > 2500), lowering kernel.perf_event_max_sample_rate to 50000
        [71340.165065] perf interrupt took too long (5005 > 5000), lowering kernel.perf_event_max_sample_rate to 25000
      
      Many customers and users are confused by the message wondering if
      something is wrong or they need to take action to fix a problem.
      Since a user can not do anything to fix the issue, the message is really
      more informational than a warning. Adjust the log level accordingly.
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1470084569-438-1-git-send-email-dsa@cumulusnetworks.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      0d87d7ec