1. 18 5月, 2015 1 次提交
    • P
      sched,perf: Fix periodic timers · 4cfafd30
      Peter Zijlstra 提交于
      In the below two commits (see Fixes) we have periodic timers that can
      stop themselves when they're no longer required, but need to be
      (re)-started when their idle condition changes.
      
      Further complications is that we want the timer handler to always do
      the forward such that it will always correctly deal with the overruns,
      and we do not want to race such that the handler has already decided
      to stop, but the (external) restart sees the timer still active and we
      end up with a 'lost' timer.
      
      The problem with the current code is that the re-start can come before
      the callback does the forward, at which point the forward from the
      callback will WARN about forwarding an enqueued timer.
      
      Now, conceptually its easy to detect if you're before or after the fwd
      by comparing the expiration time against the current time. Of course,
      that's expensive (and racy) because we don't have the current time.
      
      Alternatively one could cache this state inside the timer, but then
      everybody pays the overhead of maintaining this extra state, and that
      is undesired.
      
      The only other option that I could see is the external timer_active
      variable, which I tried to kill before. I would love a nicer interface
      for this seemingly simple 'problem' but alas.
      
      Fixes: 272325c4 ("perf: Fix mux_interval hrtimer wreckage")
      Fixes: 77a4d1a1 ("sched: Cleanup bandwidth timers")
      Cc: pjt@google.com
      Cc: tglx@linutronix.de
      Cc: klamm@yandex-team.ru
      Cc: mingo@kernel.org
      Cc: bsegall@google.com
      Cc: hpa@zytor.com
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/20150514102311.GX21418@twins.programming.kicks-ass.net
      4cfafd30
  2. 04 5月, 2015 1 次提交
  3. 23 4月, 2015 1 次提交
  4. 22 4月, 2015 2 次提交
    • P
      perf: Fix mux_interval hrtimer wreckage · 272325c4
      Peter Zijlstra 提交于
      Thomas stumbled over the hrtimer_forward_now() in
      perf_event_mux_interval_ms_store() and noticed its broken-ness.
      
      You cannot just change the expiry time of an active timer, it will
      destroy the red-black tree order and cause havoc.
      
      Change it to (re)start the timer instead, (re)starting a timer will
      dequeue and enqueue a timer and therefore preserve rb-tree order.
      
      Since we cannot enqueue remotely, wrap the thing in
      cpu_function_call(), this however mandates that we restrict ourselves
      to online cpus. Also serialize the entire setting so we don't get
      multiple concurrent threads trying to update to different values.
      
      Also fix a problem in perf_mux_hrtimer_restart(), checking against
      hrtimer_active() can actually loose us the timer when timer->state ==
      HRTIMER_STATE_CALLBACK and the callback has already decided NORESTART.
      
      Furthermore it doesn't make any sense to test
      hrtimer_callback_running() when we already tested hrtimer_active(),
      but with the above change, we explicitly must call it when
      callback_running.
      
      Lastly, rename a few functions:
      
        s/perf_cpu_hrtimer_/perf_mux_hrtimer_/ -- because I could not find
                                                  the mux timer function
      
        s/\<hr\>/timer/ -- because that's the normal way of calling things.
      
      Fixes: 62b85639 ("perf: Add sysfs entry to adjust multiplexing interval per PMU")
      Reported-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/20150415095011.863052571@infradead.orgSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      272325c4
    • T
      perf: core: Use hrtimer_start() · 3497d206
      Thomas Gleixner 提交于
      hrtimer_start() does not longer defer already expired timers to the
      softirq. Get rid of the __hrtimer_start_range_ns() invocation.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Link: http://lkml.kernel.org/r/20150414203502.452104213@linutronix.deSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      3497d206
  5. 02 4月, 2015 8 次提交
    • A
      perf: Add ITRACE_START record to indicate that tracing has started · ec0d7729
      Alexander Shishkin 提交于
      For counters that generate AUX data that is bound to the context of a
      running task, such as instruction tracing, the decoder needs to know
      exactly which task is running when the event is first scheduled in,
      before the first sched_switch. The decoder's need to know this stems
      from the fact that instruction flow trace decoding will almost always
      require program's object code in order to reconstruct said flow and
      for that we need at least its pid/tid in the perf stream.
      
      To single out such instruction tracing pmus, this patch introduces
      ITRACE PMU capability. The reason this is not part of RECORD_AUX
      record is that not all pmus capable of generating AUX data need this,
      and the opposite is *probably* also true.
      
      While sched_switch covers for most cases, there are two problems with it:
      the consumer will need to process events out of order (that is, having
      found RECORD_AUX, it will have to skip forward to the nearest sched_switch
      to figure out which task it was, then go back to the actual trace to
      decode it) and it completely misses the case when the tracing is enabled
      and disabled before sched_switch, for example, via PERF_EVENT_IOC_DISABLE.
      Signed-off-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Kaixu Xia <kaixu.xia@linaro.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Robert Richter <rric@kernel.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: acme@infradead.org
      Cc: adrian.hunter@intel.com
      Cc: kan.liang@intel.com
      Cc: markus.t.metzger@intel.com
      Cc: mathieu.poirier@linaro.org
      Link: http://lkml.kernel.org/r/1421237903-181015-15-git-send-email-alexander.shishkin@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      ec0d7729
    • A
      perf: Add wakeup watermark control to the AUX area · 1a594131
      Alexander Shishkin 提交于
      When AUX area gets a certain amount of new data, we want to wake up
      userspace to collect it. This adds a new control to specify how much
      data will cause a wakeup. This is then passed down to pmu drivers via
      output handle's "wakeup" field, so that the driver can find the nearest
      point where it can generate an interrupt.
      
      We repurpose __reserved_2 in the event attribute for this, even though
      it was never checked to be zero before, aux_watermark will only matter
      for new AUX-aware code, so the old code should still be fine.
      Signed-off-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Kaixu Xia <kaixu.xia@linaro.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Robert Richter <rric@kernel.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: acme@infradead.org
      Cc: adrian.hunter@intel.com
      Cc: kan.liang@intel.com
      Cc: markus.t.metzger@intel.com
      Cc: mathieu.poirier@linaro.org
      Link: http://lkml.kernel.org/r/1421237903-181015-10-git-send-email-alexander.shishkin@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      1a594131
    • A
      perf: Add API for PMUs to write to the AUX area · fdc26706
      Alexander Shishkin 提交于
      For pmus that wish to write data to ring buffer's AUX area, provide
      perf_aux_output_{begin,end}() calls to initiate/commit data writes,
      similarly to perf_output_{begin,end}. These also use the same output
      handle structure. Also, similarly to software counterparts, these
      will direct inherited events' output to parents' ring buffers.
      
      After the perf_aux_output_begin() returns successfully, handle->size
      is set to the maximum amount of data that can be written wrt aux_tail
      pointer, so that no data that the user hasn't seen will be overwritten,
      therefore this should always be called before hardware writing is
      enabled. On success, this will return the pointer to pmu driver's
      private structure allocated for this aux area by pmu::setup_aux. Same
      pointer can also be retrieved using perf_get_aux() while hardware
      writing is enabled.
      
      PMU driver should pass the actual amount of data written as a parameter
      to perf_aux_output_end(). All hardware writes should be completed and
      visible before this one is called.
      
      Additionally, perf_aux_output_skip() will adjust output handle and
      aux_head in case some part of the buffer has to be skipped over to
      maintain hardware's alignment constraints.
      
      Nested writers are forbidden and guards are in place to catch such
      attempts.
      Signed-off-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Kaixu Xia <kaixu.xia@linaro.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Robert Richter <rric@kernel.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: acme@infradead.org
      Cc: adrian.hunter@intel.com
      Cc: kan.liang@intel.com
      Cc: markus.t.metzger@intel.com
      Cc: mathieu.poirier@linaro.org
      Link: http://lkml.kernel.org/r/1421237903-181015-8-git-send-email-alexander.shishkin@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      fdc26706
    • A
      perf: Add AUX record · 68db7e98
      Alexander Shishkin 提交于
      When there's new data in the AUX space, output a record indicating its
      offset and size and a set of flags, such as PERF_AUX_FLAG_TRUNCATED, to
      mean the described data was truncated to fit in the ring buffer.
      Signed-off-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Kaixu Xia <kaixu.xia@linaro.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Robert Richter <rric@kernel.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: adrian.hunter@intel.com
      Cc: kan.liang@intel.com
      Cc: markus.t.metzger@intel.com
      Cc: mathieu.poirier@linaro.org
      Link: http://lkml.kernel.org/r/1421237903-181015-7-git-send-email-alexander.shishkin@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      68db7e98
    • A
      perf: Add a pmu capability for "exclusive" events · bed5b25a
      Alexander Shishkin 提交于
      Usually, pmus that do, for example, instruction tracing, would only ever
      be able to have one event per task per cpu (or per perf_event_context). For
      such pmus it makes sense to disallow creating conflicting events early on,
      so as to provide consistent behavior for the user.
      
      This patch adds a pmu capability that indicates such constraint on event
      creation.
      Signed-off-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Kaixu Xia <kaixu.xia@linaro.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Robert Richter <rric@kernel.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: acme@infradead.org
      Cc: adrian.hunter@intel.com
      Cc: kan.liang@intel.com
      Cc: markus.t.metzger@intel.com
      Cc: mathieu.poirier@linaro.org
      Link: http://lkml.kernel.org/r/1422613866-113186-1-git-send-email-alexander.shishkin@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      bed5b25a
    • P
      perf: Add AUX area to ring buffer for raw data streams · 45bfb2e5
      Peter Zijlstra 提交于
      This patch introduces "AUX space" in the perf mmap buffer, intended for
      exporting high bandwidth data streams to userspace, such as instruction
      flow traces.
      
      AUX space is a ring buffer, defined by aux_{offset,size} fields in the
      user_page structure, and read/write pointers aux_{head,tail}, which abide
      by the same rules as data_* counterparts of the main perf buffer.
      
      In order to allocate/mmap AUX, userspace needs to set up aux_offset to
      such an offset that will be greater than data_offset+data_size and
      aux_size to be the desired buffer size. Both need to be page aligned.
      Then, same aux_offset and aux_size should be passed to mmap() call and
      if everything adds up, you should have an AUX buffer as a result.
      
      Pages that are mapped into this buffer also come out of user's mlock
      rlimit plus perf_event_mlock_kb allowance.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Kaixu Xia <kaixu.xia@linaro.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Robert Richter <rric@kernel.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: acme@infradead.org
      Cc: adrian.hunter@intel.com
      Cc: kan.liang@intel.com
      Cc: markus.t.metzger@intel.com
      Cc: mathieu.poirier@linaro.org
      Link: http://lkml.kernel.org/r/1421237903-181015-3-git-send-email-alexander.shishkin@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      45bfb2e5
    • A
      perf: Add data_{offset,size} to user_page · e8c6deac
      Alexander Shishkin 提交于
      Currently, the actual perf ring buffer is one page into the mmap area,
      following the user page and the userspace follows this convention. This
      patch adds data_{offset,size} fields to user_page that can be used by
      userspace instead for locating perf data in the mmap area. This is also
      helpful when mapping existing or shared buffers if their size is not
      known in advance.
      
      Right now, it is made to follow the existing convention that
      
      	data_offset == PAGE_SIZE and
      	data_offset + data_size == mmap_size.
      Signed-off-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Kaixu Xia <kaixu.xia@linaro.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Robert Richter <rric@kernel.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: acme@infradead.org
      Cc: adrian.hunter@intel.com
      Cc: kan.liang@intel.com
      Cc: markus.t.metzger@intel.com
      Cc: mathieu.poirier@linaro.org
      Link: http://lkml.kernel.org/r/1421237903-181015-2-git-send-email-alexander.shishkin@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      e8c6deac
    • A
      tracing, perf: Implement BPF programs attached to kprobes · 2541517c
      Alexei Starovoitov 提交于
      BPF programs, attached to kprobes, provide a safe way to execute
      user-defined BPF byte-code programs without being able to crash or
      hang the kernel in any way. The BPF engine makes sure that such
      programs have a finite execution time and that they cannot break
      out of their sandbox.
      
      The user interface is to attach to a kprobe via the perf syscall:
      
      	struct perf_event_attr attr = {
      		.type	= PERF_TYPE_TRACEPOINT,
      		.config	= event_id,
      		...
      	};
      
      	event_fd = perf_event_open(&attr,...);
      	ioctl(event_fd, PERF_EVENT_IOC_SET_BPF, prog_fd);
      
      'prog_fd' is a file descriptor associated with BPF program
      previously loaded.
      
      'event_id' is an ID of the kprobe created.
      
      Closing 'event_fd':
      
      	close(event_fd);
      
      ... automatically detaches BPF program from it.
      
      BPF programs can call in-kernel helper functions to:
      
        - lookup/update/delete elements in maps
      
        - probe_read - wraper of probe_kernel_read() used to access any
          kernel data structures
      
      BPF programs receive 'struct pt_regs *' as an input ('struct pt_regs' is
      architecture dependent) and return 0 to ignore the event and 1 to store
      kprobe event into the ring buffer.
      
      Note, kprobes are a fundamentally _not_ a stable kernel ABI,
      so BPF programs attached to kprobes must be recompiled for
      every kernel version and user must supply correct LINUX_VERSION_CODE
      in attr.kern_version during bpf_prog_load() call.
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Reviewed-by: NSteven Rostedt <rostedt@goodmis.org>
      Reviewed-by: NMasami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1427312966-8434-4-git-send-email-ast@plumgrid.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      2541517c
  6. 27 3月, 2015 2 次提交
    • P
      perf: Add per event clockid support · 34f43927
      Peter Zijlstra 提交于
      While thinking on the whole clock discussion it occurred to me we have
      two distinct uses of time:
      
       1) the tracking of event/ctx/cgroup enabled/running/stopped times
          which includes the self-monitoring support in struct
          perf_event_mmap_page.
      
       2) the actual timestamps visible in the data records.
      
      And we've been conflating them.
      
      The first is all about tracking time deltas, nobody should really care
      in what time base that happens, its all relative information, as long
      as its internally consistent it works.
      
      The second however is what people are worried about when having to
      merge their data with external sources. And here we have the
      discussion on MONOTONIC vs MONOTONIC_RAW etc..
      
      Where MONOTONIC is good for correlating between machines (static
      offset), MONOTNIC_RAW is required for correlating against a fixed rate
      hardware clock.
      
      This means configurability; now 1) makes that hard because it needs to
      be internally consistent across groups of unrelated events; which is
      why we had to have a global perf_clock().
      
      However, for 2) it doesn't really matter, perf itself doesn't care
      what it writes into the buffer.
      
      The below patch makes the distinction between these two cases by
      adding perf_event_clock() which is used for the second case. It
      further makes this configurable on a per-event basis, but adds a few
      sanity checks such that we cannot combine events with different clocks
      in confusing ways.
      
      And since we then have per-event configurability we might as well
      retain the 'legacy' behaviour as a default.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      34f43927
    • P
      perf: Fix racy group access · ccd41c86
      Peter Zijlstra 提交于
      While looking at some fuzzer output I noticed that we do not hold any
      locks on leader->ctx and therefore the sibling_list iteration is
      unsafe.
      
      Acquire the relevant ctx->mutex before calling into the pmu specific
      code.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Link: http://lkml.kernel.org/r/20150225151639.GL5029@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      ccd41c86
  7. 23 3月, 2015 2 次提交
    • P
      perf: Remove type specific target pointers · 50f16a8b
      Peter Zijlstra 提交于
      The only reason CQM had to use a hard-coded pmu type was so it could use
      cqm_target in hw_perf_event.
      
      Do away with the {tp,bp,cqm}_target pointers and provide a non type
      specific one.
      
      This allows us to do away with that silly pmu type as well.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Vince Weaver <vince@deater.net>
      Cc: acme@kernel.org
      Cc: acme@redhat.com
      Cc: hpa@zytor.com
      Cc: jolsa@redhat.com
      Cc: kanaka.d.juvva@intel.com
      Cc: matt.fleming@intel.com
      Cc: tglx@linutronix.de
      Cc: torvalds@linux-foundation.org
      Cc: vikas.shivappa@linux.intel.com
      Link: http://lkml.kernel.org/r/20150305211019.GU21418@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      50f16a8b
    • P
      perf: Fix irq_work 'tail' recursion · d525211f
      Peter Zijlstra 提交于
      Vince reported a watchdog lockup like:
      
      	[<ffffffff8115e114>] perf_tp_event+0xc4/0x210
      	[<ffffffff810b4f8a>] perf_trace_lock+0x12a/0x160
      	[<ffffffff810b7f10>] lock_release+0x130/0x260
      	[<ffffffff816c7474>] _raw_spin_unlock_irqrestore+0x24/0x40
      	[<ffffffff8107bb4d>] do_send_sig_info+0x5d/0x80
      	[<ffffffff811f69df>] send_sigio_to_task+0x12f/0x1a0
      	[<ffffffff811f71ce>] send_sigio+0xae/0x100
      	[<ffffffff811f72b7>] kill_fasync+0x97/0xf0
      	[<ffffffff8115d0b4>] perf_event_wakeup+0xd4/0xf0
      	[<ffffffff8115d103>] perf_pending_event+0x33/0x60
      	[<ffffffff8114e3fc>] irq_work_run_list+0x4c/0x80
      	[<ffffffff8114e448>] irq_work_run+0x18/0x40
      	[<ffffffff810196af>] smp_trace_irq_work_interrupt+0x3f/0xc0
      	[<ffffffff816c99bd>] trace_irq_work_interrupt+0x6d/0x80
      
      Which is caused by an irq_work generating new irq_work and therefore
      not allowing forward progress.
      
      This happens because processing the perf irq_work triggers another
      perf event (tracepoint stuff) which in turn generates an irq_work ad
      infinitum.
      
      Avoid this by raising the recursion counter in the irq_work -- which
      effectively disables all software events (including tracepoints) from
      actually triggering again.
      Reported-by: NVince Weaver <vincent.weaver@maine.edu>
      Tested-by: NVince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/20150219170311.GH21418@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      d525211f
  8. 13 3月, 2015 1 次提交
  9. 03 3月, 2015 1 次提交
  10. 25 2月, 2015 4 次提交
    • M
      perf/x86/intel: Support task events with Intel CQM · bfe1fcd2
      Matt Fleming 提交于
      Add support for task events as well as system-wide events. This change
      has a big impact on the way that we gather LLC occupancy values in
      intel_cqm_event_read().
      
      Currently, for system-wide (per-cpu) events we defer processing to
      userspace which knows how to discard all but one cpu result per package.
      
      Things aren't so simple for task events because we need to do the value
      aggregation ourselves. To do this, we defer updating the LLC occupancy
      value in event->count from intel_cqm_event_read() and do an SMP
      cross-call to read values for all packages in intel_cqm_event_count().
      We need to ensure that we only do this for one task event per cache
      group, otherwise we'll report duplicate values.
      
      If we're a system-wide event we want to fallback to the default
      perf_event_count() implementation. Refactor this into a common function
      so that we don't duplicate the code.
      
      Also, introduce PERF_TYPE_INTEL_CQM, since we need a way to track an
      event's task (if the event isn't per-cpu) inside of the Intel CQM PMU
      driver.  This task information is only availble in the upper layers of
      the perf infrastructure.
      
      Other perf backends stash the target task in event->hw.*target so we
      need to do something similar. The task is used to determine whether
      events should share a cache group and an RMID.
      Signed-off-by: NMatt Fleming <matt.fleming@intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Kanaka Juvva <kanaka.d.juvva@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Vikas Shivappa <vikas.shivappa@linux.intel.com>
      Cc: linux-api@vger.kernel.org
      Link: http://lkml.kernel.org/r/1422038748-21397-8-git-send-email-matt@codeblueprint.co.ukSigned-off-by: NIngo Molnar <mingo@kernel.org>
      bfe1fcd2
    • M
      perf: Move cgroup init before PMU ->event_init() · 79dff51e
      Matt Fleming 提交于
      The Intel QoS PMU needs to know whether an event is part of a cgroup
      during ->event_init(), because tasks in the same cgroup share a
      monitoring ID.
      
      Move the cgroup initialisation before calling into the PMU driver.
      Signed-off-by: NMatt Fleming <matt.fleming@intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Kanaka Juvva <kanaka.d.juvva@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Vikas Shivappa <vikas.shivappa@linux.intel.com>
      Link: http://lkml.kernel.org/r/1422038748-21397-4-git-send-email-matt@codeblueprint.co.ukSigned-off-by: NIngo Molnar <mingo@kernel.org>
      79dff51e
    • M
      perf: Add ->count() function to read per-package counters · eacd3ecc
      Matt Fleming 提交于
      For PMU drivers that record per-package counters, the ->count variable
      cannot be used to record an accurate aggregated value, since it's not
      possible to perform SMP cross-calls to cpus on other packages from the
      context in which we update ->count.
      
      Introduce a new optional ->count() accessor function that can be used to
      customize how values are collected. If a PMU driver doesn't provide a
      ->count() function, we fallback to the existing code.
      
      There is necessarily a window of staleness with this approach because
      the task that generated the counter value may not have been scheduled by
      the cpu recently.
      
      An alternative and more complex approach would be to use a hrtimer to
      periodically refresh the values from a more permissive scheduling
      context. So, we're trading off complexity for accuracy.
      Signed-off-by: NMatt Fleming <matt.fleming@intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Kanaka Juvva <kanaka.d.juvva@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Vikas Shivappa <vikas.shivappa@linux.intel.com>
      Link: http://lkml.kernel.org/r/1422038748-21397-3-git-send-email-matt@codeblueprint.co.ukSigned-off-by: NIngo Molnar <mingo@kernel.org>
      eacd3ecc
    • M
      perf: Make perf_cgroup_from_task() global · 39bed6cb
      Matt Fleming 提交于
      Move perf_cgroup_from_task() from kernel/events/ to include/linux/ along
      with the necessary struct definitions, so that it can be used by the PMU
      code.
      
      When the upcoming Intel Cache Monitoring PMU driver assigns monitoring
      IDs to perf events, it needs to be able to check whether any two
      monitoring events overlap (say, a cgroup and task event), which means we
      need to be able to lookup the cgroup associated with a task (if any).
      Signed-off-by: NMatt Fleming <matt.fleming@intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Kanaka Juvva <kanaka.d.juvva@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Vikas Shivappa <vikas.shivappa@linux.intel.com>
      Link: http://lkml.kernel.org/r/1422038748-21397-2-git-send-email-matt@codeblueprint.co.ukSigned-off-by: NIngo Molnar <mingo@kernel.org>
      39bed6cb
  11. 19 2月, 2015 7 次提交
  12. 13 2月, 2015 1 次提交
  13. 04 2月, 2015 8 次提交
  14. 02 2月, 2015 1 次提交