1. 18 8月, 2016 1 次提交
    • D
      perf/core: Generalize event->group_flags · 4ff6a8de
      David Carrillo-Cisneros 提交于
      Currently, PERF_GROUP_SOFTWARE is used in the group_flags field of a
      group's leader to indicate that is_software_event(event) is true for all
      events in a group. This is the only usage of event->group_flags.
      
      This pattern of setting a group level flags when all events in the group
      share a property is useful for the flag introduced in the next patch and
      for future CQM/CMT flags. So this patches generalizes group_flags to work
      as an aggregate of event level flags.
      
      PERF_GROUP_SOFTWARE denotes an inmutable event's property. All other flags
      that I intend to add are also determinable at event initialization.
      To better convey the above, this patch renames event's group_flags to
      group_caps and PERF_GROUP_SOFTWARE to PERF_EV_CAP_SOFTWARE.
      
      Individual event flags are stored in the new event->event_caps. Since the
      cap flags do not change after event initialization, there is no need to
      serialize event_caps. This new field is used when events are added to a
      context, similarly to how PERF_GROUP_SOFTWARE and is_software_event()
      worked.
      
      Lastly, for consistency, updates is_software_event() to rely in event_cap
      instead of the context index.
      Signed-off-by: NDavid Carrillo-Cisneros <davidcc@google.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Kan Liang <kan.liang@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul Turner <pjt@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vegard Nossum <vegard.nossum@gmail.com>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Link: http://lkml.kernel.org/r/1471467307-61171-3-git-send-email-davidcc@google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      4ff6a8de
  2. 10 8月, 2016 2 次提交
    • P
      perf/core: Optimize perf_pmu_sched_task() · e48c1788
      Peter Zijlstra 提交于
      For perf record -b, which requires the pmu::sched_task callback the
      current code is rather expensive:
      
           7.68%  sched-pipe  [kernel.vmlinux]    [k] perf_pmu_sched_task
           5.95%  sched-pipe  [kernel.vmlinux]    [k] __switch_to
           5.20%  sched-pipe  [kernel.vmlinux]    [k] __intel_pmu_disable_all
           3.95%  sched-pipe  perf                [.] worker_thread
      
      The problem is that it will iterate all registered PMUs, most of which
      will not have anything to do. Avoid this by keeping an explicit list
      of PMUs that have requested the callback.
      
      The perf_sched_cb_{inc,dec}() functions already takes the required pmu
      argument, and now that these functions are no longer called from NMI
      context we can use them to manage a list.
      
      With this patch applied the function doesn't show up in the top 4
      anymore (it dropped to 18th place).
      
           6.67%  sched-pipe  [kernel.vmlinux]    [k] __switch_to
           6.18%  sched-pipe  [kernel.vmlinux]    [k] __intel_pmu_disable_all
           3.92%  sched-pipe  [kernel.vmlinux]    [k] switch_mm_irqs_off
           3.71%  sched-pipe  perf                [.] worker_thread
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      e48c1788
    • D
      perf/core: Set cgroup in CPU contexts for new cgroup events · db4a8356
      David Carrillo-Cisneros 提交于
      There's a perf stat bug easy to observer on a machine with only one cgroup:
      
        $ perf stat -e cycles -I 1000 -C 0 -G /
        #          time             counts unit events
            1.000161699      <not counted>      cycles                    /
            2.000355591      <not counted>      cycles                    /
            3.000565154      <not counted>      cycles                    /
            4.000951350      <not counted>      cycles                    /
      
      We'd expect some output there.
      
      The underlying problem is that there is an optimization in
      perf_cgroup_sched_{in,out}() that skips the switch of cgroup events
      if the old and new cgroups in a task switch are the same.
      
      This optimization interacts with the current code in two ways
      that cause a CPU context's cgroup (cpuctx->cgrp) to be NULL even if a
      cgroup event matches the current task. These are:
      
        1. On creation of the first cgroup event in a CPU: In current code,
        cpuctx->cpu is only set in perf_cgroup_sched_in, but due to the
        aforesaid optimization, perf_cgroup_sched_in will run until the next
        cgroup switches in that CPU. This may happen late or never happen,
        depending on system's number of cgroups, CPU load, etc.
      
        2. On deletion of the last cgroup event in a cpuctx: In list_del_event,
        cpuctx->cgrp is set NULL. Any new cgroup event will not be sched in
        because cpuctx->cgrp == NULL until a cgroup switch occurs and
        perf_cgroup_sched_in is executed (updating cpuctx->cgrp).
      
      This patch fixes both problems by setting cpuctx->cgrp in list_add_event,
      mirroring what list_del_event does when removing a cgroup event from CPU
      context, as introduced in:
      
        commit 68cacd29 ("perf_events: Fix stale ->cgrp pointer in update_cgrp_time_from_cpuctx()")
      
      With this patch, cpuctx->cgrp is always set/clear when installing/removing
      the first/last cgroup event in/from the CPU context. With cpuctx->cgrp
      correctly set, event_filter_match works as intended when events are
      sched in/out.
      
      After the fix, the output is as expected:
      
        $ perf stat -e cycles -I 1000 -a -G /
        #         time             counts unit events
           1.004699159          627342882      cycles                    /
           2.007397156          615272690      cycles                    /
           3.010019057          616726074      cycles                    /
      Signed-off-by: NDavid Carrillo-Cisneros <davidcc@google.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Kan Liang <kan.liang@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul Turner <pjt@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vegard Nossum <vegard.nossum@gmail.com>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Link: http://lkml.kernel.org/r/1470124092-113192-1-git-send-email-davidcc@google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      db4a8356
  3. 26 7月, 2016 1 次提交
    • D
      bpf, events: fix offset in skb copy handler · aa7145c1
      Daniel Borkmann 提交于
      This patch fixes the __output_custom() routine we currently use with
      bpf_skb_copy(). I missed that when len is larger than the size of the
      current handle, we can issue multiple invocations of copy_func, and
      __output_custom() advances destination but also source buffer by the
      written amount of bytes. When we have __output_custom(), this is actually
      wrong since in that case the source buffer points to a non-linear object,
      in our case an skb, which the copy_func helper is supposed to walk.
      Therefore, since this is non-linear we thus need to pass the offset into
      the helper, so that copy_func can use it for extracting the data from
      the source object.
      
      Therefore, adjust the callback signatures properly and pass offset
      into the skb_header_pointer() invoked from bpf_skb_copy() callback. The
      __DEFINE_OUTPUT_COPY_BODY() is adjusted to accommodate for two things:
      i) to pass in whether we should advance source buffer or not; this is
      a compile-time constant condition, ii) to pass in the offset for
      __output_custom(), which we do with help of __VA_ARGS__, so everything
      can stay inlined as is currently. Both changes allow for adapting the
      __output_* fast-path helpers w/o extra overhead.
      
      Fixes: 555c8a86 ("bpf: avoid stack copy and use skb ctx for event output")
      Fixes: 7e3f977e ("perf, events: add non-linear data support for raw records")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      aa7145c1
  4. 16 7月, 2016 1 次提交
    • D
      perf, events: add non-linear data support for raw records · 7e3f977e
      Daniel Borkmann 提交于
      This patch adds support for non-linear data on raw records. It
      extends raw records to have one or multiple fragments that will
      be written linearly into the ring slot, where each fragment can
      optionally have a custom callback handler to walk and extract
      complex, possibly non-linear data.
      
      If a callback handler is provided for a fragment, then the new
      __output_custom() will be used instead of __output_copy() for
      the perf_output_sample() part. perf_prepare_sample() does all
      the size calculation only once, so perf_output_sample() doesn't
      need to redo the same work anymore, meaning real_size and padding
      will be cached in the raw record. The raw record becomes 32 bytes
      in size without holes; to not increase it further and to avoid
      doing unnecessary recalculations in fast-path, we can reuse
      next pointer of the last fragment, idea here is borrowed from
      ZERO_OR_NULL_PTR(), which should keep the perf_output_sample()
      path for PERF_SAMPLE_RAW minimal.
      
      This facility is needed for BPF's event output helper as a first
      user that will, in a follow-up, add an additional perf_raw_frag
      to its perf_raw_record in order to be able to more efficiently
      dump skb context after a linear head meta data related to it.
      skbs can be non-linear and thus need a custom output function to
      dump buffers. Currently, the skb data needs to be copied twice;
      with the help of __output_custom() this work only needs to be
      done once. Future users could be things like XDP/BPF programs
      that work on different context though and would thus also have
      a different callback function.
      
      The few users of raw records are adapted to initialize their frag
      data from the raw record itself, no change in behavior for them.
      The code is based upon a PoC diff provided by Peter Zijlstra [1].
      
        [1] http://thread.gmane.org/gmane.linux.network/421294Suggested-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7e3f977e
  5. 14 7月, 2016 2 次提交
  6. 03 6月, 2016 2 次提交
  7. 30 5月, 2016 1 次提交
    • A
      perf core: Per event callchain limit · 97c79a38
      Arnaldo Carvalho de Melo 提交于
      Additionally to being able to control the system wide maximum depth via
      /proc/sys/kernel/perf_event_max_stack, now we are able to ask for
      different depths per event, using perf_event_attr.sample_max_stack for
      that.
      
      This uses an u16 hole at the end of perf_event_attr, that, when
      perf_event_attr.sample_type has the PERF_SAMPLE_CALLCHAIN, if
      sample_max_stack is zero, means use perf_event_max_stack, otherwise
      it'll be bounds checked under callchain_mutex.
      
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Brendan Gregg <brendan.d.gregg@gmail.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: He Kuang <hekuang@huawei.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Milian Wolff <milian.wolff@kdab.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: Wang Nan <wangnan0@huawei.com>
      Cc: Zefan Li <lizefan@huawei.com>
      Link: http://lkml.kernel.org/n/tip-kolmn1yo40p7jhswxwrc7rrd@git.kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      97c79a38
  8. 17 5月, 2016 4 次提交
    • A
      perf core: Separate accounting of contexts and real addresses in a stack trace · c85b0334
      Arnaldo Carvalho de Melo 提交于
      The perf_sample->ip_callchain->nr value includes all the entries in the
      ip_callchain->ip[] array, real addresses and PERF_CONTEXT_{KERNEL,USER,etc},
      while what the user expects is that what is in the kernel.perf_event_max_stack
      sysctl or in the upcoming per event perf_event_attr.sample_max_stack knob be
      honoured in terms of IP addresses in the stack trace.
      
      So allocate a bunch of extra entries for contexts, and do the accounting
      via perf_callchain_entry_ctx struct members.
      
      A new sysctl, kernel.perf_event_max_contexts_per_stack is also
      introduced for investigating possible bugs in the callchain
      implementation by some arch.
      
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Brendan Gregg <brendan.d.gregg@gmail.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: He Kuang <hekuang@huawei.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Milian Wolff <milian.wolff@kdab.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: Wang Nan <wangnan0@huawei.com>
      Cc: Zefan Li <lizefan@huawei.com>
      Link: http://lkml.kernel.org/n/tip-3b4wnqk340c4sg4gwkfdi9yk@git.kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      c85b0334
    • A
      perf core: Add perf_callchain_store_context() helper · 3e4de4ec
      Arnaldo Carvalho de Melo 提交于
      We need have different helpers to account how many contexts we have in
      the sample and for real addresses, so do it now as a prep patch, to
      ease review.
      
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/n/tip-q964tnyuqrxw5gld18vizs3c@git.kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      3e4de4ec
    • A
      perf core: Add a 'nr' field to perf_event_callchain_context · 3b1fff08
      Arnaldo Carvalho de Melo 提交于
      We will use it to count how many addresses are in the entry->ip[] array,
      excluding PERF_CONTEXT_{KERNEL,USER,etc} entries, so that we can really
      return the number of entries specified by the user via the relevant
      sysctl, kernel.perf_event_max_contexts, or via the per event
      perf_event_attr.sample_max_stack knob.
      
      This way we keep the perf_sample->ip_callchain->nr meaning, that is the
      number of entries, be it real addresses or PERF_CONTEXT_ entries, while
      honouring the max_stack knobs, i.e. the end result will be max_stack
      entries if we have at least that many entries in a given stack trace.
      
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/n/tip-s8teto51tdqvlfhefndtat9r@git.kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      3b1fff08
    • A
      perf core: Pass max stack as a perf_callchain_entry context · cfbcf468
      Arnaldo Carvalho de Melo 提交于
      This makes perf_callchain_{user,kernel}() receive the max stack
      as context for the perf_callchain_entry, instead of accessing
      the global sysctl_perf_event_max_stack.
      
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Brendan Gregg <brendan.d.gregg@gmail.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: He Kuang <hekuang@huawei.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Milian Wolff <milian.wolff@kdab.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: Wang Nan <wangnan0@huawei.com>
      Cc: Zefan Li <lizefan@huawei.com>
      Link: http://lkml.kernel.org/n/tip-kolmn1yo40p7jhswxwrc7rrd@git.kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      cfbcf468
  9. 05 5月, 2016 2 次提交
    • M
      perf/arm: Special-case hetereogeneous CPUs · 5101ef20
      Mark Rutland 提交于
      Commit:
      
        26657848 ("perf/core: Verify we have a single perf_hw_context PMU")
      
      forcefully prevents multiple PMUs from sharing perf_hw_context, as this
      generally doesn't make sense. It is a common bug for uncore PMUs to
      use perf_hw_context rather than perf_invalid_context, which this detects.
      
      However, systems exist with heterogeneous CPUs (and hence heterogeneous
      HW PMUs), for which sharing perf_hw_context is necessary, and possible
      in some limited cases.
      
      To make this work we have to perform some gymnastics, as we did in these
      commits:
      
        66eb579e ("perf: allow for PMU-specific event filtering")
        c904e32a ("arm: perf: filter unschedulable events")
      
      To allow those systems to work, we must allow PMUs for heterogeneous
      CPUs to share perf_hw_context, though we must still disallow sharing
      otherwise to detect the common misuse of perf_hw_context.
      
      This patch adds a new PERF_PMU_CAP_HETEROGENEOUS_CPUS for this, updates
      the core logic to account for this, and makes use of it in the arm_pmu
      code that is used for systems with heterogeneous CPUs. Comments are
      added to make the rationale clear and hopefully avoid accidental abuse.
      Signed-off-by: NMark Rutland <mark.rutland@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: linux-arm-kernel@lists.infradead.org
      Link: http://lkml.kernel.org/r/20160426103346.GA20836@leverpostejSigned-off-by: NIngo Molnar <mingo@kernel.org>
      5101ef20
    • A
      perf/core: Introduce address range filtering · 375637bc
      Alexander Shishkin 提交于
      Many instruction tracing PMUs out there support address range-based
      filtering, which would, for example, generate trace data only for a
      given range of instruction addresses, which is useful for tracing
      individual functions, modules or libraries. Other PMUs may also
      utilize this functionality to allow filtering to or filtering out
      code at certain address ranges.
      
      This patch introduces the interface for userspace to specify these
      filters and for the PMU drivers to apply these filters to hardware
      configuration.
      
      The user interface is an ASCII string that is passed via an ioctl()
      and specifies (in the form of an ASCII string) address ranges within
      certain object files or within kernel. There is no special treatment
      for kernel modules yet, but it might be a worthy pursuit.
      
      The PMU driver interface basically adds two extra callbacks to the
      PMU driver structure, one of which validates the filter configuration
      proposed by the user against what the hardware is actually capable of
      doing and the other one translates hardware-independent filter
      configuration into something that can be programmed into the
      hardware.
      Signed-off-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NMathieu Poirier <mathieu.poirier@linaro.org>
      Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: vince@deater.net
      Link: http://lkml.kernel.org/r/1461771888-10409-6-git-send-email-alexander.shishkin@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      375637bc
  10. 27 4月, 2016 1 次提交
    • A
      perf core: Allow setting up max frame stack depth via sysctl · c5dfd78e
      Arnaldo Carvalho de Melo 提交于
      The default remains 127, which is good for most cases, and not even hit
      most of the time, but then for some cases, as reported by Brendan, 1024+
      deep frames are appearing on the radar for things like groovy, ruby.
      
      And in some workloads putting a _lower_ cap on this may make sense. One
      that is per event still needs to be put in place tho.
      
      The new file is:
      
        # cat /proc/sys/kernel/perf_event_max_stack
        127
      
      Chaging it:
      
        # echo 256 > /proc/sys/kernel/perf_event_max_stack
        # cat /proc/sys/kernel/perf_event_max_stack
        256
      
      But as soon as there is some event using callchains we get:
      
        # echo 512 > /proc/sys/kernel/perf_event_max_stack
        -bash: echo: write error: Device or resource busy
        #
      
      Because we only allocate the callchain percpu data structures when there
      is a user, which allows for changing the max easily, its just a matter
      of having no callchain users at that point.
      Reported-and-Tested-by: NBrendan Gregg <brendan.d.gregg@gmail.com>
      Reviewed-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDavid Ahern <dsahern@gmail.com>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: He Kuang <hekuang@huawei.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Milian Wolff <milian.wolff@kdab.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: Wang Nan <wangnan0@huawei.com>
      Cc: Zefan Li <lizefan@huawei.com>
      Link: http://lkml.kernel.org/r/20160426002928.GB16708@kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      c5dfd78e
  11. 23 4月, 2016 1 次提交
    • W
      perf/core: Add ::write_backward attribute to perf event · 9ecda41a
      Wang Nan 提交于
      This patch introduces 'write_backward' bit to perf_event_attr, which
      controls the direction of a ring buffer. After set, the corresponding
      ring buffer is written from end to beginning. This feature is design to
      support reading from overwritable ring buffer.
      
      Ring buffer can be created by mapping a perf event fd. Kernel puts event
      records into ring buffer, user tooling like perf fetch them from
      address returned by mmap(). To prevent racing between kernel and tooling,
      they communicate to each other through 'head' and 'tail' pointers.
      Kernel maintains 'head' pointer, points it to the next free area (tail
      of the last record). Tooling maintains 'tail' pointer, points it to the
      tail of last consumed record (record has already been fetched). Kernel
      determines the available space in a ring buffer using these two
      pointers to avoid overwrite unfetched records.
      
      By mapping without 'PROT_WRITE', an overwritable ring buffer is created.
      Different from normal ring buffer, tooling is unable to maintain 'tail'
      pointer because writing is forbidden. Therefore, for this type of ring
      buffers, kernel overwrite old records unconditionally, works like flight
      recorder. This feature would be useful if reading from overwritable ring
      buffer were as easy as reading from normal ring buffer. However,
      there's an obscure problem.
      
      The following figure demonstrates a full overwritable ring buffer. In
      this figure, the 'head' pointer points to the end of last record, and a
      long record 'E' is pending. For a normal ring buffer, a 'tail' pointer
      would have pointed to position (X), so kernel knows there's no more
      space in the ring buffer. However, for an overwritable ring buffer,
      kernel ignore the 'tail' pointer.
      
         (X)                              head
          .                                |
          .                                V
          +------+-------+----------+------+---+
          |A....A|B.....B|C........C|D....D|   |
          +------+-------+----------+------+---+
      
      Record 'A' is overwritten by event 'E':
      
            head
             |
             V
          +--+---+-------+----------+------+---+
          |.E|..A|B.....B|C........C|D....D|E..|
          +--+---+-------+----------+------+---+
      
      Now tooling decides to read from this ring buffer. However, none of these
      two natural positions, 'head' and the start of this ring buffer, are
      pointing to the head of a record. Even the full ring buffer can be
      accessed by tooling, it is unable to find a position to start decoding.
      
      The first attempt tries to solve this problem AFAIK can be found from
      [1]. It makes kernel to maintain 'tail' pointer: updates it when ring
      buffer is half full. However, this approach introduces overhead to
      fast path. Test result shows a 1% overhead [2]. In addition, this method
      utilizes no more tham 50% records.
      
      Another attempt can be found from [3], which allows putting the size of
      an event at the end of each record. This approach allows tooling to find
      records in a backward manner from 'head' pointer by reading size of a
      record from its tail. However, because of alignment requirement, it
      needs 8 bytes to record the size of a record, which is a huge waste. Its
      performance is also not good, because more data need to be written.
      This approach also introduces some extra branch instructions to fast
      path.
      
      'write_backward' is a better solution to this problem.
      
      Following figure demonstrates the state of the overwritable ring buffer
      when 'write_backward' is set before overwriting:
      
             head
              |
              V
          +---+------+----------+-------+------+
          |   |D....D|C........C|B.....B|A....A|
          +---+------+----------+-------+------+
      
      and after overwriting:
                                           head
                                            |
                                            V
          +---+------+----------+-------+---+--+
          |..E|D....D|C........C|B.....B|A..|E.|
          +---+------+----------+-------+---+--+
      
      In each situation, 'head' points to the beginning of the newest record.
      From this record, tooling can iterate over the full ring buffer and fetch
      records one by one.
      
      The only limitation that needs to be considered is back-to-back reading.
      Due to the non-deterministic of user programs, it is impossible to ensure
      the ring buffer keeps stable during reading. Consider an extreme situation:
      tooling is scheduled out after reading record 'D', then a burst of events
      come, eat up the whole ring buffer (one or multiple rounds). When the
      tooling process comes back, reading after 'D' is incorrect now.
      
      To prevent this problem, we need to find a way to ensure the ring buffer
      is stable during reading. ioctl(PERF_EVENT_IOC_PAUSE_OUTPUT) is
      suggested because its overhead is lower than
      ioctl(PERF_EVENT_IOC_ENABLE).
      
      By carefully verifying 'header' pointer, reader can avoid pausing the
      ring-buffer. For example:
      
          /* A union of all possible events */
          union perf_event event;
      
          p = head = perf_mmap__read_head();
          while (true) {
              /* copy header of next event */
              fetch(&event.header, p, sizeof(event.header));
      
              /* read 'head' pointer */
              head = perf_mmap__read_head();
      
              /* check overwritten: is the header good? */
              if (!verify(sizeof(event.header), p, head))
                  break;
      
              /* copy the whole event */
              fetch(&event, p, event.header.size);
      
              /* read 'head' pointer again */
              head = perf_mmap__read_head();
      
              /* is the whole event good? */
              if (!verify(event.header.size, p, head))
                  break;
              p += event.header.size;
          }
      
      However, the overhead is high because:
      
       a) In-place decoding is not safe.
          Copying-verifying-decoding is required.
       b) Fetching 'head' pointer requires additional synchronization.
      
      (From Alexei Starovoitov:
      
      Even when this trick works, pause is needed for more than stability of
      reading. When we collect the events into overwrite buffer we're waiting
      for some other trigger (like all cpu utilization spike or just one cpu
      running and all others are idle) and when it happens the buffer has
      valuable info from the past. At this point new events are no longer
      interesting and buffer should be paused, events read and unpaused until
      next trigger comes.)
      
      This patch utilizes event's default overflow_handler introduced
      previously. perf_event_output_backward() is created as the default
      overflow handler for backward ring buffers. To avoid extra overhead to
      fast path, original perf_event_output() becomes __perf_event_output()
      and marked '__always_inline'. In theory, there's no extra overhead
      introduced to fast path.
      
      Performance testing:
      
      Calling 3000000 times of 'close(-1)', use gettimeofday() to check
      duration.  Use 'perf record -o /dev/null -e raw_syscalls:*' to capture
      system calls. In ns.
      
      Testing environment:
      
        CPU    : Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
        Kernel : v4.5.0
                          MEAN         STDVAR
       BASE            800214.950    2853.083
       PRE1           2253846.700    9997.014
       PRE2           2257495.540    8516.293
       POST           2250896.100    8933.921
      
      Where 'BASE' is pure performance without capturing. 'PRE1' is test
      result of pure 'v4.5.0' kernel. 'PRE2' is test result before this
      patch. 'POST' is test result after this patch. See [4] for the detailed
      experimental setup.
      
      Considering the stdvar, this patch doesn't introduce performance
      overhead to the fast path.
      
       [1] http://lkml.iu.edu/hypermail/linux/kernel/1304.1/04584.html
       [2] http://lkml.iu.edu/hypermail/linux/kernel/1307.1/00535.html
       [3] http://lkml.iu.edu/hypermail/linux/kernel/1512.0/01265.html
       [4] http://lkml.kernel.org/g/56F89DCD.1040202@huawei.comSigned-off-by: NWang Nan <wangnan0@huawei.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Cc: <acme@kernel.org>
      Cc: <pi3orama@163.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Brendan Gregg <brendan.d.gregg@gmail.com>
      Cc: He Kuang <hekuang@huawei.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: Zefan Li <lizefan@huawei.com>
      Link: http://lkml.kernel.org/r/1459865478-53413-1-git-send-email-wangnan0@huawei.com
      [ Fixed the changelog some more. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      9ecda41a
  12. 08 4月, 2016 2 次提交
  13. 31 3月, 2016 1 次提交
    • W
      perf/core: Set event's default ::overflow_handler() · 1879445d
      Wang Nan 提交于
      Set a default event->overflow_handler in perf_event_alloc() so don't
      need to check event->overflow_handler in __perf_event_overflow().
      Following commits can give a different default overflow_handler.
      
      Initial idea comes from Peter:
      
        http://lkml.kernel.org/r/20130708121557.GA17211@twins.programming.kicks-ass.net
      
      Since the default value of event->overflow_handler is not NULL, existing
      'if (!overflow_handler)' checks need to be changed.
      
      is_default_overflow_handler() is introduced for this.
      
      No extra performance overhead is introduced into the hot path because in the
      original code we still need to read this handler from memory. A conditional
      branch is avoided so actually we remove some instructions.
      Signed-off-by: NWang Nan <wangnan0@huawei.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: <pi3orama@163.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Brendan Gregg <brendan.d.gregg@gmail.com>
      Cc: He Kuang <hekuang@huawei.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: Zefan Li <lizefan@huawei.com>
      Link: http://lkml.kernel.org/r/1459147292-239310-3-git-send-email-wangnan0@huawei.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      1879445d
  14. 21 3月, 2016 2 次提交
    • H
      perf/x86/amd/power: Add AMD accumulated power reporting mechanism · c7ab62bf
      Huang Rui 提交于
      Introduce an AMD accumlated power reporting mechanism for the Family
      15h, Model 60h processor that can be used to calculate the average
      power consumed by a processor during a measurement interval. The
      feature support is indicated by CPUID Fn8000_0007_EDX[12].
      
      This feature will be implemented both in hwmon and perf. The current
      design provides one event to report per package/processor power
      consumption by counting each compute unit power value.
      
      Here the gory details of how the computation is done:
      
      * Tsample: compute unit power accumulator sample period
      * Tref: the PTSC counter period (PTSC: performance timestamp counter)
      * N: the ratio of compute unit power accumulator sample period to the
        PTSC period
      
      * Jmax: max compute unit accumulated power which is indicated by
        MSR_C001007b[MaxCpuSwPwrAcc]
      
      * Jx/Jy: compute unit accumulated power which is indicated by
        MSR_C001007a[CpuSwPwrAcc]
      
      * Tx/Ty: the value of performance timestamp counter which is indicated
        by CU_PTSC MSR_C0010280[PTSC]
      * PwrCPUave: CPU average power
      
      i. Determine the ratio of Tsample to Tref by executing CPUID Fn8000_0007.
      	N = value of CPUID Fn8000_0007_ECX[CpuPwrSampleTimeRatio[15:0]].
      
      ii. Read the full range of the cumulative energy value from the new
          MSR MaxCpuSwPwrAcc.
      	Jmax = value returned.
      
      iii. At time x, software reads CpuSwPwrAcc and samples the PTSC.
      	Jx = value read from CpuSwPwrAcc and Tx = value read from PTSC.
      
      iv. At time y, software reads CpuSwPwrAcc and samples the PTSC.
      	Jy = value read from CpuSwPwrAcc and Ty = value read from PTSC.
      
      v. Calculate the average power consumption for a compute unit over
      time period (y-x). Unit of result is uWatt:
      
      	if (Jy < Jx) // Rollover has occurred
      		Jdelta = (Jy + Jmax) - Jx
      	else
      		Jdelta = Jy - Jx
      	PwrCPUave = N * Jdelta * 1000 / (Ty - Tx)
      
      Simple example:
      
        root@hr-zp:/home/ray/tip# ./tools/perf/perf stat -a -e 'power/power-pkg/' make -j4
          CHK     include/config/kernel.release
          CHK     include/generated/uapi/linux/version.h
          CHK     include/generated/utsrelease.h
          CHK     include/generated/timeconst.h
          CHK     include/generated/bounds.h
          CHK     include/generated/asm-offsets.h
          CALL    scripts/checksyscalls.sh
          CHK     include/generated/compile.h
          SKIPPED include/generated/compile.h
          Building modules, stage 2.
        Kernel: arch/x86/boot/bzImage is ready  (#40)
          MODPOST 4225 modules
      
         Performance counter stats for 'system wide':
      
                    183.44 mWatts power/power-pkg/
      
             341.837270111 seconds time elapsed
      
        root@hr-zp:/home/ray/tip# ./tools/perf/perf stat -a -e 'power/power-pkg/' sleep 10
      
         Performance counter stats for 'system wide':
      
                      0.18 mWatts power/power-pkg/
      
              10.012551815 seconds time elapsed
      Suggested-by: NPeter Zijlstra <peterz@infradead.org>
      Suggested-by: NIngo Molnar <mingo@kernel.org>
      Suggested-by: NBorislav Petkov <bp@suse.de>
      Signed-off-by: NHuang Rui <ray.huang@amd.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Kan Liang <kan.liang@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Robert Richter <rric@kernel.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: jacob.w.shin@gmail.com
      Link: http://lkml.kernel.org/r/1457502306-2559-1-git-send-email-ray.huang@amd.com
      [ Fixed the modular build. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      c7ab62bf
    • V
      perf/x86/cqm: Fix CQM handling of grouping events into a cache_group · a223c1c7
      Vikas Shivappa 提交于
      Currently CQM (cache quality of service monitoring) is grouping all
      events belonging to same PID to use one RMID. However its not counting
      all of these different events. Hence we end up with a count of zero
      for all events other than the group leader.
      
      The patch tries to address the issue by keeping a flag in the
      perf_event.hw which has other CQM related fields. The field is updated
      at event creation and during grouping.
      Signed-off-by: NVikas Shivappa <vikas.shivappa@linux.intel.com>
      [peterz: Changed hw_perf_event::is_group_event to an int]
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NTony Luck <tony.luck@intel.com>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: fenghua.yu@intel.com
      Cc: h.peter.anvin@intel.com
      Cc: ravi.v.shankar@intel.com
      Cc: vikas.shivappa@intel.com
      Link: http://lkml.kernel.org/r/1457652732-4499-2-git-send-email-vikas.shivappa@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      a223c1c7
  15. 02 3月, 2016 1 次提交
    • F
      perf: Migrate perf to use new tick dependency mask model · 555e0c1e
      Frederic Weisbecker 提交于
      Instead of providing asynchronous checks for the nohz subsystem to verify
      perf event tick dependency, migrate perf to the new mask.
      
      Perf needs the tick for two situations:
      
      1) Freq events. We could set the tick dependency when those are
      installed on a CPU context. But setting a global dependency on top of
      the global freq events accounting is much easier. If people want that
      to be optimized, we can still refine that on the per-CPU tick dependency
      level. This patch dooesn't change the current behaviour anyway.
      
      2) Throttled events: this is a per-cpu dependency.
      Reviewed-by: NChris Metcalf <cmetcalf@ezchip.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      555e0c1e
  16. 29 2月, 2016 1 次提交
    • T
      perf: Allow storage of PMU private data in event · 54d751d4
      Thomas Gleixner 提交于
      For PMUs which are not per CPU, but e.g. per package/socket, we want to be
      able to store a reference to the underlying per package/socket facility in the
      event at init time so we can avoid magic storage constructs in the PMU driver.
      
      This allows us to get rid of the per CPU dance in the intel uncore and RAPL
      drivers and avoids a lookup of the per package data in the perf hotpath.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andi Kleen <andi.kleen@intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Harish Chegondi <harish.chegondi@intel.com>
      Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Kan Liang <kan.liang@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: linux-kernel@vger.kernel.org
      Link: http://lkml.kernel.org/r/20160222221011.364140369@linutronix.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      54d751d4
  17. 25 2月, 2016 2 次提交
    • P
      perf: Fix race between event install and jump_labels · 9107c89e
      Peter Zijlstra 提交于
      perf_install_in_context() relies upon the context switch hooks to have
      scheduled in events when the IPI misses its target -- after all, if
      the task has moved from the CPU (or wasn't running at all), it will
      have to context switch to run elsewhere.
      
      This however doesn't appear to be happening.
      
      It is possible for the IPI to not happen (task wasn't running) only to
      later observe the task running with an inactive context.
      
      The only possible explanation is that the context switch hooks are not
      called. Therefore put in a sync_sched() after toggling the jump_label
      to guarantee all CPUs will have them enabled before we install an
      event.
      
      A simple if (0->1) sync_sched() will not in fact work, because any
      further increment can race and complete before the sync_sched().
      Therefore we must jump through some hoops.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: dvyukov@google.com
      Cc: eranian@google.com
      Cc: oleg@redhat.com
      Cc: panand@redhat.com
      Cc: sasha.levin@oracle.com
      Cc: vince@deater.net
      Link: http://lkml.kernel.org/r/20160224174947.980211985@infradead.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      9107c89e
    • P
      perf: Fix cloning · a69b0ca4
      Peter Zijlstra 提交于
      Alexander reported that when the 'original' context gets destroyed, no
      new clones happen.
      
      This can happen irrespective of the ctx switch optimization, any task
      can die, even the parent, and we want to continue monitoring the task
      hierarchy until we either close the event or no tasks are left in the
      hierarchy.
      
      perf_event_init_context() will attempt to pin the 'parent' context
      during clone(). At that point current is the parent, and since current
      cannot have exited while executing clone(), its context cannot have
      passed through perf_event_exit_task_context(). Therefore
      perf_pin_task_context() cannot observe ctx->task == TASK_TOMBSTONE.
      
      However, since inherit_event() does:
      
      	if (parent_event->parent)
      		parent_event = parent_event->parent;
      
      it looks at the 'original' event when it does: is_orphaned_event().
      This can return true if the context that contains the this event has
      passed through perf_event_exit_task_context(). And thus we'll fail to
      clone the perf context.
      
      Fix this by adding a new state: STATE_DEAD, which is set by
      perf_release() to indicate that the filedesc (or kernel reference) is
      dead and there are no observers for our data left.
      
      Only for STATE_DEAD will is_orphaned_event() be true and inhibit
      cloning.
      
      STATE_EXIT is otherwise preserved such that is_event_hup() remains
      functional and will report when the observed task hierarchy becomes
      empty.
      Reported-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Tested-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: dvyukov@google.com
      Cc: eranian@google.com
      Cc: oleg@redhat.com
      Cc: panand@redhat.com
      Cc: sasha.levin@oracle.com
      Cc: vince@deater.net
      Fixes: c6e5b732 ("perf: Synchronously clean up child events")
      Link: http://lkml.kernel.org/r/20160224174947.919845295@infradead.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      a69b0ca4
  18. 20 2月, 2016 1 次提交
  19. 29 1月, 2016 2 次提交
  20. 22 1月, 2016 1 次提交
    • P
      perf: Collapse and fix event_function_call() users · fae3fde6
      Peter Zijlstra 提交于
      There is one common bug left in all the event_function_call() users,
      between loading ctx->task and getting to the remote_function(),
      ctx->task can already have been changed.
      
      Therefore we need to double check and retry if ctx->task != current.
      
      Insert another trampoline specific to event_function_call() that
      checks for this and further validates state. This also allows getting
      rid of the active/inactive functions.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      fae3fde6
  21. 23 11月, 2015 1 次提交
  22. 13 9月, 2015 4 次提交
  23. 10 8月, 2015 1 次提交
  24. 07 6月, 2015 2 次提交
    • K
      perf/x86/intel: Introduce PERF_RECORD_LOST_SAMPLES · f38b0dbb
      Kan Liang 提交于
      After enlarging the PEBS interrupt threshold, there may be some mixed up
      PEBS samples which are discarded by the kernel.
      
      This patch makes the kernel emit a PERF_RECORD_LOST_SAMPLES record with
      the number of possible discarded records when it is impossible to demux
      the samples.
      
      It makes sure the user is not left in the dark about such discards.
      Signed-off-by: NKan Liang <kan.liang@intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: acme@infradead.org
      Cc: eranian@google.com
      Link: http://lkml.kernel.org/r/1431285195-14269-8-git-send-email-kan.liang@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      f38b0dbb
    • Y
      perf/x86/intel: Handle multiple records in the PEBS buffer · 21509084
      Yan, Zheng 提交于
      When the PEBS interrupt threshold is larger than one record and the
      machine supports multiple PEBS events, the records of these events are
      mixed up and we need to demultiplex them.
      
      Demuxing the records is hard because the hardware is deficient. The
      hardware has two issues that, when combined, create impossible
      scenarios to demux.
      
      The first issue is that the 'status' field of the PEBS record is a copy
      of the GLOBAL_STATUS MSR at PEBS assist time. To see why this is a
      problem let us first describe the regular PEBS cycle:
      
      A) the CTRn value reaches 0:
        - the corresponding bit in GLOBAL_STATUS gets set
        - we start arming the hardware assist
        < some unspecified amount of time later -- this could cover multiple
          events of interest >
      
      B) the hardware assist is armed, any next event will trigger it
      
      C) a matching event happens:
        - the hardware assist triggers and generates a PEBS record
          this includes a copy of GLOBAL_STATUS at this moment
        - if we auto-reload we (re)set CTRn
        - we clear the relevant bit in GLOBAL_STATUS
      
      Now consider the following chain of events:
      
        A0, B0, A1, C0
      
      The event generated for counter 0 will include a status with counter 1
      set, even though its not at all related to the record. A similar thing
      can happen with a !PEBS event if it just happens to overflow at the
      right moment.
      
      The second issue is that the hardware will only emit one record for two
      or more counters if the event that triggers the assist is 'close'. The
      'close' can be several cycles. In some cases even the complete assist,
      if the event is something that doesn't need retirement.
      
      For instance, consider this chain of events:
      
        A0, B0, A1, B1, C01
      
      Where C01 is an event that triggers both hardware assists, we will
      generate but a single record, but again with both counters listed in the
      status field.
      
      This time the record pertains to both events.
      
      Note that these two cases are different but undistinguishable with the
      data as generated. Therefore demuxing records with multiple PEBS bits
      (we can safely ignore status bits for !PEBS counters) is impossible.
      
      Furthermore we cannot emit the record to both events because that might
      cause a data leak -- the events might not have the same privileges -- so
      what this patch does is discard such events.
      
      The assumption/hope is that such discards will be rare.
      
      Here lists some possible ways you may get high discard rate.
      
        - when you count the same thing multiple times. But it is not a useful
          configuration.
        - you can be unfortunate if you measure with a userspace only PEBS
          event along with either a kernel or unrestricted PEBS event. Imagine
          the event triggering and setting the overflow flag right before
          entering the kernel. Then all kernel side events will end up with
          multiple bits set.
      Signed-off-by: NYan, Zheng <zheng.z.yan@intel.com>
      Signed-off-by: NKan Liang <kan.liang@intel.com>
      [ Changelog improvements. ]
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: acme@infradead.org
      Cc: eranian@google.com
      Link: http://lkml.kernel.org/r/1430940834-8964-4-git-send-email-kan.liang@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      21509084
  25. 27 5月, 2015 1 次提交
    • M
      perf: allow for PMU-specific event filtering · 66eb579e
      Mark Rutland 提交于
      In certain circumstances it may not be possible to schedule particular
      events due to constraints other than a lack of hardware counters (e.g.
      on big.LITTLE systems where CPUs support different events). The core
      perf event code does not distinguish these cases and pessimistically
      assumes that any failure to schedule an event means that it is not worth
      attempting to schedule later events, even if some hardware counters are
      still unused.
      
      When an event a pmu cannot schedule exists in a flexible group list it
      can unnecessarily prevent event groups following it in the list from
      being scheduled (until it is rotated to the end of the list). This means
      some events are scheduled for only a portion of the time they could be,
      and for short running programs no events may be scheduled if the list is
      initially sorted in an unfortunate order.
      
      This patch adds a new (optional) filter_match function pointer to struct
      pmu which a pmu driver can use to tell perf core when an event matches
      pmu-specific scheduling requirements. This plugs into the existing
      event_filter_match logic, and makes it possible to avoid the scheduling
      problem described above. When no filter is provided by the PMU, the
      existing behaviour is retained.
      
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Acked-by: NWill Deacon <will.deacon@arm.com>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NMark Rutland <mark.rutland@arm.com>
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      66eb579e