1. 27 3月, 2020 1 次提交
  2. 11 2月, 2020 1 次提交
    • K
      perf/core: Add new branch sample type for HW index of raw branch records · bbfd5e4f
      Kan Liang 提交于
      The low level index is the index in the underlying hardware buffer of
      the most recently captured taken branch which is always saved in
      branch_entries[0]. It is very useful for reconstructing the call stack.
      For example, in Intel LBR call stack mode, the depth of reconstructed
      LBR call stack limits to the number of LBR registers. With the low level
      index information, perf tool may stitch the stacks of two samples. The
      reconstructed LBR call stack can break the HW limitation.
      
      Add a new branch sample type to retrieve low level index of raw branch
      records. The low level index is between -1 (unknown) and max depth which
      can be retrieved in /sys/devices/cpu/caps/branches.
      
      Only when the new branch sample type is set, the low level index
      information is dumped into the PERF_SAMPLE_BRANCH_STACK output.
      Perf tool should check the attr.branch_sample_type, and apply the
      corresponding format for PERF_SAMPLE_BRANCH_STACK samples.
      Otherwise, some user case may be broken. For example, users may parse a
      perf.data, which include the new branch sample type, with an old version
      perf tool (without the check). Users probably get incorrect information
      without any warning.
      Signed-off-by: NKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Link: https://lkml.kernel.org/r/20200127165355.27495-2-kan.liang@linux.intel.com
      bbfd5e4f
  3. 13 11月, 2019 1 次提交
    • A
      perf/aux: Allow using AUX data in perf samples · a4faf00d
      Alexander Shishkin 提交于
      AUX data can be used to annotate perf events such as performance counters
      or tracepoints/breakpoints by including it in sample records when
      PERF_SAMPLE_AUX flag is set. Such samples would be instrumental in debugging
      and profiling by providing, for example, a history of instruction flow
      leading up to the event's overflow.
      
      The implementation makes use of grouping an AUX event with all the events
      that wish to take samples of the AUX data, such that the former is the
      group leader. The samplees should also specify the desired size of the AUX
      sample via attr.aux_sample_size.
      
      AUX capable PMUs need to explicitly add support for sampling, because it
      relies on a new callback to take a snapshot of the buffer without touching
      the event states.
      Signed-off-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: adrian.hunter@intel.com
      Cc: mathieu.poirier@linaro.org
      Link: https://lkml.kernel.org/r/20191025140835.53665-2-alexander.shishkin@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      a4faf00d
  4. 28 8月, 2019 1 次提交
  5. 22 1月, 2019 2 次提交
  6. 21 1月, 2019 1 次提交
    • A
      perf/core: Remove unused perf_flags · ad07c8ce
      Andrew Murray 提交于
      Now that perf_flags is not used we remove it.
      Signed-off-by: NAndrew Murray <andrew.murray@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Sascha Hauer <s.hauer@pengutronix.de>
      Cc: Shawn Guo <shawnguo@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: linux-arm-kernel@lists.infradead.org
      Cc: linuxppc-dev@lists.ozlabs.org
      Cc: robin.murphy@arm.com
      Cc: suzuki.poulose@arm.com
      Link: https://lkml.kernel.org/r/1547128414-50693-13-git-send-email-andrew.murray@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      ad07c8ce
  7. 31 10月, 2018 1 次提交
  8. 10 9月, 2018 1 次提交
  9. 25 7月, 2018 1 次提交
    • P
      perf/x86/intel: Fix unwind errors from PEBS entries (mk-II) · 6cbc304f
      Peter Zijlstra 提交于
      Vince reported the perf_fuzzer giving various unwinder warnings and
      Josh reported:
      
      > Deja vu.  Most of these are related to perf PEBS, similar to the
      > following issue:
      >
      >   b8000586 ("perf/x86/intel: Cure bogus unwind from PEBS entries")
      >
      > This is basically the ORC version of that.  setup_pebs_sample_data() is
      > assembling a franken-pt_regs which ORC isn't happy about.  RIP is
      > inconsistent with some of the other registers (like RSP and RBP).
      
      And where the previous unwinder only needed BP,SP ORC also requires
      IP. But we cannot spoof IP because then the sample will get displaced,
      entirely negating the point of PEBS.
      
      So cure the whole thing differently by doing the unwind early; this
      does however require a means to communicate we did the unwind early.
      We (ab)use an unused sample_type bit for this, which we set on events
      that fill out the data->callchain before the normal
      perf_prepare_sample().
      Debugged-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Reported-by: NVince Weaver <vincent.weaver@maine.edu>
      Tested-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Tested-by: NPrashant Bhole <bhole_prashant_q7@lab.ntt.co.jp>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      6cbc304f
  10. 17 4月, 2018 1 次提交
  11. 13 3月, 2018 1 次提交
    • M
      perf/core: Implement fast breakpoint modification via _IOC_MODIFY_ATTRIBUTES · 32ff77e8
      Milind Chabbi 提交于
      Problem and motivation: Once a breakpoint perf event (PERF_TYPE_BREAKPOINT)
      is created, there is no flexibility to change the breakpoint type
      (bp_type), breakpoint address (bp_addr), or breakpoint length (bp_len). The
      only option is to close the perf event and configure a new breakpoint
      event. This inflexibility has a significant performance overhead. For
      example, sampling-based, lightweight performance profilers (and also
      concurrency bug detection tools),  monitor different addresses for a short
      duration using PERF_TYPE_BREAKPOINT and change the address (bp_addr) to
      another address or change the kind of breakpoint (bp_type) from  "write" to
      a "read" or vice-versa or change the length (bp_len) of the address being
      monitored. The cost of these modifications is prohibitive since it involves
      unmapping the circular buffer associated with the perf event, closing the
      perf event, opening another perf event and mmaping another circular buffer.
      
      Solution: The new ioctl flag for perf events,
      PERF_EVENT_IOC_MODIFY_ATTRIBUTES, introduced in this patch takes a pointer
      to a struct perf_event_attr as an argument to update an old breakpoint
      event with new address, type, and size. This facility allows retaining a
      previous mmaped perf events ring buffer and avoids having to close and
      reopen another perf event.
      
      This patch supports only changing PERF_TYPE_BREAKPOINT event type; future
      implementations can extend this feature. The patch replicates some of its
      functionality of modify_user_hw_breakpoint() in
      kernel/events/hw_breakpoint.c. modify_user_hw_breakpoint cannot be called
      directly since perf_event_ctx_lock() is already held in _perf_ioctl().
      
      Evidence: Experiments show that the baseline (not able to modify an already
      created breakpoint) costs an order of magnitude (~10x) more than the
      suggested optimization (having the ability to dynamically modifying a
      configured breakpoint via ioctl). When the breakpoints typically do not
      trap, the speedup due to the suggested optimization is ~10x; even when the
      breakpoints always trap, the speedup is ~4x due to the suggested
      optimization.
      
      Testing: tests posted at
      https://github.com/linux-contrib/perf_event_modify_bp demonstrate the
      performance significance of this patch. Tests also check the functional
      correctness of the patch.
      Signed-off-by: NMilind Chabbi <chabbi.milind@gmail.com>
      [ Using modify_user_hw_breakpoint_check function. ]
      [ Reformated PERF_EVENT_IOC_*, so the values are all in one column. ]
      Signed-off-by: NJiri Olsa <jolsa@kernel.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Hari Bathini <hbathini@linux.vnet.ibm.com>
      Cc: Jin Yao <yao.jin@linux.intel.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Kan Liang <kan.liang@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Oleg Nesterov <onestero@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Link: http://lkml.kernel.org/r/20180312134548.31532-8-jolsa@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      32ff77e8
  12. 06 2月, 2018 1 次提交
  13. 08 1月, 2018 2 次提交
  14. 13 12月, 2017 1 次提交
    • Y
      bpf/tracing: allow user space to query prog array on the same tp · f371b304
      Yonghong Song 提交于
      Commit e87c6bc3 ("bpf: permit multiple bpf attachments
      for a single perf event") added support to attach multiple
      bpf programs to a single perf event.
      Although this provides flexibility, users may want to know
      what other bpf programs attached to the same tp interface.
      Besides getting visibility for the underlying bpf system,
      such information may also help consolidate multiple bpf programs,
      understand potential performance issues due to a large array,
      and debug (e.g., one bpf program which overwrites return code
      may impact subsequent program results).
      
      Commit 2541517c ("tracing, perf: Implement BPF programs
      attached to kprobes") utilized the existing perf ioctl
      interface and added the command PERF_EVENT_IOC_SET_BPF
      to attach a bpf program to a tracepoint. This patch adds a new
      ioctl command, given a perf event fd, to query the bpf program
      array attached to the same perf tracepoint event.
      
      The new uapi ioctl command:
        PERF_EVENT_IOC_QUERY_BPF
      
      The new uapi/linux/perf_event.h structure:
        struct perf_event_query_bpf {
             __u32	ids_len;
             __u32	prog_cnt;
             __u32	ids[0];
        };
      
      User space provides buffer "ids" for kernel to copy to.
      When returning from the kernel, the number of available
      programs in the array is set in "prog_cnt".
      
      The usage:
        struct perf_event_query_bpf *query =
          malloc(sizeof(*query) + sizeof(u32) * ids_len);
        query.ids_len = ids_len;
        err = ioctl(pmu_efd, PERF_EVENT_IOC_QUERY_BPF, query);
        if (err == 0) {
          /* query.prog_cnt is the number of available progs,
           * number of progs in ids: (ids_len == 0) ? 0 : query.prog_cnt
           */
        } else if (errno == ENOSPC) {
          /* query.ids_len number of progs copied,
           * query.prog_cnt is the number of available progs
           */
        } else {
            /* other errors */
        }
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      f371b304
  15. 02 11月, 2017 1 次提交
    • G
      License cleanup: add SPDX license identifier to uapi header files with a license · e2be04c7
      Greg Kroah-Hartman 提交于
      Many user space API headers have licensing information, which is either
      incomplete, badly formatted or just a shorthand for referring to the
      license under which the file is supposed to be.  This makes it hard for
      compliance tools to determine the correct license.
      
      Update these files with an SPDX license identifier.  The identifier was
      chosen based on the license information in the file.
      
      GPL/LGPL licensed headers get the matching GPL/LGPL SPDX license
      identifier with the added 'WITH Linux-syscall-note' exception, which is
      the officially assigned exception identifier for the kernel syscall
      exception:
      
         NOTE! This copyright does *not* cover user programs that use kernel
         services by normal system calls - this is merely considered normal use
         of the kernel, and does *not* fall under the heading of "derived work".
      
      This exception makes it possible to include GPL headers into non GPL
      code, without confusing license compliance tools.
      
      Headers which have either explicit dual licensing or are just licensed
      under a non GPL license are updated with the corresponding SPDX
      identifier and the GPLv2 with syscall exception identifier.  The format
      is:
              ((GPL-2.0 WITH Linux-syscall-note) OR SPDX-ID-OF-OTHER-LICENSE)
      
      SPDX license identifiers are a legally binding shorthand, which can be
      used instead of the full boiler plate text.  The update does not remove
      existing license information as this has to be done on a case by case
      basis and the copyright holders might have to be consulted. This will
      happen in a separate step.
      
      This patch is based on work done by Thomas Gleixner and Kate Stewart and
      Philippe Ombredanne.  See the previous patch in this series for the
      methodology of how this patch was researched.
      Reviewed-by: NKate Stewart <kstewart@linuxfoundation.org>
      Reviewed-by: NPhilippe Ombredanne <pombredanne@nexb.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e2be04c7
  16. 18 10月, 2017 1 次提交
  17. 29 8月, 2017 1 次提交
    • K
      perf/core, x86: Add PERF_SAMPLE_PHYS_ADDR · fc7ce9c7
      Kan Liang 提交于
      For understanding how the workload maps to memory channels and hardware
      behavior, it's very important to collect address maps with physical
      addresses. For example, 3D XPoint access can only be found by filtering
      the physical address.
      
      Add a new sample type for physical address.
      
      perf already has a facility to collect data virtual address. This patch
      introduces a function to convert the virtual address to physical address.
      The function is quite generic and can be extended to any architecture as
      long as a virtual address is provided.
      
       - For kernel direct mapping addresses, virt_to_phys is used to convert
         the virtual addresses to physical address.
      
       - For user virtual addresses, __get_user_pages_fast is used to walk the
         pages tables for user physical address.
      
       - This does not work for vmalloc addresses right now. These are not
         resolved, but code to do that could be added.
      
      The new sample type requires collecting the virtual address. The
      virtual address will not be output unless SAMPLE_ADDR is applied.
      
      For security, the physical address can only be exposed to root or
      privileged user.
      Tested-by: NMadhavan Srinivasan <maddy@linux.vnet.ibm.com>
      Signed-off-by: NKan Liang <kan.liang@intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: acme@kernel.org
      Cc: mpe@ellerman.id.au
      Link: http://lkml.kernel.org/r/1503967969-48278-1-git-send-email-kan.liang@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      fc7ce9c7
  18. 25 8月, 2017 1 次提交
    • A
      perf/x86: Fix data source decoding for Skylake · 6ae5fa61
      Andi Kleen 提交于
      Skylake changed the encoding of the PEBS data source field.
      Some combinations are not available anymore, but some new cases
      e.g. for L4 cache hit are added.
      
      Fix up the conversion table for Skylake, similar as had been done
      for Nehalem.
      
      On Skylake server the encoding for L4 actually means persistent
      memory. Handle this case too.
      
      To properly describe it in the abstracted perf format I had to add
      some new fields. Since a hit can have only one level add a new
      field that is an enumeration, not a bit field to describe
      the level. It can describe any level. Some numbers are also
      used to describe PMEM and LFB.
      
      Also add a new generic remote flag that can be combined with
      the generic level to signify a remote cache.
      
      And there is an extension field for the snoop indication to handle
      the Forward state.
      
      I didn't add a generic flag for hops because it's not needed
      for Skylake.
      
      I changed the existing encodings for older CPUs to also fill in the
      new level and remote fields.
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: acme@kernel.org
      Cc: jolsa@kernel.org
      Link: http://lkml.kernel.org/r/20170816222156.19953-3-andi@firstfloor.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      6ae5fa61
  19. 19 7月, 2017 1 次提交
    • J
      perf/core: Define the common branch type classification · eb0baf8a
      Jin Yao 提交于
      It is often useful to know the branch types while analyzing branch data.
      For example, a call is very different from a conditional branch.
      
      Currently we have to look it up in binary while the binary may later not
      be available and even the binary is available but user has to take some
      time. It is very useful for user to check it directly in perf report.
      
      Perf already has support for disassembling the branch instruction to get
      the x86 branch type.
      
      To keep consistent on kernel and userspace and make the classification
      more common, the patch adds the common branch type classification
      in perf_event.h.
      
      The patch only defines a minimum but most common set of branch types.
      
      PERF_BR_UNKNOWN         : unknown
      PERF_BR_COND            :conditional
      PERF_BR_UNCOND          : unconditional
      PERF_BR_IND             : indirect
      PERF_BR_CALL            : function call
      PERF_BR_IND_CALL        : indirect function call
      PERF_BR_RET             : function return
      PERF_BR_SYSCALL         : syscall
      PERF_BR_SYSRET          : syscall return
      PERF_BR_COND_CALL       : conditional function call
      PERF_BR_COND_RET        : conditional function return
      
      The patch also adds a new field type (4 bits) in perf_branch_entry
      to record the branch type.
      
      Since the disassembling of branch instruction needs some overhead,
      a new PERF_SAMPLE_BRANCH_TYPE_SAVE is introduced to indicate if it
      needs to disassemble the branch instruction and record the branch
      type.
      
      Change log:
      
      v10: Not changed.
      
      v9: Not changed.
      
      v8: Change PERF_BR_NONE to PERF_BR_UNKNOWN.
          No other change.
      
      v7: Just keep the most common branch types.
          Others are removed.
      
      v6: Not changed.
      
      v5: Not changed. The v5 patch series just change the userspace.
      
      v4: Comparing to previous version, the major changes are:
      
      1. Remove the PERF_BR_JCC_FWD/PERF_BR_JCC_BWD, they will be
         computed later in userspace.
      
      2. Remove the "cross" field in perf_branch_entry. The cross page
         computing will be done later in userspace.
      Signed-off-by: NYao Jin <yao.jin@linux.intel.com>
      Acked-by: NJiri Olsa <jolsa@kernel.org>
      Acked-by: NMichael Ellerman <mpe@ellerman.id.au>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Kan Liang <kan.liang@intel.com>
      Link: http://lkml.kernel.org/r/1500379995-6449-2-git-send-email-yao.jin@linux.intel.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      eb0baf8a
  20. 19 4月, 2017 1 次提交
    • S
      powerpc/perf: Define big-endian version of perf_mem_data_src · 8c5073db
      Sukadev Bhattiprolu 提交于
      perf_mem_data_src is a union that is initialized in the kernel via the ->val
      field and accessed by userspace via the mem_xxx bitfields. For this to work
      correctly on big endian platforms, we need a big-endian definition for the
      bitfields.
      
      Currently on a big endian system, if a user requests PERF_SAMPLE_DATA_SRC (perf
      report -d), they will get the default value from perf_sample_data_init(), which
      is PERF_MEM_NA. The value for PERF_MEM_NA is constructed using shifts:
      
        /* TLB access */
        #define PERF_MEM_TLB_NA		0x01 /* not available */
        ...
        #define PERF_MEM_TLB_SHIFT	26
      
        #define PERF_MEM_S(a, s) \
      	(((__u64)PERF_MEM_##a##_##s) << PERF_MEM_##a##_SHIFT)
      
        #define PERF_MEM_NA (PERF_MEM_S(OP, NA)   |\
      		    PERF_MEM_S(LVL, NA)   |\
      		    PERF_MEM_S(SNOOP, NA) |\
      		    PERF_MEM_S(LOCK, NA)  |\
      		    PERF_MEM_S(TLB, NA))
      
      Which works out as:
      
        ((0x01 << 0) | (0x01 << 5) | (0x01 << 19) | (0x01 << 24) | (0x01 << 26))
      
      Which means the PERF_MEM_NA value comes out of the kernel as 0x5080021
      in CPU endian.
      
      But then in the perf tool, the code uses the bitfields to inspect the value, and
      currently the bitfields are defined using little endian ordering.
      
      So eg. in perf_mem__tlb_scnprintf() we see:
        data_src->val = 0x5080021
                   op = 0x0
                  lvl = 0x0
                snoop = 0x0
                 lock = 0x0
                 dtlb = 0x0
                 rsvd = 0x5080021
      
      Because of the way the perf tool code is written this is still displayed to the
      user as "N/A", so there is no bug visible at the UI level.
      
      Currently there are no big endian architectures which export a meaningful
      value (ie. other than PERF_MEM_NA), so the extent of the bug on big endian
      platforms is that the PERF_MEM_NA value is exported incorrectly as described
      above. Subsequent patches will add support on big endian powerpc for populating
      the data source value.
      
      This patch does a minimal fix of adding big endian definition of the bitfields
      to match the values that are already exported by the kernel on big endian. And
      it makes no change on little endian.
      Signed-off-by: NSukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
      Signed-off-by: NMadhavan Srinivasan <maddy@linux.vnet.ibm.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      8c5073db
  21. 16 3月, 2017 1 次提交
  22. 14 3月, 2017 1 次提交
    • H
      perf: Add PERF_RECORD_NAMESPACES to include namespaces related info · e4222673
      Hari Bathini 提交于
      With the advert of container technologies like docker, that depend on
      namespaces for isolation, there is a need for tracing support for
      namespaces. This patch introduces new PERF_RECORD_NAMESPACES event for
      recording namespaces related info. By recording info for every
      namespace, it is left to userspace to take a call on the definition of a
      container and trace containers by updating perf tool accordingly.
      
      Each namespace has a combination of device and inode numbers. Though
      every namespace has the same device number currently, that may change in
      future to avoid the need for a namespace of namespaces. Considering such
      possibility, record both device and inode numbers separately for each
      namespace.
      Signed-off-by: NHari Bathini <hbathini@linux.vnet.ibm.com>
      Acked-by: NJiri Olsa <jolsa@kernel.org>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Alexei Starovoitov <ast@fb.com>
      Cc: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com>
      Cc: Aravinda Prasad <aravinda@linux.vnet.ibm.com>
      Cc: Brendan Gregg <brendan.d.gregg@gmail.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Sargun Dhillon <sargun@sargun.me>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Link: http://lkml.kernel.org/r/148891929686.25309.2827618988917007768.stgit@hbathini.in.ibm.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      e4222673
  23. 30 5月, 2016 1 次提交
    • A
      perf core: Per event callchain limit · 97c79a38
      Arnaldo Carvalho de Melo 提交于
      Additionally to being able to control the system wide maximum depth via
      /proc/sys/kernel/perf_event_max_stack, now we are able to ask for
      different depths per event, using perf_event_attr.sample_max_stack for
      that.
      
      This uses an u16 hole at the end of perf_event_attr, that, when
      perf_event_attr.sample_type has the PERF_SAMPLE_CALLCHAIN, if
      sample_max_stack is zero, means use perf_event_max_stack, otherwise
      it'll be bounds checked under callchain_mutex.
      
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Brendan Gregg <brendan.d.gregg@gmail.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: He Kuang <hekuang@huawei.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Milian Wolff <milian.wolff@kdab.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: Wang Nan <wangnan0@huawei.com>
      Cc: Zefan Li <lizefan@huawei.com>
      Link: http://lkml.kernel.org/n/tip-kolmn1yo40p7jhswxwrc7rrd@git.kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      97c79a38
  24. 17 5月, 2016 1 次提交
    • A
      perf core: Separate accounting of contexts and real addresses in a stack trace · c85b0334
      Arnaldo Carvalho de Melo 提交于
      The perf_sample->ip_callchain->nr value includes all the entries in the
      ip_callchain->ip[] array, real addresses and PERF_CONTEXT_{KERNEL,USER,etc},
      while what the user expects is that what is in the kernel.perf_event_max_stack
      sysctl or in the upcoming per event perf_event_attr.sample_max_stack knob be
      honoured in terms of IP addresses in the stack trace.
      
      So allocate a bunch of extra entries for contexts, and do the accounting
      via perf_callchain_entry_ctx struct members.
      
      A new sysctl, kernel.perf_event_max_contexts_per_stack is also
      introduced for investigating possible bugs in the callchain
      implementation by some arch.
      
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Brendan Gregg <brendan.d.gregg@gmail.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: He Kuang <hekuang@huawei.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Milian Wolff <milian.wolff@kdab.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: Wang Nan <wangnan0@huawei.com>
      Cc: Zefan Li <lizefan@huawei.com>
      Link: http://lkml.kernel.org/n/tip-3b4wnqk340c4sg4gwkfdi9yk@git.kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      c85b0334
  25. 23 4月, 2016 1 次提交
    • W
      perf/core: Add ::write_backward attribute to perf event · 9ecda41a
      Wang Nan 提交于
      This patch introduces 'write_backward' bit to perf_event_attr, which
      controls the direction of a ring buffer. After set, the corresponding
      ring buffer is written from end to beginning. This feature is design to
      support reading from overwritable ring buffer.
      
      Ring buffer can be created by mapping a perf event fd. Kernel puts event
      records into ring buffer, user tooling like perf fetch them from
      address returned by mmap(). To prevent racing between kernel and tooling,
      they communicate to each other through 'head' and 'tail' pointers.
      Kernel maintains 'head' pointer, points it to the next free area (tail
      of the last record). Tooling maintains 'tail' pointer, points it to the
      tail of last consumed record (record has already been fetched). Kernel
      determines the available space in a ring buffer using these two
      pointers to avoid overwrite unfetched records.
      
      By mapping without 'PROT_WRITE', an overwritable ring buffer is created.
      Different from normal ring buffer, tooling is unable to maintain 'tail'
      pointer because writing is forbidden. Therefore, for this type of ring
      buffers, kernel overwrite old records unconditionally, works like flight
      recorder. This feature would be useful if reading from overwritable ring
      buffer were as easy as reading from normal ring buffer. However,
      there's an obscure problem.
      
      The following figure demonstrates a full overwritable ring buffer. In
      this figure, the 'head' pointer points to the end of last record, and a
      long record 'E' is pending. For a normal ring buffer, a 'tail' pointer
      would have pointed to position (X), so kernel knows there's no more
      space in the ring buffer. However, for an overwritable ring buffer,
      kernel ignore the 'tail' pointer.
      
         (X)                              head
          .                                |
          .                                V
          +------+-------+----------+------+---+
          |A....A|B.....B|C........C|D....D|   |
          +------+-------+----------+------+---+
      
      Record 'A' is overwritten by event 'E':
      
            head
             |
             V
          +--+---+-------+----------+------+---+
          |.E|..A|B.....B|C........C|D....D|E..|
          +--+---+-------+----------+------+---+
      
      Now tooling decides to read from this ring buffer. However, none of these
      two natural positions, 'head' and the start of this ring buffer, are
      pointing to the head of a record. Even the full ring buffer can be
      accessed by tooling, it is unable to find a position to start decoding.
      
      The first attempt tries to solve this problem AFAIK can be found from
      [1]. It makes kernel to maintain 'tail' pointer: updates it when ring
      buffer is half full. However, this approach introduces overhead to
      fast path. Test result shows a 1% overhead [2]. In addition, this method
      utilizes no more tham 50% records.
      
      Another attempt can be found from [3], which allows putting the size of
      an event at the end of each record. This approach allows tooling to find
      records in a backward manner from 'head' pointer by reading size of a
      record from its tail. However, because of alignment requirement, it
      needs 8 bytes to record the size of a record, which is a huge waste. Its
      performance is also not good, because more data need to be written.
      This approach also introduces some extra branch instructions to fast
      path.
      
      'write_backward' is a better solution to this problem.
      
      Following figure demonstrates the state of the overwritable ring buffer
      when 'write_backward' is set before overwriting:
      
             head
              |
              V
          +---+------+----------+-------+------+
          |   |D....D|C........C|B.....B|A....A|
          +---+------+----------+-------+------+
      
      and after overwriting:
                                           head
                                            |
                                            V
          +---+------+----------+-------+---+--+
          |..E|D....D|C........C|B.....B|A..|E.|
          +---+------+----------+-------+---+--+
      
      In each situation, 'head' points to the beginning of the newest record.
      From this record, tooling can iterate over the full ring buffer and fetch
      records one by one.
      
      The only limitation that needs to be considered is back-to-back reading.
      Due to the non-deterministic of user programs, it is impossible to ensure
      the ring buffer keeps stable during reading. Consider an extreme situation:
      tooling is scheduled out after reading record 'D', then a burst of events
      come, eat up the whole ring buffer (one or multiple rounds). When the
      tooling process comes back, reading after 'D' is incorrect now.
      
      To prevent this problem, we need to find a way to ensure the ring buffer
      is stable during reading. ioctl(PERF_EVENT_IOC_PAUSE_OUTPUT) is
      suggested because its overhead is lower than
      ioctl(PERF_EVENT_IOC_ENABLE).
      
      By carefully verifying 'header' pointer, reader can avoid pausing the
      ring-buffer. For example:
      
          /* A union of all possible events */
          union perf_event event;
      
          p = head = perf_mmap__read_head();
          while (true) {
              /* copy header of next event */
              fetch(&event.header, p, sizeof(event.header));
      
              /* read 'head' pointer */
              head = perf_mmap__read_head();
      
              /* check overwritten: is the header good? */
              if (!verify(sizeof(event.header), p, head))
                  break;
      
              /* copy the whole event */
              fetch(&event, p, event.header.size);
      
              /* read 'head' pointer again */
              head = perf_mmap__read_head();
      
              /* is the whole event good? */
              if (!verify(event.header.size, p, head))
                  break;
              p += event.header.size;
          }
      
      However, the overhead is high because:
      
       a) In-place decoding is not safe.
          Copying-verifying-decoding is required.
       b) Fetching 'head' pointer requires additional synchronization.
      
      (From Alexei Starovoitov:
      
      Even when this trick works, pause is needed for more than stability of
      reading. When we collect the events into overwrite buffer we're waiting
      for some other trigger (like all cpu utilization spike or just one cpu
      running and all others are idle) and when it happens the buffer has
      valuable info from the past. At this point new events are no longer
      interesting and buffer should be paused, events read and unpaused until
      next trigger comes.)
      
      This patch utilizes event's default overflow_handler introduced
      previously. perf_event_output_backward() is created as the default
      overflow handler for backward ring buffers. To avoid extra overhead to
      fast path, original perf_event_output() becomes __perf_event_output()
      and marked '__always_inline'. In theory, there's no extra overhead
      introduced to fast path.
      
      Performance testing:
      
      Calling 3000000 times of 'close(-1)', use gettimeofday() to check
      duration.  Use 'perf record -o /dev/null -e raw_syscalls:*' to capture
      system calls. In ns.
      
      Testing environment:
      
        CPU    : Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
        Kernel : v4.5.0
                          MEAN         STDVAR
       BASE            800214.950    2853.083
       PRE1           2253846.700    9997.014
       PRE2           2257495.540    8516.293
       POST           2250896.100    8933.921
      
      Where 'BASE' is pure performance without capturing. 'PRE1' is test
      result of pure 'v4.5.0' kernel. 'PRE2' is test result before this
      patch. 'POST' is test result after this patch. See [4] for the detailed
      experimental setup.
      
      Considering the stdvar, this patch doesn't introduce performance
      overhead to the fast path.
      
       [1] http://lkml.iu.edu/hypermail/linux/kernel/1304.1/04584.html
       [2] http://lkml.iu.edu/hypermail/linux/kernel/1307.1/00535.html
       [3] http://lkml.iu.edu/hypermail/linux/kernel/1512.0/01265.html
       [4] http://lkml.kernel.org/g/56F89DCD.1040202@huawei.comSigned-off-by: NWang Nan <wangnan0@huawei.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Cc: <acme@kernel.org>
      Cc: <pi3orama@163.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Brendan Gregg <brendan.d.gregg@gmail.com>
      Cc: He Kuang <hekuang@huawei.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: Zefan Li <lizefan@huawei.com>
      Link: http://lkml.kernel.org/r/1459865478-53413-1-git-send-email-wangnan0@huawei.com
      [ Fixed the changelog some more. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      9ecda41a
  26. 31 3月, 2016 1 次提交
    • W
      perf/ring_buffer: Introduce new ioctl options to pause and resume the ring-buffer · 86e7972f
      Wang Nan 提交于
      Add new ioctl() to pause/resume ring-buffer output.
      
      In some situations we want to read from the ring-buffer only when we
      ensure nothing can write to the ring-buffer during reading. Without
      this patch we have to turn off all events attached to this ring-buffer
      to achieve this.
      
      This patch is a prerequisite to enable overwrite support for the
      perf ring-buffer support. Following commits will introduce new methods
      support reading from overwrite ring buffer. Before reading, caller
      must ensure the ring buffer is frozen, or the reading is unreliable.
      Signed-off-by: NWang Nan <wangnan0@huawei.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: <pi3orama@163.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Brendan Gregg <brendan.d.gregg@gmail.com>
      Cc: He Kuang <hekuang@huawei.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: Zefan Li <lizefan@huawei.com>
      Link: http://lkml.kernel.org/r/1459147292-239310-2-git-send-email-wangnan0@huawei.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      86e7972f
  27. 23 11月, 2015 1 次提交
    • A
      perf/x86: Add option to disable reading branch flags/cycles · b16a5b52
      Andi Kleen 提交于
      With LBRv5 reading the extra LBR flags like mispredict, TSX, cycles is
      not free anymore, as it has moved to a separate MSR.
      
      For callstack mode we don't need any of this information; so we can
      avoid the unnecessary MSR read. Add flags to the perf interface where
      perf record can request not collecting this information.
      
      Add branch_sample_type flags for CYCLES and FLAGS. It's a bit unusual
      for branch_sample_types to be negative (disable), not positive (enable),
      but since the legacy ABI reported the flags we need some form of
      explicit disabling to avoid breaking the ABI.
      
      After we have the flags the x86 perf code can keep track if any users
      need the flags. If noone needs it the information is not collected.
      
      This cuts down the cost of LBR callstack on Skylake significantly.
      Profiling a kernel build with LBR call stack the average run time of
      the PMI handler drops by 43%.
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: acme@kernel.org
      Cc: jolsa@kernel.org
      Link: http://lkml.kernel.org/r/1445366797-30894-2-git-send-email-andi@firstfloor.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      b16a5b52
  28. 22 10月, 2015 1 次提交
    • A
      bpf: introduce bpf_perf_event_output() helper · a43eec30
      Alexei Starovoitov 提交于
      This helper is used to send raw data from eBPF program into
      special PERF_TYPE_SOFTWARE/PERF_COUNT_SW_BPF_OUTPUT perf_event.
      User space needs to perf_event_open() it (either for one or all cpus) and
      store FD into perf_event_array (similar to bpf_perf_event_read() helper)
      before eBPF program can send data into it.
      
      Today the programs triggered by kprobe collect the data and either store
      it into the maps or print it via bpf_trace_printk() where latter is the debug
      facility and not suitable to stream the data. This new helper replaces
      such bpf_trace_printk() usage and allows programs to have dedicated
      channel into user space for post-processing of the raw data collected.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a43eec30
  29. 20 10月, 2015 2 次提交
    • S
      perf: Add PERF_SAMPLE_BRANCH_CALL · c229bf9d
      Stephane Eranian 提交于
      Add a new branch sample type to cover only call branches (function calls).
      The current ANY_CALL included direct, indirect calls and far jumps.
      
      We want to be able to differentiate indirect from direct calls. Therefore
      we introduce PERF_SAMPLE_BRANCH_CALL. The implementation is up to each
      architecture.
      Signed-off-by: NStephane Eranian <eranian@google.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: khandual@linux.vnet.ibm.com
      Link: http://lkml.kernel.org/r/1444720151-10275-2-git-send-email-eranian@google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      c229bf9d
    • A
      perf/x86: Fix time_shift in perf_event_mmap_page · b9511cd7
      Adrian Hunter 提交于
      Commit:
      
        b20112ed ("perf/x86: Improve accuracy of perf/sched clock")
      
      allowed the time_shift value in perf_event_mmap_page to be as much
      as 32.  Unfortunately the documented algorithms for using time_shift
      have it shifting an integer, whereas to work correctly with the value
      32, the type must be u64.
      
      In the case of perf tools, Intel PT decodes correctly but the timestamps
      that are output (for example by perf script) have lost 32-bits of
      granularity so they look like they are not changing at all.
      
      Fix by limiting the shift to 31 and adjusting the multiplier accordingly.
      
      Also update the documentation of perf_event_mmap_page so that new code
      based on it will be more future-proof.
      Signed-off-by: NAdrian Hunter <adrian.hunter@intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Fixes: b20112ed ("perf/x86: Improve accuracy of perf/sched clock")
      Link: http://lkml.kernel.org/r/1445001845-13688-2-git-send-email-adrian.hunter@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      b9511cd7
  30. 04 8月, 2015 1 次提交
  31. 24 7月, 2015 1 次提交
    • A
      perf: Add PERF_RECORD_SWITCH to indicate context switches · 45ac1403
      Adrian Hunter 提交于
      There are already two events for context switches, namely the tracepoint
      sched:sched_switch and the software event context_switches.
      Unfortunately neither are suitable for use by non-privileged users for
      the purpose of synchronizing hardware trace data (e.g. Intel PT) to the
      context switch.
      
      Tracepoints are no good at all for non-privileged users because they
      need either CAP_SYS_ADMIN or /proc/sys/kernel/perf_event_paranoid <= -1.
      
      On the other hand, kernel software events need either CAP_SYS_ADMIN or
      /proc/sys/kernel/perf_event_paranoid <= 1.
      
      Now many distributions do default perf_event_paranoid to 1 making
      context_switches a contender, except it has another problem (which is
      also shared with sched:sched_switch) which is that it happens before
      perf schedules events out instead of after perf schedules events in.
      Whereas a privileged user can see all the events anyway, a
      non-privileged user only sees events for their own processes, in other
      words they see when their process was scheduled out not when it was
      scheduled in. That presents two problems to use the event:
      
      1. the information comes too late, so tools have to look ahead in the
         event stream to find out what the current state is
      
      2. if they are unlucky tracing might have stopped before the
         context-switches event is recorded.
      
      This new PERF_RECORD_SWITCH event does not have those problems
      and it also has a couple of other small advantages.
      
      It is easier to use because it is an auxiliary event (like mmap, comm
      and task events) which can be enabled by setting a single bit. It is
      smaller than sched:sched_switch and easier to parse.
      
      To make the event useful for privileged users also, if the
      context is cpu-wide then the event record will be
      PERF_RECORD_SWITCH_CPU_WIDE which is the same as
      PERF_RECORD_SWITCH except it also provides the next or
      previous pid/tid.
      Signed-off-by: NAdrian Hunter <adrian.hunter@intel.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Tested-by: NJiri Olsa <jolsa@redhat.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
      Cc: Pawel Moll <pawel.moll@arm.com>
      Cc: Stephane Eranian <eranian@google.com>
      Link: http://lkml.kernel.org/r/1437471846-26995-2-git-send-email-adrian.hunter@intel.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      45ac1403
  32. 20 6月, 2015 1 次提交
  33. 07 6月, 2015 2 次提交
  34. 02 4月, 2015 3 次提交
    • A
      perf: Add ITRACE_START record to indicate that tracing has started · ec0d7729
      Alexander Shishkin 提交于
      For counters that generate AUX data that is bound to the context of a
      running task, such as instruction tracing, the decoder needs to know
      exactly which task is running when the event is first scheduled in,
      before the first sched_switch. The decoder's need to know this stems
      from the fact that instruction flow trace decoding will almost always
      require program's object code in order to reconstruct said flow and
      for that we need at least its pid/tid in the perf stream.
      
      To single out such instruction tracing pmus, this patch introduces
      ITRACE PMU capability. The reason this is not part of RECORD_AUX
      record is that not all pmus capable of generating AUX data need this,
      and the opposite is *probably* also true.
      
      While sched_switch covers for most cases, there are two problems with it:
      the consumer will need to process events out of order (that is, having
      found RECORD_AUX, it will have to skip forward to the nearest sched_switch
      to figure out which task it was, then go back to the actual trace to
      decode it) and it completely misses the case when the tracing is enabled
      and disabled before sched_switch, for example, via PERF_EVENT_IOC_DISABLE.
      Signed-off-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Kaixu Xia <kaixu.xia@linaro.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Robert Richter <rric@kernel.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: acme@infradead.org
      Cc: adrian.hunter@intel.com
      Cc: kan.liang@intel.com
      Cc: markus.t.metzger@intel.com
      Cc: mathieu.poirier@linaro.org
      Link: http://lkml.kernel.org/r/1421237903-181015-15-git-send-email-alexander.shishkin@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      ec0d7729
    • A
      perf: Add wakeup watermark control to the AUX area · 1a594131
      Alexander Shishkin 提交于
      When AUX area gets a certain amount of new data, we want to wake up
      userspace to collect it. This adds a new control to specify how much
      data will cause a wakeup. This is then passed down to pmu drivers via
      output handle's "wakeup" field, so that the driver can find the nearest
      point where it can generate an interrupt.
      
      We repurpose __reserved_2 in the event attribute for this, even though
      it was never checked to be zero before, aux_watermark will only matter
      for new AUX-aware code, so the old code should still be fine.
      Signed-off-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Kaixu Xia <kaixu.xia@linaro.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Robert Richter <rric@kernel.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: acme@infradead.org
      Cc: adrian.hunter@intel.com
      Cc: kan.liang@intel.com
      Cc: markus.t.metzger@intel.com
      Cc: mathieu.poirier@linaro.org
      Link: http://lkml.kernel.org/r/1421237903-181015-10-git-send-email-alexander.shishkin@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      1a594131
    • A
      perf: Support overwrite mode for the AUX area · 2023a0d2
      Alexander Shishkin 提交于
      This adds support for overwrite mode in the AUX area, which means "keep
      collecting data till you're stopped", turning AUX area into a circular
      buffer, where new data overwrites old data. It does not depend on data
      buffer's overwrite mode, so that it doesn't lose sideband data that is
      instrumental for processing AUX data.
      
      Overwrite mode is enabled at mapping AUX area read only. Even though
      aux_tail in the buffer's user page might be user writable, it will be
      ignored in this mode.
      
      A PERF_RECORD_AUX with PERF_AUX_FLAG_OVERWRITE set is written to the perf
      data stream every time an event writes new data to the AUX area. The pmu
      driver might not be able to infer the exact beginning of the new data in
      each snapshot, some drivers will only provide the tail, which is
      aux_offset + aux_size in the AUX record. Consumer has to be able to tell
      the new data from the old one, for example, by means of time stamps if
      such are provided in the trace.
      
      Consumer is also responsible for disabling any events that might write
      to the AUX area (thus potentially racing with the consumer) before
      collecting the data.
      Signed-off-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Kaixu Xia <kaixu.xia@linaro.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Robert Richter <rric@kernel.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: acme@infradead.org
      Cc: adrian.hunter@intel.com
      Cc: kan.liang@intel.com
      Cc: markus.t.metzger@intel.com
      Cc: mathieu.poirier@linaro.org
      Link: http://lkml.kernel.org/r/1421237903-181015-9-git-send-email-alexander.shishkin@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      2023a0d2