1. 06 4月, 2021 1 次提交
    • S
      perf: aux: Add flags for the buffer format · 547b6098
      Suzuki K Poulose 提交于
      Allocate a byte for advertising the PMU specific format type
      of the given AUX record. A PMU could end up providing hardware
      trace data in multiple format in a single session.
      
      e.g, The format of hardware buffer produced by CoreSight ETM
      PMU depends on the type of the "sink" device used for collection
      for an event (Traditional TMC-ETR/Bs with formatting or
      TRBEs without any formatting).
      
       # Boring story of why this is needed. Goto The_End_of_Story for skipping.
      
      CoreSight ETM trace allows instruction level tracing of Arm CPUs.
      The ETM generates the CPU excecution trace and pumps it into CoreSight
      AMBA Trace Bus and is collected by a different CoreSight component
      (traditionally CoreSight TMC-ETR /ETB/ETF), called "sink".
      Important to note that there is no guarantee that every CPU has
      a dedicated sink.  Thus multiple ETMs could pump the trace data
      into the same "sink" and thus they apply additional formatting
      of the trace data for the user to decode it properly and attribute
      the trace data to the corresponding ETM.
      
      However, with the introduction of Arm Trace buffer Extensions (TRBE),
      we now have a dedicated per-CPU architected sink for collecting the
      trace. Since the TRBE is always per-CPU, it doesn't apply any formatting
      of the trace. The support for this driver is under review [1].
      
      Now a system could have a per-cpu TRBE and one or more shared
      TMC-ETRs on the system. A user could choose a "specific" sink
      for a perf session (e.g, a TMC-ETR) or the driver could automatically
      select the nearest sink for a given ETM. It is possible that
      some ETMs could end up using TMC-ETR (e.g, if the TRBE is not
      usable on the CPU) while the others using TRBE in a single
      perf session. Thus we now have "formatted" trace collected
      from TMC-ETR and "unformatted" trace collected from TRBE.
      However, we don't get into a situation where a single event
      could end up using TMC-ETR & TRBE. i.e, any AUX buffer is
      guaranteed to be either RAW or FORMATTED, but not a mix
      of both.
      
      As for perf decoding, we need to know the type of the data
      in the individual AUX buffers, so that it can set up the
      "OpenCSD" (library for decoding CoreSight trace) decoder
      instance appropriately. Thus the perf.data file must conatin
      the hints for the tool to decode the data correctly.
      
      Since this is a runtime variable, and perf tool doesn't have
      a control on what sink gets used (in case of automatic sink
      selection), we need this information made available from
      the PMU driver for each AUX record.
      
       # The_End_of_Story
      
      Cc: Peter Ziljstra <peterz@infradead.org>
      Cc: alexander.shishkin@linux.intel.com
      Cc: mingo@redhat.com
      Cc: will@kernel.org
      Cc: mark.rutland@arm.com
      Cc: mike.leach@linaro.org
      Cc: acme@kernel.org
      Cc: jolsa@redhat.com
      Cc: Mathieu Poirier <mathieu.poirer@linaro.org>
      Reviewed by: Mike Leach <mike.leach@linaro.org>
      Acked-by: NPeter Ziljstra <peterz@infradead.org>
      Signed-off-by: NSuzuki K Poulose <suzuki.poulose@arm.com>
      Link: https://lore.kernel.org/r/20210405164307.1720226-2-suzuki.poulose@arm.comSigned-off-by: NMathieu Poirier <mathieu.poirier@linaro.org>
      547b6098
  2. 01 2月, 2021 2 次提交
    • K
      perf/x86/intel: Add perf core PMU support for Sapphire Rapids · 61b985e3
      Kan Liang 提交于
      Add perf core PMU support for the Intel Sapphire Rapids server, which is
      the successor of the Intel Ice Lake server. The enabling code is based
      on Ice Lake, but there are several new features introduced.
      
      The event encoding is changed and simplified, e.g., the event codes
      which are below 0x90 are restricted to counters 0-3. The event codes
      which above 0x90 are likely to have no restrictions. The event
      constraints, extra_regs(), and hardware cache events table are changed
      accordingly.
      
      A new Precise Distribution (PDist) facility is introduced, which
      further minimizes the skid when a precise event is programmed on the GP
      counter 0. Enable the Precise Distribution (PDist) facility with :ppp
      event. For this facility to work, the period must be initialized with a
      value larger than 127. Add spr_limit_period() to apply the limit for
      :ppp event.
      
      Two new data source fields, data block & address block, are added in the
      PEBS Memory Info Record for the load latency event. To enable the
      feature,
      - An auxiliary event has to be enabled together with the load latency
        event on Sapphire Rapids. A new flag PMU_FL_MEM_LOADS_AUX is
        introduced to indicate the case. A new event, mem-loads-aux, is
        exposed to sysfs for the user tool.
        Add a check in hw_config(). If the auxiliary event is not detected,
        return an unique error -ENODATA.
      - The union perf_mem_data_src is extended to support the new fields.
      - Ice Lake and earlier models do not support block information, but the
        fields may be set by HW on some machines. Add pebs_no_block to
        explicitly indicate the previous platforms which don't support the new
        block fields. Accessing the new block fields are ignored on those
        platforms.
      
      A new store Latency facility is introduced, which leverages the PEBS
      facility where it can provide additional information about sampled
      stores. The additional information includes the data address, memory
      auxiliary info (e.g. Data Source, STLB miss) and the latency of the
      store access. To enable the facility, the new event (0x02cd) has to be
      programed on the GP counter 0. A new flag PERF_X86_EVENT_PEBS_STLAT is
      introduced to indicate the event. The store_latency_data() is introduced
      to parse the memory auxiliary info.
      
      The layout of access latency field of PEBS Memory Info Record has been
      changed. Two latency, instruction latency (bit 15:0) and cache access
      latency (bit 47:32) are recorded.
      - The cache access latency is similar to previous memory access latency.
        For loads, the latency starts by the actual cache access until the
        data is returned by the memory subsystem.
        For stores, the latency starts when the demand write accesses the L1
        data cache and lasts until the cacheline write is completed in the
        memory subsystem.
        The cache access latency is stored in low 32bits of the sample type
        PERF_SAMPLE_WEIGHT_STRUCT.
      - The instruction latency starts by the dispatch of the load operation
        for execution and lasts until completion of the instruction it belongs
        to.
        Add a new flag PMU_FL_INSTR_LATENCY to indicate the instruction
        latency support. The instruction latency is stored in the bit 47:32
        of the sample type PERF_SAMPLE_WEIGHT_STRUCT.
      
      Extends the PERF_METRICS MSR to feature TMA method level 2 metrics. The
      lower half of the register is the TMA level 1 metrics (legacy). The
      upper half is also divided into four 8-bit fields for the new level 2
      metrics. Expose all eight Topdown metrics events to user space.
      
      The full description for the SPR features can be found at Intel
      Architecture Instruction Set Extensions and Future Features
      Programming Reference, 319433-041.
      Signed-off-by: NKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/1611873611-156687-5-git-send-email-kan.liang@linux.intel.com
      61b985e3
    • K
      perf/core: Add PERF_SAMPLE_WEIGHT_STRUCT · 2a6c6b7d
      Kan Liang 提交于
      Current PERF_SAMPLE_WEIGHT sample type is very useful to expresses the
      cost of an action represented by the sample. This allows the profiler
      to scale the samples to be more informative to the programmer. It could
      also help to locate a hotspot, e.g., when profiling by memory latencies,
      the expensive load appear higher up in the histograms. But current
      PERF_SAMPLE_WEIGHT sample type is solely determined by one factor. This
      could be a problem, if users want two or more factors to contribute to
      the weight. For example, Golden Cove core PMU can provide both the
      instruction latency and the cache Latency information as factors for the
      memory profiling.
      
      For current X86 platforms, although meminfo::latency is defined as a
      u64, only the lower 32 bits include the valid data in practice (No
      memory access could last than 4G cycles). The higher 32 bits can be used
      to store new factors.
      
      Add a new sample type, PERF_SAMPLE_WEIGHT_STRUCT, to indicate the new
      sample weight structure. It shares the same space as the
      PERF_SAMPLE_WEIGHT sample type.
      
      Users can apply either the PERF_SAMPLE_WEIGHT sample type or the
      PERF_SAMPLE_WEIGHT_STRUCT sample type to retrieve the sample weight, but
      they cannot apply both sample types simultaneously.
      
      Currently, only X86 and PowerPC use the PERF_SAMPLE_WEIGHT sample type.
      - For PowerPC, there is nothing changed for the PERF_SAMPLE_WEIGHT
        sample type. There is no effect for the new PERF_SAMPLE_WEIGHT_STRUCT
        sample type. PowerPC can re-struct the weight field similarly later.
      - For X86, the same value will be dumped for the PERF_SAMPLE_WEIGHT
        sample type or the PERF_SAMPLE_WEIGHT_STRUCT sample type for now.
        The following patches will apply the new factors for the
        PERF_SAMPLE_WEIGHT_STRUCT sample type.
      
      The field in the union perf_sample_weight should be shared among
      different architectures. A generic name is required, but it's hard to
      abstract a name that applies to all architectures. For example, on X86,
      the fields are to store all kinds of latency. While on PowerPC, it
      stores MMCRA[TECX/TECM], which should not be latency. So a general name
      prefix 'var$NUM' is used here.
      Suggested-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/1611873611-156687-2-git-send-email-kan.liang@linux.intel.com
      2a6c6b7d
  3. 15 1月, 2021 1 次提交
  4. 29 10月, 2020 2 次提交
    • S
      perf/core: Add support for PERF_SAMPLE_CODE_PAGE_SIZE · 995f088e
      Stephane Eranian 提交于
      When studying code layout, it is useful to capture the page size of the
      sampled code address.
      
      Add a new sample type for code page size.
      The new sample type requires collecting the ip. The code page size can
      be calculated from the NMI-safe perf_get_page_size().
      
      For large PEBS, it's very unlikely that the mapping is gone for the
      earlier PEBS records. Enable the feature for the large PEBS. The worst
      case is that page-size '0' is returned.
      Signed-off-by: NKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: NStephane Eranian <eranian@google.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20201001135749.2804-5-kan.liang@linux.intel.com
      995f088e
    • K
      perf/core: Add PERF_SAMPLE_DATA_PAGE_SIZE · 8d97e718
      Kan Liang 提交于
      Current perf can report both virtual addresses and physical addresses,
      but not the MMU page size. Without the MMU page size information of the
      utilized page, users cannot decide whether to promote/demote large pages
      to optimize memory usage.
      
      Add a new sample type for the data MMU page size.
      
      Current perf already has a facility to collect data virtual addresses.
      A page walker is required to walk the pages tables and calculate the
      MMU page size from a given virtual address.
      
      On some platforms, e.g., X86, the page walker is invoked in an NMI
      handler. So the page walker must be NMI-safe and low overhead. Besides,
      the page walker should work for both user and kernel virtual address.
      The existing generic page walker, e.g., walk_page_range_novma(), is a
      little bit complex and doesn't guarantee the NMI-safe. The follow_page()
      is only for user-virtual address.
      
      Add a new function perf_get_page_size() to walk the page tables and
      calculate the MMU page size. In the function:
      - Interrupts have to be disabled to prevent any teardown of the page
        tables.
      - For user space threads, the current->mm is used for the page walker.
        For kernel threads and the like, the current->mm is NULL. The init_mm
        is used for the page walker. The active_mm is not used here, because
        it can be NULL.
        Quote from Peter Zijlstra,
        "context_switch() can set prev->active_mm to NULL when it transfers it
         to @next. It does this before @current is updated. So an NMI that
         comes in between this active_mm swizzling and updating @current will
         see !active_mm."
      - The MMU page size is calculated from the page table level.
      
      The method should work for all architectures, but it has only been
      verified on X86. Should there be some architectures, which support perf,
      where the method doesn't work, it can be fixed later separately.
      Reporting the wrong page size would not be fatal for the architecture.
      
      Some under discussion features may impact the method in the future.
      Quote from Dave Hansen,
        "There are lots of weird things folks are trying to do with the page
         tables, like Address Space Isolation.  For instance, if you get a
         perf NMI when running userspace, current->mm->pgd is *different* than
         the PGD that was in use when userspace was running. It's close enough
         today, but it might not stay that way."
      If the case happens later, lots of consecutive page walk errors will
      happen. The worst case is that lots of page-size '0' are returned, which
      would not be fatal.
      In the perf tool, a check is implemented to detect this case. Once it
      happens, a kernel patch could be implemented accordingly then.
      Suggested-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20201001135749.2804-2-kan.liang@linux.intel.com
      8d97e718
  5. 20 10月, 2020 1 次提交
  6. 20 7月, 2020 1 次提交
  7. 15 6月, 2020 3 次提交
  8. 27 3月, 2020 2 次提交
  9. 11 2月, 2020 1 次提交
    • K
      perf/core: Add new branch sample type for HW index of raw branch records · bbfd5e4f
      Kan Liang 提交于
      The low level index is the index in the underlying hardware buffer of
      the most recently captured taken branch which is always saved in
      branch_entries[0]. It is very useful for reconstructing the call stack.
      For example, in Intel LBR call stack mode, the depth of reconstructed
      LBR call stack limits to the number of LBR registers. With the low level
      index information, perf tool may stitch the stacks of two samples. The
      reconstructed LBR call stack can break the HW limitation.
      
      Add a new branch sample type to retrieve low level index of raw branch
      records. The low level index is between -1 (unknown) and max depth which
      can be retrieved in /sys/devices/cpu/caps/branches.
      
      Only when the new branch sample type is set, the low level index
      information is dumped into the PERF_SAMPLE_BRANCH_STACK output.
      Perf tool should check the attr.branch_sample_type, and apply the
      corresponding format for PERF_SAMPLE_BRANCH_STACK samples.
      Otherwise, some user case may be broken. For example, users may parse a
      perf.data, which include the new branch sample type, with an old version
      perf tool (without the check). Users probably get incorrect information
      without any warning.
      Signed-off-by: NKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Link: https://lkml.kernel.org/r/20200127165355.27495-2-kan.liang@linux.intel.com
      bbfd5e4f
  10. 13 11月, 2019 1 次提交
    • A
      perf/aux: Allow using AUX data in perf samples · a4faf00d
      Alexander Shishkin 提交于
      AUX data can be used to annotate perf events such as performance counters
      or tracepoints/breakpoints by including it in sample records when
      PERF_SAMPLE_AUX flag is set. Such samples would be instrumental in debugging
      and profiling by providing, for example, a history of instruction flow
      leading up to the event's overflow.
      
      The implementation makes use of grouping an AUX event with all the events
      that wish to take samples of the AUX data, such that the former is the
      group leader. The samplees should also specify the desired size of the AUX
      sample via attr.aux_sample_size.
      
      AUX capable PMUs need to explicitly add support for sampling, because it
      relies on a new callback to take a snapshot of the buffer without touching
      the event states.
      Signed-off-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: adrian.hunter@intel.com
      Cc: mathieu.poirier@linaro.org
      Link: https://lkml.kernel.org/r/20191025140835.53665-2-alexander.shishkin@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      a4faf00d
  11. 28 8月, 2019 1 次提交
  12. 22 1月, 2019 2 次提交
  13. 21 1月, 2019 1 次提交
    • A
      perf/core: Remove unused perf_flags · ad07c8ce
      Andrew Murray 提交于
      Now that perf_flags is not used we remove it.
      Signed-off-by: NAndrew Murray <andrew.murray@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Sascha Hauer <s.hauer@pengutronix.de>
      Cc: Shawn Guo <shawnguo@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: linux-arm-kernel@lists.infradead.org
      Cc: linuxppc-dev@lists.ozlabs.org
      Cc: robin.murphy@arm.com
      Cc: suzuki.poulose@arm.com
      Link: https://lkml.kernel.org/r/1547128414-50693-13-git-send-email-andrew.murray@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      ad07c8ce
  14. 31 10月, 2018 1 次提交
  15. 10 9月, 2018 1 次提交
  16. 25 7月, 2018 1 次提交
    • P
      perf/x86/intel: Fix unwind errors from PEBS entries (mk-II) · 6cbc304f
      Peter Zijlstra 提交于
      Vince reported the perf_fuzzer giving various unwinder warnings and
      Josh reported:
      
      > Deja vu.  Most of these are related to perf PEBS, similar to the
      > following issue:
      >
      >   b8000586 ("perf/x86/intel: Cure bogus unwind from PEBS entries")
      >
      > This is basically the ORC version of that.  setup_pebs_sample_data() is
      > assembling a franken-pt_regs which ORC isn't happy about.  RIP is
      > inconsistent with some of the other registers (like RSP and RBP).
      
      And where the previous unwinder only needed BP,SP ORC also requires
      IP. But we cannot spoof IP because then the sample will get displaced,
      entirely negating the point of PEBS.
      
      So cure the whole thing differently by doing the unwind early; this
      does however require a means to communicate we did the unwind early.
      We (ab)use an unused sample_type bit for this, which we set on events
      that fill out the data->callchain before the normal
      perf_prepare_sample().
      Debugged-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Reported-by: NVince Weaver <vincent.weaver@maine.edu>
      Tested-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Tested-by: NPrashant Bhole <bhole_prashant_q7@lab.ntt.co.jp>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      6cbc304f
  17. 17 4月, 2018 1 次提交
  18. 13 3月, 2018 1 次提交
    • M
      perf/core: Implement fast breakpoint modification via _IOC_MODIFY_ATTRIBUTES · 32ff77e8
      Milind Chabbi 提交于
      Problem and motivation: Once a breakpoint perf event (PERF_TYPE_BREAKPOINT)
      is created, there is no flexibility to change the breakpoint type
      (bp_type), breakpoint address (bp_addr), or breakpoint length (bp_len). The
      only option is to close the perf event and configure a new breakpoint
      event. This inflexibility has a significant performance overhead. For
      example, sampling-based, lightweight performance profilers (and also
      concurrency bug detection tools),  monitor different addresses for a short
      duration using PERF_TYPE_BREAKPOINT and change the address (bp_addr) to
      another address or change the kind of breakpoint (bp_type) from  "write" to
      a "read" or vice-versa or change the length (bp_len) of the address being
      monitored. The cost of these modifications is prohibitive since it involves
      unmapping the circular buffer associated with the perf event, closing the
      perf event, opening another perf event and mmaping another circular buffer.
      
      Solution: The new ioctl flag for perf events,
      PERF_EVENT_IOC_MODIFY_ATTRIBUTES, introduced in this patch takes a pointer
      to a struct perf_event_attr as an argument to update an old breakpoint
      event with new address, type, and size. This facility allows retaining a
      previous mmaped perf events ring buffer and avoids having to close and
      reopen another perf event.
      
      This patch supports only changing PERF_TYPE_BREAKPOINT event type; future
      implementations can extend this feature. The patch replicates some of its
      functionality of modify_user_hw_breakpoint() in
      kernel/events/hw_breakpoint.c. modify_user_hw_breakpoint cannot be called
      directly since perf_event_ctx_lock() is already held in _perf_ioctl().
      
      Evidence: Experiments show that the baseline (not able to modify an already
      created breakpoint) costs an order of magnitude (~10x) more than the
      suggested optimization (having the ability to dynamically modifying a
      configured breakpoint via ioctl). When the breakpoints typically do not
      trap, the speedup due to the suggested optimization is ~10x; even when the
      breakpoints always trap, the speedup is ~4x due to the suggested
      optimization.
      
      Testing: tests posted at
      https://github.com/linux-contrib/perf_event_modify_bp demonstrate the
      performance significance of this patch. Tests also check the functional
      correctness of the patch.
      Signed-off-by: NMilind Chabbi <chabbi.milind@gmail.com>
      [ Using modify_user_hw_breakpoint_check function. ]
      [ Reformated PERF_EVENT_IOC_*, so the values are all in one column. ]
      Signed-off-by: NJiri Olsa <jolsa@kernel.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Hari Bathini <hbathini@linux.vnet.ibm.com>
      Cc: Jin Yao <yao.jin@linux.intel.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Kan Liang <kan.liang@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Oleg Nesterov <onestero@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Link: http://lkml.kernel.org/r/20180312134548.31532-8-jolsa@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      32ff77e8
  19. 06 2月, 2018 1 次提交
  20. 08 1月, 2018 2 次提交
  21. 13 12月, 2017 1 次提交
    • Y
      bpf/tracing: allow user space to query prog array on the same tp · f371b304
      Yonghong Song 提交于
      Commit e87c6bc3 ("bpf: permit multiple bpf attachments
      for a single perf event") added support to attach multiple
      bpf programs to a single perf event.
      Although this provides flexibility, users may want to know
      what other bpf programs attached to the same tp interface.
      Besides getting visibility for the underlying bpf system,
      such information may also help consolidate multiple bpf programs,
      understand potential performance issues due to a large array,
      and debug (e.g., one bpf program which overwrites return code
      may impact subsequent program results).
      
      Commit 2541517c ("tracing, perf: Implement BPF programs
      attached to kprobes") utilized the existing perf ioctl
      interface and added the command PERF_EVENT_IOC_SET_BPF
      to attach a bpf program to a tracepoint. This patch adds a new
      ioctl command, given a perf event fd, to query the bpf program
      array attached to the same perf tracepoint event.
      
      The new uapi ioctl command:
        PERF_EVENT_IOC_QUERY_BPF
      
      The new uapi/linux/perf_event.h structure:
        struct perf_event_query_bpf {
             __u32	ids_len;
             __u32	prog_cnt;
             __u32	ids[0];
        };
      
      User space provides buffer "ids" for kernel to copy to.
      When returning from the kernel, the number of available
      programs in the array is set in "prog_cnt".
      
      The usage:
        struct perf_event_query_bpf *query =
          malloc(sizeof(*query) + sizeof(u32) * ids_len);
        query.ids_len = ids_len;
        err = ioctl(pmu_efd, PERF_EVENT_IOC_QUERY_BPF, query);
        if (err == 0) {
          /* query.prog_cnt is the number of available progs,
           * number of progs in ids: (ids_len == 0) ? 0 : query.prog_cnt
           */
        } else if (errno == ENOSPC) {
          /* query.ids_len number of progs copied,
           * query.prog_cnt is the number of available progs
           */
        } else {
            /* other errors */
        }
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      f371b304
  22. 02 11月, 2017 1 次提交
    • G
      License cleanup: add SPDX license identifier to uapi header files with a license · e2be04c7
      Greg Kroah-Hartman 提交于
      Many user space API headers have licensing information, which is either
      incomplete, badly formatted or just a shorthand for referring to the
      license under which the file is supposed to be.  This makes it hard for
      compliance tools to determine the correct license.
      
      Update these files with an SPDX license identifier.  The identifier was
      chosen based on the license information in the file.
      
      GPL/LGPL licensed headers get the matching GPL/LGPL SPDX license
      identifier with the added 'WITH Linux-syscall-note' exception, which is
      the officially assigned exception identifier for the kernel syscall
      exception:
      
         NOTE! This copyright does *not* cover user programs that use kernel
         services by normal system calls - this is merely considered normal use
         of the kernel, and does *not* fall under the heading of "derived work".
      
      This exception makes it possible to include GPL headers into non GPL
      code, without confusing license compliance tools.
      
      Headers which have either explicit dual licensing or are just licensed
      under a non GPL license are updated with the corresponding SPDX
      identifier and the GPLv2 with syscall exception identifier.  The format
      is:
              ((GPL-2.0 WITH Linux-syscall-note) OR SPDX-ID-OF-OTHER-LICENSE)
      
      SPDX license identifiers are a legally binding shorthand, which can be
      used instead of the full boiler plate text.  The update does not remove
      existing license information as this has to be done on a case by case
      basis and the copyright holders might have to be consulted. This will
      happen in a separate step.
      
      This patch is based on work done by Thomas Gleixner and Kate Stewart and
      Philippe Ombredanne.  See the previous patch in this series for the
      methodology of how this patch was researched.
      Reviewed-by: NKate Stewart <kstewart@linuxfoundation.org>
      Reviewed-by: NPhilippe Ombredanne <pombredanne@nexb.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e2be04c7
  23. 18 10月, 2017 1 次提交
  24. 29 8月, 2017 1 次提交
    • K
      perf/core, x86: Add PERF_SAMPLE_PHYS_ADDR · fc7ce9c7
      Kan Liang 提交于
      For understanding how the workload maps to memory channels and hardware
      behavior, it's very important to collect address maps with physical
      addresses. For example, 3D XPoint access can only be found by filtering
      the physical address.
      
      Add a new sample type for physical address.
      
      perf already has a facility to collect data virtual address. This patch
      introduces a function to convert the virtual address to physical address.
      The function is quite generic and can be extended to any architecture as
      long as a virtual address is provided.
      
       - For kernel direct mapping addresses, virt_to_phys is used to convert
         the virtual addresses to physical address.
      
       - For user virtual addresses, __get_user_pages_fast is used to walk the
         pages tables for user physical address.
      
       - This does not work for vmalloc addresses right now. These are not
         resolved, but code to do that could be added.
      
      The new sample type requires collecting the virtual address. The
      virtual address will not be output unless SAMPLE_ADDR is applied.
      
      For security, the physical address can only be exposed to root or
      privileged user.
      Tested-by: NMadhavan Srinivasan <maddy@linux.vnet.ibm.com>
      Signed-off-by: NKan Liang <kan.liang@intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: acme@kernel.org
      Cc: mpe@ellerman.id.au
      Link: http://lkml.kernel.org/r/1503967969-48278-1-git-send-email-kan.liang@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      fc7ce9c7
  25. 25 8月, 2017 1 次提交
    • A
      perf/x86: Fix data source decoding for Skylake · 6ae5fa61
      Andi Kleen 提交于
      Skylake changed the encoding of the PEBS data source field.
      Some combinations are not available anymore, but some new cases
      e.g. for L4 cache hit are added.
      
      Fix up the conversion table for Skylake, similar as had been done
      for Nehalem.
      
      On Skylake server the encoding for L4 actually means persistent
      memory. Handle this case too.
      
      To properly describe it in the abstracted perf format I had to add
      some new fields. Since a hit can have only one level add a new
      field that is an enumeration, not a bit field to describe
      the level. It can describe any level. Some numbers are also
      used to describe PMEM and LFB.
      
      Also add a new generic remote flag that can be combined with
      the generic level to signify a remote cache.
      
      And there is an extension field for the snoop indication to handle
      the Forward state.
      
      I didn't add a generic flag for hops because it's not needed
      for Skylake.
      
      I changed the existing encodings for older CPUs to also fill in the
      new level and remote fields.
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: acme@kernel.org
      Cc: jolsa@kernel.org
      Link: http://lkml.kernel.org/r/20170816222156.19953-3-andi@firstfloor.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      6ae5fa61
  26. 19 7月, 2017 1 次提交
    • J
      perf/core: Define the common branch type classification · eb0baf8a
      Jin Yao 提交于
      It is often useful to know the branch types while analyzing branch data.
      For example, a call is very different from a conditional branch.
      
      Currently we have to look it up in binary while the binary may later not
      be available and even the binary is available but user has to take some
      time. It is very useful for user to check it directly in perf report.
      
      Perf already has support for disassembling the branch instruction to get
      the x86 branch type.
      
      To keep consistent on kernel and userspace and make the classification
      more common, the patch adds the common branch type classification
      in perf_event.h.
      
      The patch only defines a minimum but most common set of branch types.
      
      PERF_BR_UNKNOWN         : unknown
      PERF_BR_COND            :conditional
      PERF_BR_UNCOND          : unconditional
      PERF_BR_IND             : indirect
      PERF_BR_CALL            : function call
      PERF_BR_IND_CALL        : indirect function call
      PERF_BR_RET             : function return
      PERF_BR_SYSCALL         : syscall
      PERF_BR_SYSRET          : syscall return
      PERF_BR_COND_CALL       : conditional function call
      PERF_BR_COND_RET        : conditional function return
      
      The patch also adds a new field type (4 bits) in perf_branch_entry
      to record the branch type.
      
      Since the disassembling of branch instruction needs some overhead,
      a new PERF_SAMPLE_BRANCH_TYPE_SAVE is introduced to indicate if it
      needs to disassemble the branch instruction and record the branch
      type.
      
      Change log:
      
      v10: Not changed.
      
      v9: Not changed.
      
      v8: Change PERF_BR_NONE to PERF_BR_UNKNOWN.
          No other change.
      
      v7: Just keep the most common branch types.
          Others are removed.
      
      v6: Not changed.
      
      v5: Not changed. The v5 patch series just change the userspace.
      
      v4: Comparing to previous version, the major changes are:
      
      1. Remove the PERF_BR_JCC_FWD/PERF_BR_JCC_BWD, they will be
         computed later in userspace.
      
      2. Remove the "cross" field in perf_branch_entry. The cross page
         computing will be done later in userspace.
      Signed-off-by: NYao Jin <yao.jin@linux.intel.com>
      Acked-by: NJiri Olsa <jolsa@kernel.org>
      Acked-by: NMichael Ellerman <mpe@ellerman.id.au>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Kan Liang <kan.liang@intel.com>
      Link: http://lkml.kernel.org/r/1500379995-6449-2-git-send-email-yao.jin@linux.intel.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      eb0baf8a
  27. 19 4月, 2017 1 次提交
    • S
      powerpc/perf: Define big-endian version of perf_mem_data_src · 8c5073db
      Sukadev Bhattiprolu 提交于
      perf_mem_data_src is a union that is initialized in the kernel via the ->val
      field and accessed by userspace via the mem_xxx bitfields. For this to work
      correctly on big endian platforms, we need a big-endian definition for the
      bitfields.
      
      Currently on a big endian system, if a user requests PERF_SAMPLE_DATA_SRC (perf
      report -d), they will get the default value from perf_sample_data_init(), which
      is PERF_MEM_NA. The value for PERF_MEM_NA is constructed using shifts:
      
        /* TLB access */
        #define PERF_MEM_TLB_NA		0x01 /* not available */
        ...
        #define PERF_MEM_TLB_SHIFT	26
      
        #define PERF_MEM_S(a, s) \
      	(((__u64)PERF_MEM_##a##_##s) << PERF_MEM_##a##_SHIFT)
      
        #define PERF_MEM_NA (PERF_MEM_S(OP, NA)   |\
      		    PERF_MEM_S(LVL, NA)   |\
      		    PERF_MEM_S(SNOOP, NA) |\
      		    PERF_MEM_S(LOCK, NA)  |\
      		    PERF_MEM_S(TLB, NA))
      
      Which works out as:
      
        ((0x01 << 0) | (0x01 << 5) | (0x01 << 19) | (0x01 << 24) | (0x01 << 26))
      
      Which means the PERF_MEM_NA value comes out of the kernel as 0x5080021
      in CPU endian.
      
      But then in the perf tool, the code uses the bitfields to inspect the value, and
      currently the bitfields are defined using little endian ordering.
      
      So eg. in perf_mem__tlb_scnprintf() we see:
        data_src->val = 0x5080021
                   op = 0x0
                  lvl = 0x0
                snoop = 0x0
                 lock = 0x0
                 dtlb = 0x0
                 rsvd = 0x5080021
      
      Because of the way the perf tool code is written this is still displayed to the
      user as "N/A", so there is no bug visible at the UI level.
      
      Currently there are no big endian architectures which export a meaningful
      value (ie. other than PERF_MEM_NA), so the extent of the bug on big endian
      platforms is that the PERF_MEM_NA value is exported incorrectly as described
      above. Subsequent patches will add support on big endian powerpc for populating
      the data source value.
      
      This patch does a minimal fix of adding big endian definition of the bitfields
      to match the values that are already exported by the kernel on big endian. And
      it makes no change on little endian.
      Signed-off-by: NSukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
      Signed-off-by: NMadhavan Srinivasan <maddy@linux.vnet.ibm.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      8c5073db
  28. 16 3月, 2017 1 次提交
  29. 14 3月, 2017 1 次提交
    • H
      perf: Add PERF_RECORD_NAMESPACES to include namespaces related info · e4222673
      Hari Bathini 提交于
      With the advert of container technologies like docker, that depend on
      namespaces for isolation, there is a need for tracing support for
      namespaces. This patch introduces new PERF_RECORD_NAMESPACES event for
      recording namespaces related info. By recording info for every
      namespace, it is left to userspace to take a call on the definition of a
      container and trace containers by updating perf tool accordingly.
      
      Each namespace has a combination of device and inode numbers. Though
      every namespace has the same device number currently, that may change in
      future to avoid the need for a namespace of namespaces. Considering such
      possibility, record both device and inode numbers separately for each
      namespace.
      Signed-off-by: NHari Bathini <hbathini@linux.vnet.ibm.com>
      Acked-by: NJiri Olsa <jolsa@kernel.org>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Alexei Starovoitov <ast@fb.com>
      Cc: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com>
      Cc: Aravinda Prasad <aravinda@linux.vnet.ibm.com>
      Cc: Brendan Gregg <brendan.d.gregg@gmail.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Sargun Dhillon <sargun@sargun.me>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Link: http://lkml.kernel.org/r/148891929686.25309.2827618988917007768.stgit@hbathini.in.ibm.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      e4222673
  30. 30 5月, 2016 1 次提交
    • A
      perf core: Per event callchain limit · 97c79a38
      Arnaldo Carvalho de Melo 提交于
      Additionally to being able to control the system wide maximum depth via
      /proc/sys/kernel/perf_event_max_stack, now we are able to ask for
      different depths per event, using perf_event_attr.sample_max_stack for
      that.
      
      This uses an u16 hole at the end of perf_event_attr, that, when
      perf_event_attr.sample_type has the PERF_SAMPLE_CALLCHAIN, if
      sample_max_stack is zero, means use perf_event_max_stack, otherwise
      it'll be bounds checked under callchain_mutex.
      
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Brendan Gregg <brendan.d.gregg@gmail.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: He Kuang <hekuang@huawei.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Milian Wolff <milian.wolff@kdab.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: Wang Nan <wangnan0@huawei.com>
      Cc: Zefan Li <lizefan@huawei.com>
      Link: http://lkml.kernel.org/n/tip-kolmn1yo40p7jhswxwrc7rrd@git.kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      97c79a38
  31. 17 5月, 2016 1 次提交
    • A
      perf core: Separate accounting of contexts and real addresses in a stack trace · c85b0334
      Arnaldo Carvalho de Melo 提交于
      The perf_sample->ip_callchain->nr value includes all the entries in the
      ip_callchain->ip[] array, real addresses and PERF_CONTEXT_{KERNEL,USER,etc},
      while what the user expects is that what is in the kernel.perf_event_max_stack
      sysctl or in the upcoming per event perf_event_attr.sample_max_stack knob be
      honoured in terms of IP addresses in the stack trace.
      
      So allocate a bunch of extra entries for contexts, and do the accounting
      via perf_callchain_entry_ctx struct members.
      
      A new sysctl, kernel.perf_event_max_contexts_per_stack is also
      introduced for investigating possible bugs in the callchain
      implementation by some arch.
      
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Brendan Gregg <brendan.d.gregg@gmail.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: He Kuang <hekuang@huawei.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Milian Wolff <milian.wolff@kdab.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: Wang Nan <wangnan0@huawei.com>
      Cc: Zefan Li <lizefan@huawei.com>
      Link: http://lkml.kernel.org/n/tip-3b4wnqk340c4sg4gwkfdi9yk@git.kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      c85b0334
  32. 23 4月, 2016 1 次提交
    • W
      perf/core: Add ::write_backward attribute to perf event · 9ecda41a
      Wang Nan 提交于
      This patch introduces 'write_backward' bit to perf_event_attr, which
      controls the direction of a ring buffer. After set, the corresponding
      ring buffer is written from end to beginning. This feature is design to
      support reading from overwritable ring buffer.
      
      Ring buffer can be created by mapping a perf event fd. Kernel puts event
      records into ring buffer, user tooling like perf fetch them from
      address returned by mmap(). To prevent racing between kernel and tooling,
      they communicate to each other through 'head' and 'tail' pointers.
      Kernel maintains 'head' pointer, points it to the next free area (tail
      of the last record). Tooling maintains 'tail' pointer, points it to the
      tail of last consumed record (record has already been fetched). Kernel
      determines the available space in a ring buffer using these two
      pointers to avoid overwrite unfetched records.
      
      By mapping without 'PROT_WRITE', an overwritable ring buffer is created.
      Different from normal ring buffer, tooling is unable to maintain 'tail'
      pointer because writing is forbidden. Therefore, for this type of ring
      buffers, kernel overwrite old records unconditionally, works like flight
      recorder. This feature would be useful if reading from overwritable ring
      buffer were as easy as reading from normal ring buffer. However,
      there's an obscure problem.
      
      The following figure demonstrates a full overwritable ring buffer. In
      this figure, the 'head' pointer points to the end of last record, and a
      long record 'E' is pending. For a normal ring buffer, a 'tail' pointer
      would have pointed to position (X), so kernel knows there's no more
      space in the ring buffer. However, for an overwritable ring buffer,
      kernel ignore the 'tail' pointer.
      
         (X)                              head
          .                                |
          .                                V
          +------+-------+----------+------+---+
          |A....A|B.....B|C........C|D....D|   |
          +------+-------+----------+------+---+
      
      Record 'A' is overwritten by event 'E':
      
            head
             |
             V
          +--+---+-------+----------+------+---+
          |.E|..A|B.....B|C........C|D....D|E..|
          +--+---+-------+----------+------+---+
      
      Now tooling decides to read from this ring buffer. However, none of these
      two natural positions, 'head' and the start of this ring buffer, are
      pointing to the head of a record. Even the full ring buffer can be
      accessed by tooling, it is unable to find a position to start decoding.
      
      The first attempt tries to solve this problem AFAIK can be found from
      [1]. It makes kernel to maintain 'tail' pointer: updates it when ring
      buffer is half full. However, this approach introduces overhead to
      fast path. Test result shows a 1% overhead [2]. In addition, this method
      utilizes no more tham 50% records.
      
      Another attempt can be found from [3], which allows putting the size of
      an event at the end of each record. This approach allows tooling to find
      records in a backward manner from 'head' pointer by reading size of a
      record from its tail. However, because of alignment requirement, it
      needs 8 bytes to record the size of a record, which is a huge waste. Its
      performance is also not good, because more data need to be written.
      This approach also introduces some extra branch instructions to fast
      path.
      
      'write_backward' is a better solution to this problem.
      
      Following figure demonstrates the state of the overwritable ring buffer
      when 'write_backward' is set before overwriting:
      
             head
              |
              V
          +---+------+----------+-------+------+
          |   |D....D|C........C|B.....B|A....A|
          +---+------+----------+-------+------+
      
      and after overwriting:
                                           head
                                            |
                                            V
          +---+------+----------+-------+---+--+
          |..E|D....D|C........C|B.....B|A..|E.|
          +---+------+----------+-------+---+--+
      
      In each situation, 'head' points to the beginning of the newest record.
      From this record, tooling can iterate over the full ring buffer and fetch
      records one by one.
      
      The only limitation that needs to be considered is back-to-back reading.
      Due to the non-deterministic of user programs, it is impossible to ensure
      the ring buffer keeps stable during reading. Consider an extreme situation:
      tooling is scheduled out after reading record 'D', then a burst of events
      come, eat up the whole ring buffer (one or multiple rounds). When the
      tooling process comes back, reading after 'D' is incorrect now.
      
      To prevent this problem, we need to find a way to ensure the ring buffer
      is stable during reading. ioctl(PERF_EVENT_IOC_PAUSE_OUTPUT) is
      suggested because its overhead is lower than
      ioctl(PERF_EVENT_IOC_ENABLE).
      
      By carefully verifying 'header' pointer, reader can avoid pausing the
      ring-buffer. For example:
      
          /* A union of all possible events */
          union perf_event event;
      
          p = head = perf_mmap__read_head();
          while (true) {
              /* copy header of next event */
              fetch(&event.header, p, sizeof(event.header));
      
              /* read 'head' pointer */
              head = perf_mmap__read_head();
      
              /* check overwritten: is the header good? */
              if (!verify(sizeof(event.header), p, head))
                  break;
      
              /* copy the whole event */
              fetch(&event, p, event.header.size);
      
              /* read 'head' pointer again */
              head = perf_mmap__read_head();
      
              /* is the whole event good? */
              if (!verify(event.header.size, p, head))
                  break;
              p += event.header.size;
          }
      
      However, the overhead is high because:
      
       a) In-place decoding is not safe.
          Copying-verifying-decoding is required.
       b) Fetching 'head' pointer requires additional synchronization.
      
      (From Alexei Starovoitov:
      
      Even when this trick works, pause is needed for more than stability of
      reading. When we collect the events into overwrite buffer we're waiting
      for some other trigger (like all cpu utilization spike or just one cpu
      running and all others are idle) and when it happens the buffer has
      valuable info from the past. At this point new events are no longer
      interesting and buffer should be paused, events read and unpaused until
      next trigger comes.)
      
      This patch utilizes event's default overflow_handler introduced
      previously. perf_event_output_backward() is created as the default
      overflow handler for backward ring buffers. To avoid extra overhead to
      fast path, original perf_event_output() becomes __perf_event_output()
      and marked '__always_inline'. In theory, there's no extra overhead
      introduced to fast path.
      
      Performance testing:
      
      Calling 3000000 times of 'close(-1)', use gettimeofday() to check
      duration.  Use 'perf record -o /dev/null -e raw_syscalls:*' to capture
      system calls. In ns.
      
      Testing environment:
      
        CPU    : Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
        Kernel : v4.5.0
                          MEAN         STDVAR
       BASE            800214.950    2853.083
       PRE1           2253846.700    9997.014
       PRE2           2257495.540    8516.293
       POST           2250896.100    8933.921
      
      Where 'BASE' is pure performance without capturing. 'PRE1' is test
      result of pure 'v4.5.0' kernel. 'PRE2' is test result before this
      patch. 'POST' is test result after this patch. See [4] for the detailed
      experimental setup.
      
      Considering the stdvar, this patch doesn't introduce performance
      overhead to the fast path.
      
       [1] http://lkml.iu.edu/hypermail/linux/kernel/1304.1/04584.html
       [2] http://lkml.iu.edu/hypermail/linux/kernel/1307.1/00535.html
       [3] http://lkml.iu.edu/hypermail/linux/kernel/1512.0/01265.html
       [4] http://lkml.kernel.org/g/56F89DCD.1040202@huawei.comSigned-off-by: NWang Nan <wangnan0@huawei.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Cc: <acme@kernel.org>
      Cc: <pi3orama@163.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Brendan Gregg <brendan.d.gregg@gmail.com>
      Cc: He Kuang <hekuang@huawei.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: Zefan Li <lizefan@huawei.com>
      Link: http://lkml.kernel.org/r/1459865478-53413-1-git-send-email-wangnan0@huawei.com
      [ Fixed the changelog some more. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      9ecda41a
  33. 31 3月, 2016 1 次提交
    • W
      perf/ring_buffer: Introduce new ioctl options to pause and resume the ring-buffer · 86e7972f
      Wang Nan 提交于
      Add new ioctl() to pause/resume ring-buffer output.
      
      In some situations we want to read from the ring-buffer only when we
      ensure nothing can write to the ring-buffer during reading. Without
      this patch we have to turn off all events attached to this ring-buffer
      to achieve this.
      
      This patch is a prerequisite to enable overwrite support for the
      perf ring-buffer support. Following commits will introduce new methods
      support reading from overwrite ring buffer. Before reading, caller
      must ensure the ring buffer is frozen, or the reading is unreliable.
      Signed-off-by: NWang Nan <wangnan0@huawei.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: <pi3orama@163.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Brendan Gregg <brendan.d.gregg@gmail.com>
      Cc: He Kuang <hekuang@huawei.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: Zefan Li <lizefan@huawei.com>
      Link: http://lkml.kernel.org/r/1459147292-239310-2-git-send-email-wangnan0@huawei.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      86e7972f