- 06 4月, 2021 1 次提交
-
-
由 Suzuki K Poulose 提交于
Allocate a byte for advertising the PMU specific format type of the given AUX record. A PMU could end up providing hardware trace data in multiple format in a single session. e.g, The format of hardware buffer produced by CoreSight ETM PMU depends on the type of the "sink" device used for collection for an event (Traditional TMC-ETR/Bs with formatting or TRBEs without any formatting). # Boring story of why this is needed. Goto The_End_of_Story for skipping. CoreSight ETM trace allows instruction level tracing of Arm CPUs. The ETM generates the CPU excecution trace and pumps it into CoreSight AMBA Trace Bus and is collected by a different CoreSight component (traditionally CoreSight TMC-ETR /ETB/ETF), called "sink". Important to note that there is no guarantee that every CPU has a dedicated sink. Thus multiple ETMs could pump the trace data into the same "sink" and thus they apply additional formatting of the trace data for the user to decode it properly and attribute the trace data to the corresponding ETM. However, with the introduction of Arm Trace buffer Extensions (TRBE), we now have a dedicated per-CPU architected sink for collecting the trace. Since the TRBE is always per-CPU, it doesn't apply any formatting of the trace. The support for this driver is under review [1]. Now a system could have a per-cpu TRBE and one or more shared TMC-ETRs on the system. A user could choose a "specific" sink for a perf session (e.g, a TMC-ETR) or the driver could automatically select the nearest sink for a given ETM. It is possible that some ETMs could end up using TMC-ETR (e.g, if the TRBE is not usable on the CPU) while the others using TRBE in a single perf session. Thus we now have "formatted" trace collected from TMC-ETR and "unformatted" trace collected from TRBE. However, we don't get into a situation where a single event could end up using TMC-ETR & TRBE. i.e, any AUX buffer is guaranteed to be either RAW or FORMATTED, but not a mix of both. As for perf decoding, we need to know the type of the data in the individual AUX buffers, so that it can set up the "OpenCSD" (library for decoding CoreSight trace) decoder instance appropriately. Thus the perf.data file must conatin the hints for the tool to decode the data correctly. Since this is a runtime variable, and perf tool doesn't have a control on what sink gets used (in case of automatic sink selection), we need this information made available from the PMU driver for each AUX record. # The_End_of_Story Cc: Peter Ziljstra <peterz@infradead.org> Cc: alexander.shishkin@linux.intel.com Cc: mingo@redhat.com Cc: will@kernel.org Cc: mark.rutland@arm.com Cc: mike.leach@linaro.org Cc: acme@kernel.org Cc: jolsa@redhat.com Cc: Mathieu Poirier <mathieu.poirer@linaro.org> Reviewed by: Mike Leach <mike.leach@linaro.org> Acked-by: NPeter Ziljstra <peterz@infradead.org> Signed-off-by: NSuzuki K Poulose <suzuki.poulose@arm.com> Link: https://lore.kernel.org/r/20210405164307.1720226-2-suzuki.poulose@arm.comSigned-off-by: NMathieu Poirier <mathieu.poirier@linaro.org>
-
- 01 2月, 2021 2 次提交
-
-
由 Kan Liang 提交于
Add perf core PMU support for the Intel Sapphire Rapids server, which is the successor of the Intel Ice Lake server. The enabling code is based on Ice Lake, but there are several new features introduced. The event encoding is changed and simplified, e.g., the event codes which are below 0x90 are restricted to counters 0-3. The event codes which above 0x90 are likely to have no restrictions. The event constraints, extra_regs(), and hardware cache events table are changed accordingly. A new Precise Distribution (PDist) facility is introduced, which further minimizes the skid when a precise event is programmed on the GP counter 0. Enable the Precise Distribution (PDist) facility with :ppp event. For this facility to work, the period must be initialized with a value larger than 127. Add spr_limit_period() to apply the limit for :ppp event. Two new data source fields, data block & address block, are added in the PEBS Memory Info Record for the load latency event. To enable the feature, - An auxiliary event has to be enabled together with the load latency event on Sapphire Rapids. A new flag PMU_FL_MEM_LOADS_AUX is introduced to indicate the case. A new event, mem-loads-aux, is exposed to sysfs for the user tool. Add a check in hw_config(). If the auxiliary event is not detected, return an unique error -ENODATA. - The union perf_mem_data_src is extended to support the new fields. - Ice Lake and earlier models do not support block information, but the fields may be set by HW on some machines. Add pebs_no_block to explicitly indicate the previous platforms which don't support the new block fields. Accessing the new block fields are ignored on those platforms. A new store Latency facility is introduced, which leverages the PEBS facility where it can provide additional information about sampled stores. The additional information includes the data address, memory auxiliary info (e.g. Data Source, STLB miss) and the latency of the store access. To enable the facility, the new event (0x02cd) has to be programed on the GP counter 0. A new flag PERF_X86_EVENT_PEBS_STLAT is introduced to indicate the event. The store_latency_data() is introduced to parse the memory auxiliary info. The layout of access latency field of PEBS Memory Info Record has been changed. Two latency, instruction latency (bit 15:0) and cache access latency (bit 47:32) are recorded. - The cache access latency is similar to previous memory access latency. For loads, the latency starts by the actual cache access until the data is returned by the memory subsystem. For stores, the latency starts when the demand write accesses the L1 data cache and lasts until the cacheline write is completed in the memory subsystem. The cache access latency is stored in low 32bits of the sample type PERF_SAMPLE_WEIGHT_STRUCT. - The instruction latency starts by the dispatch of the load operation for execution and lasts until completion of the instruction it belongs to. Add a new flag PMU_FL_INSTR_LATENCY to indicate the instruction latency support. The instruction latency is stored in the bit 47:32 of the sample type PERF_SAMPLE_WEIGHT_STRUCT. Extends the PERF_METRICS MSR to feature TMA method level 2 metrics. The lower half of the register is the TMA level 1 metrics (legacy). The upper half is also divided into four 8-bit fields for the new level 2 metrics. Expose all eight Topdown metrics events to user space. The full description for the SPR features can be found at Intel Architecture Instruction Set Extensions and Future Features Programming Reference, 319433-041. Signed-off-by: NKan Liang <kan.liang@linux.intel.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/1611873611-156687-5-git-send-email-kan.liang@linux.intel.com
-
由 Kan Liang 提交于
Current PERF_SAMPLE_WEIGHT sample type is very useful to expresses the cost of an action represented by the sample. This allows the profiler to scale the samples to be more informative to the programmer. It could also help to locate a hotspot, e.g., when profiling by memory latencies, the expensive load appear higher up in the histograms. But current PERF_SAMPLE_WEIGHT sample type is solely determined by one factor. This could be a problem, if users want two or more factors to contribute to the weight. For example, Golden Cove core PMU can provide both the instruction latency and the cache Latency information as factors for the memory profiling. For current X86 platforms, although meminfo::latency is defined as a u64, only the lower 32 bits include the valid data in practice (No memory access could last than 4G cycles). The higher 32 bits can be used to store new factors. Add a new sample type, PERF_SAMPLE_WEIGHT_STRUCT, to indicate the new sample weight structure. It shares the same space as the PERF_SAMPLE_WEIGHT sample type. Users can apply either the PERF_SAMPLE_WEIGHT sample type or the PERF_SAMPLE_WEIGHT_STRUCT sample type to retrieve the sample weight, but they cannot apply both sample types simultaneously. Currently, only X86 and PowerPC use the PERF_SAMPLE_WEIGHT sample type. - For PowerPC, there is nothing changed for the PERF_SAMPLE_WEIGHT sample type. There is no effect for the new PERF_SAMPLE_WEIGHT_STRUCT sample type. PowerPC can re-struct the weight field similarly later. - For X86, the same value will be dumped for the PERF_SAMPLE_WEIGHT sample type or the PERF_SAMPLE_WEIGHT_STRUCT sample type for now. The following patches will apply the new factors for the PERF_SAMPLE_WEIGHT_STRUCT sample type. The field in the union perf_sample_weight should be shared among different architectures. A generic name is required, but it's hard to abstract a name that applies to all architectures. For example, on X86, the fields are to store all kinds of latency. While on PowerPC, it stores MMCRA[TECX/TECM], which should not be latency. So a general name prefix 'var$NUM' is used here. Suggested-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: NKan Liang <kan.liang@linux.intel.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/1611873611-156687-2-git-send-email-kan.liang@linux.intel.com
-
- 15 1月, 2021 1 次提交
-
-
由 Jiri Olsa 提交于
Adding support to carry build id data in mmap2 event. The build id data replaces maj/min/ino/ino_generation fields, which are also used to identify map's binary, so it's ok to replace them with build id data: union { struct { u32 maj; u32 min; u64 ino; u64 ino_generation; }; struct { u8 build_id_size; u8 __reserved_1; u16 __reserved_2; u8 build_id[20]; }; }; Replaced maj/min/ino/ino_generation fields give us size of 24 bytes. We use 20 bytes for build id data, 1 byte for size and rest is unused. There's new misc bit for mmap2 to signal there's build id data in it: #define PERF_RECORD_MISC_MMAP_BUILD_ID (1 << 14) Signed-off-by: NJiri Olsa <jolsa@kernel.org> Signed-off-by: NAlexei Starovoitov <ast@kernel.org> Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/bpf/20210114134044.1418404-4-jolsa@kernel.org
-
- 29 10月, 2020 2 次提交
-
-
由 Stephane Eranian 提交于
When studying code layout, it is useful to capture the page size of the sampled code address. Add a new sample type for code page size. The new sample type requires collecting the ip. The code page size can be calculated from the NMI-safe perf_get_page_size(). For large PEBS, it's very unlikely that the mapping is gone for the earlier PEBS records. Enable the feature for the large PEBS. The worst case is that page-size '0' is returned. Signed-off-by: NKan Liang <kan.liang@linux.intel.com> Signed-off-by: NStephane Eranian <eranian@google.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20201001135749.2804-5-kan.liang@linux.intel.com
-
由 Kan Liang 提交于
Current perf can report both virtual addresses and physical addresses, but not the MMU page size. Without the MMU page size information of the utilized page, users cannot decide whether to promote/demote large pages to optimize memory usage. Add a new sample type for the data MMU page size. Current perf already has a facility to collect data virtual addresses. A page walker is required to walk the pages tables and calculate the MMU page size from a given virtual address. On some platforms, e.g., X86, the page walker is invoked in an NMI handler. So the page walker must be NMI-safe and low overhead. Besides, the page walker should work for both user and kernel virtual address. The existing generic page walker, e.g., walk_page_range_novma(), is a little bit complex and doesn't guarantee the NMI-safe. The follow_page() is only for user-virtual address. Add a new function perf_get_page_size() to walk the page tables and calculate the MMU page size. In the function: - Interrupts have to be disabled to prevent any teardown of the page tables. - For user space threads, the current->mm is used for the page walker. For kernel threads and the like, the current->mm is NULL. The init_mm is used for the page walker. The active_mm is not used here, because it can be NULL. Quote from Peter Zijlstra, "context_switch() can set prev->active_mm to NULL when it transfers it to @next. It does this before @current is updated. So an NMI that comes in between this active_mm swizzling and updating @current will see !active_mm." - The MMU page size is calculated from the page table level. The method should work for all architectures, but it has only been verified on X86. Should there be some architectures, which support perf, where the method doesn't work, it can be fixed later separately. Reporting the wrong page size would not be fatal for the architecture. Some under discussion features may impact the method in the future. Quote from Dave Hansen, "There are lots of weird things folks are trying to do with the page tables, like Address Space Isolation. For instance, if you get a perf NMI when running userspace, current->mm->pgd is *different* than the PGD that was in use when userspace was running. It's close enough today, but it might not stay that way." If the case happens later, lots of consecutive page walk errors will happen. The worst case is that lots of page-size '0' are returned, which would not be fatal. In the perf tool, a check is implemented to detect this case. Once it happens, a kernel patch could be implemented accordingly then. Suggested-by: NPeter Zijlstra <peterz@infradead.org> Signed-off-by: NKan Liang <kan.liang@linux.intel.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20201001135749.2804-2-kan.liang@linux.intel.com
-
- 20 10月, 2020 1 次提交
-
-
由 Al Grant 提交于
perf_event.h has macros that define the field offsets in the data_src bitmask in perf records. The SNOOPX and REMOTE offsets were both 37. These are distinct fields, and the bitfield layout in perf_mem_data_src confirms that SNOOPX should be at offset 38. Fixes: 52839e65 ("perf tools: Add support for printing new mem_info encodings") Signed-off-by: NAl Grant <al.grant@foss.arm.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: NAndi Kleen <ak@linux.intel.com> Link: https://lkml.kernel.org/r/4ac9f5cc-4388-b34a-9999-418a4099415d@foss.arm.com
-
- 20 7月, 2020 1 次提交
-
-
由 Peter Zijlstra 提交于
In order to support short clock counters, provide an ABI extension. As a whole: u64 time, delta, cyc = read_cycle_counter(); + if (cap_user_time_short) + cyc = time_cycle + ((cyc - time_cycle) & time_mask); delta = mul_u64_u32_shr(cyc, time_mult, time_shift); if (cap_user_time_zero) time = time_zero + delta; delta += time_offset; Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: NLeo Yan <leo.yan@linaro.org> Link: https://lore.kernel.org/r/20200716051130.4359-6-leo.yan@linaro.orgSigned-off-by: NWill Deacon <will@kernel.org>
-
- 15 6月, 2020 3 次提交
-
-
由 Adrian Hunter 提交于
Symbols are needed for tools to describe instruction addresses. Pages allocated for ftrace's purposes need symbols to be created for them. Add such symbols to be visible via perf ksymbol events. Signed-off-by: NAdrian Hunter <adrian.hunter@intel.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200512121922.8997-8-adrian.hunter@intel.com
-
由 Adrian Hunter 提交于
Symbols are needed for tools to describe instruction addresses. Pages allocated for kprobe's purposes need symbols to be created for them. Add such symbols to be visible via perf ksymbol events. Signed-off-by: NAdrian Hunter <adrian.hunter@intel.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Acked-by: NMasami Hiramatsu <mhiramat@kernel.org> Link: https://lkml.kernel.org/r/20200512121922.8997-5-adrian.hunter@intel.com
-
由 Adrian Hunter 提交于
Record (single instruction) changes to the kernel text (i.e. self-modifying code) in order to support tracers like Intel PT and ARM CoreSight. A copy of the running kernel code is needed as a reference point (e.g. from /proc/kcore). The text poke event records the old bytes and the new bytes so that the event can be processed forwards or backwards. The basic problem is recording the modified instruction in an unambiguous manner given SMP instruction cache (in)coherence. That is, when modifying an instruction concurrently any solution with one or multiple timestamps is not sufficient: CPU0 CPU1 0 1 write insn A 2 execute insn A 3 sync-I$ 4 Due to I$, CPU1 might execute either the old or new A. No matter where we record tracepoints on CPU0, one simply cannot tell what CPU1 will have observed, except that at 0 it must be the old one and at 4 it must be the new one. To solve this, take inspiration from x86 text poking, which has to solve this exact problem due to variable length instruction encoding and I-fetch windows. 1) overwrite the instruction with a breakpoint and sync I$ This guarantees that that code flow will never hit the target instruction anymore, on any CPU (or rather, it will cause an exception). 2) issue the TEXT_POKE event 3) overwrite the breakpoint with the new instruction and sync I$ Now we know that any execution after the TEXT_POKE event will either observe the breakpoint (and hit the exception) or the new instruction. So by guarding the TEXT_POKE event with an exception on either side; we can now tell, without doubt, which instruction another CPU will have observed. Signed-off-by: NAdrian Hunter <adrian.hunter@intel.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200512121922.8997-2-adrian.hunter@intel.com
-
- 27 3月, 2020 2 次提交
-
-
由 Namhyung Kim 提交于
The PERF_SAMPLE_CGROUP bit is to save (perf_event) cgroup information in the sample. It will add a 64-bit id to identify current cgroup and it's the file handle in the cgroup file system. Userspace should use this information with PERF_RECORD_CGROUP event to match which cgroup it belongs. I put it before PERF_SAMPLE_AUX for simplicity since it just needs a 64-bit word. But if we want bigger samples, I can work on that direction too. Committer testing: $ pahole perf_sample_data | grep -w cgroup -B5 -A5 /* --- cacheline 4 boundary (256 bytes) was 56 bytes ago --- */ struct perf_regs regs_intr; /* 312 16 */ /* --- cacheline 5 boundary (320 bytes) was 8 bytes ago --- */ u64 stack_user_size; /* 328 8 */ u64 phys_addr; /* 336 8 */ u64 cgroup; /* 344 8 */ /* size: 384, cachelines: 6, members: 22 */ /* padding: 32 */ }; $ Signed-off-by: NNamhyung Kim <namhyung@kernel.org> Tested-by: NArnaldo Carvalho de Melo <acme@redhat.com> Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Acked-by: NTejun Heo <tj@kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Zefan Li <lizefan@huawei.com> Link: http://lore.kernel.org/lkml/20200325124536.2800725-3-namhyung@kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
-
由 Namhyung Kim 提交于
To support cgroup tracking, add CGROUP event to save a link between cgroup path and id number. This is needed since cgroups can go away when userspace tries to read the cgroup info (from the id) later. The attr.cgroup bit was also added to enable cgroup tracking from userspace. This event will be generated when a new cgroup becomes active. Userspace might need to synthesize those events for existing cgroups. Committer testing: From the resulting kernel, using /sys/kernel/btf/vmlinux: $ pahole perf_event_attr | grep -w cgroup -B5 -A1 __u64 write_backward:1; /* 40:27 8 */ __u64 namespaces:1; /* 40:28 8 */ __u64 ksymbol:1; /* 40:29 8 */ __u64 bpf_event:1; /* 40:30 8 */ __u64 aux_output:1; /* 40:31 8 */ __u64 cgroup:1; /* 40:32 8 */ __u64 __reserved_1:31; /* 40:33 8 */ $ Reported-by: Nkbuild test robot <lkp@intel.com> Signed-off-by: NNamhyung Kim <namhyung@kernel.org> Tested-by: NArnaldo Carvalho de Melo <acme@redhat.com> Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Acked-by: NTejun Heo <tj@kernel.org> [staticize perf_event_cgroup function] Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Zefan Li <lizefan@huawei.com> Link: http://lore.kernel.org/lkml/20200325124536.2800725-2-namhyung@kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
-
- 11 2月, 2020 1 次提交
-
-
由 Kan Liang 提交于
The low level index is the index in the underlying hardware buffer of the most recently captured taken branch which is always saved in branch_entries[0]. It is very useful for reconstructing the call stack. For example, in Intel LBR call stack mode, the depth of reconstructed LBR call stack limits to the number of LBR registers. With the low level index information, perf tool may stitch the stacks of two samples. The reconstructed LBR call stack can break the HW limitation. Add a new branch sample type to retrieve low level index of raw branch records. The low level index is between -1 (unknown) and max depth which can be retrieved in /sys/devices/cpu/caps/branches. Only when the new branch sample type is set, the low level index information is dumped into the PERF_SAMPLE_BRANCH_STACK output. Perf tool should check the attr.branch_sample_type, and apply the corresponding format for PERF_SAMPLE_BRANCH_STACK samples. Otherwise, some user case may be broken. For example, users may parse a perf.data, which include the new branch sample type, with an old version perf tool (without the check). Users probably get incorrect information without any warning. Signed-off-by: NKan Liang <kan.liang@linux.intel.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: NIngo Molnar <mingo@kernel.org> Link: https://lkml.kernel.org/r/20200127165355.27495-2-kan.liang@linux.intel.com
-
- 13 11月, 2019 1 次提交
-
-
由 Alexander Shishkin 提交于
AUX data can be used to annotate perf events such as performance counters or tracepoints/breakpoints by including it in sample records when PERF_SAMPLE_AUX flag is set. Such samples would be instrumental in debugging and profiling by providing, for example, a history of instruction flow leading up to the event's overflow. The implementation makes use of grouping an AUX event with all the events that wish to take samples of the AUX data, such that the former is the group leader. The samplees should also specify the desired size of the AUX sample via attr.aux_sample_size. AUX capable PMUs need to explicitly add support for sampling, because it relies on a new callback to take a snapshot of the buffer without touching the event states. Signed-off-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: David Ahern <dsahern@gmail.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: adrian.hunter@intel.com Cc: mathieu.poirier@linaro.org Link: https://lkml.kernel.org/r/20191025140835.53665-2-alexander.shishkin@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 28 8月, 2019 1 次提交
-
-
由 Alexander Shishkin 提交于
In some cases, ordinary (non-AUX) events can generate data for AUX events. For example, PEBS events can come out as records in the Intel PT stream instead of their usual DS records, if configured to do so. One requirement for such events is to consistently schedule together, to ensure that the data from the "AUX output" events isn't lost while their corresponding AUX event is not scheduled. We use grouping to provide this guarantee: an "AUX output" event can be added to a group where an AUX event is a group leader, and provided that the former supports writing to the latter. Signed-off-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: kan.liang@linux.intel.com Link: https://lkml.kernel.org/r/20190806084606.4021-2-alexander.shishkin@linux.intel.com
-
- 22 1月, 2019 2 次提交
-
-
由 Song Liu 提交于
For better performance analysis of BPF programs, this patch introduces PERF_RECORD_BPF_EVENT, a new perf_event_type that exposes BPF program load/unload information to user space. Each BPF program may contain up to BPF_MAX_SUBPROGS (256) sub programs. The following example shows kernel symbols for a BPF program with 7 sub programs: ffffffffa0257cf9 t bpf_prog_b07ccb89267cf242_F ffffffffa02592e1 t bpf_prog_2dcecc18072623fc_F ffffffffa025b0e9 t bpf_prog_bb7a405ebaec5d5c_F ffffffffa025dd2c t bpf_prog_a7540d4a39ec1fc7_F ffffffffa025fcca t bpf_prog_05762d4ade0e3737_F ffffffffa026108f t bpf_prog_db4bd11e35df90d4_F ffffffffa0263f00 t bpf_prog_89d64e4abf0f0126_F ffffffffa0257cf9 t bpf_prog_ae31629322c4b018__dummy_tracepoi When a bpf program is loaded, PERF_RECORD_KSYMBOL is generated for each of these sub programs. Therefore, PERF_RECORD_BPF_EVENT is not needed for simple profiling. For annotation, user space need to listen to PERF_RECORD_BPF_EVENT and gather more information about these (sub) programs via sys_bpf. Signed-off-by: NSong Liu <songliubraving@fb.com> Reviewed-by: NArnaldo Carvalho de Melo <acme@redhat.com> Acked-by: NAlexei Starovoitov <ast@kernel.org> Acked-by: NPeter Zijlstra (Intel) <peterz@infradeaed.org> Tested-by: NArnaldo Carvalho de Melo <acme@redhat.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Peter Zijlstra <peterz@infradead.org> Cc: kernel-team@fb.com Cc: netdev@vger.kernel.org Link: http://lkml.kernel.org/r/20190117161521.1341602-4-songliubraving@fb.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
-
由 Song Liu 提交于
For better performance analysis of dynamically JITed and loaded kernel functions, such as BPF programs, this patch introduces PERF_RECORD_KSYMBOL, a new perf_event_type that exposes kernel symbol register/unregister information to user space. The following data structure is used for PERF_RECORD_KSYMBOL. /* * struct { * struct perf_event_header header; * u64 addr; * u32 len; * u16 ksym_type; * u16 flags; * char name[]; * struct sample_id sample_id; * }; */ Signed-off-by: NSong Liu <songliubraving@fb.com> Reviewed-by: NArnaldo Carvalho de Melo <acme@redhat.com> Tested-by: NArnaldo Carvalho de Melo <acme@redhat.com> Acked-by: NPeter Zijlstra <peterz@infradead.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Peter Zijlstra <peterz@infradead.org> Cc: kernel-team@fb.com Cc: netdev@vger.kernel.org Link: http://lkml.kernel.org/r/20190117161521.1341602-2-songliubraving@fb.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
-
- 21 1月, 2019 1 次提交
-
-
由 Andrew Murray 提交于
Now that perf_flags is not used we remove it. Signed-off-by: NAndrew Murray <andrew.murray@arm.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Richard Henderson <rth@twiddle.net> Cc: Russell King <linux@armlinux.org.uk> Cc: Sascha Hauer <s.hauer@pengutronix.de> Cc: Shawn Guo <shawnguo@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Cc: linux-arm-kernel@lists.infradead.org Cc: linuxppc-dev@lists.ozlabs.org Cc: robin.murphy@arm.com Cc: suzuki.poulose@arm.com Link: https://lkml.kernel.org/r/1547128414-50693-13-git-send-email-andrew.murray@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 31 10月, 2018 1 次提交
-
-
由 David Miller 提交于
When synthesizing FORK events, we are trying to create thread objects for the already running tasks on the machine. Normally, for a kernel FORK event, we want to clone the parent's maps because that is what the kernel just did. But when synthesizing, this should not be done. If we do, we end up with overlapping maps as we process the sythesized MMAP2 events that get delivered shortly thereafter. Use the FORK event misc flags in an internal way to signal this situation, so we can elide the map clone when appropriate. Signed-off-by: NDavid S. Miller <davem@davemloft.net> Cc: Don Zickus <dzickus@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Joe Mario <jmario@redhat.com> Link: http://lkml.kernel.org/r/20181030.222404.2085088822877051075.davem@davemloft.net [ Added comment about flag use in machine__process_fork_event(), use ternary op in thread__clone_map_groups() as suggested by Jiri ] Signed-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
-
- 10 9月, 2018 1 次提交
-
-
由 Peter Zijlstra 提交于
Vince noted that commit: 6cbc304f ("perf/x86/intel: Fix unwind errors from PEBS entries (mk-II)") 'leaked' __PERF_SAMPLE_CALLCHAIN_EARLY into the UAPI namespace. And while sys_perf_event_open() will error out if you try to use it, it is exposed. Clearly mark it for internal use only to avoid any confusion. Requested-by: NVince Weaver <vincent.weaver@maine.edu> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
- 25 7月, 2018 1 次提交
-
-
由 Peter Zijlstra 提交于
Vince reported the perf_fuzzer giving various unwinder warnings and Josh reported: > Deja vu. Most of these are related to perf PEBS, similar to the > following issue: > > b8000586 ("perf/x86/intel: Cure bogus unwind from PEBS entries") > > This is basically the ORC version of that. setup_pebs_sample_data() is > assembling a franken-pt_regs which ORC isn't happy about. RIP is > inconsistent with some of the other registers (like RSP and RBP). And where the previous unwinder only needed BP,SP ORC also requires IP. But we cannot spoof IP because then the sample will get displaced, entirely negating the point of PEBS. So cure the whole thing differently by doing the unwind early; this does however require a means to communicate we did the unwind early. We (ab)use an unused sample_type bit for this, which we set on events that fill out the data->callchain before the normal perf_prepare_sample(). Debugged-by: NJosh Poimboeuf <jpoimboe@redhat.com> Reported-by: NVince Weaver <vincent.weaver@maine.edu> Tested-by: NJosh Poimboeuf <jpoimboe@redhat.com> Tested-by: NPrashant Bhole <bhole_prashant_q7@lab.ntt.co.jp> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
- 17 4月, 2018 1 次提交
-
-
由 Alexey Budankov 提交于
Store preempting context switch out event into Perf trace as a part of PERF_RECORD_SWITCH[_CPU_WIDE] record. Percentage of preempting and non-preempting context switches help understanding the nature of workloads (CPU or IO bound) that are running on a machine; The event is treated as preemption one when task->state value of the thread being switched out is TASK_RUNNING. Event type encoding is implemented using PERF_RECORD_MISC_SWITCH_OUT_PREEMPT bit; Signed-off-by: NAlexey Budankov <alexey.budankov@linux.intel.com> Acked-by: NPeter Zijlstra <peterz@infradead.org> Tested-by: NArnaldo Carvalho de Melo <acme@redhat.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Namhyung Kim <namhyung@kernel.org> Link: http://lkml.kernel.org/r/9ff84e83-a0ca-dd82-a6d0-cb951689be74@linux.intel.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
-
- 13 3月, 2018 1 次提交
-
-
由 Milind Chabbi 提交于
Problem and motivation: Once a breakpoint perf event (PERF_TYPE_BREAKPOINT) is created, there is no flexibility to change the breakpoint type (bp_type), breakpoint address (bp_addr), or breakpoint length (bp_len). The only option is to close the perf event and configure a new breakpoint event. This inflexibility has a significant performance overhead. For example, sampling-based, lightweight performance profilers (and also concurrency bug detection tools), monitor different addresses for a short duration using PERF_TYPE_BREAKPOINT and change the address (bp_addr) to another address or change the kind of breakpoint (bp_type) from "write" to a "read" or vice-versa or change the length (bp_len) of the address being monitored. The cost of these modifications is prohibitive since it involves unmapping the circular buffer associated with the perf event, closing the perf event, opening another perf event and mmaping another circular buffer. Solution: The new ioctl flag for perf events, PERF_EVENT_IOC_MODIFY_ATTRIBUTES, introduced in this patch takes a pointer to a struct perf_event_attr as an argument to update an old breakpoint event with new address, type, and size. This facility allows retaining a previous mmaped perf events ring buffer and avoids having to close and reopen another perf event. This patch supports only changing PERF_TYPE_BREAKPOINT event type; future implementations can extend this feature. The patch replicates some of its functionality of modify_user_hw_breakpoint() in kernel/events/hw_breakpoint.c. modify_user_hw_breakpoint cannot be called directly since perf_event_ctx_lock() is already held in _perf_ioctl(). Evidence: Experiments show that the baseline (not able to modify an already created breakpoint) costs an order of magnitude (~10x) more than the suggested optimization (having the ability to dynamically modifying a configured breakpoint via ioctl). When the breakpoints typically do not trap, the speedup due to the suggested optimization is ~10x; even when the breakpoints always trap, the speedup is ~4x due to the suggested optimization. Testing: tests posted at https://github.com/linux-contrib/perf_event_modify_bp demonstrate the performance significance of this patch. Tests also check the functional correctness of the patch. Signed-off-by: NMilind Chabbi <chabbi.milind@gmail.com> [ Using modify_user_hw_breakpoint_check function. ] [ Reformated PERF_EVENT_IOC_*, so the values are all in one column. ] Signed-off-by: NJiri Olsa <jolsa@kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: David Ahern <dsahern@gmail.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Hari Bathini <hbathini@linux.vnet.ibm.com> Cc: Jin Yao <yao.jin@linux.intel.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Kan Liang <kan.liang@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Oleg Nesterov <onestero@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Link: http://lkml.kernel.org/r/20180312134548.31532-8-jolsa@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 06 2月, 2018 1 次提交
-
-
由 Song Liu 提交于
Two new perf types, perf_kprobe and perf_uprobe, will be added to allow creating [k,u]probe with perf_event_open. These [k,u]probe are associated with the file decriptor created by perf_event_open(), thus are easy to clean when the file descriptor is destroyed. kprobe_func and uprobe_path are added to union config1 for pointers to function name for kprobe or binary path for uprobe. kprobe_addr and probe_offset are added to union config2 for kernel address (when kprobe_func is NULL), or [k,u]probe offset. Signed-off-by: NSong Liu <songliubraving@fb.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: NYonghong Song <yhs@fb.com> Reviewed-by: NJosef Bacik <jbacik@fb.com> Acked-by: NAlexei Starovoitov <ast@kernel.org> Cc: <daniel@iogearbox.net> Cc: <davem@davemloft.net> Cc: <kernel-team@fb.com> Cc: <rostedt@goodmis.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20171206224518.3598254-4-songliubraving@fb.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 08 1月, 2018 2 次提交
-
-
由 Jiri Olsa 提交于
The perf_event_header::misc bit 13 is shared on different events and next patch is adding yet another bit 13 user. Updating the comment to make it more structured and clear which events use bit 13. Suggested-by: NPeter Zijlstra <peterz@infradead.org> Signed-off-by: NJiri Olsa <jolsa@kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: David Ahern <dsahern@gmail.com> Cc: Namhyung Kim <namhyung@kernel.org> Link: http://lkml.kernel.org/r/20180107160356.28203-8-jolsa@kernel.org [ Update the tools/include/uapi/linux copy ] Signed-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
-
由 Jiri Olsa 提交于
Adding missing sample_id line into PERF_RECORD_ITRACE_START event comment. Signed-off-by: NJiri Olsa <jolsa@kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: David Ahern <dsahern@gmail.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20180107160356.28203-5-jolsa@kernel.org [ Update the tools/include/uapi/linux copy ] Signed-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
-
- 13 12月, 2017 1 次提交
-
-
由 Yonghong Song 提交于
Commit e87c6bc3 ("bpf: permit multiple bpf attachments for a single perf event") added support to attach multiple bpf programs to a single perf event. Although this provides flexibility, users may want to know what other bpf programs attached to the same tp interface. Besides getting visibility for the underlying bpf system, such information may also help consolidate multiple bpf programs, understand potential performance issues due to a large array, and debug (e.g., one bpf program which overwrites return code may impact subsequent program results). Commit 2541517c ("tracing, perf: Implement BPF programs attached to kprobes") utilized the existing perf ioctl interface and added the command PERF_EVENT_IOC_SET_BPF to attach a bpf program to a tracepoint. This patch adds a new ioctl command, given a perf event fd, to query the bpf program array attached to the same perf tracepoint event. The new uapi ioctl command: PERF_EVENT_IOC_QUERY_BPF The new uapi/linux/perf_event.h structure: struct perf_event_query_bpf { __u32 ids_len; __u32 prog_cnt; __u32 ids[0]; }; User space provides buffer "ids" for kernel to copy to. When returning from the kernel, the number of available programs in the array is set in "prog_cnt". The usage: struct perf_event_query_bpf *query = malloc(sizeof(*query) + sizeof(u32) * ids_len); query.ids_len = ids_len; err = ioctl(pmu_efd, PERF_EVENT_IOC_QUERY_BPF, query); if (err == 0) { /* query.prog_cnt is the number of available progs, * number of progs in ids: (ids_len == 0) ? 0 : query.prog_cnt */ } else if (errno == ENOSPC) { /* query.ids_len number of progs copied, * query.prog_cnt is the number of available progs */ } else { /* other errors */ } Signed-off-by: NYonghong Song <yhs@fb.com> Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Acked-by: NAlexei Starovoitov <ast@kernel.org> Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
-
- 02 11月, 2017 1 次提交
-
-
由 Greg Kroah-Hartman 提交于
Many user space API headers have licensing information, which is either incomplete, badly formatted or just a shorthand for referring to the license under which the file is supposed to be. This makes it hard for compliance tools to determine the correct license. Update these files with an SPDX license identifier. The identifier was chosen based on the license information in the file. GPL/LGPL licensed headers get the matching GPL/LGPL SPDX license identifier with the added 'WITH Linux-syscall-note' exception, which is the officially assigned exception identifier for the kernel syscall exception: NOTE! This copyright does *not* cover user programs that use kernel services by normal system calls - this is merely considered normal use of the kernel, and does *not* fall under the heading of "derived work". This exception makes it possible to include GPL headers into non GPL code, without confusing license compliance tools. Headers which have either explicit dual licensing or are just licensed under a non GPL license are updated with the corresponding SPDX identifier and the GPLv2 with syscall exception identifier. The format is: ((GPL-2.0 WITH Linux-syscall-note) OR SPDX-ID-OF-OTHER-LICENSE) SPDX license identifiers are a legally binding shorthand, which can be used instead of the full boiler plate text. The update does not remove existing license information as this has to be done on a case by case basis and the copyright holders might have to be consulted. This will happen in a separate step. This patch is based on work done by Thomas Gleixner and Kate Stewart and Philippe Ombredanne. See the previous patch in this series for the methodology of how this patch was researched. Reviewed-by: NKate Stewart <kstewart@linuxfoundation.org> Reviewed-by: NPhilippe Ombredanne <pombredanne@nexb.com> Reviewed-by: NThomas Gleixner <tglx@linutronix.de> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
- 18 10月, 2017 1 次提交
-
-
由 Will Deacon 提交于
The ARM SPE architecture permits an implementation to ignore a sample if the sample is due to be taken whilst another sample is already being produced. In this case, it is desirable to report the collision to userspace, as they may want to lower the sample period. This patch adds a PERF_AUX_FLAG_COLLISION flag, so that such events can be relayed to userspace. Acked-by: NPeter Zijlstra <peterz@infradead.org> Signed-off-by: NWill Deacon <will.deacon@arm.com>
-
- 29 8月, 2017 1 次提交
-
-
由 Kan Liang 提交于
For understanding how the workload maps to memory channels and hardware behavior, it's very important to collect address maps with physical addresses. For example, 3D XPoint access can only be found by filtering the physical address. Add a new sample type for physical address. perf already has a facility to collect data virtual address. This patch introduces a function to convert the virtual address to physical address. The function is quite generic and can be extended to any architecture as long as a virtual address is provided. - For kernel direct mapping addresses, virt_to_phys is used to convert the virtual addresses to physical address. - For user virtual addresses, __get_user_pages_fast is used to walk the pages tables for user physical address. - This does not work for vmalloc addresses right now. These are not resolved, but code to do that could be added. The new sample type requires collecting the virtual address. The virtual address will not be output unless SAMPLE_ADDR is applied. For security, the physical address can only be exposed to root or privileged user. Tested-by: NMadhavan Srinivasan <maddy@linux.vnet.ibm.com> Signed-off-by: NKan Liang <kan.liang@intel.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: acme@kernel.org Cc: mpe@ellerman.id.au Link: http://lkml.kernel.org/r/1503967969-48278-1-git-send-email-kan.liang@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 25 8月, 2017 1 次提交
-
-
由 Andi Kleen 提交于
Skylake changed the encoding of the PEBS data source field. Some combinations are not available anymore, but some new cases e.g. for L4 cache hit are added. Fix up the conversion table for Skylake, similar as had been done for Nehalem. On Skylake server the encoding for L4 actually means persistent memory. Handle this case too. To properly describe it in the abstracted perf format I had to add some new fields. Since a hit can have only one level add a new field that is an enumeration, not a bit field to describe the level. It can describe any level. Some numbers are also used to describe PMEM and LFB. Also add a new generic remote flag that can be combined with the generic level to signify a remote cache. And there is an extension field for the snoop indication to handle the Forward state. I didn't add a generic flag for hops because it's not needed for Skylake. I changed the existing encodings for older CPUs to also fill in the new level and remote fields. Signed-off-by: NAndi Kleen <ak@linux.intel.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Madhavan Srinivasan <maddy@linux.vnet.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: acme@kernel.org Cc: jolsa@kernel.org Link: http://lkml.kernel.org/r/20170816222156.19953-3-andi@firstfloor.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 19 7月, 2017 1 次提交
-
-
由 Jin Yao 提交于
It is often useful to know the branch types while analyzing branch data. For example, a call is very different from a conditional branch. Currently we have to look it up in binary while the binary may later not be available and even the binary is available but user has to take some time. It is very useful for user to check it directly in perf report. Perf already has support for disassembling the branch instruction to get the x86 branch type. To keep consistent on kernel and userspace and make the classification more common, the patch adds the common branch type classification in perf_event.h. The patch only defines a minimum but most common set of branch types. PERF_BR_UNKNOWN : unknown PERF_BR_COND :conditional PERF_BR_UNCOND : unconditional PERF_BR_IND : indirect PERF_BR_CALL : function call PERF_BR_IND_CALL : indirect function call PERF_BR_RET : function return PERF_BR_SYSCALL : syscall PERF_BR_SYSRET : syscall return PERF_BR_COND_CALL : conditional function call PERF_BR_COND_RET : conditional function return The patch also adds a new field type (4 bits) in perf_branch_entry to record the branch type. Since the disassembling of branch instruction needs some overhead, a new PERF_SAMPLE_BRANCH_TYPE_SAVE is introduced to indicate if it needs to disassemble the branch instruction and record the branch type. Change log: v10: Not changed. v9: Not changed. v8: Change PERF_BR_NONE to PERF_BR_UNKNOWN. No other change. v7: Just keep the most common branch types. Others are removed. v6: Not changed. v5: Not changed. The v5 patch series just change the userspace. v4: Comparing to previous version, the major changes are: 1. Remove the PERF_BR_JCC_FWD/PERF_BR_JCC_BWD, they will be computed later in userspace. 2. Remove the "cross" field in perf_branch_entry. The cross page computing will be done later in userspace. Signed-off-by: NYao Jin <yao.jin@linux.intel.com> Acked-by: NJiri Olsa <jolsa@kernel.org> Acked-by: NMichael Ellerman <mpe@ellerman.id.au> Acked-by: NPeter Zijlstra <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Kan Liang <kan.liang@intel.com> Link: http://lkml.kernel.org/r/1500379995-6449-2-git-send-email-yao.jin@linux.intel.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
-
- 19 4月, 2017 1 次提交
-
-
由 Sukadev Bhattiprolu 提交于
perf_mem_data_src is a union that is initialized in the kernel via the ->val field and accessed by userspace via the mem_xxx bitfields. For this to work correctly on big endian platforms, we need a big-endian definition for the bitfields. Currently on a big endian system, if a user requests PERF_SAMPLE_DATA_SRC (perf report -d), they will get the default value from perf_sample_data_init(), which is PERF_MEM_NA. The value for PERF_MEM_NA is constructed using shifts: /* TLB access */ #define PERF_MEM_TLB_NA 0x01 /* not available */ ... #define PERF_MEM_TLB_SHIFT 26 #define PERF_MEM_S(a, s) \ (((__u64)PERF_MEM_##a##_##s) << PERF_MEM_##a##_SHIFT) #define PERF_MEM_NA (PERF_MEM_S(OP, NA) |\ PERF_MEM_S(LVL, NA) |\ PERF_MEM_S(SNOOP, NA) |\ PERF_MEM_S(LOCK, NA) |\ PERF_MEM_S(TLB, NA)) Which works out as: ((0x01 << 0) | (0x01 << 5) | (0x01 << 19) | (0x01 << 24) | (0x01 << 26)) Which means the PERF_MEM_NA value comes out of the kernel as 0x5080021 in CPU endian. But then in the perf tool, the code uses the bitfields to inspect the value, and currently the bitfields are defined using little endian ordering. So eg. in perf_mem__tlb_scnprintf() we see: data_src->val = 0x5080021 op = 0x0 lvl = 0x0 snoop = 0x0 lock = 0x0 dtlb = 0x0 rsvd = 0x5080021 Because of the way the perf tool code is written this is still displayed to the user as "N/A", so there is no bug visible at the UI level. Currently there are no big endian architectures which export a meaningful value (ie. other than PERF_MEM_NA), so the extent of the bug on big endian platforms is that the PERF_MEM_NA value is exported incorrectly as described above. Subsequent patches will add support on big endian powerpc for populating the data source value. This patch does a minimal fix of adding big endian definition of the bitfields to match the values that are already exported by the kernel on big endian. And it makes no change on little endian. Signed-off-by: NSukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Signed-off-by: NMadhavan Srinivasan <maddy@linux.vnet.ibm.com> Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
- 16 3月, 2017 1 次提交
-
-
由 Alexander Shishkin 提交于
The Intel PT driver needs to be able to communicate partial AUX transactions, that is, transactions with gaps in data for reasons other than no room left in the buffer (i.e. truncated transactions). Therefore, this condition does not imply a wakeup for the consumer. To this end, add a new "partial" AUX flag. Signed-off-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mathieu Poirier <mathieu.poirier@linaro.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: vince@deater.net Link: http://lkml.kernel.org/r/20170220133352.17995-4-alexander.shishkin@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 14 3月, 2017 1 次提交
-
-
由 Hari Bathini 提交于
With the advert of container technologies like docker, that depend on namespaces for isolation, there is a need for tracing support for namespaces. This patch introduces new PERF_RECORD_NAMESPACES event for recording namespaces related info. By recording info for every namespace, it is left to userspace to take a call on the definition of a container and trace containers by updating perf tool accordingly. Each namespace has a combination of device and inode numbers. Though every namespace has the same device number currently, that may change in future to avoid the need for a namespace of namespaces. Considering such possibility, record both device and inode numbers separately for each namespace. Signed-off-by: NHari Bathini <hbathini@linux.vnet.ibm.com> Acked-by: NJiri Olsa <jolsa@kernel.org> Acked-by: NPeter Zijlstra <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Alexei Starovoitov <ast@fb.com> Cc: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com> Cc: Aravinda Prasad <aravinda@linux.vnet.ibm.com> Cc: Brendan Gregg <brendan.d.gregg@gmail.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Sargun Dhillon <sargun@sargun.me> Cc: Steven Rostedt <rostedt@goodmis.org> Link: http://lkml.kernel.org/r/148891929686.25309.2827618988917007768.stgit@hbathini.in.ibm.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
-
- 30 5月, 2016 1 次提交
-
-
由 Arnaldo Carvalho de Melo 提交于
Additionally to being able to control the system wide maximum depth via /proc/sys/kernel/perf_event_max_stack, now we are able to ask for different depths per event, using perf_event_attr.sample_max_stack for that. This uses an u16 hole at the end of perf_event_attr, that, when perf_event_attr.sample_type has the PERF_SAMPLE_CALLCHAIN, if sample_max_stack is zero, means use perf_event_max_stack, otherwise it'll be bounds checked under callchain_mutex. Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Brendan Gregg <brendan.d.gregg@gmail.com> Cc: David Ahern <dsahern@gmail.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: He Kuang <hekuang@huawei.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Milian Wolff <milian.wolff@kdab.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: Wang Nan <wangnan0@huawei.com> Cc: Zefan Li <lizefan@huawei.com> Link: http://lkml.kernel.org/n/tip-kolmn1yo40p7jhswxwrc7rrd@git.kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
-
- 17 5月, 2016 1 次提交
-
-
由 Arnaldo Carvalho de Melo 提交于
The perf_sample->ip_callchain->nr value includes all the entries in the ip_callchain->ip[] array, real addresses and PERF_CONTEXT_{KERNEL,USER,etc}, while what the user expects is that what is in the kernel.perf_event_max_stack sysctl or in the upcoming per event perf_event_attr.sample_max_stack knob be honoured in terms of IP addresses in the stack trace. So allocate a bunch of extra entries for contexts, and do the accounting via perf_callchain_entry_ctx struct members. A new sysctl, kernel.perf_event_max_contexts_per_stack is also introduced for investigating possible bugs in the callchain implementation by some arch. Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Brendan Gregg <brendan.d.gregg@gmail.com> Cc: David Ahern <dsahern@gmail.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: He Kuang <hekuang@huawei.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Milian Wolff <milian.wolff@kdab.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: Wang Nan <wangnan0@huawei.com> Cc: Zefan Li <lizefan@huawei.com> Link: http://lkml.kernel.org/n/tip-3b4wnqk340c4sg4gwkfdi9yk@git.kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
-
- 23 4月, 2016 1 次提交
-
-
由 Wang Nan 提交于
This patch introduces 'write_backward' bit to perf_event_attr, which controls the direction of a ring buffer. After set, the corresponding ring buffer is written from end to beginning. This feature is design to support reading from overwritable ring buffer. Ring buffer can be created by mapping a perf event fd. Kernel puts event records into ring buffer, user tooling like perf fetch them from address returned by mmap(). To prevent racing between kernel and tooling, they communicate to each other through 'head' and 'tail' pointers. Kernel maintains 'head' pointer, points it to the next free area (tail of the last record). Tooling maintains 'tail' pointer, points it to the tail of last consumed record (record has already been fetched). Kernel determines the available space in a ring buffer using these two pointers to avoid overwrite unfetched records. By mapping without 'PROT_WRITE', an overwritable ring buffer is created. Different from normal ring buffer, tooling is unable to maintain 'tail' pointer because writing is forbidden. Therefore, for this type of ring buffers, kernel overwrite old records unconditionally, works like flight recorder. This feature would be useful if reading from overwritable ring buffer were as easy as reading from normal ring buffer. However, there's an obscure problem. The following figure demonstrates a full overwritable ring buffer. In this figure, the 'head' pointer points to the end of last record, and a long record 'E' is pending. For a normal ring buffer, a 'tail' pointer would have pointed to position (X), so kernel knows there's no more space in the ring buffer. However, for an overwritable ring buffer, kernel ignore the 'tail' pointer. (X) head . | . V +------+-------+----------+------+---+ |A....A|B.....B|C........C|D....D| | +------+-------+----------+------+---+ Record 'A' is overwritten by event 'E': head | V +--+---+-------+----------+------+---+ |.E|..A|B.....B|C........C|D....D|E..| +--+---+-------+----------+------+---+ Now tooling decides to read from this ring buffer. However, none of these two natural positions, 'head' and the start of this ring buffer, are pointing to the head of a record. Even the full ring buffer can be accessed by tooling, it is unable to find a position to start decoding. The first attempt tries to solve this problem AFAIK can be found from [1]. It makes kernel to maintain 'tail' pointer: updates it when ring buffer is half full. However, this approach introduces overhead to fast path. Test result shows a 1% overhead [2]. In addition, this method utilizes no more tham 50% records. Another attempt can be found from [3], which allows putting the size of an event at the end of each record. This approach allows tooling to find records in a backward manner from 'head' pointer by reading size of a record from its tail. However, because of alignment requirement, it needs 8 bytes to record the size of a record, which is a huge waste. Its performance is also not good, because more data need to be written. This approach also introduces some extra branch instructions to fast path. 'write_backward' is a better solution to this problem. Following figure demonstrates the state of the overwritable ring buffer when 'write_backward' is set before overwriting: head | V +---+------+----------+-------+------+ | |D....D|C........C|B.....B|A....A| +---+------+----------+-------+------+ and after overwriting: head | V +---+------+----------+-------+---+--+ |..E|D....D|C........C|B.....B|A..|E.| +---+------+----------+-------+---+--+ In each situation, 'head' points to the beginning of the newest record. From this record, tooling can iterate over the full ring buffer and fetch records one by one. The only limitation that needs to be considered is back-to-back reading. Due to the non-deterministic of user programs, it is impossible to ensure the ring buffer keeps stable during reading. Consider an extreme situation: tooling is scheduled out after reading record 'D', then a burst of events come, eat up the whole ring buffer (one or multiple rounds). When the tooling process comes back, reading after 'D' is incorrect now. To prevent this problem, we need to find a way to ensure the ring buffer is stable during reading. ioctl(PERF_EVENT_IOC_PAUSE_OUTPUT) is suggested because its overhead is lower than ioctl(PERF_EVENT_IOC_ENABLE). By carefully verifying 'header' pointer, reader can avoid pausing the ring-buffer. For example: /* A union of all possible events */ union perf_event event; p = head = perf_mmap__read_head(); while (true) { /* copy header of next event */ fetch(&event.header, p, sizeof(event.header)); /* read 'head' pointer */ head = perf_mmap__read_head(); /* check overwritten: is the header good? */ if (!verify(sizeof(event.header), p, head)) break; /* copy the whole event */ fetch(&event, p, event.header.size); /* read 'head' pointer again */ head = perf_mmap__read_head(); /* is the whole event good? */ if (!verify(event.header.size, p, head)) break; p += event.header.size; } However, the overhead is high because: a) In-place decoding is not safe. Copying-verifying-decoding is required. b) Fetching 'head' pointer requires additional synchronization. (From Alexei Starovoitov: Even when this trick works, pause is needed for more than stability of reading. When we collect the events into overwrite buffer we're waiting for some other trigger (like all cpu utilization spike or just one cpu running and all others are idle) and when it happens the buffer has valuable info from the past. At this point new events are no longer interesting and buffer should be paused, events read and unpaused until next trigger comes.) This patch utilizes event's default overflow_handler introduced previously. perf_event_output_backward() is created as the default overflow handler for backward ring buffers. To avoid extra overhead to fast path, original perf_event_output() becomes __perf_event_output() and marked '__always_inline'. In theory, there's no extra overhead introduced to fast path. Performance testing: Calling 3000000 times of 'close(-1)', use gettimeofday() to check duration. Use 'perf record -o /dev/null -e raw_syscalls:*' to capture system calls. In ns. Testing environment: CPU : Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz Kernel : v4.5.0 MEAN STDVAR BASE 800214.950 2853.083 PRE1 2253846.700 9997.014 PRE2 2257495.540 8516.293 POST 2250896.100 8933.921 Where 'BASE' is pure performance without capturing. 'PRE1' is test result of pure 'v4.5.0' kernel. 'PRE2' is test result before this patch. 'POST' is test result after this patch. See [4] for the detailed experimental setup. Considering the stdvar, this patch doesn't introduce performance overhead to the fast path. [1] http://lkml.iu.edu/hypermail/linux/kernel/1304.1/04584.html [2] http://lkml.iu.edu/hypermail/linux/kernel/1307.1/00535.html [3] http://lkml.iu.edu/hypermail/linux/kernel/1512.0/01265.html [4] http://lkml.kernel.org/g/56F89DCD.1040202@huawei.comSigned-off-by: NWang Nan <wangnan0@huawei.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Acked-by: NAlexei Starovoitov <ast@kernel.org> Cc: <acme@kernel.org> Cc: <pi3orama@163.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Brendan Gregg <brendan.d.gregg@gmail.com> Cc: He Kuang <hekuang@huawei.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: Zefan Li <lizefan@huawei.com> Link: http://lkml.kernel.org/r/1459865478-53413-1-git-send-email-wangnan0@huawei.com [ Fixed the changelog some more. ] Signed-off-by: NIngo Molnar <mingo@kernel.org> Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
- 31 3月, 2016 1 次提交
-
-
由 Wang Nan 提交于
Add new ioctl() to pause/resume ring-buffer output. In some situations we want to read from the ring-buffer only when we ensure nothing can write to the ring-buffer during reading. Without this patch we have to turn off all events attached to this ring-buffer to achieve this. This patch is a prerequisite to enable overwrite support for the perf ring-buffer support. Following commits will introduce new methods support reading from overwrite ring buffer. Before reading, caller must ensure the ring buffer is frozen, or the reading is unreliable. Signed-off-by: NWang Nan <wangnan0@huawei.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: <pi3orama@163.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Brendan Gregg <brendan.d.gregg@gmail.com> Cc: He Kuang <hekuang@huawei.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: Zefan Li <lizefan@huawei.com> Link: http://lkml.kernel.org/r/1459147292-239310-2-git-send-email-wangnan0@huawei.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-