1. 03 6月, 2021 1 次提交
  2. 09 4月, 2021 1 次提交
  3. 10 11月, 2020 2 次提交
  4. 10 9月, 2020 1 次提交
  5. 18 8月, 2020 2 次提交
  6. 22 7月, 2020 1 次提交
  7. 08 7月, 2020 2 次提交
    • K
      perf/x86: Remove task_ctx_size · 5a09928d
      Kan Liang 提交于
      A new kmem_cache method has replaced the kzalloc() to allocate the PMU
      specific data. The task_ctx_size is not required anymore.
      Signed-off-by: NKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/1593780569-62993-19-git-send-email-kan.liang@linux.intel.com
      5a09928d
    • K
      perf/core: Use kmem_cache to allocate the PMU specific data · 217c2a63
      Kan Liang 提交于
      Currently, the PMU specific data task_ctx_data is allocated by the
      function kzalloc() in the perf generic code. When there is no specific
      alignment requirement for the task_ctx_data, the method works well for
      now. However, there will be a problem once a specific alignment
      requirement is introduced in future features, e.g., the Architecture LBR
      XSAVE feature requires 64-byte alignment. If the specific alignment
      requirement is not fulfilled, the XSAVE family of instructions will fail
      to save/restore the xstate to/from the task_ctx_data.
      
      The function kzalloc() itself only guarantees a natural alignment. A
      new method to allocate the task_ctx_data has to be introduced, which
      has to meet the requirements as below:
      - must be a generic method can be used by different architectures,
        because the allocation of the task_ctx_data is implemented in the
        perf generic code;
      - must be an alignment-guarantee method (The alignment requirement is
        not changed after the boot);
      - must be able to allocate/free a buffer (smaller than a page size)
        dynamically;
      - should not cause extra CPU overhead or space overhead.
      
      Several options were considered as below:
      - One option is to allocate a larger buffer for task_ctx_data. E.g.,
          ptr = kmalloc(size + alignment, GFP_KERNEL);
          ptr &= ~(alignment - 1);
        This option causes space overhead.
      - Another option is to allocate the task_ctx_data in the PMU specific
        code. To do so, several function pointers have to be added. As a
        result, both the generic structure and the PMU specific structure
        will become bigger. Besides, extra function calls are added when
        allocating/freeing the buffer. This option will increase both the
        space overhead and CPU overhead.
      - The third option is to use a kmem_cache to allocate a buffer for the
        task_ctx_data. The kmem_cache can be created with a specific alignment
        requirement by the PMU at boot time. A new pointer for kmem_cache has
        to be added in the generic struct pmu, which would be used to
        dynamically allocate a buffer for the task_ctx_data at run time.
        Although the new pointer is added to the struct pmu, the existing
        variable task_ctx_size is not required anymore. The size of the
        generic structure is kept the same.
      
      The third option which meets all the aforementioned requirements is used
      to replace kzalloc() for the PMU specific data allocation. A later patch
      will remove the kzalloc() method and the related variables.
      Signed-off-by: NKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/1593780569-62993-17-git-send-email-kan.liang@linux.intel.com
      217c2a63
  8. 01 7月, 2020 1 次提交
  9. 15 6月, 2020 1 次提交
    • A
      perf: Add perf text poke event · e17d43b9
      Adrian Hunter 提交于
      Record (single instruction) changes to the kernel text (i.e.
      self-modifying code) in order to support tracers like Intel PT and
      ARM CoreSight.
      
      A copy of the running kernel code is needed as a reference point (e.g.
      from /proc/kcore). The text poke event records the old bytes and the
      new bytes so that the event can be processed forwards or backwards.
      
      The basic problem is recording the modified instruction in an
      unambiguous manner given SMP instruction cache (in)coherence. That is,
      when modifying an instruction concurrently any solution with one or
      multiple timestamps is not sufficient:
      
      	CPU0				CPU1
       0
       1	write insn A
       2					execute insn A
       3	sync-I$
       4
      
      Due to I$, CPU1 might execute either the old or new A. No matter where
      we record tracepoints on CPU0, one simply cannot tell what CPU1 will
      have observed, except that at 0 it must be the old one and at 4 it
      must be the new one.
      
      To solve this, take inspiration from x86 text poking, which has to
      solve this exact problem due to variable length instruction encoding
      and I-fetch windows.
      
       1) overwrite the instruction with a breakpoint and sync I$
      
      This guarantees that that code flow will never hit the target
      instruction anymore, on any CPU (or rather, it will cause an
      exception).
      
       2) issue the TEXT_POKE event
      
       3) overwrite the breakpoint with the new instruction and sync I$
      
      Now we know that any execution after the TEXT_POKE event will either
      observe the breakpoint (and hit the exception) or the new instruction.
      
      So by guarding the TEXT_POKE event with an exception on either side;
      we can now tell, without doubt, which instruction another CPU will
      have observed.
      Signed-off-by: NAdrian Hunter <adrian.hunter@intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20200512121922.8997-2-adrian.hunter@intel.com
      e17d43b9
  10. 20 5月, 2020 1 次提交
    • G
      perf/core: Replace zero-length array with flexible-array · c50c75e9
      Gustavo A. R. Silva 提交于
      The current codebase makes use of the zero-length array language
      extension to the C90 standard, but the preferred mechanism to declare
      variable-length types such as these ones is a flexible array member[1][2],
      introduced in C99:
      
      struct foo {
              int stuff;
              struct boo array[];
      };
      
      By making use of the mechanism above, we will get a compiler warning
      in case the flexible array does not occur last in the structure, which
      will help us prevent some kind of undefined behavior bugs from being
      inadvertently introduced[3] to the codebase from now on.
      
      Also, notice that, dynamic memory allocations won't be affected by
      this change:
      
      "Flexible array members have incomplete type, and so the sizeof operator
      may not be applied. As a quirk of the original implementation of
      zero-length arrays, sizeof evaluates to zero."[1]
      
      sizeof(flexible-array-member) triggers a warning because flexible array
      members have incomplete type[1]. There are some instances of code in
      which the sizeof operator is being incorrectly/erroneously applied to
      zero-length arrays and the result is zero. Such instances may be hiding
      some bugs. So, this work (flexible-array member conversions) will also
      help to get completely rid of those sorts of issues.
      
      This issue was found with the help of Coccinelle.
      
      [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
      [2] https://github.com/KSPP/linux/issues/21
      [3] commit 76497732 ("cxgb3/l2t: Fix undefined behaviour")
      Signed-off-by: NGustavo A. R. Silva <gustavoars@kernel.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20200511201227.GA14041@embeddedor
      c50c75e9
  11. 27 4月, 2020 1 次提交
  12. 16 4月, 2020 1 次提交
    • A
      perf/core: Open access to the core for CAP_PERFMON privileged process · 18aa1856
      Alexey Budankov 提交于
      Open access to monitoring of kernel code, CPUs, tracepoints and
      namespaces data for a CAP_PERFMON privileged process. Providing the
      access under CAP_PERFMON capability singly, without the rest of
      CAP_SYS_ADMIN credentials, excludes chances to misuse the credentials
      and makes operation more secure.
      
      CAP_PERFMON implements the principle of least privilege for performance
      monitoring and observability operations (POSIX IEEE 1003.1e 2.2.2.39
      principle of least privilege: A security design principle that states
      that a process or program be granted only those privileges (e.g.,
      capabilities) necessary to accomplish its legitimate function, and only
      for the time that such privileges are actually required)
      
      For backward compatibility reasons the access to perf_events subsystem
      remains open for CAP_SYS_ADMIN privileged processes but CAP_SYS_ADMIN
      usage for secure perf_events monitoring is discouraged with respect to
      CAP_PERFMON capability.
      Signed-off-by: NAlexey Budankov <alexey.budankov@linux.intel.com>
      Reviewed-by: NJames Morris <jamorris@linux.microsoft.com>
      Tested-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Igor Lubashev <ilubashe@akamai.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: linux-man@vger.kernel.org
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Serge Hallyn <serge@hallyn.com>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: intel-gfx@lists.freedesktop.org
      Cc: linux-doc@vger.kernel.org
      Cc: linux-security-module@vger.kernel.org
      Cc: selinux@vger.kernel.org
      Link: http://lore.kernel.org/lkml/471acaef-bb8a-5ce2-923f-90606b78eef9@linux.intel.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      18aa1856
  13. 27 3月, 2020 1 次提交
  14. 06 3月, 2020 1 次提交
  15. 11 2月, 2020 1 次提交
    • K
      perf/core: Add new branch sample type for HW index of raw branch records · bbfd5e4f
      Kan Liang 提交于
      The low level index is the index in the underlying hardware buffer of
      the most recently captured taken branch which is always saved in
      branch_entries[0]. It is very useful for reconstructing the call stack.
      For example, in Intel LBR call stack mode, the depth of reconstructed
      LBR call stack limits to the number of LBR registers. With the low level
      index information, perf tool may stitch the stacks of two samples. The
      reconstructed LBR call stack can break the HW limitation.
      
      Add a new branch sample type to retrieve low level index of raw branch
      records. The low level index is between -1 (unknown) and max depth which
      can be retrieved in /sys/devices/cpu/caps/branches.
      
      Only when the new branch sample type is set, the low level index
      information is dumped into the PERF_SAMPLE_BRANCH_STACK output.
      Perf tool should check the attr.branch_sample_type, and apply the
      corresponding format for PERF_SAMPLE_BRANCH_STACK samples.
      Otherwise, some user case may be broken. For example, users may parse a
      perf.data, which include the new branch sample type, with an old version
      perf tool (without the check). Users probably get incorrect information
      without any warning.
      Signed-off-by: NKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Link: https://lkml.kernel.org/r/20200127165355.27495-2-kan.liang@linux.intel.com
      bbfd5e4f
  16. 29 1月, 2020 1 次提交
  17. 14 1月, 2020 1 次提交
  18. 15 11月, 2019 2 次提交
  19. 13 11月, 2019 1 次提交
    • A
      perf/aux: Allow using AUX data in perf samples · a4faf00d
      Alexander Shishkin 提交于
      AUX data can be used to annotate perf events such as performance counters
      or tracepoints/breakpoints by including it in sample records when
      PERF_SAMPLE_AUX flag is set. Such samples would be instrumental in debugging
      and profiling by providing, for example, a history of instruction flow
      leading up to the event's overflow.
      
      The implementation makes use of grouping an AUX event with all the events
      that wish to take samples of the AUX data, such that the former is the
      group leader. The samplees should also specify the desired size of the AUX
      sample via attr.aux_sample_size.
      
      AUX capable PMUs need to explicitly add support for sampling, because it
      relies on a new callback to take a snapshot of the buffer without touching
      the event states.
      Signed-off-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: adrian.hunter@intel.com
      Cc: mathieu.poirier@linaro.org
      Link: https://lkml.kernel.org/r/20191025140835.53665-2-alexander.shishkin@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      a4faf00d
  20. 28 10月, 2019 2 次提交
  21. 18 10月, 2019 1 次提交
    • J
      perf_event: Add support for LSM and SELinux checks · da97e184
      Joel Fernandes (Google) 提交于
      In current mainline, the degree of access to perf_event_open(2) system
      call depends on the perf_event_paranoid sysctl.  This has a number of
      limitations:
      
      1. The sysctl is only a single value. Many types of accesses are controlled
         based on the single value thus making the control very limited and
         coarse grained.
      2. The sysctl is global, so if the sysctl is changed, then that means
         all processes get access to perf_event_open(2) opening the door to
         security issues.
      
      This patch adds LSM and SELinux access checking which will be used in
      Android to access perf_event_open(2) for the purposes of attaching BPF
      programs to tracepoints, perf profiling and other operations from
      userspace. These operations are intended for production systems.
      
      5 new LSM hooks are added:
      1. perf_event_open: This controls access during the perf_event_open(2)
         syscall itself. The hook is called from all the places that the
         perf_event_paranoid sysctl is checked to keep it consistent with the
         systctl. The hook gets passed a 'type' argument which controls CPU,
         kernel and tracepoint accesses (in this context, CPU, kernel and
         tracepoint have the same semantics as the perf_event_paranoid sysctl).
         Additionally, I added an 'open' type which is similar to
         perf_event_paranoid sysctl == 3 patch carried in Android and several other
         distros but was rejected in mainline [1] in 2016.
      
      2. perf_event_alloc: This allocates a new security object for the event
         which stores the current SID within the event. It will be useful when
         the perf event's FD is passed through IPC to another process which may
         try to read the FD. Appropriate security checks will limit access.
      
      3. perf_event_free: Called when the event is closed.
      
      4. perf_event_read: Called from the read(2) and mmap(2) syscalls for the event.
      
      5. perf_event_write: Called from the ioctl(2) syscalls for the event.
      
      [1] https://lwn.net/Articles/696240/
      
      Since Peter had suggest LSM hooks in 2016 [1], I am adding his
      Suggested-by tag below.
      
      To use this patch, we set the perf_event_paranoid sysctl to -1 and then
      apply selinux checking as appropriate (default deny everything, and then
      add policy rules to give access to domains that need it). In the future
      we can remove the perf_event_paranoid sysctl altogether.
      Suggested-by: NPeter Zijlstra <peterz@infradead.org>
      Co-developed-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NJoel Fernandes (Google) <joel@joelfernandes.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NJames Morris <jmorris@namei.org>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: rostedt@goodmis.org
      Cc: Yonghong Song <yhs@fb.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: jeffv@google.com
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: primiano@google.com
      Cc: Song Liu <songliubraving@fb.com>
      Cc: rsavitski@google.com
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Matthew Garrett <matthewgarrett@google.com>
      Link: https://lkml.kernel.org/r/20191014170308.70668-1-joel@joelfernandes.org
      da97e184
  22. 28 8月, 2019 1 次提交
  23. 13 7月, 2019 1 次提交
    • A
      perf/core: Fix exclusive events' grouping · 8a58ddae
      Alexander Shishkin 提交于
      So far, we tried to disallow grouping exclusive events for the fear of
      complications they would cause with moving between contexts. Specifically,
      moving a software group to a hardware context would violate the exclusivity
      rules if both groups contain matching exclusive events.
      
      This attempt was, however, unsuccessful: the check that we have in the
      perf_event_open() syscall is both wrong (looks at wrong PMU) and
      insufficient (group leader may still be exclusive), as can be illustrated
      by running:
      
        $ perf record -e '{intel_pt//,cycles}' uname
        $ perf record -e '{cycles,intel_pt//}' uname
      
      ultimately successfully.
      
      Furthermore, we are completely free to trigger the exclusivity violation
      by:
      
         perf -e '{cycles,intel_pt//}' -e '{intel_pt//,instructions}'
      
      even though the helpful perf record will not allow that, the ABI will.
      
      The warning later in the perf_event_open() path will also not trigger, because
      it's also wrong.
      
      Fix all this by validating the original group before moving, getting rid
      of broken safeguards and placing a useful one to perf_install_in_context().
      Signed-off-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: <stable@vger.kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: mathieu.poirier@linaro.org
      Cc: will.deacon@arm.com
      Fixes: bed5b25a ("perf: Add a pmu capability for "exclusive" events")
      Link: https://lkml.kernel.org/r/20190701110755.24646-1-alexander.shishkin@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      8a58ddae
  24. 25 6月, 2019 2 次提交
    • I
      perf/cgroups: Don't rotate events for cgroups unnecessarily · fd7d5517
      Ian Rogers 提交于
      Currently perf_rotate_context assumes that if the context's nr_events !=
      nr_active a rotation is necessary for perf event multiplexing. With
      cgroups, nr_events is the total count of events for all cgroups and
      nr_active will not include events in a cgroup other than the current
      task's. This makes rotation appear necessary for cgroups when it is not.
      
      Add a perf_event_context flag that is set when rotation is necessary.
      Clear the flag during sched_out and set it when a flexible sched_in
      fails due to resources.
      Signed-off-by: NIan Rogers <irogers@google.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Kan Liang <kan.liang@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Link: https://lkml.kernel.org/r/20190601082722.44543-1-irogers@google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      fd7d5517
    • K
      perf/x86: Disable extended registers for non-supported PMUs · e321d02d
      Kan Liang 提交于
      The perf fuzzer caused Skylake machine to crash:
      
      [ 9680.085831] Call Trace:
      [ 9680.088301]  <IRQ>
      [ 9680.090363]  perf_output_sample_regs+0x43/0xa0
      [ 9680.094928]  perf_output_sample+0x3aa/0x7a0
      [ 9680.099181]  perf_event_output_forward+0x53/0x80
      [ 9680.103917]  __perf_event_overflow+0x52/0xf0
      [ 9680.108266]  ? perf_trace_run_bpf_submit+0xc0/0xc0
      [ 9680.113108]  perf_swevent_hrtimer+0xe2/0x150
      [ 9680.117475]  ? check_preempt_wakeup+0x181/0x230
      [ 9680.122091]  ? check_preempt_curr+0x62/0x90
      [ 9680.126361]  ? ttwu_do_wakeup+0x19/0x140
      [ 9680.130355]  ? try_to_wake_up+0x54/0x460
      [ 9680.134366]  ? reweight_entity+0x15b/0x1a0
      [ 9680.138559]  ? __queue_work+0x103/0x3f0
      [ 9680.142472]  ? update_dl_rq_load_avg+0x1cd/0x270
      [ 9680.147194]  ? timerqueue_del+0x1e/0x40
      [ 9680.151092]  ? __remove_hrtimer+0x35/0x70
      [ 9680.155191]  __hrtimer_run_queues+0x100/0x280
      [ 9680.159658]  hrtimer_interrupt+0x100/0x220
      [ 9680.163835]  smp_apic_timer_interrupt+0x6a/0x140
      [ 9680.168555]  apic_timer_interrupt+0xf/0x20
      [ 9680.172756]  </IRQ>
      
      The XMM registers can only be collected by PEBS hardware events on the
      platforms with PEBS baseline support, e.g. Icelake, not software/probe
      events.
      
      Add capabilities flag PERF_PMU_CAP_EXTENDED_REGS to indicate the PMU
      which support extended registers. For X86, the extended registers are
      XMM registers.
      
      Add has_extended_regs() to check if extended registers are applied.
      
      The generic code define the mask of extended registers as 0 if arch
      headers haven't overridden it.
      Originally-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reported-by: NVince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 878068ea ("perf/x86: Support outputting XMM registers")
      Link: https://lkml.kernel.org/r/1559081314-9714-1-git-send-email-kan.liang@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      e321d02d
  25. 03 6月, 2019 1 次提交
    • J
      perf/core: Add attr_groups_update into struct pmu · f3a3a825
      Jiri Olsa 提交于
      Adding attr_update attribute group into pmu, to allow
      having multiple attribute groups for same group name.
      
      This will allow us to update "events" or "format"
      directories with attributes that depend on various
      HW conditions.
      
      For example having group_format_extra group that updates
      "format" directory only if pmu version is 2 and higher:
      
        static umode_t
        exra_is_visible(struct kobject *kobj, struct attribute *attr, int i)
        {
               return x86_pmu.version >= 2 ? attr->mode : 0;
        }
      
        static struct attribute_group group_format_extra = {
               .name       = "format",
               .is_visible = exra_is_visible,
        };
      Signed-off-by: NJiri Olsa <jolsa@kernel.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/20190512155518.21468-3-jolsa@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      f3a3a825
  26. 03 5月, 2019 1 次提交
  27. 01 5月, 2019 1 次提交
  28. 29 4月, 2019 1 次提交
    • K
      perf/x86: Make perf callchains work without CONFIG_FRAME_POINTER · d15d3568
      Kairui Song 提交于
      Currently perf callchain doesn't work well with ORC unwinder
      when sampling from trace point. We'll get useless in kernel callchain
      like this:
      
      perf  6429 [000]    22.498450:             kmem:mm_page_alloc: page=0x176a17 pfn=1534487 order=0 migratetype=0 gfp_flags=GFP_KERNEL
          ffffffffbe23e32e __alloc_pages_nodemask+0x22e (/lib/modules/5.1.0-rc3+/build/vmlinux)
      	7efdf7f7d3e8 __poll+0x18 (/usr/lib64/libc-2.28.so)
      	5651468729c1 [unknown] (/usr/bin/perf)
      	5651467ee82a main+0x69a (/usr/bin/perf)
      	7efdf7eaf413 __libc_start_main+0xf3 (/usr/lib64/libc-2.28.so)
          5541f689495641d7 [unknown] ([unknown])
      
      The root cause is that, for trace point events, it doesn't provide a
      real snapshot of the hardware registers. Instead perf tries to get
      required caller's registers and compose a fake register snapshot
      which suppose to contain enough information for start a unwinding.
      However without CONFIG_FRAME_POINTER, if failed to get caller's BP as the
      frame pointer, so current frame pointer is returned instead. We get
      a invalid register combination which confuse the unwinder, and end the
      stacktrace early.
      
      So in such case just don't try dump BP, and let the unwinder start
      directly when the register is not a real snapshot. Use SP
      as the skip mark, unwinder will skip all the frames until it meet
      the frame of the trace point caller.
      
      Tested with frame pointer unwinder and ORC unwinder, this makes perf
      callchain get the full kernel space stacktrace again like this:
      
      perf  6503 [000]  1567.570191:             kmem:mm_page_alloc: page=0x16c904 pfn=1493252 order=0 migratetype=0 gfp_flags=GFP_KERNEL
          ffffffffb523e2ae __alloc_pages_nodemask+0x22e (/lib/modules/5.1.0-rc3+/build/vmlinux)
          ffffffffb52383bd __get_free_pages+0xd (/lib/modules/5.1.0-rc3+/build/vmlinux)
          ffffffffb52fd28a __pollwait+0x8a (/lib/modules/5.1.0-rc3+/build/vmlinux)
          ffffffffb521426f perf_poll+0x2f (/lib/modules/5.1.0-rc3+/build/vmlinux)
          ffffffffb52fe3e2 do_sys_poll+0x252 (/lib/modules/5.1.0-rc3+/build/vmlinux)
          ffffffffb52ff027 __x64_sys_poll+0x37 (/lib/modules/5.1.0-rc3+/build/vmlinux)
          ffffffffb500418b do_syscall_64+0x5b (/lib/modules/5.1.0-rc3+/build/vmlinux)
          ffffffffb5a0008c entry_SYSCALL_64_after_hwframe+0x44 (/lib/modules/5.1.0-rc3+/build/vmlinux)
      	7f71e92d03e8 __poll+0x18 (/usr/lib64/libc-2.28.so)
      	55a22960d9c1 [unknown] (/usr/bin/perf)
      	55a22958982a main+0x69a (/usr/bin/perf)
      	7f71e9202413 __libc_start_main+0xf3 (/usr/lib64/libc-2.28.so)
          5541f689495641d7 [unknown] ([unknown])
      Co-developed-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Signed-off-by: NKairui Song <kasong@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/20190422162652.15483-1-kasong@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      d15d3568
  29. 16 4月, 2019 1 次提交
  30. 03 4月, 2019 1 次提交
  31. 23 2月, 2019 1 次提交
  32. 11 2月, 2019 1 次提交
    • J
      perf/x86: Add check_period PMU callback · 81ec3f3c
      Jiri Olsa 提交于
      Vince (and later on Ravi) reported crashes in the BTS code during
      fuzzing with the following backtrace:
      
        general protection fault: 0000 [#1] SMP PTI
        ...
        RIP: 0010:perf_prepare_sample+0x8f/0x510
        ...
        Call Trace:
         <IRQ>
         ? intel_pmu_drain_bts_buffer+0x194/0x230
         intel_pmu_drain_bts_buffer+0x160/0x230
         ? tick_nohz_irq_exit+0x31/0x40
         ? smp_call_function_single_interrupt+0x48/0xe0
         ? call_function_single_interrupt+0xf/0x20
         ? call_function_single_interrupt+0xa/0x20
         ? x86_schedule_events+0x1a0/0x2f0
         ? x86_pmu_commit_txn+0xb4/0x100
         ? find_busiest_group+0x47/0x5d0
         ? perf_event_set_state.part.42+0x12/0x50
         ? perf_mux_hrtimer_restart+0x40/0xb0
         intel_pmu_disable_event+0xae/0x100
         ? intel_pmu_disable_event+0xae/0x100
         x86_pmu_stop+0x7a/0xb0
         x86_pmu_del+0x57/0x120
         event_sched_out.isra.101+0x83/0x180
         group_sched_out.part.103+0x57/0xe0
         ctx_sched_out+0x188/0x240
         ctx_resched+0xa8/0xd0
         __perf_event_enable+0x193/0x1e0
         event_function+0x8e/0xc0
         remote_function+0x41/0x50
         flush_smp_call_function_queue+0x68/0x100
         generic_smp_call_function_single_interrupt+0x13/0x30
         smp_call_function_single_interrupt+0x3e/0xe0
         call_function_single_interrupt+0xf/0x20
         </IRQ>
      
      The reason is that while event init code does several checks
      for BTS events and prevents several unwanted config bits for
      BTS event (like precise_ip), the PERF_EVENT_IOC_PERIOD allows
      to create BTS event without those checks being done.
      
      Following sequence will cause the crash:
      
      If we create an 'almost' BTS event with precise_ip and callchains,
      and it into a BTS event it will crash the perf_prepare_sample()
      function because precise_ip events are expected to come
      in with callchain data initialized, but that's not the
      case for intel_pmu_drain_bts_buffer() caller.
      
      Adding a check_period callback to be called before the period
      is changed via PERF_EVENT_IOC_PERIOD. It will deny the change
      if the event would become BTS. Plus adding also the limit_period
      check as well.
      Reported-by: NVince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NJiri Olsa <jolsa@kernel.org>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: <stable@vger.kernel.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
      Cc: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20190204123532.GA4794@kravaSigned-off-by: NIngo Molnar <mingo@kernel.org>
      81ec3f3c
  33. 08 2月, 2019 1 次提交
  34. 06 2月, 2019 1 次提交