1. 27 9月, 2022 1 次提交
  2. 14 10月, 2021 9 次提交
  3. 26 4月, 2021 1 次提交
    • K
      perf/x86: Make perf callchains work without CONFIG_FRAME_POINTER · f20c4859
      Kairui Song 提交于
      mainline inclusion
      from mainline-5.2-rc1
      commit d15d3568
      category: bugfix
      bugzilla: 35738
      CVE: NA
      
      -------------------------------------------------
      
      Currently perf callchain doesn't work well with ORC unwinder
      when sampling from trace point. We'll get useless in kernel callchain
      like this:
      
      perf  6429 [000]    22.498450:             kmem:mm_page_alloc: page=0x176a17 pfn=1534487 order=0 migratetype=0 gfp_flags=GFP_KERNEL
          ffffffffbe23e32e __alloc_pages_nodemask+0x22e (/lib/modules/5.1.0-rc3+/build/vmlinux)
      	7efdf7f7d3e8 __poll+0x18 (/usr/lib64/libc-2.28.so)
      	5651468729c1 [unknown] (/usr/bin/perf)
      	5651467ee82a main+0x69a (/usr/bin/perf)
      	7efdf7eaf413 __libc_start_main+0xf3 (/usr/lib64/libc-2.28.so)
          5541f689495641d7 [unknown] ([unknown])
      
      The root cause is that, for trace point events, it doesn't provide a
      real snapshot of the hardware registers. Instead perf tries to get
      required caller's registers and compose a fake register snapshot
      which suppose to contain enough information for start a unwinding.
      However without CONFIG_FRAME_POINTER, if failed to get caller's BP as the
      frame pointer, so current frame pointer is returned instead. We get
      a invalid register combination which confuse the unwinder, and end the
      stacktrace early.
      
      So in such case just don't try dump BP, and let the unwinder start
      directly when the register is not a real snapshot. Use SP
      as the skip mark, unwinder will skip all the frames until it meet
      the frame of the trace point caller.
      
      Tested with frame pointer unwinder and ORC unwinder, this makes perf
      callchain get the full kernel space stacktrace again like this:
      
      perf  6503 [000]  1567.570191:             kmem:mm_page_alloc: page=0x16c904 pfn=1493252 order=0 migratetype=0 gfp_flags=GFP_KERNEL
          ffffffffb523e2ae __alloc_pages_nodemask+0x22e (/lib/modules/5.1.0-rc3+/build/vmlinux)
          ffffffffb52383bd __get_free_pages+0xd (/lib/modules/5.1.0-rc3+/build/vmlinux)
          ffffffffb52fd28a __pollwait+0x8a (/lib/modules/5.1.0-rc3+/build/vmlinux)
          ffffffffb521426f perf_poll+0x2f (/lib/modules/5.1.0-rc3+/build/vmlinux)
          ffffffffb52fe3e2 do_sys_poll+0x252 (/lib/modules/5.1.0-rc3+/build/vmlinux)
          ffffffffb52ff027 __x64_sys_poll+0x37 (/lib/modules/5.1.0-rc3+/build/vmlinux)
          ffffffffb500418b do_syscall_64+0x5b (/lib/modules/5.1.0-rc3+/build/vmlinux)
          ffffffffb5a0008c entry_SYSCALL_64_after_hwframe+0x44 (/lib/modules/5.1.0-rc3+/build/vmlinux)
      	7f71e92d03e8 __poll+0x18 (/usr/lib64/libc-2.28.so)
      	55a22960d9c1 [unknown] (/usr/bin/perf)
      	55a22958982a main+0x69a (/usr/bin/perf)
      	7f71e9202413 __libc_start_main+0xf3 (/usr/lib64/libc-2.28.so)
          5541f689495641d7 [unknown] ([unknown])
      Co-developed-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Signed-off-by: NKairui Song <kasong@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/20190422162652.15483-1-kasong@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NWei Li <liwei391@huawei.com>
      Reviewed-by: NJian Cheng <cj.chengjian@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      f20c4859
  4. 27 12月, 2019 4 次提交
    • A
      perf/core: Fix exclusive events' grouping · 9e413000
      Alexander Shishkin 提交于
      commit 8a58ddae upstream.
      
      So far, we tried to disallow grouping exclusive events for the fear of
      complications they would cause with moving between contexts. Specifically,
      moving a software group to a hardware context would violate the exclusivity
      rules if both groups contain matching exclusive events.
      
      This attempt was, however, unsuccessful: the check that we have in the
      perf_event_open() syscall is both wrong (looks at wrong PMU) and
      insufficient (group leader may still be exclusive), as can be illustrated
      by running:
      
        $ perf record -e '{intel_pt//,cycles}' uname
        $ perf record -e '{cycles,intel_pt//}' uname
      
      ultimately successfully.
      
      Furthermore, we are completely free to trigger the exclusivity violation
      by:
      
         perf -e '{cycles,intel_pt//}' -e '{intel_pt//,instructions}'
      
      even though the helpful perf record will not allow that, the ABI will.
      
      The warning later in the perf_event_open() path will also not trigger, because
      it's also wrong.
      
      Fix all this by validating the original group before moving, getting rid
      of broken safeguards and placing a useful one to perf_install_in_context().
      Signed-off-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: <stable@vger.kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: mathieu.poirier@linaro.org
      Cc: will.deacon@arm.com
      Fixes: bed5b25a ("perf: Add a pmu capability for "exclusive" events")
      Link: https://lkml.kernel.org/r/20190701110755.24646-1-alexander.shishkin@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      9e413000
    • M
      perf/aux: Make perf_event accessible to setup_aux() · c3dc528d
      Mathieu Poirier 提交于
      [ Upstream commit 84001866 ]
      
      When pmu::setup_aux() is called the coresight PMU needs to know which
      sink to use for the session by looking up the information in the
      event's attr::config2 field.
      
      As such simply replace the cpu information by the complete perf_event
      structure and change all affected customers.
      Signed-off-by: NMathieu Poirier <mathieu.poirier@linaro.org>
      Reviewed-by: NSuzuki Poulouse <suzuki.poulose@arm.com>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: linux-arm-kernel@lists.infradead.org
      Cc: linux-s390@vger.kernel.org
      Link: http://lkml.kernel.org/r/20190131184714.20388-2-mathieu.poirier@linaro.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      c3dc528d
    • A
      perf, pt, coresight: Fix address filters for vmas with non-zero offset · 7091a192
      Alexander Shishkin 提交于
      mainline inclusion
      from mainline-5.1-rc1
      commit c60f83b8
      category: bugfix
      bugzilla: 11596
      CVE: NA
      
      -------------------------------------------------
      Currently, the address range calculation for file-based filters works as
      long as the vma that maps the matching part of the object file starts
      from offset zero into the file (vm_pgoff==0). Otherwise, the resulting
      filter range would be off by vm_pgoff pages. Another related problem is
      that in case of a partially matching vma, that is, a vma that matches
      part of a filter region, the filter range size wouldn't be adjusted.
      
      Fix the arithmetics around address filter range calculations, taking
      into account vma offset, so that the entire calculation is done before
      the filter configuration is passed to the PMU drivers instead of having
      those drivers do the final bit of arithmetics.
      
      Based on the patch by Adrian Hunter <adrian.hunter.intel.com>.
      Reported-by: NAdrian Hunter <adrian.hunter@intel.com>
      Signed-off-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Tested-by: NMathieu Poirier <mathieu.poirier@linaro.org>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Fixes: 375637bc ("perf/core: Introduce address range filtering")
      Link: http://lkml.kernel.org/r/20190215115655.63469-3-alexander.shishkin@linux.intel.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      (cherry picked from commit c60f83b8)
      Signed-off-by: NZhen Lei <thunder.leizhen@huawei.com>
      Reviewed-by: NCheng Jian <cj.chengjian@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      7091a192
    • J
      perf/x86: Add check_period PMU callback · 745de283
      Jiri Olsa 提交于
      commit 81ec3f3c upstream.
      
      Vince (and later on Ravi) reported crashes in the BTS code during
      fuzzing with the following backtrace:
      
        general protection fault: 0000 [#1] SMP PTI
        ...
        RIP: 0010:perf_prepare_sample+0x8f/0x510
        ...
        Call Trace:
         <IRQ>
         ? intel_pmu_drain_bts_buffer+0x194/0x230
         intel_pmu_drain_bts_buffer+0x160/0x230
         ? tick_nohz_irq_exit+0x31/0x40
         ? smp_call_function_single_interrupt+0x48/0xe0
         ? call_function_single_interrupt+0xf/0x20
         ? call_function_single_interrupt+0xa/0x20
         ? x86_schedule_events+0x1a0/0x2f0
         ? x86_pmu_commit_txn+0xb4/0x100
         ? find_busiest_group+0x47/0x5d0
         ? perf_event_set_state.part.42+0x12/0x50
         ? perf_mux_hrtimer_restart+0x40/0xb0
         intel_pmu_disable_event+0xae/0x100
         ? intel_pmu_disable_event+0xae/0x100
         x86_pmu_stop+0x7a/0xb0
         x86_pmu_del+0x57/0x120
         event_sched_out.isra.101+0x83/0x180
         group_sched_out.part.103+0x57/0xe0
         ctx_sched_out+0x188/0x240
         ctx_resched+0xa8/0xd0
         __perf_event_enable+0x193/0x1e0
         event_function+0x8e/0xc0
         remote_function+0x41/0x50
         flush_smp_call_function_queue+0x68/0x100
         generic_smp_call_function_single_interrupt+0x13/0x30
         smp_call_function_single_interrupt+0x3e/0xe0
         call_function_single_interrupt+0xf/0x20
         </IRQ>
      
      The reason is that while event init code does several checks
      for BTS events and prevents several unwanted config bits for
      BTS event (like precise_ip), the PERF_EVENT_IOC_PERIOD allows
      to create BTS event without those checks being done.
      
      Following sequence will cause the crash:
      
      If we create an 'almost' BTS event with precise_ip and callchains,
      and it into a BTS event it will crash the perf_prepare_sample()
      function because precise_ip events are expected to come
      in with callchain data initialized, but that's not the
      case for intel_pmu_drain_bts_buffer() caller.
      
      Adding a check_period callback to be called before the period
      is changed via PERF_EVENT_IOC_PERIOD. It will deny the change
      if the event would become BTS. Plus adding also the limit_period
      check as well.
      Reported-by: NVince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NJiri Olsa <jolsa@kernel.org>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: <stable@vger.kernel.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
      Cc: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20190204123532.GA4794@kravaSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      745de283
  5. 25 7月, 2018 1 次提交
    • P
      perf/x86/intel: Fix unwind errors from PEBS entries (mk-II) · 6cbc304f
      Peter Zijlstra 提交于
      Vince reported the perf_fuzzer giving various unwinder warnings and
      Josh reported:
      
      > Deja vu.  Most of these are related to perf PEBS, similar to the
      > following issue:
      >
      >   b8000586 ("perf/x86/intel: Cure bogus unwind from PEBS entries")
      >
      > This is basically the ORC version of that.  setup_pebs_sample_data() is
      > assembling a franken-pt_regs which ORC isn't happy about.  RIP is
      > inconsistent with some of the other registers (like RSP and RBP).
      
      And where the previous unwinder only needed BP,SP ORC also requires
      IP. But we cannot spoof IP because then the sample will get displaced,
      entirely negating the point of PEBS.
      
      So cure the whole thing differently by doing the unwind early; this
      does however require a means to communicate we did the unwind early.
      We (ab)use an unused sample_type bit for this, which we set on events
      that fill out the data->callchain before the normal
      perf_prepare_sample().
      Debugged-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Reported-by: NVince Weaver <vincent.weaver@maine.edu>
      Tested-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Tested-by: NPrashant Bhole <bhole_prashant_q7@lab.ntt.co.jp>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      6cbc304f
  6. 16 7月, 2018 1 次提交
  7. 25 5月, 2018 3 次提交
    • S
      perf/core: Fix bad use of igrab() · 9511bce9
      Song Liu 提交于
      As Miklos reported and suggested:
      
       "This pattern repeats two times in trace_uprobe.c and in
        kernel/events/core.c as well:
      
            ret = kern_path(filename, LOOKUP_FOLLOW, &path);
            if (ret)
                goto fail_address_parse;
      
            inode = igrab(d_inode(path.dentry));
            path_put(&path);
      
        And it's wrong.  You can only hold a reference to the inode if you
        have an active ref to the superblock as well (which is normally
        through path.mnt) or holding s_umount.
      
        This way unmounting the containing filesystem while the tracepoint is
        active will give you the "VFS: Busy inodes after unmount..." message
        and a crash when the inode is finally put.
      
        Solution: store path instead of inode."
      
      This patch fixes the issue in kernel/event/core.c.
      Reviewed-and-tested-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Reported-by: NMiklos Szeredi <miklos@szeredi.hu>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: <kernel-team@fb.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Fixes: 375637bc ("perf/core: Introduce address range filtering")
      Link: http://lkml.kernel.org/r/20180418062907.3210386-2-songliubraving@fb.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      9511bce9
    • S
      perf/core: Fix group scheduling with mixed hw and sw events · a1150c20
      Song Liu 提交于
      When hw and sw events are mixed in the same group, they are all attached
      to the hw perf_event_context. This sometimes requires moving group of
      perf_event to a different context.
      
      We found a bug in how the kernel handles this, for example if we do:
      
         perf stat -e '{faults,ref-cycles,faults}'  -I 1000
      
           1.005591180              1,297      faults
           1.005591180        457,476,576      ref-cycles
           1.005591180    <not supported>      faults
      
      First, sw event "faults" is attached to the sw context, and becomes the
      group leader. Then, hw event "ref-cycles" is attached, so both events
      are moved to the hw context. Last, another sw "faults" tries to attach,
      but it fails because of mismatch between the new target ctx (from sw
      pmu) and the group_leader's ctx (hw context, same as ref-cycles).
      
      The broken condition is:
         group_leader is sw event;
         group_leader is on hw context;
         add a sw event to the group.
      
      Fix this scenario by checking group_leader's context (instead of just
      event type). If group_leader is on hw context, use the ->pmu of this
      context to look up context for the new event.
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: <kernel-team@fb.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Fixes: b04243ef ("perf: Complete software pmu grouping")
      Link: http://lkml.kernel.org/r/20180503194716.162815-1-songliubraving@fb.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      a1150c20
    • Y
      perf/core: add perf_get_event() to return perf_event given a struct file · f8d959a5
      Yonghong Song 提交于
      A new extern function, perf_get_event(), is added to return a perf event
      given a struct file. This function will be used in later patches.
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      f8d959a5
  8. 29 3月, 2018 1 次提交
  9. 17 3月, 2018 1 次提交
  10. 16 3月, 2018 1 次提交
  11. 12 3月, 2018 3 次提交
    • P
      perf/core: Optimize ctx_sched_out() · 6668128a
      Peter Zijlstra 提交于
      When an event group contains more events than can be scheduled on the
      hardware, iterating the full event group for ctx_sched_out is a waste
      of time.
      
      Keep track of the events that got programmed on the hardware, such
      that we can iterate this smaller list in order to schedule them out.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NMark Rutland <mark.rutland@arm.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Alexey Budankov <alexey.budankov@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: David Carrillo-Cisneros <davidcc@google.com>
      Cc: Dmitri Prokhorov <Dmitry.Prohorov@intel.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Kan Liang <kan.liang@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Valery Cherepennikov <valery.cherepennikov@intel.com>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      6668128a
    • P
      perf/core: Remove perf_event::group_entry · 8343aae6
      Peter Zijlstra 提交于
      Now that all the grouping is done with RB trees, we no longer need
      group_entry and can replace the whole thing with sibling_list.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NMark Rutland <mark.rutland@arm.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Alexey Budankov <alexey.budankov@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: David Carrillo-Cisneros <davidcc@google.com>
      Cc: Dmitri Prokhorov <Dmitry.Prohorov@intel.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Kan Liang <kan.liang@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Valery Cherepennikov <valery.cherepennikov@intel.com>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      8343aae6
    • A
      perf/cor: Use RB trees for pinned/flexible groups · 8e1a2031
      Alexey Budankov 提交于
      Change event groups into RB trees sorted by CPU and then by a 64bit
      index, so that multiplexing hrtimer interrupt handler would be able
      skipping to the current CPU's list and ignore groups allocated for the
      other CPUs.
      
      New API for manipulating event groups in the trees is implemented as well
      as adoption on the API in the current implementation.
      
      pinned_group_sched_in() and flexible_group_sched_in() API are
      introduced to consolidate code enabling the whole group from pinned
      and flexible groups appropriately.
      Signed-off-by: NAlexey Budankov <alexey.budankov@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NMark Rutland <mark.rutland@arm.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: David Carrillo-Cisneros <davidcc@google.com>
      Cc: Dmitri Prokhorov <Dmitry.Prohorov@intel.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Kan Liang <kan.liang@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Valery Cherepennikov <valery.cherepennikov@intel.com>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: linux-kernel@vger.kernel.org
      Link: http://lkml.kernel.org/r/372f9c8b-0cfe-4240-e44d-83d863d40813@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      8e1a2031
  12. 05 12月, 2017 1 次提交
    • H
      bpf: correct broken uapi for BPF_PROG_TYPE_PERF_EVENT program type · c895f6f7
      Hendrik Brueckner 提交于
      Commit 0515e599 ("bpf: introduce BPF_PROG_TYPE_PERF_EVENT
      program type") introduced the bpf_perf_event_data structure which
      exports the pt_regs structure.  This is OK for multiple architectures
      but fail for s390 and arm64 which do not export pt_regs.  Programs
      using them, for example, the bpf selftest fail to compile on these
      architectures.
      
      For s390, exporting the pt_regs is not an option because s390 wants
      to allow changes to it.  For arm64, there is a user_pt_regs structure
      that covers parts of the pt_regs structure for use by user space.
      
      To solve the broken uapi for s390 and arm64, introduce an abstract
      type for pt_regs and add an asm/bpf_perf_event.h file that concretes
      the type.  An asm-generic header file covers the architectures that
      export pt_regs today.
      
      The arch-specific enablement for s390 and arm64 follows in separate
      commits.
      Reported-by: NThomas Richter <tmricht@linux.vnet.ibm.com>
      Fixes: 0515e599 ("bpf: introduce BPF_PROG_TYPE_PERF_EVENT program type")
      Signed-off-by: NHendrik Brueckner <brueckner@linux.vnet.ibm.com>
      Reviewed-and-tested-by: NThomas Richter <tmricht@linux.vnet.ibm.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      c895f6f7
  13. 27 10月, 2017 3 次提交
    • P
      perf/core: Rewrite event timekeeping · 0d3d73aa
      Peter Zijlstra 提交于
      The current even timekeeping, which computes enabled and running
      times, uses 3 distinct timestamps to reflect the various event states:
      OFF (stopped), INACTIVE (enabled) and ACTIVE (running).
      
      Furthermore, the update rules are such that even INACTIVE events need
      their timestamps updated. This is undesirable because we'd like to not
      touch INACTIVE events if at all possible, this makes event scheduling
      (much) more expensive than needed.
      
      Rewrite the timekeeping to directly use event->state, this greatly
      simplifies the code and results in only having to update things when
      we change state, or an up-to-date value is requested (read).
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      0d3d73aa
    • P
      perf/core: Rename 'enum perf_event_active_state' · 8ca2bd41
      Peter Zijlstra 提交于
      Its a weird name, active is one of the states, it should not be part
      of the name, also, its too long.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      8ca2bd41
    • Y
      perf/bpf: Extend the perf_event_read_local() interface, a.k.a. "bpf: perf... · 7d9285e8
      Yonghong Song 提交于
      perf/bpf: Extend the perf_event_read_local() interface, a.k.a. "bpf: perf event change needed for subsequent bpf helpers"
      
      eBPF programs would like access to the (perf) event enabled and
      running times along with the event value, such that they can deal with
      event multiplexing (among other things).
      
      This patch extends the interface; a future eBPF patch will utilize
      the new functionality.
      
      [ Note, there's a same-content commit with a poor changelog and a meaningless
        title in the networking tree as well - but we need this change for subsequent
        perf work, so apply it here as well, with a proper changelog. Hopefully Git
        will be able to sort out this somewhat messy workflow, if there are no other,
        conflicting changes to these files. ]
      Signed-off-by: NYonghong Song <yhs@fb.com>
      [ Rewrote the changelog. ]
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: <ast@fb.com>
      Cc: <daniel@iogearbox.net>
      Cc: <rostedt@goodmis.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: David S. Miller <davem@davemloft.net>
      Link: http://lkml.kernel.org/r/20171005161923.332790-2-yhs@fb.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      7d9285e8
  14. 17 10月, 2017 1 次提交
  15. 08 10月, 2017 1 次提交
  16. 29 8月, 2017 3 次提交
    • K
      perf/core, x86: Add PERF_SAMPLE_PHYS_ADDR · fc7ce9c7
      Kan Liang 提交于
      For understanding how the workload maps to memory channels and hardware
      behavior, it's very important to collect address maps with physical
      addresses. For example, 3D XPoint access can only be found by filtering
      the physical address.
      
      Add a new sample type for physical address.
      
      perf already has a facility to collect data virtual address. This patch
      introduces a function to convert the virtual address to physical address.
      The function is quite generic and can be extended to any architecture as
      long as a virtual address is provided.
      
       - For kernel direct mapping addresses, virt_to_phys is used to convert
         the virtual addresses to physical address.
      
       - For user virtual addresses, __get_user_pages_fast is used to walk the
         pages tables for user physical address.
      
       - This does not work for vmalloc addresses right now. These are not
         resolved, but code to do that could be added.
      
      The new sample type requires collecting the virtual address. The
      virtual address will not be output unless SAMPLE_ADDR is applied.
      
      For security, the physical address can only be exposed to root or
      privileged user.
      Tested-by: NMadhavan Srinivasan <maddy@linux.vnet.ibm.com>
      Signed-off-by: NKan Liang <kan.liang@intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: acme@kernel.org
      Cc: mpe@ellerman.id.au
      Link: http://lkml.kernel.org/r/1503967969-48278-1-git-send-email-kan.liang@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      fc7ce9c7
    • A
      perf/core, pt, bts: Get rid of itrace_started · 8d4e6c4c
      Alexander Shishkin 提交于
      I just noticed that hw.itrace_started and hw.config are aliased to the
      same location. Now, the PT driver happens to use both, which works out
      fine by sheer luck:
      
       - STORE(hw.itrace_start) is ordered before STORE(hw.config), in the
          program order, although there are no compiler barriers to ensure that,
      
       - to the perf_log_itrace_start() hw.itrace_start looks set at the same
         time as when it is intended to be set because both stores happen in the
         same path,
      
       - hw.config is never reset to zero in the PT driver.
      
      Now, the use of hw.config by the PT driver makes more sense (it being a
      HW PMU) than messing around with itrace_started, which is an awkward API
      to begin with.
      
      This patch replaces hw.itrace_started with an attach_state bit and an
      API call for the PMU drivers to use to communicate the condition.
      Signed-off-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: vince@deater.net
      Link: http://lkml.kernel.org/r/20170330153956.25994-1-alexander.shishkin@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      8d4e6c4c
    • Z
      perf/ftrace: Fix double traces of perf on ftrace:function · 75e83876
      Zhou Chengming 提交于
      When running perf on the ftrace:function tracepoint, there is a bug
      which can be reproduced by:
      
        perf record -e ftrace:function -a sleep 20 &
        perf record -e ftrace:function ls
        perf script
      
                    ls 10304 [005]   171.853235: ftrace:function:
        perf_output_begin
                    ls 10304 [005]   171.853237: ftrace:function:
        perf_output_begin
                    ls 10304 [005]   171.853239: ftrace:function:
        task_tgid_nr_ns
                    ls 10304 [005]   171.853240: ftrace:function:
        task_tgid_nr_ns
                    ls 10304 [005]   171.853242: ftrace:function:
        __task_pid_nr_ns
                    ls 10304 [005]   171.853244: ftrace:function:
        __task_pid_nr_ns
      
      We can see that all the function traces are doubled.
      
      The problem is caused by the inconsistency of the register
      function perf_ftrace_event_register() with the probe function
      perf_ftrace_function_call(). The former registers one probe
      for every perf_event. And the latter handles all perf_events
      on the current cpu. So when two perf_events on the current cpu,
      the traces of them will be doubled.
      
      So this patch adds an extra parameter "event" for perf_tp_event,
      only send sample data to this event when it's not NULL.
      Signed-off-by: NZhou Chengming <zhouchengming1@huawei.com>
      Reviewed-by: NJiri Olsa <jolsa@kernel.org>
      Acked-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: acme@kernel.org
      Cc: alexander.shishkin@linux.intel.com
      Cc: huawei.libin@huawei.com
      Link: http://lkml.kernel.org/r/1503668977-12526-1-git-send-email-zhouchengming1@huawei.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      75e83876
  17. 10 8月, 2017 1 次提交
  18. 02 8月, 2017 1 次提交
    • V
      x86/perf/cqm: Wipe out perf based cqm · c39a0e2c
      Vikas Shivappa 提交于
      'perf cqm' never worked due to the incompatibility between perf
      infrastructure and cqm hardware support.  The hardware uses RMIDs to
      track the llc occupancy of tasks and these RMIDs are per package. This
      makes monitoring a hierarchy like cgroup along with monitoring of tasks
      separately difficult and several patches sent to lkml to fix them were
      NACKed. Further more, the following issues in the current perf cqm make
      it almost unusable:
      
          1. No support to monitor the same group of tasks for which we do
          allocation using resctrl.
      
          2. It gives random and inaccurate data (mostly 0s) once we run out
          of RMIDs due to issues in Recycling.
      
          3. Recycling results in inaccuracy of data because we cannot
          guarantee that the RMID was stolen from a task when it was not
          pulling data into cache or even when it pulled the least data. Also
          for monitoring llc_occupancy, if we stop using an RMID_x and then
          start using an RMID_y after we reclaim an RMID from an other event,
          we miss accounting all the occupancy that was tagged to RMID_x at a
          later perf_count.
      
          2. Recycling code makes the monitoring code complex including
          scheduling because the event can lose RMID any time. Since MBM
          counters count bandwidth for a period of time by taking snap shot of
          total bytes at two different times, recycling complicates the way we
          count MBM in a hierarchy. Also we need a spin lock while we do the
          processing to account for MBM counter overflow. We also currently
          use a spin lock in scheduling to prevent the RMID from being taken
          away.
      
          4. Lack of support when we run different kind of event like task,
          system-wide and cgroup events together. Data mostly prints 0s. This
          is also because we can have only one RMID tied to a cpu as defined
          by the cqm hardware but a perf can at the same time tie multiple
          events during one sched_in.
      
          5. No support of monitoring a group of tasks. There is partial support
          for cgroup but it does not work once there is a hierarchy of cgroups
          or if we want to monitor a task in a cgroup and the cgroup itself.
      
          6. No support for monitoring tasks for the lifetime without perf
          overhead.
      
          7. It reported the aggregate cache occupancy or memory bandwidth over
          all sockets. But most cloud and VMM based use cases want to know the
          individual per-socket usage.
      Signed-off-by: NVikas Shivappa <vikas.shivappa@linux.intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: ravi.v.shankar@intel.com
      Cc: tony.luck@intel.com
      Cc: fenghua.yu@intel.com
      Cc: peterz@infradead.org
      Cc: eranian@google.com
      Cc: vikas.shivappa@intel.com
      Cc: ak@linux.intel.com
      Cc: davidcc@google.com
      Cc: reinette.chatre@intel.com
      Link: http://lkml.kernel.org/r/1501017287-28083-2-git-send-email-vikas.shivappa@linux.intel.com
      c39a0e2c
  19. 05 6月, 2017 1 次提交
  20. 26 5月, 2017 1 次提交
    • T
      perf/tracing/cpuhotplug: Fix locking order · a63fbed7
      Thomas Gleixner 提交于
      perf, tracing, kprobes and jump_labels have a gazillion of ways to create
      dependency lock chains. Some of those involve nested invocations of
      get_online_cpus().
      
      The conversion of the hotplug locking to a percpu rwsem requires to avoid
      such nested calls. sys_perf_event_open() protects most of the syscall logic
      against cpu hotplug. This causes nested calls and lock inversions versus
      ftrace and kprobes in various interesting ways.
      
      It's impossible to move the hotplug locking to the outer end of all call
      chains in the involved facilities, so the hotplug protection in
      sys_perf_event_open() needs to be solved differently.
      
      Introduce 'pmus_mutex' which protects a perf private online cpumask. This
      mutex is taken when the mask is updated in the cpu hotplug callbacks and
      can be taken in sys_perf_event_open() to protect the swhash setup/teardown
      code and when the final judgement about a valid event has to be made.
      
      [ tglx: Produced changelog and fixed the swhash interaction ]
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NIngo Molnar <mingo@kernel.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Sebastian Siewior <bigeasy@linutronix.de>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Link: http://lkml.kernel.org/r/20170524081548.930941109@linutronix.de
      a63fbed7
  21. 30 3月, 2017 1 次提交