1. 25 7月, 2018 1 次提交
    • P
      perf/x86/intel: Fix unwind errors from PEBS entries (mk-II) · 6cbc304f
      Peter Zijlstra 提交于
      Vince reported the perf_fuzzer giving various unwinder warnings and
      Josh reported:
      
      > Deja vu.  Most of these are related to perf PEBS, similar to the
      > following issue:
      >
      >   b8000586 ("perf/x86/intel: Cure bogus unwind from PEBS entries")
      >
      > This is basically the ORC version of that.  setup_pebs_sample_data() is
      > assembling a franken-pt_regs which ORC isn't happy about.  RIP is
      > inconsistent with some of the other registers (like RSP and RBP).
      
      And where the previous unwinder only needed BP,SP ORC also requires
      IP. But we cannot spoof IP because then the sample will get displaced,
      entirely negating the point of PEBS.
      
      So cure the whole thing differently by doing the unwind early; this
      does however require a means to communicate we did the unwind early.
      We (ab)use an unused sample_type bit for this, which we set on events
      that fill out the data->callchain before the normal
      perf_prepare_sample().
      Debugged-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Reported-by: NVince Weaver <vincent.weaver@maine.edu>
      Tested-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Tested-by: NPrashant Bhole <bhole_prashant_q7@lab.ntt.co.jp>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      6cbc304f
  2. 27 6月, 2018 1 次提交
  3. 25 5月, 2018 4 次提交
    • E
      perf/core: Wire up compat PERF_EVENT_IOC_QUERY_BPF, PERF_EVENT_IOC_MODIFY_ATTRIBUTES · 82489c5f
      Eugene Syromiatnikov 提交于
      Since pointer size is different in compat, and switching in _perf_ioctl
      is done using exact ioctl numbers, all new ioctl numbers that use pointer
      should be added to perf_compat_ioctl for _IOC_SIZE fixup before passing
      to perf_ioctl routine (this shouldn't be needed if semantics of the size
      argument of _IO* macros was honored).
      Signed-off-by: NEugene Syromiatnikov <esyr@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Link: http://lkml.kernel.org/r/20180521123420.GA24291@asgard.redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      82489c5f
    • S
      perf/core: Fix bad use of igrab() · 9511bce9
      Song Liu 提交于
      As Miklos reported and suggested:
      
       "This pattern repeats two times in trace_uprobe.c and in
        kernel/events/core.c as well:
      
            ret = kern_path(filename, LOOKUP_FOLLOW, &path);
            if (ret)
                goto fail_address_parse;
      
            inode = igrab(d_inode(path.dentry));
            path_put(&path);
      
        And it's wrong.  You can only hold a reference to the inode if you
        have an active ref to the superblock as well (which is normally
        through path.mnt) or holding s_umount.
      
        This way unmounting the containing filesystem while the tracepoint is
        active will give you the "VFS: Busy inodes after unmount..." message
        and a crash when the inode is finally put.
      
        Solution: store path instead of inode."
      
      This patch fixes the issue in kernel/event/core.c.
      Reviewed-and-tested-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Reported-by: NMiklos Szeredi <miklos@szeredi.hu>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: <kernel-team@fb.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Fixes: 375637bc ("perf/core: Introduce address range filtering")
      Link: http://lkml.kernel.org/r/20180418062907.3210386-2-songliubraving@fb.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      9511bce9
    • S
      perf/core: Fix group scheduling with mixed hw and sw events · a1150c20
      Song Liu 提交于
      When hw and sw events are mixed in the same group, they are all attached
      to the hw perf_event_context. This sometimes requires moving group of
      perf_event to a different context.
      
      We found a bug in how the kernel handles this, for example if we do:
      
         perf stat -e '{faults,ref-cycles,faults}'  -I 1000
      
           1.005591180              1,297      faults
           1.005591180        457,476,576      ref-cycles
           1.005591180    <not supported>      faults
      
      First, sw event "faults" is attached to the sw context, and becomes the
      group leader. Then, hw event "ref-cycles" is attached, so both events
      are moved to the hw context. Last, another sw "faults" tries to attach,
      but it fails because of mismatch between the new target ctx (from sw
      pmu) and the group_leader's ctx (hw context, same as ref-cycles).
      
      The broken condition is:
         group_leader is sw event;
         group_leader is on hw context;
         add a sw event to the group.
      
      Fix this scenario by checking group_leader's context (instead of just
      event type). If group_leader is on hw context, use the ->pmu of this
      context to look up context for the new event.
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: <kernel-team@fb.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Fixes: b04243ef ("perf: Complete software pmu grouping")
      Link: http://lkml.kernel.org/r/20180503194716.162815-1-songliubraving@fb.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      a1150c20
    • Y
      perf/core: add perf_get_event() to return perf_event given a struct file · f8d959a5
      Yonghong Song 提交于
      A new extern function, perf_get_event(), is added to return a perf event
      given a struct file. This function will be used in later patches.
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      f8d959a5
  4. 17 4月, 2018 2 次提交
  5. 12 4月, 2018 1 次提交
  6. 10 4月, 2018 1 次提交
    • P
      perf/core: Fix use-after-free in uprobe_perf_close() · 621b6d2e
      Prashant Bhole 提交于
      A use-after-free bug was caught by KASAN while running usdt related
      code (BCC project. bcc/tests/python/test_usdt2.py):
      
      	==================================================================
      	BUG: KASAN: use-after-free in uprobe_perf_close+0x222/0x3b0
      	Read of size 4 at addr ffff880384f9b4a4 by task test_usdt2.py/870
      
      	CPU: 4 PID: 870 Comm: test_usdt2.py Tainted: G        W         4.16.0-next-20180409 #215
      	Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
      	Call Trace:
      	 dump_stack+0xc7/0x15b
      	 ? show_regs_print_info+0x5/0x5
      	 ? printk+0x9c/0xc3
      	 ? kmsg_dump_rewind_nolock+0x6e/0x6e
      	 ? uprobe_perf_close+0x222/0x3b0
      	 print_address_description+0x83/0x3a0
      	 ? uprobe_perf_close+0x222/0x3b0
      	 kasan_report+0x1dd/0x460
      	 ? uprobe_perf_close+0x222/0x3b0
      	 uprobe_perf_close+0x222/0x3b0
      	 ? probes_open+0x180/0x180
      	 ? free_filters_list+0x290/0x290
      	 trace_uprobe_register+0x1bb/0x500
      	 ? perf_event_attach_bpf_prog+0x310/0x310
      	 ? probe_event_disable+0x4e0/0x4e0
      	 perf_uprobe_destroy+0x63/0xd0
      	 _free_event+0x2bc/0xbd0
      	 ? lockdep_rcu_suspicious+0x100/0x100
      	 ? ring_buffer_attach+0x550/0x550
      	 ? kvm_sched_clock_read+0x1a/0x30
      	 ? perf_event_release_kernel+0x3e4/0xc00
      	 ? __mutex_unlock_slowpath+0x12e/0x540
      	 ? wait_for_completion+0x430/0x430
      	 ? lock_downgrade+0x3c0/0x3c0
      	 ? lock_release+0x980/0x980
      	 ? do_raw_spin_trylock+0x118/0x150
      	 ? do_raw_spin_unlock+0x121/0x210
      	 ? do_raw_spin_trylock+0x150/0x150
      	 perf_event_release_kernel+0x5d4/0xc00
      	 ? put_event+0x30/0x30
      	 ? fsnotify+0xd2d/0xea0
      	 ? sched_clock_cpu+0x18/0x1a0
      	 ? __fsnotify_update_child_dentry_flags.part.0+0x1b0/0x1b0
      	 ? pvclock_clocksource_read+0x152/0x2b0
      	 ? pvclock_read_flags+0x80/0x80
      	 ? kvm_sched_clock_read+0x1a/0x30
      	 ? sched_clock_cpu+0x18/0x1a0
      	 ? pvclock_clocksource_read+0x152/0x2b0
      	 ? locks_remove_file+0xec/0x470
      	 ? pvclock_read_flags+0x80/0x80
      	 ? fcntl_setlk+0x880/0x880
      	 ? ima_file_free+0x8d/0x390
      	 ? lockdep_rcu_suspicious+0x100/0x100
      	 ? ima_file_check+0x110/0x110
      	 ? fsnotify+0xea0/0xea0
      	 ? kvm_sched_clock_read+0x1a/0x30
      	 ? rcu_note_context_switch+0x600/0x600
      	 perf_release+0x21/0x40
      	 __fput+0x264/0x620
      	 ? fput+0xf0/0xf0
      	 ? do_raw_spin_unlock+0x121/0x210
      	 ? do_raw_spin_trylock+0x150/0x150
      	 ? SyS_fchdir+0x100/0x100
      	 ? fsnotify+0xea0/0xea0
      	 task_work_run+0x14b/0x1e0
      	 ? task_work_cancel+0x1c0/0x1c0
      	 ? copy_fd_bitmaps+0x150/0x150
      	 ? vfs_read+0xe5/0x260
      	 exit_to_usermode_loop+0x17b/0x1b0
      	 ? trace_event_raw_event_sys_exit+0x1a0/0x1a0
      	 do_syscall_64+0x3f6/0x490
      	 ? syscall_return_slowpath+0x2c0/0x2c0
      	 ? lockdep_sys_exit+0x1f/0xaa
      	 ? syscall_return_slowpath+0x1a3/0x2c0
      	 ? lockdep_sys_exit+0x1f/0xaa
      	 ? prepare_exit_to_usermode+0x11c/0x1e0
      	 ? enter_from_user_mode+0x30/0x30
      	random: crng init done
      	 ? __put_user_4+0x1c/0x30
      	 entry_SYSCALL_64_after_hwframe+0x3d/0xa2
      	RIP: 0033:0x7f41d95f9340
      	RSP: 002b:00007fffe71e4268 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
      	RAX: 0000000000000000 RBX: 000000000000000d RCX: 00007f41d95f9340
      	RDX: 0000000000000000 RSI: 0000000000002401 RDI: 000000000000000d
      	RBP: 0000000000000000 R08: 00007f41ca8ff700 R09: 00007f41d996dd1f
      	R10: 00007fffe71e41e0 R11: 0000000000000246 R12: 00007fffe71e4330
      	R13: 0000000000000000 R14: fffffffffffffffc R15: 00007fffe71e4290
      
      	Allocated by task 870:
      	 kasan_kmalloc+0xa0/0xd0
      	 kmem_cache_alloc_node+0x11a/0x430
      	 copy_process.part.19+0x11a0/0x41c0
      	 _do_fork+0x1be/0xa20
      	 do_syscall_64+0x198/0x490
      	 entry_SYSCALL_64_after_hwframe+0x3d/0xa2
      
      	Freed by task 0:
      	 __kasan_slab_free+0x12e/0x180
      	 kmem_cache_free+0x102/0x4d0
      	 free_task+0xfe/0x160
      	 __put_task_struct+0x189/0x290
      	 delayed_put_task_struct+0x119/0x250
      	 rcu_process_callbacks+0xa6c/0x1b60
      	 __do_softirq+0x238/0x7ae
      
      	The buggy address belongs to the object at ffff880384f9b480
      	 which belongs to the cache task_struct of size 12928
      
      It occurs because task_struct is freed before perf_event which refers
      to the task and task flags are checked while teardown of the event.
      perf_event_alloc() assigns task_struct to hw.target of perf_event,
      but there is no reference counting for it.
      
      As a fix we get_task_struct() in perf_event_alloc() at above mentioned
      assignment and put_task_struct() in _free_event().
      Signed-off-by: NPrashant Bhole <bhole_prashant_q7@lab.ntt.co.jp>
      Reviewed-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: <stable@kernel.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 63b6da39 ("perf: Fix perf_event_exit_task() race")
      Link: http://lkml.kernel.org/r/20180409100346.6416-1-bhole_prashant_q7@lab.ntt.co.jpSigned-off-by: NIngo Molnar <mingo@kernel.org>
      621b6d2e
  7. 29 3月, 2018 1 次提交
  8. 20 3月, 2018 1 次提交
    • S
      perf/cgroup: Fix child event counting bug · c917e0f2
      Song Liu 提交于
      When a perf_event is attached to parent cgroup, it should count events
      for all children cgroups:
      
         parent_group   <---- perf_event
           \
            - child_group  <---- process(es)
      
      However, in our tests, we found this perf_event cannot report reliable
      results. Here is an example case:
      
        # create cgroups
        mkdir -p /sys/fs/cgroup/p/c
        # start perf for parent group
        perf stat -e instructions -G "p"
      
        # on another console, run test process in child cgroup:
        stressapptest -s 2 -M 1000 & echo $! > /sys/fs/cgroup/p/c/cgroup.procs
      
        # after the test process is done, stop perf in the first console shows
      
             <not counted>      instructions              p
      
      The instruction should not be "not counted" as the process runs in the
      child cgroup.
      
      We found this is because perf_event->cgrp and cpuctx->cgrp are not
      identical, thus perf_event->cgrp are not updated properly.
      
      This patch fixes this by updating perf_cgroup properly for ancestor
      cgroup(s).
      Reported-by: NEphraim Park <ephiepark@fb.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: <jolsa@redhat.com>
      Cc: <kernel-team@fb.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Link: http://lkml.kernel.org/r/20180312165943.1057894-1-songliubraving@fb.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      c917e0f2
  9. 17 3月, 2018 2 次提交
    • M
      perf/core: Clear sibling list of detached events · 24868367
      Mark Rutland 提交于
      When perf_group_dettach() is called on a group leader, it updates each
      sibling's group_leader field to point to that sibling, effectively
      upgrading each siblnig to a group leader. After perf_group_detach has
      completed, the caller may free the leader event.
      
      We only remove siblings from the group leader's sibling_list when the
      leader has a non-empty group_node. This was fine prior to commit:
      
        8343aae6 ("perf/core: Remove perf_event::group_entry")
      
      ... as the sibling's sibling_list would be empty. However, now that we
      use the sibling_list field as both the list head and the list entry,
      this leaves each sibling with a non-empty sibling list, including the
      stale leader event.
      
      If perf_group_detach() is subsequently called on a sibling, it will
      appear to be a group leader, and we'll walk the sibling_list,
      potentially dereferencing these stale events. In 0day testing, this has
      been observed to result in kernel panics.
      
      Let's avoid this by always removing siblings from the sibling list when
      we promote them to leaders.
      
      Fixes: 8343aae6 ("perf/core: Remove perf_event::group_entry")
      Signed-off-by: NMark Rutland <mark.rutland@arm.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: vincent.weaver@maine.edu
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: torvalds@linux-foundation.org
      Cc: Alexey Budankov <alexey.budankov@linux.intel.com>
      Cc: valery.cherepennikov@intel.com
      Cc: linux-tip-commits@vger.kernel.org
      Cc: eranian@google.com
      Cc: acme@redhat.com
      Cc: alexander.shishkin@linux.intel.com
      Cc: davidcc@google.com
      Cc: kan.liang@intel.com
      Cc: Dmitry.Prohorov@intel.com
      Cc: Jiri Olsa <jolsa@redhat.com>
      Link: https://lkml.kernel.org/r/20180316131741.3svgr64yibc6vsid@lakrids.cambridge.arm.com
      24868367
    • P
      perf: Fix sibling iteration · edb39592
      Peter Zijlstra 提交于
      Mark noticed that the change to sibling_list changed some iteration
      semantics; because previously we used group_list as list entry,
      sibling events would always have an empty sibling_list.
      
      But because we now use sibling_list for both list head and list entry,
      siblings will report as having siblings.
      
      Fix this with a custom for_each_sibling_event() iterator.
      
      Fixes: 8343aae6 ("perf/core: Remove perf_event::group_entry")
      Reported-by: NMark Rutland <mark.rutland@arm.com>
      Suggested-by: NMark Rutland <mark.rutland@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: vincent.weaver@maine.edu
      Cc: alexander.shishkin@linux.intel.com
      Cc: torvalds@linux-foundation.org
      Cc: alexey.budankov@linux.intel.com
      Cc: valery.cherepennikov@intel.com
      Cc: eranian@google.com
      Cc: acme@redhat.com
      Cc: linux-tip-commits@vger.kernel.org
      Cc: davidcc@google.com
      Cc: kan.liang@intel.com
      Cc: Dmitry.Prohorov@intel.com
      Cc: jolsa@redhat.com
      Link: https://lkml.kernel.org/r/20180315170129.GX4043@hirez.programming.kicks-ass.net
      edb39592
  10. 16 3月, 2018 2 次提交
    • M
      perf/core: Clear sibling list of detached events · bbb68468
      Mark Rutland 提交于
      When perf_group_dettach() is called on a group leader, it updates each
      sibling's group_leader field to point to that sibling, effectively
      upgrading each siblnig to a group leader. After perf_group_detach has
      completed, the caller may free the leader event.
      
      We only remove siblings from the group leader's sibling_list when the
      leader has a non-empty group_node. This was fine prior to commit:
      
        8343aae6 ("perf/core: Remove perf_event::group_entry")
      
      ... as the sibling's sibling_list would be empty. However, now that we
      use the sibling_list field as both the list head and the list entry,
      this leaves each sibling with a non-empty sibling list, including the
      stale leader event.
      
      If perf_group_detach() is subsequently called on a sibling, it will
      appear to be a group leader, and we'll walk the sibling_list,
      potentially dereferencing these stale events. In 0day testing, this has
      been observed to result in kernel panics.
      
      Let's avoid this by always removing siblings from the sibling list when
      we promote them to leaders.
      
      Fixes: 8343aae6 ("perf/core: Remove perf_event::group_entry")
      Signed-off-by: NMark Rutland <mark.rutland@arm.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: vincent.weaver@maine.edu
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: torvalds@linux-foundation.org
      Cc: Alexey Budankov <alexey.budankov@linux.intel.com>
      Cc: valery.cherepennikov@intel.com
      Cc: linux-tip-commits@vger.kernel.org
      Cc: eranian@google.com
      Cc: acme@redhat.com
      Cc: alexander.shishkin@linux.intel.com
      Cc: davidcc@google.com
      Cc: kan.liang@intel.com
      Cc: Dmitry.Prohorov@intel.com
      Cc: Jiri Olsa <jolsa@redhat.com>
      Link: https://lkml.kernel.org/r/20180316131741.3svgr64yibc6vsid@lakrids.cambridge.arm.com
      bbb68468
    • P
      perf: Fix sibling iteration · 7eb709f2
      Peter Zijlstra 提交于
      Mark noticed that the change to sibling_list changed some iteration
      semantics; because previously we used group_list as list entry,
      sibling events would always have an empty sibling_list.
      
      But because we now use sibling_list for both list head and list entry,
      siblings will report as having siblings.
      
      Fix this with a custom for_each_sibling_event() iterator.
      
      Fixes: 8343aae6 ("perf/core: Remove perf_event::group_entry")
      Reported-by: NMark Rutland <mark.rutland@arm.com>
      Suggested-by: NMark Rutland <mark.rutland@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: vincent.weaver@maine.edu
      Cc: alexander.shishkin@linux.intel.com
      Cc: torvalds@linux-foundation.org
      Cc: alexey.budankov@linux.intel.com
      Cc: valery.cherepennikov@intel.com
      Cc: eranian@google.com
      Cc: acme@redhat.com
      Cc: linux-tip-commits@vger.kernel.org
      Cc: davidcc@google.com
      Cc: kan.liang@intel.com
      Cc: Dmitry.Prohorov@intel.com
      Cc: jolsa@redhat.com
      Link: https://lkml.kernel.org/r/20180315170129.GX4043@hirez.programming.kicks-ass.net
      7eb709f2
  11. 13 3月, 2018 2 次提交
    • M
      perf/core: Implement fast breakpoint modification via _IOC_MODIFY_ATTRIBUTES · 32ff77e8
      Milind Chabbi 提交于
      Problem and motivation: Once a breakpoint perf event (PERF_TYPE_BREAKPOINT)
      is created, there is no flexibility to change the breakpoint type
      (bp_type), breakpoint address (bp_addr), or breakpoint length (bp_len). The
      only option is to close the perf event and configure a new breakpoint
      event. This inflexibility has a significant performance overhead. For
      example, sampling-based, lightweight performance profilers (and also
      concurrency bug detection tools),  monitor different addresses for a short
      duration using PERF_TYPE_BREAKPOINT and change the address (bp_addr) to
      another address or change the kind of breakpoint (bp_type) from  "write" to
      a "read" or vice-versa or change the length (bp_len) of the address being
      monitored. The cost of these modifications is prohibitive since it involves
      unmapping the circular buffer associated with the perf event, closing the
      perf event, opening another perf event and mmaping another circular buffer.
      
      Solution: The new ioctl flag for perf events,
      PERF_EVENT_IOC_MODIFY_ATTRIBUTES, introduced in this patch takes a pointer
      to a struct perf_event_attr as an argument to update an old breakpoint
      event with new address, type, and size. This facility allows retaining a
      previous mmaped perf events ring buffer and avoids having to close and
      reopen another perf event.
      
      This patch supports only changing PERF_TYPE_BREAKPOINT event type; future
      implementations can extend this feature. The patch replicates some of its
      functionality of modify_user_hw_breakpoint() in
      kernel/events/hw_breakpoint.c. modify_user_hw_breakpoint cannot be called
      directly since perf_event_ctx_lock() is already held in _perf_ioctl().
      
      Evidence: Experiments show that the baseline (not able to modify an already
      created breakpoint) costs an order of magnitude (~10x) more than the
      suggested optimization (having the ability to dynamically modifying a
      configured breakpoint via ioctl). When the breakpoints typically do not
      trap, the speedup due to the suggested optimization is ~10x; even when the
      breakpoints always trap, the speedup is ~4x due to the suggested
      optimization.
      
      Testing: tests posted at
      https://github.com/linux-contrib/perf_event_modify_bp demonstrate the
      performance significance of this patch. Tests also check the functional
      correctness of the patch.
      Signed-off-by: NMilind Chabbi <chabbi.milind@gmail.com>
      [ Using modify_user_hw_breakpoint_check function. ]
      [ Reformated PERF_EVENT_IOC_*, so the values are all in one column. ]
      Signed-off-by: NJiri Olsa <jolsa@kernel.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Hari Bathini <hbathini@linux.vnet.ibm.com>
      Cc: Jin Yao <yao.jin@linux.intel.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Kan Liang <kan.liang@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Oleg Nesterov <onestero@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Link: http://lkml.kernel.org/r/20180312134548.31532-8-jolsa@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      32ff77e8
    • J
      perf/core: Move perf_event_attr::sample_max_stack into perf_copy_attr() · 5f970521
      Jiri Olsa 提交于
      Move the sample_max_stack check and setup into perf_copy_attr(),
      so we have all perf_event_attr initial setup in one place
      and can easily compare attrs in the new ioctl introduced
      in following change.
      Suggested-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NJiri Olsa <jolsa@kernel.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Hari Bathini <hbathini@linux.vnet.ibm.com>
      Cc: Jin Yao <yao.jin@linux.intel.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Kan Liang <kan.liang@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Milind Chabbi <chabbi.milind@gmail.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Oleg Nesterov <onestero@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Link: http://lkml.kernel.org/r/20180312134548.31532-7-jolsa@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      5f970521
  12. 12 3月, 2018 10 次提交
    • L
      perf/core: Fix installing cgroup events on CPU · 33801b94
      leilei.lin 提交于
      There's two problems when installing cgroup events on CPUs: firstly
      list_update_cgroup_event() only tries to set cpuctx->cgrp for the
      first event, if that mismatches on @cgrp we'll not try again for later
      additions.
      
      Secondly, when we install a cgroup event into an active context, only
      issue an event reprogram when the event matches the current cgroup
      context. This avoids a pointless event reprogramming.
      Signed-off-by: Nleilei.lin <leilei.lin@alibaba-inc.com>
      [ Improved the changelog and comments. ]
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: brendan.d.gregg@gmail.com
      Cc: eranian@gmail.com
      Cc: linux-kernel@vger.kernel.org
      Cc: yang_oliver@hotmail.com
      Link: http://lkml.kernel.org/r/20180306093637.28247-1-linxiulei@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      33801b94
    • P
      perf/core: Optimize perf_rotate_context() event scheduling · 8d5bce0c
      Peter Zijlstra 提交于
      The event schedule order (as per perf_event_sched_in()) is:
      
       - cpu  pinned
       - task pinned
       - cpu  flexible
       - task flexible
      
      But perf_rotate_context() will unschedule cpu-flexible even if it
      doesn't need a rotation.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      8d5bce0c
    • P
      perf/core: Fix tree based event rotation · 8703a7cf
      Peter Zijlstra 提交于
      Similar to how first programming cpu=-1 and then cpu=# is wrong, so is
      rotating both. It was especially wrong when we were still programming
      the PMU in this same order, because in that scenario we might never
      actually end up running cpu=# events at all.
      
      Cure this by using the active_list to pick the rotation event; since
      at programming we already select the left-most event.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NMark Rutland <mark.rutland@arm.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Alexey Budankov <alexey.budankov@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: David Carrillo-Cisneros <davidcc@google.com>
      Cc: Dmitri Prokhorov <Dmitry.Prohorov@intel.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Kan Liang <kan.liang@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Valery Cherepennikov <valery.cherepennikov@intel.com>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      8703a7cf
    • P
      perf/core: Simpify perf_event_groups_for_each() · 6e6804d2
      Peter Zijlstra 提交于
      The last argument is, and always must be, the same.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NMark Rutland <mark.rutland@arm.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Alexey Budankov <alexey.budankov@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: David Carrillo-Cisneros <davidcc@google.com>
      Cc: Dmitri Prokhorov <Dmitry.Prohorov@intel.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Kan Liang <kan.liang@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Valery Cherepennikov <valery.cherepennikov@intel.com>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      6e6804d2
    • P
      perf/core: Optimize ctx_sched_out() · 6668128a
      Peter Zijlstra 提交于
      When an event group contains more events than can be scheduled on the
      hardware, iterating the full event group for ctx_sched_out is a waste
      of time.
      
      Keep track of the events that got programmed on the hardware, such
      that we can iterate this smaller list in order to schedule them out.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NMark Rutland <mark.rutland@arm.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Alexey Budankov <alexey.budankov@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: David Carrillo-Cisneros <davidcc@google.com>
      Cc: Dmitri Prokhorov <Dmitry.Prohorov@intel.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Kan Liang <kan.liang@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Valery Cherepennikov <valery.cherepennikov@intel.com>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      6668128a
    • P
      perf/core: Remove perf_event::group_entry · 8343aae6
      Peter Zijlstra 提交于
      Now that all the grouping is done with RB trees, we no longer need
      group_entry and can replace the whole thing with sibling_list.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NMark Rutland <mark.rutland@arm.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Alexey Budankov <alexey.budankov@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: David Carrillo-Cisneros <davidcc@google.com>
      Cc: Dmitri Prokhorov <Dmitry.Prohorov@intel.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Kan Liang <kan.liang@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Valery Cherepennikov <valery.cherepennikov@intel.com>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      8343aae6
    • P
      perf/core: Fix event schedule order · 1cac7b1a
      Peter Zijlstra 提交于
      Scheduling in events with cpu=-1 before events with cpu=# changes
      semantics and is undesirable in that it would priorize these events.
      
      Given that groups->index is across all groups we actually have an
      inter-group ordering, meaning we can merge-sort two groups, which is
      just what we need to preserve semantics.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NMark Rutland <mark.rutland@arm.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Alexey Budankov <alexey.budankov@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: David Carrillo-Cisneros <davidcc@google.com>
      Cc: Dmitri Prokhorov <Dmitry.Prohorov@intel.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Kan Liang <kan.liang@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Valery Cherepennikov <valery.cherepennikov@intel.com>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      1cac7b1a
    • P
      perf/core: Cleanup the rb-tree code · 161c85fa
      Peter Zijlstra 提交于
      Trivial comment and code fixups..
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NMark Rutland <mark.rutland@arm.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Alexey Budankov <alexey.budankov@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: David Carrillo-Cisneros <davidcc@google.com>
      Cc: Dmitri Prokhorov <Dmitry.Prohorov@intel.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Kan Liang <kan.liang@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Valery Cherepennikov <valery.cherepennikov@intel.com>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      161c85fa
    • A
      perf/cor: Use RB trees for pinned/flexible groups · 8e1a2031
      Alexey Budankov 提交于
      Change event groups into RB trees sorted by CPU and then by a 64bit
      index, so that multiplexing hrtimer interrupt handler would be able
      skipping to the current CPU's list and ignore groups allocated for the
      other CPUs.
      
      New API for manipulating event groups in the trees is implemented as well
      as adoption on the API in the current implementation.
      
      pinned_group_sched_in() and flexible_group_sched_in() API are
      introduced to consolidate code enabling the whole group from pinned
      and flexible groups appropriately.
      Signed-off-by: NAlexey Budankov <alexey.budankov@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NMark Rutland <mark.rutland@arm.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: David Carrillo-Cisneros <davidcc@google.com>
      Cc: Dmitri Prokhorov <Dmitry.Prohorov@intel.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Kan Liang <kan.liang@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Valery Cherepennikov <valery.cherepennikov@intel.com>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: linux-kernel@vger.kernel.org
      Link: http://lkml.kernel.org/r/372f9c8b-0cfe-4240-e44d-83d863d40813@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      8e1a2031
    • P
      perf/core: Fix perf_output_read_group() · 9e5b127d
      Peter Zijlstra 提交于
      Mark reported his arm64 perf fuzzer runs sometimes splat like:
      
        armv8pmu_read_counter+0x1e8/0x2d8
        armpmu_event_update+0x8c/0x188
        armpmu_read+0xc/0x18
        perf_output_read+0x550/0x11e8
        perf_event_read_event+0x1d0/0x248
        perf_event_exit_task+0x468/0xbb8
        do_exit+0x690/0x1310
        do_group_exit+0xd0/0x2b0
        get_signal+0x2e8/0x17a8
        do_signal+0x144/0x4f8
        do_notify_resume+0x148/0x1e8
        work_pending+0x8/0x14
      
      which asserts that we only call pmu::read() on ACTIVE events.
      
      The above callchain does:
      
        perf_event_exit_task()
          perf_event_exit_task_context()
            task_ctx_sched_out() // INACTIVE
            perf_event_exit_event()
              perf_event_set_state(EXIT) // EXIT
              sync_child_event()
                perf_event_read_event()
                  perf_output_read()
                    perf_output_read_group()
                      leader->pmu->read()
      
      Which results in doing a pmu::read() on an !ACTIVE event.
      
      I _think_ this is 'new' since we added attr.inherit_stat, which added
      the perf_event_read_event() to the exit path, without that
      perf_event_read_output() would only trigger from samples and for
      @event to trigger a sample, it's leader _must_ be ACTIVE too.
      
      Still, adding this check makes it consistent with the @sub case for
      the siblings.
      Reported-and-Tested-by: NMark Rutland <mark.rutland@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      9e5b127d
  13. 09 3月, 2018 1 次提交
    • S
      perf/core: Fix ctx_event_type in ctx_resched() · bd903afe
      Song Liu 提交于
      In ctx_resched(), EVENT_FLEXIBLE should be sched_out when EVENT_PINNED is
      added. However, ctx_resched() calculates ctx_event_type before checking
      this condition. As a result, pinned events will NOT get higher priority
      than flexible events.
      
      The following shows this issue on an Intel CPU (where ref-cycles can
      only use one hardware counter).
      
        1. First start:
             perf stat -C 0 -e ref-cycles  -I 1000
        2. Then, in the second console, run:
             perf stat -C 0 -e ref-cycles:D -I 1000
      
      The second perf uses pinned events, which is expected to have higher
      priority. However, because it failed in ctx_resched(). It is never
      run.
      
      This patch fixes this by calculating ctx_event_type after re-evaluating
      event_type.
      Reported-by: NEphraim Park <ephiepark@fb.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: <jolsa@redhat.com>
      Cc: <kernel-team@fb.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Fixes: 487f05e1 ("perf/core: Optimize event rescheduling on active contexts")
      Link: http://lkml.kernel.org/r/20180306055504.3283731-1-songliubraving@fb.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      bd903afe
  14. 12 2月, 2018 1 次提交
    • L
      vfs: do bulk POLL* -> EPOLL* replacement · a9a08845
      Linus Torvalds 提交于
      This is the mindless scripted replacement of kernel use of POLL*
      variables as described by Al, done by this script:
      
          for V in IN OUT PRI ERR RDNORM RDBAND WRNORM WRBAND HUP RDHUP NVAL MSG; do
              L=`git grep -l -w POLL$V | grep -v '^t' | grep -v /um/ | grep -v '^sa' | grep -v '/poll.h$'|grep -v '^D'`
              for f in $L; do sed -i "-es/^\([^\"]*\)\(\<POLL$V\>\)/\\1E\\2/" $f; done
          done
      
      with de-mangling cleanups yet to come.
      
      NOTE! On almost all architectures, the EPOLL* constants have the same
      values as the POLL* constants do.  But they keyword here is "almost".
      For various bad reasons they aren't the same, and epoll() doesn't
      actually work quite correctly in some cases due to this on Sparc et al.
      
      The next patch from Al will sort out the final differences, and we
      should be all done.
      Scripted-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a9a08845
  15. 06 2月, 2018 2 次提交
  16. 25 1月, 2018 3 次提交
    • P
      perf/core: Fix ctx::mutex deadlock · 0c7296ca
      Peter Zijlstra 提交于
      Lockdep noticed the following 3-way lockup scenario:
      
      	sys_perf_event_open()
      	  perf_event_alloc()
      	    perf_try_init_event()
       #0	      ctx = perf_event_ctx_lock_nested(1)
      	      perf_swevent_init()
      		swevent_hlist_get()
       #1		  mutex_lock(&pmus_lock)
      
      	perf_event_init_cpu()
       #1	  mutex_lock(&pmus_lock)
       #2	  mutex_lock(&ctx->mutex)
      
      	sys_perf_event_open()
      	  mutex_lock_double()
       #2	   mutex_lock()
       #0	   mutex_lock_nested()
      
      And while we need that perf_event_ctx_lock_nested() for HW PMUs such
      that they can iterate the sibling list, trying to match it to the
      available counters, the software PMUs need do no such thing. Exclude
      them.
      
      In particular the swevent triggers the above invertion, while the
      tpevent PMU triggers a more elaborate one through their event_mutex.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      0c7296ca
    • P
      perf/core: Fix another perf,trace,cpuhp lock inversion · 43fa87f7
      Peter Zijlstra 提交于
      Lockdep noticed the following 3-way lockup race:
      
              perf_trace_init()
       #0       mutex_lock(&event_mutex)
                perf_trace_event_init()
                  perf_trace_event_reg()
                    tp_event->class->reg() := tracepoint_probe_register
       #1              mutex_lock(&tracepoints_mutex)
                        trace_point_add_func()
       #2                  static_key_enable()
      
       #2	do_cpu_up()
      	  perf_event_init_cpu()
       #3	    mutex_lock(&pmus_lock)
       #4	    mutex_lock(&ctx->mutex)
      
      	perf_ioctl()
       #4	  ctx = perf_event_ctx_lock()
      	  _perf_iotcl()
      	    ftrace_profile_set_filter()
       #0	      mutex_lock(&event_mutex)
      
      Fudge it for now by noting that the tracepoint state does not depend
      on the event <-> context relation. Ugly though :/
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      43fa87f7
    • P
      perf/core: Fix lock inversion between perf,trace,cpuhp · 82d94856
      Peter Zijlstra 提交于
      Lockdep gifted us with noticing the following 4-way lockup scenario:
      
              perf_trace_init()
       #0       mutex_lock(&event_mutex)
                perf_trace_event_init()
                  perf_trace_event_reg()
                    tp_event->class->reg() := tracepoint_probe_register
       #1             mutex_lock(&tracepoints_mutex)
                        trace_point_add_func()
       #2                 static_key_enable()
      
       #2     do_cpu_up()
                perf_event_init_cpu()
       #3         mutex_lock(&pmus_lock)
       #4         mutex_lock(&ctx->mutex)
      
              perf_event_task_disable()
                mutex_lock(&current->perf_event_mutex)
       #4       ctx = perf_event_ctx_lock()
       #5       perf_event_for_each_child()
      
              do_exit()
                task_work_run()
                  __fput()
                    perf_release()
                      perf_event_release_kernel()
       #4               mutex_lock(&ctx->mutex)
       #5               mutex_lock(&event->child_mutex)
                        free_event()
                          _free_event()
                            event->destroy() := perf_trace_destroy
       #0                     mutex_lock(&event_mutex);
      
      Fix that by moving the free_event() out from under the locks.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      82d94856
  17. 08 1月, 2018 3 次提交
  18. 03 1月, 2018 1 次提交
  19. 17 12月, 2017 1 次提交