1. 06 1月, 2016 3 次提交
    • P
      perf/core: Collapse more IPI loops · 7b648018
      Peter Zijlstra 提交于
      This patch collapses the two 'hard' cases, which are
      perf_event_{dis,en}able().
      
      I cannot seem to convince myself the current code is correct.
      
      So starting with perf_event_disable(); we don't strictly need to test
      for event->state == ACTIVE, ctx->is_active is enough. If the event is
      not scheduled while the ctx is, __perf_event_disable() still does the
      right thing.  Its a little less efficient to IPI in that case,
      over-all simpler.
      
      For perf_event_enable(); the same goes, but I think that's actually
      broken in its current form. The current condition is: ctx->is_active
      && event->state == OFF, that means it doesn't do anything when
      !ctx->active && event->state == OFF. This is wrong, it should still
      mark the event INACTIVE in that case, otherwise we'll still not try
      and schedule the event once the context becomes active again.
      
      This patch implements the two function using the new
      event_function_call() and does away with the tricky event->state
      tests.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NAlexander Shishkin <alexander.shishkin@intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      7b648018
    • P
      perf: Fix race in swevent hash · 12ca6ad2
      Peter Zijlstra 提交于
      There's a race on CPU unplug where we free the swevent hash array
      while it can still have events on. This will result in a
      use-after-free which is BAD.
      
      Simply do not free the hash array on unplug. This leaves the thing
      around and no use-after-free takes place.
      
      When the last swevent dies, we do a for_each_possible_cpu() iteration
      anyway to clean these up, at which time we'll free it, so no leakage
      will occur.
      Reported-by: NSasha Levin <sasha.levin@oracle.com>
      Tested-by: NSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      12ca6ad2
    • P
      perf: Fix race in perf_event_exec() · c1274499
      Peter Zijlstra 提交于
      I managed to tickle this warning:
      
        [ 2338.884942] ------------[ cut here ]------------
        [ 2338.890112] WARNING: CPU: 13 PID: 35162 at ../kernel/events/core.c:2702 task_ctx_sched_out+0x6b/0x80()
        [ 2338.900504] Modules linked in:
        [ 2338.903933] CPU: 13 PID: 35162 Comm: bash Not tainted 4.4.0-rc4-dirty #244
        [ 2338.911610] Hardware name: Intel Corporation S2600GZ/S2600GZ, BIOS SE5C600.86B.02.02.0002.122320131210 12/23/2013
        [ 2338.923071]  ffffffff81f1468e ffff8807c6457cb8 ffffffff815c680c 0000000000000000
        [ 2338.931382]  ffff8807c6457cf0 ffffffff810c8a56 ffffe8ffff8c1bd0 ffff8808132ed400
        [ 2338.939678]  0000000000000286 ffff880813170380 ffff8808132ed400 ffff8807c6457d00
        [ 2338.947987] Call Trace:
        [ 2338.950726]  [<ffffffff815c680c>] dump_stack+0x4e/0x82
        [ 2338.956474]  [<ffffffff810c8a56>] warn_slowpath_common+0x86/0xc0
        [ 2338.963195]  [<ffffffff810c8b4a>] warn_slowpath_null+0x1a/0x20
        [ 2338.969720]  [<ffffffff811a49cb>] task_ctx_sched_out+0x6b/0x80
        [ 2338.976244]  [<ffffffff811a62d2>] perf_event_exec+0xe2/0x180
        [ 2338.982575]  [<ffffffff8121fb6f>] setup_new_exec+0x6f/0x1b0
        [ 2338.988810]  [<ffffffff8126de83>] load_elf_binary+0x393/0x1660
        [ 2338.995339]  [<ffffffff811dc772>] ? get_user_pages+0x52/0x60
        [ 2339.001669]  [<ffffffff8121e297>] search_binary_handler+0x97/0x200
        [ 2339.008581]  [<ffffffff8121f8b3>] do_execveat_common.isra.33+0x543/0x6e0
        [ 2339.016072]  [<ffffffff8121fcea>] SyS_execve+0x3a/0x50
        [ 2339.021819]  [<ffffffff819fc165>] stub_execve+0x5/0x5
        [ 2339.027469]  [<ffffffff819fbeb2>] ? entry_SYSCALL_64_fastpath+0x12/0x71
        [ 2339.034860] ---[ end trace ee1337c59a0ddeac ]---
      
      Which is a WARN_ON_ONCE() indicating that cpuctx->task_ctx is not
      what we expected it to be.
      
      This is because context switches can swap the task_struct::perf_event_ctxp[]
      pointer around. Therefore you have to either disable preemption when looking
      at current, or hold ctx->lock.
      
      Fix perf_event_enable_on_exec(), it loads current->perf_event_ctxp[]
      before disabling interrupts, therefore a preemption in the right place
      can swap contexts around and we're using the wrong one.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Kostya Serebryany <kcc@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: syzkaller <syzkaller@googlegroups.com>
      Link: http://lkml.kernel.org/r/20151210195740.GG6357@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      c1274499
  2. 06 12月, 2015 2 次提交
    • P
      perf/core: Collapse common IPI pattern · 0017960f
      Peter Zijlstra 提交于
      Various functions implement the same pattern to send IPIs to an
      event's CPU. Collapse the easy ones in a common helper function to
      reduce duplication.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      0017960f
    • J
      perf: Do not send exit event twice · 4e93ad60
      Jiri Olsa 提交于
      In case we monitor events system wide, we get EXIT event
      (when configured) twice for each task that exited.
      
      Note doubled lines with same pid/tid in following example:
      
        $ sudo ./perf record -a
        ^C[ perf record: Woken up 1 times to write data ]
        [ perf record: Captured and wrote 0.480 MB perf.data (2518 samples) ]
        $ sudo ./perf report -D | grep EXIT
      
        0 60290687567581 0x59910 [0x38]: PERF_RECORD_EXIT(1250:1250):(1250:1250)
        0 60290687568354 0x59948 [0x38]: PERF_RECORD_EXIT(1250:1250):(1250:1250)
        0 60290687988744 0x59ad8 [0x38]: PERF_RECORD_EXIT(1250:1250):(1250:1250)
        0 60290687989198 0x59b10 [0x38]: PERF_RECORD_EXIT(1250:1250):(1250:1250)
        1 60290692567895 0x62af0 [0x38]: PERF_RECORD_EXIT(1253:1253):(1253:1253)
        1 60290692568322 0x62b28 [0x38]: PERF_RECORD_EXIT(1253:1253):(1253:1253)
        2 60290692739276 0x69a18 [0x38]: PERF_RECORD_EXIT(1252:1252):(1252:1252)
        2 60290692739910 0x69a50 [0x38]: PERF_RECORD_EXIT(1252:1252):(1252:1252)
      
      The reason is that the cpu contexts are processes each time
      we call perf_event_task. I'm changing the perf_event_aux logic
      to serve task_ctx and cpu contexts separately, which ensure we
      don't get EXIT event generated twice on same cpu context.
      
      This does not affect other auxiliary events, as they don't
      use task_ctx at all.
      Signed-off-by: NJiri Olsa <jolsa@kernel.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Link: http://lkml.kernel.org/r/1446649205-5822-1-git-send-email-jolsa@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      4e93ad60
  3. 04 12月, 2015 1 次提交
  4. 03 12月, 2015 1 次提交
    • T
      cgroup: fix handling of multi-destination migration from subtree_control enabling · 1f7dd3e5
      Tejun Heo 提交于
      Consider the following v2 hierarchy.
      
        P0 (+memory) --- P1 (-memory) --- A
                                       \- B
             
      P0 has memory enabled in its subtree_control while P1 doesn't.  If
      both A and B contain processes, they would belong to the memory css of
      P1.  Now if memory is enabled on P1's subtree_control, memory csses
      should be created on both A and B and A's processes should be moved to
      the former and B's processes the latter.  IOW, enabling controllers
      can cause atomic migrations into different csses.
      
      The core cgroup migration logic has been updated accordingly but the
      controller migration methods haven't and still assume that all tasks
      migrate to a single target css; furthermore, the methods were fed the
      css in which subtree_control was updated which is the parent of the
      target csses.  pids controller depends on the migration methods to
      move charges and this made the controller attribute charges to the
      wrong csses often triggering the following warning by driving a
      counter negative.
      
       WARNING: CPU: 1 PID: 1 at kernel/cgroup_pids.c:97 pids_cancel.constprop.6+0x31/0x40()
       Modules linked in:
       CPU: 1 PID: 1 Comm: systemd Not tainted 4.4.0-rc1+ #29
       ...
        ffffffff81f65382 ffff88007c043b90 ffffffff81551ffc 0000000000000000
        ffff88007c043bc8 ffffffff810de202 ffff88007a752000 ffff88007a29ab00
        ffff88007c043c80 ffff88007a1d8400 0000000000000001 ffff88007c043bd8
       Call Trace:
        [<ffffffff81551ffc>] dump_stack+0x4e/0x82
        [<ffffffff810de202>] warn_slowpath_common+0x82/0xc0
        [<ffffffff810de2fa>] warn_slowpath_null+0x1a/0x20
        [<ffffffff8118e031>] pids_cancel.constprop.6+0x31/0x40
        [<ffffffff8118e0fd>] pids_can_attach+0x6d/0xf0
        [<ffffffff81188a4c>] cgroup_taskset_migrate+0x6c/0x330
        [<ffffffff81188e05>] cgroup_migrate+0xf5/0x190
        [<ffffffff81189016>] cgroup_attach_task+0x176/0x200
        [<ffffffff8118949d>] __cgroup_procs_write+0x2ad/0x460
        [<ffffffff81189684>] cgroup_procs_write+0x14/0x20
        [<ffffffff811854e5>] cgroup_file_write+0x35/0x1c0
        [<ffffffff812e26f1>] kernfs_fop_write+0x141/0x190
        [<ffffffff81265f88>] __vfs_write+0x28/0xe0
        [<ffffffff812666fc>] vfs_write+0xac/0x1a0
        [<ffffffff81267019>] SyS_write+0x49/0xb0
        [<ffffffff81bcef32>] entry_SYSCALL_64_fastpath+0x12/0x76
      
      This patch fixes the bug by removing @css parameter from the three
      migration methods, ->can_attach, ->cancel_attach() and ->attach() and
      updating cgroup_taskset iteration helpers also return the destination
      css in addition to the task being migrated.  All controllers are
      updated accordingly.
      
      * Controllers which don't care whether there are one or multiple
        target csses can be converted trivially.  cpu, io, freezer, perf,
        netclassid and netprio fall in this category.
      
      * cpuset's current implementation assumes that there's single source
        and destination and thus doesn't support v2 hierarchy already.  The
        only change made by this patchset is how that single destination css
        is obtained.
      
      * memory migration path already doesn't do anything on v2.  How the
        single destination css is obtained is updated and the prep stage of
        mem_cgroup_can_attach() is reordered to accomodate the change.
      
      * pids is the only controller which was affected by this bug.  It now
        correctly handles multi-destination migrations and no longer causes
        counter underflow from incorrect accounting.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-and-tested-by: NDaniel Wagner <daniel.wagner@bmw-carit.de>
      Cc: Aleksa Sarai <cyphar@cyphar.com>
      1f7dd3e5
  5. 23 11月, 2015 3 次提交
  6. 09 11月, 2015 2 次提交
  7. 22 10月, 2015 1 次提交
  8. 16 10月, 2015 1 次提交
    • T
      cgroup: keep zombies associated with their original cgroups · 2e91fa7f
      Tejun Heo 提交于
      cgroup_exit() is called when a task exits and disassociates the
      exiting task from its cgroups and half-attach it to the root cgroup.
      This is unnecessary and undesirable.
      
      No controller actually needs an exiting task to be disassociated with
      non-root cgroups.  Both cpu and perf_event controllers update the
      association to the root cgroup from their exit callbacks just to keep
      consistent with the cgroup core behavior.
      
      Also, this disassociation makes it difficult to track resources held
      by zombies or determine where the zombies came from.  Currently, pids
      controller is completely broken as it uncharges on exit and zombies
      always escape the resource restriction.  With cgroup association being
      reset on exit, fixing it is pretty painful.
      
      There's no reason to reset cgroup membership on exit.  The zombie can
      be removed from its css_set so that it doesn't show up on
      "cgroup.procs" and thus can't be migrated or interfere with cgroup
      removal.  It can still pin and point to the css_set so that its cgroup
      membership is maintained.  This patch makes cgroup core keep zombies
      associated with their cgroups at the time of exit.
      
      * Previous patches decoupled populated_cnt tracking from css_set
        lifetime, so a dying task can be simply unlinked from its css_set
        while pinning and pointing to the css_set.  This keeps css_set
        association from task side alive while hiding it from "cgroup.procs"
        and populated_cnt tracking.  The css_set reference is dropped when
        the task_struct is freed.
      
      * ->exit() callback no longer needs the css arguments as the
        associated css never changes once PF_EXITING is set.  Removed.
      
      * cpu and perf_events controllers no longer need ->exit() callbacks.
        There's no reason to explicitly switch away on exit.  The final
        schedule out is enough.  The callbacks are removed.
      
      * On traditional hierarchies, nothing changes.  "/proc/PID/cgroup"
        still reports "/" for all zombies.  On the default hierarchy,
        "/proc/PID/cgroup" keeps reporting the cgroup that the task belonged
        to at the time of exit.  If the cgroup gets removed before the task
        is reaped, " (deleted)" is appended.
      
      v2: Build brekage due to missing dummy cgroup_free() when
          !CONFIG_CGROUP fixed.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      2e91fa7f
  9. 28 9月, 2015 1 次提交
  10. 18 9月, 2015 3 次提交
    • P
      perf: Fix races in computing the header sizes · f73e22ab
      Peter Zijlstra 提交于
      There are two races with the current code:
      
       - Another event can join the group and compute a larger header_size
         concurrently, if the smaller store wins we'll have an incorrect
         header_size set.
      
       - We compute the header_size after the event becomes active,
         therefore its possible to use the size before its computed.
      
      Remedy the first by moving the computation inside the ctx::mutex lock,
      and the second by placing it _before_ perf_install_in_context().
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      f73e22ab
    • P
      perf: Fix u16 overflows · a723968c
      Peter Zijlstra 提交于
      Vince reported that its possible to overflow the various size fields
      and get weird stuff if you stick too many events in a group.
      
      Put a lid on this by requiring the fixed record size not exceed 16k.
      This is still a fair amount of events (silly amount really) and leaves
      plenty room for callchains and stack dwarves while also avoiding
      overflowing the u16 variables.
      Reported-by: NVince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      a723968c
    • P
      perf: Restructure perf syscall point of no return · f55fc2a5
      Peter Zijlstra 提交于
      The exclusive_event_installable() stuff only works because its
      exclusive with the grouping bits.
      
      Rework the code such that there is a sane place to error out before we
      go do things we cannot undo.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      f55fc2a5
  11. 13 9月, 2015 8 次提交
  12. 11 9月, 2015 1 次提交
    • D
      kexec: split kexec_load syscall from kexec core code · 2965faa5
      Dave Young 提交于
      There are two kexec load syscalls, kexec_load another and kexec_file_load.
       kexec_file_load has been splited as kernel/kexec_file.c.  In this patch I
      split kexec_load syscall code to kernel/kexec.c.
      
      And add a new kconfig option KEXEC_CORE, so we can disable kexec_load and
      use kexec_file_load only, or vice verse.
      
      The original requirement is from Ted Ts'o, he want kexec kernel signature
      being checked with CONFIG_KEXEC_VERIFY_SIG enabled.  But kexec-tools use
      kexec_load syscall can bypass the checking.
      
      Vivek Goyal proposed to create a common kconfig option so user can compile
      in only one syscall for loading kexec kernel.  KEXEC/KEXEC_FILE selects
      KEXEC_CORE so that old config files still work.
      
      Because there's general code need CONFIG_KEXEC_CORE, so I updated all the
      architecture Kconfig with a new option KEXEC_CORE, and let KEXEC selects
      KEXEC_CORE in arch Kconfig.  Also updated general kernel code with to
      kexec_load syscall.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NDave Young <dyoung@redhat.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Petr Tesarik <ptesarik@suse.cz>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Josh Boyer <jwboyer@fedoraproject.org>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2965faa5
  13. 12 8月, 2015 1 次提交
  14. 10 8月, 2015 1 次提交
  15. 07 8月, 2015 1 次提交
    • W
      tracing, perf: Implement BPF programs attached to uprobes · 04a22fae
      Wang Nan 提交于
      By copying BPF related operation to uprobe processing path, this patch
      allow users attach BPF programs to uprobes like what they are already
      doing on kprobes.
      
      After this patch, users are allowed to use PERF_EVENT_IOC_SET_BPF on a
      uprobe perf event. Which make it possible to profile user space programs
      and kernel events together using BPF.
      
      Because of this patch, CONFIG_BPF_EVENTS should be selected by
      CONFIG_UPROBE_EVENT to ensure trace_call_bpf() is compiled even if
      KPROBE_EVENT is not set.
      Signed-off-by: NWang Nan <wangnan0@huawei.com>
      Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
      Cc: Brendan Gregg <brendan.d.gregg@gmail.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: He Kuang <hekuang@huawei.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Kaixu Xia <xiakaixu@huawei.com>
      Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Zefan Li <lizefan@huawei.com>
      Cc: pi3orama@163.com
      Link: http://lkml.kernel.org/r/1435716878-189507-3-git-send-email-wangnan0@huawei.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      04a22fae
  16. 04 8月, 2015 2 次提交
    • A
      perf/x86/intel/pt: Do not force sync packets on every schedule-in · 9a6694cf
      Alexander Shishkin 提交于
      Currently, the PT driver zeroes out the status register every time before
      starting the event. However, all the writable bits are already taken care
      of in pt_handle_status() function, except the new PacketByteCnt field,
      which in new versions of PT contains the number of packet bytes written
      since the last sync (PSB) packet. Zeroing it out before enabling PT forces
      a sync packet to be written. This means that, with the existing code, a
      sync packet (PSB and PSBEND, 18 bytes in total) will be generated every
      time a PT event is scheduled in.
      
      To avoid these unnecessary syncs and save a WRMSR in the fast path, this
      patch changes the default behavior to not clear PacketByteCnt field, so
      that the sync packets will be generated with the period specified as
      "psb_period" attribute config field. This has little impact on the trace
      data as the other packets that are normally sent within PSB+ (between PSB
      and PSBEND) have their own generation scenarios which do not depend on the
      sync packets.
      
      One exception where we do need to force PSB like this when tracing starts,
      so that the decoder has a clear sync point in the trace. For this purpose
      we aready have hw::itrace_started flag, which we are currently using to
      output PERF_RECORD_ITRACE_START. This patch moves setting itrace_started
      from perf core to the pmu::start, where it should still be 0 on the very
      first run.
      Signed-off-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: acme@infradead.org
      Cc: adrian.hunter@intel.com
      Cc: hpa@zytor.com
      Link: http://lkml.kernel.org/r/1438264104-16189-1-git-send-email-alexander.shishkin@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      9a6694cf
    • P
      perf: Fix fasync handling on inherited events · fed66e2c
      Peter Zijlstra 提交于
      Vince reported that the fasync signal stuff doesn't work proper for
      inherited events. So fix that.
      
      Installing fasync allocates memory and sets filp->f_flags |= FASYNC,
      which upon the demise of the file descriptor ensures the allocation is
      freed and state is updated.
      
      Now for perf, we can have the events stick around for a while after the
      original FD is dead because of references from child events. So we
      cannot copy the fasync pointer around. We can however consistently use
      the parent's fasync, as that will be updated.
      Reported-and-Tested-by: NVince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: <stable@vger.kernel.org>
      Cc: Arnaldo Carvalho deMelo <acme@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: eranian@google.com
      Link: http://lkml.kernel.org/r/1434011521.1495.71.camel@twinsSigned-off-by: NIngo Molnar <mingo@kernel.org>
      fed66e2c
  17. 27 7月, 2015 1 次提交
  18. 24 7月, 2015 1 次提交
    • A
      perf: Add PERF_RECORD_SWITCH to indicate context switches · 45ac1403
      Adrian Hunter 提交于
      There are already two events for context switches, namely the tracepoint
      sched:sched_switch and the software event context_switches.
      Unfortunately neither are suitable for use by non-privileged users for
      the purpose of synchronizing hardware trace data (e.g. Intel PT) to the
      context switch.
      
      Tracepoints are no good at all for non-privileged users because they
      need either CAP_SYS_ADMIN or /proc/sys/kernel/perf_event_paranoid <= -1.
      
      On the other hand, kernel software events need either CAP_SYS_ADMIN or
      /proc/sys/kernel/perf_event_paranoid <= 1.
      
      Now many distributions do default perf_event_paranoid to 1 making
      context_switches a contender, except it has another problem (which is
      also shared with sched:sched_switch) which is that it happens before
      perf schedules events out instead of after perf schedules events in.
      Whereas a privileged user can see all the events anyway, a
      non-privileged user only sees events for their own processes, in other
      words they see when their process was scheduled out not when it was
      scheduled in. That presents two problems to use the event:
      
      1. the information comes too late, so tools have to look ahead in the
         event stream to find out what the current state is
      
      2. if they are unlucky tracing might have stopped before the
         context-switches event is recorded.
      
      This new PERF_RECORD_SWITCH event does not have those problems
      and it also has a couple of other small advantages.
      
      It is easier to use because it is an auxiliary event (like mmap, comm
      and task events) which can be enabled by setting a single bit. It is
      smaller than sched:sched_switch and easier to parse.
      
      To make the event useful for privileged users also, if the
      context is cpu-wide then the event record will be
      PERF_RECORD_SWITCH_CPU_WIDE which is the same as
      PERF_RECORD_SWITCH except it also provides the next or
      previous pid/tid.
      Signed-off-by: NAdrian Hunter <adrian.hunter@intel.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Tested-by: NJiri Olsa <jolsa@redhat.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
      Cc: Pawel Moll <pawel.moll@arm.com>
      Cc: Stephane Eranian <eranian@google.com>
      Link: http://lkml.kernel.org/r/1437471846-26995-2-git-send-email-adrian.hunter@intel.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      45ac1403
  19. 06 7月, 2015 1 次提交
  20. 24 6月, 2015 1 次提交
  21. 19 6月, 2015 1 次提交
    • O
      perf: Fix ring_buffer_attach() RCU sync, again · 2f993cf0
      Oleg Nesterov 提交于
      While looking for other users of get_state/cond_sync. I Found
      ring_buffer_attach() and it looks obviously buggy?
      
      Don't we need to ensure that we have "synchronize" _between_
      list_del() and list_add() ?
      
      IOW. Suppose that ring_buffer_attach() preempts right_after
      get_state_synchronize_rcu() and gp completes before spin_lock().
      
      In this case cond_synchronize_rcu() does nothing and we reuse
      ->rb_entry without waiting for gp in between?
      
      It also moves the ->rcu_pending check under "if (rb)", to make it
      more readable imo.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: dave@stgolabs.net
      Cc: der.herr@hofr.at
      Cc: josh@joshtriplett.org
      Cc: tj@kernel.org
      Fixes: b69cf536 ("perf: Fix a race between ring_buffer_detach() and ring_buffer_attach()")
      Link: http://lkml.kernel.org/r/20150530200425.GA15748@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      2f993cf0
  22. 07 6月, 2015 2 次提交
    • K
      perf/x86/intel: Introduce PERF_RECORD_LOST_SAMPLES · f38b0dbb
      Kan Liang 提交于
      After enlarging the PEBS interrupt threshold, there may be some mixed up
      PEBS samples which are discarded by the kernel.
      
      This patch makes the kernel emit a PERF_RECORD_LOST_SAMPLES record with
      the number of possible discarded records when it is impossible to demux
      the samples.
      
      It makes sure the user is not left in the dark about such discards.
      Signed-off-by: NKan Liang <kan.liang@intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: acme@infradead.org
      Cc: eranian@google.com
      Link: http://lkml.kernel.org/r/1431285195-14269-8-git-send-email-kan.liang@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      f38b0dbb
    • Y
      perf/x86/intel: Handle multiple records in the PEBS buffer · 21509084
      Yan, Zheng 提交于
      When the PEBS interrupt threshold is larger than one record and the
      machine supports multiple PEBS events, the records of these events are
      mixed up and we need to demultiplex them.
      
      Demuxing the records is hard because the hardware is deficient. The
      hardware has two issues that, when combined, create impossible
      scenarios to demux.
      
      The first issue is that the 'status' field of the PEBS record is a copy
      of the GLOBAL_STATUS MSR at PEBS assist time. To see why this is a
      problem let us first describe the regular PEBS cycle:
      
      A) the CTRn value reaches 0:
        - the corresponding bit in GLOBAL_STATUS gets set
        - we start arming the hardware assist
        < some unspecified amount of time later -- this could cover multiple
          events of interest >
      
      B) the hardware assist is armed, any next event will trigger it
      
      C) a matching event happens:
        - the hardware assist triggers and generates a PEBS record
          this includes a copy of GLOBAL_STATUS at this moment
        - if we auto-reload we (re)set CTRn
        - we clear the relevant bit in GLOBAL_STATUS
      
      Now consider the following chain of events:
      
        A0, B0, A1, C0
      
      The event generated for counter 0 will include a status with counter 1
      set, even though its not at all related to the record. A similar thing
      can happen with a !PEBS event if it just happens to overflow at the
      right moment.
      
      The second issue is that the hardware will only emit one record for two
      or more counters if the event that triggers the assist is 'close'. The
      'close' can be several cycles. In some cases even the complete assist,
      if the event is something that doesn't need retirement.
      
      For instance, consider this chain of events:
      
        A0, B0, A1, B1, C01
      
      Where C01 is an event that triggers both hardware assists, we will
      generate but a single record, but again with both counters listed in the
      status field.
      
      This time the record pertains to both events.
      
      Note that these two cases are different but undistinguishable with the
      data as generated. Therefore demuxing records with multiple PEBS bits
      (we can safely ignore status bits for !PEBS counters) is impossible.
      
      Furthermore we cannot emit the record to both events because that might
      cause a data leak -- the events might not have the same privileges -- so
      what this patch does is discard such events.
      
      The assumption/hope is that such discards will be rare.
      
      Here lists some possible ways you may get high discard rate.
      
        - when you count the same thing multiple times. But it is not a useful
          configuration.
        - you can be unfortunate if you measure with a userspace only PEBS
          event along with either a kernel or unrestricted PEBS event. Imagine
          the event triggering and setting the overflow flag right before
          entering the kernel. Then all kernel side events will end up with
          multiple bits set.
      Signed-off-by: NYan, Zheng <zheng.z.yan@intel.com>
      Signed-off-by: NKan Liang <kan.liang@intel.com>
      [ Changelog improvements. ]
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: acme@infradead.org
      Cc: eranian@google.com
      Link: http://lkml.kernel.org/r/1430940834-8964-4-git-send-email-kan.liang@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      21509084
  23. 27 5月, 2015 1 次提交
    • M
      perf: allow for PMU-specific event filtering · 66eb579e
      Mark Rutland 提交于
      In certain circumstances it may not be possible to schedule particular
      events due to constraints other than a lack of hardware counters (e.g.
      on big.LITTLE systems where CPUs support different events). The core
      perf event code does not distinguish these cases and pessimistically
      assumes that any failure to schedule an event means that it is not worth
      attempting to schedule later events, even if some hardware counters are
      still unused.
      
      When an event a pmu cannot schedule exists in a flexible group list it
      can unnecessarily prevent event groups following it in the list from
      being scheduled (until it is rotated to the end of the list). This means
      some events are scheduled for only a portion of the time they could be,
      and for short running programs no events may be scheduled if the list is
      initially sorted in an unfortunate order.
      
      This patch adds a new (optional) filter_match function pointer to struct
      pmu which a pmu driver can use to tell perf core when an event matches
      pmu-specific scheduling requirements. This plugs into the existing
      event_filter_match logic, and makes it possible to avoid the scheduling
      problem described above. When no filter is provided by the PMU, the
      existing behaviour is retained.
      
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Acked-by: NWill Deacon <will.deacon@arm.com>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NMark Rutland <mark.rutland@arm.com>
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      66eb579e