1. 19 11月, 2013 1 次提交
  2. 13 11月, 2013 1 次提交
  3. 10 11月, 2013 2 次提交
    • O
      uprobes: Fix the memory out of bound overwrite in copy_insn() · 2ded0980
      Oleg Nesterov 提交于
      1. copy_insn() doesn't look very nice, all calculations are
         confusing and it is not immediately clear why do we read
         the 2nd page first.
      
      2. The usage of inode->i_size is wrong on 32-bit machines.
      
      3. "Instruction at end of binary" logic is simply wrong, it
         doesn't handle the case when uprobe->offset > inode->i_size.
      
         In this case "bytes" overflows, and __copy_insn() writes to
         the memory outside of uprobe->arch.insn.
      
         Yes, uprobe_register() checks i_size_read(), but this file
         can be truncated after that. All i_size checks are racy, we
         do this only to catch the obvious mistakes.
      
      Change copy_insn() to call __copy_insn() in a loop, simplify
      and fix the bytes/nbytes calculations.
      
      Note: we do not care if we read extra bytes after inode->i_size
      if we got the valid page. This is fine because the task gets the
      same page after page-fault, and arch_uprobe_analyze_insn() can't
      know how many bytes were actually read anyway.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      2ded0980
    • O
      uprobes: Fix the wrong usage of current->utask in uprobe_copy_process() · 70d7f987
      Oleg Nesterov 提交于
      Commit aa59c53f "uprobes: Change uprobe_copy_process() to dup
      xol_area" has a stupid typo, we need to setup t->utask->vaddr but
      the code wrongly uses current->utask.
      
      Even with this bug dup_xol_work() works "in practice", but only
      because get_unmapped_area(NULL, TASK_SIZE - PAGE_SIZE) likely
      returns the same address every time.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      70d7f987
  4. 07 11月, 2013 3 次提交
    • O
      uprobes: Export write_opcode() as uprobe_write_opcode() · f72d41fa
      Oleg Nesterov 提交于
      set_swbp() and set_orig_insn() are __weak, but this is pointless
      because write_opcode() is static.
      
      Export write_opcode() as uprobe_write_opcode() for the upcoming
      arm port, this way it can actually override set_swbp() and use
      __opcode_to_mem_arm(bpinsn) instead if UPROBE_SWBP_INSN.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      f72d41fa
    • O
      uprobes: Introduce arch_uprobe->ixol · 8a8de66c
      Oleg Nesterov 提交于
      Currently xol_get_insn_slot() assumes that we should simply copy
      arch_uprobe->insn[] which is (ignoring arch_uprobe_analyze_insn)
      just the copy of the original insn.
      
      This is not true for arm which needs to create another insn to
      execute it out-of-line.
      
      So this patch simply adds the new member, ->ixol into the union.
      This doesn't make any difference for x86 and powerpc, but arm
      can divorce insn/ixol and initialize the correct xol insn in
      arch_uprobe_analyze_insn().
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      8a8de66c
    • O
      uprobes: Kill module_init() and module_exit() · 736e89d9
      Oleg Nesterov 提交于
      Turn module_init() into __initcall() and kill module_exit().
      
      This code can't be compiled as a module so these module_*()
      calls only add the confusion, especially if arch-dependant
      code needs its own initialization hooks.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      736e89d9
  5. 06 11月, 2013 8 次提交
  6. 30 10月, 2013 6 次提交
    • O
      uprobes: Teach uprobe_copy_process() to handle CLONE_VFORK · 3ab67966
      Oleg Nesterov 提交于
      uprobe_copy_process() does nothing if the child shares ->mm with
      the forking process, but there is a special case: CLONE_VFORK.
      In this case it would be more correct to do dup_utask() but avoid
      dup_xol(). This is not that important, the child should not unwind
      its stack too much, this can corrupt the parent's stack, but at
      least we need this to allow to ret-probe __vfork() itself.
      
      Note: in theory, it would be better to check task_pt_regs(p)->sp
      instead of CLONE_VFORK, we need to dup_utask() if and only if the
      child can return from the function called by the parent. But this
      needs the arch-dependant helper, and I think that nobody actually
      does clone(same_stack, CLONE_VM).
      Reported-by: NMartin Cermak <mcermak@redhat.com>
      Reported-by: NDavid Smith <dsmith@redhat.com>
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      3ab67966
    • O
      uprobes: Change uprobe_copy_process() to dup xol_area · aa59c53f
      Oleg Nesterov 提交于
      This finally fixes the serious bug in uretprobes: a forked child
      crashes if the parent called fork() with the pending ret probe.
      
      Trivial test-case:
      
      	# perf probe -x /lib/libc.so.6 __fork%return
      	# perf record -e probe_libc:__fork perl -le 'fork || print "OK"'
      
      (the child doesn't print "OK", it is killed by SIGSEGV)
      
      If the child returns from the probed function it actually returns
      to trampoline_vaddr, because it got the copy of parent's stack
      mangled by prepare_uretprobe() when the parent entered this func.
      
      It crashes because a) this address is not mapped and b) until the
      previous change it doesn't have the proper->return_instances info.
      
      This means that uprobe_copy_process() has to create xol_area which
      has the trampoline slot, and its vaddr should be equal to parent's
      xol_area->vaddr.
      
      Unfortunately, uprobe_copy_process() can not simply do
      __create_xol_area(child, xol_area->vaddr). This could actually work
      but perf_event_mmap() doesn't expect the usage of foreign ->mm. So
      we offload this to task_work_run(), and pass the argument via not
      yet used utask->vaddr.
      
      We know that this vaddr is fine for install_special_mapping(), the
      necessary hole was recently "created" by dup_mmap() which skips the
      parent's VM_DONTCOPY area, and nobody else could use the new mm.
      
      Unfortunately, this also means that we can not handle the errors
      properly, we obviously can not abort the already completed fork().
      So we simply print the warning if GFP_KERNEL allocation (the only
      possible reason) fails.
      Reported-by: NMartin Cermak <mcermak@redhat.com>
      Reported-by: NDavid Smith <dsmith@redhat.com>
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      aa59c53f
    • O
      uprobes: Change uprobe_copy_process() to dup return_instances · 248d3a7b
      Oleg Nesterov 提交于
      uprobe_copy_process() assumes that the new child doesn't need
      ->utask, it should be allocated by demand.
      
      But this is not true if the forking task has the pending ret-
      probes, the child should report them as well and thus it needs
      the copy of parent's ->return_instances chain. Otherwise the
      child crashes when it returns from the probed function.
      
      Alternatively we could cleanup the child's stack, but this needs
      per-arch changes and this is not what we want. At least systemtap
      expects a .return in the child too.
      
      Note: this change alone doesn't fix the problem, see the next
      change.
      Reported-by: NMartin Cermak <mcermak@redhat.com>
      Reported-by: NDavid Smith <dsmith@redhat.com>
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      248d3a7b
    • O
      uprobes: Teach __create_xol_area() to accept the predefined vaddr · af0d95af
      Oleg Nesterov 提交于
      Currently xol_add_vma() uses get_unmapped_area() for area->vaddr,
      but the next patches need to use the fixed address. So this patch
      adds the new "vaddr" argument to __create_xol_area() which should
      be used as area->vaddr if it is nonzero.
      
      xol_add_vma() doesn't bother to verify that the predefined addr is
      not used, insert_vm_struct() should fail if find_vma_links() detects
      the overlap with the existing vma.
      
      Also, __create_xol_area() doesn't need __GFP_ZERO to allocate area.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      af0d95af
    • O
      uprobes: Introduce __create_xol_area() · 6441ec8b
      Oleg Nesterov 提交于
      No functional changes, preparation.
      
      Extract the code which actually allocates/installs the new area
      into the new helper, __create_xol_area().
      
      While at it remove the unnecessary "ret = ENOMEM" and "ret = 0"
      in xol_add_vma(), they both have no effect.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      6441ec8b
    • O
      uprobes: Change the callsite of uprobe_copy_process() · b68e0749
      Oleg Nesterov 提交于
      Preparation for the next patches.
      
      Move the callsite of uprobe_copy_process() in copy_process() down
      to the succesfull return. We do not care if copy_process() fails,
      uprobe_free_utask() won't be called in this case so the wrong
      ->utask != NULL doesn't matter.
      
      OTOH, with this change we know that copy_process() can't fail when
      uprobe_copy_process() is called, the new task should either return
      to user-mode or call do_exit(). This way uprobe_copy_process() can:
      
      	1. setup p->utask != NULL if necessary
      
      	2. setup uprobes_state.xol_area
      
      	3. use task_work_add(p)
      
      Also, move the definition of uprobe_copy_process() down so that it
      can see get_utask().
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      b68e0749
  7. 29 10月, 2013 6 次提交
  8. 18 10月, 2013 1 次提交
    • S
      perf: Disable PERF_RECORD_MMAP2 support · 3090ffb5
      Stephane Eranian 提交于
      For now, we disable the extended MMAP record support (MMAP2).
      
      We have identified cases where it would not report the correct mapping
      information, clone(VM_CLONE) but with separate pids.  We will revisit
      the support once we find a solution for this case.
      
      The patch changes the kernel to return EINVAL if attr->mmap2 is set. The
      patch also modifies the perf tool to use regular PERF_RECORD_MMAP for
      synthetic events and it also prevents the tool from requesting
      attr->mmap2 mode because the kernel would reject it.
      
      The support will be revisited once the kenrel interface is updated.
      
      In V2, we reduce the patch to the strict minimum.
      
      In V3, we avoid calling perf_event_open() with mmap2 set because we know
      it will fail and require fallback retry.
      Signed-off-by: NStephane Eranian <eranian@google.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/20131017173215.GA8820@quadSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      3090ffb5
  9. 04 10月, 2013 3 次提交
    • A
      perf: Add generic transaction flags · fdfbbd07
      Andi Kleen 提交于
      Add a generic qualifier for transaction events, as a new sample
      type that returns a flag word. This is particularly useful
      for qualifying aborts: to distinguish aborts which happen
      due to asynchronous events (like conflicts caused by another
      CPU) versus instructions that lead to an abort.
      
      The tuning strategies are very different for those cases,
      so it's important to distinguish them easily and early.
      
      Since it's inconvenient and inflexible to filter for this
      in the kernel we report all the events out and allow
      some post processing in user space.
      
      The flags are based on the Intel TSX events, but should be fairly
      generic and mostly applicable to other HTM architectures too. In addition
      to various flag words there's also reserved space to report an
      program supplied abort code. For TSX this is used to distinguish specific
      classes of aborts, like a lock busy abort when doing lock elision.
      
      Flags:
      
      Elision and generic transactions 		   (ELISION vs TRANSACTION)
      (HLE vs RTM on TSX; IBM etc.  would likely only use TRANSACTION)
      Aborts caused by current thread vs aborts caused by others (SYNC vs ASYNC)
      Retryable transaction				   (RETRY)
      Conflicts with other threads			   (CONFLICT)
      Transaction write capacity overflow		   (CAPACITY WRITE)
      Transaction read capacity overflow		   (CAPACITY READ)
      
      Transactions implicitely aborted can also return an abort code.
      This can be used to signal specific events to the profiler. A common
      case is abort on lock busy in a RTM eliding library (code 0xff)
      To handle this case we include the TSX abort code
      
      Common example aborts in TSX would be:
      
      - Data conflict with another thread on memory read.
                                            Flags: TRANSACTION|ASYNC|CONFLICT
      - executing a WRMSR in a transaction. Flags: TRANSACTION|SYNC
      - HLE transaction in user space is too large
                                            Flags: ELISION|SYNC|CAPACITY-WRITE
      
      The only flag that is somewhat TSX specific is ELISION.
      
      This adds the perf core glue needed for reporting the new flag word out.
      
      v2: Add MEM/MISC
      v3: Move transaction to the end
      v4: Separate capacity-read/write and remove misc
      v5: Remove _SAMPLE. Move abort flags to 32bit. Rename
          transaction to txn
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1379688044-14173-2-git-send-email-andi@firstfloor.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      fdfbbd07
    • K
      perf: Enforce 1 as lower limit for perf_event_max_sample_rate · 723478c8
      Knut Petersen 提交于
      /proc/sys/kernel/perf_event_max_sample_rate will accept
      negative values as well as 0.
      
      Negative values are unreasonable, and 0 causes a
      divide by zero exception in perf_proc_update_handler.
      
      This patch enforces a lower limit of 1.
      Signed-off-by: NKnut Petersen <Knut_Petersen@t-online.de>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/5242DB0C.4070005@t-online.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      723478c8
    • P
      perf: Fix perf_pmu_migrate_context · 9886167d
      Peter Zijlstra 提交于
      While auditing the list_entry usage due to a trinity bug I found that
      perf_pmu_migrate_context violates the rules for
      perf_event::event_entry.
      
      The problem is that perf_event::event_entry is a RCU list element, and
      hence we must wait for a full RCU grace period before re-using the
      element after deletion.
      
      Therefore the usage in perf_pmu_migrate_context() which re-uses the
      entry immediately is broken. For now introduce another list_head into
      perf_event for this specific usage.
      
      This doesn't actually fix the trinity report because that never goes
      through this code.
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/n/tip-mkj72lxagw1z8fvjm648iznw@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      9886167d
  10. 27 9月, 2013 1 次提交
  11. 20 9月, 2013 1 次提交
    • P
      perf: Fix capabilities bitfield compatibility in 'struct perf_event_mmap_page' · fa731587
      Peter Zijlstra 提交于
      Solve the problems around the broken definition of perf_event_mmap_page::
      cap_usr_time and cap_usr_rdpmc fields which used to overlap, partially
      fixed by:
      
        860f085b ("perf: Fix broken union in 'struct perf_event_mmap_page'")
      
      The problem with the fix (merged in v3.12-rc1 and not yet released
      officially), noticed by Vince Weaver is that the new behavior is
      not detectable by new user-space, and that due to the reuse of the
      field names it's easy to mis-compile a binary if old headers are used
      on a new kernel or new headers are used on an old kernel.
      
      To solve all that make this change explicit, detectable and self-contained,
      by iterating the ABI the following way:
      
       - Always clear bit 0, and rename it to usrpage->cap_bit0, to at least not
         confuse old user-space binaries. RDPMC will be marked as unavailable
         to old binaries but that's within the ABI, this is a capability bit.
      
       - Rename bit 1 to ->cap_bit0_is_deprecated and always set it to 1, so new
         libraries can reliably detect that bit 0 is deprecated and perma-zero
         without having to check the kernel version.
      
       - Use bits 2, 3, 4 for the newly defined, correct functionality:
      
      	cap_user_rdpmc		: 1, /* The RDPMC instruction can be used to read counts */
      	cap_user_time		: 1, /* The time_* fields are used */
      	cap_user_time_zero	: 1, /* The time_zero field is used */
      
       - Rename all the bitfield names in perf_event.h to be different from the
         old names, to make sure it's not possible to mis-compile it
         accidentally with old assumptions.
      
      The 'size' field can then be used in the future to add new fields and it
      will act as a natural ABI version indicator as well.
      
      Also adjust tools/perf/ userspace for the new definitions, noticed by
      Adrian Hunter.
      Reported-by: NVince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Also-Fixed-by: NAdrian Hunter <adrian.hunter@intel.com>
      Link: http://lkml.kernel.org/n/tip-zr03yxjrpXesOzzupszqglbv@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      fa731587
  12. 12 9月, 2013 1 次提交
  13. 11 9月, 2013 1 次提交
    • A
      perf: Fix up MMAP2 buffer space reservation · d008d525
      Arnaldo Carvalho de Melo 提交于
      The ino_generation field was added in the PERF_RECORD_MMAP2 record in
      the 13d7a241 cset but no space for it was allocated, corrupting the
      PERF_FORMAT_{TIME,CPU,TID,etc} area (sample_type/sample_id_all), fix it.
      
      Detected with one of the regression tests done by 'perf test':
      
        [root@sandy ~]# perf test -v 7
         7: Validate PERF_RECORD_* events & perf_sample fields     :
        --- start ---
        61315294449606 0 PERF_RECORD_SAMPLE
        61315294453161 0 PERF_RECORD_SAMPLE
        61315294454441 0 PERF_RECORD_SAMPLE
        61315294455709 0 PERF_RECORD_SAMPLE
        61315295600899 0 PERF_RECORD_COMM: sleep:6500
        27917287430500 342521613 PERF_RECORD_MMAP2 6500/6500: [0x400000(0x7000) @ 0 00:1d 311442 9016]: /usr/bin/sleep
        MMAP2 going backwards in time, prev=61315295600899, curr=27917287430500
        MMAP2 with unexpected cpu, expected 0, got 342521613
        MMAP2 with unexpected pid, expected 6500, got 1701606191
        MMAP2 with unexpected tid, expected 6500, got 28773
        27917287430500 342561333 PERF_RECORD_MMAP2 6500/6500: [0x3b7e000000(0x223000) @ 0 00:1d 309186 9016]: /usr/lib64/ld-2.16.so
        MMAP2 with unexpected cpu, expected 0, got 342561333
        MMAP2 with unexpected pid, expected 6500, got 1932408369
        MMAP2 with unexpected tid, expected 6500, got 111
        27917287430500 342600095 PERF_RECORD_MMAP2 6500/6500: [0x7fffbd7dc000(0x1000) @ 0x7fffbd7dc000 00:00 0 0]: [vdso]
        MMAP2 with unexpected cpu, expected 0, got 342600095
        MMAP2 with unexpected pid, expected 6500, got 1935963739
        MMAP2 with unexpected tid, expected 6500, got 23919
        27917287430500 342882834 PERF_RECORD_MMAP2 6500/6500: [0x3b7e400000(0x3b8000) @ 0 00:1d 309187 9016]: /usr/lib64/libc-2.16.so
        MMAP2 with unexpected cpu, expected 0, got 342882834
        MMAP2 with unexpected pid, expected 6500, got 909192754
        MMAP2 with unexpected tid, expected 6500, got 7303982
        61316297195411 0 PERF_RECORD_EXIT(6500:6500):(6500:6500)
        ---- end ----
        Validate PERF_RECORD_* events & perf_sample fields: FAILED!
        [root@sandy ~]#
      
      After this patch:
      
        [root@sandy ~]# perf test 7
         7: Validate PERF_RECORD_* events & perf_sample fields     : Ok
        [root@sandy ~]#
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Acked-by: NStephane Eranian <eranian@google.com>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Link: http://lkml.kernel.org/n/tip-heeuv986b8ha7whqg4o3he7c@git.kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      d008d525
  14. 02 9月, 2013 2 次提交
    • S
      perf: Add attr->mmap2 attribute to an event · 13d7a241
      Stephane Eranian 提交于
      Adds a new PERF_RECORD_MMAP2 record type which is essence
      an expanded version of PERF_RECORD_MMAP.
      
      Used to request mmap records with more information about
      the mapping, including device major, minor and the inode
      number and generation for mappings associated with files
      or shared memory segments. Works for code and data
      (with attr->mmap_data set).
      
      Existing PERF_RECORD_MMAP record is unmodified by this patch.
      Signed-off-by: NStephane Eranian <eranian@google.com>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Link: http://lkml.kernel.org/r/1377079825-19057-2-git-send-email-eranian@google.com
      [ Added Al to the Cc:. Are the ino, maj/min exports of vma->vm_file OK? ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      13d7a241
    • J
      perf: Prevent race in unthrottling code · ae23bff1
      Jiri Olsa 提交于
      The current throttling code triggers WARN below via following
      workload (only hit on AMD machine with 48 CPUs):
      
        # while [ 1 ]; do perf record perf bench sched messaging; done
      
        WARNING: at arch/x86/kernel/cpu/perf_event.c:1054 x86_pmu_start+0xc6/0x100()
        SNIP
        Call Trace:
         <IRQ>  [<ffffffff815f62d6>] dump_stack+0x19/0x1b
         [<ffffffff8105f531>] warn_slowpath_common+0x61/0x80
         [<ffffffff8105f60a>] warn_slowpath_null+0x1a/0x20
         [<ffffffff810213a6>] x86_pmu_start+0xc6/0x100
         [<ffffffff81129dd2>] perf_adjust_freq_unthr_context.part.75+0x182/0x1a0
         [<ffffffff8112a058>] perf_event_task_tick+0xc8/0xf0
         [<ffffffff81093221>] scheduler_tick+0xd1/0x140
         [<ffffffff81070176>] update_process_times+0x66/0x80
         [<ffffffff810b9565>] tick_sched_handle.isra.15+0x25/0x60
         [<ffffffff810b95e1>] tick_sched_timer+0x41/0x60
         [<ffffffff81087c24>] __run_hrtimer+0x74/0x1d0
         [<ffffffff810b95a0>] ? tick_sched_handle.isra.15+0x60/0x60
         [<ffffffff81088407>] hrtimer_interrupt+0xf7/0x240
         [<ffffffff81606829>] smp_apic_timer_interrupt+0x69/0x9c
         [<ffffffff8160569d>] apic_timer_interrupt+0x6d/0x80
         <EOI>  [<ffffffff81129f74>] ? __perf_event_task_sched_in+0x184/0x1a0
         [<ffffffff814dd937>] ? kfree_skbmem+0x37/0x90
         [<ffffffff815f2c47>] ? __slab_free+0x1ac/0x30f
         [<ffffffff8118143d>] ? kfree+0xfd/0x130
         [<ffffffff81181622>] kmem_cache_free+0x1b2/0x1d0
         [<ffffffff814dd937>] kfree_skbmem+0x37/0x90
         [<ffffffff814e03c4>] consume_skb+0x34/0x80
         [<ffffffff8158b057>] unix_stream_recvmsg+0x4e7/0x820
         [<ffffffff814d5546>] sock_aio_read.part.7+0x116/0x130
         [<ffffffff8112c10c>] ? __perf_sw_event+0x19c/0x1e0
         [<ffffffff814d5581>] sock_aio_read+0x21/0x30
         [<ffffffff8119a5d0>] do_sync_read+0x80/0xb0
         [<ffffffff8119ac85>] vfs_read+0x145/0x170
         [<ffffffff8119b699>] SyS_read+0x49/0xa0
         [<ffffffff810df516>] ? __audit_syscall_exit+0x1f6/0x2a0
         [<ffffffff81604a19>] system_call_fastpath+0x16/0x1b
        ---[ end trace 622b7e226c4a766a ]---
      
      The reason is a race in perf_event_task_tick() throttling code.
      The race flow (simplified code):
      
        - perf_throttled_count is per cpu variable and is
          CPU throttling flag, here starting with 0
      
        - perf_throttled_seq is sequence/domain for allowed
          count of interrupts within the tick, gets increased
          each tick
      
          on single CPU (CPU bounded event):
      
            ... workload
      
          perf_event_task_tick:
          |
          | T0    inc(perf_throttled_seq)
          | T1    needs_unthr = xchg(perf_throttled_count, 0) == 0
           tick gets interrupted:
      
                  ... event gets throttled under new seq ...
      
            T2    last NMI comes, event is throttled - inc(perf_throttled_count)
      
           back to tick:
          | perf_adjust_freq_unthr_context:
          |
          | T3    unthrottling is skiped for event (needs_unthr == 0)
          | T4    event is stop and started via freq adjustment
          |
          tick ends
      
            ... workload
            ... no sample is hit for event ...
      
          perf_event_task_tick:
          |
          | T5    needs_unthr = xchg(perf_throttled_count, 0) != 0 (from T2)
          | T6    unthrottling is done on event (interrupts == MAX_INTERRUPTS)
          |       event is already started (from T4) -> WARN
      
      Fixing this by not checking needs_unthr again and thus
      check all events for unthrottling.
      Signed-off-by: NJiri Olsa <jolsa@redhat.com>
      Reported-by: NJan Stancek <jstancek@redhat.com>
      Suggested-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Stephane Eranian <eranian@google.com>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1377355554-8934-1-git-send-email-jolsa@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      ae23bff1
  15. 30 8月, 2013 1 次提交
    • A
      perf: make events stream always parsable · ff3d527c
      Adrian Hunter 提交于
      The event stream is not always parsable because the format of a sample
      is dependent on the sample_type of the selected event.  When there is
      more than one selected event and the sample_types are not the same then
      parsing becomes problematic.  A sample can be matched to its selected
      event using the ID that is allocated when the event is opened.
      Unfortunately, to get the ID from the sample means first parsing it.
      
      This patch adds a new sample format bit PERF_SAMPLE_IDENTIFER that puts
      the ID at a fixed position so that the ID can be retrieved without
      parsing the sample.  For sample events, that is the first position
      immediately after the header.  For non-sample events, that is the last
      position.
      
      In this respect parsing samples requires that the sample_type and ID
      values are recorded.  For example, perf tools records struct
      perf_event_attr and the IDs within the perf.data file.  Those must be
      read first before it is possible to parse samples found later in the
      perf.data file.
      Signed-off-by: NAdrian Hunter <adrian.hunter@intel.com>
      Tested-by: NStephane Eranian <eranian@google.com>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Namhyung Kim <namhyung@gmail.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Link: http://lkml.kernel.org/r/1377591794-30553-6-git-send-email-adrian.hunter@intel.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      ff3d527c
  16. 27 8月, 2013 1 次提交
    • T
      cgroup: rename cgroup_css_from_dir() to css_from_dir() and update its syntax · 35cf0836
      Tejun Heo 提交于
      cgroup_css_from_dir() will grow another user.  In preparation, make
      the following changes.
      
      * All css functions are prefixed with just "css_", rename it to
        css_from_dir().
      
      * Take dentry * instead of file * as dentry is what ultimately
        identifies a cgroup and file may not always be available.  Note that
        the function now checkes whether @dentry->d_inode is NULL as the
        caller now may specify a negative dentry.
      
      * Make it take cgroup_subsys * instead of integer subsys_id.  This
        simplifies the function and allows specifying no subsystem for
        cgroup->dummy_css.
      
      * Make return section a bit less verbose.
      
      This patch doesn't introduce any behavior changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      35cf0836
  17. 16 8月, 2013 1 次提交