1. 19 5月, 2014 3 次提交
    • P
      perf: Fix a race between ring_buffer_detach() and ring_buffer_attach() · b69cf536
      Peter Zijlstra 提交于
      Alexander noticed that we use RCU iteration on rb->event_list but do
      not use list_{add,del}_rcu() to add,remove entries to that list, nor
      do we observe proper grace periods when re-using the entries.
      
      Merge ring_buffer_detach() into ring_buffer_attach() such that
      attaching to the NULL buffer is detaching.
      
      Furthermore, ensure that between any 'detach' and 'attach' of the same
      event we observe the required grace period, but only when strictly
      required. In effect this means that only ioctl(.request =
      PERF_EVENT_IOC_SET_OUTPUT) will wait for a grace period, while the
      normal initial attach and final detach will not be delayed.
      
      This patch should, I think, do the right thing under all
      circumstances, the 'normal' cases all should never see the extra grace
      period, but the two cases:
      
       1) PERF_EVENT_IOC_SET_OUTPUT on an event which already has a
          ring_buffer set, will now observe the required grace period between
          removing itself from the old and attaching itself to the new buffer.
      
          This case is 'simple' in that both buffers are present in
          perf_event_set_output() one could think an unconditional
          synchronize_rcu() would be sufficient; however...
      
       2) an event that has a buffer attached, the buffer is destroyed
          (munmap) and then the event is attached to a new/different buffer
          using PERF_EVENT_IOC_SET_OUTPUT.
      
          This case is more complex because the buffer destruction does:
            ring_buffer_attach(.rb = NULL)
          followed by the ioctl() doing:
            ring_buffer_attach(.rb = foo);
      
          and we still need to observe the grace period between these two
          calls due to us reusing the event->rb_entry list_head.
      
      In order to make 2 happen we use Paul's latest cond_synchronize_rcu()
      call.
      
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Reported-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/20140507123526.GD13658@twins.programming.kicks-ass.netSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      b69cf536
    • J
      perf: Prevent false warning in perf_swevent_add · 39af6b16
      Jiri Olsa 提交于
      The perf cpu offline callback takes down all cpu context
      events and releases swhash->swevent_hlist.
      
      This could race with task context software event being just
      scheduled on this cpu via perf_swevent_add while cpu hotplug
      code already cleaned up event's data.
      
      The race happens in the gap between the cpu notifier code
      and the cpu being actually taken down. Note that only cpu
      ctx events are terminated in the perf cpu hotplug code.
      
      It's easily reproduced with:
        $ perf record -e faults perf bench sched pipe
      
      while putting one of the cpus offline:
        # echo 0 > /sys/devices/system/cpu/cpu1/online
      
      Console emits following warning:
        WARNING: CPU: 1 PID: 2845 at kernel/events/core.c:5672 perf_swevent_add+0x18d/0x1a0()
        Modules linked in:
        CPU: 1 PID: 2845 Comm: sched-pipe Tainted: G        W    3.14.0+ #256
        Hardware name: Intel Corporation Montevina platform/To be filled by O.E.M., BIOS AMVACRB1.86C.0066.B00.0805070703 05/07/2008
         0000000000000009 ffff880077233ab8 ffffffff81665a23 0000000000200005
         0000000000000000 ffff880077233af8 ffffffff8104732c 0000000000000046
         ffff88007467c800 0000000000000002 ffff88007a9cf2a0 0000000000000001
        Call Trace:
         [<ffffffff81665a23>] dump_stack+0x4f/0x7c
         [<ffffffff8104732c>] warn_slowpath_common+0x8c/0xc0
         [<ffffffff8104737a>] warn_slowpath_null+0x1a/0x20
         [<ffffffff8110fb3d>] perf_swevent_add+0x18d/0x1a0
         [<ffffffff811162ae>] event_sched_in.isra.75+0x9e/0x1f0
         [<ffffffff8111646a>] group_sched_in+0x6a/0x1f0
         [<ffffffff81083dd5>] ? sched_clock_local+0x25/0xa0
         [<ffffffff811167e6>] ctx_sched_in+0x1f6/0x450
         [<ffffffff8111757b>] perf_event_sched_in+0x6b/0xa0
         [<ffffffff81117a4b>] perf_event_context_sched_in+0x7b/0xc0
         [<ffffffff81117ece>] __perf_event_task_sched_in+0x43e/0x460
         [<ffffffff81096f1e>] ? put_lock_stats.isra.18+0xe/0x30
         [<ffffffff8107b3c8>] finish_task_switch+0xb8/0x100
         [<ffffffff8166a7de>] __schedule+0x30e/0xad0
         [<ffffffff81172dd2>] ? pipe_read+0x3e2/0x560
         [<ffffffff8166b45e>] ? preempt_schedule_irq+0x3e/0x70
         [<ffffffff8166b45e>] ? preempt_schedule_irq+0x3e/0x70
         [<ffffffff8166b464>] preempt_schedule_irq+0x44/0x70
         [<ffffffff816707f0>] retint_kernel+0x20/0x30
         [<ffffffff8109e60a>] ? lockdep_sys_exit+0x1a/0x90
         [<ffffffff812a4234>] lockdep_sys_exit_thunk+0x35/0x67
         [<ffffffff81679321>] ? sysret_check+0x5/0x56
      
      Fixing this by tracking the cpu hotplug state and displaying
      the WARN only if current cpu is initialized properly.
      
      Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: stable@vger.kernel.org
      Reported-by: NFengguang Wu <fengguang.wu@intel.com>
      Signed-off-by: NJiri Olsa <jolsa@redhat.com>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1396861448-10097-1-git-send-email-jolsa@redhat.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      39af6b16
    • P
      perf: Limit perf_event_attr::sample_period to 63 bits · 0819b2e3
      Peter Zijlstra 提交于
      Vince reported that using a large sample_period (one with bit 63 set)
      results in wreckage since while the sample_period is fundamentally
      unsigned (negative periods don't make sense) the way we implement
      things very much rely on signed logic.
      
      So limit sample_period to 63 bits to avoid tripping over this.
      Reported-by: NVince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: stable@vger.kernel.org
      Link: http://lkml.kernel.org/n/tip-p25fhunibl4y3qi0zuqmyf4b@git.kernel.orgSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      0819b2e3
  2. 07 5月, 2014 2 次提交
  3. 27 2月, 2014 4 次提交
  4. 22 2月, 2014 1 次提交
  5. 13 2月, 2014 1 次提交
    • T
      cgroup: drop @skip_css from cgroup_taskset_for_each() · 924f0d9a
      Tejun Heo 提交于
      If !NULL, @skip_css makes cgroup_taskset_for_each() skip the matching
      css.  The intention of the interface is to make it easy to skip css's
      (cgroup_subsys_states) which already match the migration target;
      however, this is entirely unnecessary as migration taskset doesn't
      include tasks which are already in the target cgroup.  Drop @skip_css
      from cgroup_taskset_for_each().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
      Cc: Daniel Borkmann <dborkman@redhat.com>
      924f0d9a
  6. 12 2月, 2014 1 次提交
    • T
      cgroup: improve css_from_dir() into css_tryget_from_dir() · 5a17f543
      Tejun Heo 提交于
      css_from_dir() returns the matching css (cgroup_subsys_state) given a
      dentry and subsystem.  The function doesn't pin the css before
      returning and requires the caller to be holding RCU read lock or
      cgroup_mutex and handling pinning on the caller side.
      
      Given that users of the function are likely to want to pin the
      returned css (both existing users do) and that getting and putting
      css's are very cheap, there's no reason for the interface to be tricky
      like this.
      
      Rename css_from_dir() to css_tryget_from_dir() and make it try to pin
      the found css and return it only if pinning succeeded.  The callers
      are updated so that they no longer do RCU locking and pinning around
      the function and just use the returned css.
      
      This will also ease converting cgroup to kernfs.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      5a17f543
  7. 09 2月, 2014 1 次提交
  8. 08 2月, 2014 1 次提交
    • T
      cgroup: clean up cgroup_subsys names and initialization · 073219e9
      Tejun Heo 提交于
      cgroup_subsys is a bit messier than it needs to be.
      
      * The name of a subsys can be different from its internal identifier
        defined in cgroup_subsys.h.  Most subsystems use the matching name
        but three - cpu, memory and perf_event - use different ones.
      
      * cgroup_subsys_id enums are postfixed with _subsys_id and each
        cgroup_subsys is postfixed with _subsys.  cgroup.h is widely
        included throughout various subsystems, it doesn't and shouldn't
        have claim on such generic names which don't have any qualifier
        indicating that they belong to cgroup.
      
      * cgroup_subsys->subsys_id should always equal the matching
        cgroup_subsys_id enum; however, we require each controller to
        initialize it and then BUG if they don't match, which is a bit
        silly.
      
      This patch cleans up cgroup_subsys names and initialization by doing
      the followings.
      
      * cgroup_subsys_id enums are now postfixed with _cgrp_id, and each
        cgroup_subsys with _cgrp_subsys.
      
      * With the above, renaming subsys identifiers to match the userland
        visible names doesn't cause any naming conflicts.  All non-matching
        identifiers are renamed to match the official names.
      
        cpu_cgroup -> cpu
        mem_cgroup -> memory
        perf -> perf_event
      
      * controllers no longer need to initialize ->subsys_id and ->name.
        They're generated in cgroup core and set automatically during boot.
      
      * Redundant cgroup_subsys declarations removed.
      
      * While updating BUG_ON()s in cgroup_init_early(), convert them to
        WARN()s.  BUGging that early during boot is stupid - the kernel
        can't print anything, even through serial console and the trap
        handler doesn't even link stack frame properly for back-tracing.
      
      This patch doesn't introduce any behavior changes.
      
      v2: Rebased on top of fe1217c4 ("net: net_cls: move cgroupfs
          classid handling into core").
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Acked-by: N"David S. Miller" <davem@davemloft.net>
      Acked-by: N"Rafael J. Wysocki" <rjw@rjwysocki.net>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Acked-by: NAristeu Rozanski <aris@redhat.com>
      Acked-by: NIngo Molnar <mingo@redhat.com>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Serge E. Hallyn <serue@us.ibm.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Thomas Graf <tgraf@suug.ch>
      073219e9
  9. 12 1月, 2014 2 次提交
  10. 17 12月, 2013 2 次提交
  11. 27 11月, 2013 1 次提交
  12. 19 11月, 2013 1 次提交
  13. 13 11月, 2013 1 次提交
  14. 06 11月, 2013 1 次提交
  15. 29 10月, 2013 5 次提交
  16. 18 10月, 2013 1 次提交
    • S
      perf: Disable PERF_RECORD_MMAP2 support · 3090ffb5
      Stephane Eranian 提交于
      For now, we disable the extended MMAP record support (MMAP2).
      
      We have identified cases where it would not report the correct mapping
      information, clone(VM_CLONE) but with separate pids.  We will revisit
      the support once we find a solution for this case.
      
      The patch changes the kernel to return EINVAL if attr->mmap2 is set. The
      patch also modifies the perf tool to use regular PERF_RECORD_MMAP for
      synthetic events and it also prevents the tool from requesting
      attr->mmap2 mode because the kernel would reject it.
      
      The support will be revisited once the kenrel interface is updated.
      
      In V2, we reduce the patch to the strict minimum.
      
      In V3, we avoid calling perf_event_open() with mmap2 set because we know
      it will fail and require fallback retry.
      Signed-off-by: NStephane Eranian <eranian@google.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/20131017173215.GA8820@quadSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      3090ffb5
  17. 04 10月, 2013 3 次提交
    • A
      perf: Add generic transaction flags · fdfbbd07
      Andi Kleen 提交于
      Add a generic qualifier for transaction events, as a new sample
      type that returns a flag word. This is particularly useful
      for qualifying aborts: to distinguish aborts which happen
      due to asynchronous events (like conflicts caused by another
      CPU) versus instructions that lead to an abort.
      
      The tuning strategies are very different for those cases,
      so it's important to distinguish them easily and early.
      
      Since it's inconvenient and inflexible to filter for this
      in the kernel we report all the events out and allow
      some post processing in user space.
      
      The flags are based on the Intel TSX events, but should be fairly
      generic and mostly applicable to other HTM architectures too. In addition
      to various flag words there's also reserved space to report an
      program supplied abort code. For TSX this is used to distinguish specific
      classes of aborts, like a lock busy abort when doing lock elision.
      
      Flags:
      
      Elision and generic transactions 		   (ELISION vs TRANSACTION)
      (HLE vs RTM on TSX; IBM etc.  would likely only use TRANSACTION)
      Aborts caused by current thread vs aborts caused by others (SYNC vs ASYNC)
      Retryable transaction				   (RETRY)
      Conflicts with other threads			   (CONFLICT)
      Transaction write capacity overflow		   (CAPACITY WRITE)
      Transaction read capacity overflow		   (CAPACITY READ)
      
      Transactions implicitely aborted can also return an abort code.
      This can be used to signal specific events to the profiler. A common
      case is abort on lock busy in a RTM eliding library (code 0xff)
      To handle this case we include the TSX abort code
      
      Common example aborts in TSX would be:
      
      - Data conflict with another thread on memory read.
                                            Flags: TRANSACTION|ASYNC|CONFLICT
      - executing a WRMSR in a transaction. Flags: TRANSACTION|SYNC
      - HLE transaction in user space is too large
                                            Flags: ELISION|SYNC|CAPACITY-WRITE
      
      The only flag that is somewhat TSX specific is ELISION.
      
      This adds the perf core glue needed for reporting the new flag word out.
      
      v2: Add MEM/MISC
      v3: Move transaction to the end
      v4: Separate capacity-read/write and remove misc
      v5: Remove _SAMPLE. Move abort flags to 32bit. Rename
          transaction to txn
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1379688044-14173-2-git-send-email-andi@firstfloor.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      fdfbbd07
    • K
      perf: Enforce 1 as lower limit for perf_event_max_sample_rate · 723478c8
      Knut Petersen 提交于
      /proc/sys/kernel/perf_event_max_sample_rate will accept
      negative values as well as 0.
      
      Negative values are unreasonable, and 0 causes a
      divide by zero exception in perf_proc_update_handler.
      
      This patch enforces a lower limit of 1.
      Signed-off-by: NKnut Petersen <Knut_Petersen@t-online.de>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/5242DB0C.4070005@t-online.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      723478c8
    • P
      perf: Fix perf_pmu_migrate_context · 9886167d
      Peter Zijlstra 提交于
      While auditing the list_entry usage due to a trinity bug I found that
      perf_pmu_migrate_context violates the rules for
      perf_event::event_entry.
      
      The problem is that perf_event::event_entry is a RCU list element, and
      hence we must wait for a full RCU grace period before re-using the
      element after deletion.
      
      Therefore the usage in perf_pmu_migrate_context() which re-uses the
      entry immediately is broken. For now introduce another list_head into
      perf_event for this specific usage.
      
      This doesn't actually fix the trinity report because that never goes
      through this code.
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/n/tip-mkj72lxagw1z8fvjm648iznw@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      9886167d
  18. 27 9月, 2013 1 次提交
  19. 20 9月, 2013 1 次提交
    • P
      perf: Fix capabilities bitfield compatibility in 'struct perf_event_mmap_page' · fa731587
      Peter Zijlstra 提交于
      Solve the problems around the broken definition of perf_event_mmap_page::
      cap_usr_time and cap_usr_rdpmc fields which used to overlap, partially
      fixed by:
      
        860f085b ("perf: Fix broken union in 'struct perf_event_mmap_page'")
      
      The problem with the fix (merged in v3.12-rc1 and not yet released
      officially), noticed by Vince Weaver is that the new behavior is
      not detectable by new user-space, and that due to the reuse of the
      field names it's easy to mis-compile a binary if old headers are used
      on a new kernel or new headers are used on an old kernel.
      
      To solve all that make this change explicit, detectable and self-contained,
      by iterating the ABI the following way:
      
       - Always clear bit 0, and rename it to usrpage->cap_bit0, to at least not
         confuse old user-space binaries. RDPMC will be marked as unavailable
         to old binaries but that's within the ABI, this is a capability bit.
      
       - Rename bit 1 to ->cap_bit0_is_deprecated and always set it to 1, so new
         libraries can reliably detect that bit 0 is deprecated and perma-zero
         without having to check the kernel version.
      
       - Use bits 2, 3, 4 for the newly defined, correct functionality:
      
      	cap_user_rdpmc		: 1, /* The RDPMC instruction can be used to read counts */
      	cap_user_time		: 1, /* The time_* fields are used */
      	cap_user_time_zero	: 1, /* The time_zero field is used */
      
       - Rename all the bitfield names in perf_event.h to be different from the
         old names, to make sure it's not possible to mis-compile it
         accidentally with old assumptions.
      
      The 'size' field can then be used in the future to add new fields and it
      will act as a natural ABI version indicator as well.
      
      Also adjust tools/perf/ userspace for the new definitions, noticed by
      Adrian Hunter.
      Reported-by: NVince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Also-Fixed-by: NAdrian Hunter <adrian.hunter@intel.com>
      Link: http://lkml.kernel.org/n/tip-zr03yxjrpXesOzzupszqglbv@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      fa731587
  20. 11 9月, 2013 1 次提交
    • A
      perf: Fix up MMAP2 buffer space reservation · d008d525
      Arnaldo Carvalho de Melo 提交于
      The ino_generation field was added in the PERF_RECORD_MMAP2 record in
      the 13d7a241 cset but no space for it was allocated, corrupting the
      PERF_FORMAT_{TIME,CPU,TID,etc} area (sample_type/sample_id_all), fix it.
      
      Detected with one of the regression tests done by 'perf test':
      
        [root@sandy ~]# perf test -v 7
         7: Validate PERF_RECORD_* events & perf_sample fields     :
        --- start ---
        61315294449606 0 PERF_RECORD_SAMPLE
        61315294453161 0 PERF_RECORD_SAMPLE
        61315294454441 0 PERF_RECORD_SAMPLE
        61315294455709 0 PERF_RECORD_SAMPLE
        61315295600899 0 PERF_RECORD_COMM: sleep:6500
        27917287430500 342521613 PERF_RECORD_MMAP2 6500/6500: [0x400000(0x7000) @ 0 00:1d 311442 9016]: /usr/bin/sleep
        MMAP2 going backwards in time, prev=61315295600899, curr=27917287430500
        MMAP2 with unexpected cpu, expected 0, got 342521613
        MMAP2 with unexpected pid, expected 6500, got 1701606191
        MMAP2 with unexpected tid, expected 6500, got 28773
        27917287430500 342561333 PERF_RECORD_MMAP2 6500/6500: [0x3b7e000000(0x223000) @ 0 00:1d 309186 9016]: /usr/lib64/ld-2.16.so
        MMAP2 with unexpected cpu, expected 0, got 342561333
        MMAP2 with unexpected pid, expected 6500, got 1932408369
        MMAP2 with unexpected tid, expected 6500, got 111
        27917287430500 342600095 PERF_RECORD_MMAP2 6500/6500: [0x7fffbd7dc000(0x1000) @ 0x7fffbd7dc000 00:00 0 0]: [vdso]
        MMAP2 with unexpected cpu, expected 0, got 342600095
        MMAP2 with unexpected pid, expected 6500, got 1935963739
        MMAP2 with unexpected tid, expected 6500, got 23919
        27917287430500 342882834 PERF_RECORD_MMAP2 6500/6500: [0x3b7e400000(0x3b8000) @ 0 00:1d 309187 9016]: /usr/lib64/libc-2.16.so
        MMAP2 with unexpected cpu, expected 0, got 342882834
        MMAP2 with unexpected pid, expected 6500, got 909192754
        MMAP2 with unexpected tid, expected 6500, got 7303982
        61316297195411 0 PERF_RECORD_EXIT(6500:6500):(6500:6500)
        ---- end ----
        Validate PERF_RECORD_* events & perf_sample fields: FAILED!
        [root@sandy ~]#
      
      After this patch:
      
        [root@sandy ~]# perf test 7
         7: Validate PERF_RECORD_* events & perf_sample fields     : Ok
        [root@sandy ~]#
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Acked-by: NStephane Eranian <eranian@google.com>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Link: http://lkml.kernel.org/n/tip-heeuv986b8ha7whqg4o3he7c@git.kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      d008d525
  21. 02 9月, 2013 2 次提交
    • S
      perf: Add attr->mmap2 attribute to an event · 13d7a241
      Stephane Eranian 提交于
      Adds a new PERF_RECORD_MMAP2 record type which is essence
      an expanded version of PERF_RECORD_MMAP.
      
      Used to request mmap records with more information about
      the mapping, including device major, minor and the inode
      number and generation for mappings associated with files
      or shared memory segments. Works for code and data
      (with attr->mmap_data set).
      
      Existing PERF_RECORD_MMAP record is unmodified by this patch.
      Signed-off-by: NStephane Eranian <eranian@google.com>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Link: http://lkml.kernel.org/r/1377079825-19057-2-git-send-email-eranian@google.com
      [ Added Al to the Cc:. Are the ino, maj/min exports of vma->vm_file OK? ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      13d7a241
    • J
      perf: Prevent race in unthrottling code · ae23bff1
      Jiri Olsa 提交于
      The current throttling code triggers WARN below via following
      workload (only hit on AMD machine with 48 CPUs):
      
        # while [ 1 ]; do perf record perf bench sched messaging; done
      
        WARNING: at arch/x86/kernel/cpu/perf_event.c:1054 x86_pmu_start+0xc6/0x100()
        SNIP
        Call Trace:
         <IRQ>  [<ffffffff815f62d6>] dump_stack+0x19/0x1b
         [<ffffffff8105f531>] warn_slowpath_common+0x61/0x80
         [<ffffffff8105f60a>] warn_slowpath_null+0x1a/0x20
         [<ffffffff810213a6>] x86_pmu_start+0xc6/0x100
         [<ffffffff81129dd2>] perf_adjust_freq_unthr_context.part.75+0x182/0x1a0
         [<ffffffff8112a058>] perf_event_task_tick+0xc8/0xf0
         [<ffffffff81093221>] scheduler_tick+0xd1/0x140
         [<ffffffff81070176>] update_process_times+0x66/0x80
         [<ffffffff810b9565>] tick_sched_handle.isra.15+0x25/0x60
         [<ffffffff810b95e1>] tick_sched_timer+0x41/0x60
         [<ffffffff81087c24>] __run_hrtimer+0x74/0x1d0
         [<ffffffff810b95a0>] ? tick_sched_handle.isra.15+0x60/0x60
         [<ffffffff81088407>] hrtimer_interrupt+0xf7/0x240
         [<ffffffff81606829>] smp_apic_timer_interrupt+0x69/0x9c
         [<ffffffff8160569d>] apic_timer_interrupt+0x6d/0x80
         <EOI>  [<ffffffff81129f74>] ? __perf_event_task_sched_in+0x184/0x1a0
         [<ffffffff814dd937>] ? kfree_skbmem+0x37/0x90
         [<ffffffff815f2c47>] ? __slab_free+0x1ac/0x30f
         [<ffffffff8118143d>] ? kfree+0xfd/0x130
         [<ffffffff81181622>] kmem_cache_free+0x1b2/0x1d0
         [<ffffffff814dd937>] kfree_skbmem+0x37/0x90
         [<ffffffff814e03c4>] consume_skb+0x34/0x80
         [<ffffffff8158b057>] unix_stream_recvmsg+0x4e7/0x820
         [<ffffffff814d5546>] sock_aio_read.part.7+0x116/0x130
         [<ffffffff8112c10c>] ? __perf_sw_event+0x19c/0x1e0
         [<ffffffff814d5581>] sock_aio_read+0x21/0x30
         [<ffffffff8119a5d0>] do_sync_read+0x80/0xb0
         [<ffffffff8119ac85>] vfs_read+0x145/0x170
         [<ffffffff8119b699>] SyS_read+0x49/0xa0
         [<ffffffff810df516>] ? __audit_syscall_exit+0x1f6/0x2a0
         [<ffffffff81604a19>] system_call_fastpath+0x16/0x1b
        ---[ end trace 622b7e226c4a766a ]---
      
      The reason is a race in perf_event_task_tick() throttling code.
      The race flow (simplified code):
      
        - perf_throttled_count is per cpu variable and is
          CPU throttling flag, here starting with 0
      
        - perf_throttled_seq is sequence/domain for allowed
          count of interrupts within the tick, gets increased
          each tick
      
          on single CPU (CPU bounded event):
      
            ... workload
      
          perf_event_task_tick:
          |
          | T0    inc(perf_throttled_seq)
          | T1    needs_unthr = xchg(perf_throttled_count, 0) == 0
           tick gets interrupted:
      
                  ... event gets throttled under new seq ...
      
            T2    last NMI comes, event is throttled - inc(perf_throttled_count)
      
           back to tick:
          | perf_adjust_freq_unthr_context:
          |
          | T3    unthrottling is skiped for event (needs_unthr == 0)
          | T4    event is stop and started via freq adjustment
          |
          tick ends
      
            ... workload
            ... no sample is hit for event ...
      
          perf_event_task_tick:
          |
          | T5    needs_unthr = xchg(perf_throttled_count, 0) != 0 (from T2)
          | T6    unthrottling is done on event (interrupts == MAX_INTERRUPTS)
          |       event is already started (from T4) -> WARN
      
      Fixing this by not checking needs_unthr again and thus
      check all events for unthrottling.
      Signed-off-by: NJiri Olsa <jolsa@redhat.com>
      Reported-by: NJan Stancek <jstancek@redhat.com>
      Suggested-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Stephane Eranian <eranian@google.com>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1377355554-8934-1-git-send-email-jolsa@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      ae23bff1
  22. 30 8月, 2013 1 次提交
    • A
      perf: make events stream always parsable · ff3d527c
      Adrian Hunter 提交于
      The event stream is not always parsable because the format of a sample
      is dependent on the sample_type of the selected event.  When there is
      more than one selected event and the sample_types are not the same then
      parsing becomes problematic.  A sample can be matched to its selected
      event using the ID that is allocated when the event is opened.
      Unfortunately, to get the ID from the sample means first parsing it.
      
      This patch adds a new sample format bit PERF_SAMPLE_IDENTIFER that puts
      the ID at a fixed position so that the ID can be retrieved without
      parsing the sample.  For sample events, that is the first position
      immediately after the header.  For non-sample events, that is the last
      position.
      
      In this respect parsing samples requires that the sample_type and ID
      values are recorded.  For example, perf tools records struct
      perf_event_attr and the IDs within the perf.data file.  Those must be
      read first before it is possible to parse samples found later in the
      perf.data file.
      Signed-off-by: NAdrian Hunter <adrian.hunter@intel.com>
      Tested-by: NStephane Eranian <eranian@google.com>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Namhyung Kim <namhyung@gmail.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Link: http://lkml.kernel.org/r/1377591794-30553-6-git-send-email-adrian.hunter@intel.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      ff3d527c
  23. 27 8月, 2013 1 次提交
    • T
      cgroup: rename cgroup_css_from_dir() to css_from_dir() and update its syntax · 35cf0836
      Tejun Heo 提交于
      cgroup_css_from_dir() will grow another user.  In preparation, make
      the following changes.
      
      * All css functions are prefixed with just "css_", rename it to
        css_from_dir().
      
      * Take dentry * instead of file * as dentry is what ultimately
        identifies a cgroup and file may not always be available.  Note that
        the function now checkes whether @dentry->d_inode is NULL as the
        caller now may specify a negative dentry.
      
      * Make it take cgroup_subsys * instead of integer subsys_id.  This
        simplifies the function and allows specifying no subsystem for
        cgroup->dummy_css.
      
      * Make return section a bit less verbose.
      
      This patch doesn't introduce any behavior changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      35cf0836
  24. 16 8月, 2013 2 次提交