1. 09 2月, 2013 1 次提交
    • O
      perf: Introduce hw_perf_event->tp_target and ->tp_list · f22c1bb6
      Oleg Nesterov 提交于
      sys_perf_event_open()->perf_init_event(event) is called before
      find_get_context(event), this means that event->ctx == NULL when
      class->reg(TRACE_REG_PERF_REGISTER/OPEN) is called and thus it
      can't know if this event is per-task or system-wide.
      
      This patch adds hw_perf_event->tp_target for PERF_TYPE_TRACEPOINT,
      this is analogous to PERF_TYPE_BREAKPOINT/bp_target we already have.
      The patch also moves ->bp_target up so that it can overlap with the
      new member, this can help the compiler to generate the better code.
      
      trace_uprobe_register() will use it for prefiltering to avoid the
      unnecessary breakpoints in mm's we do not want to trace.
      
      ->tp_target doesn't have its own reference, but we can rely on the
      fact that either sys_perf_event_open() holds a reference, or it is
      equal to event->ctx->task. So this pointer is always valid until
      free_event().
      
      Also add the "struct list_head tp_list" into this union. It is not
      strictly necessary, but it can simplify the next changes and we can
      add it for free.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      f22c1bb6
  2. 20 11月, 2012 1 次提交
  3. 19 11月, 2012 1 次提交
    • E
      pidns: Use task_active_pid_ns where appropriate · 17cf22c3
      Eric W. Biederman 提交于
      The expressions tsk->nsproxy->pid_ns and task_active_pid_ns
      aka ns_of_pid(task_pid(tsk)) should have the same number of
      cache line misses with the practical difference that
      ns_of_pid(task_pid(tsk)) is released later in a processes life.
      
      Furthermore by using task_active_pid_ns it becomes trivial
      to write an unshare implementation for the the pid namespace.
      
      So I have used task_active_pid_ns everywhere I can.
      
      In fork since the pid has not yet been attached to the
      process I use ns_of_pid, to achieve the same effect.
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      17cf22c3
  4. 09 10月, 2012 1 次提交
    • K
      mm: kill vma flag VM_RESERVED and mm->reserved_vm counter · 314e51b9
      Konstantin Khlebnikov 提交于
      A long time ago, in v2.4, VM_RESERVED kept swapout process off VMA,
      currently it lost original meaning but still has some effects:
      
       | effect                 | alternative flags
      -+------------------------+---------------------------------------------
      1| account as reserved_vm | VM_IO
      2| skip in core dump      | VM_IO, VM_DONTDUMP
      3| do not merge or expand | VM_IO, VM_DONTEXPAND, VM_HUGETLB, VM_PFNMAP
      4| do not mlock           | VM_IO, VM_DONTEXPAND, VM_HUGETLB, VM_PFNMAP
      
      This patch removes reserved_vm counter from mm_struct.  Seems like nobody
      cares about it, it does not exported into userspace directly, it only
      reduces total_vm showed in proc.
      
      Thus VM_RESERVED can be replaced with VM_IO or pair VM_DONTEXPAND | VM_DONTDUMP.
      
      remap_pfn_range() and io_remap_pfn_range() set VM_IO|VM_DONTEXPAND|VM_DONTDUMP.
      remap_vmalloc_range() set VM_DONTEXPAND | VM_DONTDUMP.
      
      [akpm@linux-foundation.org: drivers/vfio/pci/vfio_pci.c fixup]
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Carsten Otte <cotte@de.ibm.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morris <james.l.morris@oracle.com>
      Cc: Jason Baron <jbaron@redhat.com>
      Cc: Kentaro Takeda <takedakn@nttdata.co.jp>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Robert Richter <robert.richter@amd.com>
      Cc: Suresh Siddha <suresh.b.siddha@intel.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Venkatesh Pallipadi <venki@google.com>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      314e51b9
  5. 05 10月, 2012 2 次提交
  6. 27 9月, 2012 2 次提交
  7. 15 9月, 2012 1 次提交
    • T
      cgroup: mark subsystems with broken hierarchy support and whine if cgroups are nested for them · 8c7f6edb
      Tejun Heo 提交于
      Currently, cgroup hierarchy support is a mess.  cpu related subsystems
      behave correctly - configuration, accounting and control on a parent
      properly cover its children.  blkio and freezer completely ignore
      hierarchy and treat all cgroups as if they're directly under the root
      cgroup.  Others show yet different behaviors.
      
      These differing interpretations of cgroup hierarchy make using cgroup
      confusing and it impossible to co-mount controllers into the same
      hierarchy and obtain sane behavior.
      
      Eventually, we want full hierarchy support from all subsystems and
      probably a unified hierarchy.  Users using separate hierarchies
      expecting completely different behaviors depending on the mounted
      subsystem is deterimental to making any progress on this front.
      
      This patch adds cgroup_subsys.broken_hierarchy and sets it to %true
      for controllers which are lacking in hierarchy support.  The goal of
      this patch is two-fold.
      
      * Move users away from using hierarchy on currently non-hierarchical
        subsystems, so that implementing proper hierarchy support on those
        doesn't surprise them.
      
      * Keep track of which controllers are broken how and nudge the
        subsystems to implement proper hierarchy support.
      
      For now, start with a single warning message.  We can whine louder
      later on.
      
      v2: Fixed a typo spotted by Michal. Warning message updated.
      
      v3: Updated memcg part so that it doesn't generate warning in the
          cases where .use_hierarchy=false doesn't make the behavior
          different from root.use_hierarchy=true.  Fixed a typo spotted by
          Glauber.
      
      v4: Check ->broken_hierarchy after cgroup creation is complete so that
          ->create() can affect the result per Michal.  Dropped unnecessary
          memcg root handling per Michal.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Acked-by: NSerge E. Hallyn <serue@us.ibm.com>
      Cc: Glauber Costa <glommer@parallels.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Paul Turner <pjt@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Thomas Graf <tgraf@suug.ch>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
      Cc: Neil Horman <nhorman@tuxdriver.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      8c7f6edb
  8. 04 9月, 2012 2 次提交
  9. 10 8月, 2012 2 次提交
    • J
      perf: Add ability to attach user stack dump to sample · c5ebcedb
      Jiri Olsa 提交于
      Introducing PERF_SAMPLE_STACK_USER sample type bit to trigger the dump
      of the user level stack on sample. The size of the dump is specified by
      sample_stack_user value.
      
      Being able to dump parts of the user stack, starting from the stack
      pointer, will be useful to make a post mortem dwarf CFI based stack
      unwinding.
      
      Added HAVE_PERF_USER_STACK_DUMP config option to determine if the
      architecture provides user stack dump on perf event samples.  This needs
      access to the user stack pointer which is not unified across
      architectures. Enabling this for x86 architecture.
      Signed-off-by: NJiri Olsa <jolsa@redhat.com>
      Original-patch-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: "Frank Ch. Eigler" <fche@redhat.com>
      Cc: Arun Sharma <asharma@fb.com>
      Cc: Benjamin Redelings <benjamin.redelings@nescent.org>
      Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Frank Ch. Eigler <fche@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Robert Richter <robert.richter@amd.com>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Tom Zanussi <tzanussi@gmail.com>
      Cc: Ulrich Drepper <drepper@gmail.com>
      Link: http://lkml.kernel.org/r/1344345647-11536-6-git-send-email-jolsa@redhat.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      c5ebcedb
    • J
      perf: Add ability to attach user level registers dump to sample · 4018994f
      Jiri Olsa 提交于
      Introducing PERF_SAMPLE_REGS_USER sample type bit to trigger the dump of
      user level registers on sample. Registers we want to dump are specified
      by sample_regs_user bitmask.
      
      Only user level registers are dumped at the moment. Meaning the register
      values of the user space context as it was before the user entered the
      kernel for whatever reason (syscall, irq, exception, or a PMI happening
      in userspace).
      
      The layout of the sample_regs_user bitmap is described in
      asm/perf_regs.h for archs that support register dump.
      
      This is going to be useful to bring Dwarf CFI based stack unwinding on
      top of samples.
      Original-patch-by: NFrederic Weisbecker <fweisbec@gmail.com>
      [ Dump registers ABI specification. ]
      Signed-off-by: NJiri Olsa <jolsa@redhat.com>
      Suggested-by: NStephane Eranian <eranian@google.com>
      Cc: "Frank Ch. Eigler" <fche@redhat.com>
      Cc: Arun Sharma <asharma@fb.com>
      Cc: Benjamin Redelings <benjamin.redelings@nescent.org>
      Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Frank Ch. Eigler <fche@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Robert Richter <robert.richter@amd.com>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Tom Zanussi <tzanussi@gmail.com>
      Cc: Ulrich Drepper <drepper@gmail.com>
      Link: http://lkml.kernel.org/r/1344345647-11536-3-git-send-email-jolsa@redhat.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      4018994f
  10. 31 7月, 2012 1 次提交
  11. 18 6月, 2012 4 次提交
  12. 01 6月, 2012 1 次提交
  13. 23 5月, 2012 1 次提交
    • J
      Revert "sched, perf: Use a single callback into the scheduler" · ab0cce56
      Jiri Olsa 提交于
      This reverts commit cb04ff9a ("sched, perf: Use a single
      callback into the scheduler").
      
      Before this change was introduced, the process switch worked
      like this (wrt. to perf event schedule):
      
           schedule (prev, next)
             - schedule out all perf events for prev
             - switch to next
             - schedule in all perf events for current (next)
      
      After the commit, the process switch looks like:
      
           schedule (prev, next)
             - schedule out all perf events for prev
             - schedule in all perf events for (next)
             - switch to next
      
      The problem is, that after we schedule perf events in, the pmu
      is enabled and we can receive events even before we make the
      switch to next - so "current" still being prev process (event
      SAMPLE data are filled based on the value of the "current"
      process).
      
      Thats exactly what we see for test__PERF_RECORD test. We receive
      SAMPLES with PID of the process that our tracee is scheduled
      from.
      
      Discussed with Peter Zijlstra:
      
       > Bah!, yeah I guess reverting is the right thing for now. Sad
       > though.
       >
       > So by having the two hooks we have a black-spot between them
       > where we receive no events at all, this black-spot covers the
       > hand-over of current and we thus don't receive the 'wrong'
       > events.
       >
       > I rather liked we could do away with both that black-spot and
       > clean up the code a little, but apparently people rely on it.
      Signed-off-by: NJiri Olsa <jolsa@redhat.com>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: acme@redhat.com
      Cc: paulus@samba.org
      Cc: cjashfor@linux.vnet.ibm.com
      Cc: fweisbec@gmail.com
      Cc: eranian@google.com
      Link: http://lkml.kernel.org/r/20120523111302.GC1638@m.brq.redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      ab0cce56
  14. 09 5月, 2012 2 次提交
  15. 26 4月, 2012 2 次提交
  16. 24 3月, 2012 1 次提交
  17. 23 3月, 2012 1 次提交
    • P
      perf: Fix mmap_page capabilities and docs · c7206205
      Peter Zijlstra 提交于
      Complete the syscall-less self-profiling feature and address
      all complaints, namely:
      
       - capabilities, so we can detect what is actually available at runtime
      
           Add a capabilities field to perf_event_mmap_page to indicate
           what is actually available for use.
      
       - on x86: RDPMC weirdness due to being 40/48 bits and not sign-extending
         properly.
      
       - ABI documentation as to how all this stuff works.
      
      Also improve the documentation for the new features.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Vince Weaver <vweaver1@eecs.utk.edu>
      Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Link: http://lkml.kernel.org/r/1332433596.2487.33.camel@twinsSigned-off-by: NIngo Molnar <mingo@kernel.org>
      c7206205
  18. 05 3月, 2012 3 次提交
    • S
      perf: Add callback to flush branch_stack on context switch · d010b332
      Stephane Eranian 提交于
      With branch stack sampling, it is possible to filter by priv levels.
      
      In system-wide mode, that means it is possible to capture only user
      level branches. The builtin SW LBR filter needs to disassemble code
      based on LBR captured addresses. For that, it needs to know the task
      the addresses are associated with. Because of context switches, the
      content of the branch stack buffer may contain addresses from
      different tasks.
      
      We need a callback on context switch to either flush the branch stack
      or save it. This patch adds a new callback in struct pmu which is called
      during context switches. The callback is called only when necessary.
      That is when a system-wide context has, at least, one event which
      uses PERF_SAMPLE_BRANCH_STACK. The callback is never called for
      per-thread context.
      
      In this version, the Intel x86 code simply flushes (resets) the LBR
      on context switches (fills it with zeroes). Those zeroed branches are
      then filtered out by the SW filter.
      Signed-off-by: NStephane Eranian <eranian@google.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/1328826068-11713-11-git-send-email-eranian@google.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
      d010b332
    • S
      perf: Disable PERF_SAMPLE_BRANCH_* when not supported · 2481c5fa
      Stephane Eranian 提交于
      PERF_SAMPLE_BRANCH_* is disabled for:
      
       - SW events (sw counters, tracepoints)
       - HW breakpoints
       - ALL but Intel x86 architecture
       - AMD64 processors
      Signed-off-by: NStephane Eranian <eranian@google.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/1328826068-11713-10-git-send-email-eranian@google.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
      2481c5fa
    • S
      perf: Add generic taken branch sampling support · bce38cd5
      Stephane Eranian 提交于
      This patch adds the ability to sample taken branches to the
      perf_event interface.
      
      The ability to capture taken branches is very useful for all
      sorts of analysis. For instance, basic block profiling, call
      counts, statistical call graph.
      
      This new capability requires hardware assist and as such may
      not be available on all HW platforms. On Intel x86 it is
      implemented on top of the Last Branch Record (LBR) facility.
      
      To enable taken branches sampling, the PERF_SAMPLE_BRANCH_STACK
      bit must be set in attr->sample_type.
      
      Sampled taken branches may be filtered by type and/or priv
      levels.
      
      The patch adds a new field, called branch_sample_type, to the
      perf_event_attr structure. It contains a bitmask of filters
      to apply to the sampled taken branches.
      
      Filters may be implemented in HW. If the HW filter does not exist
      or is not good enough, some arch may also implement a SW filter.
      
      The following generic filters are currently defined:
      - PERF_SAMPLE_USER
        only branches whose targets are at the user level
      
      - PERF_SAMPLE_KERNEL
        only branches whose targets are at the kernel level
      
      - PERF_SAMPLE_HV
        only branches whose targets are at the hypervisor level
      
      - PERF_SAMPLE_ANY
        any type of branches (subject to priv levels filters)
      
      - PERF_SAMPLE_ANY_CALL
        any call branches (may incl. syscall on some arch)
      
      - PERF_SAMPLE_ANY_RET
        any return branches (may incl. syscall returns on some arch)
      
      - PERF_SAMPLE_IND_CALL
        indirect call branches
      
      Obviously filter may be combined. The priv level bits are optional.
      If not provided, the priv level of the associated event are used. It
      is possible to collect branches at a priv level different from the
      associated event. Use of kernel, hv priv levels is subject to permissions
      and availability (hv).
      
      The number of taken branch records present in each sample may vary based
      on HW, the type of sampled branches, the executed code. Therefore
      each sample contains the number of taken branches it contains.
      Signed-off-by: NStephane Eranian <eranian@google.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/1328826068-11713-2-git-send-email-eranian@google.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
      bce38cd5
  19. 24 2月, 2012 1 次提交
    • I
      static keys: Introduce 'struct static_key', static_key_true()/false() and... · c5905afb
      Ingo Molnar 提交于
      static keys: Introduce 'struct static_key', static_key_true()/false() and static_key_slow_[inc|dec]()
      
      So here's a boot tested patch on top of Jason's series that does
      all the cleanups I talked about and turns jump labels into a
      more intuitive to use facility. It should also address the
      various misconceptions and confusions that surround jump labels.
      
      Typical usage scenarios:
      
              #include <linux/static_key.h>
      
              struct static_key key = STATIC_KEY_INIT_TRUE;
      
              if (static_key_false(&key))
                      do unlikely code
              else
                      do likely code
      
      Or:
      
              if (static_key_true(&key))
                      do likely code
              else
                      do unlikely code
      
      The static key is modified via:
      
              static_key_slow_inc(&key);
              ...
              static_key_slow_dec(&key);
      
      The 'slow' prefix makes it abundantly clear that this is an
      expensive operation.
      
      I've updated all in-kernel code to use this everywhere. Note
      that I (intentionally) have not pushed through the rename
      blindly through to the lowest levels: the actual jump-label
      patching arch facility should be named like that, so we want to
      decouple jump labels from the static-key facility a bit.
      
      On non-jump-label enabled architectures static keys default to
      likely()/unlikely() branches.
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Acked-by: NJason Baron <jbaron@redhat.com>
      Acked-by: NSteven Rostedt <rostedt@goodmis.org>
      Cc: a.p.zijlstra@chello.nl
      Cc: mathieu.desnoyers@efficios.com
      Cc: davem@davemloft.net
      Cc: ddaney.cavm@gmail.com
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20120222085809.GA26397@elte.huSigned-off-by: NIngo Molnar <mingo@elte.hu>
      c5905afb
  20. 07 2月, 2012 1 次提交
    • S
      perf: Fix double start/stop in x86_pmu_start() · f39d47ff
      Stephane Eranian 提交于
      The following patch fixes a bug introduced by the following
      commit:
      
              e050e3f0 ("perf: Fix broken interrupt rate throttling")
      
      The patch caused the following warning to pop up depending on
      the sampling frequency adjustments:
      
        ------------[ cut here ]------------
        WARNING: at arch/x86/kernel/cpu/perf_event.c:995 x86_pmu_start+0x79/0xd4()
      
      It was caused by the following call sequence:
      
      perf_adjust_freq_unthr_context.part() {
           stop()
           if (delta > 0) {
                perf_adjust_period() {
                    if (period > 8*...) {
                        stop()
                        ...
                        start()
                    }
                }
            }
            start()
      }
      
      Which caused a double start and a double stop, thus triggering
      the assert in x86_pmu_start().
      
      The patch fixes the problem by avoiding the double calls. We
      pass a new argument to perf_adjust_period() to indicate whether
      or not the event is already stopped. We can't just remove the
      start/stop from that function because it's called from
      __perf_event_overflow where the event needs to be reloaded via a
      stop/start back-toback call.
      
      The patch reintroduces the assertion in x86_pmu_start() which
      was removed by commit:
      
      	84f2b9b2 ("perf: Remove deprecated WARN_ON_ONCE()")
      
      In this second version, we've added calls to disable/enable PMU
      during unthrottling or frequency adjustment based on bug report
      of spurious NMI interrupts from Eric Dumazet.
      Reported-and-tested-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NStephane Eranian <eranian@google.com>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: markus@trippelsdorf.de
      Cc: paulus@samba.org
      Link: http://lkml.kernel.org/r/20120207133956.GA4932@quad
      [ Minor edits to the changelog and to the code ]
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      f39d47ff
  21. 03 2月, 2012 1 次提交
    • L
      cgroup: remove cgroup_subsys argument from callbacks · 761b3ef5
      Li Zefan 提交于
      The argument is not used at all, and it's not necessary, because
      a specific callback handler of course knows which subsys it
      belongs to.
      
      Now only ->pupulate() takes this argument, because the handlers of
      this callback always call cgroup_add_file()/cgroup_add_files().
      
      So we reduce a few lines of code, though the shrinking of object size
      is minimal.
      
       16 files changed, 113 insertions(+), 162 deletions(-)
      
         text    data     bss     dec     hex filename
      5486240  656987 7039960 13183187         c928d3 vmlinux.o.orig
      5486170  656987 7039960 13183117         c9288d vmlinux.o
      Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      761b3ef5
  22. 27 1月, 2012 1 次提交
    • S
      perf: Fix broken interrupt rate throttling · e050e3f0
      Stephane Eranian 提交于
      This patch fixes the sampling interrupt throttling mechanism.
      
      It was broken in v3.2. Events were not being unthrottled. The
      unthrottling mechanism required that events be checked at each
      timer tick.
      
      This patch solves this problem and also separates:
      
        - unthrottling
        - multiplexing
        - frequency-mode period adjustments
      
      Not all of them need to be executed at each timer tick.
      
      This third version of the patch is based on my original patch +
      PeterZ proposal (https://lkml.org/lkml/2012/1/7/87).
      
      At each timer tick, for each context:
      
        - if the current CPU has throttled events, we unthrottle events
      
        - if context has frequency-based events, we adjust sampling periods
      
        - if we have reached the jiffies interval, we multiplex (rotate)
      
      We decoupled rotation (multiplexing) from frequency-mode sampling
      period adjustments.  They should not necessarily happen at the same
      rate. Multiplexing is subject to jiffies_interval (currently at 1
      but could be higher once the tunable is exposed via sysfs).
      
      We have grouped frequency-mode adjustment and unthrottling into the
      same routine to minimize code duplication. When throttled while in
      frequency mode, we scan the events only once.
      
      We have fixed the threshold enforcement code in __perf_event_overflow().
      There was a bug whereby it would allow more than the authorized rate
      because an increment of hwc->interrupts was not executed at the right
      place.
      
      The patch was tested with low sampling limit (2000) and fixed periods,
      frequency mode, overcommitted PMU.
      
      On a 2.1GHz AMD CPU:
      
       $ cat /proc/sys/kernel/perf_event_max_sample_rate
       2000
      
      We set a rate of 3000 samples/sec (2.1GHz/3000 = 700000):
      
       $ perf record -e cycles,cycles -c 700000  noploop 10
       $ perf report -D | tail -21
      
       Aggregated stats:
                 TOTAL events:      80086
                  MMAP events:         88
                  COMM events:          2
                  EXIT events:          4
              THROTTLE events:      19996
            UNTHROTTLE events:      19996
                SAMPLE events:      40000
      
       cycles stats:
                 TOTAL events:      40006
                  MMAP events:          5
                  COMM events:          1
                  EXIT events:          4
              THROTTLE events:       9998
            UNTHROTTLE events:       9998
                SAMPLE events:      20000
      
       cycles stats:
                 TOTAL events:      39996
              THROTTLE events:       9998
            UNTHROTTLE events:       9998
                SAMPLE events:      20000
      
      For 10s, the cap is 2x2000x10 = 40000 samples.
      We get exactly that: 20000 samples/event.
      Signed-off-by: NStephane Eranian <eranian@google.com>
      Cc: <stable@kernel.org> # v3.2+
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/20120126160319.GA5655@quadSigned-off-by: NIngo Molnar <mingo@elte.hu>
      e050e3f0
  23. 21 1月, 2012 1 次提交
  24. 02 1月, 2012 1 次提交
  25. 21 12月, 2011 5 次提交