1. 07 6月, 2015 5 次提交
    • Y
      perf/x86/intel: Drain the PEBS buffer during context switches · 9c964efa
      Yan, Zheng 提交于
      Flush the PEBS buffer during context switches if PEBS interrupt threshold
      is larger than one. This allows perf to supply TID for sample outputs.
      Signed-off-by: NYan, Zheng <zheng.z.yan@intel.com>
      Signed-off-by: NKan Liang <kan.liang@intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: acme@infradead.org
      Cc: eranian@google.com
      Link: http://lkml.kernel.org/r/1430940834-8964-6-git-send-email-kan.liang@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      9c964efa
    • Y
      perf/x86/intel: Implement batched PEBS interrupt handling (large PEBS interrupt threshold) · 3569c0d7
      Yan, Zheng 提交于
      PEBS always had the capability to log samples to its buffers without
      an interrupt. Traditionally perf has not used this but always set the
      PEBS threshold to one.
      
      For frequently occurring events (like cycles or branches or load/store)
      this in term requires using a relatively high sampling period to avoid
      overloading the system, by only processing PMIs. This in term increases
      sampling error.
      
      For the common cases we still need to use the PMI because the PEBS
      hardware has various limitations. The biggest one is that it can not
      supply a callgraph. It also requires setting a fixed period, as the
      hardware does not support adaptive period. Another issue is that it
      cannot supply a time stamp and some other options. To supply a TID it
      requires flushing on context switch. It can however supply the IP, the
      load/store address, TSX information, registers, and some other things.
      
      So we can make PEBS work for some specific cases, basically as long as
      you can do without a callgraph and can set the period you can use this
      new PEBS mode.
      
      The main benefit is the ability to support much lower sampling period
      (down to -c 1000) without extensive overhead.
      
      One use cases is for example to increase the resolution of the c2c tool.
      Another is double checking when you suspect the standard sampling has
      too much sampling error.
      
      Some numbers on the overhead, using cycle soak, comparing the elapsed
      time from "kernbench -M -H" between plain (threshold set to one) and
      multi (large threshold).
      
      The test command for plain:
        "perf record --time -e cycles:p -c $period -- kernbench -M -H"
      
      The test command for multi:
        "perf record --no-time -e cycles:p -c $period -- kernbench -M -H"
      
      ( The only difference of test command between multi and plain is time
        stamp options. Since time stamp is not supported by large PEBS
        threshold, it can be used as a flag to indicate if large threshold is
        enabled during the test. )
      
      	period    plain(Sec)  multi(Sec)  Delta
      	10003     32.7        16.5        16.2
      	20003     30.2        16.2        14.0
      	40003     18.6        14.1        4.5
      	80003     16.8        14.6        2.2
      	100003    16.9        14.1        2.8
      	800003    15.4        15.7        -0.3
      	1000003   15.3        15.2        0.2
      	2000003   15.3        15.1        0.1
      
      With periods below 100003, plain (threshold one) cause much more
      overhead. With 10003 sampling period, the Elapsed Time for multi is
      even 2X faster than plain.
      Signed-off-by: NYan, Zheng <zheng.z.yan@intel.com>
      Signed-off-by: NKan Liang <kan.liang@intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: acme@infradead.org
      Cc: eranian@google.com
      Link: http://lkml.kernel.org/r/1430940834-8964-5-git-send-email-kan.liang@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      3569c0d7
    • Y
      perf/x86/intel: Handle multiple records in the PEBS buffer · 21509084
      Yan, Zheng 提交于
      When the PEBS interrupt threshold is larger than one record and the
      machine supports multiple PEBS events, the records of these events are
      mixed up and we need to demultiplex them.
      
      Demuxing the records is hard because the hardware is deficient. The
      hardware has two issues that, when combined, create impossible
      scenarios to demux.
      
      The first issue is that the 'status' field of the PEBS record is a copy
      of the GLOBAL_STATUS MSR at PEBS assist time. To see why this is a
      problem let us first describe the regular PEBS cycle:
      
      A) the CTRn value reaches 0:
        - the corresponding bit in GLOBAL_STATUS gets set
        - we start arming the hardware assist
        < some unspecified amount of time later -- this could cover multiple
          events of interest >
      
      B) the hardware assist is armed, any next event will trigger it
      
      C) a matching event happens:
        - the hardware assist triggers and generates a PEBS record
          this includes a copy of GLOBAL_STATUS at this moment
        - if we auto-reload we (re)set CTRn
        - we clear the relevant bit in GLOBAL_STATUS
      
      Now consider the following chain of events:
      
        A0, B0, A1, C0
      
      The event generated for counter 0 will include a status with counter 1
      set, even though its not at all related to the record. A similar thing
      can happen with a !PEBS event if it just happens to overflow at the
      right moment.
      
      The second issue is that the hardware will only emit one record for two
      or more counters if the event that triggers the assist is 'close'. The
      'close' can be several cycles. In some cases even the complete assist,
      if the event is something that doesn't need retirement.
      
      For instance, consider this chain of events:
      
        A0, B0, A1, B1, C01
      
      Where C01 is an event that triggers both hardware assists, we will
      generate but a single record, but again with both counters listed in the
      status field.
      
      This time the record pertains to both events.
      
      Note that these two cases are different but undistinguishable with the
      data as generated. Therefore demuxing records with multiple PEBS bits
      (we can safely ignore status bits for !PEBS counters) is impossible.
      
      Furthermore we cannot emit the record to both events because that might
      cause a data leak -- the events might not have the same privileges -- so
      what this patch does is discard such events.
      
      The assumption/hope is that such discards will be rare.
      
      Here lists some possible ways you may get high discard rate.
      
        - when you count the same thing multiple times. But it is not a useful
          configuration.
        - you can be unfortunate if you measure with a userspace only PEBS
          event along with either a kernel or unrestricted PEBS event. Imagine
          the event triggering and setting the overflow flag right before
          entering the kernel. Then all kernel side events will end up with
          multiple bits set.
      Signed-off-by: NYan, Zheng <zheng.z.yan@intel.com>
      Signed-off-by: NKan Liang <kan.liang@intel.com>
      [ Changelog improvements. ]
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: acme@infradead.org
      Cc: eranian@google.com
      Link: http://lkml.kernel.org/r/1430940834-8964-4-git-send-email-kan.liang@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      21509084
    • Y
      perf/x86/intel: Introduce setup_pebs_sample_data() · 43cf7631
      Yan, Zheng 提交于
      Move code that sets up the PEBS sample data to a separate function.
      Signed-off-by: NYan, Zheng <zheng.z.yan@intel.com>
      Signed-off-by: NKan Liang <kan.liang@intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: acme@infradead.org
      Cc: eranian@google.com
      Link: http://lkml.kernel.org/r/1430940834-8964-3-git-send-email-kan.liang@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      43cf7631
    • Y
      perf/x86/intel: Use the PEBS auto reload mechanism when possible · 851559e3
      Yan, Zheng 提交于
      When a fixed period is specified, this patch makes perf use the PEBS
      auto reload mechanism. This makes normal profiling faster, because
      it avoids one costly MSR write in the PMI handler.
      
      However, the reset value will be loaded by hardware assist. There is a
      small delay compared to the previous non-auto-reload mechanism. The
      delay time is arbitrary, but very small. The assist cost is 400-800
      cycles, assuming common cases with everything cached. The minimum period
      the patch currently uses is 10000. In that extreme case it can be ~10%
      if cycles are used.
      Signed-off-by: NYan, Zheng <zheng.z.yan@intel.com>
      Signed-off-by: NKan Liang <kan.liang@intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: acme@infradead.org
      Cc: eranian@google.com
      Link: http://lkml.kernel.org/r/1430940834-8964-2-git-send-email-kan.liang@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      851559e3
  2. 27 5月, 2015 1 次提交
    • P
      perf/x86: Fix event/group validation · b371b594
      Peter Zijlstra 提交于
      Commit 43b45780 ("perf/x86: Reduce stack usage of
      x86_schedule_events()") violated the rule that 'fake' scheduling; as
      used for event/group validation; should not change the event state.
      
      This went mostly un-noticed because repeated calls of
      x86_pmu::get_event_constraints() would give the same result. And
      x86_pmu::put_event_constraints() would mostly not do anything.
      
      Commit e979121b ("perf/x86/intel: Implement cross-HT corruption
      bug workaround") made the situation much worse by actually setting the
      event->hw.constraint value to NULL, so when validation and actual
      scheduling interact we get NULL ptr derefs.
      
      Fix it by removing the constraint pointer from the event and move it
      back to an array, this time in cpuc instead of on the stack.
      
      validate_group()
        x86_schedule_events()
          event->hw.constraint = c; # store
      
            <context switch>
              perf_task_event_sched_in()
                ...
                  x86_schedule_events();
                    event->hw.constraint = c2; # store
      
                    ...
      
                    put_event_constraints(event); # assume failure to schedule
                      intel_put_event_constraints()
                        event->hw.constraint = NULL;
      
            <context switch end>
      
          c = event->hw.constraint; # read -> NULL
      
          if (!test_bit(hwc->idx, c->idxmsk)) # <- *BOOM* NULL deref
      
      This in particular is possible when the event in question is a
      cpu-wide event and group-leader, where the validate_group() tries to
      add an event to the group.
      Reported-by: NVince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Hunter <ahh@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Maria Dimakopoulou <maria.n.dimakopoulou@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 43b45780 ("perf/x86: Reduce stack usage of x86_schedule_events()")
      Fixes: e979121b ("perf/x86/intel: Implement cross-HT corruption bug workaround")
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      b371b594
  3. 17 4月, 2015 1 次提交
  4. 02 4月, 2015 2 次提交
  5. 16 1月, 2015 1 次提交
  6. 18 11月, 2014 1 次提交
    • D
      x86: Remove arbitrary instruction size limit in instruction decoder · 6ba48ff4
      Dave Hansen 提交于
      The current x86 instruction decoder steps along through the
      instruction stream but always ensures that it never steps farther
      than the largest possible instruction size (MAX_INSN_SIZE).
      
      The MPX code is now going to be doing some decoding of userspace
      instructions.  We copy those from userspace in to the kernel and
      they're obviously completely untrusted coming from userspace.  In
      addition to the constraint that instructions can only be so long,
      we also have to be aware of how long the buffer is that came in
      from userspace.  This _looks_ to be similar to what the perf and
      kprobes is doing, but it's unclear to me whether they are
      affected.
      
      The whole reason we need this is that it is perfectly valid to be
      executing an instruction within MAX_INSN_SIZE bytes of an
      unreadable page. We should be able to gracefully handle short
      reads in those cases.
      
      This adds support to the decoder to record how long the buffer
      being decoded is and to refuse to "validate" the instruction if
      we would have gone over the end of the buffer to decode it.
      
      The kprobes code probably needs to be looked at here a bit more
      carefully.  This patch still respects the MAX_INSN_SIZE limit
      there but the kprobes code does look like it might be able to
      be a bit more strict than it currently is.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Acked-by: NJim Keniston <jkenisto@us.ibm.com>
      Acked-by: NMasami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: x86@kernel.org
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
      Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Link: http://lkml.kernel.org/r/20141114153957.E6B01535@viggo.jf.intel.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      6ba48ff4
  7. 16 11月, 2014 3 次提交
  8. 09 9月, 2014 1 次提交
  9. 27 8月, 2014 1 次提交
    • C
      x86: Replace __get_cpu_var uses · 89cbc767
      Christoph Lameter 提交于
      __get_cpu_var() is used for multiple purposes in the kernel source. One of
      them is address calculation via the form &__get_cpu_var(x).  This calculates
      the address for the instance of the percpu variable of the current processor
      based on an offset.
      
      Other use cases are for storing and retrieving data from the current
      processors percpu area.  __get_cpu_var() can be used as an lvalue when
      writing data or on the right side of an assignment.
      
      __get_cpu_var() is defined as :
      
      #define __get_cpu_var(var) (*this_cpu_ptr(&(var)))
      
      __get_cpu_var() always only does an address determination. However, store
      and retrieve operations could use a segment prefix (or global register on
      other platforms) to avoid the address calculation.
      
      this_cpu_write() and this_cpu_read() can directly take an offset into a
      percpu area and use optimized assembly code to read and write per cpu
      variables.
      
      This patch converts __get_cpu_var into either an explicit address
      calculation using this_cpu_ptr() or into a use of this_cpu operations that
      use the offset.  Thereby address calculations are avoided and less registers
      are used when code is generated.
      
      Transformations done to __get_cpu_var()
      
      1. Determine the address of the percpu instance of the current processor.
      
      	DEFINE_PER_CPU(int, y);
      	int *x = &__get_cpu_var(y);
      
          Converts to
      
      	int *x = this_cpu_ptr(&y);
      
      2. Same as #1 but this time an array structure is involved.
      
      	DEFINE_PER_CPU(int, y[20]);
      	int *x = __get_cpu_var(y);
      
          Converts to
      
      	int *x = this_cpu_ptr(y);
      
      3. Retrieve the content of the current processors instance of a per cpu
      variable.
      
      	DEFINE_PER_CPU(int, y);
      	int x = __get_cpu_var(y)
      
         Converts to
      
      	int x = __this_cpu_read(y);
      
      4. Retrieve the content of a percpu struct
      
      	DEFINE_PER_CPU(struct mystruct, y);
      	struct mystruct x = __get_cpu_var(y);
      
         Converts to
      
      	memcpy(&x, this_cpu_ptr(&y), sizeof(x));
      
      5. Assignment to a per cpu variable
      
      	DEFINE_PER_CPU(int, y)
      	__get_cpu_var(y) = x;
      
         Converts to
      
      	__this_cpu_write(y, x);
      
      6. Increment/Decrement etc of a per cpu variable
      
      	DEFINE_PER_CPU(int, y);
      	__get_cpu_var(y)++
      
         Converts to
      
      	__this_cpu_inc(y)
      
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: x86@kernel.org
      Acked-by: NH. Peter Anvin <hpa@linux.intel.com>
      Acked-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      89cbc767
  10. 13 8月, 2014 4 次提交
    • S
      perf/x86: Clean up __intel_pmu_pebs_event() code · c8aab2e0
      Stephane Eranian 提交于
      This patch makes the code more readable. It also renames
      precise_store_data_hsw() to precise_datala_hsw() because
      the function is called for both loads and stores on HSW.
      The patch also gets rid of the hardcoded store events
      codes in that same function.
      Signed-off-by: NStephane Eranian <eranian@google.com>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1407785233-32193-5-git-send-email-eranian@google.com
      Cc: ak@linux.intel.com
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      c8aab2e0
    • S
      perf/x86: Fix data source encoding issues for load latency/precise store · 770eee1f
      Stephane Eranian 提交于
      This patch fixes issues introuduce by Andi's previous patch 'Revamp PEBS'
      series.
      
      This patch fixes the following:
      
       - precise_store_data_hsw() encode the mem op type whenever we can
       - precise_store_data_hsw set the default data source correctly
      
       - 0 is not a valid init value for data source. Define PERF_MEM_NA as the
         default value
      
      This bug was actually introduced by
      
          commit 722e76e6
          Author: Stephane Eranian <eranian@google.com>
          Date:   Thu May 15 17:56:44 2014 +0200
      
              fix Haswell precise store data source encoding
      Signed-off-by: NStephane Eranian <eranian@google.com>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1407785233-32193-4-git-send-email-eranian@google.com
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: ak@linux.intel.com
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      770eee1f
    • A
      perf/x86: Don't mark DataLA addresses as store · f3908b8c
      Andi Kleen 提交于
      Haswell supports reporting the data address for a range
      of PEBS events, including:
      
      	UOPS_RETIRED.ALL
      	MEM_UOPS_RETIRED.STLB_MISS_LOADS
      	MEM_UOPS_RETIRED.STLB_MISS_STORES
      	MEM_UOPS_RETIRED.LOCK_LOADS
      	MEM_UOPS_RETIRED.SPLIT_LOADS
      	MEM_UOPS_RETIRED.SPLIT_STORES
      	MEM_UOPS_RETIRED.ALL_LOADS
      	MEM_UOPS_RETIRED.ALL_STORES
      	MEM_LOAD_UOPS_RETIRED.L1_HIT
      	MEM_LOAD_UOPS_RETIRED.L2_HIT
      	MEM_LOAD_UOPS_RETIRED.L3_HIT
      	MEM_LOAD_UOPS_RETIRED.L1_MISS
      	MEM_LOAD_UOPS_RETIRED.L2_MISS
      	MEM_LOAD_UOPS_RETIRED.L3_MISS
      	MEM_LOAD_UOPS_RETIRED.HIT_LFB
      	MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS
      	MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT
      	MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM
      	MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_NONE
      	MEM_LOAD_UOPS_L3_MISS_RETIRED.LOCAL_DRAM
      
      This facility was already enabled earlier with the original Haswell
      perf changes.
      
      However these addresses were always reports as stores by perf, which is wrong,
      as they could be loads too.  The hardware does not distinguish loads and stores
      for these instructions, so there's no (cheap) way for the profiler
      to find out.
      
      Change the type to PERF_MEM_OP_NA instead.
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Reviewed-by: NStephane Eranian <eranian@google.com>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Link: http://lkml.kernel.org/r/1407785233-32193-3-git-send-email-eranian@google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      f3908b8c
    • A
      perf/x86: Revamp PEBS event selection · 86a04461
      Andi Kleen 提交于
      The basic idea is that it does not make sense to list all PEBS
      events individually. The list is very long, sometimes outdated
      and the hardware doesn't need it. If an event does not support
      PEBS it will just not count, there is no security issue.
      
      We need to only list events that something special, like
      supporting load or store addresses.
      
      This vastly simplifies the PEBS event selection. It also
      speeds up the scheduling because the scheduler doesn't
      have to walk as many constraints.
      
      Bugs fixed:
      
       - We do not allow setting forbidden flags with PEBS anymore
         (SDM 18.9.4), except for the special cycle event.
         This is done using a new constraint macro that also
         matches on the event flags.
      
       - Correct DataLA and load/store/na flags reporting on Haswell
         [Requires a followon patch]
      
       - We did not allow all PEBS events on Haswell:
         We were missing some valid subevents in d1-d2 (MEM_LOAD_UOPS_RETIRED.*,
         MEM_LOAD_UOPS_RETIRED_L3_HIT_RETIRED.*)
      
      This includes the changes proposed by Stephane earlier and obsoletes
      his patchkit (except for some changes on pre Sandy Bridge/Silvermont
      CPUs)
      
      I only did Sandy Bridge and Silvermont and later so far, mostly because these
      are the parts I could directly confirm the hardware behavior with hardware
      architects. Also I do not believe the older CPUs have any
      missing events in their PEBS list, so there's no pressing
      need to change them.
      
      I did not implement the flag proposed by Peter to allow
      setting forbidden flags. If really needed this could
      be implemented on to of this patch.
      
      v2: Fix broken store events on SNB/IVB (Stephane Eranian)
      v3: More fixes. Rename some arguments (Stephane Eranian)
      v4: List most Haswell events individually again to report
      memory operation type correctly.
      Add new flags to describe load/store/na for datala.
      Update description.
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Reviewed-by: NStephane Eranian <eranian@google.com>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1407785233-32193-2-git-send-email-eranian@google.com
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Kan Liang <kan.liang@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Maria Dimakopoulou <maria.n.dimakopoulou@gmail.com>
      Cc: Mark Davies <junk@eslaf.co.uk>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Yan, Zheng <zheng.z.yan@intel.com>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      86a04461
  11. 16 7月, 2014 1 次提交
  12. 19 5月, 2014 1 次提交
  13. 06 11月, 2013 1 次提交
  14. 16 10月, 2013 1 次提交
    • P
      perf/x86: Optimize intel_pmu_pebs_fixup_ip() · 9536c8d2
      Peter Zijlstra 提交于
      There's been reports of high NMI handler overhead, highlighted by
      such kernel messages:
      
        [ 3697.380195] perf samples too long (10009 > 10000), lowering kernel.perf_event_max_sample_rate to 13000
        [ 3697.389509] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 9.331 msecs
      
      Don Zickus analyzed the source of the overhead and reported:
      
       > While there are a few places that are causing latencies, for now I focused on
       > the longest one first.  It seems to be 'copy_user_from_nmi'
       >
       > intel_pmu_handle_irq ->
       >	intel_pmu_drain_pebs_nhm ->
       >		__intel_pmu_drain_pebs_nhm ->
       >			__intel_pmu_pebs_event ->
       >				intel_pmu_pebs_fixup_ip ->
       >					copy_from_user_nmi
       >
       > In intel_pmu_pebs_fixup_ip(), if the while-loop goes over 50, the sum of
       > all the copy_from_user_nmi latencies seems to go over 1,000,000 cycles
       > (there are some cases where only 10 iterations are needed to go that high
       > too, but in generall over 50 or so).  At this point copy_user_from_nmi
       > seems to account for over 90% of the nmi latency.
      
      The solution to that is to avoid having to call copy_from_user_nmi() for
      every instruction.
      
      Since we already limit the max basic block size, we can easily
      pre-allocate a piece of memory to copy the entire thing into in one
      go.
      
      Don reported this test result:
      
       > Your patch made a huge difference in improvement.  The
       > copy_from_user_nmi() no longer hits the million of cycles.  I still
       > have a batch of 100,000-300,000 cycles.  My longest NMI paths used
       > to be dominated by copy_from_user_nmi, now it is not (I have to dig
       > up the new hot path).
      Reported-and-tested-by: NDon Zickus <dzickus@redhat.com>
      Cc: jmario@redhat.com
      Cc: acme@infradead.org
      Cc: dave.hansen@linux.intel.com
      Cc: eranian@google.com
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/20131016105755.GX10651@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      9536c8d2
  15. 04 10月, 2013 1 次提交
  16. 20 9月, 2013 2 次提交
  17. 14 9月, 2013 1 次提交
  18. 13 9月, 2013 2 次提交
  19. 02 9月, 2013 2 次提交
  20. 27 6月, 2013 1 次提交
  21. 19 6月, 2013 3 次提交
  22. 01 4月, 2013 3 次提交
  23. 21 3月, 2013 1 次提交
    • S
      perf/x86: Fix uninitialized pt_regs in intel_pmu_drain_bts_buffer() · 0e48026a
      Stephane Eranian 提交于
      This patch fixes an uninitialized pt_regs struct in drain BTS
      function. The pt_regs struct is propagated all the way to the
      code_get_segment() function from perf_instruction_pointer()
      and may get garbage.
      
      We cannot simply inherit the actual pt_regs from the interrupt
      because BTS must be flushed on context-switch or when the
      associated event is disabled. And there we do not have a pt_regs
      handy.
      
      Setting pt_regs to all zeroes may not be the best option but it
      is not clear what else to do given where the drain_bts_buffer()
      is called from.
      
      In V2, we move the memset() later in the code to avoid doing it
      when we end up returning early without doing the actual BTS
      processing. Also dropped the reg.val initialization because it
      is redundant with the memset() as suggested by PeterZ.
      Signed-off-by: NStephane Eranian <eranian@google.com>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: peterz@infradead.org
      Cc: sqazi@google.com
      Cc: ak@linux.intel.com
      Cc: jolsa@redhat.com
      Link: http://lkml.kernel.org/r/20130319151038.GA25439@quadSigned-off-by: NIngo Molnar <mingo@kernel.org>
      0e48026a