1. 11 12月, 2016 1 次提交
  2. 22 11月, 2016 1 次提交
    • P
      perf/x86/intel: Cure bogus unwind from PEBS entries · b8000586
      Peter Zijlstra 提交于
      Vince Weaver reported that perf_fuzzer + KASAN detects that PEBS event
      unwinds sometimes do 'weird' things. In particular, we seemed to be
      ending up unwinding from random places on the NMI stack.
      
      While it was somewhat expected that the event record BP,SP would not
      match the interrupt BP,SP in that the interrupt is strictly later than
      the record event, it was overlooked that it could be on an already
      overwritten stack.
      
      Therefore, don't copy the recorded BP,SP over the interrupted BP,SP
      when we need stack unwinds.
      
      Note that its still possible the unwind doesn't full match the actual
      event, as its entirely possible to have done an (I)RET between record
      and interrupt, but on average it should still point in the general
      direction of where the event came from. Also, it's the best we can do,
      considering.
      
      The particular scenario that triggered the bogus NMI stack unwind was
      a PEBS event with very short period, upon enabling the event at the
      tail of the PMI handler (FREEZE_ON_PMI is not used), it instantly
      triggers a record (while still on the NMI stack) which in turn
      triggers the next PMI. This then causes back-to-back NMIs and we'll
      try and unwind the stack-frame from the last NMI, which obviously is
      now overwritten by our own.
      Analyzed-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Reported-by: NVince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@gmail.com>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: davej@codemonkey.org.uk <davej@codemonkey.org.uk>
      Cc: dvyukov@google.com <dvyukov@google.com>
      Cc: stable@vger.kernel.org
      Fixes: ca037701 ("perf, x86: Add PEBS infrastructure")
      Link: http://lkml.kernel.org/r/20161117171731.GV3157@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      b8000586
  3. 10 8月, 2016 3 次提交
    • P
      perf/x86/intel: Clean up LBR state tracking · 3e2c1a67
      Peter Zijlstra 提交于
      The lbr_context logic confused me; it appears to me to try and do the
      same thing the pmu::sched_task() callback does now, but limited to
      per-task events.
      
      So rip it out. Afaict this should also improve performance, because I
      think the current code can end up doing lbr_reset() twice, once from
      the pmu::add() and then again from pmu::sched_task(), and MSR writes
      (all 3*16 of them) are expensive!!
      
      While thinking through the cases that need the reset it occured to me
      the first install of an event in an active context needs to reset the
      LBR (who knows what crap is in there), but detecting this case is
      somewhat hard.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      3e2c1a67
    • P
      perf/x86: Ensure perf_sched_cb_{inc,dec}() is only called from pmu::{add,del}() · 68f7082f
      Peter Zijlstra 提交于
      Currently perf_sched_cb_{inc,dec}() are called from
      pmu::{start,stop}(), which has the problem that this can happen from
      NMI context, this is making it hard to optimize perf_pmu_sched_task().
      
      Furthermore, we really only need this accounting on pmu::{add,del}(),
      so doing it from pmu::{start,stop}() is doing more work than we really
      need.
      
      Introduce x86_pmu::{add,del}() and wire up the LBR and PEBS.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      68f7082f
    • P
      perf/x86/intel: Rework the large PEBS setup code · 09e61b4f
      Peter Zijlstra 提交于
      In order to allow optimizing perf_pmu_sched_task() we must ensure
      perf_sched_cb_{inc,dec}() are no longer called from NMI context; this
      means that pmu::{start,stop}() can no longer use them.
      
      Prepare for this by reworking the whole large PEBS setup code.
      
      The current code relied on the cpuc->pebs_enabled state, however since
      that reflects the current active state as per pmu::{start,stop}() we
      can no longer rely on this.
      
      Introduce two counters: cpuc->n_pebs and cpuc->n_large_pebs which
      count the total number of PEBS events and the number of PEBS events
      that have FREERUNNING set, resp.. With this we can tell if the current
      setup requires a single record interrupt threshold or can use a larger
      buffer.
      
      This also improves the code in that it re-enables the large threshold
      once the PEBS event that required single record gets removed.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      09e61b4f
  4. 27 6月, 2016 1 次提交
    • D
      perf/x86/intel: Fix MSR_LAST_BRANCH_FROM_x bug when no TSX · 19fc9ddd
      David Carrillo-Cisneros 提交于
      Intel's SDM states that bits 61:62 in MSR_LAST_BRANCH_FROM_x are the
      TSX flags for formats with LBR_TSX flags (i.e. LBR_FORMAT_EIP_EFLAGS2).
      
      However, when the CPU has TSX support deactivated, bits 61:62 actually
      behave as follows:
      
        - For wrmsr(), bits 61:62 are considered part of the sign extension.
        - When capturing branches, the LBR hw will always clear bits 61:62.
          regardless of the sign extension.
      
      Therefore, if:
      
        1) LBR has TSX format.
        2) CPU has no TSX support enabled.
      
      ... then any value passed to wrmsr() must be sign extended to 63 bits
      and any value from rdmsr() must be converted to have a sign extension
      of 61 bits, ignoring the values at TSX flags.
      
      This bug was masked by the work-around to the Intel's CPU bug:
      BJ94. "LBR May Contain Incorrect Information When Using FREEZE_LBRS_ON_PMI"
      in Document Number: 324643-037US.
      
      The aforementioned work-around uses hw flags to filter out all kernel
      branches, limiting LBR callstack to user level execution only.
      
      Since user addresses are not sign extended, they do not trigger the wrmsr()
      bug in MSR_LAST_BRANCH_FROM_x when saved/restored at context switch.
      
      To verify the hw bug:
      
        $ perf record -b -e cycles sleep 1
        $ rdmsr -p 0 0x680
        0x1fffffffb0b9b0cc
        $ wrmsr -p 0 0x680 0x1fffffffb0b9b0cc
        write(): Input/output error
      
      The quirk for LBR_FROM_ MSRs is required before calls to wrmsrl() and
      after rdmsrl().
      
      This patch introduces it for wrmsrl()'s done for testing LBR support.
      
      Future patch in series adds the quirk for context switch, that would
      be required if LBR callstack is to be enabled for ring 0.
      Signed-off-by: NDavid Carrillo-Cisneros <davidcc@google.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NStephane Eranian <eranian@google.com>
      Reviewed-by: NAndi Kleen <ak@linux.intel.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Kan Liang <kan.liang@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Link: http://lkml.kernel.org/r/1466533874-52003-3-git-send-email-davidcc@google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      19fc9ddd
  5. 03 6月, 2016 1 次提交
  6. 05 5月, 2016 1 次提交
  7. 23 4月, 2016 2 次提交
  8. 29 3月, 2016 1 次提交
  9. 25 3月, 2016 1 次提交
  10. 08 3月, 2016 3 次提交
  11. 17 2月, 2016 1 次提交
  12. 06 1月, 2016 2 次提交
    • H
      perf/x86/intel: Add perf core PMU support for Intel Knights Landing · 1e7b9390
      Harish Chegondi 提交于
      Knights Landing core is based on Silvermont core with several differences.
      Like Silvermont, Knights Landing has 8 pairs of LBR MSRs. However, the
      LBR MSRs addresses match those of the Xeon cores' first 8 pairs of LBR MSRs
      Unlike Silvermont, Knights Landing supports hyperthreading. Knights Landing
      offcore response events config register mask is different from that of the
      Silvermont.
      
      This patch was developed based on a patch from Andi Kleen.
      
      For more details, please refer to the public document:
      
        https://software.intel.com/sites/default/files/managed/15/8d/IntelXeonPhi%E2%84%A2x200ProcessorPerformanceMonitoringReferenceManual_Volume1_Registers_v0%206.pdfSigned-off-by: NHarish Chegondi <harish.chegondi@intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andi Kleen <andi.kleen@intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Harish Chegondi <harish.chegondi@gmail.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Kan Liang <kan.liang@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Lukasz Anaczkowski <lukasz.anaczkowski@intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Link: http://lkml.kernel.org/r/d14593c7311f78c93c9cf6b006be843777c5ad5c.1449517401.git.harish.chegondi@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      1e7b9390
    • A
      perf/x86: Use INST_RETIRED.PREC_DIST for cycles: ppp · 72469764
      Andi Kleen 提交于
      Add a new 'three-p' precise level, that uses INST_RETIRED.PREC_DIST as
      base. The basic mechanism of abusing the inverse cmask to get all
      cycles works the same as before.
      
      PREC_DIST is available on Sandy Bridge or later. It had some problems
      on Sandy Bridge, so we only use it on IvyBridge and later. I tested it
      on Broadwell and Skylake.
      
      PREC_DIST has special support for avoiding shadow effects, which can
      give better results compare to UOPS_RETIRED. The drawback is that
      PREC_DIST can only schedule on counter 1, but that is ok for cycle
      sampling, as there is normally no need to do multiple cycle sampling
      runs in parallel. It is still possible to run perf top in parallel, as
      that doesn't use precise mode. Also of course the multiplexing can
      still allow parallel operation.
      
      :pp stays with the previous event.
      
      Example:
      
      Sample a loop with 10 sqrt with old cycles:pp
      
      	  0.14 │10:   sqrtps %xmm1,%xmm0     <--------------
      	  9.13 │      sqrtps %xmm1,%xmm0
      	 11.58 │      sqrtps %xmm1,%xmm0
      	 11.51 │      sqrtps %xmm1,%xmm0
      	  6.27 │      sqrtps %xmm1,%xmm0
      	 10.38 │      sqrtps %xmm1,%xmm0
      	 12.20 │      sqrtps %xmm1,%xmm0
      	 12.74 │      sqrtps %xmm1,%xmm0
      	  5.40 │      sqrtps %xmm1,%xmm0
      	 10.14 │      sqrtps %xmm1,%xmm0
      	 10.51 │    ↑ jmp    10
      
      We expect all 10 sqrt to get roughly the sample number of samples.
      
      But you can see that the instruction directly after the JMP is
      systematically underestimated in the result, due to sampling shadow
      effects.
      
      With the new PREC_DIST based sampling this problem is gone and all
      instructions show up roughly evenly:
      
      	  9.51 │10:   sqrtps %xmm1,%xmm0
      	 11.74 │      sqrtps %xmm1,%xmm0
      	 11.84 │      sqrtps %xmm1,%xmm0
      	  6.05 │      sqrtps %xmm1,%xmm0
      	 10.46 │      sqrtps %xmm1,%xmm0
      	 12.25 │      sqrtps %xmm1,%xmm0
      	 12.18 │      sqrtps %xmm1,%xmm0
      	  5.26 │      sqrtps %xmm1,%xmm0
      	 10.13 │      sqrtps %xmm1,%xmm0
      	 10.43 │      sqrtps %xmm1,%xmm0
      	  0.16 │    ↑ jmp    10
      
      Even with PREC_DIST there is still sampling skid and the result is not
      completely even, but systematic shadow effects are significantly
      reduced.
      
      The improvements are mainly expected to make a difference in high IPC
      code. With low IPC it should be similar.
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: hpa@zytor.com
      Link: http://lkml.kernel.org/r/1448929689-13771-2-git-send-email-andi@firstfloor.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      72469764
  13. 06 12月, 2015 2 次提交
  14. 23 11月, 2015 3 次提交
    • A
      perf/x86: Handle multiple umask bits for BDW CYCLE_ACTIVITY.* · b7883a1c
      Andi Kleen 提交于
      The earlier constraint fix for Broadwell CYCLE_ACTIVITY.*
      forced umask 8 to counter 2. For this it used UEVENT,
      to match the complete umask.
      
      The event list for Broadwell has an additional
      STALLS_L1D_PENDIND event that uses umask 8, but also
      sets other bits in the umask.  The earlier strict umask match
      didn't handle this case.
      
      Add a new UBIT_EVENT constraint macro that only matches
      the specified bits in the umask. Then use that macro
      to handle CYCLE_ACTIVITY.* on Broadwell.
      
      The documented event also uses cmask, but there's no
      need to let the event scheduler know about the cmask,
      as the scheduling restriction is only tied to the umask.
      Reported-by: NGrant Ayers <ayers@cs.stanford.edu>
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Link: http://lkml.kernel.org/r/1447719667-9998-1-git-send-email-andi@firstfloor.org
      [ Filled in the missing email address of Grant Ayers - hopefully I got the right one. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      b7883a1c
    • P
      treewide: Remove old email address · 90eec103
      Peter Zijlstra 提交于
      There were still a number of references to my old Red Hat email
      address in the kernel source. Remove these while keeping the
      Red Hat copyright notices intact.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      90eec103
    • A
      perf/x86: Fix LBR call stack save/restore · b28ae956
      Andi Kleen 提交于
      This fixes a bug I added in the following commit:
      
        90405aa0 ("perf/x86/intel/lbr: Limit LBR accesses to TOS in callstack mode")
      
      The bug could lead to lost LBR call stacks. When restoring the LBR state
      we need to use the TOS of the previous context, not the current context.
      To do that we need to save/restore the TOS.
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: acme@kernel.org
      Cc: jolsa@kernel.org
      Link: http://lkml.kernel.org/r/1445366797-30894-1-git-send-email-andi@firstfloor.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      b28ae956
  15. 18 9月, 2015 1 次提交
  16. 13 9月, 2015 2 次提交
    • S
      perf/core: Drop PERF_EVENT_TXN · 8f3e5684
      Sukadev Bhattiprolu 提交于
      We currently use PERF_EVENT_TXN flag to determine if we are in the middle
      of a transaction. If in a transaction, we defer the schedulability checks
      from pmu->add() operation to the pmu->commit() operation.
      
      Now that we have "transaction types" (PERF_PMU_TXN_ADD, PERF_PMU_TXN_READ)
      we can use the type to determine if we are in a transaction and drop the
      PERF_EVENT_TXN flag.
      
      When PERF_EVENT_TXN is dropped, the cpuhw->group_flag on some architectures
      becomes unused, so drop that field as well.
      
      This is an extension of the Powerpc patch from Peter Zijlstra to s390,
      Sparc and x86 architectures.
      Signed-off-by: NSukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Link: http://lkml.kernel.org/r/1441336073-22750-11-git-send-email-sukadev@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      8f3e5684
    • S
      perf/core: Add a 'flags' parameter to the PMU transactional interfaces · fbbe0701
      Sukadev Bhattiprolu 提交于
      Currently, the PMU interface allows reading only one counter at a time.
      But some PMUs like the 24x7 counters in Power, support reading several
      counters at once. To leveage this functionality, extend the transaction
      interface to support a "transaction type".
      
      The first type, PERF_PMU_TXN_ADD, refers to the existing transactions,
      i.e. used to _schedule_ all the events on the PMU as a group. A second
      transaction type, PERF_PMU_TXN_READ, will be used in a follow-on patch,
      by the 24x7 counters to read several counters at once.
      
      Extend the transaction interfaces to the PMU to accept a 'txn_flags'
      parameter and use this parameter to ignore any transactions that are
      not of type PERF_PMU_TXN_ADD.
      
      Thanks to Peter Zijlstra for his input.
      Signed-off-by: NSukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
      [peterz: s390 compile fix]
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NMichael Ellerman <mpe@ellerman.id.au>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Link: http://lkml.kernel.org/r/1441336073-22750-3-git-send-email-sukadev@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      fbbe0701
  17. 04 8月, 2015 5 次提交
  18. 19 6月, 2015 1 次提交
  19. 07 6月, 2015 3 次提交
    • Y
      perf/x86/intel: Drain the PEBS buffer during context switches · 9c964efa
      Yan, Zheng 提交于
      Flush the PEBS buffer during context switches if PEBS interrupt threshold
      is larger than one. This allows perf to supply TID for sample outputs.
      Signed-off-by: NYan, Zheng <zheng.z.yan@intel.com>
      Signed-off-by: NKan Liang <kan.liang@intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: acme@infradead.org
      Cc: eranian@google.com
      Link: http://lkml.kernel.org/r/1430940834-8964-6-git-send-email-kan.liang@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      9c964efa
    • Y
      perf/x86/intel: Implement batched PEBS interrupt handling (large PEBS interrupt threshold) · 3569c0d7
      Yan, Zheng 提交于
      PEBS always had the capability to log samples to its buffers without
      an interrupt. Traditionally perf has not used this but always set the
      PEBS threshold to one.
      
      For frequently occurring events (like cycles or branches or load/store)
      this in term requires using a relatively high sampling period to avoid
      overloading the system, by only processing PMIs. This in term increases
      sampling error.
      
      For the common cases we still need to use the PMI because the PEBS
      hardware has various limitations. The biggest one is that it can not
      supply a callgraph. It also requires setting a fixed period, as the
      hardware does not support adaptive period. Another issue is that it
      cannot supply a time stamp and some other options. To supply a TID it
      requires flushing on context switch. It can however supply the IP, the
      load/store address, TSX information, registers, and some other things.
      
      So we can make PEBS work for some specific cases, basically as long as
      you can do without a callgraph and can set the period you can use this
      new PEBS mode.
      
      The main benefit is the ability to support much lower sampling period
      (down to -c 1000) without extensive overhead.
      
      One use cases is for example to increase the resolution of the c2c tool.
      Another is double checking when you suspect the standard sampling has
      too much sampling error.
      
      Some numbers on the overhead, using cycle soak, comparing the elapsed
      time from "kernbench -M -H" between plain (threshold set to one) and
      multi (large threshold).
      
      The test command for plain:
        "perf record --time -e cycles:p -c $period -- kernbench -M -H"
      
      The test command for multi:
        "perf record --no-time -e cycles:p -c $period -- kernbench -M -H"
      
      ( The only difference of test command between multi and plain is time
        stamp options. Since time stamp is not supported by large PEBS
        threshold, it can be used as a flag to indicate if large threshold is
        enabled during the test. )
      
      	period    plain(Sec)  multi(Sec)  Delta
      	10003     32.7        16.5        16.2
      	20003     30.2        16.2        14.0
      	40003     18.6        14.1        4.5
      	80003     16.8        14.6        2.2
      	100003    16.9        14.1        2.8
      	800003    15.4        15.7        -0.3
      	1000003   15.3        15.2        0.2
      	2000003   15.3        15.1        0.1
      
      With periods below 100003, plain (threshold one) cause much more
      overhead. With 10003 sampling period, the Elapsed Time for multi is
      even 2X faster than plain.
      Signed-off-by: NYan, Zheng <zheng.z.yan@intel.com>
      Signed-off-by: NKan Liang <kan.liang@intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: acme@infradead.org
      Cc: eranian@google.com
      Link: http://lkml.kernel.org/r/1430940834-8964-5-git-send-email-kan.liang@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      3569c0d7
    • Y
      perf/x86/intel: Use the PEBS auto reload mechanism when possible · 851559e3
      Yan, Zheng 提交于
      When a fixed period is specified, this patch makes perf use the PEBS
      auto reload mechanism. This makes normal profiling faster, because
      it avoids one costly MSR write in the PMI handler.
      
      However, the reset value will be loaded by hardware assist. There is a
      small delay compared to the previous non-auto-reload mechanism. The
      delay time is arbitrary, but very small. The assist cost is 400-800
      cycles, assuming common cases with everything cached. The minimum period
      the patch currently uses is 10000. In that extreme case it can be ~10%
      if cycles are used.
      Signed-off-by: NYan, Zheng <zheng.z.yan@intel.com>
      Signed-off-by: NKan Liang <kan.liang@intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: acme@infradead.org
      Cc: eranian@google.com
      Link: http://lkml.kernel.org/r/1430940834-8964-2-git-send-email-kan.liang@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      851559e3
  20. 27 5月, 2015 4 次提交
    • P
      perf/x86/intel: Remove intel_excl_states::init_state · 43ef205b
      Peter Zijlstra 提交于
      For some obscure reason intel_{start,stop}_scheduling() copy the HT
      state to an intermediate array. This would make sense if we ever were
      to make changes to it which we'd have to discard.
      
      Except we don't. By the time we call intel_commit_scheduling() we're;
      as the name implies; committed to them. We'll never back out.
      
      A further hint its pointless is that stop_scheduling() unconditionally
      publishes the state.
      
      So the intermediate array is pointless, modify the state in place and
      kill the extra array.
      
      And remove the pointless array initialization: INTEL_EXCL_UNUSED == 0.
      
      Note; all is serialized by intel_excl_cntr::lock.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      43ef205b
    • P
      perf/x86/intel: Clean up intel_commit_scheduling() placement · 0c41e756
      Peter Zijlstra 提交于
      Move the code of intel_commit_scheduling() to the right place, which is
      in between start() and stop().
      
      No change in functionality.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      0c41e756
    • P
      perf/x86: Improve HT workaround GP counter constraint · cc1790cf
      Peter Zijlstra 提交于
      The (SNB/IVB/HSW) HT bug only affects events that can be programmed
      onto GP counters, therefore we should only limit the number of GP
      counters that can be used per cpu -- iow we should not constrain the
      FP counters.
      
      Furthermore, we should only enfore such a limit when there are in fact
      exclusive events being scheduled on either sibling.
      Reported-by: NVince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      [ Fixed build fail for the !CONFIG_CPU_SUP_INTEL case. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      cc1790cf
    • P
      perf/x86: Fix event/group validation · b371b594
      Peter Zijlstra 提交于
      Commit 43b45780 ("perf/x86: Reduce stack usage of
      x86_schedule_events()") violated the rule that 'fake' scheduling; as
      used for event/group validation; should not change the event state.
      
      This went mostly un-noticed because repeated calls of
      x86_pmu::get_event_constraints() would give the same result. And
      x86_pmu::put_event_constraints() would mostly not do anything.
      
      Commit e979121b ("perf/x86/intel: Implement cross-HT corruption
      bug workaround") made the situation much worse by actually setting the
      event->hw.constraint value to NULL, so when validation and actual
      scheduling interact we get NULL ptr derefs.
      
      Fix it by removing the constraint pointer from the event and move it
      back to an array, this time in cpuc instead of on the stack.
      
      validate_group()
        x86_schedule_events()
          event->hw.constraint = c; # store
      
            <context switch>
              perf_task_event_sched_in()
                ...
                  x86_schedule_events();
                    event->hw.constraint = c2; # store
      
                    ...
      
                    put_event_constraints(event); # assume failure to schedule
                      intel_put_event_constraints()
                        event->hw.constraint = NULL;
      
            <context switch end>
      
          c = event->hw.constraint; # read -> NULL
      
          if (!test_bit(hwc->idx, c->idxmsk)) # <- *BOOM* NULL deref
      
      This in particular is possible when the event in question is a
      cpu-wide event and group-leader, where the validate_group() tries to
      add an event to the group.
      Reported-by: NVince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Hunter <ahh@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Maria Dimakopoulou <maria.n.dimakopoulou@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 43b45780 ("perf/x86: Reduce stack usage of x86_schedule_events()")
      Fixes: e979121b ("perf/x86/intel: Implement cross-HT corruption bug workaround")
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      b371b594
  21. 17 4月, 2015 1 次提交