1. 17 4月, 2015 3 次提交
  2. 12 4月, 2015 1 次提交
  3. 02 4月, 2015 23 次提交
  4. 27 3月, 2015 5 次提交
    • P
      perf: Add per event clockid support · 34f43927
      Peter Zijlstra 提交于
      While thinking on the whole clock discussion it occurred to me we have
      two distinct uses of time:
      
       1) the tracking of event/ctx/cgroup enabled/running/stopped times
          which includes the self-monitoring support in struct
          perf_event_mmap_page.
      
       2) the actual timestamps visible in the data records.
      
      And we've been conflating them.
      
      The first is all about tracking time deltas, nobody should really care
      in what time base that happens, its all relative information, as long
      as its internally consistent it works.
      
      The second however is what people are worried about when having to
      merge their data with external sources. And here we have the
      discussion on MONOTONIC vs MONOTONIC_RAW etc..
      
      Where MONOTONIC is good for correlating between machines (static
      offset), MONOTNIC_RAW is required for correlating against a fixed rate
      hardware clock.
      
      This means configurability; now 1) makes that hard because it needs to
      be internally consistent across groups of unrelated events; which is
      why we had to have a global perf_clock().
      
      However, for 2) it doesn't really matter, perf itself doesn't care
      what it writes into the buffer.
      
      The below patch makes the distinction between these two cases by
      adding perf_event_clock() which is used for the second case. It
      further makes this configurable on a per-event basis, but adds a few
      sanity checks such that we cannot combine events with different clocks
      in confusing ways.
      
      And since we then have per-event configurability we might as well
      retain the 'legacy' behaviour as a default.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      34f43927
    • D
      perf/x86: Remove redundant calls to perf_pmu_{dis|en}able() · 9332d250
      David Ahern 提交于
      perf_pmu_disable() is called before pmu->add() and perf_pmu_enable() is called
      afterwards. No need to call these inside of x86_pmu_add() as well.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1424281543-67335-1-git-send-email-dsahern@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      9332d250
    • A
      perf/x86/intel: Add INST_RETIRED.ALL workarounds · 294fe0f5
      Andi Kleen 提交于
      On Broadwell INST_RETIRED.ALL cannot be used with any period
      that doesn't have the lowest 6 bits cleared. And the period
      should not be smaller than 128.
      
      This is erratum BDM11 and BDM55:
      
        http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/5th-gen-core-family-spec-update.pdf
      
      BDM11: When using a period < 100; we may get incorrect PEBS/PMI
      interrupts and/or an invalid counter state.
      BDM55: When bit0-5 of the period are !0 we may get redundant PEBS
      records on overflow.
      
      Add a new callback to enforce this, and set it for Broadwell.
      
      How does this handle the case when an app requests a specific
      period with some of the bottom bits set?
      
      Short answer:
      
      Any useful instruction sampling period needs to be 4-6 orders
      of magnitude larger than 128, as an PMI every 128 instructions
      would instantly overwhelm the system and be throttled.
      So the +-64 error from this is really small compared to the
      period, much smaller than normal system jitter.
      
      Long answer (by Peterz):
      
      IFF we guarantee perf_event_attr::sample_period >= 128.
      
      Suppose we start out with sample_period=192; then we'll set period_left
      to 192, we'll end up with left = 128 (we truncate the lower bits). We
      get an interrupt, find that period_left = 64 (>0 so we return 0 and
      don't get an overflow handler), up that to 128. Then we trigger again,
      at n=256. Then we find period_left = -64 (<=0 so we return 1 and do get
      an overflow). We increment with sample_period so we get left = 128. We
      fire again, at n=384, period_left = 0 (<=0 so we return 1 and get an
      overflow). And on and on.
      
      So while the individual interrupts are 'wrong' we get then with
      interval=256,128 in exactly the right ratio to average out at 192. And
      this works for everything >=128.
      
      So the num_samples*fixed_period thing is still entirely correct +- 127,
      which is good enough I'd say, as you already have that error anyhow.
      
      So no need to 'fix' the tools, al we need to do is refuse to create
      INST_RETIRED:ALL events with sample_period < 128.
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      [ Updated comments and changelog a bit. ]
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1424225886-18652-3-git-send-email-andi@firstfloor.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      294fe0f5
    • A
      perf/x86/intel: Add Broadwell core support · 91f1b705
      Andi Kleen 提交于
      Add Broadwell support for Broadwell to perf.
      
      The basic support is very similar to Haswell. We use the new cache
      event list added for Haswell earlier. The only differences
      are a few bits related to remote nodes. To avoid an extra,
      mostly identical, table these are patched up in the initialization code.
      
      The constraint list has one new event that needs to be handled over Haswell.
      
      Includes code and testing from Kan Liang.
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1424225886-18652-2-git-send-email-andi@firstfloor.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      91f1b705
    • A
      perf/x86/intel: Add new cache events table for Haswell · 0f1b5ca2
      Andi Kleen 提交于
      Haswell offcore events are quite different from Sandy Bridge.
      Add a new table to handle Haswell properly.
      
      Note that the offcore bits listed in the SDM are not quite correct
      (this is currently being fixed). An uptodate list of bits is
      in the patch.
      
      The basic setup is similar to Sandy Bridge. The prefetch columns
      have been removed, as prefetch counting is not very reliable
      on Haswell. One L1 event that is not in the event list anymore
      has been also removed.
      
      - data reads do not include code reads (comparable to earlier Sandy Bridge tables)
      - data counts include speculative execution (except L1 write, dtlb, bpu)
      - remote node access includes both remote memory, remote cache, remote mmio.
      - prefetches are not included in the counts for consistency
        (different from Sandy Bridge, which includes prefetches in the remote node)
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      [ Removed the HSM30 comments; we don't have them for SNB/IVB either. ]
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1424225886-18652-1-git-send-email-andi@firstfloor.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      0f1b5ca2
  5. 23 3月, 2015 2 次提交
  6. 28 2月, 2015 1 次提交
  7. 25 2月, 2015 5 次提交
    • M
      perf/x86/intel: Enable conflicting event scheduling for CQM · 59bf7fd4
      Matt Fleming 提交于
      We can leverage the workqueue that we use for RMID rotation to support
      scheduling of conflicting monitoring events. Allowing events that
      monitor conflicting things is done at various other places in the perf
      subsystem, so there's precedent there.
      
      An example of two conflicting events would be monitoring a cgroup and
      simultaneously monitoring a task within that cgroup.
      
      This uses the cache_groups list as a queuing mechanism, where every
      event that reaches the front of the list gets the chance to be scheduled
      in, possibly descheduling any conflicting events that are running.
      Signed-off-by: NMatt Fleming <matt.fleming@intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Kanaka Juvva <kanaka.d.juvva@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Vikas Shivappa <vikas.shivappa@linux.intel.com>
      Link: http://lkml.kernel.org/r/1422038748-21397-10-git-send-email-matt@codeblueprint.co.ukSigned-off-by: NIngo Molnar <mingo@kernel.org>
      59bf7fd4
    • M
      perf/x86/intel: Perform rotation on Intel CQM RMIDs · bff671db
      Matt Fleming 提交于
      There are many use cases where people will want to monitor more tasks
      than there exist RMIDs in the hardware, meaning that we have to perform
      some kind of multiplexing.
      
      We do this by "rotating" the RMIDs in a workqueue, and assigning an RMID
      to a waiting event when the RMID becomes unused.
      
      This scheme reserves one RMID at all times for rotation. When we need to
      schedule a new event we give it the reserved RMID, pick a victim event
      from the front of the global CQM list and wait for the victim's RMID to
      drop to zero occupancy, before it becomes the new reserved RMID.
      
      We put the victim's RMID onto the limbo list, where it resides for a
      "minimum queue time", which is intended to save ourselves an expensive
      smp IPI when the RMID is unlikely to have a occupancy value below
      __intel_cqm_threshold.
      
      If we fail to recycle an RMID, even after waiting the minimum queue time
      then we need to increment __intel_cqm_threshold. There is an upper bound
      on this threshold, __intel_cqm_max_threshold, which is programmable from
      userland as /sys/devices/intel_cqm/max_recycling_threshold.
      
      The comments above __intel_cqm_rmid_rotate() have more details.
      Signed-off-by: NMatt Fleming <matt.fleming@intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Kanaka Juvva <kanaka.d.juvva@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Vikas Shivappa <vikas.shivappa@linux.intel.com>
      Link: http://lkml.kernel.org/r/1422038748-21397-9-git-send-email-matt@codeblueprint.co.ukSigned-off-by: NIngo Molnar <mingo@kernel.org>
      bff671db
    • M
      perf/x86/intel: Support task events with Intel CQM · bfe1fcd2
      Matt Fleming 提交于
      Add support for task events as well as system-wide events. This change
      has a big impact on the way that we gather LLC occupancy values in
      intel_cqm_event_read().
      
      Currently, for system-wide (per-cpu) events we defer processing to
      userspace which knows how to discard all but one cpu result per package.
      
      Things aren't so simple for task events because we need to do the value
      aggregation ourselves. To do this, we defer updating the LLC occupancy
      value in event->count from intel_cqm_event_read() and do an SMP
      cross-call to read values for all packages in intel_cqm_event_count().
      We need to ensure that we only do this for one task event per cache
      group, otherwise we'll report duplicate values.
      
      If we're a system-wide event we want to fallback to the default
      perf_event_count() implementation. Refactor this into a common function
      so that we don't duplicate the code.
      
      Also, introduce PERF_TYPE_INTEL_CQM, since we need a way to track an
      event's task (if the event isn't per-cpu) inside of the Intel CQM PMU
      driver.  This task information is only availble in the upper layers of
      the perf infrastructure.
      
      Other perf backends stash the target task in event->hw.*target so we
      need to do something similar. The task is used to determine whether
      events should share a cache group and an RMID.
      Signed-off-by: NMatt Fleming <matt.fleming@intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Kanaka Juvva <kanaka.d.juvva@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Vikas Shivappa <vikas.shivappa@linux.intel.com>
      Cc: linux-api@vger.kernel.org
      Link: http://lkml.kernel.org/r/1422038748-21397-8-git-send-email-matt@codeblueprint.co.ukSigned-off-by: NIngo Molnar <mingo@kernel.org>
      bfe1fcd2
    • M
      perf/x86/intel: Implement LRU monitoring ID allocation for CQM · 35298e55
      Matt Fleming 提交于
      It's possible to run into issues with re-using unused monitoring IDs
      because there may be stale cachelines associated with that ID from a
      previous allocation. This can cause the LLC occupancy values to be
      inaccurate.
      
      To attempt to mitigate this problem we place the IDs on a least recently
      used list, essentially a FIFO. The basic idea is that the longer the
      time period between ID re-use the lower the probability that stale
      cachelines exist in the cache.
      Signed-off-by: NMatt Fleming <matt.fleming@intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Kanaka Juvva <kanaka.d.juvva@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Vikas Shivappa <vikas.shivappa@linux.intel.com>
      Link: http://lkml.kernel.org/r/1422038748-21397-7-git-send-email-matt@codeblueprint.co.ukSigned-off-by: NIngo Molnar <mingo@kernel.org>
      35298e55
    • M
      perf/x86/intel: Add Intel Cache QoS Monitoring support · 4afbb24c
      Matt Fleming 提交于
      Future Intel Xeon processors support a Cache QoS Monitoring feature that
      allows tracking of the LLC occupancy for a task or task group, i.e. the
      amount of data in pulled into the LLC for the task (group).
      
      Currently the PMU only supports per-cpu events. We create an event for
      each cpu and read out all the LLC occupancy values.
      
      Because this results in duplicate values being written out to userspace,
      we also export a .per-pkg event file so that the perf tools only
      accumulate values for one cpu per package.
      Signed-off-by: NMatt Fleming <matt.fleming@intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Kanaka Juvva <kanaka.d.juvva@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Vikas Shivappa <vikas.shivappa@linux.intel.com>
      Link: http://lkml.kernel.org/r/1422038748-21397-6-git-send-email-matt@codeblueprint.co.ukSigned-off-by: NIngo Molnar <mingo@kernel.org>
      4afbb24c