1. 22 2月, 2014 1 次提交
  2. 09 2月, 2014 2 次提交
  3. 16 1月, 2014 1 次提交
    • R
      perf/x86/amd/ibs: Fix waking up from S3 for AMD family 10h · bee09ed9
      Robert Richter 提交于
      On AMD family 10h we see following error messages while waking up from
      S3 for all non-boot CPUs leading to a failed IBS initialization:
      
       Enabling non-boot CPUs ...
       smpboot: Booting Node 0 Processor 1 APIC 0x1
       [Firmware Bug]: cpu 1, try to use APIC500 (LVT offset 0) for vector 0x400, but the register is already in use for vector 0xf9 on another cpu
       perf: IBS APIC setup failed on cpu #1
       process: Switch to broadcast mode on CPU1
       CPU1 is up
       ...
       ACPI: Waking up from system sleep state S3
      
      Reason for this is that during suspend the LVT offset for the IBS
      vector gets lost and needs to be reinialized while resuming.
      
      The offset is read from the IBSCTL msr. On family 10h the offset needs
      to be 1 as offset 0 is used for the MCE threshold interrupt, but
      firmware assings it for IBS to 0 too. The kernel needs to reprogram
      the vector. The msr is a readonly node msr, but a new value can be
      written via pci config space access. The reinitialization is
      implemented for family 10h in setup_ibs_ctl() which is forced during
      IBS setup.
      
      This patch fixes IBS setup after waking up from S3 by adding
      resume/supend hooks for the boot cpu which does the offset
      reinitialization.
      
      Marking it as stable to let distros pick up this fix.
      Signed-off-by: NRobert Richter <rric@kernel.org>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: <stable@vger.kernel.org> v3.2..
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/1389797849-5565-1-git-send-email-rric.net@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      bee09ed9
  4. 15 1月, 2014 2 次提交
  5. 14 1月, 2014 1 次提交
  6. 13 1月, 2014 2 次提交
    • P
      sched/clock, x86: Use a static_key for sched_clock_stable · 35af99e6
      Peter Zijlstra 提交于
      In order to avoid the runtime condition and variable load turn
      sched_clock_stable into a static_key.
      
      Also provide a shorter implementation of local_clock() and
      cpu_clock(int) when sched_clock_stable==1.
      
                              MAINLINE   PRE       POST
      
          sched_clock_stable: 1          1         1
          (cold) sched_clock: 329841     221876    215295
          (cold) local_clock: 301773     234692    220773
          (warm) sched_clock: 38375      25602     25659
          (warm) local_clock: 100371     33265     27242
          (warm) rdtsc:       27340      24214     24208
          sched_clock_stable: 0          0         0
          (cold) sched_clock: 382634     235941    237019
          (cold) local_clock: 396890     297017    294819
          (warm) sched_clock: 38194      25233     25609
          (warm) local_clock: 143452     71234     71232
          (warm) rdtsc:       27345      24245     24243
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Link: http://lkml.kernel.org/n/tip-eummbdechzz37mwmpags1gjr@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      35af99e6
    • P
      sched/clock, x86: Rewrite cyc2ns() to avoid the need to disable IRQs · 20d1c86a
      Peter Zijlstra 提交于
      Use a ring-buffer like multi-version object structure which allows
      always having a coherent object; we use this to avoid having to
      disable IRQs while reading sched_clock() and avoids a problem when
      getting an NMI while changing the cyc2ns data.
      
                              MAINLINE   PRE        POST
      
          sched_clock_stable: 1          1          1
          (cold) sched_clock: 329841     331312     257223
          (cold) local_clock: 301773     310296     309889
          (warm) sched_clock: 38375      38247      25280
          (warm) local_clock: 100371     102713     85268
          (warm) rdtsc:       27340      27289      24247
          sched_clock_stable: 0          0          0
          (cold) sched_clock: 382634     372706     301224
          (cold) local_clock: 396890     399275     399870
          (warm) sched_clock: 38194      38124      25630
          (warm) local_clock: 143452     148698     129629
          (warm) rdtsc:       27345      27365      24307
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Link: http://lkml.kernel.org/n/tip-s567in1e5ekq2nlyhn8f987r@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      20d1c86a
  7. 12 1月, 2014 2 次提交
  8. 07 1月, 2014 1 次提交
  9. 04 1月, 2014 1 次提交
  10. 21 12月, 2013 1 次提交
  11. 20 12月, 2013 1 次提交
  12. 17 12月, 2013 1 次提交
  13. 05 12月, 2013 1 次提交
  14. 30 11月, 2013 1 次提交
  15. 27 11月, 2013 2 次提交
    • S
      perf/x86: Add RAPL hrtimer support · 65661f96
      Stephane Eranian 提交于
      The RAPL PMU counters do not interrupt on overflow.
      Therefore, the kernel needs to poll the counters
      to avoid missing an overflow. This patch adds
      the hrtimer code to do this.
      
      The timer interval is calculated at boot time
      based on the power unit used by the HW.
      
      There is one hrtimer per-cpu to handle the case
      of multiple simultaneous use across cores on
      the same package + hotplug CPU.
      
      Thanks to Maria Dimakopoulou for her contributions
      to this patch especially on the math aspects.
      Signed-off-by: NStephane Eranian <eranian@google.com>
      Reviewed-by: NMaria Dimakopoulou <maria.n.dimakopoulou@gmail.com>
      Reviewed-by: NAndi Kleen <ak@linux.intel.com>
      [ Applied 32-bit build fix. ]
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: acme@redhat.com
      Cc: jolsa@redhat.com
      Cc: zheng.z.yan@intel.com
      Cc: bp@alien8.de
      Cc: maria.n.dimakopoulou@gmail.com
      Link: http://lkml.kernel.org/r/1384275531-10892-5-git-send-email-eranian@google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      65661f96
    • S
      perf/x86: Add Intel RAPL PMU support · 4788e5b4
      Stephane Eranian 提交于
      This patch adds a new uncore PMU to expose the Intel
      RAPL energy consumption counters. Up to 3 counters,
      each counting a particular RAPL event are exposed.
      
      The RAPL counters are available on Intel SandyBridge,
      IvyBridge, Haswell. The server skus add a 3rd counter.
      
      The following events are available and exposed in sysfs:
      
        - power/energy-cores: power consumption of all cores on socket
        - power/energy-pkg: power consumption of all cores + LLc cache
        - power/energy-dram: power consumption of DRAM (servers only)
      
      For each event both the unit (Joules) and scale (2^-32 J)
      is exposed in sysfs for use by perf stat and other tools.
      The files are:
      
      	/sys/devices/power/events/energy-*.unit
      	/sys/devices/power/events/energy-*.scale
      
      The RAPL PMU is uncore by nature and is implemented such
      that it only works in system-wide mode. Measuring only
      one CPU per socket is sufficient. The /sys/devices/power/cpumask
      file can be used by tools to figure out which CPUs to monitor
      by default. For instance, on a 2-socket system, 2 CPUs
      (one on each socket) will be shown.
      
      All the counters measure in the same unit (exposed via sysfs).
      The perf_events API exposes all RAPL counters as 64-bit integers
      counting in unit of 1/2^32 Joules (about 0.23 nJ). User level tools
      must convert the counts by multiplying them by 2^-32 to obtain
      Joules. The reason for this is that the kernel avoids
      doing floating point math whenever possible because it is
      expensive (user floating-point state must be saved). The method
      used avoids kernel floating-point usage. There is no loss of
      precision. Thanks to PeterZ for suggesting this approach.
      
      To convert the raw count in Watt:
         W = C * 2.3 / (1e10 * time)
      or ldexp(C, -32).
      
      RAPL PMU is a new standalone PMU which registers with the
      perf_event core subsystem. The PMU type (attr->type) is
      dynamically allocated and is available from /sys/device/power/type.
      
      Sampling is not supported by the RAPL PMU. There is no
      privilege level filtering either.
      Signed-off-by: NStephane Eranian <eranian@google.com>
      Reviewed-by: NMaria Dimakopoulou <maria.n.dimakopoulou@gmail.com>
      Reviewed-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: acme@redhat.com
      Cc: jolsa@redhat.com
      Cc: zheng.z.yan@intel.com
      Cc: bp@alien8.de
      Link: http://lkml.kernel.org/r/1384275531-10892-4-git-send-email-eranian@google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      4788e5b4
  16. 07 11月, 2013 1 次提交
  17. 06 11月, 2013 4 次提交
  18. 31 10月, 2013 1 次提交
  19. 29 10月, 2013 1 次提交
    • P
      perf/x86: Fix NMI measurements · e8a923cc
      Peter Zijlstra 提交于
      OK, so what I'm actually seeing on my WSM is that sched/clock.c is
      'broken' for the purpose we're using it for.
      
      What triggered it is that my WSM-EP is broken :-(
      
        [    0.001000] tsc: Fast TSC calibration using PIT
        [    0.002000] tsc: Detected 2533.715 MHz processor
        [    0.500180] TSC synchronization [CPU#0 -> CPU#6]:
        [    0.505197] Measured 3 cycles TSC warp between CPUs, turning off TSC clock.
        [    0.004000] tsc: Marking TSC unstable due to check_tsc_sync_source failed
      
      For some reason it consistently detects TSC skew, even though NHM+
      should have a single clock domain for 'reasonable' systems.
      
      This marks sched_clock_stable=0, which means that we do fancy stuff to
      try and get a 'sane' clock. Part of this fancy stuff relies on the tick,
      clearly that's gone when NOHZ=y. So for idle cpus time gets stuck, until
      it either wakes up or gets kicked by another cpu.
      
      While this is perfectly fine for the scheduler -- it only cares about
      actually running stuff, and when we're running stuff we're obviously not
      idle. This does somewhat break down for perf which can trigger events
      just fine on an otherwise idle cpu.
      
      So I've got NMIs get get 'measured' as taking ~1ms, which actually
      don't last nearly that long:
      
                <idle>-0     [013] d.h.   886.311970: rcu_nmi_enter <-do_nmi
        ...
                <idle>-0     [013] d.h.   886.311997: perf_sample_event_took: HERE!!! : 1040990
      
      So ftrace (which uses sched_clock(), not the fancy bits) only sees
      ~27us, but we measure ~1ms !!
      
      Now since all this measurement stuff lives in x86 code, we can actually
      fix it.
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: mingo@kernel.org
      Cc: dave.hansen@linux.intel.com
      Cc: eranian@google.com
      Cc: Don Zickus <dzickus@redhat.com>
      Cc: jmario@redhat.com
      Cc: acme@infradead.org
      Link: http://lkml.kernel.org/r/20131017133350.GG3364@laptop.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      e8a923cc
  20. 26 10月, 2013 1 次提交
    • J
      x86/cpu: Track legacy CPU model data only on 32-bit kernels · 09dc68d9
      Jan Beulich 提交于
      struct cpu_dev's c_models is only ever set inside CONFIG_X86_32
      conditionals (or code that's being built for 32-bit only), so
      there's no use of reserving the (empty) space for the model
      names in a 64-bit kernel.
      
      Similarly, c_size_cache is only used in the #else of a
      CONFIG_X86_64 conditional, so reserving space for (and in one
      case even initializing) that field is pointless for 64-bit
      kernels too.
      
      While moving both fields to the end of the structure, I also
      noticed that:
      
       - the c_models array size was one too small, potentially causing
         table_lookup_model() to return garbage on Intel CPUs (intel.c's
         instance was lacking the sentinel with family being zero), so the
         patch bumps that by one,
      
       - c_models' vendor sub-field was unused (and anyway redundant
         with the base structure's c_x86_vendor field), so the patch deletes it.
      
      Also rename the legacy fields so that their legacy nature stands out
      and comment their declarations.
      Signed-off-by: NJan Beulich <jbeulich@suse.com>
      Link: http://lkml.kernel.org/r/5265036802000078000FC4DB@nat28.tlf.novell.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      09dc68d9
  21. 24 10月, 2013 1 次提交
  22. 16 10月, 2013 1 次提交
    • P
      perf/x86: Optimize intel_pmu_pebs_fixup_ip() · 9536c8d2
      Peter Zijlstra 提交于
      There's been reports of high NMI handler overhead, highlighted by
      such kernel messages:
      
        [ 3697.380195] perf samples too long (10009 > 10000), lowering kernel.perf_event_max_sample_rate to 13000
        [ 3697.389509] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 9.331 msecs
      
      Don Zickus analyzed the source of the overhead and reported:
      
       > While there are a few places that are causing latencies, for now I focused on
       > the longest one first.  It seems to be 'copy_user_from_nmi'
       >
       > intel_pmu_handle_irq ->
       >	intel_pmu_drain_pebs_nhm ->
       >		__intel_pmu_drain_pebs_nhm ->
       >			__intel_pmu_pebs_event ->
       >				intel_pmu_pebs_fixup_ip ->
       >					copy_from_user_nmi
       >
       > In intel_pmu_pebs_fixup_ip(), if the while-loop goes over 50, the sum of
       > all the copy_from_user_nmi latencies seems to go over 1,000,000 cycles
       > (there are some cases where only 10 iterations are needed to go that high
       > too, but in generall over 50 or so).  At this point copy_user_from_nmi
       > seems to account for over 90% of the nmi latency.
      
      The solution to that is to avoid having to call copy_from_user_nmi() for
      every instruction.
      
      Since we already limit the max basic block size, we can easily
      pre-allocate a piece of memory to copy the entire thing into in one
      go.
      
      Don reported this test result:
      
       > Your patch made a huge difference in improvement.  The
       > copy_from_user_nmi() no longer hits the million of cycles.  I still
       > have a batch of 100,000-300,000 cycles.  My longest NMI paths used
       > to be dominated by copy_from_user_nmi, now it is not (I have to dig
       > up the new hot path).
      Reported-and-tested-by: NDon Zickus <dzickus@redhat.com>
      Cc: jmario@redhat.com
      Cc: acme@infradead.org
      Cc: dave.hansen@linux.intel.com
      Cc: eranian@google.com
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/20131016105755.GX10651@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      9536c8d2
  23. 14 10月, 2013 2 次提交
  24. 13 10月, 2013 1 次提交
  25. 11 10月, 2013 2 次提交
  26. 04 10月, 2013 3 次提交
  27. 28 9月, 2013 1 次提交
  28. 25 9月, 2013 1 次提交