1. 13 1月, 2014 2 次提交
    • P
      sched/clock, x86: Use a static_key for sched_clock_stable · 35af99e6
      Peter Zijlstra 提交于
      In order to avoid the runtime condition and variable load turn
      sched_clock_stable into a static_key.
      
      Also provide a shorter implementation of local_clock() and
      cpu_clock(int) when sched_clock_stable==1.
      
                              MAINLINE   PRE       POST
      
          sched_clock_stable: 1          1         1
          (cold) sched_clock: 329841     221876    215295
          (cold) local_clock: 301773     234692    220773
          (warm) sched_clock: 38375      25602     25659
          (warm) local_clock: 100371     33265     27242
          (warm) rdtsc:       27340      24214     24208
          sched_clock_stable: 0          0         0
          (cold) sched_clock: 382634     235941    237019
          (cold) local_clock: 396890     297017    294819
          (warm) sched_clock: 38194      25233     25609
          (warm) local_clock: 143452     71234     71232
          (warm) rdtsc:       27345      24245     24243
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Link: http://lkml.kernel.org/n/tip-eummbdechzz37mwmpags1gjr@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      35af99e6
    • P
      sched/clock, x86: Rewrite cyc2ns() to avoid the need to disable IRQs · 20d1c86a
      Peter Zijlstra 提交于
      Use a ring-buffer like multi-version object structure which allows
      always having a coherent object; we use this to avoid having to
      disable IRQs while reading sched_clock() and avoids a problem when
      getting an NMI while changing the cyc2ns data.
      
                              MAINLINE   PRE        POST
      
          sched_clock_stable: 1          1          1
          (cold) sched_clock: 329841     331312     257223
          (cold) local_clock: 301773     310296     309889
          (warm) sched_clock: 38375      38247      25280
          (warm) local_clock: 100371     102713     85268
          (warm) rdtsc:       27340      27289      24247
          sched_clock_stable: 0          0          0
          (cold) sched_clock: 382634     372706     301224
          (cold) local_clock: 396890     399275     399870
          (warm) sched_clock: 38194      38124      25630
          (warm) local_clock: 143452     148698     129629
          (warm) rdtsc:       27345      27365      24307
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Link: http://lkml.kernel.org/n/tip-s567in1e5ekq2nlyhn8f987r@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      20d1c86a
  2. 20 12月, 2013 1 次提交
  3. 05 12月, 2013 1 次提交
  4. 07 11月, 2013 1 次提交
  5. 06 11月, 2013 4 次提交
  6. 31 10月, 2013 1 次提交
  7. 29 10月, 2013 1 次提交
    • P
      perf/x86: Fix NMI measurements · e8a923cc
      Peter Zijlstra 提交于
      OK, so what I'm actually seeing on my WSM is that sched/clock.c is
      'broken' for the purpose we're using it for.
      
      What triggered it is that my WSM-EP is broken :-(
      
        [    0.001000] tsc: Fast TSC calibration using PIT
        [    0.002000] tsc: Detected 2533.715 MHz processor
        [    0.500180] TSC synchronization [CPU#0 -> CPU#6]:
        [    0.505197] Measured 3 cycles TSC warp between CPUs, turning off TSC clock.
        [    0.004000] tsc: Marking TSC unstable due to check_tsc_sync_source failed
      
      For some reason it consistently detects TSC skew, even though NHM+
      should have a single clock domain for 'reasonable' systems.
      
      This marks sched_clock_stable=0, which means that we do fancy stuff to
      try and get a 'sane' clock. Part of this fancy stuff relies on the tick,
      clearly that's gone when NOHZ=y. So for idle cpus time gets stuck, until
      it either wakes up or gets kicked by another cpu.
      
      While this is perfectly fine for the scheduler -- it only cares about
      actually running stuff, and when we're running stuff we're obviously not
      idle. This does somewhat break down for perf which can trigger events
      just fine on an otherwise idle cpu.
      
      So I've got NMIs get get 'measured' as taking ~1ms, which actually
      don't last nearly that long:
      
                <idle>-0     [013] d.h.   886.311970: rcu_nmi_enter <-do_nmi
        ...
                <idle>-0     [013] d.h.   886.311997: perf_sample_event_took: HERE!!! : 1040990
      
      So ftrace (which uses sched_clock(), not the fancy bits) only sees
      ~27us, but we measure ~1ms !!
      
      Now since all this measurement stuff lives in x86 code, we can actually
      fix it.
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: mingo@kernel.org
      Cc: dave.hansen@linux.intel.com
      Cc: eranian@google.com
      Cc: Don Zickus <dzickus@redhat.com>
      Cc: jmario@redhat.com
      Cc: acme@infradead.org
      Link: http://lkml.kernel.org/r/20131017133350.GG3364@laptop.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      e8a923cc
  8. 26 10月, 2013 1 次提交
    • J
      x86/cpu: Track legacy CPU model data only on 32-bit kernels · 09dc68d9
      Jan Beulich 提交于
      struct cpu_dev's c_models is only ever set inside CONFIG_X86_32
      conditionals (or code that's being built for 32-bit only), so
      there's no use of reserving the (empty) space for the model
      names in a 64-bit kernel.
      
      Similarly, c_size_cache is only used in the #else of a
      CONFIG_X86_64 conditional, so reserving space for (and in one
      case even initializing) that field is pointless for 64-bit
      kernels too.
      
      While moving both fields to the end of the structure, I also
      noticed that:
      
       - the c_models array size was one too small, potentially causing
         table_lookup_model() to return garbage on Intel CPUs (intel.c's
         instance was lacking the sentinel with family being zero), so the
         patch bumps that by one,
      
       - c_models' vendor sub-field was unused (and anyway redundant
         with the base structure's c_x86_vendor field), so the patch deletes it.
      
      Also rename the legacy fields so that their legacy nature stands out
      and comment their declarations.
      Signed-off-by: NJan Beulich <jbeulich@suse.com>
      Link: http://lkml.kernel.org/r/5265036802000078000FC4DB@nat28.tlf.novell.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      09dc68d9
  9. 24 10月, 2013 1 次提交
  10. 16 10月, 2013 1 次提交
    • P
      perf/x86: Optimize intel_pmu_pebs_fixup_ip() · 9536c8d2
      Peter Zijlstra 提交于
      There's been reports of high NMI handler overhead, highlighted by
      such kernel messages:
      
        [ 3697.380195] perf samples too long (10009 > 10000), lowering kernel.perf_event_max_sample_rate to 13000
        [ 3697.389509] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 9.331 msecs
      
      Don Zickus analyzed the source of the overhead and reported:
      
       > While there are a few places that are causing latencies, for now I focused on
       > the longest one first.  It seems to be 'copy_user_from_nmi'
       >
       > intel_pmu_handle_irq ->
       >	intel_pmu_drain_pebs_nhm ->
       >		__intel_pmu_drain_pebs_nhm ->
       >			__intel_pmu_pebs_event ->
       >				intel_pmu_pebs_fixup_ip ->
       >					copy_from_user_nmi
       >
       > In intel_pmu_pebs_fixup_ip(), if the while-loop goes over 50, the sum of
       > all the copy_from_user_nmi latencies seems to go over 1,000,000 cycles
       > (there are some cases where only 10 iterations are needed to go that high
       > too, but in generall over 50 or so).  At this point copy_user_from_nmi
       > seems to account for over 90% of the nmi latency.
      
      The solution to that is to avoid having to call copy_from_user_nmi() for
      every instruction.
      
      Since we already limit the max basic block size, we can easily
      pre-allocate a piece of memory to copy the entire thing into in one
      go.
      
      Don reported this test result:
      
       > Your patch made a huge difference in improvement.  The
       > copy_from_user_nmi() no longer hits the million of cycles.  I still
       > have a batch of 100,000-300,000 cycles.  My longest NMI paths used
       > to be dominated by copy_from_user_nmi, now it is not (I have to dig
       > up the new hot path).
      Reported-and-tested-by: NDon Zickus <dzickus@redhat.com>
      Cc: jmario@redhat.com
      Cc: acme@infradead.org
      Cc: dave.hansen@linux.intel.com
      Cc: eranian@google.com
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/20131016105755.GX10651@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      9536c8d2
  11. 14 10月, 2013 2 次提交
  12. 11 10月, 2013 2 次提交
  13. 04 10月, 2013 3 次提交
  14. 28 9月, 2013 1 次提交
  15. 25 9月, 2013 1 次提交
  16. 23 9月, 2013 1 次提交
  17. 20 9月, 2013 4 次提交
  18. 14 9月, 2013 1 次提交
  19. 13 9月, 2013 7 次提交
  20. 12 9月, 2013 3 次提交
    • S
      perf/x86: Fix uncore PCI fixed counter handling · dbc33f70
      Stephane Eranian 提交于
      There was a bug in the handling of SNB-EP/IVB-EP uncore PCI
      fixed counters, e.g., IMC.
      
      It would cause erratic values to be returned for the IMC
      clockticks event. This was due to a bogus hwc->config value
      which was then written to PCI config space.
      
      The erratic values can be seen via:
      
        $ perf stat -a -C 0 -e uncore_imc_0/clockticks/ -I 1000 sleep 10
      
      The fixed counter has most fields marked as reserved with
      hw reset values of 0. Yet the kernel was defaulting to a
      hwc->config = ~0 and that was causing the issues.
      
      This patch sets the hwc->config values for fixed uncore event
      to 0. Now, the values of IMC clockticks is correct.
      Signed-off-by: NStephane Eranian <eranian@google.com>
      Reviewed-by: NAndi Kleen <ak@linux.intel.com>
      Cc: peterz@infradead.org
      Cc: zheng.z.yan@intel.com
      Link: http://lkml.kernel.org/r/20130909195350.GA17643@google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      dbc33f70
    • S
      perf/x86: Add constraint for IVB CYCLE_ACTIVITY:CYCLES_LDM_PENDING · 6113af14
      Stephane Eranian 提交于
      The IvyBridge event CYCLE_ACTIVITY:CYCLES_LDM_PENDING can only
      be measured on counters 0-3 when HT is off. When HT is on, you
      only have counters 0-3.
      
      If you program it on the eight counters for 1s on a 3GHz
      IVB laptop running a noploop, you see:
      
                 2 747 527 CYCLE_ACTIVITY:CYCLES_LDM_PENDING
                 2 747 527 CYCLE_ACTIVITY:CYCLES_LDM_PENDING
                 2 747 527 CYCLE_ACTIVITY:CYCLES_LDM_PENDING
                 2 747 527 CYCLE_ACTIVITY:CYCLES_LDM_PENDING
             3 280 563 608 CYCLE_ACTIVITY:CYCLES_LDM_PENDING
             3 280 563 608 CYCLE_ACTIVITY:CYCLES_LDM_PENDING
             3 280 563 608 CYCLE_ACTIVITY:CYCLES_LDM_PENDING
             3 280 563 608 CYCLE_ACTIVITY:CYCLES_LDM_PENDING
      
      Clearly the last 4 values are bogus.
      Signed-off-by: NStephane Eranian <eranian@google.com>
      Cc: peterz@infradead.org
      Cc: ak@linux.intel.com
      Cc: zheng.z.yan@intel.com
      Cc: dhsharp@google.com
      Link: http://lkml.kernel.org/r/20130911152222.GA28761@google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      6113af14
    • D
      mm: vmstats: track TLB flush stats on UP too · 6df46865
      Dave Hansen 提交于
      The previous patch doing vmstats for TLB flushes ("mm: vmstats: tlb flush
      counters") effectively missed UP since arch/x86/mm/tlb.c is only compiled
      for SMP.
      
      UP systems do not do remote TLB flushes, so compile those counters out on
      UP.
      
      arch/x86/kernel/cpu/mtrr/generic.c calls __flush_tlb() directly.  This is
      probably an optimization since both the mtrr code and __flush_tlb() write
      cr4.  It would probably be safe to make that a flush_tlb_all() (and then
      get these statistics), but the mtrr code is ancient and I'm hesitant to
      touch it other than to just stick in the counters.
      
      [akpm@linux-foundation.org: tweak comments]
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6df46865
  21. 02 9月, 2013 1 次提交