1. 18 9月, 2015 3 次提交
  2. 13 9月, 2015 1 次提交
  3. 05 9月, 2015 3 次提交
  4. 12 8月, 2015 1 次提交
  5. 04 8月, 2015 6 次提交
  6. 19 6月, 2015 2 次提交
  7. 07 6月, 2015 3 次提交
    • Y
      perf/x86/intel: Drain the PEBS buffer during context switches · 9c964efa
      Yan, Zheng 提交于
      Flush the PEBS buffer during context switches if PEBS interrupt threshold
      is larger than one. This allows perf to supply TID for sample outputs.
      Signed-off-by: NYan, Zheng <zheng.z.yan@intel.com>
      Signed-off-by: NKan Liang <kan.liang@intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: acme@infradead.org
      Cc: eranian@google.com
      Link: http://lkml.kernel.org/r/1430940834-8964-6-git-send-email-kan.liang@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      9c964efa
    • Y
      perf/x86/intel: Implement batched PEBS interrupt handling (large PEBS interrupt threshold) · 3569c0d7
      Yan, Zheng 提交于
      PEBS always had the capability to log samples to its buffers without
      an interrupt. Traditionally perf has not used this but always set the
      PEBS threshold to one.
      
      For frequently occurring events (like cycles or branches or load/store)
      this in term requires using a relatively high sampling period to avoid
      overloading the system, by only processing PMIs. This in term increases
      sampling error.
      
      For the common cases we still need to use the PMI because the PEBS
      hardware has various limitations. The biggest one is that it can not
      supply a callgraph. It also requires setting a fixed period, as the
      hardware does not support adaptive period. Another issue is that it
      cannot supply a time stamp and some other options. To supply a TID it
      requires flushing on context switch. It can however supply the IP, the
      load/store address, TSX information, registers, and some other things.
      
      So we can make PEBS work for some specific cases, basically as long as
      you can do without a callgraph and can set the period you can use this
      new PEBS mode.
      
      The main benefit is the ability to support much lower sampling period
      (down to -c 1000) without extensive overhead.
      
      One use cases is for example to increase the resolution of the c2c tool.
      Another is double checking when you suspect the standard sampling has
      too much sampling error.
      
      Some numbers on the overhead, using cycle soak, comparing the elapsed
      time from "kernbench -M -H" between plain (threshold set to one) and
      multi (large threshold).
      
      The test command for plain:
        "perf record --time -e cycles:p -c $period -- kernbench -M -H"
      
      The test command for multi:
        "perf record --no-time -e cycles:p -c $period -- kernbench -M -H"
      
      ( The only difference of test command between multi and plain is time
        stamp options. Since time stamp is not supported by large PEBS
        threshold, it can be used as a flag to indicate if large threshold is
        enabled during the test. )
      
      	period    plain(Sec)  multi(Sec)  Delta
      	10003     32.7        16.5        16.2
      	20003     30.2        16.2        14.0
      	40003     18.6        14.1        4.5
      	80003     16.8        14.6        2.2
      	100003    16.9        14.1        2.8
      	800003    15.4        15.7        -0.3
      	1000003   15.3        15.2        0.2
      	2000003   15.3        15.1        0.1
      
      With periods below 100003, plain (threshold one) cause much more
      overhead. With 10003 sampling period, the Elapsed Time for multi is
      even 2X faster than plain.
      Signed-off-by: NYan, Zheng <zheng.z.yan@intel.com>
      Signed-off-by: NKan Liang <kan.liang@intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: acme@infradead.org
      Cc: eranian@google.com
      Link: http://lkml.kernel.org/r/1430940834-8964-5-git-send-email-kan.liang@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      3569c0d7
    • Y
      perf/x86/intel: Use the PEBS auto reload mechanism when possible · 851559e3
      Yan, Zheng 提交于
      When a fixed period is specified, this patch makes perf use the PEBS
      auto reload mechanism. This makes normal profiling faster, because
      it avoids one costly MSR write in the PMI handler.
      
      However, the reset value will be loaded by hardware assist. There is a
      small delay compared to the previous non-auto-reload mechanism. The
      delay time is arbitrary, but very small. The assist cost is 400-800
      cycles, assuming common cases with everything cached. The minimum period
      the patch currently uses is 10000. In that extreme case it can be ~10%
      if cycles are used.
      Signed-off-by: NYan, Zheng <zheng.z.yan@intel.com>
      Signed-off-by: NKan Liang <kan.liang@intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: acme@infradead.org
      Cc: eranian@google.com
      Link: http://lkml.kernel.org/r/1430940834-8964-2-git-send-email-kan.liang@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      851559e3
  8. 27 5月, 2015 11 次提交
    • B
      sched/topology: Rename topology_thread_cpumask() to topology_sibling_cpumask() · 06931e62
      Bartosz Golaszewski 提交于
      Rename topology_thread_cpumask() to topology_sibling_cpumask()
      for more consistency with scheduler code.
      Signed-off-by: NBartosz Golaszewski <bgolaszewski@baylibre.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NRussell King <rmk+kernel@arm.linux.org.uk>
      Acked-by: NCatalin Marinas <catalin.marinas@arm.com>
      Cc: Benoit Cousson <bcousson@baylibre.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: Jean Delvare <jdelvare@suse.de>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Oleg Drokin <oleg.drokin@intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Link: http://lkml.kernel.org/r/1432645896-12588-2-git-send-email-bgolaszewski@baylibre.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      06931e62
    • P
      perf/x86/intel: Simplify put_exclusive_constraints() · ba040653
      Peter Zijlstra 提交于
      Don't bother with taking locks if we're not actually going to do
      anything. Also, drop the _irqsave(), this is very much only called
      from IRQ-disabled context.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      ba040653
    • P
      perf/x86/intel: Remove intel_excl_states::init_state · 43ef205b
      Peter Zijlstra 提交于
      For some obscure reason intel_{start,stop}_scheduling() copy the HT
      state to an intermediate array. This would make sense if we ever were
      to make changes to it which we'd have to discard.
      
      Except we don't. By the time we call intel_commit_scheduling() we're;
      as the name implies; committed to them. We'll never back out.
      
      A further hint its pointless is that stop_scheduling() unconditionally
      publishes the state.
      
      So the intermediate array is pointless, modify the state in place and
      kill the extra array.
      
      And remove the pointless array initialization: INTEL_EXCL_UNUSED == 0.
      
      Note; all is serialized by intel_excl_cntr::lock.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      43ef205b
    • P
      perf/x86/intel: Remove pointless tests · 1fe684e3
      Peter Zijlstra 提交于
      Both intel_commit_scheduling() and intel_get_excl_contraints() test
      for cntr < 0.
      
      The only way that can happen (aside from a bug) is through
      validate_event(), however that is already captured by the
      cpuc->is_fake test.
      
      So remove these test and simplify the code.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      1fe684e3
    • P
      perf/x86/intel: Clean up intel_commit_scheduling() placement · 0c41e756
      Peter Zijlstra 提交于
      Move the code of intel_commit_scheduling() to the right place, which is
      in between start() and stop().
      
      No change in functionality.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      0c41e756
    • P
      perf/x86/intel: Make WARN()ings consistent · 17186ccd
      Peter Zijlstra 提交于
      The intel_commit_scheduling() callback is pointlessly different from
      the start and stop scheduling callback.
      
      Furthermore, the constraint should never be NULL, so remove that test.
      
      Even though we'll never get called (because we NULL the callbacks)
      when !is_ht_workaround_enabled() put that test in.
      
      Collapse the (pointless) WARN_ON_ONCE() and bail on !cpuc->excl_cntrs --
      this is doubly pointless, because its the same condition as
      is_ht_workaround_enabled() which was already pointless because the
      whole method won't ever be called.
      
      Furthremore, make all the !excl_cntrs test WARN_ON_ONCE(); they're all
      pointless, because the above, either the function
      ({get,put}_excl_constraint) are already predicated on it existing or
      the is_ht_workaround_enabled() thing is the same test.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      17186ccd
    • P
      perf/x86/intel: Simplify the dynamic constraint code somewhat · aaf932e8
      Peter Zijlstra 提交于
      We have two 'struct event_constraint' local variables in
      intel_get_excl_constraints(): 'cx' and 'c'.
      
      Instead of using 'cx' after the dynamic allocation, put all 'cx' inside
      the dynamic allocation block and use 'c' outside of it.
      
      Also use direct assignment to copy the structure; let the compiler
      figure it out.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      aaf932e8
    • P
      perf/x86/intel: Add lockdep assert · b32ed7f5
      Peter Zijlstra 提交于
      Lockdep is very good at finding incorrect IRQ state while locking and
      is far better at telling us if we hold a lock than the _is_locked()
      API. It also generates less code for !DEBUG kernels.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      b32ed7f5
    • P
      perf/x86/intel: Correct local vs remote sibling state · 1c565833
      Peter Zijlstra 提交于
      For some obscure reason the current code accounts the current SMT
      thread's state on the remote thread and reads the remote's state on
      the local SMT thread.
      
      While internally consistent, and 'correct' its pointless confusion we
      can do without.
      
      Flip them the right way around.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      1c565833
    • P
      perf/x86: Improve HT workaround GP counter constraint · cc1790cf
      Peter Zijlstra 提交于
      The (SNB/IVB/HSW) HT bug only affects events that can be programmed
      onto GP counters, therefore we should only limit the number of GP
      counters that can be used per cpu -- iow we should not constrain the
      FP counters.
      
      Furthermore, we should only enfore such a limit when there are in fact
      exclusive events being scheduled on either sibling.
      Reported-by: NVince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      [ Fixed build fail for the !CONFIG_CPU_SUP_INTEL case. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      cc1790cf
    • P
      perf/x86: Fix event/group validation · b371b594
      Peter Zijlstra 提交于
      Commit 43b45780 ("perf/x86: Reduce stack usage of
      x86_schedule_events()") violated the rule that 'fake' scheduling; as
      used for event/group validation; should not change the event state.
      
      This went mostly un-noticed because repeated calls of
      x86_pmu::get_event_constraints() would give the same result. And
      x86_pmu::put_event_constraints() would mostly not do anything.
      
      Commit e979121b ("perf/x86/intel: Implement cross-HT corruption
      bug workaround") made the situation much worse by actually setting the
      event->hw.constraint value to NULL, so when validation and actual
      scheduling interact we get NULL ptr derefs.
      
      Fix it by removing the constraint pointer from the event and move it
      back to an array, this time in cpuc instead of on the stack.
      
      validate_group()
        x86_schedule_events()
          event->hw.constraint = c; # store
      
            <context switch>
              perf_task_event_sched_in()
                ...
                  x86_schedule_events();
                    event->hw.constraint = c2; # store
      
                    ...
      
                    put_event_constraints(event); # assume failure to schedule
                      intel_put_event_constraints()
                        event->hw.constraint = NULL;
      
            <context switch end>
      
          c = event->hw.constraint; # read -> NULL
      
          if (!test_bit(hwc->idx, c->idxmsk)) # <- *BOOM* NULL deref
      
      This in particular is possible when the event in question is a
      cpu-wide event and group-leader, where the validate_group() tries to
      add an event to the group.
      Reported-by: NVince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Hunter <ahh@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Maria Dimakopoulou <maria.n.dimakopoulou@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 43b45780 ("perf/x86: Reduce stack usage of x86_schedule_events()")
      Fixes: e979121b ("perf/x86/intel: Implement cross-HT corruption bug workaround")
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      b371b594
  9. 08 5月, 2015 1 次提交
  10. 22 4月, 2015 1 次提交
    • J
      perf/x86/intel: Add cpu_(prepare|starting|dying) for core_pmu · 3b6e0421
      Jiri Olsa 提交于
      The core_pmu does not define cpu_* callbacks, which handles
      allocation of 'struct cpu_hw_events::shared_regs' data,
      initialization of debug store and PMU_FL_EXCL_CNTRS counters.
      
      While this probably won't happen on bare metal, virtual CPU can
      define x86_pmu.extra_regs together with PMU version 1 and thus
      be using core_pmu -> using shared_regs data without it being
      allocated. That could could leave to following panic:
      
      	BUG: unable to handle kernel NULL pointer dereference at (null)
      	IP: [<ffffffff8152cd4f>] _spin_lock_irqsave+0x1f/0x40
      
      	SNIP
      
      	 [<ffffffff81024bd9>] __intel_shared_reg_get_constraints+0x69/0x1e0
      	 [<ffffffff81024deb>] intel_get_event_constraints+0x9b/0x180
      	 [<ffffffff8101e815>] x86_schedule_events+0x75/0x1d0
      	 [<ffffffff810586dc>] ? check_preempt_curr+0x7c/0x90
      	 [<ffffffff810649fe>] ? try_to_wake_up+0x24e/0x3e0
      	 [<ffffffff81064ba2>] ? default_wake_function+0x12/0x20
      	 [<ffffffff8109eb16>] ? autoremove_wake_function+0x16/0x40
      	 [<ffffffff810577e9>] ? __wake_up_common+0x59/0x90
      	 [<ffffffff811a9517>] ? __d_lookup+0xa7/0x150
      	 [<ffffffff8119db5f>] ? do_lookup+0x9f/0x230
      	 [<ffffffff811a993a>] ? dput+0x9a/0x150
      	 [<ffffffff8119c8f5>] ? path_to_nameidata+0x25/0x60
      	 [<ffffffff8119e90a>] ? __link_path_walk+0x7da/0x1000
      	 [<ffffffff8101d8f9>] ? x86_pmu_add+0xb9/0x170
      	 [<ffffffff8101d7a7>] x86_pmu_commit_txn+0x67/0xc0
      	 [<ffffffff811b07b0>] ? mntput_no_expire+0x30/0x110
      	 [<ffffffff8119c731>] ? path_put+0x31/0x40
      	 [<ffffffff8107c297>] ? current_fs_time+0x27/0x30
      	 [<ffffffff8117d170>] ? mem_cgroup_get_reclaim_stat_from_page+0x20/0x70
      	 [<ffffffff8111b7aa>] group_sched_in+0x13a/0x170
      	 [<ffffffff81014a29>] ? sched_clock+0x9/0x10
      	 [<ffffffff8111bac8>] ctx_sched_in+0x2e8/0x330
      	 [<ffffffff8111bb7b>] perf_event_sched_in+0x6b/0xb0
      	 [<ffffffff8111bc36>] perf_event_context_sched_in+0x76/0xc0
      	 [<ffffffff8111eb3b>] perf_event_comm+0x1bb/0x2e0
      	 [<ffffffff81195ee9>] set_task_comm+0x69/0x80
      	 [<ffffffff81195fe1>] setup_new_exec+0xe1/0x2e0
      	 [<ffffffff811ea68e>] load_elf_binary+0x3ce/0x1ab0
      
      Adding cpu_(prepare|starting|dying) for core_pmu to have
      shared_regs data allocated for core_pmu. AFAICS there's no harm
      to initialize debug store and PMU_FL_EXCL_CNTRS either for
      core_pmu.
      Signed-off-by: NJiri Olsa <jolsa@kernel.org>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Link: http://lkml.kernel.org/r/20150421152623.GC13169@krava.redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      3b6e0421
  11. 17 4月, 2015 1 次提交
  12. 02 4月, 2015 7 次提交
    • A
      perf/x86/intel: Streamline LBR MSR handling in PMI · 1a78d937
      Andi Kleen 提交于
      The perf PMI currently does unnecessary MSR accesses when
      LBRs are enabled. We use LBR freezing, or when in callstack
      mode force the LBRs to only filter on ring 3.
      
      So there is no need to disable the LBRs explicitely in the
      PMI handler.
      
      Also we always unnecessarily rewrite LBR_SELECT in the LBR
      handler, even though it can never change.
      
       5)               |  /* write_msr: MSR_LBR_SELECT(1c8), value 0 */
       5)               |  /* read_msr: MSR_IA32_DEBUGCTLMSR(1d9), value 1801 */
       5)               |  /* write_msr: MSR_IA32_DEBUGCTLMSR(1d9), value 1801 */
       5)               |  /* write_msr: MSR_CORE_PERF_GLOBAL_CTRL(38f), value 70000000f */
       5)               |  /* write_msr: MSR_CORE_PERF_GLOBAL_CTRL(38f), value 0 */
       5)               |  /* write_msr: MSR_LBR_SELECT(1c8), value 0 */
       5)               |  /* read_msr: MSR_IA32_DEBUGCTLMSR(1d9), value 1801 */
       5)               |  /* write_msr: MSR_IA32_DEBUGCTLMSR(1d9), value 1801 */
      
      This patch:
      
        - Avoids disabling already frozen LBRs unnecessarily in the PMI
        - Avoids changing LBR_SELECT in the PMI
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: eranian@google.com
      Link: http://lkml.kernel.org/r/1426871484-21285-1-git-send-email-andi@firstfloor.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      1a78d937
    • A
      perf/x86/intel: Reset more state in PMU reset · 8882edf7
      Andi Kleen 提交于
      The PMU reset code didn't quite keep up with newer PMU features.
      Improve it a bit to really reset a modern PMU:
      
        - Clear all overflow status
        - Clear LBRs and freezing state
        - Disable fixed counters too
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: eranian@google.com
      Link: http://lkml.kernel.org/r/1425059312-18217-2-git-send-email-andi@firstfloor.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      8882edf7
    • S
      perf/x86/intel: Make the HT bug workaround conditional on HT enabled · b37609c3
      Stephane Eranian 提交于
      This patch disables the PMU HT bug when Hyperthreading (HT)
      is disabled. We cannot do this test immediately when perf_events
      is initialized. We need to wait until the topology information
      is setup properly. As such, we register a later initcall, check
      the topology and potentially disable the workaround. To do this,
      we need to ensure there is no user of the PMU. At this point of
      the boot, the only user is the NMI watchdog, thus we disable
      it during the switch and re-enable it right after.
      
      Having the workaround disabled when it is not needed provides
      some benefits by limiting the overhead is time and space.
      The workaround still ensures correct scheduling of the corrupting
      memory events (0xd0, 0xd1, 0xd2) when HT is off. Those events
      can only be measured on counters 0-3. Something else the current
      kernel did not handle correctly.
      Signed-off-by: NStephane Eranian <eranian@google.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: bp@alien8.de
      Cc: jolsa@redhat.com
      Cc: kan.liang@intel.com
      Cc: maria.n.dimakopoulou@gmail.com
      Link: http://lkml.kernel.org/r/1416251225-17721-13-git-send-email-eranian@google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      b37609c3
    • S
      perf/x86/intel: Limit to half counters when the HT workaround is enabled, to... · c02cdbf6
      Stephane Eranian 提交于
      perf/x86/intel: Limit to half counters when the HT workaround is enabled, to avoid exclusive mode starvation
      
      This patch limits the number of counters available to each CPU when
      the HT bug workaround is enabled.
      
      This is necessary to avoid situation of counter starvation. Such can
      arise from configuration where one HT thread, HT0, is using all 4 counters
      with corrupting events which require exclusion the the sibling HT, HT1.
      
      In such case, HT1 would not be able to schedule any event until HT0
      is done. To mitigate this problem, this patch artificially limits
      the number of counters to 2.
      
      That way, we can gurantee that at least 2 counters are not in exclusive
      mode and therefore allow the sibling thread to schedule events of the
      same type (system vs. per-thread). The 2 counters are not determined
      in advance. We simply set the limit to two events per HT.
      
      This helps mitigate starvation in case of events with specific counter
      constraints such a PREC_DIST.
      
      Note that this does not elimintate the starvation is all cases. But
      it is better than not having it.
      
      (Solution suggested by Peter Zjilstra.)
      Signed-off-by: NStephane Eranian <eranian@google.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: bp@alien8.de
      Cc: jolsa@redhat.com
      Cc: kan.liang@intel.com
      Cc: maria.n.dimakopoulou@gmail.com
      Link: http://lkml.kernel.org/r/1416251225-17721-11-git-send-email-eranian@google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      c02cdbf6
    • S
      perf/x86/intel: Fix intel_get_event_constraints() for dynamic constraints · a90738c2
      Stephane Eranian 提交于
      With dynamic constraint, we need to restart from the static
      constraints each time the intel_get_event_constraints() is called.
      Signed-off-by: NStephane Eranian <eranian@google.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NMaria Dimakopoulou <maria.n.dimakopoulou@gmail.com>
      Cc: bp@alien8.de
      Cc: jolsa@redhat.com
      Cc: kan.liang@intel.com
      Link: http://lkml.kernel.org/r/1416251225-17721-10-git-send-email-eranian@google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      a90738c2
    • M
      perf/x86/intel: Enforce HT bug workaround for SNB/IVB/HSW · 93fcf72c
      Maria Dimakopoulou 提交于
      This patches activates the HT bug workaround for the
      SNB/IVB/HSW processors. This covers non-PEBS mode.
      Activation is done thru the constraint tables.
      
      Both client and server processors needs this workaround.
      Signed-off-by: NMaria Dimakopoulou <maria.n.dimakopoulou@gmail.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NStephane Eranian <eranian@google.com>
      Cc: bp@alien8.de
      Cc: jolsa@redhat.com
      Cc: kan.liang@intel.com
      Link: http://lkml.kernel.org/r/1416251225-17721-8-git-send-email-eranian@google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      93fcf72c
    • M
      perf/x86/intel: Implement cross-HT corruption bug workaround · e979121b
      Maria Dimakopoulou 提交于
      This patch implements a software workaround for a HW erratum
      on Intel SandyBridge, IvyBridge and Haswell processors
      with Hyperthreading enabled. The errata are documented for
      each processor in their respective specification update
      documents:
      
        - SandyBridge: BJ122
        - IvyBridge: BV98
        - Haswell: HSD29
      
      The bug causes silent counter corruption across hyperthreads only
      when measuring certain memory events (0xd0, 0xd1, 0xd2, 0xd3).
      Counters measuring those events may leak counts to the sibling
      counter. For instance, counter 0, thread 0 measuring event 0xd0,
      may leak to counter 0, thread 1, regardless of the event measured
      there. The size of the leak is not predictible. It all depends on
      the workload and the state of each sibling hyper-thread. The
      corrupting events do undercount as a consequence of the leak. The
      leak is compensated automatically only when the sibling counter measures
      the exact same corrupting event AND the workload is on the two threads
      is the same. Given, there is no way to guarantee this, a work-around
      is necessary. Furthermore, there is a serious problem if the leaked count
      is added to a low-occurrence event. In that case the corruption on
      the low occurrence event can be very large, e.g., orders of magnitude.
      
      There is no HW or FW workaround for this problem.
      
      The bug is very easy to reproduce on a loaded system.
      Here is an example on a Haswell client, where CPU0, CPU4
      are siblings. We load the CPUs with a simple triad app
      streaming large floating-point vector. We use 0x81d0
      corrupting event (MEM_UOPS_RETIRED:ALL_LOADS) and
      0x20cc (ROB_MISC_EVENTS:LBR_INSERTS). Given we are not
      using the LBR, the 0x20cc event should be zero.
      
        $ taskset -c 0 triad &
        $ taskset -c 4 triad &
        $ perf stat -a -C 0 -e r81d0 sleep 100 &
        $ perf stat -a -C 4 -r20cc sleep 10
        Performance counter stats for 'system wide':
              139 277 291      r20cc
             10,000969126 seconds time elapsed
      
      In this example, 0x81d0 and r20cc ar eusing sinling counters
      on CPU0 and CPU4. 0x81d0 leaks into 0x20cc and corrupts it
      from 0 to 139 millions occurrences.
      
      This patch provides a software workaround to this problem by modifying the
      way events are scheduled onto counters by the kernel. The patch forces
      cross-thread mutual exclusion between counters in case a corrupting event
      is measured by one of the hyper-threads. If thread 0, counter 0 is measuring
      event 0xd0, then nothing can be measured on counter 0, thread 1. If no corrupting
      event is measured on any hyper-thread, event scheduling proceeds as before.
      
      The same example run with the workaround enabled, yield the correct answer:
      
        $ taskset -c 0 triad &
        $ taskset -c 4 triad &
        $ perf stat -a -C 0 -e r81d0 sleep 100 &
        $ perf stat -a -C 4 -r20cc sleep 10
        Performance counter stats for 'system wide':
              0 r20cc
             10,000969126 seconds time elapsed
      
      The patch does provide correctness for all non-corrupting events. It does not
      "repatriate" the leaked counts back to the leaking counter. This is planned
      for a second patch series. This patch series makes this repatriation more
      easy by guaranteeing the sibling counter is not measuring any useful event.
      
      The patch introduces dynamic constraints for events. That means that events which
      did not have constraints, i.e., could be measured on any counters, may now be
      constrained to a subset of the counters depending on what is going on the sibling
      thread. The algorithm is similar to a cache coherency protocol. We call it XSU
      in reference to Exclusive, Shared, Unused, the 3 possible states of a PMU
      counter.
      
      As a consequence of the workaround, users may see an increased amount of event
      multiplexing, even in situtations where there are fewer events than counters
      measured on a CPU.
      
      Patch has been tested on all three impacted processors. Note that when
      HT is off, there is no corruption. However, the workaround is still enabled,
      yet not costing too much. Adding a dynamic detection of HT on turned out to
      be complex are requiring too much to code to be justified.
      
      This patch addresses the issue when PEBS is not used. A subsequent patch
      fixes the problem when PEBS is used.
      Signed-off-by: NMaria Dimakopoulou <maria.n.dimakopoulou@gmail.com>
      [spinlock_t -> raw_spinlock_t]
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NStephane Eranian <eranian@google.com>
      Cc: bp@alien8.de
      Cc: jolsa@redhat.com
      Cc: kan.liang@intel.com
      Link: http://lkml.kernel.org/r/1416251225-17721-7-git-send-email-eranian@google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      e979121b