1. 27 5月, 2015 4 次提交
    • P
      perf/x86/intel: Add lockdep assert · b32ed7f5
      Peter Zijlstra 提交于
      Lockdep is very good at finding incorrect IRQ state while locking and
      is far better at telling us if we hold a lock than the _is_locked()
      API. It also generates less code for !DEBUG kernels.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      b32ed7f5
    • P
      perf/x86/intel: Correct local vs remote sibling state · 1c565833
      Peter Zijlstra 提交于
      For some obscure reason the current code accounts the current SMT
      thread's state on the remote thread and reads the remote's state on
      the local SMT thread.
      
      While internally consistent, and 'correct' its pointless confusion we
      can do without.
      
      Flip them the right way around.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      1c565833
    • P
      perf/x86: Improve HT workaround GP counter constraint · cc1790cf
      Peter Zijlstra 提交于
      The (SNB/IVB/HSW) HT bug only affects events that can be programmed
      onto GP counters, therefore we should only limit the number of GP
      counters that can be used per cpu -- iow we should not constrain the
      FP counters.
      
      Furthermore, we should only enfore such a limit when there are in fact
      exclusive events being scheduled on either sibling.
      Reported-by: NVince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      [ Fixed build fail for the !CONFIG_CPU_SUP_INTEL case. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      cc1790cf
    • P
      perf/x86: Fix event/group validation · b371b594
      Peter Zijlstra 提交于
      Commit 43b45780 ("perf/x86: Reduce stack usage of
      x86_schedule_events()") violated the rule that 'fake' scheduling; as
      used for event/group validation; should not change the event state.
      
      This went mostly un-noticed because repeated calls of
      x86_pmu::get_event_constraints() would give the same result. And
      x86_pmu::put_event_constraints() would mostly not do anything.
      
      Commit e979121b ("perf/x86/intel: Implement cross-HT corruption
      bug workaround") made the situation much worse by actually setting the
      event->hw.constraint value to NULL, so when validation and actual
      scheduling interact we get NULL ptr derefs.
      
      Fix it by removing the constraint pointer from the event and move it
      back to an array, this time in cpuc instead of on the stack.
      
      validate_group()
        x86_schedule_events()
          event->hw.constraint = c; # store
      
            <context switch>
              perf_task_event_sched_in()
                ...
                  x86_schedule_events();
                    event->hw.constraint = c2; # store
      
                    ...
      
                    put_event_constraints(event); # assume failure to schedule
                      intel_put_event_constraints()
                        event->hw.constraint = NULL;
      
            <context switch end>
      
          c = event->hw.constraint; # read -> NULL
      
          if (!test_bit(hwc->idx, c->idxmsk)) # <- *BOOM* NULL deref
      
      This in particular is possible when the event in question is a
      cpu-wide event and group-leader, where the validate_group() tries to
      add an event to the group.
      Reported-by: NVince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Hunter <ahh@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Maria Dimakopoulou <maria.n.dimakopoulou@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 43b45780 ("perf/x86: Reduce stack usage of x86_schedule_events()")
      Fixes: e979121b ("perf/x86/intel: Implement cross-HT corruption bug workaround")
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      b371b594
  2. 08 5月, 2015 1 次提交
  3. 22 4月, 2015 1 次提交
    • J
      perf/x86/intel: Add cpu_(prepare|starting|dying) for core_pmu · 3b6e0421
      Jiri Olsa 提交于
      The core_pmu does not define cpu_* callbacks, which handles
      allocation of 'struct cpu_hw_events::shared_regs' data,
      initialization of debug store and PMU_FL_EXCL_CNTRS counters.
      
      While this probably won't happen on bare metal, virtual CPU can
      define x86_pmu.extra_regs together with PMU version 1 and thus
      be using core_pmu -> using shared_regs data without it being
      allocated. That could could leave to following panic:
      
      	BUG: unable to handle kernel NULL pointer dereference at (null)
      	IP: [<ffffffff8152cd4f>] _spin_lock_irqsave+0x1f/0x40
      
      	SNIP
      
      	 [<ffffffff81024bd9>] __intel_shared_reg_get_constraints+0x69/0x1e0
      	 [<ffffffff81024deb>] intel_get_event_constraints+0x9b/0x180
      	 [<ffffffff8101e815>] x86_schedule_events+0x75/0x1d0
      	 [<ffffffff810586dc>] ? check_preempt_curr+0x7c/0x90
      	 [<ffffffff810649fe>] ? try_to_wake_up+0x24e/0x3e0
      	 [<ffffffff81064ba2>] ? default_wake_function+0x12/0x20
      	 [<ffffffff8109eb16>] ? autoremove_wake_function+0x16/0x40
      	 [<ffffffff810577e9>] ? __wake_up_common+0x59/0x90
      	 [<ffffffff811a9517>] ? __d_lookup+0xa7/0x150
      	 [<ffffffff8119db5f>] ? do_lookup+0x9f/0x230
      	 [<ffffffff811a993a>] ? dput+0x9a/0x150
      	 [<ffffffff8119c8f5>] ? path_to_nameidata+0x25/0x60
      	 [<ffffffff8119e90a>] ? __link_path_walk+0x7da/0x1000
      	 [<ffffffff8101d8f9>] ? x86_pmu_add+0xb9/0x170
      	 [<ffffffff8101d7a7>] x86_pmu_commit_txn+0x67/0xc0
      	 [<ffffffff811b07b0>] ? mntput_no_expire+0x30/0x110
      	 [<ffffffff8119c731>] ? path_put+0x31/0x40
      	 [<ffffffff8107c297>] ? current_fs_time+0x27/0x30
      	 [<ffffffff8117d170>] ? mem_cgroup_get_reclaim_stat_from_page+0x20/0x70
      	 [<ffffffff8111b7aa>] group_sched_in+0x13a/0x170
      	 [<ffffffff81014a29>] ? sched_clock+0x9/0x10
      	 [<ffffffff8111bac8>] ctx_sched_in+0x2e8/0x330
      	 [<ffffffff8111bb7b>] perf_event_sched_in+0x6b/0xb0
      	 [<ffffffff8111bc36>] perf_event_context_sched_in+0x76/0xc0
      	 [<ffffffff8111eb3b>] perf_event_comm+0x1bb/0x2e0
      	 [<ffffffff81195ee9>] set_task_comm+0x69/0x80
      	 [<ffffffff81195fe1>] setup_new_exec+0xe1/0x2e0
      	 [<ffffffff811ea68e>] load_elf_binary+0x3ce/0x1ab0
      
      Adding cpu_(prepare|starting|dying) for core_pmu to have
      shared_regs data allocated for core_pmu. AFAICS there's no harm
      to initialize debug store and PMU_FL_EXCL_CNTRS either for
      core_pmu.
      Signed-off-by: NJiri Olsa <jolsa@kernel.org>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Link: http://lkml.kernel.org/r/20150421152623.GC13169@krava.redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      3b6e0421
  4. 17 4月, 2015 1 次提交
  5. 02 4月, 2015 16 次提交
  6. 27 3月, 2015 3 次提交
    • A
      perf/x86/intel: Add INST_RETIRED.ALL workarounds · 294fe0f5
      Andi Kleen 提交于
      On Broadwell INST_RETIRED.ALL cannot be used with any period
      that doesn't have the lowest 6 bits cleared. And the period
      should not be smaller than 128.
      
      This is erratum BDM11 and BDM55:
      
        http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/5th-gen-core-family-spec-update.pdf
      
      BDM11: When using a period < 100; we may get incorrect PEBS/PMI
      interrupts and/or an invalid counter state.
      BDM55: When bit0-5 of the period are !0 we may get redundant PEBS
      records on overflow.
      
      Add a new callback to enforce this, and set it for Broadwell.
      
      How does this handle the case when an app requests a specific
      period with some of the bottom bits set?
      
      Short answer:
      
      Any useful instruction sampling period needs to be 4-6 orders
      of magnitude larger than 128, as an PMI every 128 instructions
      would instantly overwhelm the system and be throttled.
      So the +-64 error from this is really small compared to the
      period, much smaller than normal system jitter.
      
      Long answer (by Peterz):
      
      IFF we guarantee perf_event_attr::sample_period >= 128.
      
      Suppose we start out with sample_period=192; then we'll set period_left
      to 192, we'll end up with left = 128 (we truncate the lower bits). We
      get an interrupt, find that period_left = 64 (>0 so we return 0 and
      don't get an overflow handler), up that to 128. Then we trigger again,
      at n=256. Then we find period_left = -64 (<=0 so we return 1 and do get
      an overflow). We increment with sample_period so we get left = 128. We
      fire again, at n=384, period_left = 0 (<=0 so we return 1 and get an
      overflow). And on and on.
      
      So while the individual interrupts are 'wrong' we get then with
      interval=256,128 in exactly the right ratio to average out at 192. And
      this works for everything >=128.
      
      So the num_samples*fixed_period thing is still entirely correct +- 127,
      which is good enough I'd say, as you already have that error anyhow.
      
      So no need to 'fix' the tools, al we need to do is refuse to create
      INST_RETIRED:ALL events with sample_period < 128.
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      [ Updated comments and changelog a bit. ]
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1424225886-18652-3-git-send-email-andi@firstfloor.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      294fe0f5
    • A
      perf/x86/intel: Add Broadwell core support · 91f1b705
      Andi Kleen 提交于
      Add Broadwell support for Broadwell to perf.
      
      The basic support is very similar to Haswell. We use the new cache
      event list added for Haswell earlier. The only differences
      are a few bits related to remote nodes. To avoid an extra,
      mostly identical, table these are patched up in the initialization code.
      
      The constraint list has one new event that needs to be handled over Haswell.
      
      Includes code and testing from Kan Liang.
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1424225886-18652-2-git-send-email-andi@firstfloor.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      91f1b705
    • A
      perf/x86/intel: Add new cache events table for Haswell · 0f1b5ca2
      Andi Kleen 提交于
      Haswell offcore events are quite different from Sandy Bridge.
      Add a new table to handle Haswell properly.
      
      Note that the offcore bits listed in the SDM are not quite correct
      (this is currently being fixed). An uptodate list of bits is
      in the patch.
      
      The basic setup is similar to Sandy Bridge. The prefetch columns
      have been removed, as prefetch counting is not very reliable
      on Haswell. One L1 event that is not in the event list anymore
      has been also removed.
      
      - data reads do not include code reads (comparable to earlier Sandy Bridge tables)
      - data counts include speculative execution (except L1 write, dtlb, bpu)
      - remote node access includes both remote memory, remote cache, remote mmio.
      - prefetches are not included in the counts for consistency
        (different from Sandy Bridge, which includes prefetches in the remote node)
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      [ Removed the HSM30 comments; we don't have them for SNB/IVB either. ]
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1424225886-18652-1-git-send-email-andi@firstfloor.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      0f1b5ca2
  7. 19 2月, 2015 3 次提交
  8. 28 1月, 2015 1 次提交
  9. 29 10月, 2014 1 次提交
    • I
      perf/x86/intel: Revert incomplete and undocumented Broadwell client support · 1776b106
      Ingo Molnar 提交于
      These patches:
      
        86a349a2 ("perf/x86/intel: Add Broadwell core support")
        c46e665f ("perf/x86: Add INST_RETIRED.ALL workarounds")
        fdda3c4a ("perf/x86/intel: Use Broadwell cache event list for Haswell")
      
      introduced magic constants and unexplained changes:
      
        https://lkml.org/lkml/2014/10/28/1128
        https://lkml.org/lkml/2014/10/27/325
        https://lkml.org/lkml/2014/8/27/546
        https://lkml.org/lkml/2014/10/28/546
      
      Peter Zijlstra has attempted to help out, to clean up the mess:
      
        https://lkml.org/lkml/2014/10/28/543
      
      But has not received helpful and constructive replies which makes
      me doubt wether it can all be finished in time until v3.18 is
      released.
      
      Despite various review feedback the author (Andi Kleen) has answered
      only few of the review questions and has generally been uncooperative,
      only giving replies when prompted repeatedly, and only giving minimal
      answers instead of constructively explaining and helping along the effort.
      
      That kind of behavior is not acceptable.
      
      There's also a boot crash on Intel E5-1630 v3 CPUs reported for another
      commit from Andi Kleen:
      
        e735b9db ("perf/x86/intel/uncore: Add Haswell-EP uncore support")
      
        https://lkml.org/lkml/2014/10/22/730
      
      Which is not yet resolved. The uncore driver is independent in theory,
      but the crash makes me worry about how well all these patches were
      tested and makes me uneasy about the level of interminging that the
      Broadwell and Haswell code has received by the commits above.
      
      As a first step to resolve the mess revert the Broadwell client commits
      back to the v3.17 version, before we run out of time and problematic
      code hits a stable upstream kernel.
      
      ( If the Haswell-EP crash is not resolved via a simple fix then we'll have
        to revert the Haswell-EP uncore driver as well. )
      
      The Broadwell client series has to be submitted in a clean fashion, with
      single, well documented changes per patch. If they are submitted in time
      and are accepted during review then they can possibly go into v3.19 but
      will need additional scrutiny due to the rocky history of this patch set.
      
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Cc: eranian@google.com
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/1409683455-29168-3-git-send-email-andi@firstfloor.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      1776b106
  10. 24 9月, 2014 5 次提交
  11. 27 8月, 2014 1 次提交
    • C
      x86: Replace __get_cpu_var uses · 89cbc767
      Christoph Lameter 提交于
      __get_cpu_var() is used for multiple purposes in the kernel source. One of
      them is address calculation via the form &__get_cpu_var(x).  This calculates
      the address for the instance of the percpu variable of the current processor
      based on an offset.
      
      Other use cases are for storing and retrieving data from the current
      processors percpu area.  __get_cpu_var() can be used as an lvalue when
      writing data or on the right side of an assignment.
      
      __get_cpu_var() is defined as :
      
      #define __get_cpu_var(var) (*this_cpu_ptr(&(var)))
      
      __get_cpu_var() always only does an address determination. However, store
      and retrieve operations could use a segment prefix (or global register on
      other platforms) to avoid the address calculation.
      
      this_cpu_write() and this_cpu_read() can directly take an offset into a
      percpu area and use optimized assembly code to read and write per cpu
      variables.
      
      This patch converts __get_cpu_var into either an explicit address
      calculation using this_cpu_ptr() or into a use of this_cpu operations that
      use the offset.  Thereby address calculations are avoided and less registers
      are used when code is generated.
      
      Transformations done to __get_cpu_var()
      
      1. Determine the address of the percpu instance of the current processor.
      
      	DEFINE_PER_CPU(int, y);
      	int *x = &__get_cpu_var(y);
      
          Converts to
      
      	int *x = this_cpu_ptr(&y);
      
      2. Same as #1 but this time an array structure is involved.
      
      	DEFINE_PER_CPU(int, y[20]);
      	int *x = __get_cpu_var(y);
      
          Converts to
      
      	int *x = this_cpu_ptr(y);
      
      3. Retrieve the content of the current processors instance of a per cpu
      variable.
      
      	DEFINE_PER_CPU(int, y);
      	int x = __get_cpu_var(y)
      
         Converts to
      
      	int x = __this_cpu_read(y);
      
      4. Retrieve the content of a percpu struct
      
      	DEFINE_PER_CPU(struct mystruct, y);
      	struct mystruct x = __get_cpu_var(y);
      
         Converts to
      
      	memcpy(&x, this_cpu_ptr(&y), sizeof(x));
      
      5. Assignment to a per cpu variable
      
      	DEFINE_PER_CPU(int, y)
      	__get_cpu_var(y) = x;
      
         Converts to
      
      	__this_cpu_write(y, x);
      
      6. Increment/Decrement etc of a per cpu variable
      
      	DEFINE_PER_CPU(int, y);
      	__get_cpu_var(y)++
      
         Converts to
      
      	__this_cpu_inc(y)
      
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: x86@kernel.org
      Acked-by: NH. Peter Anvin <hpa@linux.intel.com>
      Acked-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      89cbc767
  12. 13 8月, 2014 2 次提交
  13. 16 7月, 2014 1 次提交
    • K
      perf/x86/intel: Protect LBR and extra_regs against KVM lying · 338b522c
      Kan Liang 提交于
      With -cpu host, KVM reports LBR and extra_regs support, if the host has
      support.
      
      When the guest perf driver tries to access LBR or extra_regs MSR,
      it #GPs all MSR accesses,since KVM doesn't handle LBR and extra_regs support.
      So check the related MSRs access right once at initialization time to avoid
      the error access at runtime.
      
      For reproducing the issue, please build the kernel with CONFIG_KVM_INTEL = y
      (for host kernel).
      And CONFIG_PARAVIRT = n and CONFIG_KVM_GUEST = n (for guest kernel).
      Start the guest with -cpu host.
      Run perf record with --branch-any or --branch-filter in guest to trigger LBR
      Run perf stat offcore events (E.g. LLC-loads/LLC-load-misses ...) in guest to
      trigger offcore_rsp #GP
      Signed-off-by: NKan Liang <kan.liang@intel.com>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Maria Dimakopoulou <maria.n.dimakopoulou@gmail.com>
      Cc: Mark Davies <junk@eslaf.co.uk>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Yan, Zheng <zheng.z.yan@intel.com>
      Link: http://lkml.kernel.org/r/1405365957-20202-1-git-send-email-kan.liang@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      338b522c