1. 02 9月, 2020 2 次提交
    • K
      perf/x86/amd: Add support for Large Increment per Cycle Events · 52ef78f4
      Kim Phillips 提交于
      fix #29035100
      
      commit 5738891229a25e9e678122a843cbf0466a456d0c upstream
      
      Description of hardware operation
      ---------------------------------
      
      The core AMD PMU has a 4-bit wide per-cycle increment for each
      performance monitor counter.  That works for most events, but
      now with AMD Family 17h and above processors, some events can
      occur more than 15 times in a cycle.  Those events are called
      "Large Increment per Cycle" events. In order to count these
      events, two adjacent h/w PMCs get their count signals merged
      to form 8 bits per cycle total.  In addition, the PERF_CTR count
      registers are merged to be able to count up to 64 bits.
      
      Normally, events like instructions retired, get programmed on a single
      counter like so:
      
      PERF_CTL0 (MSR 0xc0010200) 0x000000000053ff0c # event 0x0c, umask 0xff
      PERF_CTR0 (MSR 0xc0010201) 0x0000800000000001 # r/w 48-bit count
      
      The next counter at MSRs 0xc0010202-3 remains unused, or can be used
      independently to count something else.
      
      When counting Large Increment per Cycle events, such as FLOPs,
      however, we now have to reserve the next counter and program the
      PERF_CTL (config) register with the Merge event (0xFFF), like so:
      
      PERF_CTL0 (msr 0xc0010200) 0x000000000053ff03 # FLOPs event, umask 0xff
      PERF_CTR0 (msr 0xc0010201) 0x0000800000000001 # rd 64-bit cnt, wr lo 48b
      PERF_CTL1 (msr 0xc0010202) 0x0000000f004000ff # Merge event, enable bit
      PERF_CTR1 (msr 0xc0010203) 0x0000000000000000 # wr hi 16-bits count
      
      The count is widened from the normal 48-bits to 64 bits by having the
      second counter carry the higher 16 bits of the count in its lower 16
      bits of its counter register.
      
      The odd counter, e.g., PERF_CTL1, is programmed with the enabled Merge
      event before the even counter, PERF_CTL0.
      
      The Large Increment feature is available starting with Family 17h.
      For more details, search any Family 17h PPR for the "Large Increment
      per Cycle Events" section, e.g., section 2.1.15.3 on p. 173 in this
      version:
      
      https://www.amd.com/system/files/TechDocs/56176_ppr_Family_17h_Model_71h_B0_pub_Rev_3.06.zip
      
      Description of software operation
      ---------------------------------
      
      The following steps are taken in order to support reserving and
      enabling the extra counter for Large Increment per Cycle events:
      
      1. In the main x86 scheduler, we reduce the number of available
      counters by the number of Large Increment per Cycle events being
      scheduled, tracked by a new cpuc variable 'n_pair' and a new
      amd_put_event_constraints_f17h().  This improves the counter
      scheduler success rate.
      
      2. In perf_assign_events(), if a counter is assigned to a Large
      Increment event, we increment the current counter variable, so the
      counter used for the Merge event is removed from assignment
      consideration by upcoming event assignments.
      
      3. In find_counter(), if a counter has been found for the Large
      Increment event, we set the next counter as used, to prevent other
      events from using it.
      
      4. We perform steps 2 & 3 also in the x86 scheduler fastpath, i.e.,
      we add Merge event accounting to the existing used_mask logic.
      
      5. Finally, we add on the programming of Merge event to the
      neighbouring PMC counters in the counter enable/disable{_all}
      code paths.
      
      Currently, software does not support a single PMU with mixed 48- and
      64-bit counting, so Large increment event counts are limited to 48
      bits.  In set_period, we zero-out the upper 16 bits of the count, so
      the hardware doesn't copy them to the even counter's higher bits.
      
      Simple invocation example showing counting 8 FLOPs per 256-bit/%ymm
      vaddps instruction executed in a loop 100 million times:
      
      perf stat -e cpu/fp_ret_sse_avx_ops.all/,cpu/instructions/ <workload>
      
       Performance counter stats for '<workload>':
      
             800,000,000      cpu/fp_ret_sse_avx_ops.all/u
             300,042,101      cpu/instructions/u
      
      Prior to this patch, the reported SSE/AVX FLOPs retired count would
      be wrong.
      
      [peterz: lots of renames and edits to the code]
      Signed-off-by: NKim Phillips <kim.phillips@amd.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NPeng Wang <rocking@linux.alibaba.com>
      Reviewed-by: NArtie Ding <artie.ding@linux.alibaba.com>
      52ef78f4
    • K
      perf/x86/amd: Constrain Large Increment per Cycle events · e29ecc26
      Kim Phillips 提交于
      fix #29035100
      
      commit 471af006a747f1c535c8a8c6c0973c320fe01b22 upstream
      
      AMD Family 17h processors and above gain support for Large Increment
      per Cycle events.  Unfortunately there is no CPUID or equivalent bit
      that indicates whether the feature exists or not, so we continue to
      determine eligibility based on a CPU family number comparison.
      
      For Large Increment per Cycle events, we add a f17h-and-compatibles
      get_event_constraints_f17h() that returns an even counter bitmask:
      Large Increment per Cycle events can only be placed on PMCs 0, 2,
      and 4 out of the currently available 0-5.  The only currently
      public event that requires this feature to report valid counts
      is PMCx003 "Retired SSE/AVX Operations".
      
      Note that the CPU family logic in amd_core_pmu_init() is changed
      so as to be able to selectively add initialization for features
      available in ranges of backward-compatible CPU families.  This
      Large Increment per Cycle feature is expected to be retained
      in future families.
      
      A side-effect of assigning a new get_constraints function for f17h
      disables calling the old (prior to f15h) amd_get_event_constraints
      implementation left enabled by commit e40ed154 ("perf/x86: Add perf
      support for AMD family-17h processors"), which is no longer
      necessary since those North Bridge event codes are obsoleted.
      
      Also fix a spelling mistake whilst in the area (calulating ->
      calculating).
      
      Fixes: e40ed154 ("perf/x86: Add perf support for AMD family-17h processors")
      Signed-off-by: NKim Phillips <kim.phillips@amd.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20191114183720.19887-2-kim.phillips@amd.comSigned-off-by: NPeng Wang <rocking@linux.alibaba.com>
      Reviewed-by: NArtie Ding <artie.ding@linux.alibaba.com>
      e29ecc26
  2. 17 1月, 2020 1 次提交
  3. 06 11月, 2019 1 次提交
    • T
      perf/x86/amd: Change/fix NMI latency mitigation to use a timestamp · a1112c46
      Tom Lendacky 提交于
      [ Upstream commit df4d29732fdad43a51284f826bec3e6ded177540 ]
      
      It turns out that the NMI latency workaround from commit:
      
        6d3edaae16c6 ("x86/perf/amd: Resolve NMI latency issues for active PMCs")
      
      ends up being too conservative and results in the perf NMI handler claiming
      NMIs too easily on AMD hardware when the NMI watchdog is active.
      
      This has an impact, for example, on the hpwdt (HPE watchdog timer) module.
      This module can produce an NMI that is used to reset the system. It
      registers an NMI handler for the NMI_UNKNOWN type and relies on the fact
      that nothing has claimed an NMI so that its handler will be invoked when
      the watchdog device produces an NMI. After the referenced commit, the
      hpwdt module is unable to process its generated NMI if the NMI watchdog is
      active, because the current NMI latency mitigation results in the NMI
      being claimed by the perf NMI handler.
      
      Update the AMD perf NMI latency mitigation workaround to, instead, use a
      window of time. Whenever a PMC is handled in the perf NMI handler, set a
      timestamp which will act as a perf NMI window. Any NMIs arriving within
      that window will be claimed by perf. Anything outside that window will
      not be claimed by perf. The value for the NMI window is set to 100 msecs.
      This is a conservative value that easily covers any NMI latency in the
      hardware. While this still results in a window in which the hpwdt module
      will not receive its NMI, the window is now much, much smaller.
      Signed-off-by: NTom Lendacky <thomas.lendacky@amd.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Jerry Hoemann <jerry.hoemann@hpe.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 6d3edaae16c6 ("x86/perf/amd: Resolve NMI latency issues for active PMCs")
      Link: https://lkml.kernel.org/r/Message-ID:
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      a1112c46
  4. 08 5月, 2019 1 次提交
    • K
      perf/x86/amd: Update generic hardware cache events for Family 17h · 3f8497cf
      Kim Phillips 提交于
      commit 0e3b74e26280f2cf8753717a950b97d424da6046 upstream.
      
      Add a new amd_hw_cache_event_ids_f17h assignment structure set
      for AMD families 17h and above, since a lot has changed.  Specifically:
      
      L1 Data Cache
      
      The data cache access counter remains the same on Family 17h.
      
      For DC misses, PMCx041's definition changes with Family 17h,
      so instead we use the L2 cache accesses from L1 data cache
      misses counter (PMCx060,umask=0xc8).
      
      For DC hardware prefetch events, Family 17h breaks compatibility
      for PMCx067 "Data Prefetcher", so instead, we use PMCx05a "Hardware
      Prefetch DC Fills."
      
      L1 Instruction Cache
      
      PMCs 0x80 and 0x81 (32-byte IC fetches and misses) are backward
      compatible on Family 17h.
      
      For prefetches, we remove the erroneous PMCx04B assignment which
      counts how many software data cache prefetch load instructions were
      dispatched.
      
      LL - Last Level Cache
      
      Removing PMCs 7D, 7E, and 7F assignments, as they do not exist
      on Family 17h, where the last level cache is L3.  L3 counters
      can be accessed using the existing AMD Uncore driver.
      
      Data TLB
      
      On Intel machines, data TLB accesses ("dTLB-loads") are assigned
      to counters that count load/store instructions retired.  This
      is inconsistent with instruction TLB accesses, where Intel
      implementations report iTLB misses that hit in the STLB.
      
      Ideally, dTLB-loads would count higher level dTLB misses that hit
      in lower level TLBs, and dTLB-load-misses would report those
      that also missed in those lower-level TLBs, therefore causing
      a page table walk.  That would be consistent with instruction
      TLB operation, remove the redundancy between dTLB-loads and
      L1-dcache-loads, and prevent perf from producing artificially
      low percentage ratios, i.e. the "0.01%" below:
      
              42,550,869      L1-dcache-loads
              41,591,860      dTLB-loads
                   4,802      dTLB-load-misses          #    0.01% of all dTLB cache hits
               7,283,682      L1-dcache-stores
               7,912,392      dTLB-stores
                     310      dTLB-store-misses
      
      On AMD Families prior to 17h, the "Data Cache Accesses" counter is
      used, which is slightly better than load/store instructions retired,
      but still counts in terms of individual load/store operations
      instead of TLB operations.
      
      So, for AMD Families 17h and higher, this patch assigns "dTLB-loads"
      to a counter for L1 dTLB misses that hit in the L2 dTLB, and
      "dTLB-load-misses" to a counter for L1 DTLB misses that caused
      L2 DTLB misses and therefore also caused page table walks.  This
      results in a much more accurate view of data TLB performance:
      
              60,961,781      L1-dcache-loads
                   4,601      dTLB-loads
                     963      dTLB-load-misses          #   20.93% of all dTLB cache hits
      
      Note that for all AMD families, data loads and stores are combined
      in a single accesses counter, so no 'L1-dcache-stores' are reported
      separately, and stores are counted with loads in 'L1-dcache-loads'.
      
      Also note that the "% of all dTLB cache hits" string is misleading
      because (a) "dTLB cache": although TLBs can be considered caches for
      page tables, in this context, it can be misinterpreted as data cache
      hits because the figures are similar (at least on Intel), and (b) not
      all those loads (technically accesses) technically "hit" at that
      hardware level.  "% of all dTLB accesses" would be more clear/accurate.
      
      Instruction TLB
      
      On Intel machines, 'iTLB-loads' measure iTLB misses that hit in the
      STLB, and 'iTLB-load-misses' measure iTLB misses that also missed in
      the STLB and completed a page table walk.
      
      For AMD Family 17h and above, for 'iTLB-loads' we replace the
      erroneous instruction cache fetches counter with PMCx084
      "L1 ITLB Miss, L2 ITLB Hit".
      
      For 'iTLB-load-misses' we still use PMCx085 "L1 ITLB Miss,
      L2 ITLB Miss", but set a 0xff umask because without it the event
      does not get counted.
      
      Branch Predictor (BPU)
      
      PMCs 0xc2 and 0xc3 continue to be valid across all AMD Families.
      
      Node Level Events
      
      Family 17h does not have a PMCx0e9 counter, and corresponding counters
      have not been made available publicly, so for now, we mark them as
      unsupported for Families 17h and above.
      
      Reference:
      
        "Open-Source Register Reference For AMD Family 17h Processors Models 00h-2Fh"
        Released 7/17/2018, Publication #56255, Revision 3.03:
        https://www.amd.com/system/files/TechDocs/56255_OSRR.pdf
      
      [ mingo: tidied up the line breaks. ]
      Signed-off-by: NKim Phillips <kim.phillips@amd.com>
      Cc: <stable@vger.kernel.org> # v4.9+
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Janakarajan Natarajan <Janakarajan.Natarajan@amd.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Martin Liška <mliska@suse.cz>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Pu Wen <puwen@hygon.cn>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Suravee Suthikulpanit <Suravee.Suthikulpanit@amd.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Thomas Lendacky <Thomas.Lendacky@amd.com>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: linux-kernel@vger.kernel.org
      Cc: linux-perf-users@vger.kernel.org
      Fixes: e40ed154 ("perf/x86: Add perf support for AMD family-17h processors")
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3f8497cf
  5. 27 4月, 2019 1 次提交
    • K
      perf/x86/amd: Add event map for AMD Family 17h · f45829e6
      Kim Phillips 提交于
      commit 3fe3331bb285700ab2253dbb07f8e478fcea2f1b upstream.
      
      Family 17h differs from prior families by:
      
       - Does not support an L2 cache miss event
       - It has re-enumerated PMC counters for:
         - L2 cache references
         - front & back end stalled cycles
      
      So we add a new amd_f17h_perfmon_event_map[] so that the generic
      perf event names will resolve to the correct h/w events on
      family 17h and above processors.
      
      Reference sections 2.1.13.3.3 (stalls) and 2.1.13.3.6 (L2):
      
        https://www.amd.com/system/files/TechDocs/54945_PPR_Family_17h_Models_00h-0Fh.pdfSigned-off-by: NKim Phillips <kim.phillips@amd.com>
      Cc: <stable@vger.kernel.org> # v4.9+
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Janakarajan Natarajan <Janakarajan.Natarajan@amd.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Martin Liška <mliska@suse.cz>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Pu Wen <puwen@hygon.cn>
      Cc: Suravee Suthikulpanit <Suravee.Suthikulpanit@amd.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Fixes: e40ed154 ("perf/x86: Add perf support for AMD family-17h processors")
      [ Improved the formatting a bit. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f45829e6
  6. 17 4月, 2019 3 次提交
    • L
      x86/perf/amd: Remove need to check "running" bit in NMI handler · 05626465
      Lendacky, Thomas 提交于
      commit 3966c3feca3fd10b2935caa0b4a08c7dd59469e5 upstream.
      
      Spurious interrupt support was added to perf in the following commit, almost
      a decade ago:
      
        63e6be6d ("perf, x86: Catch spurious interrupts after disabling counters")
      
      The two previous patches (resolving the race condition when disabling a
      PMC and NMI latency mitigation) allow for the removal of this older
      spurious interrupt support.
      
      Currently in x86_pmu_stop(), the bit for the PMC in the active_mask bitmap
      is cleared before disabling the PMC, which sets up a race condition. This
      race condition was mitigated by introducing the running bitmap. That race
      condition can be eliminated by first disabling the PMC, waiting for PMC
      reset on overflow and then clearing the bit for the PMC in the active_mask
      bitmap. The NMI handler will not re-enable a disabled counter.
      
      If x86_pmu_stop() is called from the perf NMI handler, the NMI latency
      mitigation support will guard against any unhandled NMI messages.
      Signed-off-by: NTom Lendacky <thomas.lendacky@amd.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: <stable@vger.kernel.org> # 4.14.x-
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Link: https://lkml.kernel.org/r/Message-ID:
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      05626465
    • L
      x86/perf/amd: Resolve NMI latency issues for active PMCs · 23d39b0a
      Lendacky, Thomas 提交于
      commit 6d3edaae16c6c7d238360f2841212c2b26774d5e upstream.
      
      On AMD processors, the detection of an overflowed PMC counter in the NMI
      handler relies on the current value of the PMC. So, for example, to check
      for overflow on a 48-bit counter, bit 47 is checked to see if it is 1 (not
      overflowed) or 0 (overflowed).
      
      When the perf NMI handler executes it does not know in advance which PMC
      counters have overflowed. As such, the NMI handler will process all active
      PMC counters that have overflowed. NMI latency in newer AMD processors can
      result in multiple overflowed PMC counters being processed in one NMI and
      then a subsequent NMI, that does not appear to be a back-to-back NMI, not
      finding any PMC counters that have overflowed. This may appear to be an
      unhandled NMI resulting in either a panic or a series of messages,
      depending on how the kernel was configured.
      
      To mitigate this issue, add an AMD handle_irq callback function,
      amd_pmu_handle_irq(), that will invoke the common x86_pmu_handle_irq()
      function and upon return perform some additional processing that will
      indicate if the NMI has been handled or would have been handled had an
      earlier NMI not handled the overflowed PMC. Using a per-CPU variable, a
      minimum value of the number of active PMCs or 2 will be set whenever a
      PMC is active. This is used to indicate the possible number of NMIs that
      can still occur. The value of 2 is used for when an NMI does not arrive
      at the LAPIC in time to be collapsed into an already pending NMI. Each
      time the function is called without having handled an overflowed counter,
      the per-CPU value is checked. If the value is non-zero, it is decremented
      and the NMI indicates that it handled the NMI. If the value is zero, then
      the NMI indicates that it did not handle the NMI.
      Signed-off-by: NTom Lendacky <thomas.lendacky@amd.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: <stable@vger.kernel.org> # 4.14.x-
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Link: https://lkml.kernel.org/r/Message-ID:
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      23d39b0a
    • L
      x86/perf/amd: Resolve race condition when disabling PMC · e5a791b4
      Lendacky, Thomas 提交于
      commit 914123fa39042e651d79eaf86bbf63a1b938dddf upstream.
      
      On AMD processors, the detection of an overflowed counter in the NMI
      handler relies on the current value of the counter. So, for example, to
      check for overflow on a 48 bit counter, bit 47 is checked to see if it
      is 1 (not overflowed) or 0 (overflowed).
      
      There is currently a race condition present when disabling and then
      updating the PMC. Increased NMI latency in newer AMD processors makes this
      race condition more pronounced. If the counter value has overflowed, it is
      possible to update the PMC value before the NMI handler can run. The
      updated PMC value is not an overflowed value, so when the perf NMI handler
      does run, it will not find an overflowed counter. This may appear as an
      unknown NMI resulting in either a panic or a series of messages, depending
      on how the kernel is configured.
      
      To eliminate this race condition, the PMC value must be checked after
      disabling the counter. Add an AMD function, amd_pmu_disable_all(), that
      will wait for the NMI handler to reset any active and overflowed counter
      after calling x86_pmu_disable_all().
      Signed-off-by: NTom Lendacky <thomas.lendacky@amd.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: <stable@vger.kernel.org> # 4.14.x-
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Link: https://lkml.kernel.org/r/Message-ID:
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e5a791b4
  7. 01 3月, 2017 1 次提交
  8. 18 11月, 2016 1 次提交
  9. 16 9月, 2016 1 次提交
  10. 14 7月, 2016 1 次提交
  11. 28 4月, 2016 1 次提交
    • A
      perf/x86/amd: Set the size of event map array to PERF_COUNT_HW_MAX · 0a25556f
      Adam Borowski 提交于
      The entry for PERF_COUNT_HW_REF_CPU_CYCLES is not used on AMD, but is
      referenced by filter_events() which expects undefined events to have a
      value of 0.
      
      Found via KASAN:
      
        UBSAN: Undefined behaviour in arch/x86/events/amd/core.c:132:30
        index 9 is out of range for type 'u64 [9]'
        UBSAN: Undefined behaviour in arch/x86/events/amd/core.c:132:9
        load of address ffffffff81c021c8 with insufficient space for an object of type 'const u64'
      Signed-off-by: NAdam Borowski <kilobyte@angband.pl>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Link: http://lkml.kernel.org/r/1461749731-30979-1-git-send-email-kilobyte@angband.plSigned-off-by: NIngo Molnar <mingo@kernel.org>
      0a25556f
  12. 29 3月, 2016 1 次提交
  13. 17 2月, 2016 1 次提交
  14. 09 2月, 2016 1 次提交
  15. 06 1月, 2016 1 次提交
  16. 19 12月, 2015 1 次提交
  17. 02 4月, 2015 2 次提交
  18. 27 8月, 2014 1 次提交
    • C
      x86: Replace __get_cpu_var uses · 89cbc767
      Christoph Lameter 提交于
      __get_cpu_var() is used for multiple purposes in the kernel source. One of
      them is address calculation via the form &__get_cpu_var(x).  This calculates
      the address for the instance of the percpu variable of the current processor
      based on an offset.
      
      Other use cases are for storing and retrieving data from the current
      processors percpu area.  __get_cpu_var() can be used as an lvalue when
      writing data or on the right side of an assignment.
      
      __get_cpu_var() is defined as :
      
      #define __get_cpu_var(var) (*this_cpu_ptr(&(var)))
      
      __get_cpu_var() always only does an address determination. However, store
      and retrieve operations could use a segment prefix (or global register on
      other platforms) to avoid the address calculation.
      
      this_cpu_write() and this_cpu_read() can directly take an offset into a
      percpu area and use optimized assembly code to read and write per cpu
      variables.
      
      This patch converts __get_cpu_var into either an explicit address
      calculation using this_cpu_ptr() or into a use of this_cpu operations that
      use the offset.  Thereby address calculations are avoided and less registers
      are used when code is generated.
      
      Transformations done to __get_cpu_var()
      
      1. Determine the address of the percpu instance of the current processor.
      
      	DEFINE_PER_CPU(int, y);
      	int *x = &__get_cpu_var(y);
      
          Converts to
      
      	int *x = this_cpu_ptr(&y);
      
      2. Same as #1 but this time an array structure is involved.
      
      	DEFINE_PER_CPU(int, y[20]);
      	int *x = __get_cpu_var(y);
      
          Converts to
      
      	int *x = this_cpu_ptr(y);
      
      3. Retrieve the content of the current processors instance of a per cpu
      variable.
      
      	DEFINE_PER_CPU(int, y);
      	int x = __get_cpu_var(y)
      
         Converts to
      
      	int x = __this_cpu_read(y);
      
      4. Retrieve the content of a percpu struct
      
      	DEFINE_PER_CPU(struct mystruct, y);
      	struct mystruct x = __get_cpu_var(y);
      
         Converts to
      
      	memcpy(&x, this_cpu_ptr(&y), sizeof(x));
      
      5. Assignment to a per cpu variable
      
      	DEFINE_PER_CPU(int, y)
      	__get_cpu_var(y) = x;
      
         Converts to
      
      	__this_cpu_write(y, x);
      
      6. Increment/Decrement etc of a per cpu variable
      
      	DEFINE_PER_CPU(int, y);
      	__get_cpu_var(y)++
      
         Converts to
      
      	__this_cpu_inc(y)
      
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: x86@kernel.org
      Acked-by: NH. Peter Anvin <hpa@linux.intel.com>
      Acked-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      89cbc767
  19. 02 9月, 2013 1 次提交
  20. 28 5月, 2013 1 次提交
  21. 21 4月, 2013 1 次提交
  22. 16 2月, 2013 1 次提交
  23. 07 2月, 2013 5 次提交
  24. 24 10月, 2012 1 次提交
  25. 06 7月, 2012 1 次提交
  26. 18 5月, 2012 1 次提交
  27. 09 5月, 2012 1 次提交
    • R
      perf/x86-ibs: Precise event sampling with IBS for AMD CPUs · 450bbd49
      Robert Richter 提交于
      This patch adds support for precise event sampling with IBS. There are
      two counting modes to count either cycles or micro-ops. If the
      corresponding performance counter events (hw events) are setup with
      the precise flag set, the request is redirected to the ibs pmu:
      
       perf record -a -e cpu-cycles:p ...    # use ibs op counting cycle count
       perf record -a -e r076:p ...          # same as -e cpu-cycles:p
       perf record -a -e r0C1:p ...          # use ibs op counting micro-ops
      
      Each ibs sample contains a linear address that points to the
      instruction that was causing the sample to trigger. With ibs we have
      skid 0. Thus, ibs supports precise levels 1 and 2. Samples are marked
      with the PERF_EFLAGS_EXACT flag set. In rare cases the rip is invalid
      when IBS was not able to record the rip correctly. Then the
      PERF_EFLAGS_EXACT flag is cleared and the rip is taken from pt_regs.
      
      V2:
      * don't drop samples in precise level 2 if rip is invalid, instead
        support the PERF_EFLAGS_EXACT flag
      Signed-off-by: NRobert Richter <robert.richter@amd.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/20120502103309.GP18810@erda.amd.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      450bbd49
  28. 26 4月, 2012 1 次提交
  29. 17 3月, 2012 1 次提交
  30. 05 3月, 2012 1 次提交
  31. 02 3月, 2012 1 次提交
  32. 06 12月, 2011 1 次提交
    • R
      perf, x86: Fix event scheduler for constraints with overlapping counters · bc1738f6
      Robert Richter 提交于
      The current x86 event scheduler fails to resolve scheduling problems
      of certain combinations of events and constraints. This happens if the
      counter mask of such an event is not a subset of any other counter
      mask of a constraint with an equal or higher weight, e.g. constraints
      of the AMD family 15h pmu:
      
                              counter mask    weight
      
       amd_f15_PMC30          0x09            2  <--- overlapping counters
       amd_f15_PMC20          0x07            3
       amd_f15_PMC53          0x38            3
      
      The scheduler does not find then an existing solution. Here is an
      example:
      
       event code     counter         failure         possible solution
      
       0x02E          PMC[3,0]        0               3
       0x043          PMC[2:0]        1               0
       0x045          PMC[2:0]        2               1
       0x046          PMC[2:0]        FAIL            2
      
      The event scheduler may not select the correct counter in the first
      cycle because it needs to know which subsequent events will be
      scheduled. It may fail to schedule the events then.
      
      To solve this, we now save the scheduler state of events with
      overlapping counter counstraints.  If we fail to schedule the events
      we rollback to those states and try to use another free counter.
      
      Constraints with overlapping counters are marked with a new introduced
      overlap flag. We set the overlap flag for such constraints to give the
      scheduler a hint which events to select for counter rescheduling. The
      EVENT_CONSTRAINT_OVERLAP() macro can be used for this.
      
      Care must be taken as the rescheduling algorithm is O(n!) which will
      increase scheduling cycles for an over-commited system dramatically.
      The number of such EVENT_CONSTRAINT_OVERLAP() macros and its counter
      masks must be kept at a minimum. Thus, the current stack is limited to
      2 states to limit the number of loops the algorithm takes in the worst
      case.
      
      On systems with no overlapping-counter constraints, this
      implementation does not increase the loop count compared to the
      previous algorithm.
      
      V2:
      * Renamed redo -> overlap.
      * Reimplementation using perf scheduling helper functions.
      
      V3:
      * Added WARN_ON_ONCE() if out of save states.
      * Changed function interface of perf_sched_restore_state() to use bool
        as return value.
      Signed-off-by: NRobert Richter <robert.richter@amd.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Stephane Eranian <eranian@google.com>
      Link: http://lkml.kernel.org/r/1321616122-1533-3-git-send-email-robert.richter@amd.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
      bc1738f6