1. 14 11月, 2016 1 次提交
    • N
      powerpc/64s: Reduce exception alignment · f4329f2e
      Nicholas Piggin 提交于
      Exception handlers are aligned to 128 bytes (L1 cache) on 64s, which is
      overkill. It can reduce the icache footprint of any individual exception
      path. However taken as a whole, the expansion in icache footprint seems
      likely to be counter-productive and cause more total misses.
      
      Create IFETCH_ALIGN_SHIFT/BYTES, which should give optimal ifetch
      alignment with much more reasonable alignment. This saves 1792 bytes
      from head_64.o text with an allmodconfig build.
      
      Other subarchitectures should define appropriate IFETCH_ALIGN_SHIFT
      values if this becomes more widely used.
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      f4329f2e
  2. 12 3月, 2016 1 次提交
  3. 21 10月, 2015 1 次提交
    • P
      powerpc: Revert "Use the POWER8 Micro Partition Prefetch Engine in KVM HV on POWER8" · 23316316
      Paul Mackerras 提交于
      This reverts commit 9678cdaa ("Use the POWER8 Micro Partition
      Prefetch Engine in KVM HV on POWER8") because the original commit had
      multiple, partly self-cancelling bugs, that could cause occasional
      memory corruption.
      
      In fact the logmpp instruction was incorrectly using register r0 as the
      source of the buffer address and operation code, and depending on what
      was in r0, it would either do nothing or corrupt the 64k page pointed to
      by r0.
      
      The logmpp instruction encoding and the operation code definitions could
      be corrected, but then there is the problem that there is no clearly
      defined way to know when the hardware has finished writing to the
      buffer.
      
      The original commit attempted to work around this by aborting the
      write-out before starting the prefetch, but this is ineffective in the
      case where the virtual core is now executing on a different physical
      core from the one where the write-out was initiated.
      
      These problems plus advice from the hardware designers not to use the
      function (since the measured performance improvement from using the
      feature was actually mostly negative), mean that reverting the code is
      the best option.
      
      Fixes: 9678cdaa ("Use the POWER8 Micro Partition Prefetch Engine in KVM HV on POWER8")
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      23316316
  4. 17 3月, 2015 1 次提交
  5. 28 7月, 2014 1 次提交
    • S
      Use the POWER8 Micro Partition Prefetch Engine in KVM HV on POWER8 · 9678cdaa
      Stewart Smith 提交于
      The POWER8 processor has a Micro Partition Prefetch Engine, which is
      a fancy way of saying "has way to store and load contents of L2 or
      L2+MRU way of L3 cache". We initiate the storing of the log (list of
      addresses) using the logmpp instruction and start restore by writing
      to a SPR.
      
      The logmpp instruction takes parameters in a single 64bit register:
      - starting address of the table to store log of L2/L2+L3 cache contents
        - 32kb for L2
        - 128kb for L2+L3
        - Aligned relative to maximum size of the table (32kb or 128kb)
      - Log control (no-op, L2 only, L2 and L3, abort logout)
      
      We should abort any ongoing logging before initiating one.
      
      To initiate restore, we write to the MPPR SPR. The format of what to write
      to the SPR is similar to the logmpp instruction parameter:
      - starting address of the table to read from (same alignment requirements)
      - table size (no data, until end of table)
      - prefetch rate (from fastest possible to slower. about every 8, 16, 24 or
        32 cycles)
      
      The idea behind loading and storing the contents of L2/L3 cache is to
      reduce memory latency in a system that is frequently swapping vcores on
      a physical CPU.
      
      The best case scenario for doing this is when some vcores are doing very
      cache heavy workloads. The worst case is when they have about 0 cache hits,
      so we just generate needless memory operations.
      
      This implementation just does L2 store/load. In my benchmarks this proves
      to be useful.
      
      Benchmark 1:
       - 16 core POWER8
       - 3x Ubuntu 14.04LTS guests (LE) with 8 VCPUs each
       - No split core/SMT
       - two guests running sysbench memory test.
         sysbench --test=memory --num-threads=8 run
       - one guest running apache bench (of default HTML page)
         ab -n 490000 -c 400 http://localhost/
      
      This benchmark aims to measure performance of real world application (apache)
      where other guests are cache hot with their own workloads. The sysbench memory
      benchmark does pointer sized writes to a (small) memory buffer in a loop.
      
      In this benchmark with this patch I can see an improvement both in requests
      per second (~5%) and in mean and median response times (again, about 5%).
      The spread of minimum and maximum response times were largely unchanged.
      
      benchmark 2:
       - Same VM config as benchmark 1
       - all three guests running sysbench memory benchmark
      
      This benchmark aims to see if there is a positive or negative affect to this
      cache heavy benchmark. Although due to the nature of the benchmark (stores) we
      may not see a difference in performance, but rather hopefully an improvement
      in consistency of performance (when vcore switched in, don't have to wait
      many times for cachelines to be pulled in)
      
      The results of this benchmark are improvements in consistency of performance
      rather than performance itself. With this patch, the few outliers in duration
      go away and we get more consistent performance in each guest.
      
      benchmark 3:
       - same 3 guests and CPU configuration as benchmark 1 and 2.
       - two idle guests
       - 1 guest running STREAM benchmark
      
      This scenario also saw performance improvement with this patch. On Copy and
      Scale workloads from STREAM, I got 5-6% improvement with this patch. For
      Add and triad, it was around 10% (or more).
      
      benchmark 4:
       - same 3 guests as previous benchmarks
       - two guests running sysbench --memory, distinctly different cache heavy
         workload
       - one guest running STREAM benchmark.
      
      Similar improvements to benchmark 3.
      
      benchmark 5:
       - 1 guest, 8 VCPUs, Ubuntu 14.04
       - Host configured with split core (SMT8, subcores-per-core=4)
       - STREAM benchmark
      
      In this benchmark, we see a 10-20% performance improvement across the board
      of STREAM benchmark results with this patch.
      
      Based on preliminary investigation and microbenchmarks
      by Prerna Saxena <prerna@linux.vnet.ibm.com>
      Signed-off-by: NStewart Smith <stewart@linux.vnet.ibm.com>
      Acked-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      9678cdaa
  6. 02 12月, 2013 1 次提交
  7. 29 3月, 2012 1 次提交
  8. 05 5月, 2010 1 次提交
  9. 03 3月, 2010 1 次提交
  10. 04 8月, 2008 1 次提交
  11. 19 6月, 2008 1 次提交
    • K
      powerpc/booke: Add support for new e500mc core · 3dfa8773
      Kumar Gala 提交于
      The new e500mc core from Freescale is based on the e500v2 but with the
      following changes:
      
      * Supports only the Enhanced Debug Architecture (DSRR0/1, etc)
      * Floating Point
      * No SPE
      * Supports lwsync
      * Doorbell Exceptions
      * Hypervisor
      * Cache line size is now 64-bytes (e500v1/v2 have a 32-byte cache line)
      Signed-off-by: NKumar Gala <galak@kernel.crashing.org>
      3dfa8773
  12. 10 7月, 2007 1 次提交
  13. 26 4月, 2006 1 次提交
  14. 09 1月, 2006 1 次提交
  15. 10 11月, 2005 1 次提交
    • D
      [PATCH] powerpc: Merge cacheflush.h and cache.h · 26ef5c09
      David Gibson 提交于
      The ppc32 and ppc64 versions of cacheflush.h were almost identical.
      The two versions of cache.h are fairly similar, except for a bunch of
      register definitions in the ppc32 version which probably belong better
      elsewhere.  This patch, therefore, merges both headers.  Notable
      points:
      	- there are several functions in cacheflush.h which exist only
      on ppc32 or only on ppc64.  These are handled by #ifdef for now, but
      these should probably be consolidated, along with the actual code
      behind them later.
      	- Confusingly, both ppc32 and ppc64 have a
      flush_dcache_range(), but they're subtly different: it uses dcbf on
      ppc32 and dcbst on ppc64, ppc64 has a flush_inval_dcache_range() which
      uses dcbf.  These too should be merged and consolidated later.
      	- Also flush_dcache_range() was defined in cacheflush.h on
      ppc64, and in cache.h on ppc32.  In the merged version it's in
      cacheflush.h
      	- On ppc32 flush_icache_range() is a normal function from
      misc.S.  On ppc64, it was wrapper, testing a feature bit before
      calling __flush_icache_range() which does the actual flush.  This
      patch takes the ppc64 approach, which amounts to no change on ppc32,
      since CPU_FTR_COHERENT_ICACHE will never be set there, but does mean
      renaming flush_icache_range() to __flush_icache_range() in
      arch/ppc/kernel/misc.S and arch/powerpc/kernel/misc_32.S
      	- The PReP register info from asm-ppc/cache.h has moved to
      arch/ppc/platforms/prep_setup.c
      	- The 8xx register info from asm-ppc/cache.h has moved to a
      new asm-powerpc/reg_8xx.h, included from reg.h
      	- flush_dcache_all() was defined on ppc32 (only), but was
      never called (although it was exported).  Thus this patch removes it
      from cacheflush.h and from ARCH=powerpc (misc_32.S) entirely.  It's
      left in ARCH=ppc for now, with the prototype moved to ppc_ksyms.c.
      
      Built for Walnut (ARCH=ppc), 32-bit multiplatform (pmac, CHRP and PReP
      ARCH=ppc, pmac and CHRP ARCH=powerpc).  Built and booted on POWER5
      LPAR (ARCH=powerpc and ARCH=ppc64).
      
      Built for 32-bit powermac (ARCH=ppc and ARCH=powerpc).  Built and
      booted on POWER5 LPAR (ARCH=powerpc and ARCH=ppc64).  Built and booted
      on G5 (ARCH=powerpc)
      Signed-off-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      26ef5c09