1. 28 7月, 2014 1 次提交
    • S
      Use the POWER8 Micro Partition Prefetch Engine in KVM HV on POWER8 · 9678cdaa
      Stewart Smith 提交于
      The POWER8 processor has a Micro Partition Prefetch Engine, which is
      a fancy way of saying "has way to store and load contents of L2 or
      L2+MRU way of L3 cache". We initiate the storing of the log (list of
      addresses) using the logmpp instruction and start restore by writing
      to a SPR.
      
      The logmpp instruction takes parameters in a single 64bit register:
      - starting address of the table to store log of L2/L2+L3 cache contents
        - 32kb for L2
        - 128kb for L2+L3
        - Aligned relative to maximum size of the table (32kb or 128kb)
      - Log control (no-op, L2 only, L2 and L3, abort logout)
      
      We should abort any ongoing logging before initiating one.
      
      To initiate restore, we write to the MPPR SPR. The format of what to write
      to the SPR is similar to the logmpp instruction parameter:
      - starting address of the table to read from (same alignment requirements)
      - table size (no data, until end of table)
      - prefetch rate (from fastest possible to slower. about every 8, 16, 24 or
        32 cycles)
      
      The idea behind loading and storing the contents of L2/L3 cache is to
      reduce memory latency in a system that is frequently swapping vcores on
      a physical CPU.
      
      The best case scenario for doing this is when some vcores are doing very
      cache heavy workloads. The worst case is when they have about 0 cache hits,
      so we just generate needless memory operations.
      
      This implementation just does L2 store/load. In my benchmarks this proves
      to be useful.
      
      Benchmark 1:
       - 16 core POWER8
       - 3x Ubuntu 14.04LTS guests (LE) with 8 VCPUs each
       - No split core/SMT
       - two guests running sysbench memory test.
         sysbench --test=memory --num-threads=8 run
       - one guest running apache bench (of default HTML page)
         ab -n 490000 -c 400 http://localhost/
      
      This benchmark aims to measure performance of real world application (apache)
      where other guests are cache hot with their own workloads. The sysbench memory
      benchmark does pointer sized writes to a (small) memory buffer in a loop.
      
      In this benchmark with this patch I can see an improvement both in requests
      per second (~5%) and in mean and median response times (again, about 5%).
      The spread of minimum and maximum response times were largely unchanged.
      
      benchmark 2:
       - Same VM config as benchmark 1
       - all three guests running sysbench memory benchmark
      
      This benchmark aims to see if there is a positive or negative affect to this
      cache heavy benchmark. Although due to the nature of the benchmark (stores) we
      may not see a difference in performance, but rather hopefully an improvement
      in consistency of performance (when vcore switched in, don't have to wait
      many times for cachelines to be pulled in)
      
      The results of this benchmark are improvements in consistency of performance
      rather than performance itself. With this patch, the few outliers in duration
      go away and we get more consistent performance in each guest.
      
      benchmark 3:
       - same 3 guests and CPU configuration as benchmark 1 and 2.
       - two idle guests
       - 1 guest running STREAM benchmark
      
      This scenario also saw performance improvement with this patch. On Copy and
      Scale workloads from STREAM, I got 5-6% improvement with this patch. For
      Add and triad, it was around 10% (or more).
      
      benchmark 4:
       - same 3 guests as previous benchmarks
       - two guests running sysbench --memory, distinctly different cache heavy
         workload
       - one guest running STREAM benchmark.
      
      Similar improvements to benchmark 3.
      
      benchmark 5:
       - 1 guest, 8 VCPUs, Ubuntu 14.04
       - Host configured with split core (SMT8, subcores-per-core=4)
       - STREAM benchmark
      
      In this benchmark, we see a 10-20% performance improvement across the board
      of STREAM benchmark results with this patch.
      
      Based on preliminary investigation and microbenchmarks
      by Prerna Saxena <prerna@linux.vnet.ibm.com>
      Signed-off-by: NStewart Smith <stewart@linux.vnet.ibm.com>
      Acked-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      9678cdaa
  2. 31 10月, 2013 2 次提交
  3. 17 10月, 2013 1 次提交
  4. 11 10月, 2013 1 次提交
  5. 31 7月, 2013 1 次提交
  6. 06 5月, 2013 1 次提交
  7. 26 4月, 2013 1 次提交
    • A
      powerpc/perf: Add new BHRB related instructions for POWER8 · 95213959
      Anshuman Khandual 提交于
      This patch adds new POWER8 instruction encoding for reading
      and clearing Branch History Rolling Buffer entries. The new
      instruction 'mfbhrbe' (move from branch history rolling buffer
      entry) is used to read BHRB buffer entries and instruction
      'clrbhrb' (clear branch history rolling buffer) is used to
      clear the entire buffer. The instruction 'clrbhrb' has straight
      forward encoding. But the instruction encoding format for
      reading the BHRB entries is like 'mfbhrbe RT, BHRBE' where it
      takes two arguments, i.e the index for the BHRB buffer entry to
      read and a general purpose register to put the value which was
      read from the buffer entry.
      Signed-off-by: NAnshuman Khandual <khandual@linux.vnet.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      95213959
  8. 15 2月, 2013 1 次提交
    • M
      powerpc: Add new instructions for transactional memory · 14c39a4c
      Michael Neuling 提交于
      Here we define the new instructions we need for transactional memory in the
      kernel.  This is so we can support compiling with binutils that don't support
      the new transactional memory instructions.
      
      Transactional memory results in two sets of architected state (GPRs/VSRs
      etc).
      
      treclaim allows us to read the checkpointed state (from the tbegin) so that we
      can store it away on a context switch.  It does this by overwriting the exiting
      architected state, so you have to save that away before you treclaim.  treclaim
      will also abort a transaction, so you can give a register value which contains
      an abort reason.
      
      trecheckpoint allows us to inject into the checkpointed state as if it were at
      the tbegin.  It does this by copying the current architected state into the
      checkpointed state.
      Signed-off-by: NMatt Evans <matt@ozlabs.org>
      Signed-off-by: NMichael Neuling <mikey@neuling.org>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      14c39a4c
  9. 10 1月, 2013 1 次提交
  10. 18 11月, 2012 1 次提交
  11. 15 11月, 2012 2 次提交
  12. 17 9月, 2012 1 次提交
  13. 10 7月, 2012 9 次提交
  14. 05 3月, 2012 1 次提交
    • P
      KVM: PPC: Implement MMIO emulation support for Book3S HV guests · 697d3899
      Paul Mackerras 提交于
      This provides the low-level support for MMIO emulation in Book3S HV
      guests.  When the guest tries to map a page which is not covered by
      any memslot, that page is taken to be an MMIO emulation page.  Instead
      of inserting a valid HPTE, we insert an HPTE that has the valid bit
      clear but another hypervisor software-use bit set, which we call
      HPTE_V_ABSENT, to indicate that this is an absent page.  An
      absent page is treated much like a valid page as far as guest hcalls
      (H_ENTER, H_REMOVE, H_READ etc.) are concerned, except of course that
      an absent HPTE doesn't need to be invalidated with tlbie since it
      was never valid as far as the hardware is concerned.
      
      When the guest accesses a page for which there is an absent HPTE, it
      will take a hypervisor data storage interrupt (HDSI) since we now set
      the VPM1 bit in the LPCR.  Our HDSI handler for HPTE-not-present faults
      looks up the hash table and if it finds an absent HPTE mapping the
      requested virtual address, will switch to kernel mode and handle the
      fault in kvmppc_book3s_hv_page_fault(), which at present just calls
      kvmppc_hv_emulate_mmio() to set up the MMIO emulation.
      
      This is based on an earlier patch by Benjamin Herrenschmidt, but since
      heavily reworked.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      697d3899
  15. 22 7月, 2011 1 次提交
    • M
      net: filter: BPF 'JIT' compiler for PPC64 · 0ca87f05
      Matt Evans 提交于
      An implementation of a code generator for BPF programs to speed up packet
      filtering on PPC64, inspired by Eric Dumazet's x86-64 version.
      
      Filter code is generated as an ABI-compliant function in module_alloc()'d mem
      with stackframe & prologue/epilogue generated if required (simple filters don't
      need anything more than an li/blr).  The filter's local variables, M[], live in
      registers.  Supports all BPF opcodes, although "complicated" loads from negative
      packet offsets (e.g. SKF_LL_OFF) are not yet supported.
      
      There are a couple of further optimisations left for future work; many-pass
      assembly with branch-reach reduction and a register allocator to push M[]
      variables into volatile registers would improve the code quality further.
      
      This currently supports big-endian 64-bit PowerPC only (but is fairly simple
      to port to PPC32 or LE!).
      
      Enabled in the same way as x86-64:
      
      	echo 1 > /proc/sys/net/core/bpf_jit_enable
      
      Or, enabled with extra debug output:
      
      	echo 2 > /proc/sys/net/core/bpf_jit_enable
      Signed-off-by: NMatt Evans <matt@ozlabs.org>
      Acked-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0ca87f05
  16. 27 4月, 2011 1 次提交
    • A
      powerpc: Per process DSCR + some fixes (try#4) · efcac658
      Alexey Kardashevskiy 提交于
      The DSCR (aka Data Stream Control Register) is supported on some
      server PowerPC chips and allow some control over the prefetch
      of data streams.
      
      This patch allows the value to be specified per thread by emulating
      the corresponding mfspr and mtspr instructions. Children of such
      threads inherit the value. Other threads use a default value that
      can be specified in sysfs - /sys/devices/system/cpu/dscr_default.
      
      If a thread starts with non default value in the sysfs entry,
      all children threads inherit this non default value even if
      the sysfs value is changed later.
      Signed-off-by: NAlexey Kardashevskiy <aik@au1.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      efcac658
  17. 20 4月, 2011 2 次提交
  18. 09 12月, 2010 1 次提交
  19. 22 6月, 2010 1 次提交
    • P
      powerpc: Emulate most Book I instructions in emulate_step() · 0016a4cf
      Paul Mackerras 提交于
      This extends the emulate_step() function to handle a large proportion
      of the Book I instructions implemented on current 64-bit server
      processors.  The aim is to handle all the load and store instructions
      used in the kernel, plus all of the instructions that appear between
      l[wd]arx and st[wd]cx., so this handles the Altivec/VMX lvx and stvx
      and the VSX lxv2dx and stxv2dx instructions (implemented in POWER7).
      
      The new code can emulate user mode instructions, and checks the
      effective address for a load or store if the saved state is for
      user mode.  It doesn't handle little-endian mode at present.
      
      For floating-point, Altivec/VMX and VSX instructions, it checks
      that the saved MSR has the enable bit for the relevant facility
      set, and if so, assumes that the FP/VMX/VSX registers contain
      valid state, and does loads or stores directly to/from the
      FP/VMX/VSX registers, using assembly helpers in ldstfp.S.
      
      Instructions supported now include:
      * Loads and stores, including some but not all VMX and VSX instructions,
        and lmw/stmw
      * Atomic loads and stores (l[dw]arx, st[dw]cx.)
      * Arithmetic instructions (add, subtract, multiply, divide, etc.)
      * Compare instructions
      * Rotate and mask instructions
      * Shift instructions
      * Logical instructions (and, or, xor, etc.)
      * Condition register logical instructions
      * mtcrf, cntlz[wd], exts[bhw]
      * isync, sync, lwsync, ptesync, eieio
      * Cache operations (dcbf, dcbst, dcbt, dcbtst)
      
      The overflow-checking arithmetic instructions are not included, but
      they appear not to be ever used in C code.
      
      This uses decimal values for the minor opcodes in the switch statements
      because that is what appears in the Power ISA specification, thus it is
      easier to check that they are correct if they are in decimal.
      
      If this is used to single-step an instruction where a data breakpoint
      interrupt occurred, then there is the possibility that the instruction
      is a lwarx or ldarx.  In that case we have to be careful not to lose the
      reservation until we get to the matching st[wd]cx., or we'll never make
      forward progress.  One alternative is to try to arrange that we can
      return from interrupts and handle data breakpoint interrupts without
      losing the reservation, which means not using any spinlocks, mutexes,
      or atomic ops (including bitops).  That seems rather fragile.  The
      other alternative is to emulate the larx/stcx and all the instructions
      in between.  This is why this commit adds support for a wide range
      of integer instructions.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      0016a4cf
  20. 17 3月, 2010 1 次提交
  21. 17 2月, 2010 2 次提交
    • A
      powerpc: Use lwarx/ldarx hint in bit locks · 864b9e6f
      Anton Blanchard 提交于
      This patch implements the lwarx/ldarx hint bit for bit locks.
      Signed-off-by: NAnton Blanchard <anton@samba.org>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      864b9e6f
    • A
      powerpc: Use lwarx hint in spinlocks · 4e14a4d1
      Anton Blanchard 提交于
      Recent versions of the PowerPC architecture added a hint bit to the larx
      instructions to differentiate between an atomic operation and a lock operation:
      
      > 0 Other programs might attempt to modify the word in storage addressed by EA
      > even if the subsequent Store Conditional succeeds.
      >
      > 1 Other programs will not attempt to modify the word in storage addressed by
      > EA until the program that has acquired the lock performs a subsequent store
      > releasing the lock.
      
      To avoid a binutils dependency this patch create macros for the extended lwarx
      format and uses it in the spinlock code. To test this change I used a simple
      test case that acquires and releases a global pthread mutex:
      
      	pthread_mutex_lock(&mutex);
      	pthread_mutex_unlock(&mutex);
      
      On a 32 core POWER6, running 32 test threads we spend almost all our time in
      the futex spinlock code:
      
          94.37%     perf  [kernel]                     [k] ._raw_spin_lock
                     |
                     |--99.95%-- ._raw_spin_lock
                     |          |
                     |          |--63.29%-- .futex_wake
                     |          |
                     |          |--36.64%-- .futex_wait_setup
      
      Which is a good test for this patch. The results (in lock/unlock operations per
      second) are:
      
      before: 1538203 ops/sec
      after:  2189219 ops/sec
      
      An improvement of 42%
      
      A 32 core POWER7 improves even more:
      
      before: 1279529 ops/sec
      after:  2282076 ops/sec
      
      An improvement of 78%
      Signed-off-by: NAnton Blanchard <anton@samba.org>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      4e14a4d1
  22. 20 8月, 2009 1 次提交
  23. 21 5月, 2009 3 次提交
  24. 23 4月, 2009 1 次提交
  25. 07 4月, 2009 2 次提交