1. 11 6月, 2015 1 次提交
  2. 28 1月, 2015 2 次提交
  3. 23 1月, 2015 1 次提交
    • A
      powerpc: Add 64bit optimised memcmp · 15c2d45d
      Anton Blanchard 提交于
      I noticed ksm spending quite a lot of time in memcmp on a large
      KVM box. The current memcmp loop is very unoptimised - byte at a
      time compares with no loop unrolling. We can do much much better.
      
      Optimise the loop in a few ways:
      
      - Unroll the byte at a time loop
      
      - For large (at least 32 byte) comparisons that are also 8 byte
        aligned, use an unrolled modulo scheduled loop using 8 byte
        loads. This is similar to our glibc memcmp.
      
      A simple microbenchmark testing 10000000 iterations of an 8192 byte
      memcmp was used to measure the performance:
      
      baseline:	29.93 s
      
      modified:	 1.70 s
      
      Just over 17x faster.
      
      v2: Incorporated some suggestions from Segher:
      
      - Use andi. instead of rdlicl.
      
      - Convert bdnzt eq, to bdnz. It's just duplicating the earlier compare
        and was a relic from a previous version.
      
      - Don't use cr5, we have plans to use that CR field for fast local
        atomics.
      Signed-off-by: NAnton Blanchard <anton@samba.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      15c2d45d
  4. 29 12月, 2014 1 次提交
  5. 10 11月, 2014 1 次提交
  6. 25 9月, 2014 1 次提交
  7. 30 4月, 2014 1 次提交
    • P
      powerpc: memcpy optimization for 64bit LE · 00f554fa
      Philippe Bergheaud 提交于
      Unaligned stores take alignment exceptions on POWER7 running in little-endian.
      This is a dumb little-endian base memcpy that prevents unaligned stores.
      Once booted the feature fixup code switches over to the VMX copy loops
      (which are already endian safe).
      
      The question is what we do before that switch over. The base 64bit
      memcpy takes alignment exceptions on POWER7 so we can't use it as is.
      Fixing the causes of alignment exception would slow it down, because
      we'd need to ensure all loads and stores are aligned either through
      rotate tricks or bytewise loads and stores. Either would be bad for
      all other 64bit platforms.
      
      [ I simplified the loop a bit - Anton ]
      Signed-off-by: NPhilippe Bergheaud <felix@linux.vnet.ibm.com>
      Signed-off-by: NAnton Blanchard <anton@samba.org>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      00f554fa
  8. 30 10月, 2013 1 次提交
    • A
      powerpc: Add VMX optimised xor for RAID5 · ef1313de
      Anton Blanchard 提交于
      Add a VMX optimised xor, used primarily for RAID5. On a POWER7 blade
      this is a decent win:
      
         32regs    : 17932.800 MB/sec
         altivec   : 19724.800 MB/sec
      
      The bigger gain is when the same test is run in SMT4 mode, as it
      would if there was a lot of work going on:
      
         8regs     :  8377.600 MB/sec
         altivec   : 15801.600 MB/sec
      
      I tested this against an array created without the patch, and also
      verified it worked as expected on a little endian kernel.
      
      [ Fix !CONFIG_ALTIVEC build -- BenH ]
      Signed-off-by: NAnton Blanchard <anton@samba.org>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      ef1313de
  9. 11 10月, 2013 2 次提交
  10. 29 1月, 2013 1 次提交
  11. 10 1月, 2013 1 次提交
    • A
      powerpc: Build kernel with -mcmodel=medium · 1fbe9cf2
      Anton Blanchard 提交于
      Finally remove the two level TOC and build with -mcmodel=medium.
      
      Unfortunately we can't build modules with -mcmodel=medium due to
      the tricks the kernel module loader plays with percpu data:
      
      # -mcmodel=medium breaks modules because it uses 32bit offsets from
      # the TOC pointer to create pointers where possible. Pointers into the
      # percpu data area are created by this method.
      #
      # The kernel module loader relocates the percpu data section from the
      # original location (starting with 0xd...) to somewhere in the base
      # kernel percpu data space (starting with 0xc...). We need a full
      # 64bit relocation for this to work, hence -mcmodel=large.
      
      On older kernels we fall back to the two level TOC (-mminimal-toc)
      Signed-off-by: NAnton Blanchard <anton@samba.org>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      1fbe9cf2
  12. 03 7月, 2012 4 次提交
  13. 19 12月, 2011 1 次提交
    • A
      powerpc: POWER7 optimised copy_to_user/copy_from_user using VMX · a66086b8
      Anton Blanchard 提交于
      Implement a POWER7 optimised copy_to_user/copy_from_user using VMX.
      For large aligned copies this new loop is over 10% faster, and for
      large unaligned copies it is over 200% faster.
      
      If we take a fault we fall back to the old version, this keeps
      things relatively simple and easy to verify.
      
      On POWER7 unaligned stores rarely slow down - they only flush when
      a store crosses a 4KB page boundary. Furthermore this flush is
      handled completely in hardware and should be 20-30 cycles.
      
      Unaligned loads on the other hand flush much more often - whenever
      crossing a 128 byte cache line, or a 32 byte sector if either sector
      is an L1 miss.
      
      Considering this information we really want to get the loads aligned
      and not worry about the alignment of the stores. Microbenchmarks
      confirm that this approach is much faster than the current unaligned
      copy loop that uses shifts and rotates to ensure both loads and
      stores are aligned.
      
      We also want to try and do the stores in cacheline aligned, cacheline
      sized chunks. If the store queue is unable to merge an entire
      cacheline of stores then the L2 cache will have to do a
      read/modify/write. Even worse, we will serialise this with the stores
      in the next iteration of the copy loop since both iterations hit
      the same cacheline.
      
      Based on this, the new loop does the following things:
      
      1 - 127 bytes
      Get the source 8 byte aligned and use 8 byte loads and stores. Pretty
      boring and similar to how the current loop works.
      
      128 - 4095 bytes
      Get the source 8 byte aligned and use 8 byte loads and stores,
      1 cacheline at a time. We aren't doing the stores in cacheline
      aligned chunks so we will potentially serialise once per cacheline.
      Even so it is much better than the loop we have today.
      
      4096 - bytes
      If both source and destination have the same alignment get them both
      16 byte aligned, then get the destination cacheline aligned. Do
      cacheline sized loads and stores using VMX.
      
      If source and destination do not have the same alignment, we get the
      destination cacheline aligned, and use permute to do aligned loads.
      
      In both cases the VMX loop should be optimal - we always do aligned
      loads and stores and are always doing stores in cacheline aligned,
      cacheline sized chunks.
      
      To be able to use VMX we must be careful about interrupts and
      sleeping. We don't use the VMX loop when in an interrupt (which should
      be rare anyway) and we wrap the VMX loop in disable/enable_pagefault
      and fall back to the existing copy_tofrom_user loop if we do need to
      sleep.
      
      The VMX breakpoint of 4096 bytes was chosen using this microbenchmark:
      
      http://ozlabs.org/~anton/junkcode/copy_to_user.c
      
      Since we are using VMX and there is a cost to saving and restoring
      the user VMX state there are two broad cases we need to benchmark:
      
      - Best case - userspace never uses VMX
      
      - Worst case - userspace always uses VMX
      
      In reality a userspace process will sit somewhere between these two
      extremes. Since we need to test both aligned and unaligned copies we
      end up with 4 combinations. The point at which the VMX loop begins to
      win is:
      
      0% VMX
      aligned		2048 bytes
      unaligned	2048 bytes
      
      100% VMX
      aligned		16384 bytes
      unaligned	8192 bytes
      
      Considering this is a microbenchmark, the data is hot in cache and
      the VMX loop has better store queue merging properties we set the
      breakpoint to 4096 bytes, a little below the unaligned breakpoints.
      
      Some future optimisations we can look at:
      
      - Looking at the perf data, a significant part of the cost when a
        task is always using VMX is the extra exception we take to restore
        the VMX state. As such we should do something similar to the x86
        optimisation that restores FPU state for heavy users. ie:
      
              /*
               * If the task has used fpu the last 5 timeslices, just do a full
               * restore of the math state immediately to avoid the trap; the
               * chances of needing FPU soon are obviously high now
               */
              preload_fpu = tsk_used_math(next_p) && next_p->fpu_counter > 5;
      
        and
      
              /*
               * fpu_counter contains the number of consecutive context switches
               * that the FPU is used. If this is over a threshold, the lazy fpu
               * saving becomes unlazy to save the trap. This is an unsigned char
               * so that after 256 times the counter wraps and the behavior turns
               * lazy again; this to deal with bursty apps that only use FPU for
               * a short time
               */
      
      - We could create a paca bit to mirror the VMX enabled MSR bit and check
        that first, avoiding multiple calls to calling enable_kernel_altivec.
        That should help with iovec based system calls like readv.
      
      - We could have two VMX breakpoints, one for when we know the user VMX
        state is loaded into the registers and one when it isn't. This could
        be a second bit in the paca so we can calculate the break points quickly.
      
      - One suggestion from Ben was to save and restore the VSX registers
        we use inline instead of using enable_kernel_altivec.
      
      [BenH: Fixed a problem with preempt and fixed build without CONFIG_ALTIVEC]
      Signed-off-by: NAnton Blanchard <anton@samba.org>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      a66086b8
  14. 29 11月, 2010 1 次提交
  15. 13 10月, 2010 1 次提交
  16. 02 9月, 2010 1 次提交
    • A
      powerpc: Optimise 64bit csum_partial_copy_generic and add csum_and_copy_from_user · fdd374b6
      Anton Blanchard 提交于
      We use the same core loop as the new csum_partial, adding in the
      stores and exception handling code. To keep things simple we do all the
      exception fixup in csum_and_copy_from_user. This wrapper function is
      modelled on the generic checksum code and is careful to always calculate
      a complete checksum even if we only copied part of the data to userspace.
      
      To test this I forced checksumming on over loopback and ran socklib (a
      simple TCP benchmark). On a POWER6 575 throughput improved by 19% with
      this patch. If I forced both the sender and receiver onto the same cpu
      (with the hope of shifting the benchmark from being cache bandwidth limited
      to cpu limited), adding this patch improved performance by 55%
      Signed-off-by: NAnton Blanchard <anton@samba.org>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      fdd374b6
  17. 08 7月, 2010 1 次提交
  18. 22 6月, 2010 2 次提交
    • K
      powerpc, hw_breakpoints: Implement hw_breakpoints for 64-bit server processors · 5aae8a53
      K.Prasad 提交于
      Implement perf-events based hw-breakpoint interfaces for PowerPC
      64-bit server (Book III S) processors.  This allows access to a
      given location to be used as an event that can be counted or
      profiled by the perf_events subsystem.
      
      This is done using the DABR (data breakpoint register), which can
      also be used for process debugging via ptrace.  When perf_event
      hw_breakpoint support is configured in, the perf_event subsystem
      manages the DABR and arbitrates access to it, and ptrace then
      creates a perf_event when it is requested to set a data breakpoint.
      
      [Adopted suggestions from Paul Mackerras <paulus@samba.org> to
      - emulate_step() all system-wide breakpoints and single-step only the
        per-task breakpoints
      - perform arch-specific cleanup before unregistration through
        arch_unregister_hw_breakpoint()
      ]
      Signed-off-by: NK.Prasad <prasad@linux.vnet.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      5aae8a53
    • P
      powerpc: Emulate most Book I instructions in emulate_step() · 0016a4cf
      Paul Mackerras 提交于
      This extends the emulate_step() function to handle a large proportion
      of the Book I instructions implemented on current 64-bit server
      processors.  The aim is to handle all the load and store instructions
      used in the kernel, plus all of the instructions that appear between
      l[wd]arx and st[wd]cx., so this handles the Altivec/VMX lvx and stvx
      and the VSX lxv2dx and stxv2dx instructions (implemented in POWER7).
      
      The new code can emulate user mode instructions, and checks the
      effective address for a load or store if the saved state is for
      user mode.  It doesn't handle little-endian mode at present.
      
      For floating-point, Altivec/VMX and VSX instructions, it checks
      that the saved MSR has the enable bit for the relevant facility
      set, and if so, assumes that the FP/VMX/VSX registers contain
      valid state, and does loads or stores directly to/from the
      FP/VMX/VSX registers, using assembly helpers in ldstfp.S.
      
      Instructions supported now include:
      * Loads and stores, including some but not all VMX and VSX instructions,
        and lmw/stmw
      * Atomic loads and stores (l[dw]arx, st[dw]cx.)
      * Arithmetic instructions (add, subtract, multiply, divide, etc.)
      * Compare instructions
      * Rotate and mask instructions
      * Shift instructions
      * Logical instructions (and, or, xor, etc.)
      * Condition register logical instructions
      * mtcrf, cntlz[wd], exts[bhw]
      * isync, sync, lwsync, ptesync, eieio
      * Cache operations (dcbf, dcbst, dcbt, dcbtst)
      
      The overflow-checking arithmetic instructions are not included, but
      they appear not to be ever used in C code.
      
      This uses decimal values for the minor opcodes in the switch statements
      because that is what appears in the Power ISA specification, thus it is
      easier to check that they are correct if they are in decimal.
      
      If this is used to single-step an instruction where a data breakpoint
      interrupt occurred, then there is the possibility that the instruction
      is a lwarx or ldarx.  In that case we have to be careful not to lose the
      reservation until we get to the matching st[wd]cx., or we'll never make
      forward progress.  One alternative is to try to arrange that we can
      return from interrupts and handle data breakpoint interrupts without
      losing the reservation, which means not using any spinlocks, mutexes,
      or atomic ops (including bitops).  That seems rather fragile.  The
      other alternative is to emulate the larx/stcx and all the instructions
      in between.  This is why this commit adds support for a wide range
      of integer instructions.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      0016a4cf
  19. 16 6月, 2009 1 次提交
    • M
      powerpc: Add configurable -Werror for arch/powerpc · ba55bd74
      Michael Ellerman 提交于
      Add the option to build the code under arch/powerpc with -Werror.
      
      The intention is to make it harder for people to inadvertantly introduce
      warnings in the arch/powerpc code. It needs to be configurable so that
      if a warning is introduced, people can easily work around it while it's
      being fixed.
      
      The option is a negative, ie. don't enable -Werror, so that it will be
      turned on for allyes and allmodconfig builds.
      
      The default is n, in the hope that developers will build with -Werror,
      that will probably lead to some build breaks, I am prepared to be flamed.
      
      It's not enabled for math-emu, which is a steaming pile of warnings.
      Signed-off-by: NMichael Ellerman <michael@ellerman.id.au>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      ba55bd74
  20. 27 5月, 2009 1 次提交
  21. 28 11月, 2008 1 次提交
    • S
      powerpc/ppc32: static ftrace fixes for PPC32 · f1eecf0e
      Steven Rostedt 提交于
      Impact: fix for PowerPC 32 code
      
      There were some early init code that was not safe for static
      ftrace to boot on my PowerBook. This code must only use relative
      addressing, and static mcount performs a compare of the
      ftrace_trace_function pointer, and gets that with an absolute address.
      In the early init boot up code, this will cause a fault.
      
      This patch removes tracing from the files containing the offending
      functions.
      Signed-off-by: NSteven Rostedt <srostedt@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      f1eecf0e
  22. 04 8月, 2008 1 次提交
  23. 01 7月, 2008 3 次提交
  24. 16 6月, 2008 1 次提交
  25. 12 5月, 2008 1 次提交
    • P
      [POWERPC] ppc: More compile fixes · 0d4b6b90
      Paul Mackerras 提交于
      This fixes a few more miscellaneous compile problems with ARCH=ppc.
      
      1. Don't compile devres.c on ARCH=ppc, it doesn't have ioremap_flags.
      2. Include <asm/irq.h> in setup.c for the __DO_IRQ_CANON definition.
      3. Include <linux/proc_fs.h> in residual.c for the
         definition of create_proc_read_entry.
      4. Fix xchg_ptr to be a static inline to eliminate a compiler warning.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      0d4b6b90
  26. 05 5月, 2008 1 次提交
  27. 17 10月, 2007 1 次提交
  28. 03 10月, 2007 1 次提交
  29. 19 9月, 2007 1 次提交
    • S
      [POWERPC] Fix section mismatch in PCI code · 7b2c3c5b
      Stephen Rothwell 提交于
      Create a helper function (alloc_maybe_bootmem) that is marked __init_refok
      to limit the chances of mistakenly referring to other __init routines.
      
      WARNING: vmlinux.o(.text+0x2a9c4): Section mismatch: reference to .init.text:.__alloc_bootmem (between '.update_dn_pci_info' and '.pci_dn_reconfig_notifier')
      WARNING: vmlinux.o(.text+0x36430): Section mismatch: reference to .init.text:.__alloc_bootmem (between '.mpic_msi_init_allocator' and '.find_ht_magic_addr')
      WARNING: vmlinux.o(.text+0x5e804): Section mismatch: reference to .init.text:.__alloc_bootmem (between '.celleb_setup_phb' and '.celleb_fake_pci_write_config')
      WARNING: vmlinux.o(.text+0x5e8e8): Section mismatch: reference to .init.text:.__alloc_bootmem (between '.celleb_setup_phb' and '.celleb_fake_pci_write_config')
      WARNING: vmlinux.o(.text+0x5e968): Section mismatch: reference to .init.text:.__alloc_bootmem (between '.celleb_setup_phb' and '.celleb_fake_pci_write_config')
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      7b2c3c5b
  30. 10 5月, 2007 1 次提交
  31. 26 4月, 2007 1 次提交
  32. 07 2月, 2007 1 次提交