1. 03 7月, 2012 5 次提交
  2. 28 5月, 2012 1 次提交
  3. 30 4月, 2012 1 次提交
  4. 29 3月, 2012 1 次提交
  5. 21 3月, 2012 1 次提交
  6. 19 12月, 2011 1 次提交
    • A
      powerpc: POWER7 optimised copy_to_user/copy_from_user using VMX · a66086b8
      Anton Blanchard 提交于
      Implement a POWER7 optimised copy_to_user/copy_from_user using VMX.
      For large aligned copies this new loop is over 10% faster, and for
      large unaligned copies it is over 200% faster.
      
      If we take a fault we fall back to the old version, this keeps
      things relatively simple and easy to verify.
      
      On POWER7 unaligned stores rarely slow down - they only flush when
      a store crosses a 4KB page boundary. Furthermore this flush is
      handled completely in hardware and should be 20-30 cycles.
      
      Unaligned loads on the other hand flush much more often - whenever
      crossing a 128 byte cache line, or a 32 byte sector if either sector
      is an L1 miss.
      
      Considering this information we really want to get the loads aligned
      and not worry about the alignment of the stores. Microbenchmarks
      confirm that this approach is much faster than the current unaligned
      copy loop that uses shifts and rotates to ensure both loads and
      stores are aligned.
      
      We also want to try and do the stores in cacheline aligned, cacheline
      sized chunks. If the store queue is unable to merge an entire
      cacheline of stores then the L2 cache will have to do a
      read/modify/write. Even worse, we will serialise this with the stores
      in the next iteration of the copy loop since both iterations hit
      the same cacheline.
      
      Based on this, the new loop does the following things:
      
      1 - 127 bytes
      Get the source 8 byte aligned and use 8 byte loads and stores. Pretty
      boring and similar to how the current loop works.
      
      128 - 4095 bytes
      Get the source 8 byte aligned and use 8 byte loads and stores,
      1 cacheline at a time. We aren't doing the stores in cacheline
      aligned chunks so we will potentially serialise once per cacheline.
      Even so it is much better than the loop we have today.
      
      4096 - bytes
      If both source and destination have the same alignment get them both
      16 byte aligned, then get the destination cacheline aligned. Do
      cacheline sized loads and stores using VMX.
      
      If source and destination do not have the same alignment, we get the
      destination cacheline aligned, and use permute to do aligned loads.
      
      In both cases the VMX loop should be optimal - we always do aligned
      loads and stores and are always doing stores in cacheline aligned,
      cacheline sized chunks.
      
      To be able to use VMX we must be careful about interrupts and
      sleeping. We don't use the VMX loop when in an interrupt (which should
      be rare anyway) and we wrap the VMX loop in disable/enable_pagefault
      and fall back to the existing copy_tofrom_user loop if we do need to
      sleep.
      
      The VMX breakpoint of 4096 bytes was chosen using this microbenchmark:
      
      http://ozlabs.org/~anton/junkcode/copy_to_user.c
      
      Since we are using VMX and there is a cost to saving and restoring
      the user VMX state there are two broad cases we need to benchmark:
      
      - Best case - userspace never uses VMX
      
      - Worst case - userspace always uses VMX
      
      In reality a userspace process will sit somewhere between these two
      extremes. Since we need to test both aligned and unaligned copies we
      end up with 4 combinations. The point at which the VMX loop begins to
      win is:
      
      0% VMX
      aligned		2048 bytes
      unaligned	2048 bytes
      
      100% VMX
      aligned		16384 bytes
      unaligned	8192 bytes
      
      Considering this is a microbenchmark, the data is hot in cache and
      the VMX loop has better store queue merging properties we set the
      breakpoint to 4096 bytes, a little below the unaligned breakpoints.
      
      Some future optimisations we can look at:
      
      - Looking at the perf data, a significant part of the cost when a
        task is always using VMX is the extra exception we take to restore
        the VMX state. As such we should do something similar to the x86
        optimisation that restores FPU state for heavy users. ie:
      
              /*
               * If the task has used fpu the last 5 timeslices, just do a full
               * restore of the math state immediately to avoid the trap; the
               * chances of needing FPU soon are obviously high now
               */
              preload_fpu = tsk_used_math(next_p) && next_p->fpu_counter > 5;
      
        and
      
              /*
               * fpu_counter contains the number of consecutive context switches
               * that the FPU is used. If this is over a threshold, the lazy fpu
               * saving becomes unlazy to save the trap. This is an unsigned char
               * so that after 256 times the counter wraps and the behavior turns
               * lazy again; this to deal with bursty apps that only use FPU for
               * a short time
               */
      
      - We could create a paca bit to mirror the VMX enabled MSR bit and check
        that first, avoiding multiple calls to calling enable_kernel_altivec.
        That should help with iovec based system calls like readv.
      
      - We could have two VMX breakpoints, one for when we know the user VMX
        state is loaded into the registers and one when it isn't. This could
        be a second bit in the paca so we can calculate the break points quickly.
      
      - One suggestion from Ben was to save and restore the VSX registers
        we use inline instead of using enable_kernel_altivec.
      
      [BenH: Fixed a problem with preempt and fixed build without CONFIG_ALTIVEC]
      Signed-off-by: NAnton Blanchard <anton@samba.org>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      a66086b8
  7. 16 11月, 2011 1 次提交
  8. 01 11月, 2011 1 次提交
  9. 21 5月, 2011 1 次提交
    • L
      sanitize <linux/prefetch.h> usage · 268bb0ce
      Linus Torvalds 提交于
      Commit e66eed65 ("list: remove prefetching from regular list
      iterators") removed the include of prefetch.h from list.h, which
      uncovered several cases that had apparently relied on that rather
      obscure header file dependency.
      
      So this fixes things up a bit, using
      
         grep -L linux/prefetch.h $(git grep -l '[^a-z_]prefetchw*(' -- '*.[ch]')
         grep -L 'prefetchw*(' $(git grep -l 'linux/prefetch.h' -- '*.[ch]')
      
      to guide us in finding files that either need <linux/prefetch.h>
      inclusion, or have it despite not needing it.
      
      There are more of them around (mostly network drivers), but this gets
      many core ones.
      Reported-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      268bb0ce
  10. 19 5月, 2011 3 次提交
  11. 27 4月, 2011 1 次提交
  12. 21 1月, 2011 1 次提交
    • M
      powerpc: Ensure the else case of feature sections will fit · c0337288
      Michael Ellerman 提交于
      When we create an alternative feature section, the else case must be the
      same size or smaller than the body. This is because when we patch the
      else case in we just overwrite the body, so there must be room.
      
      Up to now we just did this by inspection, but it's quite easy to enforce
      it in the assembler, so we should.
      
      The only change is to add the ifgt block, but that effects the alignment
      of the tabs and so the whole macro is modified.
      
      Also add a test, but #if 0 it because we don't want to break the build.
      Anyone who's modifying the feature macros should enable the test.
      Signed-off-by: NMichael Ellerman <michael@ellerman.id.au>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      c0337288
  13. 09 12月, 2010 1 次提交
  14. 29 11月, 2010 1 次提交
  15. 13 10月, 2010 1 次提交
  16. 02 9月, 2010 6 次提交
  17. 08 7月, 2010 2 次提交
    • S
      powerpc: Fix feature-fixup tests for gcc 4.5 · 3880ecb0
      Stephen Rothwell 提交于
      The feature-fixup test declare some extern void variables and then take
      their addresses.  Fix this by declaring them as extern u8 instead.
      
      Fixes these warnings (treated as errors):
      
        CC      arch/powerpc/lib/feature-fixups.o
      cc1: warnings being treated as errors
      arch/powerpc/lib/feature-fixups.c: In function 'test_cpu_macros':
      arch/powerpc/lib/feature-fixups.c:293:23: error: taking address of expression of type 'void'
      arch/powerpc/lib/feature-fixups.c:294:9: error: taking address of expression of type 'void'
      arch/powerpc/lib/feature-fixups.c:297:2: error: taking address of expression of type 'void'
      arch/powerpc/lib/feature-fixups.c:297:2: error: taking address of expression of type 'void'
      arch/powerpc/lib/feature-fixups.c: In function 'test_fw_macros':
      arch/powerpc/lib/feature-fixups.c:306:23: error: taking address of expression of type 'void'
      arch/powerpc/lib/feature-fixups.c:307:9: error: taking address of expression of type 'void'
      arch/powerpc/lib/feature-fixups.c:310:2: error: taking address of expression of type 'void'
      arch/powerpc/lib/feature-fixups.c:310:2: error: taking address of expression of type 'void'
      arch/powerpc/lib/feature-fixups.c: In function 'test_lwsync_macros':
      arch/powerpc/lib/feature-fixups.c:321:23: error: taking address of expression of type 'void'
      arch/powerpc/lib/feature-fixups.c:322:9: error: taking address of expression of type 'void'
      arch/powerpc/lib/feature-fixups.c:326:3: error: taking address of expression of type 'void'
      arch/powerpc/lib/feature-fixups.c:326:3: error: taking address of expression of type 'void'
      arch/powerpc/lib/feature-fixups.c:329:3: error: taking address of expression of type 'void'
      arch/powerpc/lib/feature-fixups.c:329:3: error: taking address of expression of type 'void'
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      3880ecb0
    • S
      powerpc: Fix module building for gcc 4.5 and 64 bit · 7fca5dc8
      Stephen Rothwell 提交于
      Gcc 4.5 is now generating out of line register save and restore
      in the function prefix and postfix when we use -Os.
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      7fca5dc8
  18. 22 6月, 2010 2 次提交
    • K
      powerpc, hw_breakpoints: Implement hw_breakpoints for 64-bit server processors · 5aae8a53
      K.Prasad 提交于
      Implement perf-events based hw-breakpoint interfaces for PowerPC
      64-bit server (Book III S) processors.  This allows access to a
      given location to be used as an event that can be counted or
      profiled by the perf_events subsystem.
      
      This is done using the DABR (data breakpoint register), which can
      also be used for process debugging via ptrace.  When perf_event
      hw_breakpoint support is configured in, the perf_event subsystem
      manages the DABR and arbitrates access to it, and ptrace then
      creates a perf_event when it is requested to set a data breakpoint.
      
      [Adopted suggestions from Paul Mackerras <paulus@samba.org> to
      - emulate_step() all system-wide breakpoints and single-step only the
        per-task breakpoints
      - perform arch-specific cleanup before unregistration through
        arch_unregister_hw_breakpoint()
      ]
      Signed-off-by: NK.Prasad <prasad@linux.vnet.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      5aae8a53
    • P
      powerpc: Emulate most Book I instructions in emulate_step() · 0016a4cf
      Paul Mackerras 提交于
      This extends the emulate_step() function to handle a large proportion
      of the Book I instructions implemented on current 64-bit server
      processors.  The aim is to handle all the load and store instructions
      used in the kernel, plus all of the instructions that appear between
      l[wd]arx and st[wd]cx., so this handles the Altivec/VMX lvx and stvx
      and the VSX lxv2dx and stxv2dx instructions (implemented in POWER7).
      
      The new code can emulate user mode instructions, and checks the
      effective address for a load or store if the saved state is for
      user mode.  It doesn't handle little-endian mode at present.
      
      For floating-point, Altivec/VMX and VSX instructions, it checks
      that the saved MSR has the enable bit for the relevant facility
      set, and if so, assumes that the FP/VMX/VSX registers contain
      valid state, and does loads or stores directly to/from the
      FP/VMX/VSX registers, using assembly helpers in ldstfp.S.
      
      Instructions supported now include:
      * Loads and stores, including some but not all VMX and VSX instructions,
        and lmw/stmw
      * Atomic loads and stores (l[dw]arx, st[dw]cx.)
      * Arithmetic instructions (add, subtract, multiply, divide, etc.)
      * Compare instructions
      * Rotate and mask instructions
      * Shift instructions
      * Logical instructions (and, or, xor, etc.)
      * Condition register logical instructions
      * mtcrf, cntlz[wd], exts[bhw]
      * isync, sync, lwsync, ptesync, eieio
      * Cache operations (dcbf, dcbst, dcbt, dcbtst)
      
      The overflow-checking arithmetic instructions are not included, but
      they appear not to be ever used in C code.
      
      This uses decimal values for the minor opcodes in the switch statements
      because that is what appears in the Power ISA specification, thus it is
      easier to check that they are correct if they are in decimal.
      
      If this is used to single-step an instruction where a data breakpoint
      interrupt occurred, then there is the possibility that the instruction
      is a lwarx or ldarx.  In that case we have to be careful not to lose the
      reservation until we get to the matching st[wd]cx., or we'll never make
      forward progress.  One alternative is to try to arrange that we can
      return from interrupts and handle data breakpoint interrupts without
      losing the reservation, which means not using any spinlocks, mutexes,
      or atomic ops (including bitops).  That seems rather fragile.  The
      other alternative is to emulate the larx/stcx and all the instructions
      in between.  This is why this commit adds support for a wide range
      of integer instructions.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      0016a4cf
  19. 21 5月, 2010 1 次提交
  20. 07 4月, 2010 1 次提交
    • J
      powerpc: Fix handling of strncmp with zero len · 637a9902
      Jeff Mahoney 提交于
      Commit 0119536c, which added the assembly version of strncmp to
      powerpc, mentions that it adds two instructions to the version from
      boot/string.S to allow it to handle len=0. Unfortunately, it doesn't
      always return 0 when that is the case. The length is passed in r5, but
      the return value is passed back in r3. In certain cases, this will
      happen to work. Otherwise it will pass back the address of the first
      string as the return value.
      
      This patch lifts the len <= 0 handling code from memcpy to handle that
      case.
      
      Reported by: Christian_Sellars@symantec.com
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      CC: <stable@kernel.org>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      637a9902
  21. 30 3月, 2010 1 次提交
    • T
      include cleanup: Update gfp.h and slab.h includes to prepare for breaking... · 5a0e3ad6
      Tejun Heo 提交于
      include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
      
      percpu.h is included by sched.h and module.h and thus ends up being
      included when building most .c files.  percpu.h includes slab.h which
      in turn includes gfp.h making everything defined by the two files
      universally available and complicating inclusion dependencies.
      
      percpu.h -> slab.h dependency is about to be removed.  Prepare for
      this change by updating users of gfp and slab facilities include those
      headers directly instead of assuming availability.  As this conversion
      needs to touch large number of source files, the following script is
      used as the basis of conversion.
      
        http://userweb.kernel.org/~tj/misc/slabh-sweep.py
      
      The script does the followings.
      
      * Scan files for gfp and slab usages and update includes such that
        only the necessary includes are there.  ie. if only gfp is used,
        gfp.h, if slab is used, slab.h.
      
      * When the script inserts a new include, it looks at the include
        blocks and try to put the new include such that its order conforms
        to its surrounding.  It's put in the include block which contains
        core kernel includes, in the same order that the rest are ordered -
        alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
        doesn't seem to be any matching order.
      
      * If the script can't find a place to put a new include (mostly
        because the file doesn't have fitting include block), it prints out
        an error message indicating which .h file needs to be added to the
        file.
      
      The conversion was done in the following steps.
      
      1. The initial automatic conversion of all .c files updated slightly
         over 4000 files, deleting around 700 includes and adding ~480 gfp.h
         and ~3000 slab.h inclusions.  The script emitted errors for ~400
         files.
      
      2. Each error was manually checked.  Some didn't need the inclusion,
         some needed manual addition while adding it to implementation .h or
         embedding .c file was more appropriate for others.  This step added
         inclusions to around 150 files.
      
      3. The script was run again and the output was compared to the edits
         from #2 to make sure no file was left behind.
      
      4. Several build tests were done and a couple of problems were fixed.
         e.g. lib/decompress_*.c used malloc/free() wrappers around slab
         APIs requiring slab.h to be added manually.
      
      5. The script was run on all .h files but without automatically
         editing them as sprinkling gfp.h and slab.h inclusions around .h
         files could easily lead to inclusion dependency hell.  Most gfp.h
         inclusion directives were ignored as stuff from gfp.h was usually
         wildly available and often used in preprocessor macros.  Each
         slab.h inclusion directive was examined and added manually as
         necessary.
      
      6. percpu.h was updated not to include slab.h.
      
      7. Build test were done on the following configurations and failures
         were fixed.  CONFIG_GCOV_KERNEL was turned off for all tests (as my
         distributed build env didn't work with gcov compiles) and a few
         more options had to be turned off depending on archs to make things
         build (like ipr on powerpc/64 which failed due to missing writeq).
      
         * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
         * powerpc and powerpc64 SMP allmodconfig
         * sparc and sparc64 SMP allmodconfig
         * ia64 SMP allmodconfig
         * s390 SMP allmodconfig
         * alpha SMP allmodconfig
         * um on x86_64 SMP allmodconfig
      
      8. percpu.h modifications were reverted so that it could be applied as
         a separate patch and serve as bisection point.
      
      Given the fact that I had only a couple of failures from tests on step
      6, I'm fairly confident about the coverage of this conversion patch.
      If there is a breakage, it's likely to be something in one of the arch
      headers which should be easily discoverable easily on most builds of
      the specific arch.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Guess-its-ok-by: NChristoph Lameter <cl@linux-foundation.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      5a0e3ad6
  22. 26 2月, 2010 1 次提交
    • B
      powerpc: Fix lwsync feature fixup vs. modules on 64-bit · 3d98ffbf
      Benjamin Herrenschmidt 提交于
      Anton's commit enabling the use of the lwsync fixup mechanism on 64-bit
      breaks modules. The lwsync fixup section uses .long instead of the
      FTR_ENTRY_OFFSET macro used by other fixups sections, and thus will
      generate 32-bit relocations that our module loader cannot resolve.
      
      This changes it to use the same type as other feature sections.
      
      Note however that we might want to consider using 32-bit for all the
      feature fixup offsets and add support for R_PPC_REL32 to module_64.c
      instead as that would reduce the size of the kernel image. I'll leave
      that as an exercise for the reader for now...
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      3d98ffbf
  23. 17 2月, 2010 3 次提交
    • A
      powerpc: Improve 64bit copy_tofrom_user · 789c299c
      Anton Blanchard 提交于
      Here is a patch from Paul Mackerras that improves the ppc64 copy_tofrom_user.
      The loop now does 32 bytes at a time and as well as pairing loads and stores.
      
      A quick test case that reads 8kB over and over shows the improvement:
      
      POWER6: 53% faster
      POWER7: 51% faster
      
      #define _XOPEN_SOURCE 500
      #include <stdlib.h>
      #include <stdio.h>
      #include <unistd.h>
      #include <fcntl.h>
      #include <sys/types.h>
      #include <sys/stat.h>
      
      #define BUFSIZE (8 * 1024)
      #define ITERATIONS 10000000
      
      int main()
      {
      	char tmpfile[] = "/tmp/copy_to_user_testXXXXXX";
      	int fd;
      	char *buf[BUFSIZE];
      	unsigned long i;
      
      	fd = mkstemp(tmpfile);
      	if (fd < 0) {
      		perror("open");
      		exit(1);
      	}
      
      	if (write(fd, buf, BUFSIZE) != BUFSIZE) {
      		perror("open");
      		exit(1);
      	}
      
      	for (i = 0; i < 10000000; i++) {
      		if (pread(fd, buf, BUFSIZE, 0) != BUFSIZE) {
      			perror("pread");
      			exit(1);
      		}
      	}
      
      	unlink(tmpfile);
      
      	return 0;
      }
      Signed-off-by: NAnton Blanchard <anton@samba.org>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      789c299c
    • A
      powerpc: Pair loads and stores in copy_4k_page · 63e6c5b8
      Anton Blanchard 提交于
      A number of our chips like loads and stores to be paired. A small kernel
      module testcase shows the improvement of pairing loads and stores in
      copy_4k_page:
      
      POWER6: +9%
      POWER7: +1.5%
      
      #include <linux/module.h>
      #include <linux/mm.h>
      
      #define ITERATIONS 10000000
      
      static int __init copypage_init(void)
      {
      	struct timespec before, after;
      	unsigned long i;
      	struct page *destpage, *srcpage;
      	char *dest, *src;
      
      	destpage = alloc_page(GFP_KERNEL);
      	srcpage = alloc_page(GFP_KERNEL);
      
      	dest = page_address(destpage);
      	src = page_address(srcpage);
      
      	getnstimeofday(&before);
      
      	for (i = 0; i < ITERATIONS; i++)
      		copy_4K_page(dest, src);
      
      	getnstimeofday(&after);
      
      	free_page((unsigned long)dest);
      	free_page((unsigned long)src);
      
      	printk(KERN_DEBUG "copy_4K_page loop took %lu ns\n",
      		(after.tv_sec - before.tv_sec) * NSEC_PER_SEC +
      		(after.tv_nsec - before.tv_nsec));
      
      	return 0;
      }
      
      static void __exit copypage_exit(void)
      {
      }
      
      module_init(copypage_init)
      module_exit(copypage_exit)
      MODULE_LICENSE("GPL");
      MODULE_AUTHOR("Anton Blanchard");
      Signed-off-by: NAnton Blanchard <anton@samba.org>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      63e6c5b8
    • A
      powerpc: Fix lwsync patching code on 64bit · 53eae228
      Anton Blanchard 提交于
      do_lwsync_fixups doesn't work on 64bit, we end up writing lwsyncs to the
      wrong addresses:
      
      0:mon> di c0000001000bfacc
      c0000001000bfacc  7c2004ac      lwsync
      
      Since the lwsync section has negative offsets we need to use a signed int
      pointer so we sign extend the value.
      Signed-off-by: NAnton Blanchard <anton@samba.org>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      53eae228
  24. 15 12月, 2009 2 次提交