1. 01 6月, 2013 1 次提交
  2. 29 1月, 2013 1 次提交
  3. 10 1月, 2013 1 次提交
    • A
      powerpc: Build kernel with -mcmodel=medium · 1fbe9cf2
      Anton Blanchard 提交于
      Finally remove the two level TOC and build with -mcmodel=medium.
      
      Unfortunately we can't build modules with -mcmodel=medium due to
      the tricks the kernel module loader plays with percpu data:
      
      # -mcmodel=medium breaks modules because it uses 32bit offsets from
      # the TOC pointer to create pointers where possible. Pointers into the
      # percpu data area are created by this method.
      #
      # The kernel module loader relocates the percpu data section from the
      # original location (starting with 0xd...) to somewhere in the base
      # kernel percpu data space (starting with 0xc...). We need a full
      # 64bit relocation for this to work, hence -mcmodel=large.
      
      On older kernels we fall back to the two level TOC (-mminimal-toc)
      Signed-off-by: NAnton Blanchard <anton@samba.org>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      1fbe9cf2
  4. 04 10月, 2012 1 次提交
    • N
      powerpc: Fix VMX fix for memcpy case · c8adfecc
      Nishanth Aravamudan 提交于
      In 2fae7cdb ("powerpc: Fix VMX in
      interrupt check in POWER7 copy loops"), Anton inadvertently
      introduced a regression for memcpy on POWER7 machines. copyuser and
      memcpy diverge slightly in their use of cr1 (copyuser doesn't use it,
      but memcpy does) and you end up clobbering that register with your fix.
      That results in (taken from an FC18 kernel):
      
      [   18.824604] Unrecoverable VMX/Altivec Unavailable Exception f20 at c000000000052f40
      [   18.824618] Oops: Unrecoverable VMX/Altivec Unavailable Exception, sig: 6 [#1]
      [   18.824623] SMP NR_CPUS=1024 NUMA pSeries
      [   18.824633] Modules linked in: tg3(+) be2net(+) cxgb4(+) ipr(+) sunrpc xts lrw gf128mul dm_crypt dm_round_robin dm_multipath linear raid10 raid456 async_raid6_recov async_memcpy async_pq raid6_pq async_xor xor async_tx raid1 raid0 scsi_dh_rdac scsi_dh_hp_sw scsi_dh_emc scsi_dh_alua squashfs cramfs
      [   18.824705] NIP: c000000000052f40 LR: c00000000020b874 CTR: 0000000000000512
      [   18.824709] REGS: c000001f1fef7790 TRAP: 0f20   Not tainted  (3.6.0-0.rc6.git0.2.fc18.ppc64)
      [   18.824713] MSR: 8000000000009032 <SF,EE,ME,IR,DR,RI>  CR: 4802802e  XER: 20000010
      [   18.824726] SOFTE: 0
      [   18.824728] CFAR: 0000000000000f20
      [   18.824731] TASK = c000000fa7128400[0] 'swapper/24' THREAD: c000000fa7480000 CPU: 24
      GPR00: 00000000ffffffc0 c000001f1fef7a10 c00000000164edc0 c000000f9b9a8120
      GPR04: c000000f9b9a8124 0000000000001438 0000000000000060 03ffffff064657ee
      GPR08: 0000000080000000 0000000000000010 0000000000000020 0000000000000030
      GPR12: 0000000028028022 c00000000ff25400 0000000000000001 0000000000000000
      GPR16: 0000000000000000 7fffffffffffffff c0000000016b2180 c00000000156a500
      GPR20: c000000f968c7a90 c0000000131c31d8 c000001f1fef4000 c000000001561d00
      GPR24: 000000000000000a 0000000000000000 0000000000000001 0000000000000012
      GPR28: c000000fa5c04f80 00000000000008bc c0000000015c0a28 000000000000022e
      [   18.824792] NIP [c000000000052f40] .memcpy_power7+0x5a0/0x7c4
      [   18.824797] LR [c00000000020b874] .pcpu_free_area+0x174/0x2d0
      [   18.824800] Call Trace:
      [   18.824803] [c000001f1fef7a10] [c000000000052c14] .memcpy_power7+0x274/0x7c4 (unreliable)
      [   18.824809] [c000001f1fef7b10] [c00000000020b874] .pcpu_free_area+0x174/0x2d0
      [   18.824813] [c000001f1fef7bb0] [c00000000020ba88] .free_percpu+0xb8/0x1b0
      [   18.824819] [c000001f1fef7c50] [c00000000043d144] .throtl_pd_exit+0x94/0xd0
      [   18.824824] [c000001f1fef7cf0] [c00000000043acf8] .blkg_free+0x88/0xe0
      [   18.824829] [c000001f1fef7d90] [c00000000018c048] .rcu_process_callbacks+0x2e8/0x8a0
      [   18.824835] [c000001f1fef7e90] [c0000000000a8ce8] .__do_softirq+0x158/0x4d0
      [   18.824840] [c000001f1fef7f90] [c000000000025ecc] .call_do_softirq+0x14/0x24
      [   18.824845] [c000000fa7483650] [c000000000010e80] .do_softirq+0x160/0x1a0
      [   18.824850] [c000000fa74836f0] [c0000000000a94a4] .irq_exit+0xf4/0x120
      [   18.824854] [c000000fa7483780] [c000000000020c44] .timer_interrupt+0x154/0x4d0
      [   18.824859] [c000000fa7483830] [c000000000003be0] decrementer_common+0x160/0x180
      [   18.824866] --- Exception: 901 at .plpar_hcall_norets+0x84/0xd4
      [   18.824866]     LR = .check_and_cede_processor+0x48/0x80
      [   18.824871] [c000000fa7483b20] [c00000000007f018] .check_and_cede_processor+0x18/0x80 (unreliable)
      [   18.824877] [c000000fa7483b90] [c00000000007f104] .dedicated_cede_loop+0x84/0x150
      [   18.824883] [c000000fa7483c50] [c0000000006bc030] .cpuidle_enter+0x30/0x50
      [   18.824887] [c000000fa7483cc0] [c0000000006bc9f4] .cpuidle_idle_call+0x104/0x720
      [   18.824892] [c000000fa7483d80] [c000000000070af8] .pSeries_idle+0x18/0x40
      [   18.824897] [c000000fa7483df0] [c000000000019084] .cpu_idle+0x1a4/0x380
      [   18.824902] [c000000fa7483ec0] [c0000000008a4c18] .start_secondary+0x520/0x528
      [   18.824907] [c000000fa7483f90] [c0000000000093f0] .start_secondary_prolog+0x10/0x14
      [   18.824911] Instruction dump:
      [   18.824914] 38840008 90030000 90e30004 38630008 7ca62850 7cc300d0 78c7e102 7cf01120
      [   18.824923] 78c60660 39200010 39400020 39600030 <7e00200c> 7c0020ce 38840010 409f001c
      [   18.824935] ---[ end trace 0bb95124affaaa45 ]---
      [   18.825046] Unrecoverable VMX/Altivec Unavailable Exception f20 at c000000000052d08
      
      I believe the right fix is to make memcpy match usercopy and not use
      cr1.
      Signed-off-by: NNishanth Aravamudan <nacc@us.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      CC: <stable@kernel.org> [v3.6]
      c8adfecc
  5. 18 9月, 2012 1 次提交
  6. 05 9月, 2012 1 次提交
  7. 24 8月, 2012 2 次提交
  8. 11 7月, 2012 1 次提交
  9. 10 7月, 2012 4 次提交
  10. 03 7月, 2012 8 次提交
  11. 28 5月, 2012 1 次提交
  12. 30 4月, 2012 1 次提交
  13. 29 3月, 2012 1 次提交
  14. 21 3月, 2012 1 次提交
  15. 19 12月, 2011 1 次提交
    • A
      powerpc: POWER7 optimised copy_to_user/copy_from_user using VMX · a66086b8
      Anton Blanchard 提交于
      Implement a POWER7 optimised copy_to_user/copy_from_user using VMX.
      For large aligned copies this new loop is over 10% faster, and for
      large unaligned copies it is over 200% faster.
      
      If we take a fault we fall back to the old version, this keeps
      things relatively simple and easy to verify.
      
      On POWER7 unaligned stores rarely slow down - they only flush when
      a store crosses a 4KB page boundary. Furthermore this flush is
      handled completely in hardware and should be 20-30 cycles.
      
      Unaligned loads on the other hand flush much more often - whenever
      crossing a 128 byte cache line, or a 32 byte sector if either sector
      is an L1 miss.
      
      Considering this information we really want to get the loads aligned
      and not worry about the alignment of the stores. Microbenchmarks
      confirm that this approach is much faster than the current unaligned
      copy loop that uses shifts and rotates to ensure both loads and
      stores are aligned.
      
      We also want to try and do the stores in cacheline aligned, cacheline
      sized chunks. If the store queue is unable to merge an entire
      cacheline of stores then the L2 cache will have to do a
      read/modify/write. Even worse, we will serialise this with the stores
      in the next iteration of the copy loop since both iterations hit
      the same cacheline.
      
      Based on this, the new loop does the following things:
      
      1 - 127 bytes
      Get the source 8 byte aligned and use 8 byte loads and stores. Pretty
      boring and similar to how the current loop works.
      
      128 - 4095 bytes
      Get the source 8 byte aligned and use 8 byte loads and stores,
      1 cacheline at a time. We aren't doing the stores in cacheline
      aligned chunks so we will potentially serialise once per cacheline.
      Even so it is much better than the loop we have today.
      
      4096 - bytes
      If both source and destination have the same alignment get them both
      16 byte aligned, then get the destination cacheline aligned. Do
      cacheline sized loads and stores using VMX.
      
      If source and destination do not have the same alignment, we get the
      destination cacheline aligned, and use permute to do aligned loads.
      
      In both cases the VMX loop should be optimal - we always do aligned
      loads and stores and are always doing stores in cacheline aligned,
      cacheline sized chunks.
      
      To be able to use VMX we must be careful about interrupts and
      sleeping. We don't use the VMX loop when in an interrupt (which should
      be rare anyway) and we wrap the VMX loop in disable/enable_pagefault
      and fall back to the existing copy_tofrom_user loop if we do need to
      sleep.
      
      The VMX breakpoint of 4096 bytes was chosen using this microbenchmark:
      
      http://ozlabs.org/~anton/junkcode/copy_to_user.c
      
      Since we are using VMX and there is a cost to saving and restoring
      the user VMX state there are two broad cases we need to benchmark:
      
      - Best case - userspace never uses VMX
      
      - Worst case - userspace always uses VMX
      
      In reality a userspace process will sit somewhere between these two
      extremes. Since we need to test both aligned and unaligned copies we
      end up with 4 combinations. The point at which the VMX loop begins to
      win is:
      
      0% VMX
      aligned		2048 bytes
      unaligned	2048 bytes
      
      100% VMX
      aligned		16384 bytes
      unaligned	8192 bytes
      
      Considering this is a microbenchmark, the data is hot in cache and
      the VMX loop has better store queue merging properties we set the
      breakpoint to 4096 bytes, a little below the unaligned breakpoints.
      
      Some future optimisations we can look at:
      
      - Looking at the perf data, a significant part of the cost when a
        task is always using VMX is the extra exception we take to restore
        the VMX state. As such we should do something similar to the x86
        optimisation that restores FPU state for heavy users. ie:
      
              /*
               * If the task has used fpu the last 5 timeslices, just do a full
               * restore of the math state immediately to avoid the trap; the
               * chances of needing FPU soon are obviously high now
               */
              preload_fpu = tsk_used_math(next_p) && next_p->fpu_counter > 5;
      
        and
      
              /*
               * fpu_counter contains the number of consecutive context switches
               * that the FPU is used. If this is over a threshold, the lazy fpu
               * saving becomes unlazy to save the trap. This is an unsigned char
               * so that after 256 times the counter wraps and the behavior turns
               * lazy again; this to deal with bursty apps that only use FPU for
               * a short time
               */
      
      - We could create a paca bit to mirror the VMX enabled MSR bit and check
        that first, avoiding multiple calls to calling enable_kernel_altivec.
        That should help with iovec based system calls like readv.
      
      - We could have two VMX breakpoints, one for when we know the user VMX
        state is loaded into the registers and one when it isn't. This could
        be a second bit in the paca so we can calculate the break points quickly.
      
      - One suggestion from Ben was to save and restore the VSX registers
        we use inline instead of using enable_kernel_altivec.
      
      [BenH: Fixed a problem with preempt and fixed build without CONFIG_ALTIVEC]
      Signed-off-by: NAnton Blanchard <anton@samba.org>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      a66086b8
  16. 16 11月, 2011 1 次提交
  17. 01 11月, 2011 1 次提交
  18. 21 5月, 2011 1 次提交
    • L
      sanitize <linux/prefetch.h> usage · 268bb0ce
      Linus Torvalds 提交于
      Commit e66eed65 ("list: remove prefetching from regular list
      iterators") removed the include of prefetch.h from list.h, which
      uncovered several cases that had apparently relied on that rather
      obscure header file dependency.
      
      So this fixes things up a bit, using
      
         grep -L linux/prefetch.h $(git grep -l '[^a-z_]prefetchw*(' -- '*.[ch]')
         grep -L 'prefetchw*(' $(git grep -l 'linux/prefetch.h' -- '*.[ch]')
      
      to guide us in finding files that either need <linux/prefetch.h>
      inclusion, or have it despite not needing it.
      
      There are more of them around (mostly network drivers), but this gets
      many core ones.
      Reported-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      268bb0ce
  19. 19 5月, 2011 3 次提交
  20. 27 4月, 2011 1 次提交
  21. 21 1月, 2011 1 次提交
    • M
      powerpc: Ensure the else case of feature sections will fit · c0337288
      Michael Ellerman 提交于
      When we create an alternative feature section, the else case must be the
      same size or smaller than the body. This is because when we patch the
      else case in we just overwrite the body, so there must be room.
      
      Up to now we just did this by inspection, but it's quite easy to enforce
      it in the assembler, so we should.
      
      The only change is to add the ifgt block, but that effects the alignment
      of the tabs and so the whole macro is modified.
      
      Also add a test, but #if 0 it because we don't want to break the build.
      Anyone who's modifying the feature macros should enable the test.
      Signed-off-by: NMichael Ellerman <michael@ellerman.id.au>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      c0337288
  22. 09 12月, 2010 1 次提交
  23. 29 11月, 2010 1 次提交
  24. 13 10月, 2010 1 次提交
  25. 02 9月, 2010 3 次提交