1. 09 7月, 2010 1 次提交
    • A
      powerpc: Optimise per cpu accesses on 64bit · ae01f84b
      Anton Blanchard 提交于
      Now we dynamically allocate the paca array, it takes an extra load
      whenever we want to access another cpu's paca. One place we do that a lot
      is per cpu variables. A simple example:
      
      DEFINE_PER_CPU(unsigned long, vara);
      unsigned long test4(int cpu)
      {
      	return per_cpu(vara, cpu);
      }
      
      This takes 4 loads, 5 if you include the actual load of the per cpu variable:
      
          ld r11,-32760(r30)  # load address of paca pointer
          ld r9,-32768(r30)   # load link address of percpu variable
          sldi r3,r29,9       # get offset into paca (each entry is 512 bytes)
          ld r0,0(r11)        # load paca pointer
          add r3,r0,r3        # paca + offset
          ld r11,64(r3)       # load paca[cpu].data_offset
      
          ldx r3,r9,r11       # load per cpu variable
      
      If we remove the ppc64 specific per_cpu_offset(), we get the generic one
      which indexes into a statically allocated array. This removes one load and
      one add:
      
          ld r11,-32760(r30)  # load address of __per_cpu_offset
          ld r9,-32768(r30)   # load link address of percpu variable
          sldi r3,r29,3       # get offset into __per_cpu_offset (each entry 8 bytes)
          ldx r11,r11,r3      # load __per_cpu_offset[cpu]
      
          ldx r3,r9,r11       # load per cpu variable
      
      Having all the offsets in one array also helps when iterating over a per cpu
      variable across a number of cpus, such as in the scheduler. Before we would
      need to load one paca cacheline when calculating each per cpu offset. Now we
      have 16 (128 / sizeof(long)) per cpu offsets in each cacheline.
      Signed-off-by: NAnton Blanchard <anton@samba.org>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      ae01f84b
  2. 04 8月, 2008 1 次提交
  3. 24 2月, 2008 1 次提交
    • H
      percpu: fix DEBUG_PREEMPT per_cpu checking · 1e835278
      Hugh Dickins 提交于
      2.6.25-rc1 percpu changes broke CONFIG_DEBUG_PREEMPT's per_cpu checking
      on several architectures.  On s390, sparc64 and x86 it's been weakened to
      not checking at all; whereas on powerpc64 it's become too strict, issuing
      warnings from __raw_get_cpu_var in io_schedule and init_timer for example.
      
      Fix this by weakening powerpc's __my_cpu_offset to use the non-checking
      local_paca instead of get_paca (which itself contains such a check);
      and strengthening the generic my_cpu_offset to go the old slow way via
      smp_processor_id when CONFIG_DEBUG_PREEMPT (debug_smp_processor_id is
      where all the knowledge of what's correct when lives).
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Reviewed-by: NMike Travis <travis@sgi.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1e835278
  4. 31 1月, 2008 2 次提交
  5. 30 1月, 2008 1 次提交
  6. 03 10月, 2007 1 次提交
    • H
      [POWERPC] ppc64: support CONFIG_DEBUG_PREEMPT · 048c8bc9
      Hugh Dickins 提交于
      Add CONFIG_DEBUG_PREEMPT support to ppc64: it was useful for testing
      get_paca() preemption.  Cheat a little, just use debug_smp_processor_id()
      in the debug version of get_paca(): it contains all the right checks and
      reporting, though get_paca() doesn't really use smp_processor_id().
      
      Use local_paca for what might have been called __raw_get_paca().
      Silence harmless warnings from io.h and lparcfg.c with local_paca -
      it is okay for iseries_lparcfg_data to be referencing shared_proc
      with preemption enabled: all cpus should show the same value for
      shared_proc.
      
      Why do other architectures need TRACE_IRQFLAGS_SUPPORT for DEBUG_PREEMPT?
      I don't know, ppc64 appears to get along fine without it.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      048c8bc9
  7. 20 7月, 2007 1 次提交
    • F
      define new percpu interface for shared data · 5fb7dc37
      Fenghua Yu 提交于
      per cpu data section contains two types of data.  One set which is
      exclusively accessed by the local cpu and the other set which is per cpu,
      but also shared by remote cpus.  In the current kernel, these two sets are
      not clearely separated out.  This can potentially cause the same data
      cacheline shared between the two sets of data, which will result in
      unnecessary bouncing of the cacheline between cpus.
      
      One way to fix the problem is to cacheline align the remotely accessed per
      cpu data, both at the beginning and at the end.  Because of the padding at
      both ends, this will likely cause some memory wastage and also the
      interface to achieve this is not clean.
      
      This patch:
      
      Moves the remotely accessed per cpu data (which is currently marked
      as ____cacheline_aligned_in_smp) into a different section, where all the data
      elements are cacheline aligned. And as such, this differentiates the local
      only data and remotely accessed data cleanly.
      Signed-off-by: NFenghua Yu <fenghua.yu@intel.com>
      Acked-by: NSuresh Siddha <suresh.b.siddha@intel.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: <linux-arch@vger.kernel.org>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5fb7dc37
  8. 04 7月, 2006 1 次提交
  9. 26 6月, 2006 1 次提交
  10. 29 3月, 2006 1 次提交
  11. 23 3月, 2006 1 次提交
    • A
      [PATCH] more for_each_cpu() conversions · 394e3902
      Andrew Morton 提交于
      When we stop allocating percpu memory for not-possible CPUs we must not touch
      the percpu data for not-possible CPUs at all.  The correct way of doing this
      is to test cpu_possible() or to use for_each_cpu().
      
      This patch is a kernel-wide sweep of all instances of NR_CPUS.  I found very
      few instances of this bug, if any.  But the patch converts lots of open-coded
      test to use the preferred helper macros.
      
      Cc: Mikael Starvik <starvik@axis.com>
      Cc: David Howells <dhowells@redhat.com>
      Acked-by: NKyle McMartin <kyle@parisc-linux.org>
      Cc: Anton Blanchard <anton@samba.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Cc: Andi Kleen <ak@muc.de>
      Cc: Christian Zankel <chris@zankel.net>
      Cc: Philippe Elie <phil.el@wanadoo.fr>
      Cc: Nathan Scott <nathans@sgi.com>
      Cc: Jens Axboe <axboe@suse.de>
      Cc: Eric Dumazet <dada1@cosmosbay.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      394e3902
  12. 11 1月, 2006 1 次提交
    • A
      [PATCH] powerpc/64: per cpu data optimisations · 7a0268fa
      Anton Blanchard 提交于
      The current ppc64 per cpu data implementation is quite slow. eg:
      
              lhz 11,18(13)           /* smp_processor_id() */
              ld 9,.LC63-.LCTOC1(30)  /* per_cpu__variable_name */
              ld 8,.LC61-.LCTOC1(30)  /* __per_cpu_offset */
              sldi 11,11,3            /* form index into __per_cpu_offset */
              mr 10,9
              ldx 9,11,8              /* __per_cpu_offset[smp_processor_id()] */
              ldx 0,10,9              /* load per cpu data */
      
      5 loads for something that is supposed to be fast, pretty awful. One
      reason for the large number of loads is that we have to synthesize 2
      64bit constants (per_cpu__variable_name and __per_cpu_offset).
      
      By putting __per_cpu_offset into the paca we can avoid the 2 loads
      associated with it:
      
              ld 11,56(13)            /* paca->data_offset */
              ld 9,.LC59-.LCTOC1(30)  /* per_cpu__variable_name */
              ldx 0,9,11              /* load per cpu data
      
      Longer term we can should be able to do even better than 3 loads.
      If per_cpu__variable_name wasnt a 64bit constant and paca->data_offset
      was in a register we could cut it down to one load. A suggestion from
      Rusty is to use gcc's __thread extension here. In order to do this we
      would need to free up r13 (the __thread register and where the paca
      currently is). So far Ive had a few unsuccessful attempts at doing that :)
      
      The patch also allocates per cpu memory node local on NUMA machines.
      This patch from Rusty has been sitting in my queue _forever_ but stalled
      when I hit the compiler bug. Sorry about that.
      
      Finally I also only allocate per cpu data for possible cpus, which comes
      straight out of the x86-64 port. On a pseries kernel (with NR_CPUS == 128)
      and 4 possible cpus we see some nice gains:
      
                   total       used       free     shared    buffers cached
      Mem:       4012228     212860    3799368          0          0 162424
      
                   total       used       free     shared    buffers cached
      Mem:       4016200     212984    3803216          0          0 162424
      
      A saving of 3.75MB. Quite nice for smaller machines. Note: we now have
      to be careful of per cpu users that touch data for !possible cpus.
      
      At this stage it might be worth making the NUMA and possible cpu
      optimisations generic, but per cpu init is done so early we have to be
      careful that all architectures have their possible map setup correctly.
      Signed-off-by: NAnton Blanchard <anton@samba.org>
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      7a0268fa
  13. 30 8月, 2005 1 次提交