1. 20 7月, 2007 1 次提交
    • F
      define new percpu interface for shared data · 5fb7dc37
      Fenghua Yu 提交于
      per cpu data section contains two types of data.  One set which is
      exclusively accessed by the local cpu and the other set which is per cpu,
      but also shared by remote cpus.  In the current kernel, these two sets are
      not clearely separated out.  This can potentially cause the same data
      cacheline shared between the two sets of data, which will result in
      unnecessary bouncing of the cacheline between cpus.
      
      One way to fix the problem is to cacheline align the remotely accessed per
      cpu data, both at the beginning and at the end.  Because of the padding at
      both ends, this will likely cause some memory wastage and also the
      interface to achieve this is not clean.
      
      This patch:
      
      Moves the remotely accessed per cpu data (which is currently marked
      as ____cacheline_aligned_in_smp) into a different section, where all the data
      elements are cacheline aligned. And as such, this differentiates the local
      only data and remotely accessed data cleanly.
      Signed-off-by: NFenghua Yu <fenghua.yu@intel.com>
      Acked-by: NSuresh Siddha <suresh.b.siddha@intel.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: <linux-arch@vger.kernel.org>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5fb7dc37
  2. 04 7月, 2006 1 次提交
  3. 26 6月, 2006 1 次提交
  4. 29 3月, 2006 1 次提交
  5. 23 3月, 2006 1 次提交
    • A
      [PATCH] more for_each_cpu() conversions · 394e3902
      Andrew Morton 提交于
      When we stop allocating percpu memory for not-possible CPUs we must not touch
      the percpu data for not-possible CPUs at all.  The correct way of doing this
      is to test cpu_possible() or to use for_each_cpu().
      
      This patch is a kernel-wide sweep of all instances of NR_CPUS.  I found very
      few instances of this bug, if any.  But the patch converts lots of open-coded
      test to use the preferred helper macros.
      
      Cc: Mikael Starvik <starvik@axis.com>
      Cc: David Howells <dhowells@redhat.com>
      Acked-by: NKyle McMartin <kyle@parisc-linux.org>
      Cc: Anton Blanchard <anton@samba.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Cc: Andi Kleen <ak@muc.de>
      Cc: Christian Zankel <chris@zankel.net>
      Cc: Philippe Elie <phil.el@wanadoo.fr>
      Cc: Nathan Scott <nathans@sgi.com>
      Cc: Jens Axboe <axboe@suse.de>
      Cc: Eric Dumazet <dada1@cosmosbay.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      394e3902
  6. 11 1月, 2006 1 次提交
    • A
      [PATCH] powerpc/64: per cpu data optimisations · 7a0268fa
      Anton Blanchard 提交于
      The current ppc64 per cpu data implementation is quite slow. eg:
      
              lhz 11,18(13)           /* smp_processor_id() */
              ld 9,.LC63-.LCTOC1(30)  /* per_cpu__variable_name */
              ld 8,.LC61-.LCTOC1(30)  /* __per_cpu_offset */
              sldi 11,11,3            /* form index into __per_cpu_offset */
              mr 10,9
              ldx 9,11,8              /* __per_cpu_offset[smp_processor_id()] */
              ldx 0,10,9              /* load per cpu data */
      
      5 loads for something that is supposed to be fast, pretty awful. One
      reason for the large number of loads is that we have to synthesize 2
      64bit constants (per_cpu__variable_name and __per_cpu_offset).
      
      By putting __per_cpu_offset into the paca we can avoid the 2 loads
      associated with it:
      
              ld 11,56(13)            /* paca->data_offset */
              ld 9,.LC59-.LCTOC1(30)  /* per_cpu__variable_name */
              ldx 0,9,11              /* load per cpu data
      
      Longer term we can should be able to do even better than 3 loads.
      If per_cpu__variable_name wasnt a 64bit constant and paca->data_offset
      was in a register we could cut it down to one load. A suggestion from
      Rusty is to use gcc's __thread extension here. In order to do this we
      would need to free up r13 (the __thread register and where the paca
      currently is). So far Ive had a few unsuccessful attempts at doing that :)
      
      The patch also allocates per cpu memory node local on NUMA machines.
      This patch from Rusty has been sitting in my queue _forever_ but stalled
      when I hit the compiler bug. Sorry about that.
      
      Finally I also only allocate per cpu data for possible cpus, which comes
      straight out of the x86-64 port. On a pseries kernel (with NR_CPUS == 128)
      and 4 possible cpus we see some nice gains:
      
                   total       used       free     shared    buffers cached
      Mem:       4012228     212860    3799368          0          0 162424
      
                   total       used       free     shared    buffers cached
      Mem:       4016200     212984    3803216          0          0 162424
      
      A saving of 3.75MB. Quite nice for smaller machines. Note: we now have
      to be careful of per cpu users that touch data for !possible cpus.
      
      At this stage it might be worth making the NUMA and possible cpu
      optimisations generic, but per cpu init is done so early we have to be
      careful that all architectures have their possible map setup correctly.
      Signed-off-by: NAnton Blanchard <anton@samba.org>
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      7a0268fa
  7. 30 8月, 2005 1 次提交