1. 20 8月, 2011 1 次提交
    • C
      slub: per cpu cache for partial pages · 49e22585
      Christoph Lameter 提交于
      Allow filling out the rest of the kmem_cache_cpu cacheline with pointers to
      partial pages. The partial page list is used in slab_free() to avoid
      per node lock taking.
      
      In __slab_alloc() we can then take multiple partial pages off the per
      node partial list in one go reducing node lock pressure.
      
      We can also use the per cpu partial list in slab_alloc() to avoid scanning
      partial lists for pages with free objects.
      
      The main effect of a per cpu partial list is that the per node list_lock
      is taken for batches of partial pages instead of individual ones.
      
      Potential future enhancements:
      
      1. The pickup from the partial list could be perhaps be done without disabling
         interrupts with some work. The free path already puts the page into the
         per cpu partial list without disabling interrupts.
      
      2. __slab_free() may have some code paths that could use optimization.
      
      Performance:
      
      				Before		After
      ./hackbench 100 process 200000
      				Time: 1953.047	1564.614
      ./hackbench 100 process 20000
      				Time: 207.176   156.940
      ./hackbench 100 process 20000
      				Time: 204.468	156.940
      ./hackbench 100 process 20000
      				Time: 204.879	158.772
      ./hackbench 10 process 20000
      				Time: 20.153	15.853
      ./hackbench 10 process 20000
      				Time: 20.153	15.986
      ./hackbench 10 process 20000
      				Time: 19.363	16.111
      ./hackbench 1 process 20000
      				Time: 2.518	2.307
      ./hackbench 1 process 20000
      				Time: 2.258	2.339
      ./hackbench 1 process 20000
      				Time: 2.864	2.163
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      Signed-off-by: NPekka Enberg <penberg@kernel.org>
      49e22585
  2. 08 7月, 2011 1 次提交
  3. 02 7月, 2011 3 次提交
  4. 17 6月, 2011 1 次提交
    • C
      slab, slub, slob: Unify alignment definition · 3192b920
      Christoph Lameter 提交于
      Every slab has its on alignment definition in include/linux/sl?b_def.h. Extract those
      and define a common set in include/linux/slab.h.
      
      SLOB: As notes sometimes we need double word alignment on 32 bit. This gives all
      structures allocated by SLOB a unsigned long long alignment like the others do.
      
      SLAB: If ARCH_SLAB_MINALIGN is not set SLAB would set ARCH_SLAB_MINALIGN to
      zero meaning no alignment at all. Give it the default unsigned long long alignment.
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      Signed-off-by: NPekka Enberg <penberg@kernel.org>
      3192b920
  5. 21 5月, 2011 1 次提交
  6. 08 5月, 2011 1 次提交
    • C
      slub: Remove CONFIG_CMPXCHG_LOCAL ifdeffery · 1759415e
      Christoph Lameter 提交于
      Remove the #ifdefs. This means that the irqsafe_cpu_cmpxchg_double() is used
      everywhere.
      
      There may be performance implications since:
      
      A. We now have to manage a transaction ID for all arches
      
      B. The interrupt holdoff for arches not supporting CONFIG_CMPXCHG_LOCAL is reduced
      to a very short irqoff section.
      
      There are no multiple irqoff/irqon sequences as a result of this change. Even in the fallback
      case we only have to do one disable and enable like before.
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      Signed-off-by: NPekka Enberg <penberg@kernel.org>
      1759415e
  7. 23 3月, 2011 1 次提交
  8. 12 3月, 2011 1 次提交
  9. 11 3月, 2011 2 次提交
    • C
      Lockless (and preemptless) fastpaths for slub · 8a5ec0ba
      Christoph Lameter 提交于
      Use the this_cpu_cmpxchg_double functionality to implement a lockless
      allocation algorithm on arches that support fast this_cpu_ops.
      
      Each of the per cpu pointers is paired with a transaction id that ensures
      that updates of the per cpu information can only occur in sequence on
      a certain cpu.
      
      A transaction id is a "long" integer that is comprised of an event number
      and the cpu number. The event number is incremented for every change to the
      per cpu state. This means that the cmpxchg instruction can verify for an
      update that nothing interfered and that we are updating the percpu structure
      for the processor where we picked up the information and that we are also
      currently on that processor when we update the information.
      
      This results in a significant decrease of the overhead in the fastpaths. It
      also makes it easy to adopt the fast path for realtime kernels since this
      is lockless and does not require the use of the current per cpu area
      over the critical section. It is only important that the per cpu area is
      current at the beginning of the critical section and at the end.
      
      So there is no need even to disable preemption.
      
      Test results show that the fastpath cycle count is reduced by up to ~ 40%
      (alloc/free test goes from ~140 cycles down to ~80). The slowpath for kfree
      adds a few cycles.
      
      Sadly this does nothing for the slowpath which is where the main issues with
      performance in slub are but the best case performance rises significantly.
      (For that see the more complex slub patches that require cmpxchg_double)
      
      Kmalloc: alloc/free test
      
      Before:
      
      10000 times kmalloc(8)/kfree -> 134 cycles
      10000 times kmalloc(16)/kfree -> 152 cycles
      10000 times kmalloc(32)/kfree -> 144 cycles
      10000 times kmalloc(64)/kfree -> 142 cycles
      10000 times kmalloc(128)/kfree -> 142 cycles
      10000 times kmalloc(256)/kfree -> 132 cycles
      10000 times kmalloc(512)/kfree -> 132 cycles
      10000 times kmalloc(1024)/kfree -> 135 cycles
      10000 times kmalloc(2048)/kfree -> 135 cycles
      10000 times kmalloc(4096)/kfree -> 135 cycles
      10000 times kmalloc(8192)/kfree -> 144 cycles
      10000 times kmalloc(16384)/kfree -> 754 cycles
      
      After:
      
      10000 times kmalloc(8)/kfree -> 78 cycles
      10000 times kmalloc(16)/kfree -> 78 cycles
      10000 times kmalloc(32)/kfree -> 82 cycles
      10000 times kmalloc(64)/kfree -> 88 cycles
      10000 times kmalloc(128)/kfree -> 79 cycles
      10000 times kmalloc(256)/kfree -> 79 cycles
      10000 times kmalloc(512)/kfree -> 85 cycles
      10000 times kmalloc(1024)/kfree -> 82 cycles
      10000 times kmalloc(2048)/kfree -> 82 cycles
      10000 times kmalloc(4096)/kfree -> 85 cycles
      10000 times kmalloc(8192)/kfree -> 82 cycles
      10000 times kmalloc(16384)/kfree -> 706 cycles
      
      Kmalloc: Repeatedly allocate then free test
      
      Before:
      
      10000 times kmalloc(8) -> 211 cycles kfree -> 113 cycles
      10000 times kmalloc(16) -> 174 cycles kfree -> 115 cycles
      10000 times kmalloc(32) -> 235 cycles kfree -> 129 cycles
      10000 times kmalloc(64) -> 222 cycles kfree -> 120 cycles
      10000 times kmalloc(128) -> 343 cycles kfree -> 139 cycles
      10000 times kmalloc(256) -> 827 cycles kfree -> 147 cycles
      10000 times kmalloc(512) -> 1048 cycles kfree -> 272 cycles
      10000 times kmalloc(1024) -> 2043 cycles kfree -> 528 cycles
      10000 times kmalloc(2048) -> 4002 cycles kfree -> 571 cycles
      10000 times kmalloc(4096) -> 7740 cycles kfree -> 628 cycles
      10000 times kmalloc(8192) -> 8062 cycles kfree -> 850 cycles
      10000 times kmalloc(16384) -> 8895 cycles kfree -> 1249 cycles
      
      After:
      
      10000 times kmalloc(8) -> 190 cycles kfree -> 129 cycles
      10000 times kmalloc(16) -> 76 cycles kfree -> 123 cycles
      10000 times kmalloc(32) -> 126 cycles kfree -> 124 cycles
      10000 times kmalloc(64) -> 181 cycles kfree -> 128 cycles
      10000 times kmalloc(128) -> 310 cycles kfree -> 140 cycles
      10000 times kmalloc(256) -> 809 cycles kfree -> 165 cycles
      10000 times kmalloc(512) -> 1005 cycles kfree -> 269 cycles
      10000 times kmalloc(1024) -> 1999 cycles kfree -> 527 cycles
      10000 times kmalloc(2048) -> 3967 cycles kfree -> 570 cycles
      10000 times kmalloc(4096) -> 7658 cycles kfree -> 637 cycles
      10000 times kmalloc(8192) -> 8111 cycles kfree -> 859 cycles
      10000 times kmalloc(16384) -> 8791 cycles kfree -> 1173 cycles
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      Signed-off-by: NPekka Enberg <penberg@kernel.org>
      8a5ec0ba
    • C
      slub: min_partial needs to be in first cacheline · 1a757fe5
      Christoph Lameter 提交于
      It is used in unfreeze_slab() which is a performance critical
      function.
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      Signed-off-by: NPekka Enberg <penberg@kernel.org>
      1a757fe5
  10. 06 11月, 2010 1 次提交
    • R
      slub tracing: move trace calls out of always inlined functions to reduce kernel code size · 4a92379b
      Richard Kennedy 提交于
      Having the trace calls defined in the always inlined kmalloc functions
      in include/linux/slub_def.h causes a lot of code duplication as the
      trace functions get instantiated for each kamalloc call site. This can
      simply be removed by pushing the trace calls down into the functions in
      slub.c.
      
      On my x86_64 built this patch shrinks the code size of the kernel by
      approx 36K and also shrinks the code size of many modules -- too many to
      list here ;)
      
      size vmlinux (2.6.36) reports
             text        data     bss     dec     hex filename
          5410611	 743172	 828928	6982711	 6a8c37	vmlinux
          5373738	 744244	 828928	6946910	 6a005e	vmlinux + patch
      
      The resulting kernel has had some testing & kmalloc trace still seems to
      work.
      
      This patch
      - moves trace_kmalloc out of the inlined kmalloc() and pushes it down
      into kmem_cache_alloc_trace() so this it only get instantiated once.
      
      - rename kmem_cache_alloc_notrace()  to kmem_cache_alloc_trace() to
      indicate that now is does have tracing. (maybe this would better being
      called something like kmalloc_kmem_cache ?)
      
      - adds a new function kmalloc_order() to handle allocation and tracing
      of large allocations of page order.
      
      - removes tracing from the inlined kmalloc_large() replacing them with a
      call to kmalloc_order();
      
      - move tracing out of inlined kmalloc_node() and pushing it down into
      kmem_cache_alloc_node_trace
      
      - rename kmem_cache_alloc_node_notrace() to
      kmem_cache_alloc_node_trace()
      
      - removes the include of trace/events/kmem.h from slub_def.h.
      
      v2
      - keep kmalloc_order_trace inline when !CONFIG_TRACE
      Signed-off-by: NRichard Kennedy <richard@rsk.demon.co.uk>
      Signed-off-by: NPekka Enberg <penberg@kernel.org>
      4a92379b
  11. 06 10月, 2010 1 次提交
  12. 02 10月, 2010 2 次提交
  13. 11 8月, 2010 1 次提交
    • F
      dma-mapping: rename ARCH_KMALLOC_MINALIGN to ARCH_DMA_MINALIGN · a6eb9fe1
      FUJITA Tomonori 提交于
      Now each architecture has the own dma_get_cache_alignment implementation.
      
      dma_get_cache_alignment returns the minimum DMA alignment.  Architectures
      define it as ARCH_KMALLOC_MINALIGN (it's used to make sure that malloc'ed
      buffer is DMA-safe; the buffer doesn't share a cache with the others).  So
      we can unify dma_get_cache_alignment implementations.
      
      This patch:
      
      dma_get_cache_alignment() needs to know if an architecture defines
      ARCH_KMALLOC_MINALIGN or not (needs to know if architecture has DMA
      alignment restriction).  However, slab.h define ARCH_KMALLOC_MINALIGN if
      architectures doesn't define it.
      
      Let's rename ARCH_KMALLOC_MINALIGN to ARCH_DMA_MINALIGN.
      ARCH_KMALLOC_MINALIGN is used only in the internals of slab/slob/slub
      (except for crypto).
      Signed-off-by: NFUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
      Cc: <linux-arch@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a6eb9fe1
  14. 09 8月, 2010 1 次提交
  15. 09 6月, 2010 1 次提交
  16. 30 5月, 2010 1 次提交
  17. 25 5月, 2010 1 次提交
  18. 20 5月, 2010 1 次提交
  19. 20 12月, 2009 3 次提交
    • C
      SLUB: this_cpu: Remove slub kmem_cache fields · ff12059e
      Christoph Lameter 提交于
      Remove the fields in struct kmem_cache_cpu that were used to cache data from
      struct kmem_cache when they were in different cachelines. The cacheline that
      holds the per cpu array pointer now also holds these values. We can cut down
      the struct kmem_cache_cpu size to almost half.
      
      The get_freepointer() and set_freepointer() functions that used to be only
      intended for the slow path now are also useful for the hot path since access
      to the size field does not require accessing an additional cacheline anymore.
      This results in consistent use of functions for setting the freepointer of
      objects throughout SLUB.
      
      Also we initialize all possible kmem_cache_cpu structures when a slab is
      created. No need to initialize them when a processor or node comes online.
      Signed-off-by: NChristoph Lameter <cl@linux-foundation.org>
      Signed-off-by: NPekka Enberg <penberg@cs.helsinki.fi>
      ff12059e
    • C
      SLUB: Get rid of dynamic DMA kmalloc cache allocation · 756dee75
      Christoph Lameter 提交于
      Dynamic DMA kmalloc cache allocation is troublesome since the
      new percpu allocator does not support allocations in atomic contexts.
      Reserve some statically allocated kmalloc_cpu structures instead.
      Signed-off-by: NChristoph Lameter <cl@linux-foundation.org>
      Signed-off-by: NPekka Enberg <penberg@cs.helsinki.fi>
      756dee75
    • C
      SLUB: Use this_cpu operations in slub · 9dfc6e68
      Christoph Lameter 提交于
      Using per cpu allocations removes the needs for the per cpu arrays in the
      kmem_cache struct. These could get quite big if we have to support systems
      with thousands of cpus. The use of this_cpu_xx operations results in:
      
      1. The size of kmem_cache for SMP configuration shrinks since we will only
         need 1 pointer instead of NR_CPUS. The same pointer can be used by all
         processors. Reduces cache footprint of the allocator.
      
      2. We can dynamically size kmem_cache according to the actual nodes in the
         system meaning less memory overhead for configurations that may potentially
         support up to 1k NUMA nodes / 4k cpus.
      
      3. We can remove the diddle widdle with allocating and releasing of
         kmem_cache_cpu structures when bringing up and shutting down cpus. The cpu
         alloc logic will do it all for us. Removes some portions of the cpu hotplug
         functionality.
      
      4. Fastpath performance increases since per cpu pointer lookups and
         address calculations are avoided.
      
      V7-V8
      - Convert missed get_cpu_slab() under CONFIG_SLUB_STATS
      Signed-off-by: NChristoph Lameter <cl@linux-foundation.org>
      Signed-off-by: NPekka Enberg <penberg@cs.helsinki.fi>
      9dfc6e68
  20. 11 12月, 2009 1 次提交
  21. 30 8月, 2009 1 次提交
  22. 06 8月, 2009 1 次提交
  23. 08 7月, 2009 1 次提交
  24. 12 6月, 2009 1 次提交
    • P
      slab,slub: don't enable interrupts during early boot · 7e85ee0c
      Pekka Enberg 提交于
      As explained by Benjamin Herrenschmidt:
      
        Oh and btw, your patch alone doesn't fix powerpc, because it's missing
        a whole bunch of GFP_KERNEL's in the arch code... You would have to
        grep the entire kernel for things that check slab_is_available() and
        even then you'll be missing some.
      
        For example, slab_is_available() didn't always exist, and so in the
        early days on powerpc, we used a mem_init_done global that is set form
        mem_init() (not perfect but works in practice). And we still have code
        using that to do the test.
      
      Therefore, mask out __GFP_WAIT, __GFP_IO, and __GFP_FS in the slab allocators
      in early boot code to avoid enabling interrupts.
      Signed-off-by: NPekka Enberg <penberg@cs.helsinki.fi>
      7e85ee0c
  25. 12 4月, 2009 1 次提交
  26. 03 4月, 2009 1 次提交
  27. 23 2月, 2009 1 次提交
  28. 20 2月, 2009 3 次提交
  29. 30 12月, 2008 1 次提交
    • F
      tracing/kmemtrace: normalize the raw tracer event to the unified tracing API · 36994e58
      Frederic Weisbecker 提交于
      Impact: new tracer plugin
      
      This patch adapts kmemtrace raw events tracing to the unified tracing API.
      
      To enable and use this tracer, just do the following:
      
       echo kmemtrace > /debugfs/tracing/current_tracer
       cat /debugfs/tracing/trace
      
      You will have the following output:
      
       # tracer: kmemtrace
       #
       #
       # ALLOC  TYPE  REQ   GIVEN  FLAGS           POINTER         NODE    CALLER
       # FREE   |      |     |       |              |   |            |        |
       # |
      
      type_id 1 call_site 18446744071565527833 ptr 18446612134395152256
      type_id 0 call_site 18446744071565585597 ptr 18446612134405955584 bytes_req 4096 bytes_alloc 4096 gfp_flags 208 node -1
      type_id 1 call_site 18446744071565585534 ptr 18446612134405955584
      type_id 0 call_site 18446744071565585597 ptr 18446612134405955584 bytes_req 4096 bytes_alloc 4096 gfp_flags 208 node -1
      type_id 0 call_site 18446744071565636711 ptr 18446612134345164672 bytes_req 240 bytes_alloc 240 gfp_flags 208 node -1
      type_id 1 call_site 18446744071565585534 ptr 18446612134405955584
      type_id 0 call_site 18446744071565585597 ptr 18446612134405955584 bytes_req 4096 bytes_alloc 4096 gfp_flags 208 node -1
      type_id 0 call_site 18446744071565636711 ptr 18446612134345164912 bytes_req 240 bytes_alloc 240 gfp_flags 208 node -1
      type_id 1 call_site 18446744071565585534 ptr 18446612134405955584
      type_id 0 call_site 18446744071565585597 ptr 18446612134405955584 bytes_req 4096 bytes_alloc 4096 gfp_flags 208 node -1
      type_id 0 call_site 18446744071565636711 ptr 18446612134345165152 bytes_req 240 bytes_alloc 240 gfp_flags 208 node -1
      type_id 0 call_site 18446744071566144042 ptr 18446612134346191680 bytes_req 1304 bytes_alloc 1312 gfp_flags 208 node -1
      type_id 1 call_site 18446744071565585534 ptr 18446612134405955584
      type_id 0 call_site 18446744071565585597 ptr 18446612134405955584 bytes_req 4096 bytes_alloc 4096 gfp_flags 208 node -1
      type_id 1 call_site 18446744071565585534 ptr 18446612134405955584
      
      That was to stay backward compatible with the format output produced in
      inux/tracepoint.h.
      
      This is the default ouput, but note that I tried something else.
      
      If you change an option:
      
      echo kmem_minimalistic > /debugfs/trace_options
      
      and then cat /debugfs/trace, you will have the following output:
      
       # tracer: kmemtrace
       #
       #
       # ALLOC  TYPE  REQ   GIVEN  FLAGS           POINTER         NODE    CALLER
       # FREE   |      |     |       |              |   |            |        |
       # |
      
         -      C                            0xffff88007c088780          file_free_rcu
         +      K   4096   4096   000000d0   0xffff88007cad6000     -1   getname
         -      C                            0xffff88007cad6000          putname
         +      K   4096   4096   000000d0   0xffff88007cad6000     -1   getname
         +      K    240    240   000000d0   0xffff8800790dc780     -1   d_alloc
         -      C                            0xffff88007cad6000          putname
         +      K   4096   4096   000000d0   0xffff88007cad6000     -1   getname
         +      K    240    240   000000d0   0xffff8800790dc870     -1   d_alloc
         -      C                            0xffff88007cad6000          putname
         +      K   4096   4096   000000d0   0xffff88007cad6000     -1   getname
         +      K    240    240   000000d0   0xffff8800790dc960     -1   d_alloc
         +      K   1304   1312   000000d0   0xffff8800791d7340     -1   reiserfs_alloc_inode
         -      C                            0xffff88007cad6000          putname
         +      K   4096   4096   000000d0   0xffff88007cad6000     -1   getname
         -      C                            0xffff88007cad6000          putname
         +      K    992   1000   000000d0   0xffff880079045b58     -1   alloc_inode
         +      K    768   1024   000080d0   0xffff88007c096400     -1   alloc_pipe_info
         +      K    240    240   000000d0   0xffff8800790dca50     -1   d_alloc
         +      K    272    320   000080d0   0xffff88007c088780     -1   get_empty_filp
         +      K    272    320   000080d0   0xffff88007c088000     -1   get_empty_filp
      
      Yeah I shall confess kmem_minimalistic should be: kmem_alternative.
      
      Whatever, I find it more readable but this a personal opinion of course.
      We can drop it if you want.
      
      On the ALLOC/FREE column, + means an allocation and - a free.
      
      On the type column, you have K = kmalloc, C = cache, P = page
      
      I would like the flags to be GFP_* strings but that would not be easy to not
      break the column with strings....
      
      About the node...it seems to always be -1. I don't know why but that shouldn't
      be difficult to find.
      
      I moved linux/tracepoint.h to trace/tracepoint.h as well. I think that would
      be more easy to find the tracer headers if they are all in their common
      directory.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      36994e58
  30. 29 12月, 2008 1 次提交
  31. 05 8月, 2008 1 次提交
    • P
      SLUB: dynamic per-cache MIN_PARTIAL · 5595cffc
      Pekka Enberg 提交于
      This patch changes the static MIN_PARTIAL to a dynamic per-cache ->min_partial
      value that is calculated from object size. The bigger the object size, the more
      pages we keep on the partial list.
      
      I tested SLAB, SLUB, and SLUB with this patch on Jens Axboe's 'netio' example
      script of the fio benchmarking tool. The script stresses the networking
      subsystem which should also give a fairly good beating of kmalloc() et al.
      
      To run the test yourself, first clone the fio repository:
      
        git clone git://git.kernel.dk/fio.git
      
      and then run the following command n times on your machine:
      
        time ./fio examples/netio
      
      The results on my 2-way 64-bit x86 machine are as follows:
      
        [ the minimum, maximum, and average are captured from 50 individual runs ]
      
                       real time (seconds)
                       min      max      avg      sd
        SLAB           22.76    23.38    22.98    0.17
        SLUB           22.80    25.78    23.46    0.72
        SLUB (dynamic) 22.74    23.54    23.00    0.20
      
                       sys time (seconds)
                       min      max      avg      sd
        SLAB           6.90     8.28     7.70     0.28
        SLUB           7.42     16.95    8.89     2.28
        SLUB (dynamic) 7.17     8.64     7.73     0.29
      
                       user time (seconds)
                       min      max      avg      sd
        SLAB           36.89    38.11    37.50    0.29
        SLUB           30.85    37.99    37.06    1.67
        SLUB (dynamic) 36.75    38.07    37.59    0.32
      
      As you can see from the above numbers, this patch brings SLUB to the same level
      as SLAB for this particular workload fixing a ~2% regression. I'd expect this
      change to help similar workloads that allocate a lot of objects that are close
      to the size of a page.
      
      Cc: Matthew Wilcox <matthew@wil.cx>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Acked-by: NChristoph Lameter <cl@linux-foundation.org>
      Signed-off-by: NPekka Enberg <penberg@cs.helsinki.fi>
      5595cffc
  32. 27 7月, 2008 1 次提交