1. 11 3月, 2011 2 次提交
    • C
      Lockless (and preemptless) fastpaths for slub · 8a5ec0ba
      Christoph Lameter 提交于
      Use the this_cpu_cmpxchg_double functionality to implement a lockless
      allocation algorithm on arches that support fast this_cpu_ops.
      
      Each of the per cpu pointers is paired with a transaction id that ensures
      that updates of the per cpu information can only occur in sequence on
      a certain cpu.
      
      A transaction id is a "long" integer that is comprised of an event number
      and the cpu number. The event number is incremented for every change to the
      per cpu state. This means that the cmpxchg instruction can verify for an
      update that nothing interfered and that we are updating the percpu structure
      for the processor where we picked up the information and that we are also
      currently on that processor when we update the information.
      
      This results in a significant decrease of the overhead in the fastpaths. It
      also makes it easy to adopt the fast path for realtime kernels since this
      is lockless and does not require the use of the current per cpu area
      over the critical section. It is only important that the per cpu area is
      current at the beginning of the critical section and at the end.
      
      So there is no need even to disable preemption.
      
      Test results show that the fastpath cycle count is reduced by up to ~ 40%
      (alloc/free test goes from ~140 cycles down to ~80). The slowpath for kfree
      adds a few cycles.
      
      Sadly this does nothing for the slowpath which is where the main issues with
      performance in slub are but the best case performance rises significantly.
      (For that see the more complex slub patches that require cmpxchg_double)
      
      Kmalloc: alloc/free test
      
      Before:
      
      10000 times kmalloc(8)/kfree -> 134 cycles
      10000 times kmalloc(16)/kfree -> 152 cycles
      10000 times kmalloc(32)/kfree -> 144 cycles
      10000 times kmalloc(64)/kfree -> 142 cycles
      10000 times kmalloc(128)/kfree -> 142 cycles
      10000 times kmalloc(256)/kfree -> 132 cycles
      10000 times kmalloc(512)/kfree -> 132 cycles
      10000 times kmalloc(1024)/kfree -> 135 cycles
      10000 times kmalloc(2048)/kfree -> 135 cycles
      10000 times kmalloc(4096)/kfree -> 135 cycles
      10000 times kmalloc(8192)/kfree -> 144 cycles
      10000 times kmalloc(16384)/kfree -> 754 cycles
      
      After:
      
      10000 times kmalloc(8)/kfree -> 78 cycles
      10000 times kmalloc(16)/kfree -> 78 cycles
      10000 times kmalloc(32)/kfree -> 82 cycles
      10000 times kmalloc(64)/kfree -> 88 cycles
      10000 times kmalloc(128)/kfree -> 79 cycles
      10000 times kmalloc(256)/kfree -> 79 cycles
      10000 times kmalloc(512)/kfree -> 85 cycles
      10000 times kmalloc(1024)/kfree -> 82 cycles
      10000 times kmalloc(2048)/kfree -> 82 cycles
      10000 times kmalloc(4096)/kfree -> 85 cycles
      10000 times kmalloc(8192)/kfree -> 82 cycles
      10000 times kmalloc(16384)/kfree -> 706 cycles
      
      Kmalloc: Repeatedly allocate then free test
      
      Before:
      
      10000 times kmalloc(8) -> 211 cycles kfree -> 113 cycles
      10000 times kmalloc(16) -> 174 cycles kfree -> 115 cycles
      10000 times kmalloc(32) -> 235 cycles kfree -> 129 cycles
      10000 times kmalloc(64) -> 222 cycles kfree -> 120 cycles
      10000 times kmalloc(128) -> 343 cycles kfree -> 139 cycles
      10000 times kmalloc(256) -> 827 cycles kfree -> 147 cycles
      10000 times kmalloc(512) -> 1048 cycles kfree -> 272 cycles
      10000 times kmalloc(1024) -> 2043 cycles kfree -> 528 cycles
      10000 times kmalloc(2048) -> 4002 cycles kfree -> 571 cycles
      10000 times kmalloc(4096) -> 7740 cycles kfree -> 628 cycles
      10000 times kmalloc(8192) -> 8062 cycles kfree -> 850 cycles
      10000 times kmalloc(16384) -> 8895 cycles kfree -> 1249 cycles
      
      After:
      
      10000 times kmalloc(8) -> 190 cycles kfree -> 129 cycles
      10000 times kmalloc(16) -> 76 cycles kfree -> 123 cycles
      10000 times kmalloc(32) -> 126 cycles kfree -> 124 cycles
      10000 times kmalloc(64) -> 181 cycles kfree -> 128 cycles
      10000 times kmalloc(128) -> 310 cycles kfree -> 140 cycles
      10000 times kmalloc(256) -> 809 cycles kfree -> 165 cycles
      10000 times kmalloc(512) -> 1005 cycles kfree -> 269 cycles
      10000 times kmalloc(1024) -> 1999 cycles kfree -> 527 cycles
      10000 times kmalloc(2048) -> 3967 cycles kfree -> 570 cycles
      10000 times kmalloc(4096) -> 7658 cycles kfree -> 637 cycles
      10000 times kmalloc(8192) -> 8111 cycles kfree -> 859 cycles
      10000 times kmalloc(16384) -> 8791 cycles kfree -> 1173 cycles
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      Signed-off-by: NPekka Enberg <penberg@kernel.org>
      8a5ec0ba
    • C
      slub: min_partial needs to be in first cacheline · 1a757fe5
      Christoph Lameter 提交于
      It is used in unfreeze_slab() which is a performance critical
      function.
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      Signed-off-by: NPekka Enberg <penberg@kernel.org>
      1a757fe5
  2. 06 11月, 2010 1 次提交
    • R
      slub tracing: move trace calls out of always inlined functions to reduce kernel code size · 4a92379b
      Richard Kennedy 提交于
      Having the trace calls defined in the always inlined kmalloc functions
      in include/linux/slub_def.h causes a lot of code duplication as the
      trace functions get instantiated for each kamalloc call site. This can
      simply be removed by pushing the trace calls down into the functions in
      slub.c.
      
      On my x86_64 built this patch shrinks the code size of the kernel by
      approx 36K and also shrinks the code size of many modules -- too many to
      list here ;)
      
      size vmlinux (2.6.36) reports
             text        data     bss     dec     hex filename
          5410611	 743172	 828928	6982711	 6a8c37	vmlinux
          5373738	 744244	 828928	6946910	 6a005e	vmlinux + patch
      
      The resulting kernel has had some testing & kmalloc trace still seems to
      work.
      
      This patch
      - moves trace_kmalloc out of the inlined kmalloc() and pushes it down
      into kmem_cache_alloc_trace() so this it only get instantiated once.
      
      - rename kmem_cache_alloc_notrace()  to kmem_cache_alloc_trace() to
      indicate that now is does have tracing. (maybe this would better being
      called something like kmalloc_kmem_cache ?)
      
      - adds a new function kmalloc_order() to handle allocation and tracing
      of large allocations of page order.
      
      - removes tracing from the inlined kmalloc_large() replacing them with a
      call to kmalloc_order();
      
      - move tracing out of inlined kmalloc_node() and pushing it down into
      kmem_cache_alloc_node_trace
      
      - rename kmem_cache_alloc_node_notrace() to
      kmem_cache_alloc_node_trace()
      
      - removes the include of trace/events/kmem.h from slub_def.h.
      
      v2
      - keep kmalloc_order_trace inline when !CONFIG_TRACE
      Signed-off-by: NRichard Kennedy <richard@rsk.demon.co.uk>
      Signed-off-by: NPekka Enberg <penberg@kernel.org>
      4a92379b
  3. 06 10月, 2010 1 次提交
  4. 02 10月, 2010 2 次提交
  5. 11 8月, 2010 1 次提交
    • F
      dma-mapping: rename ARCH_KMALLOC_MINALIGN to ARCH_DMA_MINALIGN · a6eb9fe1
      FUJITA Tomonori 提交于
      Now each architecture has the own dma_get_cache_alignment implementation.
      
      dma_get_cache_alignment returns the minimum DMA alignment.  Architectures
      define it as ARCH_KMALLOC_MINALIGN (it's used to make sure that malloc'ed
      buffer is DMA-safe; the buffer doesn't share a cache with the others).  So
      we can unify dma_get_cache_alignment implementations.
      
      This patch:
      
      dma_get_cache_alignment() needs to know if an architecture defines
      ARCH_KMALLOC_MINALIGN or not (needs to know if architecture has DMA
      alignment restriction).  However, slab.h define ARCH_KMALLOC_MINALIGN if
      architectures doesn't define it.
      
      Let's rename ARCH_KMALLOC_MINALIGN to ARCH_DMA_MINALIGN.
      ARCH_KMALLOC_MINALIGN is used only in the internals of slab/slob/slub
      (except for crypto).
      Signed-off-by: NFUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
      Cc: <linux-arch@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a6eb9fe1
  6. 09 8月, 2010 1 次提交
  7. 09 6月, 2010 1 次提交
  8. 30 5月, 2010 1 次提交
  9. 25 5月, 2010 1 次提交
  10. 20 5月, 2010 1 次提交
  11. 20 12月, 2009 3 次提交
    • C
      SLUB: this_cpu: Remove slub kmem_cache fields · ff12059e
      Christoph Lameter 提交于
      Remove the fields in struct kmem_cache_cpu that were used to cache data from
      struct kmem_cache when they were in different cachelines. The cacheline that
      holds the per cpu array pointer now also holds these values. We can cut down
      the struct kmem_cache_cpu size to almost half.
      
      The get_freepointer() and set_freepointer() functions that used to be only
      intended for the slow path now are also useful for the hot path since access
      to the size field does not require accessing an additional cacheline anymore.
      This results in consistent use of functions for setting the freepointer of
      objects throughout SLUB.
      
      Also we initialize all possible kmem_cache_cpu structures when a slab is
      created. No need to initialize them when a processor or node comes online.
      Signed-off-by: NChristoph Lameter <cl@linux-foundation.org>
      Signed-off-by: NPekka Enberg <penberg@cs.helsinki.fi>
      ff12059e
    • C
      SLUB: Get rid of dynamic DMA kmalloc cache allocation · 756dee75
      Christoph Lameter 提交于
      Dynamic DMA kmalloc cache allocation is troublesome since the
      new percpu allocator does not support allocations in atomic contexts.
      Reserve some statically allocated kmalloc_cpu structures instead.
      Signed-off-by: NChristoph Lameter <cl@linux-foundation.org>
      Signed-off-by: NPekka Enberg <penberg@cs.helsinki.fi>
      756dee75
    • C
      SLUB: Use this_cpu operations in slub · 9dfc6e68
      Christoph Lameter 提交于
      Using per cpu allocations removes the needs for the per cpu arrays in the
      kmem_cache struct. These could get quite big if we have to support systems
      with thousands of cpus. The use of this_cpu_xx operations results in:
      
      1. The size of kmem_cache for SMP configuration shrinks since we will only
         need 1 pointer instead of NR_CPUS. The same pointer can be used by all
         processors. Reduces cache footprint of the allocator.
      
      2. We can dynamically size kmem_cache according to the actual nodes in the
         system meaning less memory overhead for configurations that may potentially
         support up to 1k NUMA nodes / 4k cpus.
      
      3. We can remove the diddle widdle with allocating and releasing of
         kmem_cache_cpu structures when bringing up and shutting down cpus. The cpu
         alloc logic will do it all for us. Removes some portions of the cpu hotplug
         functionality.
      
      4. Fastpath performance increases since per cpu pointer lookups and
         address calculations are avoided.
      
      V7-V8
      - Convert missed get_cpu_slab() under CONFIG_SLUB_STATS
      Signed-off-by: NChristoph Lameter <cl@linux-foundation.org>
      Signed-off-by: NPekka Enberg <penberg@cs.helsinki.fi>
      9dfc6e68
  12. 11 12月, 2009 1 次提交
  13. 30 8月, 2009 1 次提交
  14. 06 8月, 2009 1 次提交
  15. 08 7月, 2009 1 次提交
  16. 12 6月, 2009 1 次提交
    • P
      slab,slub: don't enable interrupts during early boot · 7e85ee0c
      Pekka Enberg 提交于
      As explained by Benjamin Herrenschmidt:
      
        Oh and btw, your patch alone doesn't fix powerpc, because it's missing
        a whole bunch of GFP_KERNEL's in the arch code... You would have to
        grep the entire kernel for things that check slab_is_available() and
        even then you'll be missing some.
      
        For example, slab_is_available() didn't always exist, and so in the
        early days on powerpc, we used a mem_init_done global that is set form
        mem_init() (not perfect but works in practice). And we still have code
        using that to do the test.
      
      Therefore, mask out __GFP_WAIT, __GFP_IO, and __GFP_FS in the slab allocators
      in early boot code to avoid enabling interrupts.
      Signed-off-by: NPekka Enberg <penberg@cs.helsinki.fi>
      7e85ee0c
  17. 12 4月, 2009 1 次提交
  18. 03 4月, 2009 1 次提交
  19. 23 2月, 2009 1 次提交
  20. 20 2月, 2009 3 次提交
  21. 30 12月, 2008 1 次提交
    • F
      tracing/kmemtrace: normalize the raw tracer event to the unified tracing API · 36994e58
      Frederic Weisbecker 提交于
      Impact: new tracer plugin
      
      This patch adapts kmemtrace raw events tracing to the unified tracing API.
      
      To enable and use this tracer, just do the following:
      
       echo kmemtrace > /debugfs/tracing/current_tracer
       cat /debugfs/tracing/trace
      
      You will have the following output:
      
       # tracer: kmemtrace
       #
       #
       # ALLOC  TYPE  REQ   GIVEN  FLAGS           POINTER         NODE    CALLER
       # FREE   |      |     |       |              |   |            |        |
       # |
      
      type_id 1 call_site 18446744071565527833 ptr 18446612134395152256
      type_id 0 call_site 18446744071565585597 ptr 18446612134405955584 bytes_req 4096 bytes_alloc 4096 gfp_flags 208 node -1
      type_id 1 call_site 18446744071565585534 ptr 18446612134405955584
      type_id 0 call_site 18446744071565585597 ptr 18446612134405955584 bytes_req 4096 bytes_alloc 4096 gfp_flags 208 node -1
      type_id 0 call_site 18446744071565636711 ptr 18446612134345164672 bytes_req 240 bytes_alloc 240 gfp_flags 208 node -1
      type_id 1 call_site 18446744071565585534 ptr 18446612134405955584
      type_id 0 call_site 18446744071565585597 ptr 18446612134405955584 bytes_req 4096 bytes_alloc 4096 gfp_flags 208 node -1
      type_id 0 call_site 18446744071565636711 ptr 18446612134345164912 bytes_req 240 bytes_alloc 240 gfp_flags 208 node -1
      type_id 1 call_site 18446744071565585534 ptr 18446612134405955584
      type_id 0 call_site 18446744071565585597 ptr 18446612134405955584 bytes_req 4096 bytes_alloc 4096 gfp_flags 208 node -1
      type_id 0 call_site 18446744071565636711 ptr 18446612134345165152 bytes_req 240 bytes_alloc 240 gfp_flags 208 node -1
      type_id 0 call_site 18446744071566144042 ptr 18446612134346191680 bytes_req 1304 bytes_alloc 1312 gfp_flags 208 node -1
      type_id 1 call_site 18446744071565585534 ptr 18446612134405955584
      type_id 0 call_site 18446744071565585597 ptr 18446612134405955584 bytes_req 4096 bytes_alloc 4096 gfp_flags 208 node -1
      type_id 1 call_site 18446744071565585534 ptr 18446612134405955584
      
      That was to stay backward compatible with the format output produced in
      inux/tracepoint.h.
      
      This is the default ouput, but note that I tried something else.
      
      If you change an option:
      
      echo kmem_minimalistic > /debugfs/trace_options
      
      and then cat /debugfs/trace, you will have the following output:
      
       # tracer: kmemtrace
       #
       #
       # ALLOC  TYPE  REQ   GIVEN  FLAGS           POINTER         NODE    CALLER
       # FREE   |      |     |       |              |   |            |        |
       # |
      
         -      C                            0xffff88007c088780          file_free_rcu
         +      K   4096   4096   000000d0   0xffff88007cad6000     -1   getname
         -      C                            0xffff88007cad6000          putname
         +      K   4096   4096   000000d0   0xffff88007cad6000     -1   getname
         +      K    240    240   000000d0   0xffff8800790dc780     -1   d_alloc
         -      C                            0xffff88007cad6000          putname
         +      K   4096   4096   000000d0   0xffff88007cad6000     -1   getname
         +      K    240    240   000000d0   0xffff8800790dc870     -1   d_alloc
         -      C                            0xffff88007cad6000          putname
         +      K   4096   4096   000000d0   0xffff88007cad6000     -1   getname
         +      K    240    240   000000d0   0xffff8800790dc960     -1   d_alloc
         +      K   1304   1312   000000d0   0xffff8800791d7340     -1   reiserfs_alloc_inode
         -      C                            0xffff88007cad6000          putname
         +      K   4096   4096   000000d0   0xffff88007cad6000     -1   getname
         -      C                            0xffff88007cad6000          putname
         +      K    992   1000   000000d0   0xffff880079045b58     -1   alloc_inode
         +      K    768   1024   000080d0   0xffff88007c096400     -1   alloc_pipe_info
         +      K    240    240   000000d0   0xffff8800790dca50     -1   d_alloc
         +      K    272    320   000080d0   0xffff88007c088780     -1   get_empty_filp
         +      K    272    320   000080d0   0xffff88007c088000     -1   get_empty_filp
      
      Yeah I shall confess kmem_minimalistic should be: kmem_alternative.
      
      Whatever, I find it more readable but this a personal opinion of course.
      We can drop it if you want.
      
      On the ALLOC/FREE column, + means an allocation and - a free.
      
      On the type column, you have K = kmalloc, C = cache, P = page
      
      I would like the flags to be GFP_* strings but that would not be easy to not
      break the column with strings....
      
      About the node...it seems to always be -1. I don't know why but that shouldn't
      be difficult to find.
      
      I moved linux/tracepoint.h to trace/tracepoint.h as well. I think that would
      be more easy to find the tracer headers if they are all in their common
      directory.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      36994e58
  22. 29 12月, 2008 1 次提交
  23. 05 8月, 2008 1 次提交
    • P
      SLUB: dynamic per-cache MIN_PARTIAL · 5595cffc
      Pekka Enberg 提交于
      This patch changes the static MIN_PARTIAL to a dynamic per-cache ->min_partial
      value that is calculated from object size. The bigger the object size, the more
      pages we keep on the partial list.
      
      I tested SLAB, SLUB, and SLUB with this patch on Jens Axboe's 'netio' example
      script of the fio benchmarking tool. The script stresses the networking
      subsystem which should also give a fairly good beating of kmalloc() et al.
      
      To run the test yourself, first clone the fio repository:
      
        git clone git://git.kernel.dk/fio.git
      
      and then run the following command n times on your machine:
      
        time ./fio examples/netio
      
      The results on my 2-way 64-bit x86 machine are as follows:
      
        [ the minimum, maximum, and average are captured from 50 individual runs ]
      
                       real time (seconds)
                       min      max      avg      sd
        SLAB           22.76    23.38    22.98    0.17
        SLUB           22.80    25.78    23.46    0.72
        SLUB (dynamic) 22.74    23.54    23.00    0.20
      
                       sys time (seconds)
                       min      max      avg      sd
        SLAB           6.90     8.28     7.70     0.28
        SLUB           7.42     16.95    8.89     2.28
        SLUB (dynamic) 7.17     8.64     7.73     0.29
      
                       user time (seconds)
                       min      max      avg      sd
        SLAB           36.89    38.11    37.50    0.29
        SLUB           30.85    37.99    37.06    1.67
        SLUB (dynamic) 36.75    38.07    37.59    0.32
      
      As you can see from the above numbers, this patch brings SLUB to the same level
      as SLAB for this particular workload fixing a ~2% regression. I'd expect this
      change to help similar workloads that allocate a lot of objects that are close
      to the size of a page.
      
      Cc: Matthew Wilcox <matthew@wil.cx>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Acked-by: NChristoph Lameter <cl@linux-foundation.org>
      Signed-off-by: NPekka Enberg <penberg@cs.helsinki.fi>
      5595cffc
  24. 27 7月, 2008 1 次提交
  25. 05 7月, 2008 1 次提交
  26. 04 7月, 2008 1 次提交
  27. 27 4月, 2008 3 次提交
    • C
      slub: Fallback to minimal order during slab page allocation · 65c3376a
      Christoph Lameter 提交于
      If any higher order allocation fails then fall back the smallest order
      necessary to contain at least one object. This enables fallback for all
      allocations to order 0 pages. The fallback will waste more memory (objects
      will not fit neatly) and the fallback slabs will be not as efficient as larger
      slabs since they contain less objects.
      
      Note that SLAB also depends on order 1 allocations for some slabs that waste
      too much memory if forced into PAGE_SIZE'd page. SLUB now can now deal with
      failing order 1 allocs which SLAB cannot do.
      
      Add a new field min that will contain the objects for the smallest possible order
      for a slab cache.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NPekka Enberg <penberg@cs.helsinki.fi>
      65c3376a
    • C
      slub: Update statistics handling for variable order slabs · 205ab99d
      Christoph Lameter 提交于
      Change the statistics to consider that slabs of the same slabcache
      can have different number of objects in them since they may be of
      different order.
      
      Provide a new sysfs field
      
      	total_objects
      
      which shows the total objects that the allocated slabs of a slabcache
      could hold.
      
      Add a max field that holds the largest slab order that was ever used
      for a slab cache.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NPekka Enberg <penberg@cs.helsinki.fi>
      205ab99d
    • C
      slub: Add kmem_cache_order_objects struct · 834f3d11
      Christoph Lameter 提交于
      Pack the order and the number of objects into a single word.
      This saves some memory in the kmem_cache_structure and more importantly
      allows us to fetch both values atomically.
      
      Later the slab orders become runtime configurable and we need to fetch these
      two items together in order to properly allocate a slab and initialize its
      objects.
      
      Fix the race by fetching the order and the number of objects in one word.
      
      [penberg@cs.helsinki.fi: fix memset() page order in new_slab()]
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NPekka Enberg <penberg@cs.helsinki.fi>
      834f3d11
  28. 14 4月, 2008 1 次提交
    • C
      slub: No need for per node slab counters if !SLUB_DEBUG · 0f389ec6
      Christoph Lameter 提交于
      The per node counters are used mainly for showing data through the sysfs API.
      If that API is not compiled in then there is no point in keeping track of this
      data. Disable counters for the number of slabs and the number of total slabs
      if !SLUB_DEBUG. Incrementing the per node counters is also accessing a
      potentially contended cacheline so this could actually be a performance
      benefit to embedded systems.
      
      SLABINFO support is also affected. It now must depends on SLUB_DEBUG (which
      is on by default).
      
      Patch also avoids a check for a NULL kmem_cache_node pointer in new_slab()
      if the system is not compiled with NUMA support.
      
      [penberg@cs.helsinki.fi: fix oops and move ->nr_slabs into CONFIG_SLUB_DEBUG]
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NPekka Enberg <penberg@cs.helsinki.fi>
      0f389ec6
  29. 04 3月, 2008 1 次提交
  30. 15 2月, 2008 3 次提交