1. 17 10月, 2007 2 次提交
    • C
      SLUB: Avoid page struct cacheline bouncing due to remote frees to cpu slab · dfb4f096
      Christoph Lameter 提交于
      A remote free may access the same page struct that also contains the lockless
      freelist for the cpu slab. If objects have a short lifetime and are freed by
      a different processor then remote frees back to the slab from which we are
      currently allocating are frequent. The cacheline with the page struct needs
      to be repeately acquired in exclusive mode by both the allocating thread and
      the freeing thread. If this is frequent enough then performance will suffer
      because of cacheline bouncing.
      
      This patchset puts the lockless_freelist pointer in its own cacheline. In
      order to make that happen we introduce a per cpu structure called
      kmem_cache_cpu.
      
      Instead of keeping an array of pointers to page structs we now keep an array
      to a per cpu structure that--among other things--contains the pointer to the
      lockless freelist. The freeing thread can then keep possession of exclusive
      access to the page struct cacheline while the allocating thread keeps its
      exclusive access to the cacheline containing the per cpu structure.
      
      This works as long as the allocating cpu is able to service its request
      from the lockless freelist. If the lockless freelist runs empty then the
      allocating thread needs to acquire exclusive access to the cacheline with
      the page struct lock the slab.
      
      The allocating thread will then check if new objects were freed to the per
      cpu slab. If so it will keep the slab as the cpu slab and continue with the
      recently remote freed objects. So the allocating thread can take a series
      of just freed remote pages and dish them out again. Ideally allocations
      could be just recycling objects in the same slab this way which will lead
      to an ideal allocation / remote free pattern.
      
      The number of objects that can be handled in this way is limited by the
      capacity of one slab. Increasing slab size via slub_min_objects/
      slub_max_order may increase the number of objects and therefore performance.
      
      If the allocating thread runs out of objects and finds that no objects were
      put back by the remote processor then it will retrieve a new slab (from the
      partial lists or from the page allocator) and start with a whole
      new set of objects while the remote thread may still be freeing objects to
      the old cpu slab. This may then repeat until the new slab is also exhausted.
      If remote freeing has freed objects in the earlier slab then that earlier
      slab will now be on the partial freelist and the allocating thread will
      pick that slab next for allocation. So the loop is extended. However,
      both threads need to take the list_lock to make the swizzling via
      the partial list happen.
      
      It is likely that this kind of scheme will keep the objects being passed
      around to a small set that can be kept in the cpu caches leading to increased
      performance.
      
      More code cleanups become possible:
      
      - Instead of passing a cpu we can now pass a kmem_cache_cpu structure around.
        Allows reducing the number of parameters to various functions.
      - Can define a new node_match() function for NUMA to encapsulate locality
        checks.
      
      Effect on allocations:
      
      Cachelines touched before this patch:
      
      	Write:	page cache struct and first cacheline of object
      
      Cachelines touched after this patch:
      
      	Write:	kmem_cache_cpu cacheline and first cacheline of object
      	Read: page cache struct (but see later patch that avoids touching
      		that cacheline)
      
      The handling when the lockless alloc list runs empty gets to be a bit more
      complicated since another cacheline has now to be written to. But that is
      halfway out of the hot path.
      
      Effect on freeing:
      
      Cachelines touched before this patch:
      
      	Write: page_struct and first cacheline of object
      
      Cachelines touched after this patch depending on how we free:
      
        Write(to cpu_slab):	kmem_cache_cpu struct and first cacheline of object
        Write(to other):	page struct and first cacheline of object
      
        Read(to cpu_slab):	page struct to id slab etc. (but see later patch that
        			avoids touching the page struct on free)
        Read(to other):	cpu local kmem_cache_cpu struct to verify its not
        			the cpu slab.
      
      Summary:
      
      Pro:
      	- Distinct cachelines so that concurrent remote frees and local
      	  allocs on a cpuslab can occur without cacheline bouncing.
      	- Avoids potential bouncing cachelines because of neighboring
      	  per cpu pointer updates in kmem_cache's cpu_slab structure since
      	  it now grows to a cacheline (Therefore remove the comment
      	  that talks about that concern).
      
      Cons:
      	- Freeing objects now requires the reading of one additional
      	  cacheline. That can be mitigated for some cases by the following
      	  patches but its not possible to completely eliminate these
      	  references.
      
      	- Memory usage grows slightly.
      
      	The size of each per cpu object is blown up from one word
      	(pointing to the page_struct) to one cacheline with various data.
      	So this is NR_CPUS*NR_SLABS*L1_BYTES more memory use. Lets say
      	NR_SLABS is 100 and a cache line size of 128 then we have just
      	increased SLAB metadata requirements by 12.8k per cpu.
      	(Another later patch reduces these requirements)
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dfb4f096
    • C
      SLUB: direct pass through of page size or higher kmalloc requests · aadb4bc4
      Christoph Lameter 提交于
      This gets rid of all kmalloc caches larger than page size.  A kmalloc
      request larger than PAGE_SIZE > 2 is going to be passed through to the page
      allocator.  This works both inline where we will call __get_free_pages
      instead of kmem_cache_alloc and in __kmalloc.
      
      kfree is modified to check if the object is in a slab page. If not then
      the page is freed via the page allocator instead. Roughly similar to what
      SLOB does.
      
      Advantages:
      - Reduces memory overhead for kmalloc array
      - Large kmalloc operations are faster since they do not
        need to pass through the slab allocator to get to the
        page allocator.
      - Performance increase of 10%-20% on alloc and 50% on free for
        PAGE_SIZEd allocations.
        SLUB must call page allocator for each alloc anyways since
        the higher order pages which that allowed avoiding the page alloc calls
        are not available in a reliable way anymore. So we are basically removing
        useless slab allocator overhead.
      - Large kmallocs yields page aligned object which is what
        SLAB did. Bad things like using page sized kmalloc allocations to
        stand in for page allocate allocs can be transparently handled and are not
        distinguishable from page allocator uses.
      - Checking for too large objects can be removed since
        it is done by the page allocator.
      
      Drawbacks:
      - No accounting for large kmalloc slab allocations anymore
      - No debugging of large kmalloc slab allocations.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      aadb4bc4
  2. 31 8月, 2007 1 次提交
  3. 20 7月, 2007 1 次提交
  4. 18 7月, 2007 3 次提交
  5. 17 7月, 2007 1 次提交
  6. 17 6月, 2007 1 次提交
    • C
      SLUB: minimum alignment fixes · 4b356be0
      Christoph Lameter 提交于
      If ARCH_KMALLOC_MINALIGN is set to a value greater than 8 (SLUBs smallest
      kmalloc cache) then SLUB may generate duplicate slabs in sysfs (yes again)
      because the object size is padded to reach ARCH_KMALLOC_MINALIGN.  Thus the
      size of the small slabs is all the same.
      
      No arch sets ARCH_KMALLOC_MINALIGN larger than 8 though except mips which
      for some reason wants a 128 byte alignment.
      
      This patch increases the size of the smallest cache if
      ARCH_KMALLOC_MINALIGN is greater than 8.  In that case more and more of the
      smallest caches are disabled.
      
      If we do that then the count of the active general caches that is displayed
      on boot is not correct anymore since we may skip elements of the kmalloc
      array.  So count them separately.
      
      This approach was tested by Havard yesterday.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Haavard Skinnemoen <hskinnemoen@atmel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4b356be0
  7. 09 6月, 2007 1 次提交
  8. 17 5月, 2007 4 次提交
  9. 15 5月, 2007 1 次提交
  10. 08 5月, 2007 3 次提交
    • C
      slub: enable tracking of full slabs · 643b1138
      Christoph Lameter 提交于
      If slab tracking is on then build a list of full slabs so that we can verify
      the integrity of all slabs and are also able to built list of alloc/free
      callers.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      643b1138
    • C
      SLUB: allocate smallest object size if the user asks for 0 bytes · 614410d5
      Christoph Lameter 提交于
      Makes SLUB behave like SLAB in this area to avoid issues....
      
      Throw a stack dump to alert people.
      
      At some point the behavior should be switched back.  NULL is no memory as
      far as I can tell and if the use asked for 0 bytes then he need to get no
      memory.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      614410d5
    • C
      SLUB core · 81819f0f
      Christoph Lameter 提交于
      This is a new slab allocator which was motivated by the complexity of the
      existing code in mm/slab.c. It attempts to address a variety of concerns
      with the existing implementation.
      
      A. Management of object queues
      
         A particular concern was the complex management of the numerous object
         queues in SLAB. SLUB has no such queues. Instead we dedicate a slab for
         each allocating CPU and use objects from a slab directly instead of
         queueing them up.
      
      B. Storage overhead of object queues
      
         SLAB Object queues exist per node, per CPU. The alien cache queue even
         has a queue array that contain a queue for each processor on each
         node. For very large systems the number of queues and the number of
         objects that may be caught in those queues grows exponentially. On our
         systems with 1k nodes / processors we have several gigabytes just tied up
         for storing references to objects for those queues  This does not include
         the objects that could be on those queues. One fears that the whole
         memory of the machine could one day be consumed by those queues.
      
      C. SLAB meta data overhead
      
         SLAB has overhead at the beginning of each slab. This means that data
         cannot be naturally aligned at the beginning of a slab block. SLUB keeps
         all meta data in the corresponding page_struct. Objects can be naturally
         aligned in the slab. F.e. a 128 byte object will be aligned at 128 byte
         boundaries and can fit tightly into a 4k page with no bytes left over.
         SLAB cannot do this.
      
      D. SLAB has a complex cache reaper
      
         SLUB does not need a cache reaper for UP systems. On SMP systems
         the per CPU slab may be pushed back into partial list but that
         operation is simple and does not require an iteration over a list
         of objects. SLAB expires per CPU, shared and alien object queues
         during cache reaping which may cause strange hold offs.
      
      E. SLAB has complex NUMA policy layer support
      
         SLUB pushes NUMA policy handling into the page allocator. This means that
         allocation is coarser (SLUB does interleave on a page level) but that
         situation was also present before 2.6.13. SLABs application of
         policies to individual slab objects allocated in SLAB is
         certainly a performance concern due to the frequent references to
         memory policies which may lead a sequence of objects to come from
         one node after another. SLUB will get a slab full of objects
         from one node and then will switch to the next.
      
      F. Reduction of the size of partial slab lists
      
         SLAB has per node partial lists. This means that over time a large
         number of partial slabs may accumulate on those lists. These can
         only be reused if allocator occur on specific nodes. SLUB has a global
         pool of partial slabs and will consume slabs from that pool to
         decrease fragmentation.
      
      G. Tunables
      
         SLAB has sophisticated tuning abilities for each slab cache. One can
         manipulate the queue sizes in detail. However, filling the queues still
         requires the uses of the spin lock to check out slabs. SLUB has a global
         parameter (min_slab_order) for tuning. Increasing the minimum slab
         order can decrease the locking overhead. The bigger the slab order the
         less motions of pages between per CPU and partial lists occur and the
         better SLUB will be scaling.
      
      G. Slab merging
      
         We often have slab caches with similar parameters. SLUB detects those
         on boot up and merges them into the corresponding general caches. This
         leads to more effective memory use. About 50% of all caches can
         be eliminated through slab merging. This will also decrease
         slab fragmentation because partial allocated slabs can be filled
         up again. Slab merging can be switched off by specifying
         slub_nomerge on boot up.
      
         Note that merging can expose heretofore unknown bugs in the kernel
         because corrupted objects may now be placed differently and corrupt
         differing neighboring objects. Enable sanity checks to find those.
      
      H. Diagnostics
      
         The current slab diagnostics are difficult to use and require a
         recompilation of the kernel. SLUB contains debugging code that
         is always available (but is kept out of the hot code paths).
         SLUB diagnostics can be enabled via the "slab_debug" option.
         Parameters can be specified to select a single or a group of
         slab caches for diagnostics. This means that the system is running
         with the usual performance and it is much more likely that
         race conditions can be reproduced.
      
      I. Resiliency
      
         If basic sanity checks are on then SLUB is capable of detecting
         common error conditions and recover as best as possible to allow the
         system to continue.
      
      J. Tracing
      
         Tracing can be enabled via the slab_debug=T,<slabcache> option
         during boot. SLUB will then protocol all actions on that slabcache
         and dump the object contents on free.
      
      K. On demand DMA cache creation.
      
         Generally DMA caches are not needed. If a kmalloc is used with
         __GFP_DMA then just create this single slabcache that is needed.
         For systems that have no ZONE_DMA requirement the support is
         completely eliminated.
      
      L. Performance increase
      
         Some benchmarks have shown speed improvements on kernbench in the
         range of 5-10%. The locking overhead of slub is based on the
         underlying base allocation size. If we can reliably allocate
         larger order pages then it is possible to increase slub
         performance much further. The anti-fragmentation patches may
         enable further performance increases.
      
      Tested on:
      i386 UP + SMP, x86_64 UP + SMP + NUMA emulation, IA64 NUMA + Simulator
      
      SLUB Boot options
      
      slub_nomerge		Disable merging of slabs
      slub_min_order=x	Require a minimum order for slab caches. This
      			increases the managed chunk size and therefore
      			reduces meta data and locking overhead.
      slub_min_objects=x	Mininum objects per slab. Default is 8.
      slub_max_order=x	Avoid generating slabs larger than order specified.
      slub_debug		Enable all diagnostics for all caches
      slub_debug=<options>	Enable selective options for all caches
      slub_debug=<o>,<cache>	Enable selective options for a certain set of
      			caches
      
      Available Debug options
      F		Double Free checking, sanity and resiliency
      R		Red zoning
      P		Object / padding poisoning
      U		Track last free / alloc
      T		Trace all allocs / frees (only use for individual slabs).
      
      To use SLUB: Apply this patch and then select SLUB as the default slab
      allocator.
      
      [hugh@veritas.com: fix an oops-causing locking error]
      [akpm@linux-foundation.org: various stupid cleanups and small fixes]
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      81819f0f