1. 17 10月, 2007 9 次提交
    • C
      SLUB: Do not use page->mapping · 8e65d24c
      Christoph Lameter 提交于
      After moving the lockless_freelist to kmem_cache_cpu we no longer need
      page->lockless_freelist. Restructure the use of the struct page fields in
      such a way that we never touch the mapping field.
      
      This is turn allows us to remove the special casing of SLUB when determining
      the mapping of a page (needed for corner cases of virtual caches machines that
      need to flush caches of processors mapping a page).
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8e65d24c
    • C
      SLUB: Avoid page struct cacheline bouncing due to remote frees to cpu slab · dfb4f096
      Christoph Lameter 提交于
      A remote free may access the same page struct that also contains the lockless
      freelist for the cpu slab. If objects have a short lifetime and are freed by
      a different processor then remote frees back to the slab from which we are
      currently allocating are frequent. The cacheline with the page struct needs
      to be repeately acquired in exclusive mode by both the allocating thread and
      the freeing thread. If this is frequent enough then performance will suffer
      because of cacheline bouncing.
      
      This patchset puts the lockless_freelist pointer in its own cacheline. In
      order to make that happen we introduce a per cpu structure called
      kmem_cache_cpu.
      
      Instead of keeping an array of pointers to page structs we now keep an array
      to a per cpu structure that--among other things--contains the pointer to the
      lockless freelist. The freeing thread can then keep possession of exclusive
      access to the page struct cacheline while the allocating thread keeps its
      exclusive access to the cacheline containing the per cpu structure.
      
      This works as long as the allocating cpu is able to service its request
      from the lockless freelist. If the lockless freelist runs empty then the
      allocating thread needs to acquire exclusive access to the cacheline with
      the page struct lock the slab.
      
      The allocating thread will then check if new objects were freed to the per
      cpu slab. If so it will keep the slab as the cpu slab and continue with the
      recently remote freed objects. So the allocating thread can take a series
      of just freed remote pages and dish them out again. Ideally allocations
      could be just recycling objects in the same slab this way which will lead
      to an ideal allocation / remote free pattern.
      
      The number of objects that can be handled in this way is limited by the
      capacity of one slab. Increasing slab size via slub_min_objects/
      slub_max_order may increase the number of objects and therefore performance.
      
      If the allocating thread runs out of objects and finds that no objects were
      put back by the remote processor then it will retrieve a new slab (from the
      partial lists or from the page allocator) and start with a whole
      new set of objects while the remote thread may still be freeing objects to
      the old cpu slab. This may then repeat until the new slab is also exhausted.
      If remote freeing has freed objects in the earlier slab then that earlier
      slab will now be on the partial freelist and the allocating thread will
      pick that slab next for allocation. So the loop is extended. However,
      both threads need to take the list_lock to make the swizzling via
      the partial list happen.
      
      It is likely that this kind of scheme will keep the objects being passed
      around to a small set that can be kept in the cpu caches leading to increased
      performance.
      
      More code cleanups become possible:
      
      - Instead of passing a cpu we can now pass a kmem_cache_cpu structure around.
        Allows reducing the number of parameters to various functions.
      - Can define a new node_match() function for NUMA to encapsulate locality
        checks.
      
      Effect on allocations:
      
      Cachelines touched before this patch:
      
      	Write:	page cache struct and first cacheline of object
      
      Cachelines touched after this patch:
      
      	Write:	kmem_cache_cpu cacheline and first cacheline of object
      	Read: page cache struct (but see later patch that avoids touching
      		that cacheline)
      
      The handling when the lockless alloc list runs empty gets to be a bit more
      complicated since another cacheline has now to be written to. But that is
      halfway out of the hot path.
      
      Effect on freeing:
      
      Cachelines touched before this patch:
      
      	Write: page_struct and first cacheline of object
      
      Cachelines touched after this patch depending on how we free:
      
        Write(to cpu_slab):	kmem_cache_cpu struct and first cacheline of object
        Write(to other):	page struct and first cacheline of object
      
        Read(to cpu_slab):	page struct to id slab etc. (but see later patch that
        			avoids touching the page struct on free)
        Read(to other):	cpu local kmem_cache_cpu struct to verify its not
        			the cpu slab.
      
      Summary:
      
      Pro:
      	- Distinct cachelines so that concurrent remote frees and local
      	  allocs on a cpuslab can occur without cacheline bouncing.
      	- Avoids potential bouncing cachelines because of neighboring
      	  per cpu pointer updates in kmem_cache's cpu_slab structure since
      	  it now grows to a cacheline (Therefore remove the comment
      	  that talks about that concern).
      
      Cons:
      	- Freeing objects now requires the reading of one additional
      	  cacheline. That can be mitigated for some cases by the following
      	  patches but its not possible to completely eliminate these
      	  references.
      
      	- Memory usage grows slightly.
      
      	The size of each per cpu object is blown up from one word
      	(pointing to the page_struct) to one cacheline with various data.
      	So this is NR_CPUS*NR_SLABS*L1_BYTES more memory use. Lets say
      	NR_SLABS is 100 and a cache line size of 128 then we have just
      	increased SLAB metadata requirements by 12.8k per cpu.
      	(Another later patch reduces these requirements)
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dfb4f096
    • M
      Group short-lived and reclaimable kernel allocations · e12ba74d
      Mel Gorman 提交于
      This patch marks a number of allocations that are either short-lived such as
      network buffers or are reclaimable such as inode allocations.  When something
      like updatedb is called, long-lived and unmovable kernel allocations tend to
      be spread throughout the address space which increases fragmentation.
      
      This patch groups these allocations together as much as possible by adding a
      new MIGRATE_TYPE.  The MIGRATE_RECLAIMABLE type is for allocations that can be
      reclaimed on demand, but not moved.  i.e.  they can be migrated by deleting
      them and re-reading the information from elsewhere.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: Christoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e12ba74d
    • C
      Categorize GFP flags · 6cb06229
      Christoph Lameter 提交于
      The function of GFP_LEVEL_MASK seems to be unclear.  In order to clear up
      the mystery we get rid of it and replace GFP_LEVEL_MASK with 3 sets of GFP
      flags:
      
      GFP_RECLAIM_MASK	Flags used to control page allocator reclaim behavior.
      
      GFP_CONSTRAINT_MASK	Flags used to limit where allocations can occur.
      
      GFP_SLAB_BUG_MASK	Flags that the slab allocator BUG()s on.
      
      These replace the uses of GFP_LEVEL mask in the slab allocators and in
      vmalloc.c.
      
      The use of the flags not included in these sets may occur as a result of a
      slab allocation standing in for a page allocation when constructing scatter
      gather lists.  Extraneous flags are cleared and not passed through to the
      page allocator.  __GFP_MOVABLE/RECLAIMABLE, __GFP_COLD and __GFP_COMP will
      now be ignored if passed to a slab allocator.
      
      Change the allocation of allocator meta data in SLAB and vmalloc to not
      pass through flags listed in GFP_CONSTRAINT_MASK.  SLAB already removes the
      __GFP_THISNODE flag for such allocations.  Generalize that to also cover
      vmalloc.  The use of GFP_CONSTRAINT_MASK also includes __GFP_HARDWALL.
      
      The impact of allocator metadata placement on access latency to the
      cachelines of the object itself is minimal since metadata is only
      referenced on alloc and free.  The attempt is still made to place the meta
      data optimally but we consistently allow fallback both in SLAB and vmalloc
      (SLUB does not need to allocate metadata like that).
      
      Allocator metadata may serve multiple in kernel users and thus should not
      be subject to the limitations arising from a single allocation context.
      
      [akpm@linux-foundation.org: fix fallback_alloc()]
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6cb06229
    • C
      Memoryless nodes: SLUB support · f64dc58c
      Christoph Lameter 提交于
      Simply switch all for_each_online_node to for_each_node_state(NORMAL_MEMORY).
      That way SLUB only operates on nodes with regular memory.  Any allocation
      attempt on a memoryless node or a node with just highmem will fall whereupon
      SLUB will fetch memory from a nearby node (depending on how memory policies
      and cpuset describe fallback).
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Tested-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Acked-by: NBob Picco <bob.picco@hp.com>
      Cc: Nishanth Aravamudan <nacc@us.ibm.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@skynet.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f64dc58c
    • C
      Slab allocators: fail if ksize is called with a NULL parameter · ef8b4520
      Christoph Lameter 提交于
      A NULL pointer means that the object was not allocated.  One cannot
      determine the size of an object that has not been allocated.  Currently we
      return 0 but we really should BUG() on attempts to determine the size of
      something nonexistent.
      
      krealloc() interprets NULL to mean a zero sized object.  Handle that
      separately in krealloc().
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Acked-by: NPekka Enberg <penberg@cs.helsinki.fi>
      Cc: Matt Mackall <mpm@selenic.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ef8b4520
    • S
      {slub, slob}: use unlikely() for kfree(ZERO_OR_NULL_PTR) check · 2408c550
      Satyam Sharma 提交于
      Considering kfree(NULL) would normally occur only in error paths and
      kfree(ZERO_SIZE_PTR) is uncommon as well, so let's use unlikely() for the
      condition check in SLUB's and SLOB's kfree() to optimize for the common
      case.  SLAB has this already.
      Signed-off-by: NSatyam Sharma <satyam@infradead.org>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Christoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2408c550
    • C
      SLUB: direct pass through of page size or higher kmalloc requests · aadb4bc4
      Christoph Lameter 提交于
      This gets rid of all kmalloc caches larger than page size.  A kmalloc
      request larger than PAGE_SIZE > 2 is going to be passed through to the page
      allocator.  This works both inline where we will call __get_free_pages
      instead of kmem_cache_alloc and in __kmalloc.
      
      kfree is modified to check if the object is in a slab page. If not then
      the page is freed via the page allocator instead. Roughly similar to what
      SLOB does.
      
      Advantages:
      - Reduces memory overhead for kmalloc array
      - Large kmalloc operations are faster since they do not
        need to pass through the slab allocator to get to the
        page allocator.
      - Performance increase of 10%-20% on alloc and 50% on free for
        PAGE_SIZEd allocations.
        SLUB must call page allocator for each alloc anyways since
        the higher order pages which that allowed avoiding the page alloc calls
        are not available in a reliable way anymore. So we are basically removing
        useless slab allocator overhead.
      - Large kmallocs yields page aligned object which is what
        SLAB did. Bad things like using page sized kmalloc allocations to
        stand in for page allocate allocs can be transparently handled and are not
        distinguishable from page allocator uses.
      - Checking for too large objects can be removed since
        it is done by the page allocator.
      
      Drawbacks:
      - No accounting for large kmalloc slab allocations anymore
      - No debugging of large kmalloc slab allocations.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      aadb4bc4
    • A
      slub.c:early_kmem_cache_node_alloc() shouldn't be __init · 1cd7daa5
      Adrian Bunk 提交于
      WARNING: mm/built-in.o(.text+0x24bd3): Section mismatch: reference to .init.text:early_kmem_cache_node_alloc (between 'init_kmem_cache_nodes' and 'calculate_sizes')
      ...
      Signed-off-by: NAdrian Bunk <bunk@stusta.de>
      Acked-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1cd7daa5
  2. 12 9月, 2007 1 次提交
  3. 31 8月, 2007 1 次提交
  4. 23 8月, 2007 2 次提交
  5. 10 8月, 2007 2 次提交
    • C
      SLUB: Fix dynamic dma kmalloc cache creation · 1ceef402
      Christoph Lameter 提交于
      The dynamic dma kmalloc creation can run into trouble if a
      GFP_ATOMIC allocation is the first one performed for a certain size
      of dma kmalloc slab.
      
      - Move the adding of the slab to sysfs into a workqueue
        (sysfs does GFP_KERNEL allocations)
      - Do not call kmem_cache_destroy() (uses slub_lock)
      - Only acquire the slub_lock once and--if we cannot wait--do a trylock.
      
        This introduces a slight risk of the first kmalloc(x, GFP_DMA|GFP_ATOMIC)
        for a range of sizes failing due to another process holding the slub_lock.
        However, we only need to acquire the spinlock once in order to establish
        each power of two DMA kmalloc cache. The possible conflict is with the
        slub_lock taken during slab management actions (create / remove slab cache).
      
        It is rather typical that a driver will first fill its buffers using
        GFP_KERNEL allocations which will wait until the slub_lock can be acquired.
        Drivers will also create its slab caches first outside of an atomic
        context before starting to use atomic kmalloc from an interrupt context.
      
        If there are any failures then they will occur early after boot or when
        loading of multiple drivers concurrently. Drivers can already accomodate
        failures of GFP_ATOMIC for other reasons. Retries will then create the slab.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      1ceef402
    • C
      SLUB: Remove checks for MAX_PARTIAL from kmem_cache_shrink · fcda3d89
      Christoph Lameter 提交于
      The MAX_PARTIAL checks were supposed to be an optimization. However, slab
      shrinking is a manually triggered process either through running slabinfo
      or by the kernel calling kmem_cache_shrink.
      
      If one really wants to shrink a slab then all operations should be done
      regardless of the size of the partial list. This also fixes an issue that
      could surface if the number of partial slabs was initially above MAX_PARTIAL
      in kmem_cache_shrink and later drops below MAX_PARTIAL through the
      elimination of empty slabs on the partial list (rare). In that case a few
      slabs may be left off the partial list (and only be put back when they
      are empty).
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      fcda3d89
  6. 31 7月, 2007 2 次提交
  7. 20 7月, 2007 2 次提交
    • P
      mm: Remove slab destructors from kmem_cache_create(). · 20c2df83
      Paul Mundt 提交于
      Slab destructors were no longer supported after Christoph's
      c59def9f change. They've been
      BUGs for both slab and slub, and slob never supported them
      either.
      
      This rips out support for the dtor pointer from kmem_cache_create()
      completely and fixes up every single callsite in the kernel (there were
      about 224, not including the slab allocator definitions themselves,
      or the documentation references).
      Signed-off-by: NPaul Mundt <lethal@linux-sh.org>
      20c2df83
    • L
      slub: fix ksize() for zero-sized pointers · 9550b105
      Linus Torvalds 提交于
      The slab and slob allocators already did this right, but slub would call
      "get_object_page()" on the magic ZERO_SIZE_PTR, with all kinds of nasty
      end results.
      
      Noted by Ingo Molnar.
      
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9550b105
  8. 18 7月, 2007 20 次提交
  9. 17 7月, 2007 1 次提交