1. 14 4月, 2008 1 次提交
    • C
      slub: No need for per node slab counters if !SLUB_DEBUG · 0f389ec6
      Christoph Lameter 提交于
      The per node counters are used mainly for showing data through the sysfs API.
      If that API is not compiled in then there is no point in keeping track of this
      data. Disable counters for the number of slabs and the number of total slabs
      if !SLUB_DEBUG. Incrementing the per node counters is also accessing a
      potentially contended cacheline so this could actually be a performance
      benefit to embedded systems.
      
      SLABINFO support is also affected. It now must depends on SLUB_DEBUG (which
      is on by default).
      
      Patch also avoids a check for a NULL kmem_cache_node pointer in new_slab()
      if the system is not compiled with NUMA support.
      
      [penberg@cs.helsinki.fi: fix oops and move ->nr_slabs into CONFIG_SLUB_DEBUG]
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NPekka Enberg <penberg@cs.helsinki.fi>
      0f389ec6
  2. 04 3月, 2008 1 次提交
  3. 15 2月, 2008 3 次提交
  4. 08 2月, 2008 1 次提交
    • C
      SLUB: Support for performance statistics · 8ff12cfc
      Christoph Lameter 提交于
      The statistics provided here allow the monitoring of allocator behavior but
      at the cost of some (minimal) loss of performance. Counters are placed in
      SLUB's per cpu data structure. The per cpu structure may be extended by the
      statistics to grow larger than one cacheline which will increase the cache
      footprint of SLUB.
      
      There is a compile option to enable/disable the inclusion of the runtime
      statistics and its off by default.
      
      The slabinfo tool is enhanced to support these statistics via two options:
      
      -D 	Switches the line of information displayed for a slab from size
      	mode to activity mode.
      
      -A	Sorts the slabs displayed by activity. This allows the display of
      	the slabs most important to the performance of a certain load.
      
      -r	Report option will report detailed statistics on
      
      Example (tbench load):
      
      slabinfo -AD		->Shows the most active slabs
      
      Name                   Objects    Alloc     Free   %Fast
      skbuff_fclone_cache         33 111953835 111953835  99  99
      :0000192                  2666  5283688  5281047  99  99
      :0001024                   849  5247230  5246389  83  83
      vm_area_struct            1349   119642   118355  91  22
      :0004096                    15    66753    66751  98  98
      :0000064                  2067    25297    23383  98  78
      dentry                   10259    28635    18464  91  45
      :0000080                 11004    18950     8089  98  98
      :0000096                  1703    12358    10784  99  98
      :0000128                   762    10582     9875  94  18
      :0000512                   184     9807     9647  95  81
      :0002048                   479     9669     9195  83  65
      anon_vma                   777     9461     9002  99  71
      kmalloc-8                 6492     9981     5624  99  97
      :0000768                   258     7174     6931  58  15
      
      So the skbuff_fclone_cache is of highest importance for the tbench load.
      Pretty high load on the 192 sized slab. Look for the aliases
      
      slabinfo -a | grep 000192
      :0000192     <- xfs_btree_cur filp kmalloc-192 uid_cache tw_sock_TCP
      	request_sock_TCPv6 tw_sock_TCPv6 skbuff_head_cache xfs_ili
      
      Likely skbuff_head_cache.
      
      
      Looking into the statistics of the skbuff_fclone_cache is possible through
      
      slabinfo skbuff_fclone_cache	->-r option implied if cache name is mentioned
      
      
      .... Usual output ...
      
      Slab Perf Counter       Alloc     Free %Al %Fr
      --------------------------------------------------
      Fastpath             111953360 111946981  99  99
      Slowpath                 1044     7423   0   0
      Page Alloc                272      264   0   0
      Add partial                25      325   0   0
      Remove partial             86      264   0   0
      RemoteObj/SlabFrozen      350     4832   0   0
      Total                111954404 111954404
      
      Flushes       49 Refill        0
      Deactivate Full=325(92%) Empty=0(0%) ToHead=24(6%) ToTail=1(0%)
      
      Looks good because the fastpath is overwhelmingly taken.
      
      
      skbuff_head_cache:
      
      Slab Perf Counter       Alloc     Free %Al %Fr
      --------------------------------------------------
      Fastpath              5297262  5259882  99  99
      Slowpath                 4477    39586   0   0
      Page Alloc                937      824   0   0
      Add partial                 0     2515   0   0
      Remove partial           1691      824   0   0
      RemoteObj/SlabFrozen     2621     9684   0   0
      Total                 5301739  5299468
      
      Deactivate Full=2620(100%) Empty=0(0%) ToHead=0(0%) ToTail=0(0%)
      
      
      Descriptions of the output:
      
      Total:		The total number of allocation and frees that occurred for a
      		slab
      
      Fastpath:	The number of allocations/frees that used the fastpath.
      
      Slowpath:	Other allocations
      
      Page Alloc:	Number of calls to the page allocator as a result of slowpath
      		processing
      
      Add Partial:	Number of slabs added to the partial list through free or
      		alloc (occurs during cpuslab flushes)
      
      Remove Partial:	Number of slabs removed from the partial list as a result of
      		allocations retrieving a partial slab or by a free freeing
      		the last object of a slab.
      
      RemoteObj/Froz:	How many times were remotely freed object encountered when a
      		slab was about to be deactivated. Frozen: How many times was
      		free able to skip list processing because the slab was in use
      		as the cpuslab of another processor.
      
      Flushes:	Number of times the cpuslab was flushed on request
      		(kmem_cache_shrink, may result from races in __slab_alloc)
      
      Refill:		Number of times we were able to refill the cpuslab from
      		remotely freed objects for the same slab.
      
      Deactivate:	Statistics how slabs were deactivated. Shows how they were
      		put onto the partial list.
      
      In general fastpath is very good. Slowpath without partial list processing is
      also desirable. Any touching of partial list uses node specific locks which
      may potentially cause list lock contention.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      8ff12cfc
  5. 05 2月, 2008 2 次提交
  6. 03 1月, 2008 1 次提交
  7. 02 1月, 2008 1 次提交
  8. 17 10月, 2007 6 次提交
    • C
      Slab API: remove useless ctor parameter and reorder parameters · 4ba9b9d0
      Christoph Lameter 提交于
      Slab constructors currently have a flags parameter that is never used.  And
      the order of the arguments is opposite to other slab functions.  The object
      pointer is placed before the kmem_cache pointer.
      
      Convert
      
              ctor(void *object, struct kmem_cache *s, unsigned long flags)
      
      to
      
              ctor(struct kmem_cache *s, void *object)
      
      throughout the kernel
      
      [akpm@linux-foundation.org: coupla fixes]
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4ba9b9d0
    • C
      SLUB: Optimize cacheline use for zeroing · 42a9fdbb
      Christoph Lameter 提交于
      We touch a cacheline in the kmem_cache structure for zeroing to get the
      size. However, the hot paths in slab_alloc and slab_free do not reference
      any other fields in kmem_cache, so we may have to just bring in the
      cacheline for this one access.
      
      Add a new field to kmem_cache_cpu that contains the object size. That
      cacheline must already be used in the hotpaths. So we save one cacheline
      on every slab_alloc if we zero.
      
      We need to update the kmem_cache_cpu object size if an aliasing operation
      changes the objsize of an non debug slab.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      42a9fdbb
    • C
      SLUB: Place kmem_cache_cpu structures in a NUMA aware way · 4c93c355
      Christoph Lameter 提交于
      The kmem_cache_cpu structures introduced are currently an array placed in the
      kmem_cache struct. Meaning the kmem_cache_cpu structures are overwhelmingly
      on the wrong node for systems with a higher amount of nodes. These are
      performance critical structures since the per node information has
      to be touched for every alloc and free in a slab.
      
      In order to place the kmem_cache_cpu structure optimally we put an array
      of pointers to kmem_cache_cpu structs in kmem_cache (similar to SLAB).
      
      However, the kmem_cache_cpu structures can now be allocated in a more
      intelligent way.
      
      We would like to put per cpu structures for the same cpu but different
      slab caches in cachelines together to save space and decrease the cache
      footprint. However, the slab allocators itself control only allocations
      per node. We set up a simple per cpu array for every processor with
      100 per cpu structures which is usually enough to get them all set up right.
      If we run out then we fall back to kmalloc_node. This also solves the
      bootstrap problem since we do not have to use slab allocator functions
      early in boot to get memory for the small per cpu structures.
      
      Pro:
      	- NUMA aware placement improves memory performance
      	- All global structures in struct kmem_cache become readonly
      	- Dense packing of per cpu structures reduces cacheline
      	  footprint in SMP and NUMA.
      	- Potential avoidance of exclusive cacheline fetches
      	  on the free and alloc hotpath since multiple kmem_cache_cpu
      	  structures are in one cacheline. This is particularly important
      	  for the kmalloc array.
      
      Cons:
      	- Additional reference to one read only cacheline (per cpu
      	  array of pointers to kmem_cache_cpu) in both slab_alloc()
      	  and slab_free().
      
      [akinobu.mita@gmail.com: fix cpu hotplug offline/online path]
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Cc: "Pekka Enberg" <penberg@cs.helsinki.fi>
      Cc: Akinobu Mita <akinobu.mita@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4c93c355
    • C
      SLUB: Move page->offset to kmem_cache_cpu->offset · b3fba8da
      Christoph Lameter 提交于
      We need the offset from the page struct during slab_alloc and slab_free. In
      both cases we also reference the cacheline of the kmem_cache_cpu structure.
      We can therefore move the offset field into the kmem_cache_cpu structure
      freeing up 16 bits in the page struct.
      
      Moving the offset allows an allocation from slab_alloc() without touching the
      page struct in the hot path.
      
      The only thing left in slab_free() that touches the page struct cacheline for
      per cpu freeing is the checking of SlabDebug(page). The next patch deals with
      that.
      
      Use the available 16 bits to broaden page->inuse. More than 64k objects per
      slab become possible and we can get rid of the checks for that limitation.
      
      No need anymore to shrink the order of slabs if we boot with 2M sized slabs
      (slub_min_order=9).
      
      No need anymore to switch off the offset calculation for very large slabs
      since the field in the kmem_cache_cpu structure is 32 bits and so the offset
      field can now handle slab sizes of up to 8GB.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b3fba8da
    • C
      SLUB: Avoid page struct cacheline bouncing due to remote frees to cpu slab · dfb4f096
      Christoph Lameter 提交于
      A remote free may access the same page struct that also contains the lockless
      freelist for the cpu slab. If objects have a short lifetime and are freed by
      a different processor then remote frees back to the slab from which we are
      currently allocating are frequent. The cacheline with the page struct needs
      to be repeately acquired in exclusive mode by both the allocating thread and
      the freeing thread. If this is frequent enough then performance will suffer
      because of cacheline bouncing.
      
      This patchset puts the lockless_freelist pointer in its own cacheline. In
      order to make that happen we introduce a per cpu structure called
      kmem_cache_cpu.
      
      Instead of keeping an array of pointers to page structs we now keep an array
      to a per cpu structure that--among other things--contains the pointer to the
      lockless freelist. The freeing thread can then keep possession of exclusive
      access to the page struct cacheline while the allocating thread keeps its
      exclusive access to the cacheline containing the per cpu structure.
      
      This works as long as the allocating cpu is able to service its request
      from the lockless freelist. If the lockless freelist runs empty then the
      allocating thread needs to acquire exclusive access to the cacheline with
      the page struct lock the slab.
      
      The allocating thread will then check if new objects were freed to the per
      cpu slab. If so it will keep the slab as the cpu slab and continue with the
      recently remote freed objects. So the allocating thread can take a series
      of just freed remote pages and dish them out again. Ideally allocations
      could be just recycling objects in the same slab this way which will lead
      to an ideal allocation / remote free pattern.
      
      The number of objects that can be handled in this way is limited by the
      capacity of one slab. Increasing slab size via slub_min_objects/
      slub_max_order may increase the number of objects and therefore performance.
      
      If the allocating thread runs out of objects and finds that no objects were
      put back by the remote processor then it will retrieve a new slab (from the
      partial lists or from the page allocator) and start with a whole
      new set of objects while the remote thread may still be freeing objects to
      the old cpu slab. This may then repeat until the new slab is also exhausted.
      If remote freeing has freed objects in the earlier slab then that earlier
      slab will now be on the partial freelist and the allocating thread will
      pick that slab next for allocation. So the loop is extended. However,
      both threads need to take the list_lock to make the swizzling via
      the partial list happen.
      
      It is likely that this kind of scheme will keep the objects being passed
      around to a small set that can be kept in the cpu caches leading to increased
      performance.
      
      More code cleanups become possible:
      
      - Instead of passing a cpu we can now pass a kmem_cache_cpu structure around.
        Allows reducing the number of parameters to various functions.
      - Can define a new node_match() function for NUMA to encapsulate locality
        checks.
      
      Effect on allocations:
      
      Cachelines touched before this patch:
      
      	Write:	page cache struct and first cacheline of object
      
      Cachelines touched after this patch:
      
      	Write:	kmem_cache_cpu cacheline and first cacheline of object
      	Read: page cache struct (but see later patch that avoids touching
      		that cacheline)
      
      The handling when the lockless alloc list runs empty gets to be a bit more
      complicated since another cacheline has now to be written to. But that is
      halfway out of the hot path.
      
      Effect on freeing:
      
      Cachelines touched before this patch:
      
      	Write: page_struct and first cacheline of object
      
      Cachelines touched after this patch depending on how we free:
      
        Write(to cpu_slab):	kmem_cache_cpu struct and first cacheline of object
        Write(to other):	page struct and first cacheline of object
      
        Read(to cpu_slab):	page struct to id slab etc. (but see later patch that
        			avoids touching the page struct on free)
        Read(to other):	cpu local kmem_cache_cpu struct to verify its not
        			the cpu slab.
      
      Summary:
      
      Pro:
      	- Distinct cachelines so that concurrent remote frees and local
      	  allocs on a cpuslab can occur without cacheline bouncing.
      	- Avoids potential bouncing cachelines because of neighboring
      	  per cpu pointer updates in kmem_cache's cpu_slab structure since
      	  it now grows to a cacheline (Therefore remove the comment
      	  that talks about that concern).
      
      Cons:
      	- Freeing objects now requires the reading of one additional
      	  cacheline. That can be mitigated for some cases by the following
      	  patches but its not possible to completely eliminate these
      	  references.
      
      	- Memory usage grows slightly.
      
      	The size of each per cpu object is blown up from one word
      	(pointing to the page_struct) to one cacheline with various data.
      	So this is NR_CPUS*NR_SLABS*L1_BYTES more memory use. Lets say
      	NR_SLABS is 100 and a cache line size of 128 then we have just
      	increased SLAB metadata requirements by 12.8k per cpu.
      	(Another later patch reduces these requirements)
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dfb4f096
    • C
      SLUB: direct pass through of page size or higher kmalloc requests · aadb4bc4
      Christoph Lameter 提交于
      This gets rid of all kmalloc caches larger than page size.  A kmalloc
      request larger than PAGE_SIZE > 2 is going to be passed through to the page
      allocator.  This works both inline where we will call __get_free_pages
      instead of kmem_cache_alloc and in __kmalloc.
      
      kfree is modified to check if the object is in a slab page. If not then
      the page is freed via the page allocator instead. Roughly similar to what
      SLOB does.
      
      Advantages:
      - Reduces memory overhead for kmalloc array
      - Large kmalloc operations are faster since they do not
        need to pass through the slab allocator to get to the
        page allocator.
      - Performance increase of 10%-20% on alloc and 50% on free for
        PAGE_SIZEd allocations.
        SLUB must call page allocator for each alloc anyways since
        the higher order pages which that allowed avoiding the page alloc calls
        are not available in a reliable way anymore. So we are basically removing
        useless slab allocator overhead.
      - Large kmallocs yields page aligned object which is what
        SLAB did. Bad things like using page sized kmalloc allocations to
        stand in for page allocate allocs can be transparently handled and are not
        distinguishable from page allocator uses.
      - Checking for too large objects can be removed since
        it is done by the page allocator.
      
      Drawbacks:
      - No accounting for large kmalloc slab allocations anymore
      - No debugging of large kmalloc slab allocations.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      aadb4bc4
  9. 31 8月, 2007 1 次提交
  10. 20 7月, 2007 1 次提交
  11. 18 7月, 2007 3 次提交
  12. 17 7月, 2007 1 次提交
  13. 17 6月, 2007 1 次提交
    • C
      SLUB: minimum alignment fixes · 4b356be0
      Christoph Lameter 提交于
      If ARCH_KMALLOC_MINALIGN is set to a value greater than 8 (SLUBs smallest
      kmalloc cache) then SLUB may generate duplicate slabs in sysfs (yes again)
      because the object size is padded to reach ARCH_KMALLOC_MINALIGN.  Thus the
      size of the small slabs is all the same.
      
      No arch sets ARCH_KMALLOC_MINALIGN larger than 8 though except mips which
      for some reason wants a 128 byte alignment.
      
      This patch increases the size of the smallest cache if
      ARCH_KMALLOC_MINALIGN is greater than 8.  In that case more and more of the
      smallest caches are disabled.
      
      If we do that then the count of the active general caches that is displayed
      on boot is not correct anymore since we may skip elements of the kmalloc
      array.  So count them separately.
      
      This approach was tested by Havard yesterday.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Haavard Skinnemoen <hskinnemoen@atmel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4b356be0
  14. 09 6月, 2007 1 次提交
  15. 17 5月, 2007 4 次提交
  16. 15 5月, 2007 1 次提交
  17. 08 5月, 2007 3 次提交
    • C
      slub: enable tracking of full slabs · 643b1138
      Christoph Lameter 提交于
      If slab tracking is on then build a list of full slabs so that we can verify
      the integrity of all slabs and are also able to built list of alloc/free
      callers.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      643b1138
    • C
      SLUB: allocate smallest object size if the user asks for 0 bytes · 614410d5
      Christoph Lameter 提交于
      Makes SLUB behave like SLAB in this area to avoid issues....
      
      Throw a stack dump to alert people.
      
      At some point the behavior should be switched back.  NULL is no memory as
      far as I can tell and if the use asked for 0 bytes then he need to get no
      memory.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      614410d5
    • C
      SLUB core · 81819f0f
      Christoph Lameter 提交于
      This is a new slab allocator which was motivated by the complexity of the
      existing code in mm/slab.c. It attempts to address a variety of concerns
      with the existing implementation.
      
      A. Management of object queues
      
         A particular concern was the complex management of the numerous object
         queues in SLAB. SLUB has no such queues. Instead we dedicate a slab for
         each allocating CPU and use objects from a slab directly instead of
         queueing them up.
      
      B. Storage overhead of object queues
      
         SLAB Object queues exist per node, per CPU. The alien cache queue even
         has a queue array that contain a queue for each processor on each
         node. For very large systems the number of queues and the number of
         objects that may be caught in those queues grows exponentially. On our
         systems with 1k nodes / processors we have several gigabytes just tied up
         for storing references to objects for those queues  This does not include
         the objects that could be on those queues. One fears that the whole
         memory of the machine could one day be consumed by those queues.
      
      C. SLAB meta data overhead
      
         SLAB has overhead at the beginning of each slab. This means that data
         cannot be naturally aligned at the beginning of a slab block. SLUB keeps
         all meta data in the corresponding page_struct. Objects can be naturally
         aligned in the slab. F.e. a 128 byte object will be aligned at 128 byte
         boundaries and can fit tightly into a 4k page with no bytes left over.
         SLAB cannot do this.
      
      D. SLAB has a complex cache reaper
      
         SLUB does not need a cache reaper for UP systems. On SMP systems
         the per CPU slab may be pushed back into partial list but that
         operation is simple and does not require an iteration over a list
         of objects. SLAB expires per CPU, shared and alien object queues
         during cache reaping which may cause strange hold offs.
      
      E. SLAB has complex NUMA policy layer support
      
         SLUB pushes NUMA policy handling into the page allocator. This means that
         allocation is coarser (SLUB does interleave on a page level) but that
         situation was also present before 2.6.13. SLABs application of
         policies to individual slab objects allocated in SLAB is
         certainly a performance concern due to the frequent references to
         memory policies which may lead a sequence of objects to come from
         one node after another. SLUB will get a slab full of objects
         from one node and then will switch to the next.
      
      F. Reduction of the size of partial slab lists
      
         SLAB has per node partial lists. This means that over time a large
         number of partial slabs may accumulate on those lists. These can
         only be reused if allocator occur on specific nodes. SLUB has a global
         pool of partial slabs and will consume slabs from that pool to
         decrease fragmentation.
      
      G. Tunables
      
         SLAB has sophisticated tuning abilities for each slab cache. One can
         manipulate the queue sizes in detail. However, filling the queues still
         requires the uses of the spin lock to check out slabs. SLUB has a global
         parameter (min_slab_order) for tuning. Increasing the minimum slab
         order can decrease the locking overhead. The bigger the slab order the
         less motions of pages between per CPU and partial lists occur and the
         better SLUB will be scaling.
      
      G. Slab merging
      
         We often have slab caches with similar parameters. SLUB detects those
         on boot up and merges them into the corresponding general caches. This
         leads to more effective memory use. About 50% of all caches can
         be eliminated through slab merging. This will also decrease
         slab fragmentation because partial allocated slabs can be filled
         up again. Slab merging can be switched off by specifying
         slub_nomerge on boot up.
      
         Note that merging can expose heretofore unknown bugs in the kernel
         because corrupted objects may now be placed differently and corrupt
         differing neighboring objects. Enable sanity checks to find those.
      
      H. Diagnostics
      
         The current slab diagnostics are difficult to use and require a
         recompilation of the kernel. SLUB contains debugging code that
         is always available (but is kept out of the hot code paths).
         SLUB diagnostics can be enabled via the "slab_debug" option.
         Parameters can be specified to select a single or a group of
         slab caches for diagnostics. This means that the system is running
         with the usual performance and it is much more likely that
         race conditions can be reproduced.
      
      I. Resiliency
      
         If basic sanity checks are on then SLUB is capable of detecting
         common error conditions and recover as best as possible to allow the
         system to continue.
      
      J. Tracing
      
         Tracing can be enabled via the slab_debug=T,<slabcache> option
         during boot. SLUB will then protocol all actions on that slabcache
         and dump the object contents on free.
      
      K. On demand DMA cache creation.
      
         Generally DMA caches are not needed. If a kmalloc is used with
         __GFP_DMA then just create this single slabcache that is needed.
         For systems that have no ZONE_DMA requirement the support is
         completely eliminated.
      
      L. Performance increase
      
         Some benchmarks have shown speed improvements on kernbench in the
         range of 5-10%. The locking overhead of slub is based on the
         underlying base allocation size. If we can reliably allocate
         larger order pages then it is possible to increase slub
         performance much further. The anti-fragmentation patches may
         enable further performance increases.
      
      Tested on:
      i386 UP + SMP, x86_64 UP + SMP + NUMA emulation, IA64 NUMA + Simulator
      
      SLUB Boot options
      
      slub_nomerge		Disable merging of slabs
      slub_min_order=x	Require a minimum order for slab caches. This
      			increases the managed chunk size and therefore
      			reduces meta data and locking overhead.
      slub_min_objects=x	Mininum objects per slab. Default is 8.
      slub_max_order=x	Avoid generating slabs larger than order specified.
      slub_debug		Enable all diagnostics for all caches
      slub_debug=<options>	Enable selective options for all caches
      slub_debug=<o>,<cache>	Enable selective options for a certain set of
      			caches
      
      Available Debug options
      F		Double Free checking, sanity and resiliency
      R		Red zoning
      P		Object / padding poisoning
      U		Track last free / alloc
      T		Trace all allocs / frees (only use for individual slabs).
      
      To use SLUB: Apply this patch and then select SLUB as the default slab
      allocator.
      
      [hugh@veritas.com: fix an oops-causing locking error]
      [akpm@linux-foundation.org: various stupid cleanups and small fixes]
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      81819f0f