1. 18 7月, 2007 20 次提交
    • C
      SLUB: faster more efficient slab determination for __kmalloc · f1b26339
      Christoph Lameter 提交于
      kmalloc_index is a long series of comparisons.  The attempt to replace
      kmalloc_index with something more efficient like ilog2 failed due to compiler
      issues with constant folding on gcc 3.3 / powerpc.
      
      kmalloc_index()'es long list of comparisons works fine for constant folding
      since all the comparisons are optimized away.  However, SLUB also uses
      kmalloc_index to determine the slab to use for the __kmalloc_xxx functions.
      This leads to a large set of comparisons in get_slab().
      
      The patch here allows to get rid of that list of comparisons in get_slab():
      
      1. If the requested size is larger than 192 then we can simply use
         fls to determine the slab index since all larger slabs are
         of the power of two type.
      
      2. If the requested size is smaller then we cannot use fls since there
         are non power of two caches to be considered. However, the sizes are
         in a managable range. So we divide the size by 8. Then we have only
         24 possibilities left and then we simply look up the kmalloc index
         in a table.
      
      Code size of slub.o decreases by more than 200 bytes through this patch.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f1b26339
    • C
      SLUB: do proper locking during dma slab creation · dfce8648
      Christoph Lameter 提交于
      We modify the kmalloc_cache_dma[] array without proper locking.  Do the proper
      locking and undo the dma cache creation if another processor has already
      created it.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dfce8648
    • C
      SLUB: extract dma_kmalloc_cache from get_cache. · 2e443fd0
      Christoph Lameter 提交于
      The rarely used dma functionality in get_slab() makes the function too
      complex.  The compiler begins to spill variables from the working set onto the
      stack.  The created function is only used in extremely rare cases so make sure
      that the compiler does not decide on its own to merge it back into get_slab().
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2e443fd0
    • C
      SLUB: add some more inlines and #ifdef CONFIG_SLUB_DEBUG · 0c710013
      Christoph Lameter 提交于
      Add #ifdefs around data structures only needed if debugging is compiled into
      SLUB.
      
      Add inlines to small functions to reduce code size.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0c710013
    • C
      Slab allocators: support __GFP_ZERO in all allocators · d07dbea4
      Christoph Lameter 提交于
      A kernel convention for many allocators is that if __GFP_ZERO is passed to an
      allocator then the allocated memory should be zeroed.
      
      This is currently not supported by the slab allocators.  The inconsistency
      makes it difficult to implement in derived allocators such as in the uncached
      allocator and the pool allocators.
      
      In addition the support zeroed allocations in the slab allocators does not
      have a consistent API.  There are no zeroing allocator functions for NUMA node
      placement (kmalloc_node, kmem_cache_alloc_node).  The zeroing allocations are
      only provided for default allocs (kzalloc, kmem_cache_zalloc_node).
      __GFP_ZERO will make zeroing universally available and does not require any
      addititional functions.
      
      So add the necessary logic to all slab allocators to support __GFP_ZERO.
      
      The code is added to the hot path.  The gfp flags are on the stack and so the
      cacheline is readily available for checking if we want a zeroed object.
      
      Zeroing while allocating is now a frequent operation and we seem to be
      gradually approaching a 1-1 parity between zeroing and not zeroing allocs.
      The current tree has 3476 uses of kmalloc vs 2731 uses of kzalloc.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Acked-by: NPekka Enberg <penberg@cs.helsinki.fi>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d07dbea4
    • C
      Slab allocators: consistent ZERO_SIZE_PTR support and NULL result semantics · 6cb8f913
      Christoph Lameter 提交于
      Define ZERO_OR_NULL_PTR macro to be able to remove the checks from the
      allocators.  Move ZERO_SIZE_PTR related stuff into slab.h.
      
      Make ZERO_SIZE_PTR work for all slab allocators and get rid of the
      WARN_ON_ONCE(size == 0) that is still remaining in SLAB.
      
      Make slub return NULL like the other allocators if a too large memory segment
      is requested via __kmalloc.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Acked-by: NPekka Enberg <penberg@cs.helsinki.fi>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6cb8f913
    • C
      Slab allocators: consolidate code for krealloc in mm/util.c · ef2ad80c
      Christoph Lameter 提交于
      The size of a kmalloc object is readily available via ksize().  ksize is
      provided by all allocators and thus we can implement krealloc in a generic
      way.
      
      Implement krealloc in mm/util.c and drop slab specific implementations of
      krealloc.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Acked-by: NPekka Enberg <penberg@cs.helsinki.fi>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ef2ad80c
    • C
      SLUB Debug: fix initial object debug state of NUMA bootstrap objects · d45f39cb
      Christoph Lameter 提交于
      The function we are calling to initialize object debug state during early NUMA
      bootstrap sets up an inactive object giving it the wrong redzone signature.
      The bootstrap nodes are active objects and should have active redzone
      signatures.
      
      Currently slab validation complains and reverts the object to active state.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d45f39cb
    • C
      SLUB: ensure that the number of objects per slab stays low for high orders · 6300ea75
      Christoph Lameter 提交于
      Currently SLUB has no provision to deal with too high page orders that may
      be specified on the kernel boot line.  If an order higher than 6 (on a 4k
      platform) is generated then we will BUG() because slabs get more than 65535
      objects.
      
      Add some logic that decreases order for slabs that have too many objects.
      This allow booting with slab sizes up to MAX_ORDER.
      
      For example
      
      	slub_min_order=10
      
      will boot with a default slab size of 4M and reduce slab sizes for small
      object sizes to lower orders if the number of objects becomes too big.
      Large slab sizes like that allow a concentration of objects of the same
      slab cache under as few as possible TLB entries and thus potentially
      reduces TLB pressure.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6300ea75
    • C
      SLUB slab validation: Move tracking information alloc outside of lock · 68dff6a9
      Christoph Lameter 提交于
      We currently have to do an GFP_ATOMIC allocation because the list_lock is
      already taken when we first allocate memory for tracking allocation
      information.  It would be better if we could avoid atomic allocations.
      
      Allocate a size of the tracking table that is usually sufficient (one page)
      before we take the list lock.  We will then only do the atomic allocation
      if we need to resize the table to become larger than a page (mostly only
      needed under large NUMA because of the tracking of cpus and nodes otherwise
      the table stays small).
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      68dff6a9
    • C
      SLUB: use list_for_each_entry for loops over all slabs · 5b95a4ac
      Christoph Lameter 提交于
      Use list_for_each_entry() instead of list_for_each().
      
      Get rid of for_all_slabs(). It had only one user. So fold it into the
      callback. This also gets rid of cpu_slab_flush.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5b95a4ac
    • C
      SLUB: change error reporting format to follow lockdep loosely · 24922684
      Christoph Lameter 提交于
      Changes the error reporting format to loosely follow lockdep.
      
      If data corruption is detected then we generate the following lines:
      
      ============================================
      BUG <slab-cache>: <problem>
      --------------------------------------------
      
      INFO: <more information> [possibly multiple times]
      
      <object dump>
      
      FIX <slab-cache>: <remedial action>
      
      This also adds some more intelligence to the data corruption detection. Its
      now capable of figuring out the start and end.
      
      Add a comment on how to configure SLUB so that a production system may
      continue to operate even though occasional slab corruption occur through
      a misbehaving kernel component. See "Emergency operations" in
      Documentation/vm/slub.txt.
      
      [akpm@linux-foundation.org: build fix]
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      24922684
    • R
      mm: clean up and kernelify shrinker registration · 8e1f936b
      Rusty Russell 提交于
      I can never remember what the function to register to receive VM pressure
      is called.  I have to trace down from __alloc_pages() to find it.
      
      It's called "set_shrinker()", and it needs Your Help.
      
      1) Don't hide struct shrinker.  It contains no magic.
      2) Don't allocate "struct shrinker".  It's not helpful.
      3) Call them "register_shrinker" and "unregister_shrinker".
      4) Call the function "shrink" not "shrinker".
      5) Reduce the 17 lines of waffly comments to 13, but document it properly.
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      Cc: David Chinner <dgc@sgi.com>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8e1f936b
    • A
      Lumpy Reclaim V4 · 5ad333eb
      Andy Whitcroft 提交于
      When we are out of memory of a suitable size we enter reclaim.  The current
      reclaim algorithm targets pages in LRU order, which is great for fairness at
      order-0 but highly unsuitable if you desire pages at higher orders.  To get
      pages of higher order we must shoot down a very high proportion of memory;
      >95% in a lot of cases.
      
      This patch set adds a lumpy reclaim algorithm to the allocator.  It targets
      groups of pages at the specified order anchored at the end of the active and
      inactive lists.  This encourages groups of pages at the requested orders to
      move from active to inactive, and active to free lists.  This behaviour is
      only triggered out of direct reclaim when higher order pages have been
      requested.
      
      This patch set is particularly effective when utilised with an
      anti-fragmentation scheme which groups pages of similar reclaimability
      together.
      
      This patch set is based on Peter Zijlstra's lumpy reclaim V2 patch which forms
      the foundation.  Credit to Mel Gorman for sanitity checking.
      
      Mel said:
      
        The patches have an application with hugepage pool resizing.
      
        When lumpy-reclaim is used used with ZONE_MOVABLE, the hugepages pool can
        be resized with greater reliability.  Testing on a desktop machine with 2GB
        of RAM showed that growing the hugepage pool with ZONE_MOVABLE on it's own
        was very slow as the success rate was quite low.  Without lumpy-reclaim,
        each attempt to grow the pool by 100 pages would yield 1 or 2 hugepages.
        With lumpy-reclaim, getting 40 to 70 hugepages on each attempt was typical.
      
      [akpm@osdl.org: ia64 pfn_to_nid fixes and loop cleanup]
      [bunk@stusta.de: static declarations for internal functions]
      [a.p.zijlstra@chello.nl: initial lumpy V2 implementation]
      Signed-off-by: NAndy Whitcroft <apw@shadowen.org>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Bob Picco <bob.picco@hp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5ad333eb
    • M
      Add a movablecore= parameter for sizing ZONE_MOVABLE · 7e63efef
      Mel Gorman 提交于
      This patch adds a new parameter for sizing ZONE_MOVABLE called
      movablecore=.  While kernelcore= is used to specify the minimum amount of
      memory that must be available for all allocation types, movablecore= is
      used to specify the minimum amount of memory that is used for migratable
      allocations.  The amount of memory used for migratable allocations
      determines how large the huge page pool could be dynamically resized to at
      runtime for example.
      
      How movablecore is actually handled is that the total number of pages in
      the system is calculated and a value is set for kernelcore that is
      
      kernelcore == totalpages - movablecore
      
      Both kernelcore= and movablecore= can be safely specified at the same time.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Acked-by: NAndy Whitcroft <apw@shadowen.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7e63efef
    • M
      handle kernelcore=: generic · ed7ed365
      Mel Gorman 提交于
      This patch adds the kernelcore= parameter for x86.
      
      Once all patches are applied, a new command-line parameter exist and a new
      sysctl.  This patch adds the necessary documentation.
      
      From: Yasunori Goto <y-goto@jp.fujitsu.com>
      
        When "kernelcore" boot option is specified, kernel can't boot up on ia64
        because of an infinite loop.  In addition, the parsing code can be handled
        in an architecture-independent manner.
      
        This patch uses common code to handle the kernelcore= parameter.  It is
        only available to architectures that support arch-independent zone-sizing
        (i.e.  define CONFIG_ARCH_POPULATES_NODE_MAP).  Other architectures will
        ignore the boot parameter.
      
      [bunk@stusta.de: make cmdline_parse_kernelcore() static]
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Signed-off-by: NYasunori Goto <y-goto@jp.fujitsu.com>
      Acked-by: NAndy Whitcroft <apw@shadowen.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ed7ed365
    • M
      Allow huge page allocations to use GFP_HIGH_MOVABLE · 396faf03
      Mel Gorman 提交于
      Huge pages are not movable so are not allocated from ZONE_MOVABLE.  However,
      as ZONE_MOVABLE will always have pages that can be migrated or reclaimed, it
      can be used to satisfy hugepage allocations even when the system has been
      running a long time.  This allows an administrator to resize the hugepage pool
      at runtime depending on the size of ZONE_MOVABLE.
      
      This patch adds a new sysctl called hugepages_treat_as_movable.  When a
      non-zero value is written to it, future allocations for the huge page pool
      will use ZONE_MOVABLE.  Despite huge pages being non-movable, we do not
      introduce additional external fragmentation of note as huge pages are always
      the largest contiguous block we care about.
      
      [akpm@linux-foundation.org: various fixes]
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      396faf03
    • M
      Create the ZONE_MOVABLE zone · 2a1e274a
      Mel Gorman 提交于
      The following 8 patches against 2.6.20-mm2 create a zone called ZONE_MOVABLE
      that is only usable by allocations that specify both __GFP_HIGHMEM and
      __GFP_MOVABLE.  This has the effect of keeping all non-movable pages within a
      single memory partition while allowing movable allocations to be satisfied
      from either partition.  The patches may be applied with the list-based
      anti-fragmentation patches that groups pages together based on mobility.
      
      The size of the zone is determined by a kernelcore= parameter specified at
      boot-time.  This specifies how much memory is usable by non-movable
      allocations and the remainder is used for ZONE_MOVABLE.  Any range of pages
      within ZONE_MOVABLE can be released by migrating the pages or by reclaiming.
      
      When selecting a zone to take pages from for ZONE_MOVABLE, there are two
      things to consider.  First, only memory from the highest populated zone is
      used for ZONE_MOVABLE.  On the x86, this is probably going to be ZONE_HIGHMEM
      but it would be ZONE_DMA on ppc64 or possibly ZONE_DMA32 on x86_64.  Second,
      the amount of memory usable by the kernel will be spread evenly throughout
      NUMA nodes where possible.  If the nodes are not of equal size, the amount of
      memory usable by the kernel on some nodes may be greater than others.
      
      By default, the zone is not as useful for hugetlb allocations because they are
      pinned and non-migratable (currently at least).  A sysctl is provided that
      allows huge pages to be allocated from that zone.  This means that the huge
      page pool can be resized to the size of ZONE_MOVABLE during the lifetime of
      the system assuming that pages are not mlocked.  Despite huge pages being
      non-movable, we do not introduce additional external fragmentation of note as
      huge pages are always the largest contiguous block we care about.
      
      Credit goes to Andy Whitcroft for catching a large variety of problems during
      review of the patches.
      
      This patch creates an additional zone, ZONE_MOVABLE.  This zone is only usable
      by allocations which specify both __GFP_HIGHMEM and __GFP_MOVABLE.  Hot-added
      memory continues to be placed in their existing destination as there is no
      mechanism to redirect them to a specific zone.
      
      [y-goto@jp.fujitsu.com: Fix section mismatch of memory hotplug related code]
      [akpm@linux-foundation.org: various fixes]
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Signed-off-by: NYasunori Goto <y-goto@jp.fujitsu.com>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2a1e274a
    • M
      Add __GFP_MOVABLE for callers to flag allocations from high memory that may be migrated · 769848c0
      Mel Gorman 提交于
      It is often known at allocation time whether a page may be migrated or not.
      This patch adds a flag called __GFP_MOVABLE and a new mask called
      GFP_HIGH_MOVABLE.  Allocations using the __GFP_MOVABLE can be either migrated
      using the page migration mechanism or reclaimed by syncing with backing
      storage and discarding.
      
      An API function very similar to alloc_zeroed_user_highpage() is added for
      __GFP_MOVABLE allocations called alloc_zeroed_user_highpage_movable().  The
      flags used by alloc_zeroed_user_highpage() are not changed because it would
      change the semantics of an existing API.  After this patch is applied there
      are no in-kernel users of alloc_zeroed_user_highpage() so it probably should
      be marked deprecated if this patch is merged.
      
      Note that this patch includes a minor cleanup to the use of __GFP_ZERO in
      shmem.c to keep all flag modifications to inode->mapping in the
      shmem_dir_alloc() helper function.  This clean-up suggestion is courtesy of
      Hugh Dickens.
      
      Additional credit goes to Christoph Lameter and Linus Torvalds for shaping the
      concept.  Credit to Hugh Dickens for catching issues with shmem swap vector
      and ramfs allocations.
      
      [akpm@linux-foundation.org: build fix]
      [hugh@veritas.com: __GFP_ZERO cleanup]
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: Christoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      769848c0
    • N
      Fix read/truncate race · a32ea1e1
      NeilBrown 提交于
      do_generic_mapping_read currently samples the i_size at the start and doesn't
      do so again unless it needs to call ->readpage to load a page.  After
      ->readpage it has to re-sample i_size as a truncate may have caused that page
      to be filled with zeros, and the read() call should not see these.
      
      However there are other activities that might cause ->readpage to be called on
      a page between the time that do_generic_mapping_read samples i_size and when
      it finds that it has an uptodate page.  These include at least read-ahead and
      possibly another thread performing a read.
      
      So do_generic_mapping_read must sample i_size *after* it has an uptodate page.
       Thus the current sampling at the start and after a read can be replaced with
      a sampling before the copy-out.
      
      The same change applied to __generic_file_splice_read.
      
      Note that this fixes any race with truncate_complete_page, but does not fix a
      possible race with truncate_partial_page.  If a partial truncate happens after
      do_generic_mapping_read samples i_size and before the copy_out, the nuls that
      truncate_partial_page place in the page could be copied out incorrectly.
      
      I think the best fix for that is to *not* zero out parts of the page in
      truncate_partial_page, but rather to zero out the tail of a page when
      increasing i_size.
      Signed-off-by: NNeil Brown <neilb@suse.de>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Acked-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a32ea1e1
  2. 17 7月, 2007 20 次提交