1. 08 5月, 2007 40 次提交
    • C
      slab allocators: Remove SLAB_CTOR_ATOMIC · 4f104934
      Christoph Lameter 提交于
      SLAB_CTOR atomic is never used which is no surprise since I cannot imagine
      that one would want to do something serious in a constructor or destructor.
       In particular given that the slab allocators run with interrupts disabled.
       Actions in constructors and destructors are by their nature very limited
      and usually do not go beyond initializing variables and list operations.
      
      (The i386 pgd ctor and dtors do take a spinlock in constructor and
      destructor.....  I think that is the furthest we go at this point.)
      
      There is no flag passed to the destructor so removing SLAB_CTOR_ATOMIC also
      establishes a certain symmetry.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4f104934
    • C
      slab allocators: Remove SLAB_DEBUG_INITIAL flag · 50953fe9
      Christoph Lameter 提交于
      I have never seen a use of SLAB_DEBUG_INITIAL.  It is only supported by
      SLAB.
      
      I think its purpose was to have a callback after an object has been freed
      to verify that the state is the constructor state again?  The callback is
      performed before each freeing of an object.
      
      I would think that it is much easier to check the object state manually
      before the free.  That also places the check near the code object
      manipulation of the object.
      
      Also the SLAB_DEBUG_INITIAL callback is only performed if the kernel was
      compiled with SLAB debugging on.  If there would be code in a constructor
      handling SLAB_DEBUG_INITIAL then it would have to be conditional on
      SLAB_DEBUG otherwise it would just be dead code.  But there is no such code
      in the kernel.  I think SLUB_DEBUG_INITIAL is too problematic to make real
      use of, difficult to understand and there are easier ways to accomplish the
      same effect (i.e.  add debug code before kfree).
      
      There is a related flag SLAB_CTOR_VERIFY that is frequently checked to be
      clear in fs inode caches.  Remove the pointless checks (they would even be
      pointless without removeal of SLAB_DEBUG_INITIAL) from the fs constructors.
      
      This is the last slab flag that SLUB did not support.  Remove the check for
      unimplemented flags from SLUB.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      50953fe9
    • B
      get_unmapped_area doesn't need hugetlbfs hacks anymore · 4b1d8929
      Benjamin Herrenschmidt 提交于
      Remove the hugetlbfs specific hacks in toplevel get_unmapped_area() now that
      all archs and hugetlbfs itself do the right thing for both cases.
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Acked-by: NWilliam Irwin <bill.irwin@oracle.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Russell King <rmk+kernel@arm.linux.org.uk>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Andi Kleen <ak@suse.de>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Kyle McMartin <kyle@mcmartin.ca>
      Cc: Grant Grundler <grundler@parisc-linux.org>
      Cc: Matthew Wilcox <willy@debian.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Adam Litke <agl@us.ibm.com>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4b1d8929
    • B
      get_unmapped_area handles MAP_FIXED in generic code · 06abdfb4
      Benjamin Herrenschmidt 提交于
      generic arch_get_unmapped_area() now handles MAP_FIXED.  Now that all
      implementations have been fixed, change the toplevel get_unmapped_area() to
      call into arch or drivers for the MAP_FIXED case.
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Russell King <rmk+kernel@arm.linux.org.uk>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Andi Kleen <ak@suse.de>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Kyle McMartin <kyle@mcmartin.ca>
      Cc: Grant Grundler <grundler@parisc-linux.org>
      Cc: Matthew Wilcox <willy@debian.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: William Irwin <bill.irwin@oracle.com>
      Cc: Adam Litke <agl@us.ibm.com>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      06abdfb4
    • D
      oom: fix constraint deadlock · 2b45ab33
      David Rientjes 提交于
      Fixes a deadlock in the OOM killer for allocations that are not
      __GFP_HARDWALL.
      
      Before the OOM killer checks for the allocation constraint, it takes
      callback_mutex.
      
      constrained_alloc() iterates through each zone in the allocation zonelist
      and calls cpuset_zone_allowed_softwall() to determine whether an allocation
      for gfp_mask is possible.  If a zone's node is not in the OOM-triggering
      task's mems_allowed, it is not exiting, and we did not fail on a
      __GFP_HARDWALL allocation, cpuset_zone_allowed_softwall() attempts to take
      callback_mutex to check the nearest exclusive ancestor of current's cpuset.
       This results in deadlock.
      
      We now take callback_mutex after iterating through the zonelist since we
      don't need it yet.
      
      Cc: Andi Kleen <ak@suse.de>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Christoph Lameter <clameter@engr.sgi.com>
      Cc: Martin J. Bligh <mbligh@mbligh.org>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2b45ab33
    • Y
      mm: fix handling of panic_on_oom when cpusets are in use · 2b744c01
      Yasunori Goto 提交于
      The current panic_on_oom may not work if there is a process using
      cpusets/mempolicy, because other nodes' memory may remain.  But some people
      want failover by panic ASAP even if they are used.  This patch makes new
      setting for its request.
      
      This is tested on my ia64 box which has 3 nodes.
      Signed-off-by: NYasunori Goto <y-goto@jp.fujitsu.com>
      Signed-off-by: NBenjamin LaHaise <bcrl@kvack.org>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Ethan Solomita <solo@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2b744c01
    • A
      fault injection: fix failslab with CONFIG_NUMA · 824ebef1
      Akinobu Mita 提交于
      Currently failslab injects failures into ____cache_alloc().  But with enabling
      CONFIG_NUMA it's not enough to let actual slab allocator functions (kmalloc,
      kmem_cache_alloc, ...) return NULL.
      
      This patch moves fault injection hook inside of __cache_alloc() and
      __cache_alloc_node().  These are lower call path than ____cache_alloc() and
      enable to inject faulures to slab allocators with CONFIG_NUMA.
      Acked-by: NPekka Enberg <penberg@cs.helsinki.fi>
      Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      824ebef1
    • C
      slab allocators: Remove obsolete SLAB_MUST_HWCACHE_ALIGN · 5af60839
      Christoph Lameter 提交于
      This patch was recently posted to lkml and acked by Pekka.
      
      The flag SLAB_MUST_HWCACHE_ALIGN is
      
      1. Never checked by SLAB at all.
      
      2. A duplicate of SLAB_HWCACHE_ALIGN for SLUB
      
      3. Fulfills the role of SLAB_HWCACHE_ALIGN for SLOB.
      
      The only remaining use is in sparc64 and ppc64 and their use there
      reflects some earlier role that the slab flag once may have had. If
      its specified then SLAB_HWCACHE_ALIGN is also specified.
      
      The flag is confusing, inconsistent and has no purpose.
      
      Remove it.
      Acked-by: NPekka Enberg <penberg@cs.helsinki.fi>
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5af60839
    • N
      mm: madvise avoid exclusive mmap_sem · 0a27a14a
      Nick Piggin 提交于
      Avoid down_write of the mmap_sem in madvise when we can help it.
      Acked-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0a27a14a
    • M
    • A
      slob: handle SLAB_PANIC flag · bc0055ae
      Akinobu Mita 提交于
      kmem_cache_create() for slob doesn't handle SLAB_PANIC.
      Signed-off-by: NMatt Mackall <mpm@selenic.com>
      Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bc0055ae
    • C
      Quicklists for page table pages · 6225e937
      Christoph Lameter 提交于
      On x86_64 this cuts allocation overhead for page table pages down to a
      fraction (kernel compile / editing load.  TSC based measurement of times spend
      in each function):
      
      no quicklist
      
      pte_alloc               1569048 4.3s(401ns/2.7us/179.7us)
      pmd_alloc                780988 2.1s(337ns/2.7us/86.1us)
      pud_alloc                780072 2.2s(424ns/2.8us/300.6us)
      pgd_alloc                260022 1s(920ns/4us/263.1us)
      
      quicklist:
      
      pte_alloc                452436 573.4ms(8ns/1.3us/121.1us)
      pmd_alloc                196204 174.5ms(7ns/889ns/46.1us)
      pud_alloc                195688 172.4ms(7ns/881ns/151.3us)
      pgd_alloc                 65228 9.8ms(8ns/150ns/6.1us)
      
      pgd allocations are the most complex and there we see the most dramatic
      improvement (may be we can cut down the amount of pgds cached somewhat?).  But
      even the pte allocations still see a doubling of performance.
      
      1. Proven code from the IA64 arch.
      
      	The method used here has been fine tuned for years and
      	is NUMA aware. It is based on the knowledge that accesses
      	to page table pages are sparse in nature. Taking a page
      	off the freelists instead of allocating a zeroed pages
      	allows a reduction of number of cachelines touched
      	in addition to getting rid of the slab overhead. So
      	performance improves. This is particularly useful if pgds
      	contain standard mappings. We can save on the teardown
      	and setup of such a page if we have some on the quicklists.
      	This includes avoiding lists operations that are otherwise
      	necessary on alloc and free to track pgds.
      
      2. Light weight alternative to use slab to manage page size pages
      
      	Slab overhead is significant and even page allocator use
      	is pretty heavy weight. The use of a per cpu quicklist
      	means that we touch only two cachelines for an allocation.
      	There is no need to access the page_struct (unless arch code
      	needs to fiddle around with it). So the fast past just
      	means bringing in one cacheline at the beginning of the
      	page. That same cacheline may then be used to store the
      	page table entry. Or a second cacheline may be used
      	if the page table entry is not in the first cacheline of
      	the page. The current code will zero the page which means
      	touching 32 cachelines (assuming 128 byte). We get down
      	from 32 to 2 cachelines in the fast path.
      
      3. x86_64 gets lightweight page table page management.
      
      	This will allow x86_64 arch code to faster repopulate pgds
      	and other page table entries. The list operations for pgds
      	are reduced in the same way as for i386 to the point where
      	a pgd is allocated from the page allocator and when it is
      	freed back to the page allocator. A pgd can pass through
      	the quicklists without having to be reinitialized.
      
      64 Consolidation of code from multiple arches
      
      	So far arches have their own implementation of quicklist
      	management. This patch moves that feature into the core allowing
      	an easier maintenance and consistent management of quicklists.
      
      Page table pages have the characteristics that they are typically zero or in a
      known state when they are freed.  This is usually the exactly same state as
      needed after allocation.  So it makes sense to build a list of freed page
      table pages and then consume the pages already in use first.  Those pages have
      already been initialized correctly (thus no need to zero them) and are likely
      already cached in such a way that the MMU can use them most effectively.  Page
      table pages are used in a sparse way so zeroing them on allocation is not too
      useful.
      
      Such an implementation already exits for ia64.  Howver, that implementation
      did not support constructors and destructors as needed by i386 / x86_64.  It
      also only supported a single quicklist.  The implementation here has
      constructor and destructor support as well as the ability for an arch to
      specify how many quicklists are needed.
      
      Quicklists are defined by an arch defining CONFIG_QUICKLIST.  If more than one
      quicklist is necessary then we can define NR_QUICK for additional lists.  F.e.
       i386 needs two and thus has
      
      config NR_QUICK
      	int
      	default 2
      
      If an arch has requested quicklist support then pages can be allocated
      from the quicklist (or from the page allocator if the quicklist is
      empty) via:
      
      quicklist_alloc(<quicklist-nr>, <gfpflags>, <constructor>)
      
      Page table pages can be freed using:
      
      quicklist_free(<quicklist-nr>, <destructor>, <page>)
      
      Pages must have a definite state after allocation and before
      they are freed. If no constructor is specified then pages
      will be zeroed on allocation and must be zeroed before they are
      freed.
      
      If a constructor is used then the constructor will establish
      a definite page state. F.e. the i386 and x86_64 pgd constructors
      establish certain mappings.
      
      Constructors and destructors can also be used to track the pages.
      i386 and x86_64 use a list of pgds in order to be able to dynamically
      update standard mappings.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Andi Kleen <ak@suse.de>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6225e937
    • C
      slub: remove object activities out of checking functions · 70d71228
      Christoph Lameter 提交于
      Make sure that the check function really only check things and do not perform
      activities.  Extract the tracing and object seeding out of the two check
      functions and place them into slab_alloc and slab_free
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      70d71228
    • C
      SLUB: Free slabs and sort partial slab lists in kmem_cache_shrink · 2086d26a
      Christoph Lameter 提交于
      At kmem_cache_shrink check if we have any empty slabs on the partial
      if so then remove them.
      
      Also--as an anti-fragmentation measure--sort the partial slabs so that
      the most fully allocated ones come first and the least allocated last.
      
      The next allocations may fill up the nearly full slabs. Having the
      least allocated slabs last gives them the maximum chance that their
      remaining objects may be freed. Thus we can hopefully minimize the
      partial slabs.
      
      I think this is the best one can do in terms antifragmentation
      measures. Real defragmentation (meaning moving objects out of slabs with
      the least free objects to those that are almost full) can be implemted
      by reverse scanning through the list produced here but that would mean
      that we need to provide a callback at slab cache creation that allows
      the deletion or moving of an object. This will involve slab API
      changes, so defer for now.
      
      Cc: Mel Gorman <mel@skynet.ie>
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2086d26a
    • C
      slub: add ability to list alloc / free callers per slab · 88a420e4
      Christoph Lameter 提交于
      This patch enables listing the callers who allocated or freed objects in a
      cache.
      
      For example to list the allocators for kmalloc-128 do
      
      cat /sys/slab/kmalloc-128/alloc_calls
            7 sn_io_slot_fixup+0x40/0x700
            7 sn_io_slot_fixup+0x80/0x700
            9 sn_bus_fixup+0xe0/0x380
            6 param_sysfs_setup+0xf0/0x280
          276 percpu_populate+0xf0/0x1a0
           19 __register_chrdev_region+0x30/0x360
            8 expand_files+0x2e0/0x6e0
            1 sys_epoll_create+0x60/0x200
            1 __mounts_open+0x140/0x2c0
           65 kmem_alloc+0x110/0x280
            3 alloc_disk_node+0xe0/0x200
           33 as_get_io_context+0x90/0x280
           74 kobject_kset_add_dir+0x40/0x140
           12 pci_create_bus+0x2a0/0x5c0
            1 acpi_ev_create_gpe_block+0x120/0x9e0
           41 con_insert_unipair+0x100/0x1c0
            1 uart_open+0x1c0/0xba0
            1 dma_pool_create+0xe0/0x340
            2 neigh_table_init_no_netlink+0x260/0x4c0
            6 neigh_parms_alloc+0x30/0x200
            1 netlink_kernel_create+0x130/0x320
            5 fz_hash_alloc+0x50/0xe0
            2 sn_common_hubdev_init+0xd0/0x6e0
           28 kernel_param_sysfs_setup+0x30/0x180
           72 process_zones+0x70/0x2e0
      
      cat /sys/slab/kmalloc-128/free_calls
          558 <not-available>
            3 sn_io_slot_fixup+0x600/0x700
           84 free_fdtable_rcu+0x120/0x260
            2 seq_release+0x40/0x60
            6 kmem_free+0x70/0xc0
           24 free_as_io_context+0x20/0x200
            1 acpi_get_object_info+0x3a0/0x3e0
            1 acpi_add_single_object+0xcf0/0x1e40
            2 con_release_unimap+0x80/0x140
            1 free+0x20/0x40
      
      SLAB_STORE_USER must be enabled for a slab cache by either booting with
      "slab_debug" or enabling user tracking specifically for the slab of interest.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      88a420e4
    • C
      SLUB: Add MIN_PARTIAL · e95eed57
      Christoph Lameter 提交于
      We leave a mininum of partial slabs on nodes when we search for
      partial slabs on other node. Define a constant for that value.
      
      Then modify slub to keep MIN_PARTIAL slabs around.
      
      This avoids bad situations where a function frees the last object
      in a slab (which results in the page being returned to the page
      allocator) only to then allocate one again (which requires getting
      a page back from the page allocator if the partial list was empty).
      Keeping a couple of slabs on the partial list reduces overhead.
      
      Empty slabs are added to the end of the partial list to insure that
      partially allocated slabs are consumed first (defragmentation).
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e95eed57
    • C
      slub: validation of slabs (metadata and guard zones) · 53e15af0
      Christoph Lameter 提交于
      This enables validation of slab.  Validation means that all objects are
      checked to see if there are redzone violations, if padding has been
      overwritten or any pointers have been corrupted.  Also checks the consistency
      of slab counters.
      
      Validation enables the detection of metadata corruption without the kernel
      having to execute code that actually uses (allocs/frees) and object.  It
      allows one to make sure that the slab metainformation and the guard values
      around an object have not been compromised.
      
      A single slabcache can be checked by writing a 1 to the "validate" file.
      
      i.e.
      
      echo 1 >/sys/slab/kmalloc-128/validate
      
      or use the slabinfo tool to check all slabs
      
      slabinfo -v
      
      Error messages will show up in the syslog.
      
      Note that validation can only reach slabs that are on a list.  This means that
      we are usually restricted to partial slabs and active slabs unless
      SLAB_STORE_USER is active which will build a full slab list and allows
      validation of slabs that are fully in use.  Booting with "slub_debug" set will
      enable SLAB_STORE_USER and then full diagnostic are available.
      
      Note that we attempt to push cpu slabs back to the lists when we start the
      check.  If the cpu slab is reactivated before we get to it (another processor
      grabs it before we get to it) then it cannot be checked.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      53e15af0
    • C
      slub: enable tracking of full slabs · 643b1138
      Christoph Lameter 提交于
      If slab tracking is on then build a list of full slabs so that we can verify
      the integrity of all slabs and are also able to built list of alloc/free
      callers.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      643b1138
    • C
      slub: fix object tracking · 77c5e2d0
      Christoph Lameter 提交于
      Object tracking did not work the right way for several call chains. Fix this up
      by adding a new parameter to slub_alloc and slub_free that specifies the
      caller address explicitly.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      77c5e2d0
    • C
    • C
      mm: optimize compound_head() by avoiding a shared page flag · 6d777953
      Christoph Lameter 提交于
      The patch adds PageTail(page) and PageHead(page) to check if a page is the
      head or the tail of a compound page.  This is done by masking the two bits
      describing the state of a compound page and then comparing them.  So one
      comparision and a branch instead of two bit checks and two branches.
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6d777953
    • C
      Make page->private usable in compound pages · d85f3385
      Christoph Lameter 提交于
      If we add a new flag so that we can distinguish between the first page and the
      tail pages then we can avoid to use page->private in the first page.
      page->private == page for the first page, so there is no real information in
      there.
      
      Freeing up page->private makes the use of compound pages more transparent.
      They become more usable like real pages.  Right now we have to be careful f.e.
       if we are going beyond PAGE_SIZE allocations in the slab on i386 because we
      can then no longer use the private field.  This is one of the issues that
      cause us not to support debugging for page size slabs in SLAB.
      
      Having page->private available for SLUB would allow more meta information in
      the page struct.  I can probably avoid the 16 bit ints that I have in there
      right now.
      
      Also if page->private is available then a compound page may be equipped with
      buffer heads.  This may free up the way for filesystems to support larger
      blocks than page size.
      
      We add PageTail as an alias of PageReclaim.  Compound pages cannot currently
      be reclaimed.  Because of the alias one needs to check PageCompound first.
      
      The RFC for the this approach was discussed at
      http://marc.info/?t=117574302800001&r=1&w=2
      
      [nacc@us.ibm.com: fix hugetlbfs]
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NNishanth Aravamudan <nacc@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d85f3385
    • C
      SLUB: allocate smallest object size if the user asks for 0 bytes · 614410d5
      Christoph Lameter 提交于
      Makes SLUB behave like SLAB in this area to avoid issues....
      
      Throw a stack dump to alert people.
      
      At some point the behavior should be switched back.  NULL is no memory as
      far as I can tell and if the use asked for 0 bytes then he need to get no
      memory.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      614410d5
    • C
      SLUB: change default alignments · 47bfdc0d
      Christoph Lameter 提交于
      Structures may contain u64 items on 32 bit platforms that are only able to
      address 64 bit items on 64 bit boundaries.  Change the mininum alignment of
      slabs to conform to those expectations.
      
      ARCH_KMALLOC_MINALIGN must be changed for good since a variety of structure
      are mixed in the general slabs.
      
      ARCH_SLAB_MINALIGN is changed because currently there is no consistent
      specification of object alignment.  We may have that in the future when the
      KMEM_CACHE and related macros are used to generate slabs.  These pass the
      alignment of the structure generated by the compiler to the slab.
      
      With KMEM_CACHE etc we could align structures that do not contain 64
      bit values to 32 bit boundaries potentially saving some memory.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      47bfdc0d
    • C
      SLUB core · 81819f0f
      Christoph Lameter 提交于
      This is a new slab allocator which was motivated by the complexity of the
      existing code in mm/slab.c. It attempts to address a variety of concerns
      with the existing implementation.
      
      A. Management of object queues
      
         A particular concern was the complex management of the numerous object
         queues in SLAB. SLUB has no such queues. Instead we dedicate a slab for
         each allocating CPU and use objects from a slab directly instead of
         queueing them up.
      
      B. Storage overhead of object queues
      
         SLAB Object queues exist per node, per CPU. The alien cache queue even
         has a queue array that contain a queue for each processor on each
         node. For very large systems the number of queues and the number of
         objects that may be caught in those queues grows exponentially. On our
         systems with 1k nodes / processors we have several gigabytes just tied up
         for storing references to objects for those queues  This does not include
         the objects that could be on those queues. One fears that the whole
         memory of the machine could one day be consumed by those queues.
      
      C. SLAB meta data overhead
      
         SLAB has overhead at the beginning of each slab. This means that data
         cannot be naturally aligned at the beginning of a slab block. SLUB keeps
         all meta data in the corresponding page_struct. Objects can be naturally
         aligned in the slab. F.e. a 128 byte object will be aligned at 128 byte
         boundaries and can fit tightly into a 4k page with no bytes left over.
         SLAB cannot do this.
      
      D. SLAB has a complex cache reaper
      
         SLUB does not need a cache reaper for UP systems. On SMP systems
         the per CPU slab may be pushed back into partial list but that
         operation is simple and does not require an iteration over a list
         of objects. SLAB expires per CPU, shared and alien object queues
         during cache reaping which may cause strange hold offs.
      
      E. SLAB has complex NUMA policy layer support
      
         SLUB pushes NUMA policy handling into the page allocator. This means that
         allocation is coarser (SLUB does interleave on a page level) but that
         situation was also present before 2.6.13. SLABs application of
         policies to individual slab objects allocated in SLAB is
         certainly a performance concern due to the frequent references to
         memory policies which may lead a sequence of objects to come from
         one node after another. SLUB will get a slab full of objects
         from one node and then will switch to the next.
      
      F. Reduction of the size of partial slab lists
      
         SLAB has per node partial lists. This means that over time a large
         number of partial slabs may accumulate on those lists. These can
         only be reused if allocator occur on specific nodes. SLUB has a global
         pool of partial slabs and will consume slabs from that pool to
         decrease fragmentation.
      
      G. Tunables
      
         SLAB has sophisticated tuning abilities for each slab cache. One can
         manipulate the queue sizes in detail. However, filling the queues still
         requires the uses of the spin lock to check out slabs. SLUB has a global
         parameter (min_slab_order) for tuning. Increasing the minimum slab
         order can decrease the locking overhead. The bigger the slab order the
         less motions of pages between per CPU and partial lists occur and the
         better SLUB will be scaling.
      
      G. Slab merging
      
         We often have slab caches with similar parameters. SLUB detects those
         on boot up and merges them into the corresponding general caches. This
         leads to more effective memory use. About 50% of all caches can
         be eliminated through slab merging. This will also decrease
         slab fragmentation because partial allocated slabs can be filled
         up again. Slab merging can be switched off by specifying
         slub_nomerge on boot up.
      
         Note that merging can expose heretofore unknown bugs in the kernel
         because corrupted objects may now be placed differently and corrupt
         differing neighboring objects. Enable sanity checks to find those.
      
      H. Diagnostics
      
         The current slab diagnostics are difficult to use and require a
         recompilation of the kernel. SLUB contains debugging code that
         is always available (but is kept out of the hot code paths).
         SLUB diagnostics can be enabled via the "slab_debug" option.
         Parameters can be specified to select a single or a group of
         slab caches for diagnostics. This means that the system is running
         with the usual performance and it is much more likely that
         race conditions can be reproduced.
      
      I. Resiliency
      
         If basic sanity checks are on then SLUB is capable of detecting
         common error conditions and recover as best as possible to allow the
         system to continue.
      
      J. Tracing
      
         Tracing can be enabled via the slab_debug=T,<slabcache> option
         during boot. SLUB will then protocol all actions on that slabcache
         and dump the object contents on free.
      
      K. On demand DMA cache creation.
      
         Generally DMA caches are not needed. If a kmalloc is used with
         __GFP_DMA then just create this single slabcache that is needed.
         For systems that have no ZONE_DMA requirement the support is
         completely eliminated.
      
      L. Performance increase
      
         Some benchmarks have shown speed improvements on kernbench in the
         range of 5-10%. The locking overhead of slub is based on the
         underlying base allocation size. If we can reliably allocate
         larger order pages then it is possible to increase slub
         performance much further. The anti-fragmentation patches may
         enable further performance increases.
      
      Tested on:
      i386 UP + SMP, x86_64 UP + SMP + NUMA emulation, IA64 NUMA + Simulator
      
      SLUB Boot options
      
      slub_nomerge		Disable merging of slabs
      slub_min_order=x	Require a minimum order for slab caches. This
      			increases the managed chunk size and therefore
      			reduces meta data and locking overhead.
      slub_min_objects=x	Mininum objects per slab. Default is 8.
      slub_max_order=x	Avoid generating slabs larger than order specified.
      slub_debug		Enable all diagnostics for all caches
      slub_debug=<options>	Enable selective options for all caches
      slub_debug=<o>,<cache>	Enable selective options for a certain set of
      			caches
      
      Available Debug options
      F		Double Free checking, sanity and resiliency
      R		Red zoning
      P		Object / padding poisoning
      U		Track last free / alloc
      T		Trace all allocs / frees (only use for individual slabs).
      
      To use SLUB: Apply this patch and then select SLUB as the default slab
      allocator.
      
      [hugh@veritas.com: fix an oops-causing locking error]
      [akpm@linux-foundation.org: various stupid cleanups and small fixes]
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      81819f0f
    • A
      slab: mark set_up_list3s() __init · a3a02be7
      Andrew Morton 提交于
      It is only ever used prior to free_initmem().
      
      (It will cause a warning when we run the section checking, but that's a
      false-positive and it simply changes the source of an existing warning, which
      is also a false-positive)
      
      Cc: Christoph Lameter <clameter@engr.sgi.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a3a02be7
    • M
      Do not disable interrupts when reading min_free_kbytes · 3b1d92c5
      Mel Gorman 提交于
      The sysctl handler for min_free_kbytes calls setup_per_zone_pages_min() on
      read or write.  This function iterates through every zone and calls
      spin_lock_irqsave() on the zone LRU lock.  When reading min_free_kbytes,
      this is a total waste of time that disables interrupts on the local
      processor.  It might even be noticable machines with large numbers of zones
      if a process started constantly reading min_free_kbytes.
      
      This patch only calls setup_per_zone_pages_min() only on write. Tested on
      an x86 laptop and it did the right thing.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Acked-by: NChristoph Lameter <clameter@engr.sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3b1d92c5
    • E
      slab: NUMA kmem_cache diet · 8da3430d
      Eric Dumazet 提交于
      Some NUMA machines have a big MAX_NUMNODES (possibly 1024), but fewer
      possible nodes.  This patch dynamically sizes the 'struct kmem_cache' to
      allocate only needed space.
      
      I moved nodelists[] field at the end of struct kmem_cache, and use the
      following computation in kmem_cache_init()
      
      cache_cache.buffer_size = offsetof(struct kmem_cache, nodelists) +
                                       nr_node_ids * sizeof(struct kmem_list3 *);
      
      On my two nodes x86_64 machine, kmem_cache.obj_size is now 192 instead of 704
      (This is because on x86_64, MAX_NUMNODES is 64)
      
      On bigger NUMA setups, this might reduce the gfporder of "cache_cache"
      Signed-off-by: NEric Dumazet <dada1@cosmosbay.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: Christoph Lameter <clameter@engr.sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8da3430d
    • E
      SLAB: don't allocate empty shared caches · 63109846
      Eric Dumazet 提交于
      We can avoid allocating empty shared caches and avoid unecessary check of
      cache->limit.  We save some memory.  We avoid bringing into CPU cache
      unecessary cache lines.
      
      All accesses to l3->shared are already checking NULL pointers so this patch is
      safe.
      Signed-off-by: NEric Dumazet <dada1@cosmosbay.com>
      Acked-by: NPekka Enberg <penberg@cs.helsinki.fi>
      Cc: Christoph Lameter <clameter@engr.sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      63109846
    • E
      SLAB: use num_possible_cpus() in enable_cpucache() · 364fbb29
      Eric Dumazet 提交于
      The existing comment in mm/slab.c is *perfect*, so I reproduce it :
      
               /*
                * CPU bound tasks (e.g. network routing) can exhibit cpu bound
                * allocation behaviour: Most allocs on one cpu, most free operations
                * on another cpu. For these cases, an efficient object passing between
                * cpus is necessary. This is provided by a shared array. The array
                * replaces Bonwick's magazine layer.
                * On uniprocessor, it's functionally equivalent (but less efficient)
                * to a larger limit. Thus disabled by default.
                */
      
      As most shiped linux kernels are now compiled with CONFIG_SMP, there is no way
      a preprocessor #if can detect if the machine is UP or SMP. Better to use
      num_possible_cpus().
      
      This means on UP we allocate a 'size=0 shared array', to be more efficient.
      
      Another patch can later avoid the allocations of 'empty shared arrays', to
      save some memory.
      Signed-off-by: NEric Dumazet <dada1@cosmosbay.com>
      Acked-by: NPekka Enberg <penberg@cs.helsinki.fi>
      Acked-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      364fbb29
    • J
      readahead: code cleanup · 6ce745ed
      Jan Kara 提交于
      Rename file_ra_state.prev_page to prev_index and file_ra_state.offset to
      prev_offset.  Also update of prev_index in do_generic_mapping_read() is now
      moved close to the update of prev_offset.
      
      [wfg@mail.ustc.edu.cn: fix it]
      Signed-off-by: NJan Kara <jack@suse.cz>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: WU Fengguang <wfg@mail.ustc.edu.cn>
      Signed-off-by: NFengguang Wu <wfg@mail.ustc.edu.cn>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6ce745ed
    • J
      readahead: improve heuristic detecting sequential reads · ec0f1637
      Jan Kara 提交于
      Introduce ra.offset and store in it an offset where the previous read
      ended.  This way we can detect whether reads are really sequential (and
      thus we should not mark the page as accessed repeatedly) or whether they
      are random and just happen to be in the same page (and the page should
      really be marked accessed again).
      Signed-off-by: NJan Kara <jack@suse.cz>
      Acked-by: NNick Piggin <nickpiggin@yahoo.com.au>
      Cc: WU Fengguang <wfg@mail.ustc.edu.cn>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ec0f1637
    • B
      Add unitialized_var() macro for suppressing gcc warnings · 94909914
      Borislav Petkov 提交于
      Introduce a macro for suppressing gcc from generating a warning about a
      probable uninitialized state of a variable.
      
      Example:
      
      -	spinlock_t *ptl;
      +	spinlock_t *uninitialized_var(ptl);
      
      Not a happy solution, but those warnings are obnoxious.
      
      - Using the usual pointlessly-set-it-to-zero approach wastes several
        bytes of text.
      
      - Using a macro means we can (hopefully) do something else if gcc changes
        cause the `x = x' hack to stop working
      
      - Using a macro means that people who are worried about hiding true bugs
        can easily turn it off.
      Signed-off-by: NBorislav Petkov <bbpetkov@yahoo.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      94909914
    • N
      mm: simplify filemap_nopage · a8127717
      Nick Piggin 提交于
      Identical block is duplicated twice: contrary to the comment, we have been
      re-reading the page *twice* in filemap_nopage rather than once.
      
      If any retry logic or anything is needed, it belongs in lower levels anyway.
      Only retry once.  Linus agrees.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a8127717
    • A
      add pfn_valid_within helper for sub-MAX_ORDER hole detection · 14e07298
      Andy Whitcroft 提交于
      Generally we work under the assumption that memory the mem_map array is
      contigious and valid out to MAX_ORDER_NR_PAGES block of pages, ie.  that if we
      have validated any page within this MAX_ORDER_NR_PAGES block we need not check
      any other.  This is not true when CONFIG_HOLES_IN_ZONE is set and we must
      check each and every reference we make from a pfn.
      
      Add a pfn_valid_within() helper which should be used when scanning pages
      within a MAX_ORDER_NR_PAGES block when we have already checked the validility
      of the block normally with pfn_valid().  This can then be optimised away when
      we do not have holes within a MAX_ORDER_NR_PAGES block of pages.
      Signed-off-by: NAndy Whitcroft <apw@shadowen.org>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Acked-by: NBob Picco <bob.picco@hp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      14e07298
    • J
      allow oom_adj of saintly processes · 9a82782f
      Joshua N Pritikin 提交于
      If the badness of a process is zero then oom_adj>0 has no effect.  This
      patch makes sure that the oom_adj shift actually increases badness points
      appropriately.
      Signed-off-by: NJoshua N. Pritikin <jpritikin@pobox.com>
      Cc: Andrea Arcangeli <andrea@novell.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9a82782f
    • N
      mm: make read_cache_page synchronous · 6fe6900e
      Nick Piggin 提交于
      Ensure pages are uptodate after returning from read_cache_page, which allows
      us to cut out most of the filesystem-internal PageUptodate calls.
      
      I didn't have a great look down the call chains, but this appears to fixes 7
      possible use-before uptodate in hfs, 2 in hfsplus, 1 in jfs, a few in
      ecryptfs, 1 in jffs2, and a possible cleared data overwritten with readpage in
      block2mtd.  All depending on whether the filler is async and/or can return
      with a !uptodate page.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Cc: Hugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6fe6900e
    • P
      slab: ensure cache_alloc_refill terminates · 714b8171
      Pekka Enberg 提交于
      If slab->inuse is corrupted, cache_alloc_refill can enter an infinite
      loop as detailed by Michael Richardson in the following post:
      <http://lkml.org/lkml/2007/2/16/292>. This adds a BUG_ON to catch
      those cases.
      
      Cc: Michael Richardson <mcr@sandelman.ca>
      Acked-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NPekka Enberg <penberg@cs.helsinki.fi>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      714b8171
    • N
      mm: remove gcc workaround · 5f22df00
      Nick Piggin 提交于
      Minimum gcc version is 3.2 now.  However, with likely profiling, even
      modern gcc versions cannot always eliminate the call.
      
      Replace the placeholder functions with the more conventional empty static
      inlines, which should be optimal for everyone.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5f22df00
    • C
      Use ZVC counters to establish exact size of dirtyable pages · 1b424464
      Christoph Lameter 提交于
      We can use the global ZVC counters to establish the exact size of the LRU
      and the free pages.  This allows a more accurate determination of the dirty
      ratio.
      
      This patch will fix the broken ratio calculations if large amounts of
      memory are allocated to huge pags or other consumers that do not put the
      pages on to the LRU.
      
      Notes:
      - I did not add NR_SLAB_RECLAIMABLE to the calculation of the
        dirtyable pages. Those may be reclaimable but they are at this
        point not dirtyable. If NR_SLAB_RECLAIMABLE would be considered
        then a huge number of reclaimable pages would stop writeback
        from occurring.
      
      - This patch used to be in mm as the last one in a series of patches.
        It was removed when Linus updated the treatment of highmem because
        there was a conflict. I updated the patch to follow Linus' approach.
        This patch is neede to fulfill the claims made in the beginning of the
        patchset that is now in Linus' tree.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1b424464