1. 08 12月, 2006 30 次提交
    • C
      [PATCH] slab: remove SLAB_LEVEL_MASK · a06d72c1
      Christoph Lameter 提交于
      SLAB_LEVEL_MASK is only used internally to the slab and is
      and alias of GFP_LEVEL_MASK.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      a06d72c1
    • C
      [PATCH] slab: remove SLAB_NO_GROW · 6e0eaa4b
      Christoph Lameter 提交于
      It is only used internally in the slab.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      6e0eaa4b
    • H
      [PATCH] kill install_file_pte's pte_val · 2d4d862f
      Hugh Dickins 提交于
      David Binderman and his Intel C compiler rightly observe that
      install_file_pte no longer has any use for its pte_val.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: d binderman <dcb314@hotmail.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      2d4d862f
    • A
      [PATCH] mm: cleanup indentation on switch for CPU operations · ce421c79
      Andy Whitcroft 提交于
      These patches introduced new switch statements which are indented contrary
      to the concensus in mm/*.c.  Fix them up to match that concensus.
      
          [PATCH] node local per-cpu-pages
          [PATCH] ZVC: Scale thresholds depending on the size of the system
          commit e7c8d5c9
          commit df9ecabaSigned-off-by: NAndy Whitcroft <apw@shadowen.org>
      Cc: Christoph Lameter <clameter@engr.sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      ce421c79
    • E
      [PATCH] reject corrupt swapfiles earlier · 5d1854e1
      Eric Sandeen 提交于
      The fsfuzzer found this; with a corrupt small swapfile that claims to have
      many pages:
      
        [root]# file swap.741.img
        swap.741.img: Linux/i386 swap file (new style) 1 (4K pages) size 1040191487 pages
        [root]# ls -l swap.741.img
        -rw-r--r-- 1 root root 16777216 Nov 22 05:18 swap.741.img
      
      sys_swapon() will try to vmalloc all those pages, and -then- check to see if
      the file is actually that large:
      
                      if (!(p->swap_map = vmalloc(maxpages * sizeof(short)))) {
        <snip>
              if (swapfilesize && maxpages > swapfilesize) {
                      printk(KERN_WARNING
                             "Swap area shorter than signature indicates\n");
      
      It seems to me that it would make more sense to move this test up before
      the vmalloc, with the other checks, to avoid the OOM-killer in this
      situation...
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      5d1854e1
    • A
      [PATCH] numa node ids are int, page_to_nid and zone_to_nid should return int · 25ba77c1
      Andy Whitcroft 提交于
      NUMA node ids are passed as either int or unsigned int almost exclusivly
      page_to_nid and zone_to_nid both return unsigned long.  This is a throw
      back to when page_to_nid was a #define and was thus exposing the real type
      of the page flags field.
      
      In addition to fixing up the definitions of page_to_nid and zone_to_nid I
      audited the users of these functions identifying the following incorrect
      uses:
      
      1) mm/page_alloc.c show_node() -- printk dumping the node id,
      2) include/asm-ia64/pgalloc.h pgtable_quicklist_free() -- comparison
         against numa_node_id() which returns an int from cpu_to_node(), and
      3) mm/mpolicy.c check_pte_range -- used as an index in node_isset which
         uses bit_set which in generic code takes an int.
      Signed-off-by: NAndy Whitcroft <apw@shadowen.org>
      Cc: Christoph Lameter <clameter@engr.sgi.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      25ba77c1
    • C
      [PATCH] drain_node_page(): Drain pages in batch units · bc4ba393
      Christoph Lameter 提交于
      drain_node_pages() currently drains the complete pageset of all pages.  If
      there are a large number of pages in the queues then we may hold off
      interrupts for too long.
      
      Duplicate the method used in free_hot_cold_page.  Only drain pcp->batch
      pages at one time.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      bc4ba393
    • A
      [PATCH] make mm/thrash.c:global_faults static · e3050055
      Adrian Bunk 提交于
      This patch makes the needlessly global "global_faults" static.
      Signed-off-by: NAdrian Bunk <bunk@stusta.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      e3050055
    • C
      [PATCH] enable booting a NUMA system where some nodes have no memory · 7c309a64
      Christian Krafft 提交于
      When booting a NUMA system with nodes that have no memory (eg by limiting
      memory), bootmem_alloc_core tried to find pages in an uninitialized
      bootmem_map.  This caused a null pointer access.  This fix adds a check, so
      that NULL is returned.  That will enable the caller (bootmem_alloc_nopanic)
      to alloc memory on other without a panic.
      Signed-off-by: NChristian Krafft <krafft@de.ibm.com>
      Cc: Christoph Lameter <clameter@engr.sgi.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: Martin Bligh <mbligh@google.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      7c309a64
    • A
      [PATCH] Allow NULL pointers in percpu_free · a1205868
      Alan Stern 提交于
      The patch (as824b) makes percpu_free() ignore NULL arguments, as one would
      expect for a deallocation routine.  (Note that free_percpu is #defined as
      percpu_free in include/linux/percpu.h.) A few callers are updated to remove
      now-unneeded tests for NULL.  A few other callers already seem to assume
      that passing a NULL pointer to percpu_free() is okay!
      
      The patch also removes an unnecessary NULL check in percpu_depopulate().
      Signed-off-by: NAlan Stern <stern@rowland.harvard.edu>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      a1205868
    • C
      [PATCH] leak tracking for kmalloc_node · 8b98c169
      Christoph Hellwig 提交于
      We have variants of kmalloc and kmem_cache_alloc that leave leak tracking to
      the caller.  This is used for subsystem-specific allocators like skb_alloc.
      
      To make skb_alloc node-aware we need similar routines for the node-aware slab
      allocator, which this patch adds.
      
      Note that the code is rather ugly, but it mirrors the non-node-aware code 1:1:
      
      [akpm@osdl.org: add module export]
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      8b98c169
    • S
      [PATCH] Always print out the header line in /proc/swaps · 881e4aab
      Suleiman Souhlal 提交于
      It would be possible for /proc/swaps to not always print out the header:
      
      swapon /dev/hdc2
      swapon /dev/hde2
      swapoff /dev/hdc2
      
      At this point /proc/swaps would not have a header.
      Signed-off-by: NSuleiman Souhlal <suleiman@google.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      881e4aab
    • K
      [PATCH] OOM can panic due to processes stuck in __alloc_pages() · b43a57bb
      Kirill Korotaev 提交于
      OOM can panic due to the processes stuck in __alloc_pages() doing infinite
      rebalance loop while no memory can be reclaimed.  OOM killer tries to kill
      some processes, but unfortunetaly, rebalance label was moved by someone
      below the TIF_MEMDIE check, so buddy allocator doesn't see that process is
      OOM-killed and it can simply fail the allocation :/
      
      Observed in reality on RHEL4(2.6.9)+OpenVZ kernel when a user doing some
      memory allocation tricks triggered OOM panic.
      Signed-off-by: NDenis Lunev <den@sw.ru>
      Signed-off-by: NKirill Korotaev <dev@openvz.org>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      b43a57bb
    • R
      [PATCH] mlock cleanup · a3eea484
      Rik Bobbaers 提交于
      mm is defined as vma->vm_mm, so use that.
      Acked-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      a3eea484
    • P
      [PATCH] mm: add noaliencache boot option to disable numa alien caches · 3395ee05
      Paul Menage 提交于
      When using numa=fake on non-NUMA hardware there is no benefit to having the
      alien caches, and they consume much memory.
      
      Add a kernel boot option to disable them.
      
      Christoph sayeth "This is good to have even on large NUMA.  The problem is
      that the alien caches grow by the square of the size of the system in terms of
      nodes."
      
      Cc: Christoph Lameter <clameter@engr.sgi.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      3395ee05
    • R
      [PATCH] mm: slab: eliminate lock_cpu_hotplug from slab · 8f5be20b
      Ravikiran G Thirumalai 提交于
      Here's an attempt towards doing away with lock_cpu_hotplug in the slab
      subsystem.  This approach also fixes a bug which shows up when cpus are
      being offlined/onlined and slab caches are being tuned simultaneously.
      
      http://marc.theaimsgroup.com/?l=linux-kernel&m=116098888100481&w=2
      
      The patch has been stress tested overnight on a 2 socket 4 core AMD box with
      repeated cpu online and offline, while dbench and kernbench process are
      running, and slab caches being tuned at the same time.
      There were no lockdep warnings either.  (This test on 2,6.18 as 2.6.19-rc
      crashes at __drain_pages
      http://marc.theaimsgroup.com/?l=linux-kernel&m=116172164217678&w=2 )
      
      The approach here is to hold cache_chain_mutex from CPU_UP_PREPARE until
      CPU_ONLINE (similar in approach as worqueue_mutex) .  Slab code sensitive
      to cpu_online_map (kmem_cache_create, kmem_cache_destroy, slabinfo_write,
      __cache_shrink) is already serialized with cache_chain_mutex.  (This patch
      lengthens cache_chain_mutex hold time at kmem_cache_destroy to cover this).
       This patch also takes the cache_chain_sem at kmem_cache_shrink to protect
      sanity of cpu_online_map at __cache_shrink, as viewed by slab.
      (kmem_cache_shrink->__cache_shrink->drain_cpu_caches).  But, really,
      kmem_cache_shrink is used at just one place in the acpi subsystem!  Do we
      really need to keep kmem_cache_shrink at all?
      
      Another note.  Looks like a cpu hotplug event can send  CPU_UP_CANCELED to
      a registered subsystem even if the subsystem did not receive CPU_UP_PREPARE.
      This could be due to a subsystem registered for notification earlier than
      the current subsystem crapping out with NOTIFY_BAD. Badness can occur with
      in the CPU_UP_CANCELED code path at slab if this happens (The same would
      apply for workqueue.c as well).  To overcome this, we might have to use either
      a) a per subsystem flag and avoid handling of CPU_UP_CANCELED, or
      b) Use a special notifier events like LOCK_ACQUIRE/RELEASE as Gautham was
         using in his experiments, or
      c) Do not send CPU_UP_CANCELED to a subsystem which did not receive
         CPU_UP_PREPARE.
      
      I would prefer c).
      Signed-off-by: NRavikiran Thirumalai <kiran@scalex86.org>
      Signed-off-by: NShai Fultheim <shai@scalex86.org>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      8f5be20b
    • K
      [PATCH] slab debug and ARCH_SLAB_MINALIGN don't get along · a44b56d3
      Kevin Hilman 提交于
      When CONFIG_SLAB_DEBUG is used in combination with ARCH_SLAB_MINALIGN, some
      debug flags should be disabled which depend on BYTES_PER_WORD alignment.
      
      The disabling of these debug flags is not properly handled when
      BYTES_PER_WORD < ARCH_SLAB_MEMALIGN < cache_line_size()
      
      This patch fixes that and also adds an alignment check to
      cache_alloc_debugcheck_after() when ARCH_SLAB_MINALIGN is used.
      Signed-off-by: NKevin Hilman <khilman@mvista.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Christoph Lameter <clameter@engr.sgi.com>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      a44b56d3
    • C
      [PATCH] htlb forget rss with pt sharing · cace673d
      Chen, Kenneth W 提交于
      Imprecise RSS accounting is an irritating ill effect with pt sharing.  After
      consulted with several VM experts, I have tried various methods to solve that
      problem: (1) iterate through all mm_structs that share the PT and increment
      count; (2) keep RSS count in page table structure and then sum them up at
      reporting time.  None of the above methods yield any satisfactory
      implementation.
      
      Since process RSS accounting is pure information only, I propose we don't
      count them at all for hugetlb page.  rlimit has such field, though there is
      absolutely no enforcement on limiting that resource.  One other method is to
      account all RSS at hugetlb mmap time regardless they are faulted or not.  I
      opt for the simplicity of no accounting at all.
      
      Hugetlb page are special, they are reserved up front in global reservation
      pool and is not reclaimable.  From physical memory resource point of view, it
      is already consumed regardless whether there are users using them.
      
      If the concern is that RSS can be used to control resource allocation, we
      already can specify hugetlb fs size limit and sysadmin can enforce that at
      mount time.  Combined with the two points mentioned above, I fail to see if
      there is anything got affected because of this patch.
      Signed-off-by: NKen Chen <kenneth.w.chen@intel.com>
      Acked-by: NHugh Dickins <hugh@veritas.com>
      Cc: Dave McCracken <dmccr@us.ibm.com>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Cc: Adam Litke <agl@us.ibm.com>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      cace673d
    • C
      [PATCH] shared page table for hugetlb page · 39dde65c
      Chen, Kenneth W 提交于
      Following up with the work on shared page table done by Dave McCracken.  This
      set of patch target shared page table for hugetlb memory only.
      
      The shared page table is particular useful in the situation of large number of
      independent processes sharing large shared memory segments.  In the normal
      page case, the amount of memory saved from process' page table is quite
      significant.  For hugetlb, the saving on page table memory is not the primary
      objective (as hugetlb itself already cuts down page table overhead
      significantly), instead, the purpose of using shared page table on hugetlb is
      to allow faster TLB refill and smaller cache pollution upon TLB miss.
      
      With PT sharing, pte entries are shared among hundreds of processes, the cache
      consumption used by all the page table is smaller and in return, application
      gets much higher cache hit ratio.  One other effect is that cache hit ratio
      with hardware page walker hitting on pte in cache will be higher and this
      helps to reduce tlb miss latency.  These two effects contribute to higher
      application performance.
      Signed-off-by: NKen Chen <kenneth.w.chen@intel.com>
      Acked-by: NHugh Dickins <hugh@veritas.com>
      Cc: Dave McCracken <dmccr@us.ibm.com>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Cc: Adam Litke <agl@us.ibm.com>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      39dde65c
    • A
      [PATCH] balance_pdgat() cleanup · e1dbeda6
      Andrew Morton 提交于
      Despaghettify balance_pdgat() a bit.
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      e1dbeda6
    • N
      [PATCH] mm: add arch_alloc_page · cc102509
      Nick Piggin 提交于
      Add an arch_alloc_page to match arch_free_page.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      cc102509
    • A
      [PATCH] new scheme to preempt swap token · 7602bdf2
      Ashwin Chaugule 提交于
      The new swap token patches replace the current token traversal algo.  The old
      algo had a crude timeout parameter that was used to handover the token from
      one task to another.  This algo, transfers the token to the tasks that are in
      need of the token.  The urgency for the token is based on the number of times
      a task is required to swap-in pages.  Accordingly, the priority of a task is
      incremented if it has been badly affected due to swap-outs.  To ensure that
      the token doesnt bounce around rapidly, the token holders are given a priority
      boost.  The priority of tasks is also decremented, if their rate of swap-in's
      keeps reducing.  This way, the condition to check whether to pre-empt the swap
      token, is a matter of comparing two task's priority fields.
      
      [akpm@osdl.org: cleanups]
      Signed-off-by: NAshwin Chaugule <ashwin.chaugule@celunite.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      7602bdf2
    • A
      [PATCH] grab swap token reordered · 098fe651
      Ashwin Chaugule 提交于
      Make sure the contention for the token happens _before_ any read-in and
      kicks the swap-token algo only when the VM is under pressure.
      Signed-off-by: NAshwin Chaugule <ashwin.chaugule@celunite.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      098fe651
    • N
      [PATCH] oom: less memdie · f2a2a710
      Nick Piggin 提交于
      Don't cause all threads in all other thread groups to gain TIF_MEMDIE
      otherwise we'll get a thundering herd eating our memory reserve.  This may not
      be the optimal scheme, but it fits our policy of allowing just one TIF_MEMDIE
      in the system at once.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      f2a2a710
    • N
      [PATCH] oom: cleanup messages · f3af38d3
      Nick Piggin 提交于
      Clean up the OOM killer messages to be more consistent.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      f3af38d3
    • N
      [PATCH] oom: don't kill unkillable children or siblings · c33e0fca
      Nick Piggin 提交于
      Abort the kill if any of our threads have OOM_DISABLE set.  Having this
      test here also prevents any OOM_DISABLE child of the "selected" process
      from being killed.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      c33e0fca
    • P
      [PATCH] memory page_alloc zonelist caching speedup · 9276b1bc
      Paul Jackson 提交于
      Optimize the critical zonelist scanning for free pages in the kernel memory
      allocator by caching the zones that were found to be full recently, and
      skipping them.
      
      Remembers the zones in a zonelist that were short of free memory in the
      last second.  And it stashes a zone-to-node table in the zonelist struct,
      to optimize that conversion (minimize its cache footprint.)
      
      Recent changes:
      
          This differs in a significant way from a similar patch that I
          posted a week ago.  Now, instead of having a nodemask_t of
          recently full nodes, I have a bitmask of recently full zones.
          This solves a problem that last weeks patch had, which on
          systems with multiple zones per node (such as DMA zone) would
          take seeing any of these zones full as meaning that all zones
          on that node were full.
      
          Also I changed names - from "zonelist faster" to "zonelist cache",
          as that seemed to better convey what we're doing here - caching
          some of the key zonelist state (for faster access.)
      
          See below for some performance benchmark results.  After all that
          discussion with David on why I didn't need them, I went and got
          some ;).  I wanted to verify that I had not hurt the normal case
          of memory allocation noticeably.  At least for my one little
          microbenchmark, I found (1) the normal case wasn't affected, and
          (2) workloads that forced scanning across multiple nodes for
          memory improved up to 10% fewer System CPU cycles and lower
          elapsed clock time ('sys' and 'real').  Good.  See details, below.
      
          I didn't have the logic in get_page_from_freelist() for various
          full nodes and zone reclaim failures correct.  That should be
          fixed up now - notice the new goto labels zonelist_scan,
          this_zone_full, and try_next_zone, in get_page_from_freelist().
      
      There are two reasons I persued this alternative, over some earlier
      proposals that would have focused on optimizing the fake numa
      emulation case by caching the last useful zone:
      
       1) Contrary to what I said before, we (SGI, on large ia64 sn2 systems)
          have seen real customer loads where the cost to scan the zonelist
          was a problem, due to many nodes being full of memory before
          we got to a node we could use.  Or at least, I think we have.
          This was related to me by another engineer, based on experiences
          from some time past.  So this is not guaranteed.  Most likely, though.
      
          The following approach should help such real numa systems just as
          much as it helps fake numa systems, or any combination thereof.
      
       2) The effort to distinguish fake from real numa, using node_distance,
          so that we could cache a fake numa node and optimize choosing
          it over equivalent distance fake nodes, while continuing to
          properly scan all real nodes in distance order, was going to
          require a nasty blob of zonelist and node distance munging.
      
          The following approach has no new dependency on node distances or
          zone sorting.
      
      See comment in the patch below for a description of what it actually does.
      
      Technical details of note (or controversy):
      
       - See the use of "zlc_active" and "did_zlc_setup" below, to delay
         adding any work for this new mechanism until we've looked at the
         first zone in zonelist.  I figured the odds of the first zone
         having the memory we needed were high enough that we should just
         look there, first, then get fancy only if we need to keep looking.
      
       - Some odd hackery was needed to add items to struct zonelist, while
         not tripping up the custom zonelists built by the mm/mempolicy.c
         code for MPOL_BIND.  My usual wordy comments below explain this.
         Search for "MPOL_BIND".
      
       - Some per-node data in the struct zonelist is now modified frequently,
         with no locking.  Multiple CPU cores on a node could hit and mangle
         this data.  The theory is that this is just performance hint data,
         and the memory allocator will work just fine despite any such mangling.
         The fields at risk are the struct 'zonelist_cache' fields 'fullzones'
         (a bitmask) and 'last_full_zap' (unsigned long jiffies).  It should
         all be self correcting after at most a one second delay.
      
       - This still does a linear scan of the same lengths as before.  All
         I've optimized is making the scan faster, not algorithmically
         shorter.  It is now able to scan a compact array of 'unsigned
         short' in the case of many full nodes, so one cache line should
         cover quite a few nodes, rather than each node hitting another
         one or two new and distinct cache lines.
      
       - If both Andi and Nick don't find this too complicated, I will be
         (pleasantly) flabbergasted.
      
       - I removed the comment claiming we only use one cachline's worth of
         zonelist.  We seem, at least in the fake numa case, to have put the
         lie to that claim.
      
       - I pay no attention to the various watermarks and such in this performance
         hint.  A node could be marked full for one watermark, and then skipped
         over when searching for a page using a different watermark.  I think
         that's actually quite ok, as it will tend to slightly increase the
         spreading of memory over other nodes, away from a memory stressed node.
      
      ===============
      
      Performance - some benchmark results and analysis:
      
      This benchmark runs a memory hog program that uses multiple
      threads to touch alot of memory as quickly as it can.
      
      Multiple runs were made, touching 12, 38, 64 or 90 GBytes out of
      the total 96 GBytes on the system, and using 1, 19, 37, or 55
      threads (on a 56 CPU system.)  System, user and real (elapsed)
      timings were recorded for each run, shown in units of seconds,
      in the table below.
      
      Two kernels were tested - 2.6.18-mm3 and the same kernel with
      this zonelist caching patch added.  The table also shows the
      percentage improvement the zonelist caching sys time is over
      (lower than) the stock *-mm kernel.
      
            number     2.6.18-mm3	   zonelist-cache    delta (< 0 good)	percent
       GBs    N  	------------	   --------------    ----------------	systime
       mem threads   sys user  real	  sys  user  real     sys  user  real	 better
        12	 1     153   24   177	  151	 24   176      -2     0    -1	   1%
        12	19	99   22     8	   99	 22	8	0     0     0	   0%
        12	37     111   25     6	  112	 25	6	1     0     0	  -0%
        12	55     115   25     5	  110	 23	5      -5    -2     0	   4%
        38	 1     502   74   576	  497	 73   570      -5    -1    -6	   0%
        38	19     426   78    48	  373	 76    39     -53    -2    -9	  12%
        38	37     544   83    36	  547	 82    36	3    -1     0	  -0%
        38	55     501   77    23	  511	 80    24      10     3     1	  -1%
        64	 1     917  125  1042	  890	124  1014     -27    -1   -28	   2%
        64	19    1118  138   119	  965	141   103    -153     3   -16	  13%
        64	37    1202  151    94	 1136	150    81     -66    -1   -13	   5%
        64	55    1118  141    61	 1072	140    58     -46    -1    -3	   4%
        90	 1    1342  177  1519	 1275	174  1450     -67    -3   -69	   4%
        90	19    2392  199   192	 2116	189   176    -276   -10   -16	  11%
        90	37    3313  238   175	 2972	225   145    -341   -13   -30	  10%
        90	55    1948  210   104	 1843	213   100    -105     3    -4	   5%
      
      Notes:
       1) This test ran a memory hog program that started a specified number N of
          threads, and had each thread allocate and touch 1/N'th of
          the total memory to be used in the test run in a single loop,
          writing a constant word to memory, one store every 4096 bytes.
          Watching this test during some earlier trial runs, I would see
          each of these threads sit down on one CPU and stay there, for
          the remainder of the pass, a different CPU for each thread.
      
       2) The 'real' column is not comparable to the 'sys' or 'user' columns.
          The 'real' column is seconds wall clock time elapsed, from beginning
          to end of that test pass.  The 'sys' and 'user' columns are total
          CPU seconds spent on that test pass.  For a 19 thread test run,
          for example, the sum of 'sys' and 'user' could be up to 19 times the
          number of 'real' elapsed wall clock seconds.
      
       3) Tests were run on a fresh, single-user boot, to minimize the amount
          of memory already in use at the start of the test, and to minimize
          the amount of background activity that might interfere.
      
       4) Tests were done on a 56 CPU, 28 Node system with 96 GBytes of RAM.
      
       5) Notice that the 'real' time gets large for the single thread runs, even
          though the measured 'sys' and 'user' times are modest.  I'm not sure what
          that means - probably something to do with it being slow for one thread to
          be accessing memory along ways away.  Perhaps the fake numa system, running
          ostensibly the same workload, would not show this substantial degradation
          of 'real' time for one thread on many nodes -- lets hope not.
      
       6) The high thread count passes (one thread per CPU - on 55 of 56 CPUs)
          ran quite efficiently, as one might expect.  Each pair of threads needed
          to allocate and touch the memory on the node the two threads shared, a
          pleasantly parallizable workload.
      
       7) The intermediate thread count passes, when asking for alot of memory forcing
          them to go to a few neighboring nodes, improved the most with this zonelist
          caching patch.
      
      Conclusions:
       * This zonelist cache patch probably makes little difference one way or the
         other for most workloads on real numa hardware, if those workloads avoid
         heavy off node allocations.
       * For memory intensive workloads requiring substantial off-node allocations
         on real numa hardware, this patch improves both kernel and elapsed timings
         up to ten per-cent.
       * For fake numa systems, I'm optimistic, but will have to leave that up to
         Rohit Seth to actually test (once I get him a 2.6.18 backport.)
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Cc: Rohit Seth <rohitseth@google.com>
      Cc: Christoph Lameter <clameter@engr.sgi.com>
      Cc: David Rientjes <rientjes@cs.washington.edu>
      Cc: Paul Menage <menage@google.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      9276b1bc
    • C
      [PATCH] Get rid of zone_table[] · 89689ae7
      Christoph Lameter 提交于
      The zone table is mostly not needed.  If we have a node in the page flags
      then we can get to the zone via NODE_DATA() which is much more likely to be
      already in the cpu cache.
      
      In case of SMP and UP NODE_DATA() is a constant pointer which allows us to
      access an exact replica of zonetable in the node_zones field.  In all of
      the above cases there will be no need at all for the zone table.
      
      The only remaining case is if in a NUMA system the node numbers do not fit
      into the page flags.  In that case we make sparse generate a table that
      maps sections to nodes and use that table to to figure out the node number.
       This table is sized to fit in a single cache line for the known 32 bit
      NUMA platform which makes it very likely that the information can be
      obtained without a cache miss.
      
      For sparsemem the zone table seems to be have been fairly large based on
      the maximum possible number of sections and the number of zones per node.
      There is some memory saving by removing zone_table.  The main benefit is to
      reduce the cache foootprint of the VM from the frequent lookups of zones.
      Plus it simplifies the page allocator.
      
      [akpm@osdl.org: build fix]
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      89689ae7
    • C
      [PATCH] __unmap_hugepage_range(): add comment · c0a499c2
      Chen, Kenneth W 提交于
      Signed-off-by: NKen Chen <kenneth.w.chen@intel.com>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Cc: Hugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      c0a499c2
    • P
      [PATCH] memory page alloc minor cleanups · 0798e519
      Paul Jackson 提交于
      - s/freeliest/freelist/ spelling fix
      
      - Check for NULL *z zone seems useless - even if it could happen, so
        what?  Perhaps we should have a check later on if we are faced with an
        allocation request that is not allowed to fail - shouldn't that be a
        serious kernel error, passing an empty zonelist with a mandate to not
        fail?
      
      - Initializing 'z' to zonelist->zones can wait until after the first
        get_page_from_freelist() fails; we only use 'z' in the wakeup_kswapd()
        loop, so let's initialize 'z' there, in a 'for' loop.  Seems clearer.
      
      - Remove superfluous braces around a break
      
      - Fix a couple errant spaces
      
      - Adjust indentation on the cpuset_zone_allowed() check, to match the
        lines just before it -- seems easier to read in this case.
      
      - Add another set of braces to the zone_watermark_ok logic
      
      From: Paul Jackson <pj@sgi.com>
      
        Backout one item from a previous "memory page_alloc minor cleanups" patch.
         Until and unless we are certain that no one can ever pass an empty zonelist
        to __alloc_pages(), this check for an empty zonelist (or some BUG
        equivalent) is essential.  The code in get_page_from_freelist() blow ups if
        passed an empty zonelist.
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Acked-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      0798e519
  2. 06 12月, 2006 1 次提交
    • M
      [PATCH] uclinux: fix mmap() of directory for nommu case · f81cff0d
      Mike Frysinger 提交于
      I was playing with blackfin when i hit a neat bug ... doing an open() on a
      directory and then passing that fd to mmap() would cause the kernel to hang
      
      after poking into the code a bit more, i found that
      mm/nommu.c:validate_mmap_request() checks the length and if it is 0, just
      returns the address ... this is in stark contrast to mmu's
      mm/mmap.c:do_mmap_pgoff() where it returns -EINVAL for 0 length requests ...
      i then noticed that some other parts of the logic is out of date between the
      two funcs, so perhaps that's the easy fix ?
      Signed-off-by: NGreg Ungerer <gerg@uclinux.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      f81cff0d
  3. 02 12月, 2006 1 次提交
  4. 24 11月, 2006 1 次提交
    • M
      [PATCH] x86_64: fix bad page state in process 'swapper' · 1abbfb41
      Mel Gorman 提交于
      find_min_pfn_for_node() and find_min_pfn_with_active_regions() both
      depend on a sorted early_node_map[].  However, sort_node_map() is being
      called after fin_min_pfn_with_active_regions() in
      free_area_init_nodes().
      
      In most cases, this is ok, but on at least one x86_64, the SRAT table
      caused the E820 ranges to be registered out of order.  This gave the
      wrong values for the min PFN range resulting in some pages not being
      initialised.
      
      This patch sorts the early_node_map in find_min_pfn_for_node().  It has
      been boot tested on x86, x86_64, ppc64 and ia64.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Acked-by: NAndre Noll <maan@systemlinux.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      1abbfb41
  5. 22 11月, 2006 3 次提交
    • D
      WorkStruct: make allyesconfig · c4028958
      David Howells 提交于
      Fix up for make allyesconfig.
      Signed-Off-By: NDavid Howells <dhowells@redhat.com>
      c4028958
    • D
      WorkStruct: Pass the work_struct pointer instead of context data · 65f27f38
      David Howells 提交于
      Pass the work_struct pointer to the work function rather than context data.
      The work function can use container_of() to work out the data.
      
      For the cases where the container of the work_struct may go away the moment the
      pending bit is cleared, it is made possible to defer the release of the
      structure by deferring the clearing of the pending bit.
      
      To make this work, an extra flag is introduced into the management side of the
      work_struct.  This governs auto-release of the structure upon execution.
      
      Ordinarily, the work queue executor would release the work_struct for further
      scheduling or deallocation by clearing the pending bit prior to jumping to the
      work function.  This means that, unless the driver makes some guarantee itself
      that the work_struct won't go away, the work function may not access anything
      else in the work_struct or its container lest they be deallocated..  This is a
      problem if the auxiliary data is taken away (as done by the last patch).
      
      However, if the pending bit is *not* cleared before jumping to the work
      function, then the work function *may* access the work_struct and its container
      with no problems.  But then the work function must itself release the
      work_struct by calling work_release().
      
      In most cases, automatic release is fine, so this is the default.  Special
      initiators exist for the non-auto-release case (ending in _NAR).
      Signed-Off-By: NDavid Howells <dhowells@redhat.com>
      65f27f38
    • D
      WorkStruct: Separate delayable and non-delayable events. · 52bad64d
      David Howells 提交于
      Separate delayable work items from non-delayable work items be splitting them
      into a separate structure (delayed_work), which incorporates a work_struct and
      the timer_list removed from work_struct.
      
      The work_struct struct is huge, and this limits it's usefulness.  On a 64-bit
      architecture it's nearly 100 bytes in size.  This reduces that by half for the
      non-delayable type of event.
      Signed-Off-By: NDavid Howells <dhowells@redhat.com>
      52bad64d
  6. 17 11月, 2006 1 次提交
  7. 15 11月, 2006 3 次提交
    • H
      [PATCH] hugetlb: fix error return for brk() entering a hugepage region · cd2579d7
      Hugh Dickins 提交于
      Commit cb07c9a1 causes the wrong return
      value.  is_hugepage_only_range() is a boolean, so we should return
      -EINVAL rather than 1.
      
      Also - we can use "mm" instead of looking up "current->mm" again.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      cd2579d7
    • D
      [PATCH] hugetlb: check for brk() entering a hugepage region · cb07c9a1
      David Gibson 提交于
      Unlike mmap(), the codepath for brk() creates a vma without first checking
      that it doesn't touch a region exclusively reserved for hugepages.  On
      powerpc, this can allow it to create a normal page vma in a hugepage
      region, causing oopses and other badness.
      
      Add a test to prevent this.  With this patch, brk() will simply fail if it
      attempts to move the break into a hugepage reserved region.
      Signed-off-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Cc: Adam Litke <agl@us.ibm.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      cb07c9a1
    • H
      [PATCH] hugetlb: prepare_hugepage_range check offset too · 68589bc3
      Hugh Dickins 提交于
      (David:)
      
      If hugetlbfs_file_mmap() returns a failure to do_mmap_pgoff() - for example,
      because the given file offset is not hugepage aligned - then do_mmap_pgoff
      will go to the unmap_and_free_vma backout path.
      
      But at this stage the vma hasn't been marked as hugepage, and the backout path
      will call unmap_region() on it.  That will eventually call down to the
      non-hugepage version of unmap_page_range().  On ppc64, at least, that will
      cause serious problems if there are any existing hugepage pagetable entries in
      the vicinity - for example if there are any other hugepage mappings under the
      same PUD.  unmap_page_range() will trigger a bad_pud() on the hugepage pud
      entries.  I suspect this will also cause bad problems on ia64, though I don't
      have a machine to test it on.
      
      (Hugh:)
      
      prepare_hugepage_range() should check file offset alignment when it checks
      virtual address and length, to stop MAP_FIXED with a bad huge offset from
      unmapping before it fails further down.  PowerPC should apply the same
      prepare_hugepage_range alignment checks as ia64 and all the others do.
      
      Then none of the alignment checks in hugetlbfs_file_mmap are required (nor
      is the check for too small a mapping); but even so, move up setting of
      VM_HUGETLB and add a comment to warn of what David Gibson discovered - if
      hugetlbfs_file_mmap fails before setting it, do_mmap_pgoff's unmap_region
      when unwinding from error will go the non-huge way, which may cause bad
      behaviour on architectures (powerpc and ia64) which segregate their huge
      mappings into a separate region of the address space.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Acked-by: NAdam Litke <agl@us.ibm.com>
      Acked-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      68589bc3