1. 08 12月, 2006 32 次提交
    • C
      [PATCH] add numa node information to struct device · 87348136
      Christoph Hellwig 提交于
      For node-aware skb allocations we need information about the node in struct
      net_device or struct device.  Davem suggested to put it into struct device
      which this patch does.
      
      In particular:
      
       - struct device gets a new int numa_node member if CONFIG_NUMA is set
       - there are two new helpers, dev_to_node and set_dev_node to
         transparently deal with the non-numa case
       - for pci devices the node-info is set to the value we get from
         pcibus_to_node.
      
      Note that for some architectures pcibus_to_node doesn't work yet at the time
      we call it currently.  This is harmless and will just mean skb allocations
      aren't node-local on this architectures until the implementation of
      pcibus_to_node on these architectures have been updated (There are patches for
      x86 and x86_64 floating around)
      
      [akpm@osdl.org: cleanup]
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Cc: Christoph Lameter <clameter@engr.sgi.com>
      Cc: Greg KH <greg@kroah.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      87348136
    • C
      [PATCH] leak tracking for kmalloc_node · 8b98c169
      Christoph Hellwig 提交于
      We have variants of kmalloc and kmem_cache_alloc that leave leak tracking to
      the caller.  This is used for subsystem-specific allocators like skb_alloc.
      
      To make skb_alloc node-aware we need similar routines for the node-aware slab
      allocator, which this patch adds.
      
      Note that the code is rather ugly, but it mirrors the non-node-aware code 1:1:
      
      [akpm@osdl.org: add module export]
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      8b98c169
    • S
      [PATCH] Always print out the header line in /proc/swaps · 881e4aab
      Suleiman Souhlal 提交于
      It would be possible for /proc/swaps to not always print out the header:
      
      swapon /dev/hdc2
      swapon /dev/hde2
      swapoff /dev/hdc2
      
      At this point /proc/swaps would not have a header.
      Signed-off-by: NSuleiman Souhlal <suleiman@google.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      881e4aab
    • K
      [PATCH] OOM can panic due to processes stuck in __alloc_pages() · b43a57bb
      Kirill Korotaev 提交于
      OOM can panic due to the processes stuck in __alloc_pages() doing infinite
      rebalance loop while no memory can be reclaimed.  OOM killer tries to kill
      some processes, but unfortunetaly, rebalance label was moved by someone
      below the TIF_MEMDIE check, so buddy allocator doesn't see that process is
      OOM-killed and it can simply fail the allocation :/
      
      Observed in reality on RHEL4(2.6.9)+OpenVZ kernel when a user doing some
      memory allocation tricks triggered OOM panic.
      Signed-off-by: NDenis Lunev <den@sw.ru>
      Signed-off-by: NKirill Korotaev <dev@openvz.org>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      b43a57bb
    • R
      [PATCH] mlock cleanup · a3eea484
      Rik Bobbaers 提交于
      mm is defined as vma->vm_mm, so use that.
      Acked-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      a3eea484
    • G
      [PATCH] Allow user processes to raise their oom_adj value · 8fb4fc68
      Guillem Jover 提交于
      Currently a user process cannot rise its own oom_adj value (i.e.
      unprotecting itself from the OOM killer).  As this value is stored in the
      task structure it gets inherited and the unprivileged childs will be unable
      to rise it.
      
      The EPERM will be handled by the generic proc fs layer, as only processes
      with the proper caps or the owner of the process will be able to write to
      the file.  So we allow only the processes with CAP_SYS_RESOURCE to lower
      the value, otherwise it will get an EACCES which seems more appropriate
      than EPERM.
      Signed-off-by: NGuillem Jover <guillem.jover@nokia.com>
      Acked-by: NAndrea Arcangeli <andrea@novell.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      8fb4fc68
    • J
      [PATCH] Fix kunmap_atomic's use of kpte_clear_flush() · 3b17979b
      Jeremy Fitzhardinge 提交于
      kunmap_atomic() will call kpte_clear_flush with vaddr/ptep arguments which
      don't correspond if the vaddr is just a normal lowmem address (ie, not in
      the KMAP area).  This patch makes sure that the pte is only cleared if kmap
      area was actually used for the mapping.
      Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com>
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      Cc: Zachary Amsden <zach@vmware.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      3b17979b
    • P
      [PATCH] mm: k{,um}map_atomic() vs in_atomic() · ad76fb6b
      Peter Zijlstra 提交于
      Make kmap_atomic/kunmap_atomic denote a pagefault disabled scope.  All non
      trivial implementations already do this anyway.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Acked-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      ad76fb6b
    • P
      [PATCH] mm: pagefault_{disable,enable}() · a866374a
      Peter Zijlstra 提交于
      Introduce pagefault_{disable,enable}() and use these where previously we did
      manual preempt increments/decrements to make the pagefault handler do the
      atomic thing.
      
      Currently they still rely on the increased preempt count, but do not rely on
      the disabled preemption, this might go away in the future.
      
      (NOTE: the extra barrier() in pagefault_disable might fix some holes on
             machines which have too many registers for their own good)
      
      [heiko.carstens@de.ibm.com: s390 fix]
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Acked-by: NNick Piggin <npiggin@suse.de>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      a866374a
    • P
      [PATCH] mm: arch do_page_fault() vs in_atomic() · 6edaf68a
      Peter Zijlstra 提交于
      In light of the recent pagefault and filemap_copy_from_user work I've gone
      through all the arch pagefault handlers to make sure the inc_preempt_count()
      'feature' works as expected.
      
      Several sections of code (including the new filemap_copy_from_user) rely on
      the fact that faults do not take locks under increased preempt count.
      
      arch/x86_64 - good
      arch/powerpc - good
      arch/cris - fixed
      arch/i386 - good
      arch/parisc - fixed
      arch/sh - good
      arch/sparc - good
      arch/s390 - good
      arch/m68k - fixed
      arch/ppc - good
      arch/alpha - fixed
      arch/mips - good
      arch/sparc64 - good
      arch/ia64 - good
      arch/arm - fixed
      arch/um - good
      arch/avr32 - good
      arch/h8300 - NA
      arch/m32r - good
      arch/v850 - good
      arch/frv - fixed
      arch/m68knommu - NA
      arch/arm26 - fixed
      arch/sh64 - fixed
      arch/xtensa - good
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Acked-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      6edaf68a
    • P
      [PATCH] mm: add noaliencache boot option to disable numa alien caches · 3395ee05
      Paul Menage 提交于
      When using numa=fake on non-NUMA hardware there is no benefit to having the
      alien caches, and they consume much memory.
      
      Add a kernel boot option to disable them.
      
      Christoph sayeth "This is good to have even on large NUMA.  The problem is
      that the alien caches grow by the square of the size of the system in terms of
      nodes."
      
      Cc: Christoph Lameter <clameter@engr.sgi.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      3395ee05
    • R
      [PATCH] mm: slab: eliminate lock_cpu_hotplug from slab · 8f5be20b
      Ravikiran G Thirumalai 提交于
      Here's an attempt towards doing away with lock_cpu_hotplug in the slab
      subsystem.  This approach also fixes a bug which shows up when cpus are
      being offlined/onlined and slab caches are being tuned simultaneously.
      
      http://marc.theaimsgroup.com/?l=linux-kernel&m=116098888100481&w=2
      
      The patch has been stress tested overnight on a 2 socket 4 core AMD box with
      repeated cpu online and offline, while dbench and kernbench process are
      running, and slab caches being tuned at the same time.
      There were no lockdep warnings either.  (This test on 2,6.18 as 2.6.19-rc
      crashes at __drain_pages
      http://marc.theaimsgroup.com/?l=linux-kernel&m=116172164217678&w=2 )
      
      The approach here is to hold cache_chain_mutex from CPU_UP_PREPARE until
      CPU_ONLINE (similar in approach as worqueue_mutex) .  Slab code sensitive
      to cpu_online_map (kmem_cache_create, kmem_cache_destroy, slabinfo_write,
      __cache_shrink) is already serialized with cache_chain_mutex.  (This patch
      lengthens cache_chain_mutex hold time at kmem_cache_destroy to cover this).
       This patch also takes the cache_chain_sem at kmem_cache_shrink to protect
      sanity of cpu_online_map at __cache_shrink, as viewed by slab.
      (kmem_cache_shrink->__cache_shrink->drain_cpu_caches).  But, really,
      kmem_cache_shrink is used at just one place in the acpi subsystem!  Do we
      really need to keep kmem_cache_shrink at all?
      
      Another note.  Looks like a cpu hotplug event can send  CPU_UP_CANCELED to
      a registered subsystem even if the subsystem did not receive CPU_UP_PREPARE.
      This could be due to a subsystem registered for notification earlier than
      the current subsystem crapping out with NOTIFY_BAD. Badness can occur with
      in the CPU_UP_CANCELED code path at slab if this happens (The same would
      apply for workqueue.c as well).  To overcome this, we might have to use either
      a) a per subsystem flag and avoid handling of CPU_UP_CANCELED, or
      b) Use a special notifier events like LOCK_ACQUIRE/RELEASE as Gautham was
         using in his experiments, or
      c) Do not send CPU_UP_CANCELED to a subsystem which did not receive
         CPU_UP_PREPARE.
      
      I would prefer c).
      Signed-off-by: NRavikiran Thirumalai <kiran@scalex86.org>
      Signed-off-by: NShai Fultheim <shai@scalex86.org>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      8f5be20b
    • K
      [PATCH] slab debug and ARCH_SLAB_MINALIGN don't get along · a44b56d3
      Kevin Hilman 提交于
      When CONFIG_SLAB_DEBUG is used in combination with ARCH_SLAB_MINALIGN, some
      debug flags should be disabled which depend on BYTES_PER_WORD alignment.
      
      The disabling of these debug flags is not properly handled when
      BYTES_PER_WORD < ARCH_SLAB_MEMALIGN < cache_line_size()
      
      This patch fixes that and also adds an alignment check to
      cache_alloc_debugcheck_after() when ARCH_SLAB_MINALIGN is used.
      Signed-off-by: NKevin Hilman <khilman@mvista.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Christoph Lameter <clameter@engr.sgi.com>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      a44b56d3
    • C
      [PATCH] htlb forget rss with pt sharing · cace673d
      Chen, Kenneth W 提交于
      Imprecise RSS accounting is an irritating ill effect with pt sharing.  After
      consulted with several VM experts, I have tried various methods to solve that
      problem: (1) iterate through all mm_structs that share the PT and increment
      count; (2) keep RSS count in page table structure and then sum them up at
      reporting time.  None of the above methods yield any satisfactory
      implementation.
      
      Since process RSS accounting is pure information only, I propose we don't
      count them at all for hugetlb page.  rlimit has such field, though there is
      absolutely no enforcement on limiting that resource.  One other method is to
      account all RSS at hugetlb mmap time regardless they are faulted or not.  I
      opt for the simplicity of no accounting at all.
      
      Hugetlb page are special, they are reserved up front in global reservation
      pool and is not reclaimable.  From physical memory resource point of view, it
      is already consumed regardless whether there are users using them.
      
      If the concern is that RSS can be used to control resource allocation, we
      already can specify hugetlb fs size limit and sysadmin can enforce that at
      mount time.  Combined with the two points mentioned above, I fail to see if
      there is anything got affected because of this patch.
      Signed-off-by: NKen Chen <kenneth.w.chen@intel.com>
      Acked-by: NHugh Dickins <hugh@veritas.com>
      Cc: Dave McCracken <dmccr@us.ibm.com>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Cc: Adam Litke <agl@us.ibm.com>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      cace673d
    • C
      [PATCH] shared page table for hugetlb page · 39dde65c
      Chen, Kenneth W 提交于
      Following up with the work on shared page table done by Dave McCracken.  This
      set of patch target shared page table for hugetlb memory only.
      
      The shared page table is particular useful in the situation of large number of
      independent processes sharing large shared memory segments.  In the normal
      page case, the amount of memory saved from process' page table is quite
      significant.  For hugetlb, the saving on page table memory is not the primary
      objective (as hugetlb itself already cuts down page table overhead
      significantly), instead, the purpose of using shared page table on hugetlb is
      to allow faster TLB refill and smaller cache pollution upon TLB miss.
      
      With PT sharing, pte entries are shared among hundreds of processes, the cache
      consumption used by all the page table is smaller and in return, application
      gets much higher cache hit ratio.  One other effect is that cache hit ratio
      with hardware page walker hitting on pte in cache will be higher and this
      helps to reduce tlb miss latency.  These two effects contribute to higher
      application performance.
      Signed-off-by: NKen Chen <kenneth.w.chen@intel.com>
      Acked-by: NHugh Dickins <hugh@veritas.com>
      Cc: Dave McCracken <dmccr@us.ibm.com>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Cc: Adam Litke <agl@us.ibm.com>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      39dde65c
    • A
      [PATCH] balance_pdgat() cleanup · e1dbeda6
      Andrew Morton 提交于
      Despaghettify balance_pdgat() a bit.
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      e1dbeda6
    • N
      [PATCH] mm: add arch_alloc_page · cc102509
      Nick Piggin 提交于
      Add an arch_alloc_page to match arch_free_page.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      cc102509
    • A
      [PATCH] new scheme to preempt swap token · 7602bdf2
      Ashwin Chaugule 提交于
      The new swap token patches replace the current token traversal algo.  The old
      algo had a crude timeout parameter that was used to handover the token from
      one task to another.  This algo, transfers the token to the tasks that are in
      need of the token.  The urgency for the token is based on the number of times
      a task is required to swap-in pages.  Accordingly, the priority of a task is
      incremented if it has been badly affected due to swap-outs.  To ensure that
      the token doesnt bounce around rapidly, the token holders are given a priority
      boost.  The priority of tasks is also decremented, if their rate of swap-in's
      keeps reducing.  This way, the condition to check whether to pre-empt the swap
      token, is a matter of comparing two task's priority fields.
      
      [akpm@osdl.org: cleanups]
      Signed-off-by: NAshwin Chaugule <ashwin.chaugule@celunite.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      7602bdf2
    • A
      [PATCH] grab swap token reordered · 098fe651
      Ashwin Chaugule 提交于
      Make sure the contention for the token happens _before_ any read-in and
      kicks the swap-token algo only when the VM is under pressure.
      Signed-off-by: NAshwin Chaugule <ashwin.chaugule@celunite.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      098fe651
    • N
      [PATCH] mm: incorrect VM_FAULT_OOM returns from drivers · cd54e7e5
      Nick Piggin 提交于
      Some drivers are returning OOM when it is not in response to a memory
      shortage.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Cc: Dave Airlie <airlied@linux.ie>
      Cc: Jaroslav Kysela <perex@suse.cz>
      Cc: Takashi Iwai <tiwai@suse.de>
      Cc: Greg KH <greg@kroah.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      cd54e7e5
    • N
      [PATCH] oom: less memdie · f2a2a710
      Nick Piggin 提交于
      Don't cause all threads in all other thread groups to gain TIF_MEMDIE
      otherwise we'll get a thundering herd eating our memory reserve.  This may not
      be the optimal scheme, but it fits our policy of allowing just one TIF_MEMDIE
      in the system at once.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      f2a2a710
    • N
      [PATCH] oom: cleanup messages · f3af38d3
      Nick Piggin 提交于
      Clean up the OOM killer messages to be more consistent.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      f3af38d3
    • N
      [PATCH] oom: don't kill unkillable children or siblings · c33e0fca
      Nick Piggin 提交于
      Abort the kill if any of our threads have OOM_DISABLE set.  Having this
      test here also prevents any OOM_DISABLE child of the "selected" process
      from being killed.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      c33e0fca
    • P
      [PATCH] memory page_alloc zonelist caching reorder structure · 7253f4ef
      Paul Jackson 提交于
      Rearrange the struct members in the 'struct zonelist_cache' structure, so
      as to put the readonly (once initialized) z_to_n[] array first, where it
      will come right after the zones[] array in struct zonelist.
      
      This pretty much eliminates the chance that the two frequently written
      elements of 'struct zonelist_cache', the fullzones bitmap and last_full_zap
      times, will end up on the same cache line as the performance sensitive,
      frequently read, never (after init) written zones[] array.
      
      Keeping frequently written data off frequently read cache lines is good for
      performance.
      
      Thanks to Rohit Seth for the suggestion.
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Cc: Rohit Seth <rohitseth@google.com>
      Cc: Paul Menage <menage@google.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      7253f4ef
    • P
      [PATCH] memory page_alloc zonelist caching speedup · 9276b1bc
      Paul Jackson 提交于
      Optimize the critical zonelist scanning for free pages in the kernel memory
      allocator by caching the zones that were found to be full recently, and
      skipping them.
      
      Remembers the zones in a zonelist that were short of free memory in the
      last second.  And it stashes a zone-to-node table in the zonelist struct,
      to optimize that conversion (minimize its cache footprint.)
      
      Recent changes:
      
          This differs in a significant way from a similar patch that I
          posted a week ago.  Now, instead of having a nodemask_t of
          recently full nodes, I have a bitmask of recently full zones.
          This solves a problem that last weeks patch had, which on
          systems with multiple zones per node (such as DMA zone) would
          take seeing any of these zones full as meaning that all zones
          on that node were full.
      
          Also I changed names - from "zonelist faster" to "zonelist cache",
          as that seemed to better convey what we're doing here - caching
          some of the key zonelist state (for faster access.)
      
          See below for some performance benchmark results.  After all that
          discussion with David on why I didn't need them, I went and got
          some ;).  I wanted to verify that I had not hurt the normal case
          of memory allocation noticeably.  At least for my one little
          microbenchmark, I found (1) the normal case wasn't affected, and
          (2) workloads that forced scanning across multiple nodes for
          memory improved up to 10% fewer System CPU cycles and lower
          elapsed clock time ('sys' and 'real').  Good.  See details, below.
      
          I didn't have the logic in get_page_from_freelist() for various
          full nodes and zone reclaim failures correct.  That should be
          fixed up now - notice the new goto labels zonelist_scan,
          this_zone_full, and try_next_zone, in get_page_from_freelist().
      
      There are two reasons I persued this alternative, over some earlier
      proposals that would have focused on optimizing the fake numa
      emulation case by caching the last useful zone:
      
       1) Contrary to what I said before, we (SGI, on large ia64 sn2 systems)
          have seen real customer loads where the cost to scan the zonelist
          was a problem, due to many nodes being full of memory before
          we got to a node we could use.  Or at least, I think we have.
          This was related to me by another engineer, based on experiences
          from some time past.  So this is not guaranteed.  Most likely, though.
      
          The following approach should help such real numa systems just as
          much as it helps fake numa systems, or any combination thereof.
      
       2) The effort to distinguish fake from real numa, using node_distance,
          so that we could cache a fake numa node and optimize choosing
          it over equivalent distance fake nodes, while continuing to
          properly scan all real nodes in distance order, was going to
          require a nasty blob of zonelist and node distance munging.
      
          The following approach has no new dependency on node distances or
          zone sorting.
      
      See comment in the patch below for a description of what it actually does.
      
      Technical details of note (or controversy):
      
       - See the use of "zlc_active" and "did_zlc_setup" below, to delay
         adding any work for this new mechanism until we've looked at the
         first zone in zonelist.  I figured the odds of the first zone
         having the memory we needed were high enough that we should just
         look there, first, then get fancy only if we need to keep looking.
      
       - Some odd hackery was needed to add items to struct zonelist, while
         not tripping up the custom zonelists built by the mm/mempolicy.c
         code for MPOL_BIND.  My usual wordy comments below explain this.
         Search for "MPOL_BIND".
      
       - Some per-node data in the struct zonelist is now modified frequently,
         with no locking.  Multiple CPU cores on a node could hit and mangle
         this data.  The theory is that this is just performance hint data,
         and the memory allocator will work just fine despite any such mangling.
         The fields at risk are the struct 'zonelist_cache' fields 'fullzones'
         (a bitmask) and 'last_full_zap' (unsigned long jiffies).  It should
         all be self correcting after at most a one second delay.
      
       - This still does a linear scan of the same lengths as before.  All
         I've optimized is making the scan faster, not algorithmically
         shorter.  It is now able to scan a compact array of 'unsigned
         short' in the case of many full nodes, so one cache line should
         cover quite a few nodes, rather than each node hitting another
         one or two new and distinct cache lines.
      
       - If both Andi and Nick don't find this too complicated, I will be
         (pleasantly) flabbergasted.
      
       - I removed the comment claiming we only use one cachline's worth of
         zonelist.  We seem, at least in the fake numa case, to have put the
         lie to that claim.
      
       - I pay no attention to the various watermarks and such in this performance
         hint.  A node could be marked full for one watermark, and then skipped
         over when searching for a page using a different watermark.  I think
         that's actually quite ok, as it will tend to slightly increase the
         spreading of memory over other nodes, away from a memory stressed node.
      
      ===============
      
      Performance - some benchmark results and analysis:
      
      This benchmark runs a memory hog program that uses multiple
      threads to touch alot of memory as quickly as it can.
      
      Multiple runs were made, touching 12, 38, 64 or 90 GBytes out of
      the total 96 GBytes on the system, and using 1, 19, 37, or 55
      threads (on a 56 CPU system.)  System, user and real (elapsed)
      timings were recorded for each run, shown in units of seconds,
      in the table below.
      
      Two kernels were tested - 2.6.18-mm3 and the same kernel with
      this zonelist caching patch added.  The table also shows the
      percentage improvement the zonelist caching sys time is over
      (lower than) the stock *-mm kernel.
      
            number     2.6.18-mm3	   zonelist-cache    delta (< 0 good)	percent
       GBs    N  	------------	   --------------    ----------------	systime
       mem threads   sys user  real	  sys  user  real     sys  user  real	 better
        12	 1     153   24   177	  151	 24   176      -2     0    -1	   1%
        12	19	99   22     8	   99	 22	8	0     0     0	   0%
        12	37     111   25     6	  112	 25	6	1     0     0	  -0%
        12	55     115   25     5	  110	 23	5      -5    -2     0	   4%
        38	 1     502   74   576	  497	 73   570      -5    -1    -6	   0%
        38	19     426   78    48	  373	 76    39     -53    -2    -9	  12%
        38	37     544   83    36	  547	 82    36	3    -1     0	  -0%
        38	55     501   77    23	  511	 80    24      10     3     1	  -1%
        64	 1     917  125  1042	  890	124  1014     -27    -1   -28	   2%
        64	19    1118  138   119	  965	141   103    -153     3   -16	  13%
        64	37    1202  151    94	 1136	150    81     -66    -1   -13	   5%
        64	55    1118  141    61	 1072	140    58     -46    -1    -3	   4%
        90	 1    1342  177  1519	 1275	174  1450     -67    -3   -69	   4%
        90	19    2392  199   192	 2116	189   176    -276   -10   -16	  11%
        90	37    3313  238   175	 2972	225   145    -341   -13   -30	  10%
        90	55    1948  210   104	 1843	213   100    -105     3    -4	   5%
      
      Notes:
       1) This test ran a memory hog program that started a specified number N of
          threads, and had each thread allocate and touch 1/N'th of
          the total memory to be used in the test run in a single loop,
          writing a constant word to memory, one store every 4096 bytes.
          Watching this test during some earlier trial runs, I would see
          each of these threads sit down on one CPU and stay there, for
          the remainder of the pass, a different CPU for each thread.
      
       2) The 'real' column is not comparable to the 'sys' or 'user' columns.
          The 'real' column is seconds wall clock time elapsed, from beginning
          to end of that test pass.  The 'sys' and 'user' columns are total
          CPU seconds spent on that test pass.  For a 19 thread test run,
          for example, the sum of 'sys' and 'user' could be up to 19 times the
          number of 'real' elapsed wall clock seconds.
      
       3) Tests were run on a fresh, single-user boot, to minimize the amount
          of memory already in use at the start of the test, and to minimize
          the amount of background activity that might interfere.
      
       4) Tests were done on a 56 CPU, 28 Node system with 96 GBytes of RAM.
      
       5) Notice that the 'real' time gets large for the single thread runs, even
          though the measured 'sys' and 'user' times are modest.  I'm not sure what
          that means - probably something to do with it being slow for one thread to
          be accessing memory along ways away.  Perhaps the fake numa system, running
          ostensibly the same workload, would not show this substantial degradation
          of 'real' time for one thread on many nodes -- lets hope not.
      
       6) The high thread count passes (one thread per CPU - on 55 of 56 CPUs)
          ran quite efficiently, as one might expect.  Each pair of threads needed
          to allocate and touch the memory on the node the two threads shared, a
          pleasantly parallizable workload.
      
       7) The intermediate thread count passes, when asking for alot of memory forcing
          them to go to a few neighboring nodes, improved the most with this zonelist
          caching patch.
      
      Conclusions:
       * This zonelist cache patch probably makes little difference one way or the
         other for most workloads on real numa hardware, if those workloads avoid
         heavy off node allocations.
       * For memory intensive workloads requiring substantial off-node allocations
         on real numa hardware, this patch improves both kernel and elapsed timings
         up to ten per-cent.
       * For fake numa systems, I'm optimistic, but will have to leave that up to
         Rohit Seth to actually test (once I get him a 2.6.18 backport.)
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Cc: Rohit Seth <rohitseth@google.com>
      Cc: Christoph Lameter <clameter@engr.sgi.com>
      Cc: David Rientjes <rientjes@cs.washington.edu>
      Cc: Paul Menage <menage@google.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      9276b1bc
    • C
      [PATCH] Get rid of zone_table[] · 89689ae7
      Christoph Lameter 提交于
      The zone table is mostly not needed.  If we have a node in the page flags
      then we can get to the zone via NODE_DATA() which is much more likely to be
      already in the cpu cache.
      
      In case of SMP and UP NODE_DATA() is a constant pointer which allows us to
      access an exact replica of zonetable in the node_zones field.  In all of
      the above cases there will be no need at all for the zone table.
      
      The only remaining case is if in a NUMA system the node numbers do not fit
      into the page flags.  In that case we make sparse generate a table that
      maps sections to nodes and use that table to to figure out the node number.
       This table is sized to fit in a single cache line for the known 32 bit
      NUMA platform which makes it very likely that the information can be
      obtained without a cache miss.
      
      For sparsemem the zone table seems to be have been fairly large based on
      the maximum possible number of sections and the number of zones per node.
      There is some memory saving by removing zone_table.  The main benefit is to
      reduce the cache foootprint of the VM from the frequent lookups of zones.
      Plus it simplifies the page allocator.
      
      [akpm@osdl.org: build fix]
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      89689ae7
    • C
      [PATCH] __unmap_hugepage_range(): add comment · c0a499c2
      Chen, Kenneth W 提交于
      Signed-off-by: NKen Chen <kenneth.w.chen@intel.com>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Cc: Hugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      c0a499c2
    • P
      [PATCH] memory page alloc minor cleanups · 0798e519
      Paul Jackson 提交于
      - s/freeliest/freelist/ spelling fix
      
      - Check for NULL *z zone seems useless - even if it could happen, so
        what?  Perhaps we should have a check later on if we are faced with an
        allocation request that is not allowed to fail - shouldn't that be a
        serious kernel error, passing an empty zonelist with a mandate to not
        fail?
      
      - Initializing 'z' to zonelist->zones can wait until after the first
        get_page_from_freelist() fails; we only use 'z' in the wakeup_kswapd()
        loop, so let's initialize 'z' there, in a 'for' loop.  Seems clearer.
      
      - Remove superfluous braces around a break
      
      - Fix a couple errant spaces
      
      - Adjust indentation on the cpuset_zone_allowed() check, to match the
        lines just before it -- seems easier to read in this case.
      
      - Add another set of braces to the zone_watermark_ok logic
      
      From: Paul Jackson <pj@sgi.com>
      
        Backout one item from a previous "memory page_alloc minor cleanups" patch.
         Until and unless we are certain that no one can ever pass an empty zonelist
        to __alloc_pages(), this check for an empty zonelist (or some BUG
        equivalent) is essential.  The code in get_page_from_freelist() blow ups if
        passed an empty zonelist.
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Acked-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      0798e519
    • A
      [PATCH] uml: workqueue build fix · a2ce7740
      Andrew Morton 提交于
        arch/um/drivers/chan_kern.c:643: error: conflicting types for 'chan_interrupt'
        arch/um/include/chan_kern.h:31: error: previous declaration of 'chan_interrupt'
      
      Cc: David Howells <dhowells@redhat.com>
      Cc: Jeff Dike <jdike@addtoit.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      a2ce7740
    • A
      [PATCH] skip data conversion in compat_sys_mount when data_page is NULL · 822191a2
      Andrey Mirkin 提交于
      OpenVZ Linux kernel team has found a problem with mounting in compat mode.
      
      Simple command "mount -t smbfs ..." on Fedora Core 5 distro in 32-bit mode
      leads to oops:
      
        Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP: compat_sys_mount+0xd6/0x290
        Process mount (pid: 14656, veid=300, threadinfo ffff810034d30000, task ffff810034c86bc0)
        Call Trace: ia32_sysret+0x0/0xa
      
      The problem is that data_page pointer can be NULL, so we should skip data
      conversion in this case.
      Signed-off-by: NAndrey Mirkin <amirkin@openvz.org>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      822191a2
    • A
      [PATCH] drm-sis linkage fix · a1e85378
      Andrew Morton 提交于
      Fix http://bugzilla.kernel.org/show_bug.cgi?id=7606
      
      WARNING: "drm_sman_set_manager" [drivers/char/drm/sis.ko] undefined!
      
      Cc: <daniel-silveira@gee.inatel.br>
      Cc: Dave Airlie <airlied@linux.ie>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      a1e85378
    • A
      [PATCH] add bottom_half.h · 676dcb8b
      Andrew Morton 提交于
      With CONFIG_SMP=n:
      
        drivers/input/ff-memless.c:384: warning: implicit declaration of function 'local_bh_disable'
        drivers/input/ff-memless.c:393: warning: implicit declaration of function 'local_bh_enable'
      
      Really linux/spinlock.h should include linux/interrupt.h.  But interrupt.h
      includes sched.h which will need spinlock.h.
      
      So the patch breaks the _bh declarations out into a separate header and
      includes it in both interrupt.h and spinlock.h.
      
      Cc: "Randy.Dunlap" <rdunlap@xenotime.net>
      Cc: Andi Kleen <ak@suse.de>
      Cc: <stable@kernel.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      676dcb8b
  2. 07 12月, 2006 8 次提交