1. 17 10月, 2007 40 次提交
    • M
      Choose pages from the per-cpu list based on migration type · 535131e6
      Mel Gorman 提交于
      The freelists for each migrate type can slowly become polluted due to the
      per-cpu list.  Consider what happens when the following happens
      
      1. A 2^(MAX_ORDER-1) list is reserved for __GFP_MOVABLE pages
      2. An order-0 page is allocated from the newly reserved block
      3. The page is freed and placed on the per-cpu list
      4. alloc_page() is called with GFP_KERNEL as the gfp_mask
      5. The per-cpu list is used to satisfy the allocation
      
      This results in a kernel page is in the middle of a migratable region. This
      patch prevents this leak occuring by storing the MIGRATE_ type of the page in
      page->private. On allocate, a page will only be returned of the desired type,
      else more pages will be allocated. This may temporarily allow a per-cpu list
      to go over the pcp->high limit but it'll be corrected on the next free. Care
      is taken to preserve the hotness of pages recently freed.
      
      The additional code is not measurably slower for the workloads we've tested.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      535131e6
    • M
      Split the free lists for movable and unmovable allocations · b2a0ac88
      Mel Gorman 提交于
      This patch adds the core of the fragmentation reduction strategy.  It works by
      grouping pages together based on their ability to migrate or be reclaimed.
      Basically, it works by breaking the list in zone->free_area list into
      MIGRATE_TYPES number of lists.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b2a0ac88
    • M
      Add a bitmap that is used to track flags affecting a block of pages · 835c134e
      Mel Gorman 提交于
      Here is the latest revision of the anti-fragmentation patches.  Of particular
      note in this version is special treatment of high-order atomic allocations.
      Care is taken to group them together and avoid grouping pages of other types
      near them.  Artifical tests imply that it works.  I'm trying to get the
      hardware together that would allow setting up of a "real" test.  If anyone
      already has a setup and test that can trigger the atomic-allocation problem,
      I'd appreciate a test of these patches and a report.  The second major change
      is that these patches will apply cleanly with patches that implement
      anti-fragmentation through zones.
      
      kernbench shows effectively no performance difference varying between -0.2%
      and +2% on a variety of test machines.  Success rates for huge page allocation
      are dramatically increased.  For example, on a ppc64 machine, the vanilla
      kernel was only able to allocate 1% of memory as a hugepage and this was due
      to a single hugepage reserved as min_free_kbytes.  With these patches applied,
      17% was allocatable as superpages.  With reclaim-related fixes from Andy
      Whitcroft, it was 40% and further reclaim-related improvements should increase
      this further.
      
      Changelog Since V28
      o Group high-order atomic allocations together
      o It is no longer required to set min_free_kbytes to 10% of memory. A value
        of 16384 in most cases will be sufficient
      o Now applied with zone-based anti-fragmentation
      o Fix incorrect VM_BUG_ON within buffered_rmqueue()
      o Reorder the stack so later patches do not back out work from earlier patches
      o Fix bug were journal pages were being treated as movable
      o Bias placement of non-movable pages to lower PFNs
      o More agressive clustering of reclaimable pages in reactions to workloads
        like updatedb that flood the size of inode caches
      
      Changelog Since V27
      
      o Renamed anti-fragmentation to Page Clustering. Anti-fragmentation was giving
        the mistaken impression that it was the 100% solution for high order
        allocations. Instead, it greatly increases the chances high-order
        allocations will succeed and lays the foundation for defragmentation and
        memory hot-remove to work properly
      o Redefine page groupings based on ability to migrate or reclaim instead of
        basing on reclaimability alone
      o Get rid of spurious inits
      o Per-cpu lists are no longer split up per-type. Instead the per-cpu list is
        searched for a page of the appropriate type
      o Added more explanation commentary
      o Fix up bug in pageblock code where bitmap was used before being initalised
      
      Changelog Since V26
      o Fix double init of lists in setup_pageset
      
      Changelog Since V25
      o Fix loop order of for_each_rclmtype_order so that order of loop matches args
      o gfpflags_to_rclmtype uses gfp_t instead of unsigned long
      o Rename get_pageblock_type() to get_page_rclmtype()
      o Fix alignment problem in move_freepages()
      o Add mechanism for assigning flags to blocks of pages instead of page->flags
      o On fallback, do not examine the preferred list of free pages a second time
      
      The purpose of these patches is to reduce external fragmentation by grouping
      pages of related types together.  When pages are migrated (or reclaimed under
      memory pressure), large contiguous pages will be freed.
      
      This patch works by categorising allocations by their ability to migrate;
      
      Movable - The pages may be moved with the page migration mechanism. These are
      	generally userspace pages.
      
      Reclaimable - These are allocations for some kernel caches that are
      	reclaimable or allocations that are known to be very short-lived.
      
      Unmovable - These are pages that are allocated by the kernel that
      	are not trivially reclaimed. For example, the memory allocated for a
      	loaded module would be in this category. By default, allocations are
      	considered to be of this type
      
      HighAtomic - These are high-order allocations belonging to callers that
      	cannot sleep or perform any IO. In practice, this is restricted to
      	jumbo frame allocation for network receive. It is assumed that the
      	allocations are short-lived
      
      Instead of having one MAX_ORDER-sized array of free lists in struct free_area,
      there is one for each type of reclaimability.  Once a 2^MAX_ORDER block of
      pages is split for a type of allocation, it is added to the free-lists for
      that type, in effect reserving it.  Hence, over time, pages of the different
      types can be clustered together.
      
      When the preferred freelists are expired, the largest possible block is taken
      from an alternative list.  Buddies that are split from that large block are
      placed on the preferred allocation-type freelists to mitigate fragmentation.
      
      This implementation gives best-effort for low fragmentation in all zones.
      Ideally, min_free_kbytes needs to be set to a value equal to 4 * (1 <<
      (MAX_ORDER-1)) pages in most cases.  This would be 16384 on x86 and x86_64 for
      example.
      
      Our tests show that about 60-70% of physical memory can be allocated on a
      desktop after a few days uptime.  In benchmarks and stress tests, we are
      finding that 80% of memory is available as contiguous blocks at the end of the
      test.  To compare, a standard kernel was getting < 1% of memory as large pages
      on a desktop and about 8-12% of memory as large pages at the end of stress
      tests.
      
      Following this email are 12 patches that implement thie page grouping feature.
       The first patch introduces a mechanism for storing flags related to a whole
      block of pages.  Then allocations are split between movable and all other
      allocations.  Following that are patches to deal with per-cpu pages and make
      the mechanism configurable.  The next patch moves free pages between lists
      when partially allocated blocks are used for pages of another migrate type.
      The second last patch groups reclaimable kernel allocations such as inode
      caches together.  The final patch related to groupings keeps high-order atomic
      allocations.
      
      The last two patches are more concerned with control of fragmentation.  The
      second last patch biases placement of non-movable allocations towards the
      start of memory.  This is with a view of supporting memory hot-remove of DIMMs
      with higher PFNs in the future.  The biasing could be enforced a lot heavier
      but it would cost.  The last patch agressively clusters reclaimable pages like
      inode caches together.
      
      The fragmentation reduction strategy needs to track if pages within a block
      can be moved or reclaimed so that pages are freed to the appropriate list.
      This patch adds a bitmap for flags affecting a whole a MAX_ORDER block of
      pages.
      
      In non-SPARSEMEM configurations, the bitmap is stored in the struct zone and
      allocated during initialisation.  SPARSEMEM statically allocates the bitmap in
      a struct mem_section so that bitmaps do not have to be resized during memory
      hotadd.  This wastes a small amount of memory per unused section (usually
      sizeof(unsigned long)) but the complexity of dynamically allocating the memory
      is quite high.
      
      Additional credit to Andy Whitcroft who reviewed up an earlier implementation
      of the mechanism an suggested how to make it a *lot* cleaner.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      835c134e
    • K
      flush icache before set_pte() on ia64: flush icache at set_pte · 954ffcb3
      KAMEZAWA Hiroyuki 提交于
      Current ia64 kernel flushes icache by lazy_mmu_prot_update() *after*
      set_pte().  This is too late.  This patch removes lazy_mmu_prot_update and
      add modfied set_pte() for flushing if necessary.
      
      This patch flush icache of a page when
      	new pte has exec bit.
      	&& new pte has present bit
      	&& new pte is user's page.
      	&& (old *ptep is not present
                  || new pte's pfn is not same to old *ptep's ptn)
      	&& new pte's page has no Pg_arch_1 bit.
      	   Pg_arch_1 is set when a page is cache consistent.
      
      I think this condition checks are much easier to understand than considering
      "Where sync_icache_dcache() should be inserted ?".
      
      pte_user() for ia64 was removed by http://lkml.org/lkml/2007/6/12/67 as
      clean-up. So, I added it again.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      954ffcb3
    • K
      flush cache before installing new page at migraton · 97ee0524
      KAMEZAWA Hiroyuki 提交于
      In migration, a new page should be cache flushed before set_pte() in some
      archs which have virtually-tagged cache.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      97ee0524
    • A
      make swappiness safer to use · 4106f83a
      Andrea Arcangeli 提交于
      Swappiness isn't a safe sysctl.  Setting it to 0 for example can hang a
      system.  That's a corner case but even setting it to 10 or lower can waste
      enormous amounts of cpu without making much progress.  We've customers who
      wants to use swappiness but they can't because of the current
      implementation (if you change it so the system stops swapping it really
      stops swapping and nothing works sane anymore if you really had to swap
      something to make progress).
      
      This patch from Kurt Garloff makes swappiness safer to use (no more huge
      cpu usage or hangs with low swappiness values).
      
      I think the prev_priority can also be nuked since it wastes 4 bytes per
      zone (that would be an incremental patch but I wait the nr_scan_[in]active
      to be nuked first for similar reasons).  Clearly somebody at some point
      noticed how broken that thing was and they had to add min(priority,
      prev_priority) to give it some reliability, but they didn't go the last
      mile to nuke prev_priority too.  Calculating distress only in function of
      not-racy priority is correct and sure more than enough without having to
      add randomness into the equation.
      
      Patch is tested on older kernels but it compiles and it's quite simple
      so...
      
      Overall I'm not very satisified by the swappiness tweak, since it doesn't
      rally do anything with the dirty pagecache that may be inactive.  We need
      another kind of tweak that controls the inactive scan and tunes the
      can_writepage feature (not yet in mainline despite having submitted it a
      few times), not only the active one.  That new tweak will tell the kernel
      how hard to scan the inactive list for pure clean pagecache (something the
      mainline kernel isn't capable of yet).  We already have that feature
      working in all our enterprise kernels with the default reasonable tune, or
      they can't even run a readonly backup with tar without triggering huge
      write I/O.  I think it should be available also in mainline later.
      
      Cc: Nick Piggin <npiggin@suse.de>
      Signed-off-by: NKurt Garloff <garloff@suse.de>
      Signed-off-by: NAndrea Arcangeli <andrea@suse.de>
      Signed-off-by: NFengguang Wu <wfg@mail.ustc.edu.cn>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4106f83a
    • C
      Categorize GFP flags · 6cb06229
      Christoph Lameter 提交于
      The function of GFP_LEVEL_MASK seems to be unclear.  In order to clear up
      the mystery we get rid of it and replace GFP_LEVEL_MASK with 3 sets of GFP
      flags:
      
      GFP_RECLAIM_MASK	Flags used to control page allocator reclaim behavior.
      
      GFP_CONSTRAINT_MASK	Flags used to limit where allocations can occur.
      
      GFP_SLAB_BUG_MASK	Flags that the slab allocator BUG()s on.
      
      These replace the uses of GFP_LEVEL mask in the slab allocators and in
      vmalloc.c.
      
      The use of the flags not included in these sets may occur as a result of a
      slab allocation standing in for a page allocation when constructing scatter
      gather lists.  Extraneous flags are cleared and not passed through to the
      page allocator.  __GFP_MOVABLE/RECLAIMABLE, __GFP_COLD and __GFP_COMP will
      now be ignored if passed to a slab allocator.
      
      Change the allocation of allocator meta data in SLAB and vmalloc to not
      pass through flags listed in GFP_CONSTRAINT_MASK.  SLAB already removes the
      __GFP_THISNODE flag for such allocations.  Generalize that to also cover
      vmalloc.  The use of GFP_CONSTRAINT_MASK also includes __GFP_HARDWALL.
      
      The impact of allocator metadata placement on access latency to the
      cachelines of the object itself is minimal since metadata is only
      referenced on alloc and free.  The attempt is still made to place the meta
      data optimally but we consistently allow fallback both in SLAB and vmalloc
      (SLUB does not need to allocate metadata like that).
      
      Allocator metadata may serve multiple in kernel users and thus should not
      be subject to the limitations arising from a single allocation context.
      
      [akpm@linux-foundation.org: fix fallback_alloc()]
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6cb06229
    • Y
      Fix panic of cpu online with memory less node · 58c0a4a7
      Yasunori Goto 提交于
      When a cpu is onlined on memory-less-node box, kernel panics due to touch
      NULL pointer of pgdat->kswapd.  Current kswapd runs only nodes which have
      memory.  So, calling of set_cpus_allowed() is not necessary for memory-less
      node.
      
      This is fix for it.
      Signed-off-by: NYasunori Goto <y-goto@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      58c0a4a7
    • L
      memoryless nodes: fixup uses of node_online_map in generic code · 37b07e41
      Lee Schermerhorn 提交于
      Here's a cut at fixing up uses of the online node map in generic code.
      
      mm/shmem.c:shmem_parse_mpol()
      
      	Ensure nodelist is subset of nodes with memory.
      	Use node_states[N_HIGH_MEMORY] as default for missing
      	nodelist for interleave policy.
      
      mm/shmem.c:shmem_fill_super()
      
      	initialize policy_nodes to node_states[N_HIGH_MEMORY]
      
      mm/page-writeback.c:highmem_dirtyable_memory()
      
      	sum over nodes with memory
      
      mm/page_alloc.c:zlc_setup()
      
      	allowednodes - use nodes with memory.
      
      mm/page_alloc.c:default_zonelist_order()
      
      	average over nodes with memory.
      
      mm/page_alloc.c:find_next_best_node()
      
      	skip nodes w/o memory.
      	N_HIGH_MEMORY state mask may not be initialized at this time,
      	unless we want to depend on early_calculate_totalpages() [see
      	below].  Will ZONE_MOVABLE ever be configurable?
      
      mm/page_alloc.c:find_zone_movable_pfns_for_nodes()
      
      	spread kernelcore over nodes with memory.
      
      	This required calling early_calculate_totalpages()
      	unconditionally, and populating N_HIGH_MEMORY node
      	state therein from nodes in the early_node_map[].
      	If we can depend on this, we can eliminate the
      	population of N_HIGH_MEMORY mask from __build_all_zonelists()
      	and use the N_HIGH_MEMORY mask in find_next_best_node().
      
      mm/mempolicy.c:mpol_check_policy()
      
      	Ensure nodes specified for policy are subset of
      	nodes with memory.
      
      [akpm@linux-foundation.org: fix warnings]
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Acked-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      37b07e41
    • C
      Memoryless nodes: Fix GFP_THISNODE behavior · 523b9458
      Christoph Lameter 提交于
      GFP_THISNODE checks that the zone selected is within the pgdat (node) of the
      first zone of a nodelist.  That only works if the node has memory.  A
      memoryless node will have its first node on another pgdat (node).
      
      GFP_THISNODE currently will return simply memory on the first pgdat.  Thus it
      is returning memory on other nodes.  GFP_THISNODE should fail if there is no
      local memory on a node.
      
      Add a new set of zonelists for each node that only contain the nodes that
      belong to the zones itself so that no fallback is possible.
      
      Then modify gfp_type to pickup the right zone based on the presence of
      __GFP_THISNODE.
      
      Drop the existing GFP_THISNODE checks from the page_allocators hot path.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Acked-by: NNishanth Aravamudan <nacc@us.ibm.com>
      Tested-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Acked-by: NBob Picco <bob.picco@hp.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@skynet.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      523b9458
    • C
      Memoryless nodes: drop one memoryless node boot warning · 633c0666
      Christoph Lameter 提交于
      get_pfn_range_for_nid() is called multiple times for each node at boot time.
      Each time, it will warn about nodes with no memory, resulting in boot messages
      like:
      
              Node 0 active with no memory
              Node 0 active with no memory
              Node 0 active with no memory
              Node 0 active with no memory
              Node 0 active with no memory
              Node 0 active with no memory
              On node 0 totalpages: 0
              Node 0 active with no memory
              Node 0 active with no memory
                DMA zone: 0 pages used for memmap
              Node 0 active with no memory
              Node 0 active with no memory
                Normal zone: 0 pages used for memmap
              Node 0 active with no memory
              Node 0 active with no memory
                Movable zone: 0 pages used for memmap
      
      and so on for each memoryless node.
      
      We already have the "On node N totalpages: ..." and other related messages, so
      drop the "Node N active with no memory" warnings.
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Bob Picco <bob.picco@hp.com>
      Cc: Nishanth Aravamudan <nacc@us.ibm.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@skynet.ie>
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      633c0666
    • C
      Memoryless nodes: Add N_CPU node state · 37c0708d
      Christoph Lameter 提交于
      We need the check for a node with cpu in zone reclaim.  Zone reclaim will not
      allow remote zone reclaim if a node has a cpu.
      
      [Lee.Schermerhorn@hp.com: Move setup of N_CPU node state mask]
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Tested-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Acked-by: NBob Picco <bob.picco@hp.com>
      Cc: Nishanth Aravamudan <nacc@us.ibm.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@skynet.ie>
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      37c0708d
    • C
      Memoryless nodes: Update memory policy and page migration · 56bbd65d
      Christoph Lameter 提交于
      Online nodes now may have no memory.  The checks and initialization must
      therefore be changed to no longer use the online functions.
      
      This will correctly initialize the interleave on bootup to only target nodes
      with memory and will make sys_move_pages return an error when a page is to be
      moved to a memoryless node.  Similarly we will get an error if MPOL_BIND and
      MPOL_INTERLEAVE is used on a memoryless node.
      
      These are somewhat new semantics.  So far one could specify memoryless nodes
      and we would maybe do the right thing and just ignore the node (or we'd do
      something strange like with MPOL_INTERLEAVE).  If we want to allow the
      specification of memoryless nodes via memory policies then we need to keep
      checking for online nodes.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Acked-by: NNishanth Aravamudan <nacc@us.ibm.com>
      Tested-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Acked-by: NBob Picco <bob.picco@hp.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@skynet.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      56bbd65d
    • C
      Memoryless nodes: SLUB support · f64dc58c
      Christoph Lameter 提交于
      Simply switch all for_each_online_node to for_each_node_state(NORMAL_MEMORY).
      That way SLUB only operates on nodes with regular memory.  Any allocation
      attempt on a memoryless node or a node with just highmem will fall whereupon
      SLUB will fetch memory from a nearby node (depending on how memory policies
      and cpuset describe fallback).
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Tested-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Acked-by: NBob Picco <bob.picco@hp.com>
      Cc: Nishanth Aravamudan <nacc@us.ibm.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@skynet.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f64dc58c
    • C
      Memoryless nodes: Slab support · 04231b30
      Christoph Lameter 提交于
      Slab should not allocate control structures for nodes without memory.  This
      may seem to work right now but its unreliable since not all allocations can
      fall back due to the use of GFP_THISNODE.
      
      Switching a few for_each_online_node's to N_NORMAL_MEMORY will allow us to
      only allocate for nodes that have regular memory.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Acked-by: NNishanth Aravamudan <nacc@us.ibm.com>
      Acked-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Acked-by: NBob Picco <bob.picco@hp.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@skynet.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      04231b30
    • C
      Memoryless nodes: No need for kswapd · 9422ffba
      Christoph Lameter 提交于
      A node without memory does not need a kswapd.  So use the memory map instead
      of the online map when starting kswapd.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Acked-by: NNishanth Aravamudan <nacc@us.ibm.com>
      Tested-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Acked-by: NBob Picco <bob.picco@hp.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@skynet.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9422ffba
    • C
      Memoryless nodes: OOM: use N_HIGH_MEMORY map instead of constructing one on the fly · ee31af5d
      Christoph Lameter 提交于
      constrained_alloc() builds its own memory map for nodes with memory.  We have
      that available in N_HIGH_MEMORY now.  So simplify the code.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Acked-by: NNishanth Aravamudan <nacc@us.ibm.com>
      Acked-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Acked-by: NBob Picco <bob.picco@hp.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@skynet.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ee31af5d
    • C
      Memoryless nodes: Fix interleave behavior for memoryless nodes · 6eaf806a
      Christoph Lameter 提交于
      MPOL_INTERLEAVE currently simply loops over all nodes.  Allocations on
      memoryless nodes will be redirected to nodes with memory.  This results in an
      imbalance because the neighboring nodes to memoryless nodes will get
      significantly more interleave hits that the rest of the nodes on the system.
      
      We can avoid this imbalance by clearing the nodes in the interleave node set
      that have no memory.  If we use the node map of the memory nodes instead of
      the online nodes then we have only the nodes we want.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NNishanth Aravamudan <nacc@us.ibm.com>
      Tested-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Acked-by: NBob Picco <bob.picco@hp.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@skynet.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6eaf806a
    • C
      Memoryless nodes: introduce mask of nodes with memory · 7ea1530a
      Christoph Lameter 提交于
      It is necessary to know if nodes have memory since we have recently begun to
      add support for memoryless nodes.  For that purpose we introduce a two new
      node states: N_HIGH_MEMORY and N_NORMAL_MEMORY.
      
      A node has its bit in N_HIGH_MEMORY set if it has any memory regardless of the
      type of mmemory.  If a node has memory then it has at least one zone defined
      in its pgdat structure that is located in the pgdat itself.
      
      A node has its bit in N_NORMAL_MEMORY set if it has a lower zone than
      ZONE_HIGHMEM.  This means it is possible to allocate memory that is not
      subject to kmap.
      
      N_HIGH_MEMORY and N_NORMAL_MEMORY can then be used in various places to insure
      that we do the right thing when we encounter a memoryless node.
      
      [akpm@linux-foundation.org: build fix]
      [Lee.Schermerhorn@hp.com: update N_HIGH_MEMORY node state for memory hotadd]
      [y-goto@jp.fujitsu.com: Fix memory hotplug + sparsemem build]
      Signed-off-by: NLee Schermerhorn <Lee.Schermerhorn@hp.com>
      Signed-off-by: NNishanth Aravamudan <nacc@us.ibm.com>
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Acked-by: NBob Picco <bob.picco@hp.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@skynet.ie>
      Signed-off-by: NYasunori Goto <y-goto@jp.fujitsu.com>
      Signed-off-by: NPaul Mundt <lethal@linux-sh.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7ea1530a
    • C
      Memoryless nodes: Generic management of nodemasks for various purposes · 13808910
      Christoph Lameter 提交于
      Why do we need to support memoryless nodes?
      
      KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
      
      > For fujitsu, problem is called "empty" node.
      >
      > When ACPI's SRAT table includes "possible nodes", ia64 bootstrap(acpi_numa_init)
      > creates nodes, which includes no memory, no cpu.
      >
      > I tried to remove empty-node in past, but that was denied.
      > It was because we can hot-add cpu to the empty node.
      > (node-hotplug triggered by cpu is not implemented now. and it will be ugly.)
      >
      >
      > For HP, (Lee can comment on this later), they have memory-less-node.
      > As far as I hear, HP's machine can have following configration.
      >
      > (example)
      > Node0: CPU0   memory AAA MB
      > Node1: CPU1   memory AAA MB
      > Node2: CPU2   memory AAA MB
      > Node3: CPU3   memory AAA MB
      > Node4: Memory XXX GB
      >
      > AAA is very small value (below 16MB)  and will be omitted by ia64 bootstrap.
      > After boot, only Node 4 has valid memory (but have no cpu.)
      >
      > Maybe this is memory-interleave by firmware config.
      
      Christoph Lameter <clameter@sgi.com> wrote:
      
      > Future SGI platforms (actually also current one can have but nothing like
      > that is deployed to my knowledge) have nodes with only cpus. Current SGI
      > platforms have nodes with just I/O that we so far cannot manage in the
      > core. So the arch code maps them to the nearest memory node.
      
      Lee Schermerhorn <Lee.Schermerhorn@hp.com> wrote:
      
      > For the HP platforms, we can configure each cell with from 0% to 100%
      > "cell local memory".  When we configure with <100% CLM, the "missing
      > percentages" are interleaved by hardware on a cache-line granularity to
      > improve bandwidth at the expense of latency for numa-challenged
      > applications [and OSes, but not our problem ;-)].  When we boot Linux on
      > such a config, all of the real nodes have no memory--it all resides in a
      > single interleaved pseudo-node.
      >
      > When we boot Linux on a 100% CLM configuration [== NUMA], we still have
      > the interleaved pseudo-node.  It contains a few hundred MB stolen from
      > the real nodes to contain the DMA zone.  [Interleaved memory resides at
      > phys addr 0].  The memoryless-nodes patches, along with the zoneorder
      > patches, support this config as well.
      >
      > Also, when we boot a NUMA config with the "mem=" command line,
      > specifying less memory than actually exists, Linux takes the excluded
      > memory "off the top" rather than distributing it across the nodes.  This
      > can result in memoryless nodes, as well.
      >
      
      This patch:
      
      Preparation for memoryless node patches.
      
      Provide a generic way to keep nodemasks describing various characteristics of
      NUMA nodes.
      
      Remove the node_online_map and the node_possible map and realize the same
      functionality using two nodes stats: N_POSSIBLE and N_ONLINE.
      
      [Lee.Schermerhorn@hp.com: Initialize N_*_MEMORY and N_CPU masks for non-NUMA config]
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Tested-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Acked-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Acked-by: NBob Picco <bob.picco@hp.com>
      Cc: Nishanth Aravamudan <nacc@us.ibm.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@skynet.ie>
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: "Serge E. Hallyn" <serge@hallyn.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      13808910
    • N
      fs: remove some AOP_TRUNCATED_PAGE · 55144768
      Nick Piggin 提交于
      prepare/commit_write no longer returns AOP_TRUNCATED_PAGE since OCFS2 and
      GFS2 were converted to the new aops, so we can make some simplifications
      for that.
      
      [michal.k.k.piotrowski@gmail.com: fix warning]
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Cc: Michael Halcrow <mhalcrow@us.ibm.com>
      Cc: Mark Fasheh <mark.fasheh@oracle.com>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Signed-off-by: NMichal Piotrowski <michal.k.k.piotrowski@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      55144768
    • N
      fs: new cont helpers · 89e10787
      Nick Piggin 提交于
      Rework the generic block "cont" routines to handle the new aops.  Supporting
      cont_prepare_write would take quite a lot of code to support, so remove it
      instead (and we later convert all filesystems to use it).
      
      write_begin gets passed AOP_FLAG_CONT_EXPAND when called from
      generic_cont_expand, so filesystems can avoid the old hacks they used.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      89e10787
    • N
      implement simple fs aops · 800d15a5
      Nick Piggin 提交于
      Implement new aops for some of the simpler filesystems.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      800d15a5
    • N
      mm: restore KERNEL_DS optimisations · 674b892e
      Nick Piggin 提交于
      Restore the KERNEL_DS optimisation, especially helpful to the 2copy write
      path.
      
      This may be a pretty questionable gain in most cases, especially after the
      legacy 2copy write path is removed, but it doesn't cost much.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      674b892e
    • N
      fs: introduce write_begin, write_end, and perform_write aops · afddba49
      Nick Piggin 提交于
      These are intended to replace prepare_write and commit_write with more
      flexible alternatives that are also able to avoid the buffered write
      deadlock problems efficiently (which prepare_write is unable to do).
      
      [mark.fasheh@oracle.com: API design contributions, code review and fixes]
      [akpm@linux-foundation.org: various fixes]
      [dmonakhov@sw.ru: new aop block_write_begin fix]
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NMark Fasheh <mark.fasheh@oracle.com>
      Signed-off-by: NDmitriy Monakhov <dmonakhov@openvz.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      afddba49
    • N
      mm: buffered write iterator · 2f718ffc
      Nick Piggin 提交于
      Add an iterator data structure to operate over an iovec.  Add usercopy
      operators needed by generic_file_buffered_write, and convert that function
      over.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2f718ffc
    • N
      mm: fix pagecache write deadlocks · 08291429
      Nick Piggin 提交于
      Modify the core write() code so that it won't take a pagefault while holding a
      lock on the pagecache page. There are a number of different deadlocks possible
      if we try to do such a thing:
      
      1.  generic_buffered_write
      2.   lock_page
      3.    prepare_write
      4.     unlock_page+vmtruncate
      5.     copy_from_user
      6.      mmap_sem(r)
      7.       handle_mm_fault
      8.        lock_page (filemap_nopage)
      9.    commit_write
      10.  unlock_page
      
      a. sys_munmap / sys_mlock / others
      b.  mmap_sem(w)
      c.   make_pages_present
      d.    get_user_pages
      e.     handle_mm_fault
      f.      lock_page (filemap_nopage)
      
      2,8	- recursive deadlock if page is same
      2,8;2,8	- ABBA deadlock is page is different
      2,6;b,f	- ABBA deadlock if page is same
      
      The solution is as follows:
      1.  If we find the destination page is uptodate, continue as normal, but use
          atomic usercopies which do not take pagefaults and do not zero the uncopied
          tail of the destination. The destination is already uptodate, so we can
          commit_write the full length even if there was a partial copy: it does not
          matter that the tail was not modified, because if it is dirtied and written
          back to disk it will not cause any problems (uptodate *means* that the
          destination page is as new or newer than the copy on disk).
      
      1a. The above requires that fault_in_pages_readable correctly returns access
          information, because atomic usercopies cannot distinguish between
          non-present pages in a readable mapping, from lack of a readable mapping.
      
      2.  If we find the destination page is non uptodate, unlock it (this could be
          made slightly more optimal), then allocate a temporary page to copy the
          source data into. Relock the destination page and continue with the copy.
          However, instead of a usercopy (which might take a fault), copy the data
          from the pinned temporary page via the kernel address space.
      
      (also, rename maxlen to seglen, because it was confusing)
      
      This increases the CPU/memory copy cost by almost 50% on the affected
      workloads. That will be solved by introducing a new set of pagecache write
      aops in a subsequent patch.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      08291429
    • N
      mm: write iovec cleanup · 4a9e5ef1
      Nick Piggin 提交于
      Hide some of the open-coded nr_segs tests into the iovec helpers.  This is all
      to simplify generic_file_buffered_write, because that gets more complex in the
      next patch.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4a9e5ef1
    • N
      mm: buffered write cleanup · eb2be189
      Nick Piggin 提交于
      Quite a bit of code is used in maintaining these "cached pages" that are
      probably pretty unlikely to get used. It would require a narrow race where
      the page is inserted concurrently while this process is allocating a page
      in order to create the spare page. Then a multi-page write into an uncached
      part of the file, to make use of it.
      
      Next, the buffered write path (and others) uses its own LRU pagevec when it
      should be just using the per-CPU LRU pagevec (which will cut down on both data
      and code size cacheline footprint). Also, these private LRU pagevecs are
      emptied after just a very short time, in contrast with the per-CPU pagevecs
      that are persistent. Net result: 7.3 times fewer lru_lock acquisitions required
      to add the pages to pagecache for a bulk write (in 4K chunks).
      
      [this gets rid of some cond_resched() calls in readahead.c and mpage.c due
       to clashes in -mm. What put them there, and why? ]
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      eb2be189
    • N
      mm: trim more holes · 64649a58
      Nick Piggin 提交于
      If prepare_write fails with AOP_TRUNCATED_PAGE, or if commit_write fails, then
      we may have failed the write operation despite prepare_write having
      instantiated blocks past i_size.  Fix this, and consolidate the trimming into
      one place.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      64649a58
    • N
      mm: debug write deadlocks · 5fe17237
      Nick Piggin 提交于
      Allow CONFIG_DEBUG_VM to switch off the prefaulting logic, to simulate the
      Makes the race much easier to hit.
      
      This is useful for demonstration and testing purposes, but is removed in a
      subsequent patch.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5fe17237
    • A
      mm: clean up buffered write code · ae37461c
      Andrew Morton 提交于
      Rename some variables and fix some types.
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ae37461c
    • A
      Revert "[PATCH] generic_file_buffered_write(): deadlock on vectored write" · 6814d7a9
      Andrew Morton 提交于
      This reverts commit 6527c2bd, which
      fixed the following bug:
      
        When prefaulting in the pages in generic_file_buffered_write(), we only
        faulted in the pages for the firts segment of the iovec.  If the second of
        successive segment described a mmapping of the page into which we're
        write()ing, and that page is not up-to-date, the fault handler tries to lock
        the already-locked page (to bring it up to date) and deadlocks.
      
        An exploit for this bug is in writev-deadlock-demo.c, in
        http://www.zip.com.au/~akpm/linux/patches/stuff/ext3-tools.tar.gz.
      
        (These demos assume blocksize < PAGE_CACHE_SIZE).
      
      The problem with this fix is that it takes the kernel back to doing a single
      prepare_write()/commit_write() per iovec segment.  So in the worst case we'll
      run prepare_write+commit_write 1024 times where we previously would have run
      it once. The other problem with the fix is that it fix all the locking problems.
      
      <insert numbers obtained via ext3-tools's writev-speed.c here>
      
      And apparently this change killed NFS overwrite performance, because, I
      suppose, it talks to the server for each prepare_write+commit_write.
      
      So just back that patch out - we'll be fixing the deadlock by other means.
      
      Nick says: also it only ever actually papered over the bug, because after
      faulting in the pages, they might be unmapped or reclaimed.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6814d7a9
    • A
      Revert "[PATCH] generic_file_buffered_write(): handle zero-length iovec segments" · 4b49643f
      Andrew Morton 提交于
      This reverts commit 81b0c871, which was
      a bugfix against 6527c2bd ("[PATCH]
      generic_file_buffered_write(): deadlock on vectored write"), which we
      also revert.
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4b49643f
    • N
      mm: revert KERNEL_DS buffered write optimisation · 41cb8ac0
      Nick Piggin 提交于
      Revert the patch from Neil Brown to optimise NFSD writev handling.
      
      Cc: Neil Brown <neilb@suse.de>
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      41cb8ac0
    • H
      mm: use pagevec to rotate reclaimable page · 902aaed0
      Hisashi Hifumi 提交于
      While running some memory intensive load, system response deteriorated just
      after swap-out started.
      
      The cause of this problem is that when a PG_reclaim page is moved to the tail
      of the inactive LRU list in rotate_reclaimable_page(), lru_lock spin lock is
      acquired every page writeback .  This deteriorates system performance and
      makes interrupt hold off time longer when swap-out started.
      
      Following patch solves this problem.  I use pagevec in rotating reclaimable
      pages to mitigate LRU spin lock contention and reduce interrupt hold off time.
      
      I did a test that allocating and touching pages in multiple processes, and
      pinging to the test machine in flooding mode to measure response under memory
      intensive load.
      
      The test result is:
      
      	-2.6.23-rc5
      	--- testmachine ping statistics ---
      	3000 packets transmitted, 3000 received, 0% packet loss, time 53222ms
      	rtt min/avg/max/mdev = 0.074/0.652/172.228/7.176 ms, pipe 11, ipg/ewma
      17.746/0.092 ms
      
      	-2.6.23-rc5-patched
      	--- testmachine ping statistics ---
      	3000 packets transmitted, 3000 received, 0% packet loss, time 51924ms
      	rtt min/avg/max/mdev = 0.072/0.108/3.884/0.114 ms, pipe 2, ipg/ewma
      17.314/0.091 ms
      
      Max round-trip-time was improved.
      
      The test machine spec is that 4CPU(3.16GHz, Hyper-threading enabled)
      8GB memory , 8GB swap.
      
      I did ping test again to observe performance deterioration caused by taking
      a ref.
      
      	-2.6.23-rc6-with-modifiedpatch
      	--- testmachine ping statistics ---
      	3000 packets transmitted, 3000 received, 0% packet loss, time 53386ms
      	rtt min/avg/max/mdev = 0.074/0.110/4.716/0.147 ms, pipe 2, ipg/ewma 17.801/0.129 ms
      
      The result for my original patch is as follows.
      
      	-2.6.23-rc5-with-originalpatch
      	--- testmachine ping statistics ---
      	3000 packets transmitted, 3000 received, 0% packet loss, time 51924ms
      	rtt min/avg/max/mdev = 0.072/0.108/3.884/0.114 ms, pipe 2, ipg/ewma 17.314/0.091 ms
      
      The influence to response was small.
      
      [akpm@linux-foundation.org: fix uninitalised var warning]
      [hugh@veritas.com: fix locking]
      [randy.dunlap@oracle.com: fix function declaration]
      [hugh@veritas.com: fix BUG at include/linux/mm.h:220!]
      [hugh@veritas.com: kill redundancy in rotate_reclaimable_page]
      [hugh@veritas.com: move_tail_pages into lru_add_drain]
      Signed-off-by: NHisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      902aaed0
    • L
      Mem Policy: add MPOL_F_MEMS_ALLOWED get_mempolicy() flag · 754af6f5
      Lee Schermerhorn 提交于
      Allow an application to query the memories allowed by its context.
      
      Updated numa_memory_policy.txt to mention that applications can use this to
      obtain allowed memories for constructing valid policies.
      
      TODO:  update out-of-tree libnuma wrapper[s], or maybe add a new
      wrapper--e.g.,  numa_get_mems_allowed() ?
      
      Also, update numa syscall man pages.
      
      Tested with memtoy V>=0.13.
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Acked-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      754af6f5
    • R
      mm: prevent kswapd from freeing excessive amounts of lowmem · 32a4330d
      Rik van Riel 提交于
      The current VM can get itself into trouble fairly easily on systems with a
      small ZONE_HIGHMEM, which is common on i686 computers with 1GB of memory.
      
      On one side, page_alloc() will allocate down to zone->pages_low, while on
      the other side, kswapd() and balance_pgdat() will try to free memory from
      every zone, until every zone has more free pages than zone->pages_high.
      
      Highmem can be filled up to zone->pages_low with page tables, ramfs,
      vmalloc allocations and other unswappable things quite easily and without
      many bad side effects, since we still have a huge ZONE_NORMAL to do future
      allocations from.
      
      However, as long as the number of free pages in the highmem zone is below
      zone->pages_high, kswapd will continue swapping things out from
      ZONE_NORMAL, too!
      
      Sami Farin managed to get his system into a stage where kswapd had freed
      about 700MB of low memory and was still "going strong".
      
      The attached patch will make kswapd stop paging out data from zones when
      there is more than enough memory free.  We do go above zone->pages_high in
      order to keep pressure between zones equal in normal circumstances, but the
      patch should prevent the kind of excesses that made Sami's computer totally
      unusable.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      32a4330d
    • J
      mm: no need to cast vmalloc() return value in zone_wait_table_init() · 8691f3a7
      Jesper Juhl 提交于
      vmalloc() returns a void pointer, so there's no need to cast its
      return value in mm/page_alloc.c::zone_wait_table_init().
      Signed-off-by: NJesper Juhl <jesper.juhl@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8691f3a7
    • C
      Slab allocators: fail if ksize is called with a NULL parameter · ef8b4520
      Christoph Lameter 提交于
      A NULL pointer means that the object was not allocated.  One cannot
      determine the size of an object that has not been allocated.  Currently we
      return 0 but we really should BUG() on attempts to determine the size of
      something nonexistent.
      
      krealloc() interprets NULL to mean a zero sized object.  Handle that
      separately in krealloc().
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Acked-by: NPekka Enberg <penberg@cs.helsinki.fi>
      Cc: Matt Mackall <mpm@selenic.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ef8b4520