1. 29 10月, 2006 1 次提交
    • M
      [PATCH] vmscan: Fix temp_priority race · 3bb1a852
      Martin Bligh 提交于
      The temp_priority field in zone is racy, as we can walk through a reclaim
      path, and just before we copy it into prev_priority, it can be overwritten
      (say with DEF_PRIORITY) by another reclaimer.
      
      The same bug is contained in both try_to_free_pages and balance_pgdat, but
      it is fixed slightly differently.  In balance_pgdat, we keep a separate
      priority record per zone in a local array.  In try_to_free_pages there is
      no need to do this, as the priority level is the same for all zones that we
      reclaim from.
      
      Impact of this bug is that temp_priority is copied into prev_priority, and
      setting this artificially high causes reclaimers to set distress
      artificially low.  They then fail to reclaim mapped pages, when they are,
      in fact, under severe memory pressure (their priority may be as low as 0).
      This causes the OOM killer to fire incorrectly.
      
      From: Andrew Morton <akpm@osdl.org>
      
      __zone_reclaim() isn't modifying zone->prev_priority.  But zone->prev_priority
      is used in the decision whether or not to bring mapped pages onto the inactive
      list.  Hence there's a risk here that __zone_reclaim() will fail because
      zone->prev_priority ir large (ie: low urgency) and lots of mapped pages end up
      stuck on the active list.
      
      Fix that up by decreasing (ie making more urgent) zone->prev_priority as
      __zone_reclaim() scans the zone's pages.
      
      This bug perhaps explains why ZONE_RECLAIM_PRIORITY was created.  It should be
      possible to remove that now, and to just start out at DEF_PRIORITY?
      
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Christoph Lameter <clameter@engr.sgi.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      3bb1a852
  2. 22 10月, 2006 1 次提交
  3. 21 10月, 2006 2 次提交
    • A
      [PATCH] highest_possible_node_id() linkage fix · 6220ec78
      Andrew Morton 提交于
      Qooting Adrian:
      
      - net/sunrpc/svc.c uses highest_possible_node_id()
      
      - include/linux/nodemask.h says highest_possible_node_id() is
        out-of-line #if MAX_NUMNODES > 1
      
      - the out-of-line highest_possible_node_id() is in lib/cpumask.c
      
      - lib/Makefile: lib-$(CONFIG_SMP) += cpumask.o
        CONFIG_ARCH_DISCONTIGMEM_ENABLE=y, CONFIG_SMP=n, CONFIG_SUNRPC=y
      
      -> highest_possible_node_id() is used in net/sunrpc/svc.c
         CONFIG_NODES_SHIFT defined and > 0
      
      -> include/linux/numa.h: MAX_NUMNODES > 1
      
      -> compile error
      
      The bug is not present on architectures where ARCH_DISCONTIGMEM_ENABLE
      depends on NUMA (but m32r isn't the only affected architecture).
      
      So move the function into page_alloc.c
      
      Cc: Adrian Bunk <bunk@stusta.de>
      Cc: Paul Jackson <pj@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      6220ec78
    • A
      [PATCH] separate bdi congestion functions from queue congestion functions · 3fcfab16
      Andrew Morton 提交于
      Separate out the concept of "queue congestion" from "backing-dev congestion".
      Congestion is a backing-dev concept, not a queue concept.
      
      The blk_* congestion functions are retained, as wrappers around the core
      backing-dev congestion functions.
      
      This proper layering is needed so that NFS can cleanly use the congestion
      functions, and so that CONFIG_BLOCK=n actually links.
      
      Cc: "Thomas Maier" <balagi@justmail.de>
      Cc: "Jens Axboe" <jens.axboe@oracle.com>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Peter Osterlund <petero2@telia.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      3fcfab16
  4. 12 10月, 2006 3 次提交
  5. 04 10月, 2006 2 次提交
  6. 27 9月, 2006 9 次提交
    • R
      [PATCH] mm/page_alloc: use NULL instead of 0 for ptr · 423b41d7
      Randy Dunlap 提交于
      Use NULL instead of 0 for pointer value, eliminate sparse warnings.
      Signed-off-by: NRandy Dunlap <rdunlap@xenotime.net>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      423b41d7
    • C
      [PATCH] Do not allocate pagesets for unpopulated zones. · 66a55030
      Christoph Lameter 提交于
      We do not need to allocate pagesets for unpopulated zones.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      66a55030
    • C
      [PATCH] Add node to zone for the NUMA case · d5f541ed
      Christoph Lameter 提交于
      Add the node in order to optimize zone_to_nid.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Acked-by: NPaul Jackson <pj@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      d5f541ed
    • C
      [PATCH] Add NUMA_BUILD definition in kernel.h to avoid #ifdef CONFIG_NUMA · 08e0f6a9
      Christoph Lameter 提交于
      The NUMA_BUILD constant is always available and will be set to 1 on
      NUMA_BUILDs.  That way checks valid only under CONFIG_NUMA can easily be done
      without #ifdef CONFIG_NUMA
      
      F.e.
      
      if (NUMA_BUILD && <numa_condition>) {
      ...
      }
      
      [akpm: not a thing we'd normally do, but CONFIG_NUMA is special: it is
       causing ifdef explosion in core kernel, so let's see if this is a comfortable
       way in whcih to control that]
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      08e0f6a9
    • J
      [PATCH] Condense output of show_free_areas() · c7241913
      Jes Sorensen 提交于
      On larger systems, the amount of output dumped on the console when you do
      SysRq-M is beyond insane.  This patch is trying to reduce it somewhat as
      even with the smaller NUMA systems that have hit the desktop this seems to
      be a fair thing to do.
      
      The philosophy I have taken is as follows:
       1) If a zone is empty, don't tell, we don't need yet another line
          telling us so. The information is available since one can look up
          the fact how many zones were initialized in the first place.
       2) Put as much information on a line is possible, if it can be done
          in one line, rahter than two, then do it in one. I tried to format
          the temperature stuff for easy reading.
      
      Change show_free_areas() to not print lines for empty zones.  If no zone
      output is printed, the zone is empty.  This reduces the number of lines
      dumped to the console in sysrq on a large system by several thousand lines.
      
      Change the zone temperature printouts to use one line per CPU instead of
      two lines (one hot, one cold).  On a 1024 CPU, 1024 node system, this
      reduces the console output by over a million lines of output.
      
      While this is a bigger problem on large NUMA systems, it is also applicable
      to smaller desktop sized and mid range NUMA systems.
      
      Old format:
      
      Mem-info:
      Node 0 DMA per-cpu:
      cpu 0 hot: high 42, batch 7 used:24
      cpu 0 cold: high 14, batch 3 used:1
      cpu 1 hot: high 42, batch 7 used:34
      cpu 1 cold: high 14, batch 3 used:0
      cpu 2 hot: high 42, batch 7 used:0
      cpu 2 cold: high 14, batch 3 used:0
      cpu 3 hot: high 42, batch 7 used:0
      cpu 3 cold: high 14, batch 3 used:0
      cpu 4 hot: high 42, batch 7 used:0
      cpu 4 cold: high 14, batch 3 used:0
      cpu 5 hot: high 42, batch 7 used:0
      cpu 5 cold: high 14, batch 3 used:0
      cpu 6 hot: high 42, batch 7 used:0
      cpu 6 cold: high 14, batch 3 used:0
      cpu 7 hot: high 42, batch 7 used:0
      cpu 7 cold: high 14, batch 3 used:0
      Node 0 DMA32 per-cpu: empty
      Node 0 Normal per-cpu: empty
      Node 0 HighMem per-cpu: empty
      Node 1 DMA per-cpu:
      [snip]
      Free pages:     5410688kB (0kB HighMem)
      Active:9536 inactive:4261 dirty:6 writeback:0 unstable:0 free:338168 slab:1931 mapped:1900 pagetables:208
      Node 0 DMA free:1676304kB min:3264kB low:4080kB high:4896kB active:128048kB inactive:61568kB present:1970880kB pages_scanned:0 all_unreclaimable? no
      lowmem_reserve[]: 0 0 0 0
      Node 0 DMA32 free:0kB min:0kB low:0kB high:0kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no
      lowmem_reserve[]: 0 0 0 0
      Node 0 Normal free:0kB min:0kB low:0kB high:0kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no
      lowmem_reserve[]: 0 0 0 0
      Node 0 HighMem free:0kB min:512kB low:512kB high:512kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no
      lowmem_reserve[]: 0 0 0 0
      Node 1 DMA free:1951728kB min:3280kB low:4096kB high:4912kB active:5632kB inactive:1504kB present:1982464kB pages_scanned:0 all_unreclaimable? no
      lowmem_reserve[]: 0 0 0 0
      ....
      
      New format:
      
      Mem-info:
      Node 0 DMA per-cpu:
      CPU    0: Hot: hi:   42, btch:   7 usd:  41   Cold: hi:   14, btch:   3 usd:   2
      CPU    1: Hot: hi:   42, btch:   7 usd:  40   Cold: hi:   14, btch:   3 usd:   1
      CPU    2: Hot: hi:   42, btch:   7 usd:   0   Cold: hi:   14, btch:   3 usd:   0
      CPU    3: Hot: hi:   42, btch:   7 usd:   0   Cold: hi:   14, btch:   3 usd:   0
      CPU    4: Hot: hi:   42, btch:   7 usd:   0   Cold: hi:   14, btch:   3 usd:   0
      CPU    5: Hot: hi:   42, btch:   7 usd:   0   Cold: hi:   14, btch:   3 usd:   0
      CPU    6: Hot: hi:   42, btch:   7 usd:   0   Cold: hi:   14, btch:   3 usd:   0
      CPU    7: Hot: hi:   42, btch:   7 usd:   0   Cold: hi:   14, btch:   3 usd:   0
      Node 1 DMA per-cpu:
      [snip]
      Free pages:     5411088kB (0kB HighMem)
      Active:9558 inactive:4233 dirty:6 writeback:0 unstable:0 free:338193 slab:1942 mapped:1918 pagetables:208
      Node 0 DMA free:1677648kB min:3264kB low:4080kB high:4896kB active:129296kB inactive:58864kB present:1970880kB pages_scanned:0 all_unreclaimable? no
      lowmem_reserve[]: 0 0 0 0
      Node 1 DMA free:1948448kB min:3280kB low:4096kB high:4912kB active:6864kB inactive:3536kB present:1982464kB pages_scanned:0 all_unreclaimable? no
      lowmem_reserve[]: 0 0 0 0
      Signed-off-by: NJes Sorensen <jes@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      c7241913
    • M
      [PATCH] Allow an arch to expand node boundaries · fb01439c
      Mel Gorman 提交于
      Arch-independent zone-sizing determines the size of a node
      (pgdat->node_spanned_pages) based on the physical memory that was
      registered by the architecture.  However, when
      CONFIG_MEMORY_HOTPLUG_RESERVE is set, the architecture expects that the
      spanned_pages will be much larger and that mem_map will be allocated that
      is used lated on memory hot-add.
      
      This patch allows an architecture that sets CONFIG_MEMORY_HOTPLUG_RESERVE
      to call push_node_boundaries() which will set the node beginning and end to
      at *least* the requested boundary.
      
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: Andi Kleen <ak@muc.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: "Keith Mannthey" <kmannth@gmail.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      fb01439c
    • M
      [PATCH] Account for holes that are outside the range of physical memory · 9c7cd687
      Mel Gorman 提交于
      absent_pages_in_range() made the assumption that users of the API would not
      care about holes beyound the end of physical memory.  This was not the
      case.  This patch will account for ranges outside of physical memory as
      holes correctly.
      
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: Andi Kleen <ak@muc.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: "Keith Mannthey" <kmannth@gmail.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      9c7cd687
    • M
      [PATCH] Account for memmap and optionally the kernel image as holes · 0e0b864e
      Mel Gorman 提交于
      The x86_64 code accounted for memmap and some portions of the the DMA zone as
      holes.  This was because those areas would never be reclaimed and accounting
      for them as memory affects min watermarks.  This patch will account for the
      memmap as a memory hole.  Architectures may optionally use set_dma_reserve()
      if they wish to account for a portion of memory in ZONE_DMA as a hole.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: Andi Kleen <ak@muc.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: "Keith Mannthey" <kmannth@gmail.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      0e0b864e
    • M
      [PATCH] Introduce mechanism for registering active regions of memory · c713216d
      Mel Gorman 提交于
      At a basic level, architectures define structures to record where active
      ranges of page frames are located.  Once located, the code to calculate zone
      sizes and holes in each architecture is very similar.  Some of this zone and
      hole sizing code is difficult to read for no good reason.  This set of patches
      eliminates the similar-looking architecture-specific code.
      
      The patches introduce a mechanism where architectures register where the
      active ranges of page frames are with add_active_range().  When all areas have
      been discovered, free_area_init_nodes() is called to initialise the pgdat and
      zones.  The zone sizes and holes are then calculated in an architecture
      independent manner.
      
      Patch 1 introduces the mechanism for registering and initialising PFN ranges
      Patch 2 changes ppc to use the mechanism - 139 arch-specific LOC removed
      Patch 3 changes x86 to use the mechanism - 136 arch-specific LOC removed
      Patch 4 changes x86_64 to use the mechanism - 74 arch-specific LOC removed
      Patch 5 changes ia64 to use the mechanism - 52 arch-specific LOC removed
      Patch 6 accounts for mem_map as a memory hole as the pages are not reclaimable.
      	It adjusts the watermarks slightly
      
      Tony Luck has successfully tested for ia64 on Itanium with tiger_defconfig,
      gensparse_defconfig and defconfig.  Bob Picco has also tested and debugged on
      IA64.  Jack Steiner successfully boot tested on a mammoth SGI IA64-based
      machine.  These were on patches against 2.6.17-rc1 and release 3 of these
      patches but there have been no ia64-changes since release 3.
      
      There are differences in the zone sizes for x86_64 as the arch-specific code
      for x86_64 accounts the kernel image and the starting mem_maps as memory holes
      but the architecture-independent code accounts the memory as present.
      
      The big benefit of this set of patches is a sizable reduction of
      architecture-specific code, some of which is very hairy.  There should be a
      greater reduction when other architectures use the same mechanisms for zone
      and hole sizing but I lack the hardware to test on.
      
      Additional credit;
      	Dave Hansen for the initial suggestion and comments on early patches
      	Andy Whitcroft for reviewing early versions and catching numerous
      		errors
      	Tony Luck for testing and debugging on IA64
      	Bob Picco for fixing bugs related to pfn registration, reviewing a
      		number of patch revisions, providing a number of suggestions
      		on future direction and testing heavily
      	Jack Steiner and Robin Holt for testing on IA64 and clarifying
      		issues related to memory holes
      	Yasunori for testing on IA64
      	Andi Kleen for reviewing and feeding back about x86_64
      	Christian Kujau for providing valuable information related to ACPI
      		problems on x86_64 and testing potential fixes
      
      This patch:
      
      Define the structure to represent an active range of page frames within a node
      in an architecture independent manner.  Architectures are expected to register
      active ranges of PFNs using add_active_range(nid, start_pfn, end_pfn) and call
      free_area_init_nodes() passing the PFNs of the end of each zone.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Signed-off-by: NBob Picco <bob.picco@hp.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: Andi Kleen <ak@muc.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: "Keith Mannthey" <kmannth@gmail.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      c713216d
  7. 26 9月, 2006 20 次提交
  8. 23 9月, 2006 1 次提交
    • D
      [XFRM]: Dynamic xfrm_state hash table sizing. · f034b5d4
      David S. Miller 提交于
      The grow algorithm is simple, we grow if:
      
      1) we see a hash chain collision at insert, and
      2) we haven't hit the hash size limit (currently 1*1024*1024 slots), and
      3) the number of xfrm_state objects is > the current hash mask
      
      All of this needs some tweaking.
      
      Remove __initdata from "hashdist" so we can use it safely at run time.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f034b5d4
  9. 04 7月, 2006 1 次提交
    • C
      [PATCH] ZVC/zone_reclaim: Leave 1% of unmapped pagecache pages for file I/O · 9614634f
      Christoph Lameter 提交于
      It turns out that it is advantageous to leave a small portion of unmapped file
      backed pages if all of a zone's pages (or almost all pages) are allocated and
      so the page allocator has to go off-node.
      
      This allows recently used file I/O buffers to stay on the node and
      reduces the times that zone reclaim is invoked if file I/O occurs
      when we run out of memory in a zone.
      
      The problem is that zone reclaim runs too frequently when the page cache is
      used for file I/O (read write and therefore unmapped pages!) alone and we have
      almost all pages of the zone allocated.  Zone reclaim may remove 32 unmapped
      pages.  File I/O will use these pages for the next read/write requests and the
      unmapped pages increase.  After the zone has filled up again zone reclaim will
      remove it again after only 32 pages.  This cycle is too inefficient and there
      are potentially too many zone reclaim cycles.
      
      With the 1% boundary we may still remove all unmapped pages for file I/O in
      zone reclaim pass.  However.  it will take a large number of read and writes
      to get back to 1% again where we trigger zone reclaim again.
      
      The zone reclaim 2.6.16/17 does not show this behavior because we have a 30
      second timeout.
      
      [akpm@osdl.org: rename the /proc file and the variable]
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      9614634f