1. 27 9月, 2006 7 次提交
    • C
      [PATCH] Add node to zone for the NUMA case · d5f541ed
      Christoph Lameter 提交于
      Add the node in order to optimize zone_to_nid.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Acked-by: NPaul Jackson <pj@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      d5f541ed
    • C
      [PATCH] Add NUMA_BUILD definition in kernel.h to avoid #ifdef CONFIG_NUMA · 08e0f6a9
      Christoph Lameter 提交于
      The NUMA_BUILD constant is always available and will be set to 1 on
      NUMA_BUILDs.  That way checks valid only under CONFIG_NUMA can easily be done
      without #ifdef CONFIG_NUMA
      
      F.e.
      
      if (NUMA_BUILD && <numa_condition>) {
      ...
      }
      
      [akpm: not a thing we'd normally do, but CONFIG_NUMA is special: it is
       causing ifdef explosion in core kernel, so let's see if this is a comfortable
       way in whcih to control that]
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      08e0f6a9
    • J
      [PATCH] Condense output of show_free_areas() · c7241913
      Jes Sorensen 提交于
      On larger systems, the amount of output dumped on the console when you do
      SysRq-M is beyond insane.  This patch is trying to reduce it somewhat as
      even with the smaller NUMA systems that have hit the desktop this seems to
      be a fair thing to do.
      
      The philosophy I have taken is as follows:
       1) If a zone is empty, don't tell, we don't need yet another line
          telling us so. The information is available since one can look up
          the fact how many zones were initialized in the first place.
       2) Put as much information on a line is possible, if it can be done
          in one line, rahter than two, then do it in one. I tried to format
          the temperature stuff for easy reading.
      
      Change show_free_areas() to not print lines for empty zones.  If no zone
      output is printed, the zone is empty.  This reduces the number of lines
      dumped to the console in sysrq on a large system by several thousand lines.
      
      Change the zone temperature printouts to use one line per CPU instead of
      two lines (one hot, one cold).  On a 1024 CPU, 1024 node system, this
      reduces the console output by over a million lines of output.
      
      While this is a bigger problem on large NUMA systems, it is also applicable
      to smaller desktop sized and mid range NUMA systems.
      
      Old format:
      
      Mem-info:
      Node 0 DMA per-cpu:
      cpu 0 hot: high 42, batch 7 used:24
      cpu 0 cold: high 14, batch 3 used:1
      cpu 1 hot: high 42, batch 7 used:34
      cpu 1 cold: high 14, batch 3 used:0
      cpu 2 hot: high 42, batch 7 used:0
      cpu 2 cold: high 14, batch 3 used:0
      cpu 3 hot: high 42, batch 7 used:0
      cpu 3 cold: high 14, batch 3 used:0
      cpu 4 hot: high 42, batch 7 used:0
      cpu 4 cold: high 14, batch 3 used:0
      cpu 5 hot: high 42, batch 7 used:0
      cpu 5 cold: high 14, batch 3 used:0
      cpu 6 hot: high 42, batch 7 used:0
      cpu 6 cold: high 14, batch 3 used:0
      cpu 7 hot: high 42, batch 7 used:0
      cpu 7 cold: high 14, batch 3 used:0
      Node 0 DMA32 per-cpu: empty
      Node 0 Normal per-cpu: empty
      Node 0 HighMem per-cpu: empty
      Node 1 DMA per-cpu:
      [snip]
      Free pages:     5410688kB (0kB HighMem)
      Active:9536 inactive:4261 dirty:6 writeback:0 unstable:0 free:338168 slab:1931 mapped:1900 pagetables:208
      Node 0 DMA free:1676304kB min:3264kB low:4080kB high:4896kB active:128048kB inactive:61568kB present:1970880kB pages_scanned:0 all_unreclaimable? no
      lowmem_reserve[]: 0 0 0 0
      Node 0 DMA32 free:0kB min:0kB low:0kB high:0kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no
      lowmem_reserve[]: 0 0 0 0
      Node 0 Normal free:0kB min:0kB low:0kB high:0kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no
      lowmem_reserve[]: 0 0 0 0
      Node 0 HighMem free:0kB min:512kB low:512kB high:512kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no
      lowmem_reserve[]: 0 0 0 0
      Node 1 DMA free:1951728kB min:3280kB low:4096kB high:4912kB active:5632kB inactive:1504kB present:1982464kB pages_scanned:0 all_unreclaimable? no
      lowmem_reserve[]: 0 0 0 0
      ....
      
      New format:
      
      Mem-info:
      Node 0 DMA per-cpu:
      CPU    0: Hot: hi:   42, btch:   7 usd:  41   Cold: hi:   14, btch:   3 usd:   2
      CPU    1: Hot: hi:   42, btch:   7 usd:  40   Cold: hi:   14, btch:   3 usd:   1
      CPU    2: Hot: hi:   42, btch:   7 usd:   0   Cold: hi:   14, btch:   3 usd:   0
      CPU    3: Hot: hi:   42, btch:   7 usd:   0   Cold: hi:   14, btch:   3 usd:   0
      CPU    4: Hot: hi:   42, btch:   7 usd:   0   Cold: hi:   14, btch:   3 usd:   0
      CPU    5: Hot: hi:   42, btch:   7 usd:   0   Cold: hi:   14, btch:   3 usd:   0
      CPU    6: Hot: hi:   42, btch:   7 usd:   0   Cold: hi:   14, btch:   3 usd:   0
      CPU    7: Hot: hi:   42, btch:   7 usd:   0   Cold: hi:   14, btch:   3 usd:   0
      Node 1 DMA per-cpu:
      [snip]
      Free pages:     5411088kB (0kB HighMem)
      Active:9558 inactive:4233 dirty:6 writeback:0 unstable:0 free:338193 slab:1942 mapped:1918 pagetables:208
      Node 0 DMA free:1677648kB min:3264kB low:4080kB high:4896kB active:129296kB inactive:58864kB present:1970880kB pages_scanned:0 all_unreclaimable? no
      lowmem_reserve[]: 0 0 0 0
      Node 1 DMA free:1948448kB min:3280kB low:4096kB high:4912kB active:6864kB inactive:3536kB present:1982464kB pages_scanned:0 all_unreclaimable? no
      lowmem_reserve[]: 0 0 0 0
      Signed-off-by: NJes Sorensen <jes@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      c7241913
    • M
      [PATCH] Allow an arch to expand node boundaries · fb01439c
      Mel Gorman 提交于
      Arch-independent zone-sizing determines the size of a node
      (pgdat->node_spanned_pages) based on the physical memory that was
      registered by the architecture.  However, when
      CONFIG_MEMORY_HOTPLUG_RESERVE is set, the architecture expects that the
      spanned_pages will be much larger and that mem_map will be allocated that
      is used lated on memory hot-add.
      
      This patch allows an architecture that sets CONFIG_MEMORY_HOTPLUG_RESERVE
      to call push_node_boundaries() which will set the node beginning and end to
      at *least* the requested boundary.
      
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: Andi Kleen <ak@muc.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: "Keith Mannthey" <kmannth@gmail.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      fb01439c
    • M
      [PATCH] Account for holes that are outside the range of physical memory · 9c7cd687
      Mel Gorman 提交于
      absent_pages_in_range() made the assumption that users of the API would not
      care about holes beyound the end of physical memory.  This was not the
      case.  This patch will account for ranges outside of physical memory as
      holes correctly.
      
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: Andi Kleen <ak@muc.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: "Keith Mannthey" <kmannth@gmail.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      9c7cd687
    • M
      [PATCH] Account for memmap and optionally the kernel image as holes · 0e0b864e
      Mel Gorman 提交于
      The x86_64 code accounted for memmap and some portions of the the DMA zone as
      holes.  This was because those areas would never be reclaimed and accounting
      for them as memory affects min watermarks.  This patch will account for the
      memmap as a memory hole.  Architectures may optionally use set_dma_reserve()
      if they wish to account for a portion of memory in ZONE_DMA as a hole.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: Andi Kleen <ak@muc.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: "Keith Mannthey" <kmannth@gmail.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      0e0b864e
    • M
      [PATCH] Introduce mechanism for registering active regions of memory · c713216d
      Mel Gorman 提交于
      At a basic level, architectures define structures to record where active
      ranges of page frames are located.  Once located, the code to calculate zone
      sizes and holes in each architecture is very similar.  Some of this zone and
      hole sizing code is difficult to read for no good reason.  This set of patches
      eliminates the similar-looking architecture-specific code.
      
      The patches introduce a mechanism where architectures register where the
      active ranges of page frames are with add_active_range().  When all areas have
      been discovered, free_area_init_nodes() is called to initialise the pgdat and
      zones.  The zone sizes and holes are then calculated in an architecture
      independent manner.
      
      Patch 1 introduces the mechanism for registering and initialising PFN ranges
      Patch 2 changes ppc to use the mechanism - 139 arch-specific LOC removed
      Patch 3 changes x86 to use the mechanism - 136 arch-specific LOC removed
      Patch 4 changes x86_64 to use the mechanism - 74 arch-specific LOC removed
      Patch 5 changes ia64 to use the mechanism - 52 arch-specific LOC removed
      Patch 6 accounts for mem_map as a memory hole as the pages are not reclaimable.
      	It adjusts the watermarks slightly
      
      Tony Luck has successfully tested for ia64 on Itanium with tiger_defconfig,
      gensparse_defconfig and defconfig.  Bob Picco has also tested and debugged on
      IA64.  Jack Steiner successfully boot tested on a mammoth SGI IA64-based
      machine.  These were on patches against 2.6.17-rc1 and release 3 of these
      patches but there have been no ia64-changes since release 3.
      
      There are differences in the zone sizes for x86_64 as the arch-specific code
      for x86_64 accounts the kernel image and the starting mem_maps as memory holes
      but the architecture-independent code accounts the memory as present.
      
      The big benefit of this set of patches is a sizable reduction of
      architecture-specific code, some of which is very hairy.  There should be a
      greater reduction when other architectures use the same mechanisms for zone
      and hole sizing but I lack the hardware to test on.
      
      Additional credit;
      	Dave Hansen for the initial suggestion and comments on early patches
      	Andy Whitcroft for reviewing early versions and catching numerous
      		errors
      	Tony Luck for testing and debugging on IA64
      	Bob Picco for fixing bugs related to pfn registration, reviewing a
      		number of patch revisions, providing a number of suggestions
      		on future direction and testing heavily
      	Jack Steiner and Robin Holt for testing on IA64 and clarifying
      		issues related to memory holes
      	Yasunori for testing on IA64
      	Andi Kleen for reviewing and feeding back about x86_64
      	Christian Kujau for providing valuable information related to ACPI
      		problems on x86_64 and testing potential fixes
      
      This patch:
      
      Define the structure to represent an active range of page frames within a node
      in an architecture independent manner.  Architectures are expected to register
      active ranges of PFNs using add_active_range(nid, start_pfn, end_pfn) and call
      free_area_init_nodes() passing the PFNs of the end of each zone.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Signed-off-by: NBob Picco <bob.picco@hp.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: Andi Kleen <ak@muc.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: "Keith Mannthey" <kmannth@gmail.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      c713216d
  2. 26 9月, 2006 20 次提交
  3. 23 9月, 2006 1 次提交
    • D
      [XFRM]: Dynamic xfrm_state hash table sizing. · f034b5d4
      David S. Miller 提交于
      The grow algorithm is simple, we grow if:
      
      1) we see a hash chain collision at insert, and
      2) we haven't hit the hash size limit (currently 1*1024*1024 slots), and
      3) the number of xfrm_state objects is > the current hash mask
      
      All of this needs some tweaking.
      
      Remove __initdata from "hashdist" so we can use it safely at run time.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f034b5d4
  4. 04 7月, 2006 1 次提交
    • C
      [PATCH] ZVC/zone_reclaim: Leave 1% of unmapped pagecache pages for file I/O · 9614634f
      Christoph Lameter 提交于
      It turns out that it is advantageous to leave a small portion of unmapped file
      backed pages if all of a zone's pages (or almost all pages) are allocated and
      so the page allocator has to go off-node.
      
      This allows recently used file I/O buffers to stay on the node and
      reduces the times that zone reclaim is invoked if file I/O occurs
      when we run out of memory in a zone.
      
      The problem is that zone reclaim runs too frequently when the page cache is
      used for file I/O (read write and therefore unmapped pages!) alone and we have
      almost all pages of the zone allocated.  Zone reclaim may remove 32 unmapped
      pages.  File I/O will use these pages for the next read/write requests and the
      unmapped pages increase.  After the zone has filled up again zone reclaim will
      remove it again after only 32 pages.  This cycle is too inefficient and there
      are potentially too many zone reclaim cycles.
      
      With the 1% boundary we may still remove all unmapped pages for file I/O in
      zone reclaim pass.  However.  it will take a large number of read and writes
      to get back to 1% again where we trigger zone reclaim again.
      
      The zone reclaim 2.6.16/17 does not show this behavior because we have a 30
      second timeout.
      
      [akpm@osdl.org: rename the /proc file and the variable]
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      9614634f
  5. 01 7月, 2006 11 次提交
    • C
      [PATCH] Light weight event counters · f8891e5e
      Christoph Lameter 提交于
      The remaining counters in page_state after the zoned VM counter patches
      have been applied are all just for show in /proc/vmstat.  They have no
      essential function for the VM.
      
      We use a simple increment of per cpu variables.  In order to avoid the most
      severe races we disable preempt.  Preempt does not prevent the race between
      an increment and an interrupt handler incrementing the same statistics
      counter.  However, that race is exceedingly rare, we may only loose one
      increment or so and there is no requirement (at least not in kernel) that
      the vm event counters have to be accurate.
      
      In the non preempt case this results in a simple increment for each
      counter.  For many architectures this will be reduced by the compiler to a
      single instruction.  This single instruction is atomic for i386 and x86_64.
       And therefore even the rare race condition in an interrupt is avoided for
      both architectures in most cases.
      
      The patchset also adds an off switch for embedded systems that allows a
      building of linux kernels without these counters.
      
      The implementation of these counters is through inline code that hopefully
      results in only a single instruction increment instruction being emitted
      (i386, x86_64) or in the increment being hidden though instruction
      concurrency (EPIC architectures such as ia64 can get that done).
      
      Benefits:
      - VM event counter operations usually reduce to a single inline instruction
        on i386 and x86_64.
      - No interrupt disable, only preempt disable for the preempt case.
        Preempt disable can also be avoided by moving the counter into a spinlock.
      - Handling is similar to zoned VM counters.
      - Simple and easily extendable.
      - Can be omitted to reduce memory use for embedded use.
      
      References:
      
      RFC http://marc.theaimsgroup.com/?l=linux-kernel&m=113512330605497&w=2
      RFC http://marc.theaimsgroup.com/?l=linux-kernel&m=114988082814934&w=2
      local_t http://marc.theaimsgroup.com/?l=linux-kernel&m=114991748606690&w=2
      V2 http://marc.theaimsgroup.com/?t=115014808400007&r=1&w=2
      V3 http://marc.theaimsgroup.com/?l=linux-kernel&m=115024767022346&w=2
      V4 http://marc.theaimsgroup.com/?l=linux-kernel&m=115047968808926&w=2Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      f8891e5e
    • C
      [PATCH] Use Zoned VM Counters for NUMA statistics · ca889e6c
      Christoph Lameter 提交于
      The numa statistics are really event counters.  But they are per node and
      so we have had special treatment for these counters through additional
      fields on the pcp structure.  We can now use the per zone nature of the
      zoned VM counters to realize these.
      
      This will shrink the size of the pcp structure on NUMA systems.  We will
      have some room to add additional per zone counters that will all still fit
      in the same cacheline.
      
       Bits	Prior pcp size	  	Size after patch	We can add
       ------------------------------------------------------------------
       64	128 bytes (16 words)	80 bytes (10 words)	48
       32	 76 bytes (19 words)	56 bytes (14 words)	8 (64 byte cacheline)
      							72 (128 byte)
      
      Remove the special statistics for numa and replace them with zoned vm
      counters.  This has the side effect that global sums of these events now
      show up in /proc/vmstat.
      
      Also take the opportunity to move the zone_statistics() function from
      page_alloc.c into vmstat.c.
      
      Discussions:
      V2 http://marc.theaimsgroup.com/?t=115048227000002&r=1&w=2Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Acked-by: NAndi Kleen <ak@suse.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      ca889e6c
    • C
      [PATCH] zoned vm counters: conversion of nr_unstable to per zone counter · fd39fc85
      Christoph Lameter 提交于
      Conversion of nr_unstable to a per zone counter
      
      We need to do some special modifications to the nfs code since there are
      multiple cases of disposition and we need to have a page ref for proper
      accounting.
      
      This converts the last critical page state of the VM and therefore we need to
      remove several functions that were depending on GET_PAGE_STATE_LAST in order
      to make the kernel compile again.  We are only left with event type counters
      in page state.
      
      [akpm@osdl.org: bugfixes]
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      fd39fc85
    • C
      [PATCH] zoned vm counters: conversion of nr_writeback to per zone counter · ce866b34
      Christoph Lameter 提交于
      Conversion of nr_writeback to per zone counter.
      
      This removes the last page_state counter from arch/i386/mm/pgtable.c so we
      drop the page_state from there.
      
      [akpm@osdl.org: bugfix]
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      ce866b34
    • C
      [PATCH] zoned vm counters: conversion of nr_dirty to per zone counter · b1e7a8fd
      Christoph Lameter 提交于
      This makes nr_dirty a per zone counter.  Looping over all processors is
      avoided during writeback state determination.
      
      The counter aggregation for nr_dirty had to be undone in the NFS layer since
      we summed up the page counts from multiple zones.  Someone more familiar with
      NFS should probably review what I have done.
      
      [akpm@osdl.org: bugfix]
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      b1e7a8fd
    • C
      [PATCH] zoned vm counters: conversion of nr_pagetables to per zone counter · df849a15
      Christoph Lameter 提交于
      Conversion of nr_page_table_pages to a per zone counter
      
      [akpm@osdl.org: bugfix]
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      df849a15
    • C
      [PATCH] zoned vm counters: conversion of nr_slab to per zone counter · 9a865ffa
      Christoph Lameter 提交于
      - Allows reclaim to access counter without looping over processor counts.
      
      - Allows accurate statistics on how many pages are used in a zone by
        the slab. This may become useful to balance slab allocations over
        various zones.
      
      [akpm@osdl.org: bugfix]
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      9a865ffa
    • C
      [PATCH] zoned vm counters: conversion of nr_pagecache to per zone counter · 347ce434
      Christoph Lameter 提交于
      Currently a single atomic variable is used to establish the size of the page
      cache in the whole machine.  The zoned VM counters have the same method of
      implementation as the nr_pagecache code but also allow the determination of
      the pagecache size per zone.
      
      Remove the special implementation for nr_pagecache and make it a zoned counter
      named NR_FILE_PAGES.
      
      Updates of the page cache counters are always performed with interrupts off.
      We can therefore use the __ variant here.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      347ce434
    • C
      [PATCH] zoned vm counters: convert nr_mapped to per zone counter · 65ba55f5
      Christoph Lameter 提交于
      nr_mapped is important because it allows a determination of how many pages of
      a zone are not mapped, which would allow a more efficient means of determining
      when we need to reclaim memory in a zone.
      
      We take the nr_mapped field out of the page state structure and define a new
      per zone counter named NR_FILE_MAPPED (the anonymous pages will be split off
      from NR_MAPPED in the next patch).
      
      We replace the use of nr_mapped in various kernel locations.  This avoids the
      looping over all processors in try_to_free_pages(), writeback, reclaim (swap +
      zone reclaim).
      
      [akpm@osdl.org: bugfix]
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      65ba55f5
    • C
      [PATCH] zoned vm counters: basic ZVC (zoned vm counter) implementation · 2244b95a
      Christoph Lameter 提交于
      Per zone counter infrastructure
      
      The counters that we currently have for the VM are split per processor.  The
      processor however has not much to do with the zone these pages belong to.  We
      cannot tell f.e.  how many ZONE_DMA pages are dirty.
      
      So we are blind to potentially inbalances in the usage of memory in various
      zones.  F.e.  in a NUMA system we cannot tell how many pages are dirty on a
      particular node.  If we knew then we could put measures into the VM to balance
      the use of memory between different zones and different nodes in a NUMA
      system.  For example it would be possible to limit the dirty pages per node so
      that fast local memory is kept available even if a process is dirtying huge
      amounts of pages.
      
      Another example is zone reclaim.  We do not know how many unmapped pages exist
      per zone.  So we just have to try to reclaim.  If it is not working then we
      pause and try again later.  It would be better if we knew when it makes sense
      to reclaim unmapped pages from a zone.  This patchset allows the determination
      of the number of unmapped pages per zone.  We can remove the zone reclaim
      interval with the counters introduced here.
      
      Futhermore the ability to have various usage statistics available will allow
      the development of new NUMA balancing algorithms that may be able to improve
      the decision making in the scheduler of when to move a process to another node
      and hopefully will also enable automatic page migration through a user space
      program that can analyse the memory load distribution and then rebalance
      memory use in order to increase performance.
      
      The counter framework here implements differential counters for each processor
      in struct zone.  The differential counters are consolidated when a threshold
      is exceeded (like done in the current implementation for nr_pageache), when
      slab reaping occurs or when a consolidation function is called.
      
      Consolidation uses atomic operations and accumulates counters per zone in the
      zone structure and also globally in the vm_stat array.  VM functions can
      access the counts by simply indexing a global or zone specific array.
      
      The arrangement of counters in an array also simplifies processing when output
      has to be generated for /proc/*.
      
      Counters can be updated by calling inc/dec_zone_page_state or
      _inc/dec_zone_page_state analogous to *_page_state.  The second group of
      functions can be called if it is known that interrupts are disabled.
      
      Special optimized increment and decrement functions are provided.  These can
      avoid certain checks and use increment or decrement instructions that an
      architecture may provide.
      
      We also add a new CONFIG_DMA_IS_NORMAL that signifies that an architecture can
      do DMA to all memory and therefore ZONE_NORMAL will not be populated.  This is
      only currently set for IA64 SGI SN2 and currently only affects
      node_page_state().  In the best case node_page_state can be reduced to
      retrieving a single counter for the one zone on the node.
      
      [akpm@osdl.org: cleanups]
      [akpm@osdl.org: export vm_stat[] for filesystems]
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      2244b95a
    • C
      [PATCH] zoned vm counters: create vmstat.c/.h from page_alloc.c/.h · f6ac2354
      Christoph Lameter 提交于
      NOTE: ZVC are *not* the lightweight event counters.  ZVCs are reliable whereas
      event counters do not need to be.
      
      Zone based VM statistics are necessary to be able to determine what the state
      of memory in one zone is.  In a NUMA system this can be helpful for local
      reclaim and other memory optimizations that may be able to shift VM load in
      order to get more balanced memory use.
      
      It is also useful to know how the computing load affects the memory
      allocations on various zones.  This patchset allows the retrieval of that data
      from userspace.
      
      The patchset introduces a framework for counters that is a cross between the
      existing page_stats --which are simply global counters split per cpu-- and the
      approach of deferred incremental updates implemented for nr_pagecache.
      
      Small per cpu 8 bit counters are added to struct zone.  If the counter exceeds
      certain thresholds then the counters are accumulated in an array of
      atomic_long in the zone and in a global array that sums up all zone values.
      The small 8 bit counters are next to the per cpu page pointers and so they
      will be in high in the cpu cache when pages are allocated and freed.
      
      Access to VM counter information for a zone and for the whole machine is then
      possible by simply indexing an array (Thanks to Nick Piggin for pointing out
      that approach).  The access to the total number of pages of various types does
      no longer require the summing up of all per cpu counters.
      
      Benefits of this patchset right now:
      
      - Ability for UP and SMP configuration to determine how memory
        is balanced between the DMA, NORMAL and HIGHMEM zones.
      
      - loops over all processors are avoided in writeback and
        reclaim paths. We can avoid caching the writeback information
        because the needed information is directly accessible.
      
      - Special handling for nr_pagecache removed.
      
      - zone_reclaim_interval vanishes since VM stats can now determine
        when it is worth to do local reclaim.
      
      - Fast inline per node page state determination.
      
      - Accurate counters in /sys/devices/system/node/node*/meminfo. Current
        counters are counting simply which processor allocated a page somewhere
        and guestimate based on that. So the counters were not useful to show
        the actual distribution of page use on a specific zone.
      
      - The swap_prefetch patch requires per node statistics in order to
        figure out when processors of a node can prefetch. This patch provides
        some of the needed numbers.
      
      - Detailed VM counters available in more /proc and /sys status files.
      
      References to earlier discussions:
      V1 http://marc.theaimsgroup.com/?l=linux-kernel&m=113511649910826&w=2
      V2 http://marc.theaimsgroup.com/?l=linux-kernel&m=114980851924230&w=2
      V3 http://marc.theaimsgroup.com/?l=linux-kernel&m=115014697910351&w=2
      V4 http://marc.theaimsgroup.com/?l=linux-kernel&m=115024767318740&w=2
      
      Performance tests with AIM7 did not show any regressions.  Seems to be a tad
      faster even.  Tested on ia64/NUMA.  Builds fine on i386, SMP / UP.  Includes
      fixes for s390/arm/uml arch code.
      
      This patch:
      
      Move counter code from page_alloc.c/page-flags.h to vmstat.c/h.
      
      Create vmstat.c/vmstat.h by separating the counter code and the proc
      functions.
      
      Move the vm_stat_text array before zoneinfo_show.
      
      [akpm@osdl.org: s390 build fix]
      [akpm@osdl.org: HOTPLUG_CPU build fix]
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      f6ac2354