1. 29 10月, 2006 1 次提交
    • M
      [PATCH] vmscan: Fix temp_priority race · 3bb1a852
      Martin Bligh 提交于
      The temp_priority field in zone is racy, as we can walk through a reclaim
      path, and just before we copy it into prev_priority, it can be overwritten
      (say with DEF_PRIORITY) by another reclaimer.
      
      The same bug is contained in both try_to_free_pages and balance_pgdat, but
      it is fixed slightly differently.  In balance_pgdat, we keep a separate
      priority record per zone in a local array.  In try_to_free_pages there is
      no need to do this, as the priority level is the same for all zones that we
      reclaim from.
      
      Impact of this bug is that temp_priority is copied into prev_priority, and
      setting this artificially high causes reclaimers to set distress
      artificially low.  They then fail to reclaim mapped pages, when they are,
      in fact, under severe memory pressure (their priority may be as low as 0).
      This causes the OOM killer to fire incorrectly.
      
      From: Andrew Morton <akpm@osdl.org>
      
      __zone_reclaim() isn't modifying zone->prev_priority.  But zone->prev_priority
      is used in the decision whether or not to bring mapped pages onto the inactive
      list.  Hence there's a risk here that __zone_reclaim() will fail because
      zone->prev_priority ir large (ie: low urgency) and lots of mapped pages end up
      stuck on the active list.
      
      Fix that up by decreasing (ie making more urgent) zone->prev_priority as
      __zone_reclaim() scans the zone's pages.
      
      This bug perhaps explains why ZONE_RECLAIM_PRIORITY was created.  It should be
      possible to remove that now, and to just start out at DEF_PRIORITY?
      
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Christoph Lameter <clameter@engr.sgi.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      3bb1a852
  2. 22 10月, 2006 1 次提交
  3. 27 9月, 2006 4 次提交
    • C
      [PATCH] Add node to zone for the NUMA case · d5f541ed
      Christoph Lameter 提交于
      Add the node in order to optimize zone_to_nid.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Acked-by: NPaul Jackson <pj@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      d5f541ed
    • H
      [PATCH] own header file for struct page · 5b99cd0e
      Heiko Carstens 提交于
      This moves the definition of struct page from mm.h to its own header file
      page-struct.h.  This is a prereq to fix SetPageUptodate which is broken on
      s390:
      
      #define SetPageUptodate(_page)
             do {
                     struct page *__page = (_page);
                     if (!test_and_set_bit(PG_uptodate, &__page->flags))
                             page_test_and_clear_dirty(_page);
             } while (0)
      
      _page gets used twice in this macro which can cause subtle bugs.  Using
      __page for the page_test_and_clear_dirty call doesn't work since it causes
      yet another problem with the page_test_and_clear_dirty macro as well.
      
      In order to avoid all these problems caused by macros it seems to be a good
      idea to get rid of them and convert them to static inline functions.
      Because of header file include order it's necessary to have a seperate
      header file for the struct page definition.
      
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Roman Zippel <zippel@linux-m68k.org>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      5b99cd0e
    • A
      [PATCH] vm: add per-zone writeout counter · e129b5c2
      Andrew Morton 提交于
      The VM is supposed to minimise the number of pages which get written off the
      LRU (for IO scheduling efficiency, and for high reclaim-success rates).  But
      we don't actually have a clear way of showing how true this is.
      
      So add `nr_vmscan_write' to /proc/vmstat and /proc/zoneinfo - the number of
      pages which have been written by the vm scanner in this zone and globally.
      
      Cc: Christoph Lameter <clameter@engr.sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      e129b5c2
    • M
      [PATCH] Introduce mechanism for registering active regions of memory · c713216d
      Mel Gorman 提交于
      At a basic level, architectures define structures to record where active
      ranges of page frames are located.  Once located, the code to calculate zone
      sizes and holes in each architecture is very similar.  Some of this zone and
      hole sizing code is difficult to read for no good reason.  This set of patches
      eliminates the similar-looking architecture-specific code.
      
      The patches introduce a mechanism where architectures register where the
      active ranges of page frames are with add_active_range().  When all areas have
      been discovered, free_area_init_nodes() is called to initialise the pgdat and
      zones.  The zone sizes and holes are then calculated in an architecture
      independent manner.
      
      Patch 1 introduces the mechanism for registering and initialising PFN ranges
      Patch 2 changes ppc to use the mechanism - 139 arch-specific LOC removed
      Patch 3 changes x86 to use the mechanism - 136 arch-specific LOC removed
      Patch 4 changes x86_64 to use the mechanism - 74 arch-specific LOC removed
      Patch 5 changes ia64 to use the mechanism - 52 arch-specific LOC removed
      Patch 6 accounts for mem_map as a memory hole as the pages are not reclaimable.
      	It adjusts the watermarks slightly
      
      Tony Luck has successfully tested for ia64 on Itanium with tiger_defconfig,
      gensparse_defconfig and defconfig.  Bob Picco has also tested and debugged on
      IA64.  Jack Steiner successfully boot tested on a mammoth SGI IA64-based
      machine.  These were on patches against 2.6.17-rc1 and release 3 of these
      patches but there have been no ia64-changes since release 3.
      
      There are differences in the zone sizes for x86_64 as the arch-specific code
      for x86_64 accounts the kernel image and the starting mem_maps as memory holes
      but the architecture-independent code accounts the memory as present.
      
      The big benefit of this set of patches is a sizable reduction of
      architecture-specific code, some of which is very hairy.  There should be a
      greater reduction when other architectures use the same mechanisms for zone
      and hole sizing but I lack the hardware to test on.
      
      Additional credit;
      	Dave Hansen for the initial suggestion and comments on early patches
      	Andy Whitcroft for reviewing early versions and catching numerous
      		errors
      	Tony Luck for testing and debugging on IA64
      	Bob Picco for fixing bugs related to pfn registration, reviewing a
      		number of patch revisions, providing a number of suggestions
      		on future direction and testing heavily
      	Jack Steiner and Robin Holt for testing on IA64 and clarifying
      		issues related to memory holes
      	Yasunori for testing on IA64
      	Andi Kleen for reviewing and feeding back about x86_64
      	Christian Kujau for providing valuable information related to ACPI
      		problems on x86_64 and testing potential fixes
      
      This patch:
      
      Define the structure to represent an active range of page frames within a node
      in an architecture independent manner.  Architectures are expected to register
      active ranges of PFNs using add_active_range(nid, start_pfn, end_pfn) and call
      free_area_init_nodes() passing the PFNs of the end of each zone.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Signed-off-by: NBob Picco <bob.picco@hp.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: Andi Kleen <ak@muc.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: "Keith Mannthey" <kmannth@gmail.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      c713216d
  4. 26 9月, 2006 7 次提交
  5. 02 9月, 2006 1 次提交
    • C
      [PATCH] ZVC: Scale thresholds depending on the size of the system · df9ecaba
      Christoph Lameter 提交于
      The ZVC counter update threshold is currently set to a fixed value of 32.
      This patch sets up the threshold depending on the number of processors and
      the sizes of the zones in the system.
      
      With the current threshold of 32, I was able to observe slight contention
      when more than 130-140 processors concurrently updated the counters.  The
      contention vanished when I either increased the threshold to 64 or used
      Andrew's idea of overstepping the interval (see ZVC overstep patch).
      
      However, we saw contention again at 220-230 processors.  So we need higher
      values for larger systems.
      
      But the current default is already a bit of an overkill for smaller
      systems.  Some systems have tiny zones where precision matters.  For
      example i386 and x86_64 have 16M DMA zones and either 900M ZONE_NORMAL or
      ZONE_DMA32.  These are even present on SMP and NUMA systems.
      
      The patch here sets up a threshold based on the number of processors in the
      system and the size of the zone that these counters are used for.  The
      threshold should grow logarithmically, so we use fls() as an easy
      approximation.
      
      Results of tests on a system with 1024 processors (4TB RAM)
      
      The following output is from a test allocating 1GB of memory concurrently
      on each processor (Forking the process.  So contention on mmap_sem and the
      pte locks is not a factor):
      
                             X                   MIN
      TYPE:               CPUS       WALL       WALL        SYS     USER     TOTCPU
      fork                   1      0.552      0.552      0.540    0.012      0.552
      fork                   4      0.552      0.548      2.164    0.036      2.200
      fork                  16      0.564      0.548      8.812    0.164      8.976
      fork                 128      0.580      0.572     72.204    1.208     73.412
      fork                 256      1.300      0.660    310.400    2.160    312.560
      fork                 512      3.512      0.696   1526.836    4.816   1531.652
      fork                1020     20.024      0.700  17243.176    6.688  17249.863
      
      So a threshold of 32 is fine up to 128 processors. At 256 processors contention
      becomes a factor.
      
      Overstepping the counter (earlier patch) improves the numbers a bit:
      
      fork                   4      0.552      0.548      2.164    0.040      2.204
      fork                  16      0.552      0.548      8.640    0.148      8.788
      fork                 128      0.556      0.548     69.676    0.956     70.632
      fork                 256      0.876      0.636    212.468    2.108    214.576
      fork                 512      2.276      0.672    997.324    4.260   1001.584
      fork                1020     13.564      0.680  11586.436    6.088  11592.523
      
      Still contention at 512 and 1020. Contention at 1020 is down by a third.
      256 still has a slight bit of contention.
      
      After this patch the counter threshold will be set to 125 which reduces
      contention significantly:
      
      fork                 128      0.560      0.548     69.776    0.932     70.708
      fork                 256      0.636      0.556    143.460    2.036    145.496
      fork                 512      0.640      0.548    284.244    4.236    288.480
      fork                1020      1.500      0.588   1326.152    8.892   1335.044
      
      [akpm@osdl.org: !SMP build fix]
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      df9ecaba
  6. 04 7月, 2006 1 次提交
    • C
      [PATCH] ZVC/zone_reclaim: Leave 1% of unmapped pagecache pages for file I/O · 9614634f
      Christoph Lameter 提交于
      It turns out that it is advantageous to leave a small portion of unmapped file
      backed pages if all of a zone's pages (or almost all pages) are allocated and
      so the page allocator has to go off-node.
      
      This allows recently used file I/O buffers to stay on the node and
      reduces the times that zone reclaim is invoked if file I/O occurs
      when we run out of memory in a zone.
      
      The problem is that zone reclaim runs too frequently when the page cache is
      used for file I/O (read write and therefore unmapped pages!) alone and we have
      almost all pages of the zone allocated.  Zone reclaim may remove 32 unmapped
      pages.  File I/O will use these pages for the next read/write requests and the
      unmapped pages increase.  After the zone has filled up again zone reclaim will
      remove it again after only 32 pages.  This cycle is too inefficient and there
      are potentially too many zone reclaim cycles.
      
      With the 1% boundary we may still remove all unmapped pages for file I/O in
      zone reclaim pass.  However.  it will take a large number of read and writes
      to get back to 1% again where we trigger zone reclaim again.
      
      The zone reclaim 2.6.16/17 does not show this behavior because we have a 30
      second timeout.
      
      [akpm@osdl.org: rename the /proc file and the variable]
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      9614634f
  7. 01 7月, 2006 12 次提交
    • C
      [PATCH] Use Zoned VM Counters for NUMA statistics · ca889e6c
      Christoph Lameter 提交于
      The numa statistics are really event counters.  But they are per node and
      so we have had special treatment for these counters through additional
      fields on the pcp structure.  We can now use the per zone nature of the
      zoned VM counters to realize these.
      
      This will shrink the size of the pcp structure on NUMA systems.  We will
      have some room to add additional per zone counters that will all still fit
      in the same cacheline.
      
       Bits	Prior pcp size	  	Size after patch	We can add
       ------------------------------------------------------------------
       64	128 bytes (16 words)	80 bytes (10 words)	48
       32	 76 bytes (19 words)	56 bytes (14 words)	8 (64 byte cacheline)
      							72 (128 byte)
      
      Remove the special statistics for numa and replace them with zoned vm
      counters.  This has the side effect that global sums of these events now
      show up in /proc/vmstat.
      
      Also take the opportunity to move the zone_statistics() function from
      page_alloc.c into vmstat.c.
      
      Discussions:
      V2 http://marc.theaimsgroup.com/?t=115048227000002&r=1&w=2Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Acked-by: NAndi Kleen <ak@suse.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      ca889e6c
    • C
      [PATCH] zoned vm counters: conversion of nr_bounce to per zone counter · d2c5e30c
      Christoph Lameter 提交于
      Conversion of nr_bounce to a per zone counter
      
      nr_bounce is only used for proc output.  So it could be left as an event
      counter.  However, the event counters may not be accurate and nr_bounce is
      categorizing types of pages in a zone.  So we really need this to also be a
      per zone counter.
      
      [akpm@osdl.org: bugfix]
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      d2c5e30c
    • C
      [PATCH] zoned vm counters: conversion of nr_unstable to per zone counter · fd39fc85
      Christoph Lameter 提交于
      Conversion of nr_unstable to a per zone counter
      
      We need to do some special modifications to the nfs code since there are
      multiple cases of disposition and we need to have a page ref for proper
      accounting.
      
      This converts the last critical page state of the VM and therefore we need to
      remove several functions that were depending on GET_PAGE_STATE_LAST in order
      to make the kernel compile again.  We are only left with event type counters
      in page state.
      
      [akpm@osdl.org: bugfixes]
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      fd39fc85
    • C
      [PATCH] zoned vm counters: conversion of nr_writeback to per zone counter · ce866b34
      Christoph Lameter 提交于
      Conversion of nr_writeback to per zone counter.
      
      This removes the last page_state counter from arch/i386/mm/pgtable.c so we
      drop the page_state from there.
      
      [akpm@osdl.org: bugfix]
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      ce866b34
    • C
      [PATCH] zoned vm counters: conversion of nr_dirty to per zone counter · b1e7a8fd
      Christoph Lameter 提交于
      This makes nr_dirty a per zone counter.  Looping over all processors is
      avoided during writeback state determination.
      
      The counter aggregation for nr_dirty had to be undone in the NFS layer since
      we summed up the page counts from multiple zones.  Someone more familiar with
      NFS should probably review what I have done.
      
      [akpm@osdl.org: bugfix]
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      b1e7a8fd
    • C
      [PATCH] zoned vm counters: conversion of nr_pagetables to per zone counter · df849a15
      Christoph Lameter 提交于
      Conversion of nr_page_table_pages to a per zone counter
      
      [akpm@osdl.org: bugfix]
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      df849a15
    • C
      [PATCH] zoned vm counters: conversion of nr_slab to per zone counter · 9a865ffa
      Christoph Lameter 提交于
      - Allows reclaim to access counter without looping over processor counts.
      
      - Allows accurate statistics on how many pages are used in a zone by
        the slab. This may become useful to balance slab allocations over
        various zones.
      
      [akpm@osdl.org: bugfix]
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      9a865ffa
    • C
      [PATCH] zoned vm counters: zone_reclaim: remove /proc/sys/vm/zone_reclaim_interval · 34aa1330
      Christoph Lameter 提交于
      The zone_reclaim_interval was necessary because we were not able to determine
      how many unmapped pages exist in a zone.  Therefore we had to scan in
      intervals to figure out if any pages were unmapped.
      
      With the zoned counters and NR_ANON_PAGES we now know the number of pagecache
      pages and the number of mapped pages in a zone.  So we can simply skip the
      reclaim if there is an insufficient number of unmapped pages.  We use
      SWAP_CLUSTER_MAX as the boundary.
      
      Drop all support for /proc/sys/vm/zone_reclaim_interval.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      34aa1330
    • C
      [PATCH] zoned vm counters: split NR_ANON_PAGES off from NR_FILE_MAPPED · f3dbd344
      Christoph Lameter 提交于
      The current NR_FILE_MAPPED is used by zone reclaim and the dirty load
      calculation as the number of mapped pagecache pages.  However, that is not
      true.  NR_FILE_MAPPED includes the mapped anonymous pages.  This patch
      separates those and therefore allows an accurate tracking of the anonymous
      pages per zone.
      
      It then becomes possible to determine the number of unmapped pages per zone
      and we can avoid scanning for unmapped pages if there are none.
      
      Also it may now be possible to determine the mapped/unmapped ratio in
      get_dirty_limit.  Isnt the number of anonymous pages irrelevant in that
      calculation?
      
      Note that this will change the meaning of the number of mapped pages reported
      in /proc/vmstat /proc/meminfo and in the per node statistics.  This may affect
      user space tools that monitor these counters!  NR_FILE_MAPPED works like
      NR_FILE_DIRTY.  It is only valid for pagecache pages.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      f3dbd344
    • C
      [PATCH] zoned vm counters: conversion of nr_pagecache to per zone counter · 347ce434
      Christoph Lameter 提交于
      Currently a single atomic variable is used to establish the size of the page
      cache in the whole machine.  The zoned VM counters have the same method of
      implementation as the nr_pagecache code but also allow the determination of
      the pagecache size per zone.
      
      Remove the special implementation for nr_pagecache and make it a zoned counter
      named NR_FILE_PAGES.
      
      Updates of the page cache counters are always performed with interrupts off.
      We can therefore use the __ variant here.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      347ce434
    • C
      [PATCH] zoned vm counters: convert nr_mapped to per zone counter · 65ba55f5
      Christoph Lameter 提交于
      nr_mapped is important because it allows a determination of how many pages of
      a zone are not mapped, which would allow a more efficient means of determining
      when we need to reclaim memory in a zone.
      
      We take the nr_mapped field out of the page state structure and define a new
      per zone counter named NR_FILE_MAPPED (the anonymous pages will be split off
      from NR_MAPPED in the next patch).
      
      We replace the use of nr_mapped in various kernel locations.  This avoids the
      looping over all processors in try_to_free_pages(), writeback, reclaim (swap +
      zone reclaim).
      
      [akpm@osdl.org: bugfix]
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      65ba55f5
    • C
      [PATCH] zoned vm counters: basic ZVC (zoned vm counter) implementation · 2244b95a
      Christoph Lameter 提交于
      Per zone counter infrastructure
      
      The counters that we currently have for the VM are split per processor.  The
      processor however has not much to do with the zone these pages belong to.  We
      cannot tell f.e.  how many ZONE_DMA pages are dirty.
      
      So we are blind to potentially inbalances in the usage of memory in various
      zones.  F.e.  in a NUMA system we cannot tell how many pages are dirty on a
      particular node.  If we knew then we could put measures into the VM to balance
      the use of memory between different zones and different nodes in a NUMA
      system.  For example it would be possible to limit the dirty pages per node so
      that fast local memory is kept available even if a process is dirtying huge
      amounts of pages.
      
      Another example is zone reclaim.  We do not know how many unmapped pages exist
      per zone.  So we just have to try to reclaim.  If it is not working then we
      pause and try again later.  It would be better if we knew when it makes sense
      to reclaim unmapped pages from a zone.  This patchset allows the determination
      of the number of unmapped pages per zone.  We can remove the zone reclaim
      interval with the counters introduced here.
      
      Futhermore the ability to have various usage statistics available will allow
      the development of new NUMA balancing algorithms that may be able to improve
      the decision making in the scheduler of when to move a process to another node
      and hopefully will also enable automatic page migration through a user space
      program that can analyse the memory load distribution and then rebalance
      memory use in order to increase performance.
      
      The counter framework here implements differential counters for each processor
      in struct zone.  The differential counters are consolidated when a threshold
      is exceeded (like done in the current implementation for nr_pageache), when
      slab reaping occurs or when a consolidation function is called.
      
      Consolidation uses atomic operations and accumulates counters per zone in the
      zone structure and also globally in the vm_stat array.  VM functions can
      access the counts by simply indexing a global or zone specific array.
      
      The arrangement of counters in an array also simplifies processing when output
      has to be generated for /proc/*.
      
      Counters can be updated by calling inc/dec_zone_page_state or
      _inc/dec_zone_page_state analogous to *_page_state.  The second group of
      functions can be called if it is known that interrupts are disabled.
      
      Special optimized increment and decrement functions are provided.  These can
      avoid certain checks and use increment or decrement instructions that an
      architecture may provide.
      
      We also add a new CONFIG_DMA_IS_NORMAL that signifies that an architecture can
      do DMA to all memory and therefore ZONE_NORMAL will not be populated.  This is
      only currently set for IA64 SGI SN2 and currently only affects
      node_page_state().  In the best case node_page_state can be reduced to
      retrieving a single counter for the one zone on the node.
      
      [akpm@osdl.org: cleanups]
      [akpm@osdl.org: export vm_stat[] for filesystems]
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      2244b95a
  8. 23 6月, 2006 3 次提交
  9. 06 6月, 2006 1 次提交
    • R
      [PATCH] Sparsemem build fix · 93ff66bf
      Ralf Baechle 提交于
      From: Ralf Baechle <ralf@linux-mips.org>
      
      <linux/mmzone.h> uses PAGE_SIZE, PAGE_SHIFT from <asm/page.h> without
      including that header itself.  For some sparsemem configurations this may
      result in build errors like:
      
        CC      init/initramfs.o
      In file included from include/linux/gfp.h:4,
                       from include/linux/slab.h:15,
                       from include/linux/percpu.h:4,
                       from include/linux/rcupdate.h:41,
                       from include/linux/dcache.h:10,
                       from include/linux/fs.h:226,
                       from init/initramfs.c:2:
      include/linux/mmzone.h:498:22: warning: "PAGE_SHIFT" is not defined
      In file included from include/linux/gfp.h:4,
                       from include/linux/slab.h:15,
                       from include/linux/percpu.h:4,
                       from include/linux/rcupdate.h:41,
                       from include/linux/dcache.h:10,
                       from include/linux/fs.h:226,
                       from init/initramfs.c:2:
      include/linux/mmzone.h:526: error: `PAGE_SIZE' undeclared here (not in a function)
      include/linux/mmzone.h: In function `__pfn_to_section':
      include/linux/mmzone.h:573: error: `PAGE_SHIFT' undeclared (first use in this function)
      include/linux/mmzone.h:573: error: (Each undeclared identifier is reported only once
      include/linux/mmzone.h:573: error: for each function it appears in.)
      include/linux/mmzone.h: In function `pfn_valid':
      include/linux/mmzone.h:578: error: `PAGE_SHIFT' undeclared (first use in this function)
      make[1]: *** [init/initramfs.o] Error 1
      make: *** [init] Error 2
      Signed-off-by: NRalf Baechle <ralf@linux-mips.org>
      Seems-reasonable-to: Andy Whitcroft <apw@shadowen.org>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      93ff66bf
  10. 22 5月, 2006 1 次提交
    • B
      [PATCH] Align the node_mem_map endpoints to a MAX_ORDER boundary · e984bb43
      Bob Picco 提交于
      Andy added code to buddy allocator which does not require the zone's
      endpoints to be aligned to MAX_ORDER.  An issue is that the buddy allocator
      requires the node_mem_map's endpoints to be MAX_ORDER aligned.  Otherwise
      __page_find_buddy could compute a buddy not in node_mem_map for partial
      MAX_ORDER regions at zone's endpoints.  page_is_buddy will detect that
      these pages at endpoints are not PG_buddy (they were zeroed out by bootmem
      allocator and not part of zone).  Of course the negative here is we could
      waste a little memory but the positive is eliminating all the old checks
      for zone boundary conditions.
      
      SPARSEMEM won't encounter this issue because of MAX_ORDER size constraint
      when SPARSEMEM is configured.  ia64 VIRTUAL_MEM_MAP doesn't need the logic
      either because the holes and endpoints are handled differently.  This
      leaves checking alloc_remap and other arches which privately allocate for
      node_mem_map.
      Signed-off-by: NBob Picco <bob.picco@hp.com>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      e984bb43
  11. 26 4月, 2006 1 次提交
  12. 28 3月, 2006 5 次提交
    • K
      [PATCH] uninline zone helpers · 95144c78
      KAMEZAWA Hiroyuki 提交于
      Helper functions for for_each_online_pgdat/for_each_zone look too big to be
      inlined.  Speed of these helper macro itself is not very important.  (inner
      loops are tend to do more work than this)
      
      This patch make helper function to be out-of-lined.
      
      	inline		out-of-line
      .text   005c0680        005bf6a0
      
      005c0680 - 005bf6a0 = FE0 = 4Kbytes.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      95144c78
    • K
      [PATCH] for_each_online_pgdat: remove pgdat_list · ae0f15fb
      KAMEZAWA Hiroyuki 提交于
      By using for_each_online_pgdat(), pgdat_list is not necessary now.  This patch
      removes it.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      ae0f15fb
    • K
      [PATCH] define for_each_online_pgdat · 8357f869
      KAMEZAWA Hiroyuki 提交于
      This patch defines for_each_online_pgdat() as a replacement of
      for_each_pgdat()
      
      Now, online nodes are managed by node_online_map.  But for_each_pgdat()
      uses pgdat_link to iterate over all nodes(pgdat).  This means management
      structure for online pgdat is duplicated.
      
      I think using node_online_map for for_each_pgdat() is simple and sane
      rather ather than pgdat_link.  New macro is named as
      for_each_online_pgdat().  Following patch will fix callers of
      for_each_pgdat().
      
      The bootmem allocater uses for_each_pgdat() before pgdat initialization.  I
      don't think it's sane.  Following patch will fix it.
      Signed-off-by: NYasunori Goto     <y-goto@jp.fujitsu.com>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      8357f869
    • K
      [PATCH] remove zone_mem_map · a0140c1d
      KAMEZAWA Hiroyuki 提交于
      This patch removes zone_mem_map.
      
      pfn_to_page uses pgdat, page_to_pfn uses zone.  page_to_pfn can use pgdat
      instead of zone, which is only one user of zone_mem_map.  By modifing it,
      we can remove zone_mem_map.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: Christoph Lameter <christoph@lameter.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      a0140c1d
    • K
      [PATCH] unify pfn_to_page: generic functions · a117e66e
      KAMEZAWA Hiroyuki 提交于
      There are 3 memory models, FLATMEM, DISCONTIGMEM, SPARSEMEM.
      Each arch has its own page_to_pfn(), pfn_to_page() for each models.
      But most of them can use the same arithmetic.
      
      This patch adds asm-generic/memory_model.h, which includes generic
      page_to_pfn(), pfn_to_page() definitions for each memory model.
      
      When CONFIG_OUT_OF_LINE_PFN_TO_PAGE=y, out-of-line functions are
      used instead of macro. This is enabled by some archs and  reduces
      text size.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Andi Kleen <ak@muc.de>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Ian Molton <spyro@f2s.com>
      Cc: Mikael Starvik <starvik@axis.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Hirokazu Takata <takata.hirokazu@renesas.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Kyle McMartin <kyle@mcmartin.ca>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: Kazumoto Kojima <kkojima@rr.iij4u.or.jp>
      Cc: Richard Curnow <rc@rc0.org.uk>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
      Cc: Miles Bader <uclinux-v850@lsi.nec.co.jp>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      a117e66e
  13. 02 2月, 2006 2 次提交