1. 12 2月, 2007 6 次提交
  2. 10 2月, 2007 1 次提交
  3. 01 2月, 2007 1 次提交
  4. 12 1月, 2007 1 次提交
  5. 06 1月, 2007 2 次提交
    • C
      [PATCH] Check for populated zone in __drain_pages · f2e12bb2
      Christoph Lameter 提交于
      Both process_zones() and drain_node_pages() check for populated zones
      before touching pagesets.  However, __drain_pages does not do so,
      
      This may result in a NULL pointer dereference for pagesets in unpopulated
      zones if a NUMA setup is combined with cpu hotplug.
      
      Initially the unpopulated zone has the pcp pointers pointing to the boot
      pagesets.  Since the zone is not populated the boot pageset pointers will
      not be changed during page allocator and slab bootstrap.
      
      If a cpu is later brought down (first call to __drain_pages()) then the pcp
      pointers for cpus in unpopulated zones are set to NULL since __drain_pages
      does not first check for an unpopulated zone.
      
      If the cpu is then brought up again then we call process_zones() which will
      ignore the unpopulated zone.  So the pageset pointers will still be NULL.
      
      If the cpu is then again brought down then __drain_pages will attempt to
      drain pages by following the NULL pageset pointer for unpopulated zones.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      f2e12bb2
    • P
      [PATCH] Sanely size hash tables when using large base pages · 9ab37b8f
      Paul Mundt 提交于
      At the moment the inode/dentry cache hash tables (common by way of
      alloc_large_system_hash()) are incorrectly sized by their respective
      detection logic when we attempt to use large base pages on systems with
      little memory.
      
      This results in odd behaviour when using a 64kB PAGE_SIZE, such as:
      
      Dentry cache hash table entries: 8192 (order: -1, 32768 bytes)
      Inode-cache hash table entries: 4096 (order: -2, 16384 bytes)
      
      The mount cache hash table is seemingly the only one that gets this right
      by directly taking PAGE_SIZE in to account.
      
      The following patch attempts to catch the bogus values and round it up to
      at least 0-order.
      Signed-off-by: NPaul Mundt <lethal@linux-sh.org>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      9ab37b8f
  6. 14 12月, 2006 1 次提交
    • P
      [PATCH] cpuset: rework cpuset_zone_allowed api · 02a0e53d
      Paul Jackson 提交于
      Elaborate the API for calling cpuset_zone_allowed(), so that users have to
      explicitly choose between the two variants:
      
        cpuset_zone_allowed_hardwall()
        cpuset_zone_allowed_softwall()
      
      Until now, whether or not you got the hardwall flavor depended solely on
      whether or not you or'd in the __GFP_HARDWALL gfp flag to the gfp_mask
      argument.
      
      If you didn't specify __GFP_HARDWALL, you implicitly got the softwall
      version.
      
      Unfortunately, this meant that users would end up with the softwall version
      without thinking about it.  Since only the softwall version might sleep,
      this led to bugs with possible sleeping in interrupt context on more than
      one occassion.
      
      The hardwall version requires that the current tasks mems_allowed allows
      the node of the specified zone (or that you're in interrupt or that
      __GFP_THISNODE is set or that you're on a one cpuset system.)
      
      The softwall version, depending on the gfp_mask, might allow a node if it
      was allowed in the nearest enclusing cpuset marked mem_exclusive (which
      requires taking the cpuset lock 'callback_mutex' to evaluate.)
      
      This patch removes the cpuset_zone_allowed() call, and forces the caller to
      explicitly choose between the hardwall and the softwall case.
      
      If the caller wants the gfp_mask to determine this choice, they should (1)
      be sure they can sleep or that __GFP_HARDWALL is set, and (2) invoke the
      cpuset_zone_allowed_softwall() routine.
      
      This adds another 100 or 200 bytes to the kernel text space, due to the few
      lines of nearly duplicate code at the top of both cpuset_zone_allowed_*
      routines.  It should save a few instructions executed for the calls that
      turned into calls of cpuset_zone_allowed_hardwall, thanks to not having to
      set (before the call) then check (within the call) the __GFP_HARDWALL flag.
      
      For the most critical call, from get_page_from_freelist(), the same
      instructions are executed as before -- the old cpuset_zone_allowed()
      routine it used to call is the same code as the
      cpuset_zone_allowed_softwall() routine that it calls now.
      
      Not a perfect win, but seems worth it, to reduce this chance of hitting a
      sleeping with irq off complaint again.
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      02a0e53d
  7. 09 12月, 2006 3 次提交
    • D
      [PATCH] fault-injection: defaults likely to please a new user · 6b1b60f4
      Don Mullis 提交于
      Assign defaults most likely to please a new user:
       1) generate some logging output
          (verbose=2)
       2) avoid injecting failures likely to lock up UI
          (ignore_gfp_wait=1, ignore_gfp_highmem=1)
      Signed-off-by: NDon Mullis <dwm@meer.net>
      Cc: Akinobu Mita <akinobu.mita@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      6b1b60f4
    • A
      [PATCH] fault-injection capability for alloc_pages() · 933e312e
      Akinobu Mita 提交于
      This patch provides fault-injection capability for alloc_pages()
      
      Boot option:
      
      fail_page_alloc=<interval>,<probability>,<space>,<times>
      
      	<interval> -- specifies the interval of failures.
      
      	<probability> -- specifies how often it should fail in percent.
      
      	<space> -- specifies the size of free space where memory can be
      		   allocated safely in pages.
      
      	<times> -- specifies how many times failures may happen at most.
      
      Debugfs:
      
      /debug/fail_page_alloc/interval
      /debug/fail_page_alloc/probability
      /debug/fail_page_alloc/specifies
      /debug/fail_page_alloc/times
      /debug/fail_page_alloc/ignore-gfp-highmem
      /debug/fail_page_alloc/ignore-gfp-wait
      
      Example:
      
      	fail_page_alloc=10,100,0,-1
      
      The page allocation (alloc_pages(), ...) fails once per 10 times.
      Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      933e312e
    • D
      [PATCH] LOG2: Implement a general integer log2 facility in the kernel · f0d1b0b3
      David Howells 提交于
      This facility provides three entry points:
      
      	ilog2()		Log base 2 of unsigned long
      	ilog2_u32()	Log base 2 of u32
      	ilog2_u64()	Log base 2 of u64
      
      These facilities can either be used inside functions on dynamic data:
      
      	int do_something(long q)
      	{
      		...;
      		y = ilog2(x)
      		...;
      	}
      
      Or can be used to statically initialise global variables with constant values:
      
      	unsigned n = ilog2(27);
      
      When performing static initialisation, the compiler will report "error:
      initializer element is not constant" if asked to take a log of zero or of
      something not reducible to a constant.  They treat negative numbers as
      unsigned.
      
      When not dealing with a constant, they fall back to using fls() which permits
      them to use arch-specific log calculation instructions - such as BSR on
      x86/x86_64 or SCAN on FRV - if available.
      
      [akpm@osdl.org: MMC fix]
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Wojtek Kaniewski <wojtekka@toxygen.net>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      f0d1b0b3
  8. 08 12月, 2006 13 次提交
    • H
      [PATCH] struct seq_operations and struct file_operations constification · 15ad7cdc
      Helge Deller 提交于
       - move some file_operations structs into the .rodata section
      
       - move static strings from policy_types[] array into the .rodata section
      
       - fix generic seq_operations usages, so that those structs may be defined
         as "const" as well
      
      [akpm@osdl.org: couple of fixes]
      Signed-off-by: NHelge Deller <deller@gmx.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      15ad7cdc
    • I
      [PATCH] hotplug CPU: clean up hotcpu_notifier() use · 02316067
      Ingo Molnar 提交于
      There was lots of #ifdef noise in the kernel due to hotcpu_notifier(fn,
      prio) not correctly marking 'fn' as used in the !HOTPLUG_CPU case, and thus
      generating compiler warnings of unused symbols, hence forcing people to add
      #ifdefs.
      
      the compiler can skip truly unused functions just fine:
      
          text    data     bss     dec     hex filename
       1624412  728710 3674856 6027978  5bfaca vmlinux.before
       1624412  728710 3674856 6027978  5bfaca vmlinux.after
      
      [akpm@osdl.org: topology.c fix]
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      02316067
    • A
      [PATCH] remove HASH_HIGHMEM · 04903664
      Andrew Morton 提交于
      It has no users and it's doubtful that we'll need it again.
      
      Cc: "David S. Miller" <davem@davemloft.net>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      04903664
    • A
      [PATCH] mm: make compound page destructor handling explicit · 33f2ef89
      Andy Whitcroft 提交于
      Currently we we use the lru head link of the second page of a compound page
      to hold its destructor.  This was ok when it was purely an internal
      implmentation detail.  However, hugetlbfs overrides this destructor
      violating the layering.  Abstract this out as explicit calls, also
      introduce a type for the callback function allowing them to be type
      checked.  For each callback we pre-declare the function, causing a type
      error on definition rather than on use elsewhere.
      
      [akpm@osdl.org: cleanups]
      Signed-off-by: NAndy Whitcroft <apw@shadowen.org>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      33f2ef89
    • C
      [PATCH] GFP_THISNODE must not trigger global reclaim · 952f3b51
      Christoph Lameter 提交于
      The intent of GFP_THISNODE is to make sure that an allocation occurs on a
      particular node.  If this is not possible then NULL needs to be returned so
      that the caller can choose what to do next on its own (the slab allocator
      depends on that).
      
      However, GFP_THISNODE currently triggers reclaim before returning a failure
      (GFP_THISNODE means GFP_NORETRY is set).  If we have over allocated a node
      then we will currently do some reclaim before returning NULL.  The caller
      may want memory from other nodes before reclaim should be triggered.  (If
      the caller wants reclaim then he can directly use __GFP_THISNODE instead).
      
      There is no flag to avoid reclaim in the page allocator and adding yet
      another GFP_xx flag would be difficult given that we are out of available
      flags.
      
      So just compare and see if all bits for GFP_THISNODE (__GFP_THISNODE,
      __GFP_NORETRY and __GFP_NOWARN) are set.  If so then we return NULL before
      waking up kswapd.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      952f3b51
    • A
      [PATCH] mm: cleanup indentation on switch for CPU operations · ce421c79
      Andy Whitcroft 提交于
      These patches introduced new switch statements which are indented contrary
      to the concensus in mm/*.c.  Fix them up to match that concensus.
      
          [PATCH] node local per-cpu-pages
          [PATCH] ZVC: Scale thresholds depending on the size of the system
          commit e7c8d5c9
          commit df9ecabaSigned-off-by: NAndy Whitcroft <apw@shadowen.org>
      Cc: Christoph Lameter <clameter@engr.sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      ce421c79
    • A
      [PATCH] numa node ids are int, page_to_nid and zone_to_nid should return int · 25ba77c1
      Andy Whitcroft 提交于
      NUMA node ids are passed as either int or unsigned int almost exclusivly
      page_to_nid and zone_to_nid both return unsigned long.  This is a throw
      back to when page_to_nid was a #define and was thus exposing the real type
      of the page flags field.
      
      In addition to fixing up the definitions of page_to_nid and zone_to_nid I
      audited the users of these functions identifying the following incorrect
      uses:
      
      1) mm/page_alloc.c show_node() -- printk dumping the node id,
      2) include/asm-ia64/pgalloc.h pgtable_quicklist_free() -- comparison
         against numa_node_id() which returns an int from cpu_to_node(), and
      3) mm/mpolicy.c check_pte_range -- used as an index in node_isset which
         uses bit_set which in generic code takes an int.
      Signed-off-by: NAndy Whitcroft <apw@shadowen.org>
      Cc: Christoph Lameter <clameter@engr.sgi.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      25ba77c1
    • C
      [PATCH] drain_node_page(): Drain pages in batch units · bc4ba393
      Christoph Lameter 提交于
      drain_node_pages() currently drains the complete pageset of all pages.  If
      there are a large number of pages in the queues then we may hold off
      interrupts for too long.
      
      Duplicate the method used in free_hot_cold_page.  Only drain pcp->batch
      pages at one time.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      bc4ba393
    • K
      [PATCH] OOM can panic due to processes stuck in __alloc_pages() · b43a57bb
      Kirill Korotaev 提交于
      OOM can panic due to the processes stuck in __alloc_pages() doing infinite
      rebalance loop while no memory can be reclaimed.  OOM killer tries to kill
      some processes, but unfortunetaly, rebalance label was moved by someone
      below the TIF_MEMDIE check, so buddy allocator doesn't see that process is
      OOM-killed and it can simply fail the allocation :/
      
      Observed in reality on RHEL4(2.6.9)+OpenVZ kernel when a user doing some
      memory allocation tricks triggered OOM panic.
      Signed-off-by: NDenis Lunev <den@sw.ru>
      Signed-off-by: NKirill Korotaev <dev@openvz.org>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      b43a57bb
    • N
      [PATCH] mm: add arch_alloc_page · cc102509
      Nick Piggin 提交于
      Add an arch_alloc_page to match arch_free_page.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      cc102509
    • P
      [PATCH] memory page_alloc zonelist caching speedup · 9276b1bc
      Paul Jackson 提交于
      Optimize the critical zonelist scanning for free pages in the kernel memory
      allocator by caching the zones that were found to be full recently, and
      skipping them.
      
      Remembers the zones in a zonelist that were short of free memory in the
      last second.  And it stashes a zone-to-node table in the zonelist struct,
      to optimize that conversion (minimize its cache footprint.)
      
      Recent changes:
      
          This differs in a significant way from a similar patch that I
          posted a week ago.  Now, instead of having a nodemask_t of
          recently full nodes, I have a bitmask of recently full zones.
          This solves a problem that last weeks patch had, which on
          systems with multiple zones per node (such as DMA zone) would
          take seeing any of these zones full as meaning that all zones
          on that node were full.
      
          Also I changed names - from "zonelist faster" to "zonelist cache",
          as that seemed to better convey what we're doing here - caching
          some of the key zonelist state (for faster access.)
      
          See below for some performance benchmark results.  After all that
          discussion with David on why I didn't need them, I went and got
          some ;).  I wanted to verify that I had not hurt the normal case
          of memory allocation noticeably.  At least for my one little
          microbenchmark, I found (1) the normal case wasn't affected, and
          (2) workloads that forced scanning across multiple nodes for
          memory improved up to 10% fewer System CPU cycles and lower
          elapsed clock time ('sys' and 'real').  Good.  See details, below.
      
          I didn't have the logic in get_page_from_freelist() for various
          full nodes and zone reclaim failures correct.  That should be
          fixed up now - notice the new goto labels zonelist_scan,
          this_zone_full, and try_next_zone, in get_page_from_freelist().
      
      There are two reasons I persued this alternative, over some earlier
      proposals that would have focused on optimizing the fake numa
      emulation case by caching the last useful zone:
      
       1) Contrary to what I said before, we (SGI, on large ia64 sn2 systems)
          have seen real customer loads where the cost to scan the zonelist
          was a problem, due to many nodes being full of memory before
          we got to a node we could use.  Or at least, I think we have.
          This was related to me by another engineer, based on experiences
          from some time past.  So this is not guaranteed.  Most likely, though.
      
          The following approach should help such real numa systems just as
          much as it helps fake numa systems, or any combination thereof.
      
       2) The effort to distinguish fake from real numa, using node_distance,
          so that we could cache a fake numa node and optimize choosing
          it over equivalent distance fake nodes, while continuing to
          properly scan all real nodes in distance order, was going to
          require a nasty blob of zonelist and node distance munging.
      
          The following approach has no new dependency on node distances or
          zone sorting.
      
      See comment in the patch below for a description of what it actually does.
      
      Technical details of note (or controversy):
      
       - See the use of "zlc_active" and "did_zlc_setup" below, to delay
         adding any work for this new mechanism until we've looked at the
         first zone in zonelist.  I figured the odds of the first zone
         having the memory we needed were high enough that we should just
         look there, first, then get fancy only if we need to keep looking.
      
       - Some odd hackery was needed to add items to struct zonelist, while
         not tripping up the custom zonelists built by the mm/mempolicy.c
         code for MPOL_BIND.  My usual wordy comments below explain this.
         Search for "MPOL_BIND".
      
       - Some per-node data in the struct zonelist is now modified frequently,
         with no locking.  Multiple CPU cores on a node could hit and mangle
         this data.  The theory is that this is just performance hint data,
         and the memory allocator will work just fine despite any such mangling.
         The fields at risk are the struct 'zonelist_cache' fields 'fullzones'
         (a bitmask) and 'last_full_zap' (unsigned long jiffies).  It should
         all be self correcting after at most a one second delay.
      
       - This still does a linear scan of the same lengths as before.  All
         I've optimized is making the scan faster, not algorithmically
         shorter.  It is now able to scan a compact array of 'unsigned
         short' in the case of many full nodes, so one cache line should
         cover quite a few nodes, rather than each node hitting another
         one or two new and distinct cache lines.
      
       - If both Andi and Nick don't find this too complicated, I will be
         (pleasantly) flabbergasted.
      
       - I removed the comment claiming we only use one cachline's worth of
         zonelist.  We seem, at least in the fake numa case, to have put the
         lie to that claim.
      
       - I pay no attention to the various watermarks and such in this performance
         hint.  A node could be marked full for one watermark, and then skipped
         over when searching for a page using a different watermark.  I think
         that's actually quite ok, as it will tend to slightly increase the
         spreading of memory over other nodes, away from a memory stressed node.
      
      ===============
      
      Performance - some benchmark results and analysis:
      
      This benchmark runs a memory hog program that uses multiple
      threads to touch alot of memory as quickly as it can.
      
      Multiple runs were made, touching 12, 38, 64 or 90 GBytes out of
      the total 96 GBytes on the system, and using 1, 19, 37, or 55
      threads (on a 56 CPU system.)  System, user and real (elapsed)
      timings were recorded for each run, shown in units of seconds,
      in the table below.
      
      Two kernels were tested - 2.6.18-mm3 and the same kernel with
      this zonelist caching patch added.  The table also shows the
      percentage improvement the zonelist caching sys time is over
      (lower than) the stock *-mm kernel.
      
            number     2.6.18-mm3	   zonelist-cache    delta (< 0 good)	percent
       GBs    N  	------------	   --------------    ----------------	systime
       mem threads   sys user  real	  sys  user  real     sys  user  real	 better
        12	 1     153   24   177	  151	 24   176      -2     0    -1	   1%
        12	19	99   22     8	   99	 22	8	0     0     0	   0%
        12	37     111   25     6	  112	 25	6	1     0     0	  -0%
        12	55     115   25     5	  110	 23	5      -5    -2     0	   4%
        38	 1     502   74   576	  497	 73   570      -5    -1    -6	   0%
        38	19     426   78    48	  373	 76    39     -53    -2    -9	  12%
        38	37     544   83    36	  547	 82    36	3    -1     0	  -0%
        38	55     501   77    23	  511	 80    24      10     3     1	  -1%
        64	 1     917  125  1042	  890	124  1014     -27    -1   -28	   2%
        64	19    1118  138   119	  965	141   103    -153     3   -16	  13%
        64	37    1202  151    94	 1136	150    81     -66    -1   -13	   5%
        64	55    1118  141    61	 1072	140    58     -46    -1    -3	   4%
        90	 1    1342  177  1519	 1275	174  1450     -67    -3   -69	   4%
        90	19    2392  199   192	 2116	189   176    -276   -10   -16	  11%
        90	37    3313  238   175	 2972	225   145    -341   -13   -30	  10%
        90	55    1948  210   104	 1843	213   100    -105     3    -4	   5%
      
      Notes:
       1) This test ran a memory hog program that started a specified number N of
          threads, and had each thread allocate and touch 1/N'th of
          the total memory to be used in the test run in a single loop,
          writing a constant word to memory, one store every 4096 bytes.
          Watching this test during some earlier trial runs, I would see
          each of these threads sit down on one CPU and stay there, for
          the remainder of the pass, a different CPU for each thread.
      
       2) The 'real' column is not comparable to the 'sys' or 'user' columns.
          The 'real' column is seconds wall clock time elapsed, from beginning
          to end of that test pass.  The 'sys' and 'user' columns are total
          CPU seconds spent on that test pass.  For a 19 thread test run,
          for example, the sum of 'sys' and 'user' could be up to 19 times the
          number of 'real' elapsed wall clock seconds.
      
       3) Tests were run on a fresh, single-user boot, to minimize the amount
          of memory already in use at the start of the test, and to minimize
          the amount of background activity that might interfere.
      
       4) Tests were done on a 56 CPU, 28 Node system with 96 GBytes of RAM.
      
       5) Notice that the 'real' time gets large for the single thread runs, even
          though the measured 'sys' and 'user' times are modest.  I'm not sure what
          that means - probably something to do with it being slow for one thread to
          be accessing memory along ways away.  Perhaps the fake numa system, running
          ostensibly the same workload, would not show this substantial degradation
          of 'real' time for one thread on many nodes -- lets hope not.
      
       6) The high thread count passes (one thread per CPU - on 55 of 56 CPUs)
          ran quite efficiently, as one might expect.  Each pair of threads needed
          to allocate and touch the memory on the node the two threads shared, a
          pleasantly parallizable workload.
      
       7) The intermediate thread count passes, when asking for alot of memory forcing
          them to go to a few neighboring nodes, improved the most with this zonelist
          caching patch.
      
      Conclusions:
       * This zonelist cache patch probably makes little difference one way or the
         other for most workloads on real numa hardware, if those workloads avoid
         heavy off node allocations.
       * For memory intensive workloads requiring substantial off-node allocations
         on real numa hardware, this patch improves both kernel and elapsed timings
         up to ten per-cent.
       * For fake numa systems, I'm optimistic, but will have to leave that up to
         Rohit Seth to actually test (once I get him a 2.6.18 backport.)
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Cc: Rohit Seth <rohitseth@google.com>
      Cc: Christoph Lameter <clameter@engr.sgi.com>
      Cc: David Rientjes <rientjes@cs.washington.edu>
      Cc: Paul Menage <menage@google.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      9276b1bc
    • C
      [PATCH] Get rid of zone_table[] · 89689ae7
      Christoph Lameter 提交于
      The zone table is mostly not needed.  If we have a node in the page flags
      then we can get to the zone via NODE_DATA() which is much more likely to be
      already in the cpu cache.
      
      In case of SMP and UP NODE_DATA() is a constant pointer which allows us to
      access an exact replica of zonetable in the node_zones field.  In all of
      the above cases there will be no need at all for the zone table.
      
      The only remaining case is if in a NUMA system the node numbers do not fit
      into the page flags.  In that case we make sparse generate a table that
      maps sections to nodes and use that table to to figure out the node number.
       This table is sized to fit in a single cache line for the known 32 bit
      NUMA platform which makes it very likely that the information can be
      obtained without a cache miss.
      
      For sparsemem the zone table seems to be have been fairly large based on
      the maximum possible number of sections and the number of zones per node.
      There is some memory saving by removing zone_table.  The main benefit is to
      reduce the cache foootprint of the VM from the frequent lookups of zones.
      Plus it simplifies the page allocator.
      
      [akpm@osdl.org: build fix]
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      89689ae7
    • P
      [PATCH] memory page alloc minor cleanups · 0798e519
      Paul Jackson 提交于
      - s/freeliest/freelist/ spelling fix
      
      - Check for NULL *z zone seems useless - even if it could happen, so
        what?  Perhaps we should have a check later on if we are faced with an
        allocation request that is not allowed to fail - shouldn't that be a
        serious kernel error, passing an empty zonelist with a mandate to not
        fail?
      
      - Initializing 'z' to zonelist->zones can wait until after the first
        get_page_from_freelist() fails; we only use 'z' in the wakeup_kswapd()
        loop, so let's initialize 'z' there, in a 'for' loop.  Seems clearer.
      
      - Remove superfluous braces around a break
      
      - Fix a couple errant spaces
      
      - Adjust indentation on the cpuset_zone_allowed() check, to match the
        lines just before it -- seems easier to read in this case.
      
      - Add another set of braces to the zone_watermark_ok logic
      
      From: Paul Jackson <pj@sgi.com>
      
        Backout one item from a previous "memory page_alloc minor cleanups" patch.
         Until and unless we are certain that no one can ever pass an empty zonelist
        to __alloc_pages(), this check for an empty zonelist (or some BUG
        equivalent) is essential.  The code in get_page_from_freelist() blow ups if
        passed an empty zonelist.
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Acked-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      0798e519
  9. 24 11月, 2006 1 次提交
    • M
      [PATCH] x86_64: fix bad page state in process 'swapper' · 1abbfb41
      Mel Gorman 提交于
      find_min_pfn_for_node() and find_min_pfn_with_active_regions() both
      depend on a sorted early_node_map[].  However, sort_node_map() is being
      called after fin_min_pfn_with_active_regions() in
      free_area_init_nodes().
      
      In most cases, this is ok, but on at least one x86_64, the SRAT table
      caused the E820 ranges to be registered out of order.  This gave the
      wrong values for the min PFN range resulting in some pages not being
      initialised.
      
      This patch sorts the early_node_map in find_min_pfn_for_node().  It has
      been boot tested on x86, x86_64, ppc64 and ia64.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Acked-by: NAndre Noll <maan@systemlinux.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      1abbfb41
  10. 04 11月, 2006 1 次提交
  11. 29 10月, 2006 2 次提交
    • M
      [PATCH] Calculation fix for memory holes beyong the end of physical memory · 0c6cb974
      Mel Gorman 提交于
      absent_pages_in_range() made the assumption that users of the
      arch-independent zone-sizing API would not care about holes beyound the end
      of physical memory.  This was not the case and was "fixed" in a patch
      called "Account for holes that are outside the range of physical memory".
      However, when given a range that started before a hole in "real" memory and
      ended beyond the end of memory, it would get the result wrong.  The bug is
      in mainline but a patch is below.
      
      It has been tested successfully on a number of machines and architectures.
      Additional credit to Keith Mannthey for discovering the problem, helping
      identify the correct fix and confirming it Worked For Him.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Cc: keith mannthey <kmannth@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      0c6cb974
    • M
      [PATCH] vmscan: Fix temp_priority race · 3bb1a852
      Martin Bligh 提交于
      The temp_priority field in zone is racy, as we can walk through a reclaim
      path, and just before we copy it into prev_priority, it can be overwritten
      (say with DEF_PRIORITY) by another reclaimer.
      
      The same bug is contained in both try_to_free_pages and balance_pgdat, but
      it is fixed slightly differently.  In balance_pgdat, we keep a separate
      priority record per zone in a local array.  In try_to_free_pages there is
      no need to do this, as the priority level is the same for all zones that we
      reclaim from.
      
      Impact of this bug is that temp_priority is copied into prev_priority, and
      setting this artificially high causes reclaimers to set distress
      artificially low.  They then fail to reclaim mapped pages, when they are,
      in fact, under severe memory pressure (their priority may be as low as 0).
      This causes the OOM killer to fire incorrectly.
      
      From: Andrew Morton <akpm@osdl.org>
      
      __zone_reclaim() isn't modifying zone->prev_priority.  But zone->prev_priority
      is used in the decision whether or not to bring mapped pages onto the inactive
      list.  Hence there's a risk here that __zone_reclaim() will fail because
      zone->prev_priority ir large (ie: low urgency) and lots of mapped pages end up
      stuck on the active list.
      
      Fix that up by decreasing (ie making more urgent) zone->prev_priority as
      __zone_reclaim() scans the zone's pages.
      
      This bug perhaps explains why ZONE_RECLAIM_PRIORITY was created.  It should be
      possible to remove that now, and to just start out at DEF_PRIORITY?
      
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Christoph Lameter <clameter@engr.sgi.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      3bb1a852
  12. 22 10月, 2006 1 次提交
  13. 21 10月, 2006 2 次提交
    • A
      [PATCH] highest_possible_node_id() linkage fix · 6220ec78
      Andrew Morton 提交于
      Qooting Adrian:
      
      - net/sunrpc/svc.c uses highest_possible_node_id()
      
      - include/linux/nodemask.h says highest_possible_node_id() is
        out-of-line #if MAX_NUMNODES > 1
      
      - the out-of-line highest_possible_node_id() is in lib/cpumask.c
      
      - lib/Makefile: lib-$(CONFIG_SMP) += cpumask.o
        CONFIG_ARCH_DISCONTIGMEM_ENABLE=y, CONFIG_SMP=n, CONFIG_SUNRPC=y
      
      -> highest_possible_node_id() is used in net/sunrpc/svc.c
         CONFIG_NODES_SHIFT defined and > 0
      
      -> include/linux/numa.h: MAX_NUMNODES > 1
      
      -> compile error
      
      The bug is not present on architectures where ARCH_DISCONTIGMEM_ENABLE
      depends on NUMA (but m32r isn't the only affected architecture).
      
      So move the function into page_alloc.c
      
      Cc: Adrian Bunk <bunk@stusta.de>
      Cc: Paul Jackson <pj@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      6220ec78
    • A
      [PATCH] separate bdi congestion functions from queue congestion functions · 3fcfab16
      Andrew Morton 提交于
      Separate out the concept of "queue congestion" from "backing-dev congestion".
      Congestion is a backing-dev concept, not a queue concept.
      
      The blk_* congestion functions are retained, as wrappers around the core
      backing-dev congestion functions.
      
      This proper layering is needed so that NFS can cleanly use the congestion
      functions, and so that CONFIG_BLOCK=n actually links.
      
      Cc: "Thomas Maier" <balagi@justmail.de>
      Cc: "Jens Axboe" <jens.axboe@oracle.com>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Peter Osterlund <petero2@telia.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      3fcfab16
  14. 12 10月, 2006 3 次提交
  15. 04 10月, 2006 2 次提交