1. 08 12月, 2006 6 次提交
    • N
      [PATCH] oom: cleanup messages · f3af38d3
      Nick Piggin 提交于
      Clean up the OOM killer messages to be more consistent.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      f3af38d3
    • N
      [PATCH] oom: don't kill unkillable children or siblings · c33e0fca
      Nick Piggin 提交于
      Abort the kill if any of our threads have OOM_DISABLE set.  Having this
      test here also prevents any OOM_DISABLE child of the "selected" process
      from being killed.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      c33e0fca
    • P
      [PATCH] memory page_alloc zonelist caching speedup · 9276b1bc
      Paul Jackson 提交于
      Optimize the critical zonelist scanning for free pages in the kernel memory
      allocator by caching the zones that were found to be full recently, and
      skipping them.
      
      Remembers the zones in a zonelist that were short of free memory in the
      last second.  And it stashes a zone-to-node table in the zonelist struct,
      to optimize that conversion (minimize its cache footprint.)
      
      Recent changes:
      
          This differs in a significant way from a similar patch that I
          posted a week ago.  Now, instead of having a nodemask_t of
          recently full nodes, I have a bitmask of recently full zones.
          This solves a problem that last weeks patch had, which on
          systems with multiple zones per node (such as DMA zone) would
          take seeing any of these zones full as meaning that all zones
          on that node were full.
      
          Also I changed names - from "zonelist faster" to "zonelist cache",
          as that seemed to better convey what we're doing here - caching
          some of the key zonelist state (for faster access.)
      
          See below for some performance benchmark results.  After all that
          discussion with David on why I didn't need them, I went and got
          some ;).  I wanted to verify that I had not hurt the normal case
          of memory allocation noticeably.  At least for my one little
          microbenchmark, I found (1) the normal case wasn't affected, and
          (2) workloads that forced scanning across multiple nodes for
          memory improved up to 10% fewer System CPU cycles and lower
          elapsed clock time ('sys' and 'real').  Good.  See details, below.
      
          I didn't have the logic in get_page_from_freelist() for various
          full nodes and zone reclaim failures correct.  That should be
          fixed up now - notice the new goto labels zonelist_scan,
          this_zone_full, and try_next_zone, in get_page_from_freelist().
      
      There are two reasons I persued this alternative, over some earlier
      proposals that would have focused on optimizing the fake numa
      emulation case by caching the last useful zone:
      
       1) Contrary to what I said before, we (SGI, on large ia64 sn2 systems)
          have seen real customer loads where the cost to scan the zonelist
          was a problem, due to many nodes being full of memory before
          we got to a node we could use.  Or at least, I think we have.
          This was related to me by another engineer, based on experiences
          from some time past.  So this is not guaranteed.  Most likely, though.
      
          The following approach should help such real numa systems just as
          much as it helps fake numa systems, or any combination thereof.
      
       2) The effort to distinguish fake from real numa, using node_distance,
          so that we could cache a fake numa node and optimize choosing
          it over equivalent distance fake nodes, while continuing to
          properly scan all real nodes in distance order, was going to
          require a nasty blob of zonelist and node distance munging.
      
          The following approach has no new dependency on node distances or
          zone sorting.
      
      See comment in the patch below for a description of what it actually does.
      
      Technical details of note (or controversy):
      
       - See the use of "zlc_active" and "did_zlc_setup" below, to delay
         adding any work for this new mechanism until we've looked at the
         first zone in zonelist.  I figured the odds of the first zone
         having the memory we needed were high enough that we should just
         look there, first, then get fancy only if we need to keep looking.
      
       - Some odd hackery was needed to add items to struct zonelist, while
         not tripping up the custom zonelists built by the mm/mempolicy.c
         code for MPOL_BIND.  My usual wordy comments below explain this.
         Search for "MPOL_BIND".
      
       - Some per-node data in the struct zonelist is now modified frequently,
         with no locking.  Multiple CPU cores on a node could hit and mangle
         this data.  The theory is that this is just performance hint data,
         and the memory allocator will work just fine despite any such mangling.
         The fields at risk are the struct 'zonelist_cache' fields 'fullzones'
         (a bitmask) and 'last_full_zap' (unsigned long jiffies).  It should
         all be self correcting after at most a one second delay.
      
       - This still does a linear scan of the same lengths as before.  All
         I've optimized is making the scan faster, not algorithmically
         shorter.  It is now able to scan a compact array of 'unsigned
         short' in the case of many full nodes, so one cache line should
         cover quite a few nodes, rather than each node hitting another
         one or two new and distinct cache lines.
      
       - If both Andi and Nick don't find this too complicated, I will be
         (pleasantly) flabbergasted.
      
       - I removed the comment claiming we only use one cachline's worth of
         zonelist.  We seem, at least in the fake numa case, to have put the
         lie to that claim.
      
       - I pay no attention to the various watermarks and such in this performance
         hint.  A node could be marked full for one watermark, and then skipped
         over when searching for a page using a different watermark.  I think
         that's actually quite ok, as it will tend to slightly increase the
         spreading of memory over other nodes, away from a memory stressed node.
      
      ===============
      
      Performance - some benchmark results and analysis:
      
      This benchmark runs a memory hog program that uses multiple
      threads to touch alot of memory as quickly as it can.
      
      Multiple runs were made, touching 12, 38, 64 or 90 GBytes out of
      the total 96 GBytes on the system, and using 1, 19, 37, or 55
      threads (on a 56 CPU system.)  System, user and real (elapsed)
      timings were recorded for each run, shown in units of seconds,
      in the table below.
      
      Two kernels were tested - 2.6.18-mm3 and the same kernel with
      this zonelist caching patch added.  The table also shows the
      percentage improvement the zonelist caching sys time is over
      (lower than) the stock *-mm kernel.
      
            number     2.6.18-mm3	   zonelist-cache    delta (< 0 good)	percent
       GBs    N  	------------	   --------------    ----------------	systime
       mem threads   sys user  real	  sys  user  real     sys  user  real	 better
        12	 1     153   24   177	  151	 24   176      -2     0    -1	   1%
        12	19	99   22     8	   99	 22	8	0     0     0	   0%
        12	37     111   25     6	  112	 25	6	1     0     0	  -0%
        12	55     115   25     5	  110	 23	5      -5    -2     0	   4%
        38	 1     502   74   576	  497	 73   570      -5    -1    -6	   0%
        38	19     426   78    48	  373	 76    39     -53    -2    -9	  12%
        38	37     544   83    36	  547	 82    36	3    -1     0	  -0%
        38	55     501   77    23	  511	 80    24      10     3     1	  -1%
        64	 1     917  125  1042	  890	124  1014     -27    -1   -28	   2%
        64	19    1118  138   119	  965	141   103    -153     3   -16	  13%
        64	37    1202  151    94	 1136	150    81     -66    -1   -13	   5%
        64	55    1118  141    61	 1072	140    58     -46    -1    -3	   4%
        90	 1    1342  177  1519	 1275	174  1450     -67    -3   -69	   4%
        90	19    2392  199   192	 2116	189   176    -276   -10   -16	  11%
        90	37    3313  238   175	 2972	225   145    -341   -13   -30	  10%
        90	55    1948  210   104	 1843	213   100    -105     3    -4	   5%
      
      Notes:
       1) This test ran a memory hog program that started a specified number N of
          threads, and had each thread allocate and touch 1/N'th of
          the total memory to be used in the test run in a single loop,
          writing a constant word to memory, one store every 4096 bytes.
          Watching this test during some earlier trial runs, I would see
          each of these threads sit down on one CPU and stay there, for
          the remainder of the pass, a different CPU for each thread.
      
       2) The 'real' column is not comparable to the 'sys' or 'user' columns.
          The 'real' column is seconds wall clock time elapsed, from beginning
          to end of that test pass.  The 'sys' and 'user' columns are total
          CPU seconds spent on that test pass.  For a 19 thread test run,
          for example, the sum of 'sys' and 'user' could be up to 19 times the
          number of 'real' elapsed wall clock seconds.
      
       3) Tests were run on a fresh, single-user boot, to minimize the amount
          of memory already in use at the start of the test, and to minimize
          the amount of background activity that might interfere.
      
       4) Tests were done on a 56 CPU, 28 Node system with 96 GBytes of RAM.
      
       5) Notice that the 'real' time gets large for the single thread runs, even
          though the measured 'sys' and 'user' times are modest.  I'm not sure what
          that means - probably something to do with it being slow for one thread to
          be accessing memory along ways away.  Perhaps the fake numa system, running
          ostensibly the same workload, would not show this substantial degradation
          of 'real' time for one thread on many nodes -- lets hope not.
      
       6) The high thread count passes (one thread per CPU - on 55 of 56 CPUs)
          ran quite efficiently, as one might expect.  Each pair of threads needed
          to allocate and touch the memory on the node the two threads shared, a
          pleasantly parallizable workload.
      
       7) The intermediate thread count passes, when asking for alot of memory forcing
          them to go to a few neighboring nodes, improved the most with this zonelist
          caching patch.
      
      Conclusions:
       * This zonelist cache patch probably makes little difference one way or the
         other for most workloads on real numa hardware, if those workloads avoid
         heavy off node allocations.
       * For memory intensive workloads requiring substantial off-node allocations
         on real numa hardware, this patch improves both kernel and elapsed timings
         up to ten per-cent.
       * For fake numa systems, I'm optimistic, but will have to leave that up to
         Rohit Seth to actually test (once I get him a 2.6.18 backport.)
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Cc: Rohit Seth <rohitseth@google.com>
      Cc: Christoph Lameter <clameter@engr.sgi.com>
      Cc: David Rientjes <rientjes@cs.washington.edu>
      Cc: Paul Menage <menage@google.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      9276b1bc
    • C
      [PATCH] Get rid of zone_table[] · 89689ae7
      Christoph Lameter 提交于
      The zone table is mostly not needed.  If we have a node in the page flags
      then we can get to the zone via NODE_DATA() which is much more likely to be
      already in the cpu cache.
      
      In case of SMP and UP NODE_DATA() is a constant pointer which allows us to
      access an exact replica of zonetable in the node_zones field.  In all of
      the above cases there will be no need at all for the zone table.
      
      The only remaining case is if in a NUMA system the node numbers do not fit
      into the page flags.  In that case we make sparse generate a table that
      maps sections to nodes and use that table to to figure out the node number.
       This table is sized to fit in a single cache line for the known 32 bit
      NUMA platform which makes it very likely that the information can be
      obtained without a cache miss.
      
      For sparsemem the zone table seems to be have been fairly large based on
      the maximum possible number of sections and the number of zones per node.
      There is some memory saving by removing zone_table.  The main benefit is to
      reduce the cache foootprint of the VM from the frequent lookups of zones.
      Plus it simplifies the page allocator.
      
      [akpm@osdl.org: build fix]
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      89689ae7
    • C
      [PATCH] __unmap_hugepage_range(): add comment · c0a499c2
      Chen, Kenneth W 提交于
      Signed-off-by: NKen Chen <kenneth.w.chen@intel.com>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Cc: Hugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      c0a499c2
    • P
      [PATCH] memory page alloc minor cleanups · 0798e519
      Paul Jackson 提交于
      - s/freeliest/freelist/ spelling fix
      
      - Check for NULL *z zone seems useless - even if it could happen, so
        what?  Perhaps we should have a check later on if we are faced with an
        allocation request that is not allowed to fail - shouldn't that be a
        serious kernel error, passing an empty zonelist with a mandate to not
        fail?
      
      - Initializing 'z' to zonelist->zones can wait until after the first
        get_page_from_freelist() fails; we only use 'z' in the wakeup_kswapd()
        loop, so let's initialize 'z' there, in a 'for' loop.  Seems clearer.
      
      - Remove superfluous braces around a break
      
      - Fix a couple errant spaces
      
      - Adjust indentation on the cpuset_zone_allowed() check, to match the
        lines just before it -- seems easier to read in this case.
      
      - Add another set of braces to the zone_watermark_ok logic
      
      From: Paul Jackson <pj@sgi.com>
      
        Backout one item from a previous "memory page_alloc minor cleanups" patch.
         Until and unless we are certain that no one can ever pass an empty zonelist
        to __alloc_pages(), this check for an empty zonelist (or some BUG
        equivalent) is essential.  The code in get_page_from_freelist() blow ups if
        passed an empty zonelist.
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Acked-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      0798e519
  2. 06 12月, 2006 1 次提交
    • M
      [PATCH] uclinux: fix mmap() of directory for nommu case · f81cff0d
      Mike Frysinger 提交于
      I was playing with blackfin when i hit a neat bug ... doing an open() on a
      directory and then passing that fd to mmap() would cause the kernel to hang
      
      after poking into the code a bit more, i found that
      mm/nommu.c:validate_mmap_request() checks the length and if it is 0, just
      returns the address ... this is in stark contrast to mmu's
      mm/mmap.c:do_mmap_pgoff() where it returns -EINVAL for 0 length requests ...
      i then noticed that some other parts of the logic is out of date between the
      two funcs, so perhaps that's the easy fix ?
      Signed-off-by: NGreg Ungerer <gerg@uclinux.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      f81cff0d
  3. 02 12月, 2006 1 次提交
  4. 24 11月, 2006 1 次提交
    • M
      [PATCH] x86_64: fix bad page state in process 'swapper' · 1abbfb41
      Mel Gorman 提交于
      find_min_pfn_for_node() and find_min_pfn_with_active_regions() both
      depend on a sorted early_node_map[].  However, sort_node_map() is being
      called after fin_min_pfn_with_active_regions() in
      free_area_init_nodes().
      
      In most cases, this is ok, but on at least one x86_64, the SRAT table
      caused the E820 ranges to be registered out of order.  This gave the
      wrong values for the min PFN range resulting in some pages not being
      initialised.
      
      This patch sorts the early_node_map in find_min_pfn_for_node().  It has
      been boot tested on x86, x86_64, ppc64 and ia64.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Acked-by: NAndre Noll <maan@systemlinux.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      1abbfb41
  5. 22 11月, 2006 3 次提交
    • D
      WorkStruct: make allyesconfig · c4028958
      David Howells 提交于
      Fix up for make allyesconfig.
      Signed-Off-By: NDavid Howells <dhowells@redhat.com>
      c4028958
    • D
      WorkStruct: Pass the work_struct pointer instead of context data · 65f27f38
      David Howells 提交于
      Pass the work_struct pointer to the work function rather than context data.
      The work function can use container_of() to work out the data.
      
      For the cases where the container of the work_struct may go away the moment the
      pending bit is cleared, it is made possible to defer the release of the
      structure by deferring the clearing of the pending bit.
      
      To make this work, an extra flag is introduced into the management side of the
      work_struct.  This governs auto-release of the structure upon execution.
      
      Ordinarily, the work queue executor would release the work_struct for further
      scheduling or deallocation by clearing the pending bit prior to jumping to the
      work function.  This means that, unless the driver makes some guarantee itself
      that the work_struct won't go away, the work function may not access anything
      else in the work_struct or its container lest they be deallocated..  This is a
      problem if the auxiliary data is taken away (as done by the last patch).
      
      However, if the pending bit is *not* cleared before jumping to the work
      function, then the work function *may* access the work_struct and its container
      with no problems.  But then the work function must itself release the
      work_struct by calling work_release().
      
      In most cases, automatic release is fine, so this is the default.  Special
      initiators exist for the non-auto-release case (ending in _NAR).
      Signed-Off-By: NDavid Howells <dhowells@redhat.com>
      65f27f38
    • D
      WorkStruct: Separate delayable and non-delayable events. · 52bad64d
      David Howells 提交于
      Separate delayable work items from non-delayable work items be splitting them
      into a separate structure (delayed_work), which incorporates a work_struct and
      the timer_list removed from work_struct.
      
      The work_struct struct is huge, and this limits it's usefulness.  On a 64-bit
      architecture it's nearly 100 bytes in size.  This reduces that by half for the
      non-delayable type of event.
      Signed-Off-By: NDavid Howells <dhowells@redhat.com>
      52bad64d
  6. 17 11月, 2006 1 次提交
  7. 15 11月, 2006 3 次提交
    • H
      [PATCH] hugetlb: fix error return for brk() entering a hugepage region · cd2579d7
      Hugh Dickins 提交于
      Commit cb07c9a1 causes the wrong return
      value.  is_hugepage_only_range() is a boolean, so we should return
      -EINVAL rather than 1.
      
      Also - we can use "mm" instead of looking up "current->mm" again.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      cd2579d7
    • D
      [PATCH] hugetlb: check for brk() entering a hugepage region · cb07c9a1
      David Gibson 提交于
      Unlike mmap(), the codepath for brk() creates a vma without first checking
      that it doesn't touch a region exclusively reserved for hugepages.  On
      powerpc, this can allow it to create a normal page vma in a hugepage
      region, causing oopses and other badness.
      
      Add a test to prevent this.  With this patch, brk() will simply fail if it
      attempts to move the break into a hugepage reserved region.
      Signed-off-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Cc: Adam Litke <agl@us.ibm.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      cb07c9a1
    • H
      [PATCH] hugetlb: prepare_hugepage_range check offset too · 68589bc3
      Hugh Dickins 提交于
      (David:)
      
      If hugetlbfs_file_mmap() returns a failure to do_mmap_pgoff() - for example,
      because the given file offset is not hugepage aligned - then do_mmap_pgoff
      will go to the unmap_and_free_vma backout path.
      
      But at this stage the vma hasn't been marked as hugepage, and the backout path
      will call unmap_region() on it.  That will eventually call down to the
      non-hugepage version of unmap_page_range().  On ppc64, at least, that will
      cause serious problems if there are any existing hugepage pagetable entries in
      the vicinity - for example if there are any other hugepage mappings under the
      same PUD.  unmap_page_range() will trigger a bad_pud() on the hugepage pud
      entries.  I suspect this will also cause bad problems on ia64, though I don't
      have a machine to test it on.
      
      (Hugh:)
      
      prepare_hugepage_range() should check file offset alignment when it checks
      virtual address and length, to stop MAP_FIXED with a bad huge offset from
      unmapping before it fails further down.  PowerPC should apply the same
      prepare_hugepage_range alignment checks as ia64 and all the others do.
      
      Then none of the alignment checks in hugetlbfs_file_mmap are required (nor
      is the check for too small a mapping); but even so, move up setting of
      VM_HUGETLB and add a comment to warn of what David Gibson discovered - if
      hugetlbfs_file_mmap fails before setting it, do_mmap_pgoff's unmap_region
      when unwinding from error will go the non-huge way, which may cause bad
      behaviour on architectures (powerpc and ia64) which segregate their huge
      mappings into a separate region of the address space.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Acked-by: NAdam Litke <agl@us.ibm.com>
      Acked-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      68589bc3
  8. 13 11月, 2006 1 次提交
  9. 04 11月, 2006 4 次提交
  10. 30 10月, 2006 1 次提交
    • G
      [PATCH] Fix GFP_HIGHMEM slab panic · 5211e6e6
      Giridhar Pemmasani 提交于
      As reported by Martin J. Bligh <mbligh@google.com>, we let through some
      non-slab bits to slab allocation through __get_vm_area_node when doing a
      vmalloc.
      
      I haven't been able to reproduce this, although I understand why it
      happens: vmalloc allocates memory with
      
      GFP_KERNEL | __GFP_HIGHMEM
      
      and commit 52fd24ca resulted in the same
      flags are passed down to cache_alloc_refill, causing the BUG.  The
      following patch fixes it.
      
      Note that when calling kmalloc_node, I am masking off __GFP_HIGHMEM with
      GFP_LEVEL_MASK, whereas __vmalloc_area_node does the same with
      
      ~(__GFP_HIGHMEM | __GFP_ZERO).
      
      IMHO, using GFP_LEVEL_MASK is preferable, but either should fix this
      problem.
      
      Signed-off-by: Giridhar Pemmasani (pgiri@yahoo.com)
      Cc: Martin J. Bligh <mbligh@google.com>
      Cc: Andrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      5211e6e6
  11. 29 10月, 2006 7 次提交
    • M
      [PATCH] Calculation fix for memory holes beyong the end of physical memory · 0c6cb974
      Mel Gorman 提交于
      absent_pages_in_range() made the assumption that users of the
      arch-independent zone-sizing API would not care about holes beyound the end
      of physical memory.  This was not the case and was "fixed" in a patch
      called "Account for holes that are outside the range of physical memory".
      However, when given a range that started before a hole in "real" memory and
      ended beyond the end of memory, it would get the result wrong.  The bug is
      in mainline but a patch is below.
      
      It has been tested successfully on a number of machines and architectures.
      Additional credit to Keith Mannthey for discovering the problem, helping
      identify the correct fix and confirming it Worked For Him.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Cc: keith mannthey <kmannth@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      0c6cb974
    • H
      [PATCH] hugetlb: fix absurd HugePages_Rsvd · ebed4bfc
      Hugh Dickins 提交于
      If you truncated an mmap'ed hugetlbfs file, then faulted on the truncated
      area, /proc/meminfo's HugePages_Rsvd wrapped hugely "negative".  Reinstate my
      preliminary i_size check before attempting to allocate the page (though this
      only fixes the most obvious case: more work will be needed here).
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: Adam Litke <agl@us.ibm.com>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Cc: "Chen, Kenneth W" <kenneth.w.chen@intel.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      ebed4bfc
    • G
      [PATCH] __vmalloc with GFP_ATOMIC causes 'sleeping from invalid context' · 52fd24ca
      Giridhar Pemmasani 提交于
      If __vmalloc is called to allocate memory with GFP_ATOMIC in atomic
      context, the chain of calls results in __get_vm_area_node allocating memory
      for vm_struct with GFP_KERNEL, causing the 'sleeping from invalid context'
      warning.  This patch fixes it by passing the gfp flags along so
      __get_vm_area_node allocates memory for vm_struct with the same flags.
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      52fd24ca
    • Y
      [PATCH] memory hotplug: __GFP_NOWARN is better for __kmalloc_section_memmap() · f2d0aa5b
      Yasunori Goto 提交于
      Add __GFP_NOWARN flag to calling of __alloc_pages() in
      __kmalloc_section_memmap().  It can reduce noisy failure message.
      
      In ia64, section size is 1 GB, this means that order 8 pages are necessary
      for each section's memmap.  It is often very hard requirement under heavy
      memory pressure as you know.  So, __alloc_pages() gives up allocation and
      shows many noisy stack traces which means no page for each sections.
      (Current my environment shows 32 times of stack trace....)
      
      But, __kmalloc_section_memmap() calls vmalloc() after failure of it, and it
      can succeed allocation of memmap.  So, its stack trace warning becomes just
      noisy.  I suppose it shouldn't be shown.
      Signed-off-by: NYasunori Goto <y-goto@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      f2d0aa5b
    • M
      [PATCH] Use min of two prio settings in calculating distress for reclaim · bbdb396a
      Martin Bligh 提交于
      If try_to_free_pages / balance_pgdat are called with a gfp_mask specifying
      GFP_IO and/or GFP_FS, they will reclaim the requisite number of pages, and the
      reset prev_priority to DEF_PRIORITY (or to some other high (ie: unurgent)
      value).
      
      However, another reclaimer without those gfp_mask flags set (say, GFP_NOIO)
      may still be struggling to reclaim pages.  The concurrent overwrite of
      zone->prev_priority will cause this GFP_NOIO thread to unexpectedly cease
      deactivating mapped pages, thus causing reclaim difficulties.
      
      Fix this is to key the distress calculation not off zone->prev_priority, but
      also take into account the local caller's priority by using
      min(zone->prev_priority, sc->priority)
      Signed-off-by: NMartin J. Bligh <mbligh@google.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      bbdb396a
    • M
      [PATCH] vmscan: Fix temp_priority race · 3bb1a852
      Martin Bligh 提交于
      The temp_priority field in zone is racy, as we can walk through a reclaim
      path, and just before we copy it into prev_priority, it can be overwritten
      (say with DEF_PRIORITY) by another reclaimer.
      
      The same bug is contained in both try_to_free_pages and balance_pgdat, but
      it is fixed slightly differently.  In balance_pgdat, we keep a separate
      priority record per zone in a local array.  In try_to_free_pages there is
      no need to do this, as the priority level is the same for all zones that we
      reclaim from.
      
      Impact of this bug is that temp_priority is copied into prev_priority, and
      setting this artificially high causes reclaimers to set distress
      artificially low.  They then fail to reclaim mapped pages, when they are,
      in fact, under severe memory pressure (their priority may be as low as 0).
      This causes the OOM killer to fire incorrectly.
      
      From: Andrew Morton <akpm@osdl.org>
      
      __zone_reclaim() isn't modifying zone->prev_priority.  But zone->prev_priority
      is used in the decision whether or not to bring mapped pages onto the inactive
      list.  Hence there's a risk here that __zone_reclaim() will fail because
      zone->prev_priority ir large (ie: low urgency) and lots of mapped pages end up
      stuck on the active list.
      
      Fix that up by decreasing (ie making more urgent) zone->prev_priority as
      __zone_reclaim() scans the zone's pages.
      
      This bug perhaps explains why ZONE_RECLAIM_PRIORITY was created.  It should be
      possible to remove that now, and to just start out at DEF_PRIORITY?
      
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Christoph Lameter <clameter@engr.sgi.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      3bb1a852
    • N
      [PATCH] mm: clean up pagecache allocation · 2ae88149
      Nick Piggin 提交于
      - Consolidate page_cache_alloc
      
      - Fix splice: only the pagecache pages and filesystem data need to use
        mapping_gfp_mask.
      
      - Fix grab_cache_page_nowait: same as splice, also honour NUMA placement.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      2ae88149
  12. 22 10月, 2006 2 次提交
  13. 21 10月, 2006 6 次提交
    • N
      [PATCH] mm: more commenting on lock ordering · 82591e6e
      Nick Piggin 提交于
      Clarify lockorder comments now that sys_msync dropps mmap_sem before
      calling do_fsync.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      82591e6e
    • D
      [PATCH] mm: D-cache aliasing issue in cow_user_page · c4ec7b0d
      Dmitriy Monakhov 提交于
      --=-=-=
      
       from mm/memory.c:
        1434  static inline void cow_user_page(struct page *dst, struct page *src, unsigned long va)
        1435  {
        1436          /*
        1437           * If the source page was a PFN mapping, we don't have
        1438           * a "struct page" for it. We do a best-effort copy by
        1439           * just copying from the original user address. If that
        1440           * fails, we just zero-fill it. Live with it.
        1441           */
        1442          if (unlikely(!src)) {
        1443                  void *kaddr = kmap_atomic(dst, KM_USER0);
        1444                  void __user *uaddr = (void __user *)(va & PAGE_MASK);
        1445
        1446                  /*
        1447                   * This really shouldn't fail, because the page is there
        1448                   * in the page tables. But it might just be unreadable,
        1449                   * in which case we just give up and fill the result with
        1450                   * zeroes.
        1451                   */
        1452                  if (__copy_from_user_inatomic(kaddr, uaddr, PAGE_SIZE))
        1453                          memset(kaddr, 0, PAGE_SIZE);
        1454                  kunmap_atomic(kaddr, KM_USER0);
        #### D-cache have to be flushed here.
        #### It seems it is just forgotten.
      
        1455                  return;
        1456
        1457          }
        1458          copy_user_highpage(dst, src, va);
        #### Ok here. flush_dcache_page() called from this func if arch need it
        1459  }
      
      Following is the patch  fix this issue:
      Signed-off-by: NDmitriy Monakhov <dmonakhov@openvz.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      c4ec7b0d
    • A
      [PATCH] highest_possible_node_id() linkage fix · 6220ec78
      Andrew Morton 提交于
      Qooting Adrian:
      
      - net/sunrpc/svc.c uses highest_possible_node_id()
      
      - include/linux/nodemask.h says highest_possible_node_id() is
        out-of-line #if MAX_NUMNODES > 1
      
      - the out-of-line highest_possible_node_id() is in lib/cpumask.c
      
      - lib/Makefile: lib-$(CONFIG_SMP) += cpumask.o
        CONFIG_ARCH_DISCONTIGMEM_ENABLE=y, CONFIG_SMP=n, CONFIG_SUNRPC=y
      
      -> highest_possible_node_id() is used in net/sunrpc/svc.c
         CONFIG_NODES_SHIFT defined and > 0
      
      -> include/linux/numa.h: MAX_NUMNODES > 1
      
      -> compile error
      
      The bug is not present on architectures where ARCH_DISCONTIGMEM_ENABLE
      depends on NUMA (but m32r isn't the only affected architecture).
      
      So move the function into page_alloc.c
      
      Cc: Adrian Bunk <bunk@stusta.de>
      Cc: Paul Jackson <pj@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      6220ec78
    • A
      [PATCH] OOM killer meets userspace headers · 8ac773b4
      Alexey Dobriyan 提交于
      Despite mm.h is not being exported header, it does contain one thing
      which is part of userspace ABI -- value disabling OOM killer for given
      process. So,
      a) create and export include/linux/oom.h
      b) move OOM_DISABLE define there.
      c) turn bounding values of /proc/$PID/oom_adj into defines and export
         them too.
      
      Note: mass __KERNEL__ removal will be done later.
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      8ac773b4
    • A
      [PATCH] separate bdi congestion functions from queue congestion functions · 3fcfab16
      Andrew Morton 提交于
      Separate out the concept of "queue congestion" from "backing-dev congestion".
      Congestion is a backing-dev concept, not a queue concept.
      
      The blk_* congestion functions are retained, as wrappers around the core
      backing-dev congestion functions.
      
      This proper layering is needed so that NFS can cleanly use the congestion
      functions, and so that CONFIG_BLOCK=n actually links.
      
      Cc: "Thomas Maier" <balagi@justmail.de>
      Cc: "Jens Axboe" <jens.axboe@oracle.com>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Peter Osterlund <petero2@telia.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      3fcfab16
    • J
      [PATCH] direct-io: sync and invalidate file region when falling back to buffered write · fb5527e6
      Jeff Moyer 提交于
      When direct-io falls back to buffered write, it will just leave the dirty data
      floating about in pagecache, pending regular writeback.
      
      But normal direct-io semantics are that IO is synchronous, and that it leaves
      no pagecache behind.
      
      So change the fallback-to-buffered-write code to sync the file region and to
      then strip away the pagecache, just as a regular direct-io write would do.
      Acked-by: NJeff Moyer <jmoyer@redhat.com>
      Cc: Zach Brown <zach.brown@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      fb5527e6
  14. 20 10月, 2006 1 次提交
  15. 17 10月, 2006 2 次提交
    • A
      [PATCH] vmalloc(): don't pass __GFP_ZERO to slab · 286e1ea3
      Andrew Morton 提交于
      A recent change to the vmalloc() code accidentally resulted in us passing
      __GFP_ZERO into the slab allocator.  But we only wanted __GFP_ZERO for the
      actual pages whcih are being vmalloc()ed, and passing __GFP_ZERO into slab is
      not a rational thing to ask for.
      
      Cc: Jonathan Corbet <corbet@lwn.net>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      286e1ea3
    • D
      [PATCH] knfsd: add nfs-export support to tmpfs · 91828a40
      David M. Grimes 提交于
      We need to encode a decode the 'file' part of a handle.  We simply use the
      inode number and generation number to construct the filehandle.
      
      The generation number is the time when the file was created.  As inode numbers
      cycle through the full 32 bits before being reused, there is no real chance of
      the same inum being allocated to different files in the same second so this is
      suitably unique.  Using time-of-day rather than e.g.  jiffies makes it less
      likely that the same filehandle can be created after a reboot.
      
      In order to be able to decode a filehandle we need to be able to lookup by
      inum, which means that the inode needs to be added to the inode hash table
      (tmpfs doesn't currently hash inodes as there is never a need to lookup by
      inum).  To avoid overhead when not exporting, we only hash an inode when it is
      first exported.  This requires a lock to ensure it isn't hashed twice.
      
      This code is separate from the patch posted in June06 from Atal Shargorodsky
      which provided the same functionality, but does borrow slightly from it.
      
      Locking comment: Most filesystems that hash their inodes do so at the point
      where the 'struct inode' is initialised, and that has suitable locking
      (I_NEW).  Here in shmem, we are hashing the inode later, the first time we
      need an NFS file handle for it.  We no longer have I_NEW to ensure only one
      thread tries to add it to the hash table.
      
      Cc: Atal Shargorodsky <atal@codefidence.com>
      Cc: Gilad Ben-Yossef <gilad@codefidence.com>
      Signed-off-by: NDavid M. Grimes <dgrimes@navisite.com>
      Signed-off-by: NNeil Brown <neilb@suse.de>
      Acked-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      91828a40