1. 23 6月, 2006 7 次提交
  2. 22 5月, 2006 2 次提交
    • B
      [PATCH] Align the node_mem_map endpoints to a MAX_ORDER boundary · e984bb43
      Bob Picco 提交于
      Andy added code to buddy allocator which does not require the zone's
      endpoints to be aligned to MAX_ORDER.  An issue is that the buddy allocator
      requires the node_mem_map's endpoints to be MAX_ORDER aligned.  Otherwise
      __page_find_buddy could compute a buddy not in node_mem_map for partial
      MAX_ORDER regions at zone's endpoints.  page_is_buddy will detect that
      these pages at endpoints are not PG_buddy (they were zeroed out by bootmem
      allocator and not part of zone).  Of course the negative here is we could
      waste a little memory but the positive is eliminating all the old checks
      for zone boundary conditions.
      
      SPARSEMEM won't encounter this issue because of MAX_ORDER size constraint
      when SPARSEMEM is configured.  ia64 VIRTUAL_MEM_MAP doesn't need the logic
      either because the holes and endpoints are handled differently.  This
      leaves checking alloc_remap and other arches which privately allocate for
      node_mem_map.
      Signed-off-by: NBob Picco <bob.picco@hp.com>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      e984bb43
    • P
      [PATCH] Cpuset: might sleep checking zones allowed fix · bdd804f4
      Paul Jackson 提交于
      Fix a couple of infrequently encountered 'sleeping function called from
      invalid context' in the cpuset hooks in __alloc_pages.  Could sleep while
      interrupts disabled.
      
      The routine cpuset_zone_allowed() is called by code in mm/page_alloc.c
      __alloc_pages() to determine if a zone is allowed in the current tasks
      cpuset.  This routine can sleep, for certain GFP_KERNEL allocations, if the
      zone is on a memory node not allowed in the current cpuset, but might be
      allowed in a parent cpuset.
      
      But we can't sleep in __alloc_pages() if in interrupt, nor if called for a
      GFP_ATOMIC request (__GFP_WAIT not set in gfp_flags).
      
      The rule was intended to be:
        Don't call cpuset_zone_allowed() if you can't sleep, unless you
        pass in the __GFP_HARDWALL flag set in gfp_flag, which disables
        the code that might scan up ancestor cpusets and sleep.
      
      This rule was being violated in a couple of places, due to a bogus change
      made (by myself, pj) to __alloc_pages() as part of the November 2005 effort
      to cleanup its logic, and also due to a later fix to constrain which swap
      daemons were awoken.
      
      The bogus change can be seen at:
        http://linux.derkeiler.com/Mailing-Lists/Kernel/2005-11/4691.html
        [PATCH 01/05] mm fix __alloc_pages cpuset ALLOC_* flags
      
      This was first noticed on a tight memory system, in code that was disabling
      interrupts and doing allocation requests with __GFP_WAIT not set, which
      resulted in __might_sleep() writing complaints to the log "Debug: sleeping
      function called ...", when the code in cpuset_zone_allowed() tried to take
      the callback_sem cpuset semaphore.
      
      We haven't seen a system hang on this 'might_sleep' yet, but we are at
      decent risk of seeing it fairly soon, especially since the additional
      cpuset_zone_allowed() check was added, conditioning wakeup_kswapd(), in
      March 2006.
      
      Special thanks to Dave Chinner, for figuring this out, and a tip of the hat
      to Nick Piggin who warned me of this back in Nov 2005, before I was ready
      to listen.
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      bdd804f4
  3. 16 5月, 2006 1 次提交
  4. 26 4月, 2006 1 次提交
  5. 20 4月, 2006 1 次提交
  6. 11 4月, 2006 2 次提交
    • H
      [PATCH] overcommit: add calculate_totalreserve_pages() · cb45b0e9
      Hideo AOKI 提交于
      These patches are an enhancement of OVERCOMMIT_GUESS algorithm in
      __vm_enough_memory().
      
      - why the kernel needed patching
      
        When the kernel can't allocate anonymous pages in practice, currnet
        OVERCOMMIT_GUESS could return success. This implementation might be
        the cause of oom kill in memory pressure situation.
      
        If the Linux runs with page reservation features like
        /proc/sys/vm/lowmem_reserve_ratio and without swap region, I think
        the oom kill occurs easily.
      
      - the overall design approach in the patch
      
        When the OVERCOMMET_GUESS algorithm calculates number of free pages,
        the reserved free pages are regarded as non-free pages.
      
        This change helps to avoid the pitfall that the number of free pages
        become less than the number which the kernel tries to keep free.
      
      - testing results
      
        I tested the patches using my test kernel module.
      
        If the patches aren't applied to the kernel, __vm_enough_memory()
        returns success in the situation but autual page allocation is
        failed.
      
        On the other hand, if the patches are applied to the kernel, memory
        allocation failure is avoided since __vm_enough_memory() returns
        failure in the situation.
      
        I checked that on i386 SMP 16GB memory machine. I haven't tested on
        nommu environment currently.
      
      This patch adds totalreserve_pages for __vm_enough_memory().
      
      Calculate_totalreserve_pages() checks maximum lowmem_reserve pages and
      pages_high in each zone. Finally, the function stores the sum of each
      zone to totalreserve_pages.
      
      The totalreserve_pages is calculated when the VM is initilized.
      And the variable is updated when /proc/sys/vm/lowmem_reserve_raito
      or /proc/sys/vm/min_free_kbytes are changed.
      Signed-off-by: NHideo Aoki <haoki@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      cb45b0e9
    • N
      [PATCH] Fix buddy list race that could lead to page lru list corruptions · 676165a8
      Nick Piggin 提交于
      Rohit found an obscure bug causing buddy list corruption.
      
      page_is_buddy is using a non-atomic test (PagePrivate && page_count == 0)
      to determine whether or not a free page's buddy is itself free and in the
      buddy lists.
      
      Each of the conjuncts may be true at different times due to unrelated
      conditions, so the non-atomic page_is_buddy test may find each conjunct to
      be true even if they were not both true at the same time (ie. the page was
      not on the buddy lists).
      Signed-off-by: NMartin Bligh <mbligh@google.com>
      Signed-off-by: NRohit Seth <rohitseth@google.com>
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      676165a8
  7. 28 3月, 2006 4 次提交
  8. 26 3月, 2006 2 次提交
  9. 24 3月, 2006 1 次提交
  10. 22 3月, 2006 9 次提交
  11. 10 3月, 2006 1 次提交
    • C
      [PATCH] slab: Node rotor for freeing alien caches and remote per cpu pages. · 8fce4d8e
      Christoph Lameter 提交于
      The cache reaper currently tries to free all alien caches and all remote
      per cpu pages in each pass of cache_reap.  For a machines with large number
      of nodes (such as Altix) this may lead to sporadic delays of around ~10ms.
      Interrupts are disabled while reclaiming creating unacceptable delays.
      
      This patch changes that behavior by adding a per cpu reap_node variable.
      Instead of attempting to free all caches, we free only one alien cache and
      the per cpu pages from one remote node.  That reduces the time spend in
      cache_reap.  However, doing so will lengthen the time it takes to
      completely drain all remote per cpu pagesets and all alien caches.  The
      time needed will grow with the number of nodes in the system.  All caches
      are drained when they overflow their respective capacity.  So the drawback
      here is only that a bit of memory may be wasted for awhile longer.
      
      Details:
      
      1. Rename drain_remote_pages to drain_node_pages to allow the specification
         of the node to drain of pcp pages.
      
      2. Add additional functions init_reap_node, next_reap_node for NUMA
         that manage a per cpu reap_node counter.
      
      3. Add a reap_alien function that reaps only from the current reap_node.
      
      For us this seems to be a critical issue.  Holdoffs of an average of ~7ms
      cause some HPC benchmarks to slow down significantly.  F.e.  NAS parallel
      slows down dramatically.  NAS parallel has a 12-16 seconds runtime w/o rotor
      compared to 5.8 secs with the rotor patches.  It gets down to 5.05 secs with
      the additional interrupt holdoff reductions.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      8fce4d8e
  12. 21 2月, 2006 1 次提交
    • C
      [PATCH] Terminate process that fails on a constrained allocation · 9b0f8b04
      Christoph Lameter 提交于
      Some allocations are restricted to a limited set of nodes (due to memory
      policies or cpuset constraints).  If the page allocator is not able to find
      enough memory then that does not mean that overall system memory is low.
      
      In particular going postal and more or less randomly shooting at processes
      is not likely going to help the situation but may just lead to suicide (the
      whole system coming down).
      
      It is better to signal to the process that no memory exists given the
      constraints that the process (or the configuration of the process) has
      placed on the allocation behavior.  The process may be killed but then the
      sysadmin or developer can investigate the situation.  The solution is
      similar to what we do when running out of hugepages.
      
      This patch adds a check before we kill processes.  At that point
      performance considerations do not matter much so we just scan the zonelist
      and reconstruct a list of nodes.  If the list of nodes does not contain all
      online nodes then this is a constrained allocation and we should kill the
      current process.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Andi Kleen <ak@muc.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      9b0f8b04
  13. 18 2月, 2006 1 次提交
  14. 15 2月, 2006 2 次提交
    • H
      [PATCH] compound page: default destructor · d98c7a09
      Hugh Dickins 提交于
      Somehow I imagined that calling a NULL destructor would free a compound page
      rather than oopsing.  No, we must supply a default destructor, __free_pages_ok
      using the order noted by prep_compound_page.  hugetlb can still replace this
      as before with its own free_huge_page pointer.
      
      The case that needs this is not common: rarely does put_compound_page's
      put_page_testzero bring the count down to 0.  But if get_user_pages is applied
      to some part of a compound page, without immediate release (e.g.  AIO or
      Infiniband), then it's possible for its put_page to come after the containing
      vma has been unmapped and the driver done its free_pages.
      
      That's just the kind of case compound pages are supposed to be guarding
      against (but Nick points out, nor did PageReserved handle this right).
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      d98c7a09
    • H
      [PATCH] compound page: use page[1].lru · 41d78ba5
      Hugh Dickins 提交于
      If a compound page has its own put_page_testzero destructor (the only current
      example is free_huge_page), that is noted in page[1].mapping of the compound
      page.  But that's rather a poor place to keep it: functions which call
      set_page_dirty_lock after get_user_pages (e.g.  Infiniband's
      __ib_umem_release) ought to be checking first, otherwise set_page_dirty is
      liable to crash on what's not the address of a struct address_space.
      
      And now I'm about to make that worse: it turns out that every compound page
      needs a destructor, so we can no longer rely on hugetlb pages going their own
      special way, to avoid further problems of page->mapping reuse.  For example,
      not many people know that: on 50% of i386 -Os builds, the first tail page of a
      compound page purports to be PageAnon (when its destructor has an odd
      address), which surprises page_add_file_rmap.
      
      Keep the compound page destructor in page[1].lru.next instead.  And to free up
      the common pairing of mapping and index, also move compound page order from
      index to lru.prev.  Slab reuses page->lru too: but if we ever need slab to use
      compound pages, it can easily stack its use above this.
      
      (akpm: decoded version of the above: the tail pages of a compound page now
      have ->mapping==NULL, so there's no need for the set_page_dirty[_lock]()
      caller to check that they're not compund pages before doing the dirty).
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      41d78ba5
  15. 06 2月, 2006 1 次提交
  16. 02 2月, 2006 1 次提交
  17. 19 1月, 2006 1 次提交
  18. 17 1月, 2006 1 次提交
  19. 13 1月, 2006 1 次提交