1. 26 4月, 2006 1 次提交
  2. 20 4月, 2006 1 次提交
  3. 11 4月, 2006 2 次提交
    • H
      [PATCH] overcommit: add calculate_totalreserve_pages() · cb45b0e9
      Hideo AOKI 提交于
      These patches are an enhancement of OVERCOMMIT_GUESS algorithm in
      __vm_enough_memory().
      
      - why the kernel needed patching
      
        When the kernel can't allocate anonymous pages in practice, currnet
        OVERCOMMIT_GUESS could return success. This implementation might be
        the cause of oom kill in memory pressure situation.
      
        If the Linux runs with page reservation features like
        /proc/sys/vm/lowmem_reserve_ratio and without swap region, I think
        the oom kill occurs easily.
      
      - the overall design approach in the patch
      
        When the OVERCOMMET_GUESS algorithm calculates number of free pages,
        the reserved free pages are regarded as non-free pages.
      
        This change helps to avoid the pitfall that the number of free pages
        become less than the number which the kernel tries to keep free.
      
      - testing results
      
        I tested the patches using my test kernel module.
      
        If the patches aren't applied to the kernel, __vm_enough_memory()
        returns success in the situation but autual page allocation is
        failed.
      
        On the other hand, if the patches are applied to the kernel, memory
        allocation failure is avoided since __vm_enough_memory() returns
        failure in the situation.
      
        I checked that on i386 SMP 16GB memory machine. I haven't tested on
        nommu environment currently.
      
      This patch adds totalreserve_pages for __vm_enough_memory().
      
      Calculate_totalreserve_pages() checks maximum lowmem_reserve pages and
      pages_high in each zone. Finally, the function stores the sum of each
      zone to totalreserve_pages.
      
      The totalreserve_pages is calculated when the VM is initilized.
      And the variable is updated when /proc/sys/vm/lowmem_reserve_raito
      or /proc/sys/vm/min_free_kbytes are changed.
      Signed-off-by: NHideo Aoki <haoki@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      cb45b0e9
    • N
      [PATCH] Fix buddy list race that could lead to page lru list corruptions · 676165a8
      Nick Piggin 提交于
      Rohit found an obscure bug causing buddy list corruption.
      
      page_is_buddy is using a non-atomic test (PagePrivate && page_count == 0)
      to determine whether or not a free page's buddy is itself free and in the
      buddy lists.
      
      Each of the conjuncts may be true at different times due to unrelated
      conditions, so the non-atomic page_is_buddy test may find each conjunct to
      be true even if they were not both true at the same time (ie. the page was
      not on the buddy lists).
      Signed-off-by: NMartin Bligh <mbligh@google.com>
      Signed-off-by: NRohit Seth <rohitseth@google.com>
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      676165a8
  4. 28 3月, 2006 4 次提交
  5. 26 3月, 2006 2 次提交
  6. 24 3月, 2006 1 次提交
  7. 22 3月, 2006 9 次提交
  8. 10 3月, 2006 1 次提交
    • C
      [PATCH] slab: Node rotor for freeing alien caches and remote per cpu pages. · 8fce4d8e
      Christoph Lameter 提交于
      The cache reaper currently tries to free all alien caches and all remote
      per cpu pages in each pass of cache_reap.  For a machines with large number
      of nodes (such as Altix) this may lead to sporadic delays of around ~10ms.
      Interrupts are disabled while reclaiming creating unacceptable delays.
      
      This patch changes that behavior by adding a per cpu reap_node variable.
      Instead of attempting to free all caches, we free only one alien cache and
      the per cpu pages from one remote node.  That reduces the time spend in
      cache_reap.  However, doing so will lengthen the time it takes to
      completely drain all remote per cpu pagesets and all alien caches.  The
      time needed will grow with the number of nodes in the system.  All caches
      are drained when they overflow their respective capacity.  So the drawback
      here is only that a bit of memory may be wasted for awhile longer.
      
      Details:
      
      1. Rename drain_remote_pages to drain_node_pages to allow the specification
         of the node to drain of pcp pages.
      
      2. Add additional functions init_reap_node, next_reap_node for NUMA
         that manage a per cpu reap_node counter.
      
      3. Add a reap_alien function that reaps only from the current reap_node.
      
      For us this seems to be a critical issue.  Holdoffs of an average of ~7ms
      cause some HPC benchmarks to slow down significantly.  F.e.  NAS parallel
      slows down dramatically.  NAS parallel has a 12-16 seconds runtime w/o rotor
      compared to 5.8 secs with the rotor patches.  It gets down to 5.05 secs with
      the additional interrupt holdoff reductions.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      8fce4d8e
  9. 21 2月, 2006 1 次提交
    • C
      [PATCH] Terminate process that fails on a constrained allocation · 9b0f8b04
      Christoph Lameter 提交于
      Some allocations are restricted to a limited set of nodes (due to memory
      policies or cpuset constraints).  If the page allocator is not able to find
      enough memory then that does not mean that overall system memory is low.
      
      In particular going postal and more or less randomly shooting at processes
      is not likely going to help the situation but may just lead to suicide (the
      whole system coming down).
      
      It is better to signal to the process that no memory exists given the
      constraints that the process (or the configuration of the process) has
      placed on the allocation behavior.  The process may be killed but then the
      sysadmin or developer can investigate the situation.  The solution is
      similar to what we do when running out of hugepages.
      
      This patch adds a check before we kill processes.  At that point
      performance considerations do not matter much so we just scan the zonelist
      and reconstruct a list of nodes.  If the list of nodes does not contain all
      online nodes then this is a constrained allocation and we should kill the
      current process.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Andi Kleen <ak@muc.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      9b0f8b04
  10. 18 2月, 2006 1 次提交
  11. 15 2月, 2006 2 次提交
    • H
      [PATCH] compound page: default destructor · d98c7a09
      Hugh Dickins 提交于
      Somehow I imagined that calling a NULL destructor would free a compound page
      rather than oopsing.  No, we must supply a default destructor, __free_pages_ok
      using the order noted by prep_compound_page.  hugetlb can still replace this
      as before with its own free_huge_page pointer.
      
      The case that needs this is not common: rarely does put_compound_page's
      put_page_testzero bring the count down to 0.  But if get_user_pages is applied
      to some part of a compound page, without immediate release (e.g.  AIO or
      Infiniband), then it's possible for its put_page to come after the containing
      vma has been unmapped and the driver done its free_pages.
      
      That's just the kind of case compound pages are supposed to be guarding
      against (but Nick points out, nor did PageReserved handle this right).
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      d98c7a09
    • H
      [PATCH] compound page: use page[1].lru · 41d78ba5
      Hugh Dickins 提交于
      If a compound page has its own put_page_testzero destructor (the only current
      example is free_huge_page), that is noted in page[1].mapping of the compound
      page.  But that's rather a poor place to keep it: functions which call
      set_page_dirty_lock after get_user_pages (e.g.  Infiniband's
      __ib_umem_release) ought to be checking first, otherwise set_page_dirty is
      liable to crash on what's not the address of a struct address_space.
      
      And now I'm about to make that worse: it turns out that every compound page
      needs a destructor, so we can no longer rely on hugetlb pages going their own
      special way, to avoid further problems of page->mapping reuse.  For example,
      not many people know that: on 50% of i386 -Os builds, the first tail page of a
      compound page purports to be PageAnon (when its destructor has an odd
      address), which surprises page_add_file_rmap.
      
      Keep the compound page destructor in page[1].lru.next instead.  And to free up
      the common pairing of mapping and index, also move compound page order from
      index to lru.prev.  Slab reuses page->lru too: but if we ever need slab to use
      compound pages, it can easily stack its use above this.
      
      (akpm: decoded version of the above: the tail pages of a compound page now
      have ->mapping==NULL, so there's no need for the set_page_dirty[_lock]()
      caller to check that they're not compund pages before doing the dirty).
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      41d78ba5
  12. 06 2月, 2006 1 次提交
  13. 02 2月, 2006 1 次提交
  14. 19 1月, 2006 1 次提交
  15. 17 1月, 2006 1 次提交
  16. 13 1月, 2006 1 次提交
  17. 12 1月, 2006 3 次提交
  18. 10 1月, 2006 1 次提交
  19. 09 1月, 2006 6 次提交
    • P
      [PATCH] cpuset: memory pressure meter · 3e0d98b9
      Paul Jackson 提交于
      Provide a simple per-cpuset metric of memory pressure, tracking the -rate-
      that the tasks in a cpuset call try_to_free_pages(), the synchronous
      (direct) memory reclaim code.
      
      This enables batch managers monitoring jobs running in dedicated cpusets to
      efficiently detect what level of memory pressure that job is causing.
      
      This is useful both on tightly managed systems running a wide mix of
      submitted jobs, which may choose to terminate or reprioritize jobs that are
      trying to use more memory than allowed on the nodes assigned them, and with
      tightly coupled, long running, massively parallel scientific computing jobs
      that will dramatically fail to meet required performance goals if they
      start to use more memory than allowed to them.
      
      This patch just provides a very economical way for the batch manager to
      monitor a cpuset for signs of memory pressure.  It's up to the batch
      manager or other user code to decide what to do about it and take action.
      
      ==> Unless this feature is enabled by writing "1" to the special file
          /dev/cpuset/memory_pressure_enabled, the hook in the rebalance
          code of __alloc_pages() for this metric reduces to simply noticing
          that the cpuset_memory_pressure_enabled flag is zero.  So only
          systems that enable this feature will compute the metric.
      
      Why a per-cpuset, running average:
      
          Because this meter is per-cpuset, rather than per-task or mm, the
          system load imposed by a batch scheduler monitoring this metric is
          sharply reduced on large systems, because a scan of the tasklist can be
          avoided on each set of queries.
      
          Because this meter is a running average, instead of an accumulating
          counter, a batch scheduler can detect memory pressure with a single
          read, instead of having to read and accumulate results for a period of
          time.
      
          Because this meter is per-cpuset rather than per-task or mm, the
          batch scheduler can obtain the key information, memory pressure in a
          cpuset, with a single read, rather than having to query and accumulate
          results over all the (dynamically changing) set of tasks in the cpuset.
      
      A per-cpuset simple digital filter (requires a spinlock and 3 words of data
      per-cpuset) is kept, and updated by any task attached to that cpuset, if it
      enters the synchronous (direct) page reclaim code.
      
      A per-cpuset file provides an integer number representing the recent
      (half-life of 10 seconds) rate of direct page reclaims caused by the tasks
      in the cpuset, in units of reclaims attempted per second, times 1000.
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      3e0d98b9
    • N
      [PATCH] mm: free_pages opt · 48db57f8
      Nick Piggin 提交于
      Try to streamline free_pages_bulk by ensuring callers don't pass in a
      'count' that exceeds the list size.
      
      Some cleanups:
      Rename __free_pages_bulk to __free_one_page.
      Put the page list manipulation from __free_pages_ok into free_one_page.
      Make __free_pages_ok static.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Cc: Hugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      48db57f8
    • N
      [PATCH] mm: cleanup zone_pcp · 23316bc8
      Nick Piggin 提交于
      Use zone_pcp everywhere even though NUMA code "knows" the internal details
      of the zone.  Stop other people trying to copy, and it looks nicer.
      
      Also, only print the pagesets of online cpus in zoneinfo.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Cc: "Seth, Rohit" <rohit.seth@intel.com>
      Cc: Christoph Lameter <christoph@lameter.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      23316bc8
    • R
      [PATCH] Make high and batch sizes of per_cpu_pagelists configurable · 8ad4b1fb
      Rohit Seth 提交于
      As recently there has been lot of traffic on the right values for batch and
      high water marks for per_cpu_pagelists.  This patch makes these two
      variables configurable through /proc interface.
      
      A new tunable /proc/sys/vm/percpu_pagelist_fraction is added.  This entry
      controls the fraction of pages at most in each zone that are allocated for
      each per cpu page list.  The min value for this is 8.  It means that we
      don't allow more than 1/8th of pages in each zone to be allocated in any
      single per_cpu_pagelist.
      
      The batch value of each per cpu pagelist is also updated as a result.  It
      is set to pcp->high/4.  The upper limit of batch is (PAGE_SHIFT * 8)
      Signed-off-by: NRohit Seth <rohit.seth@intel.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      8ad4b1fb
    • C
      [PATCH] slab: remove nested #ifdef CONFIG_NUMA · bec6b0c8
      Christoph Lameter 提交于
      For some reason there is an #ifdef CONFIG_NUMA within another #ifdef
      CONFIG_NUMA in the page allocator.  Remove innermost #ifdef CONFIG_NUMA
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      bec6b0c8
    • A
      [PATCH] revert "mm: page_state fixes" · 84c2008a
      Andrew Morton 提交于
      Hugh says:
      
      page_alloc_cpu_notify() specifically contains code to
      
       		/* Add dead cpu's page_states to our own. */
      
      which handles this more efficiently.
      
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      84c2008a