1. 29 11月, 2005 2 次提交
    • A
      [PATCH] shrinker->nr = LONG_MAX means deadlock for icache · ea164d73
      Andrea Arcangeli 提交于
      With Andrew Morton <akpm@osdl.org>
      
      The slab scanning code tries to balance the scanning rate of slabs versus the
      scanning rate of LRU pages.  To do this, it retains state concerning how many
      slabs have been scanned - if a particular slab shrinker didn't scan enough
      objects, we remember that for next time, and scan more objects on the next
      pass.
      
      The problem with this is that with (say) a huge number of GFP_NOIO
      direct-reclaim attempts, the number of objects which are to be scanned when we
      finally get a GFP_KERNEL request can be huge.  Because some shrinker handlers
      just bail out if !__GFP_FS.
      
      So the patch clamps the number of objects-to-be-scanned to 2* the total number
      of objects in the slab cache.
      Signed-off-by: NAndrea Arcangeli <andrea@suse.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      ea164d73
    • R
      [PATCH] temporarily disable swap token on memory pressure · f7b7fd8f
      Rik van Riel 提交于
      Some users (hi Zwane) have seen a problem when running a workload that
      eats nearly all of physical memory - th system does an OOM kill, even
      when there is still a lot of swap free.
      
      The problem appears to be a very big task that is holding the swap
      token, and the VM has a very hard time finding any other page in the
      system that is swappable.
      
      Instead of ignoring the swap token when sc->priority reaches 0, we could
      simply take the swap token away from the memory hog and make sure we
      don't give it back to the memory hog for a few seconds.
      
      This patch resolves the problem Zwane ran into.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      f7b7fd8f
  2. 14 11月, 2005 1 次提交
  3. 30 10月, 2005 2 次提交
    • H
      [PATCH] mm: split page table lock · 4c21e2f2
      Hugh Dickins 提交于
      Christoph Lameter demonstrated very poor scalability on the SGI 512-way, with
      a many-threaded application which concurrently initializes different parts of
      a large anonymous area.
      
      This patch corrects that, by using a separate spinlock per page table page, to
      guard the page table entries in that page, instead of using the mm's single
      page_table_lock.  (But even then, page_table_lock is still used to guard page
      table allocation, and anon_vma allocation.)
      
      In this implementation, the spinlock is tucked inside the struct page of the
      page table page: with a BUILD_BUG_ON in case it overflows - which it would in
      the case of 32-bit PA-RISC with spinlock debugging enabled.
      
      Splitting the lock is not quite for free: another cacheline access.  Ideally,
      I suppose we would use split ptlock only for multi-threaded processes on
      multi-cpu machines; but deciding that dynamically would have its own costs.
      So for now enable it by config, at some number of cpus - since the Kconfig
      language doesn't support inequalities, let preprocessor compare that with
      NR_CPUS.  But I don't think it's worth being user-configurable: for good
      testing of both split and unsplit configs, split now at 4 cpus, and perhaps
      change that to 8 later.
      
      There is a benefit even for singly threaded processes: kswapd can be attacking
      one part of the mm while another part is busy faulting.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      4c21e2f2
    • L
      [PATCH] shrink_list(): skip anon pages if not may_swap · c340010e
      Lee Schermerhorn 提交于
      Martin Hicks' page cache reclaim patch added the 'may_swap' flag to the
      scan_control struct; and modified shrink_list() not to add anon pages to
      the swap cache if may_swap is not asserted.
      
      Ref:  http://marc.theaimsgroup.com/?l=linux-mm&m=111461480725322&w=4
      
      However, further down, if the page is mapped, shrink_list() calls
      try_to_unmap() which will call try_to_unmap_one() via try_to_unmap_anon ().
       try_to_unmap_one() will BUG_ON() an anon page that is NOT in the swap
      cache.  Martin says he never encountered this path in his testing, but
      agrees that it might happen.
      
      This patch modifies shrink_list() to skip anon pages that are not already
      in the swap cache when !may_swap, rather than just not adding them to the
      cache.
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      c340010e
  4. 28 10月, 2005 1 次提交
  5. 17 10月, 2005 1 次提交
    • L
      Fix memory ordering bug in page reclaim · 3d80636a
      Linus Torvalds 提交于
      As noticed by Nick Piggin, we need to make sure that we check the page
      count before we check for PageDirty, since the dirty check is only valid
      if the count implies that we're the only possible ones holding the page.
      
      We always did do this, but the code needs a read-memory-barrier to make
      sure that the orderign is also honored by the CPU.
      
      (The writer side is ordered due to the atomic decrement and test on the
      page count, see the discussion on linux-kernel)
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      3d80636a
  6. 13 9月, 2005 1 次提交
  7. 08 9月, 2005 1 次提交
    • P
      [PATCH] cpusets: formalize intermediate GFP_KERNEL containment · 9bf2229f
      Paul Jackson 提交于
      This patch makes use of the previously underutilized cpuset flag
      'mem_exclusive' to provide what amounts to another layer of memory placement
      resolution.  With this patch, there are now the following four layers of
      memory placement available:
      
       1) The whole system (interrupt and GFP_ATOMIC allocations can use this),
       2) The nearest enclosing mem_exclusive cpuset (GFP_KERNEL allocations can use),
       3) The current tasks cpuset (GFP_USER allocations constrained to here), and
       4) Specific node placement, using mbind and set_mempolicy.
      
      These nest - each layer is a subset (same or within) of the previous.
      
      Layer (2) above is new, with this patch.  The call used to check whether a
      zone (its node, actually) is in a cpuset (in its mems_allowed, actually) is
      extended to take a gfp_mask argument, and its logic is extended, in the case
      that __GFP_HARDWALL is not set in the flag bits, to look up the cpuset
      hierarchy for the nearest enclosing mem_exclusive cpuset, to determine if
      placement is allowed.  The definition of GFP_USER, which used to be identical
      to GFP_KERNEL, is changed to also set the __GFP_HARDWALL bit, in the previous
      cpuset_gfp_hardwall_flag patch.
      
      GFP_ATOMIC and GFP_KERNEL allocations will stay within the current tasks
      cpuset, so long as any node therein is not too tight on memory, but will
      escape to the larger layer, if need be.
      
      The intended use is to allow something like a batch manager to handle several
      jobs, each job in its own cpuset, but using common kernel memory for caches
      and such.  Swapper and oom_kill activity is also constrained to Layer (2).  A
      task in or below one mem_exclusive cpuset should not cause swapping on nodes
      in another non-overlapping mem_exclusive cpuset, nor provoke oom_killing of a
      task in another such cpuset.  Heavy use of kernel memory for i/o caching and
      such by one job should not impact the memory available to jobs in other
      non-overlapping mem_exclusive cpusets.
      
      This patch enables providing hardwall, inescapable cpusets for memory
      allocations of each job, while sharing kernel memory allocations between
      several jobs, in an enclosing mem_exclusive cpuset.
      
      Like Dinakar's patch earlier to enable administering sched domains using the
      cpu_exclusive flag, this patch also provides a useful meaning to a cpuset flag
      that had previously done nothing much useful other than restrict what cpuset
      configurations were allowed.
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      9bf2229f
  8. 05 9月, 2005 2 次提交
  9. 29 6月, 2005 1 次提交
  10. 26 6月, 2005 1 次提交
    • C
      [PATCH] Cleanup patch for process freezing · 3e1d1d28
      Christoph Lameter 提交于
      1. Establish a simple API for process freezing defined in linux/include/sched.h:
      
         frozen(process)		Check for frozen process
         freezing(process)		Check if a process is being frozen
         freeze(process)		Tell a process to freeze (go to refrigerator)
         thaw_process(process)	Restart process
         frozen_process(process)	Process is frozen now
      
      2. Remove all references to PF_FREEZE and PF_FROZEN from all
         kernel sources except sched.h
      
      3. Fix numerous locations where try_to_freeze is manually done by a driver
      
      4. Remove the argument that is no longer necessary from two function calls.
      
      5. Some whitespace cleanup
      
      6. Clear potential race in refrigerator (provides an open window of PF_FREEZE
         cleared before setting PF_FROZEN, recalc_sigpending does not check
         PF_FROZEN).
      
      This patch does not address the problem of freeze_processes() violating the rule
      that a task may only modify its own flags by setting PF_FREEZE. This is not clean
      in an SMP environment. freeze(process) is therefore not SMP safe!
      Signed-off-by: NChristoph Lameter <christoph@lameter.com>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      3e1d1d28
  11. 22 6月, 2005 5 次提交
    • D
      [PATCH] vm: try_to_free_pages unused argument · 1ad539b2
      Darren Hart 提交于
      try_to_free_pages accepts a third argument, order, but hasn't used it since
      before 2.6.0.  The following patch removes the argument and updates all the
      calls to try_to_free_pages.
      Signed-off-by: NDarren Hart <dvhltc@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      1ad539b2
    • M
      [PATCH] VM: rate limit early reclaim · 1e7e5a90
      Martin Hicks 提交于
      When early zone reclaim is turned on the LRU is scanned more frequently when a
      zone is low on memory.  This limits when the zone reclaim can be called by
      skipping the scan if another thread (either via kswapd or sync reclaim) is
      already reclaiming from the zone.
      Signed-off-by: NMartin Hicks <mort@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      1e7e5a90
    • M
      [PATCH] VM: early zone reclaim · 753ee728
      Martin Hicks 提交于
      This is the core of the (much simplified) early reclaim.  The goal of this
      patch is to reclaim some easily-freed pages from a zone before falling back
      onto another zone.
      
      One of the major uses of this is NUMA machines.  With the default allocator
      behavior the allocator would look for memory in another zone, which might be
      off-node, before trying to reclaim from the current zone.
      
      This adds a zone tuneable to enable early zone reclaim.  It is selected on a
      per-zone basis and is turned on/off via syscall.
      
      Adding some extra throttling on the reclaim was also required (patch
      4/4).  Without the machine would grind to a crawl when doing a "make -j"
      kernel build.  Even with this patch the System Time is higher on
      average, but it seems tolerable.  Here are some numbers for kernbench
      runs on a 2-node, 4cpu, 8Gig RAM Altix in the "make -j" run:
      
      			wall  user   sys   %cpu  ctx sw.  sleeps
      			----  ----   ---   ----   ------  ------
      No patch		1009  1384   847   258   298170   504402
      w/patch, no reclaim     880   1376   667   288   254064   396745
      w/patch & reclaim       1079  1385   926   252   291625   548873
      
      These numbers are the average of 2 runs of 3 "make -j" runs done right
      after system boot.  Run-to-run variability for "make -j" is huge, so
      these numbers aren't terribly useful except to seee that with reclaim
      the benchmark still finishes in a reasonable amount of time.
      
      I also looked at the NUMA hit/miss stats for the "make -j" runs and the
      reclaim doesn't make any difference when the machine is thrashing away.
      
      Doing a "make -j8" on a single node that is filled with page cache pages
      takes 700 seconds with reclaim turned on and 735 seconds without reclaim
      (due to remote memory accesses).
      
      The simple zone_reclaim syscall program is at
      http://www.bork.org/~mort/sgi/zone_reclaim.cSigned-off-by: NMartin Hicks <mort@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      753ee728
    • M
      [PATCH] VM: add may_swap flag to scan_control · bfbb38fb
      Martin Hicks 提交于
      Here's the next round of these patches.  These are totally different in
      an attempt to meet the "simpler" request after the last patches.  For
      reference the earlier threads are:
      
      http://marc.theaimsgroup.com/?l=linux-kernel&m=110839604924587&w=2
      http://marc.theaimsgroup.com/?l=linux-mm&m=111461480721249&w=2
      
      This set of patches replaces my other vm- patches that are currently in
      -mm.  So they're against 2.6.12-rc5-mm1 about half way through the -mm
      patchset.
      
      As I said already this patch is a lot simpler.  The reclaim is turned on
      or off on a per-zone basis using a syscall.  I haven't tested the x86
      syscall, so it might be wrong.  It uses the existing reclaim/pageout
      code with the small addition of a may_swap flag to scan_control
      (patch 1/4).
      
      I also added __GFP_NORECLAIM (patch 3/4) so that certain allocation
      types can be flagged to never cause reclaim.  This was a deficiency
      that was in all of my earlier patch sets.  Previously, doing a big
      buffered read would fill one zone with page cache and then start to
      reclaim from that same zone, leaving the other zones untouched.
      
      Adding some extra throttling on the reclaim was also required (patch
      4/4).  Without the machine would grind to a crawl when doing a "make -j"
      kernel build.  Even with this patch the System Time is higher on
      average, but it seems tolerable.  Here are some numbers for kernbench
      runs on a 2-node, 4cpu, 8Gig RAM Altix in the "make -j" run:
      
      			wall  user   sys   %cpu  ctx sw.  sleeps
      			----  ----   ---   ----   ------  ------
      No patch		1009  1384   847   258   298170   504402
      w/patch, no reclaim     880   1376   667   288   254064   396745
      w/patch & reclaim       1079  1385   926   252   291625   548873
      
      These numbers are the average of 2 runs of 3 "make -j" runs done right
      after system boot.  Run-to-run variability for "make -j" is huge, so
      these numbers aren't terribly useful except to seee that with reclaim
      the benchmark still finishes in a reasonable amount of time.
      
      I also looked at the NUMA hit/miss stats for the "make -j" runs and the
      reclaim doesn't make any difference when the machine is thrashing away.
      
      Doing a "make -j8" on a single node that is filled with page cache pages
      takes 700 seconds with reclaim turned on and 735 seconds without reclaim
      (due to remote memory accesses).
      
      The simple zone_reclaim syscall program is at
      http://www.bork.org/~mort/sgi/zone_reclaim.c
      
      This patch:
      
      This adds an extra switch to the scan_control struct.  It simply lets the
      reclaim code know if its allowed to swap pages out.
      
      This was required for a simple per-zone reclaimer.  Without this addition
      pages would be swapped out as soon as a zone ran out of memory and the early
      reclaim kicked in.
      Signed-off-by: NMartin Hicks <mort@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      bfbb38fb
    • A
      [PATCH] vmscan: notice slab shrinking · b15e0905
      akpm@osdl.org 提交于
      Fix a problem identified by Andrea Arcangeli <andrea@suse.de>
      
      kswapd will set a zone into all_unreclaimable state if it sees that we're not
      successfully reclaiming LRU pages.  But that fails to notice that we're
      successfully reclaiming slab obects, so we can set all_unreclaimable too soon.
      
      So change shrink_slab() to return a success indication if it actually
      reclaimed some objects, and don't assume that the zone is all_unreclaimable if
      that is true.  This means that we won't enter all_unreclaimable state if we
      are successfully freeing slab objects but we're not yet actually freeing slab
      pages, due to internal fragmentation.
      
      (hm, this has a shortcoming.  We could be successfully freeing ZONE_NORMAL
      slab objects while being really oom on ZONE_DMA.  If that happens then kswapd
      might burn a lot of CPU.  But given that there might be some slab objects in
      ZONE_DMA, perhaps that is appropriate.)
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      b15e0905
  12. 17 4月, 2005 2 次提交