1. 20 10月, 2008 23 次提交
    • L
      vmscan: unevictable LRU scan sysctl · af936a16
      Lee Schermerhorn 提交于
      This patch adds a function to scan individual or all zones' unevictable
      lists and move any pages that have become evictable onto the respective
      zone's inactive list, where shrink_inactive_list() will deal with them.
      
      Adds sysctl to scan all nodes, and per node attributes to individual
      nodes' zones.
      
      Kosaki: If evictable page found in unevictable lru when write
      /proc/sys/vm/scan_unevictable_pages, print filename and file offset of
      these pages.
      
      [akpm@linux-foundation.org: fix one CONFIG_MMU=n build error]
      [kosaki.motohiro@jp.fujitsu.com: adapt vmscan-unevictable-lru-scan-sysctl.patch to new sysfs API]
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      af936a16
    • L
      swap: cull unevictable pages in fault path · 64d6519d
      Lee Schermerhorn 提交于
      In the fault paths that install new anonymous pages, check whether the
      page is evictable or not using lru_cache_add_active_or_unevictable().  If
      the page is evictable, just add it to the active lru list [via the pagevec
      cache], else add it to the unevictable list.
      
      This "proactive" culling in the fault path mimics the handling of mlocked
      pages in Nick Piggin's series to keep mlocked pages off the lru lists.
      
      Notes:
      
      1) This patch is optional--e.g., if one is concerned about the
         additional test in the fault path.  We can defer the moving of
         nonreclaimable pages until when vmscan [shrink_*_list()]
         encounters them.  Vmscan will only need to handle such pages
         once, but if there are a lot of them it could impact system
         performance.
      
      2) The 'vma' argument to page_evictable() is require to notice that
         we're faulting a page into an mlock()ed vma w/o having to scan the
         page's rmap in the fault path.   Culling mlock()ed anon pages is
         currently the only reason for this patch.
      
      3) We can't cull swap pages in read_swap_cache_async() because the
         vma argument doesn't necessarily correspond to the swap cache
         offset passed in by swapin_readahead().  This could [did!] result
         in mlocking pages in non-VM_LOCKED vmas if [when] we tried to
         cull in this path.
      
      4) Move set_pte_at() to after where we add page to lru to keep it
         hidden from other tasks that might walk the page table.
         We already do it in this order in do_anonymous() page.  And,
         these are COW'd anon pages.  Is this safe?
      
      [riel@redhat.com: undo an overzealous code cleanup]
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      64d6519d
    • N
      vmstat: mlocked pages statistics · 5344b7e6
      Nick Piggin 提交于
      Add NR_MLOCK zone page state, which provides a (conservative) count of
      mlocked pages (actually, the number of mlocked pages moved off the LRU).
      
      Reworked by lts to fit in with the modified mlock page support in the
      Reclaim Scalability series.
      
      [kosaki.motohiro@jp.fujitsu.com: fix incorrect Mlocked field of /proc/meminfo]
      [lee.schermerhorn@hp.com: mlocked-pages: add event counting with statistics]
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5344b7e6
    • R
      mmap: handle mlocked pages during map, remap, unmap · ba470de4
      Rik van Riel 提交于
      Originally by Nick Piggin <npiggin@suse.de>
      
      Remove mlocked pages from the LRU using "unevictable infrastructure"
      during mmap(), munmap(), mremap() and truncate().  Try to move back to
      normal LRU lists on munmap() when last mlocked mapping removed.  Remove
      PageMlocked() status when page truncated from file.
      
      [akpm@linux-foundation.org: cleanup]
      [kamezawa.hiroyu@jp.fujitsu.com: fix double unlock_page()]
      [kosaki.motohiro@jp.fujitsu.com: split LRU: munlock rework]
      [lee.schermerhorn@hp.com: mlock: fix __mlock_vma_pages_range comment block]
      [akpm@linux-foundation.org: remove bogus kerneldoc token]
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamewzawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ba470de4
    • L
      mlock: downgrade mmap sem while populating mlocked regions · 8edb08ca
      Lee Schermerhorn 提交于
      We need to hold the mmap_sem for write to initiatate mlock()/munlock()
      because we may need to merge/split vmas.  However, this can lead to very
      long lock hold times attempting to fault in a large memory region to mlock
      it into memory.  This can hold off other faults against the mm
      [multithreaded tasks] and other scans of the mm, such as via /proc.  To
      alleviate this, downgrade the mmap_sem to read mode during the population
      of the region for locking.  This is especially the case if we need to
      reclaim memory to lock down the region.  We [probably?] don't need to do
      this for unlocking as all of the pages should be resident--they're already
      mlocked.
      
      Now, the caller's of the mlock functions [mlock_fixup() and
      mlock_vma_pages_range()] expect the mmap_sem to be returned in write mode.
       Changing all callers appears to be way too much effort at this point.
      So, restore write mode before returning.  Note that this opens a window
      where the mmap list could change in a multithreaded process.  So, at least
      for mlock_fixup(), where we could be called in a loop over multiple vmas,
      we check that a vma still exists at the start address and that vma still
      covers the page range [start,end).  If not, we return an error, -EAGAIN,
      and let the caller deal with it.
      
      Return -EAGAIN from mlock_vma_pages_range() function and mlock_fixup() if
      the vma at 'start' disappears or changes so that the page range
      [start,end) is no longer contained in the vma.  Again, let the caller deal
      with it.  Looks like only sys_remap_file_pages() [via mmap_region()]
      should actually care.
      
      With this patch, I no longer see processes like ps(1) blocked for seconds
      or minutes at a time waiting for a large [multiple gigabyte] region to be
      locked down.  However, I occassionally see delays while unlocking or
      unmapping a large mlocked region.  Should we also downgrade the mmap_sem
      for the unlock path?
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8edb08ca
    • N
      mlock: mlocked pages are unevictable · b291f000
      Nick Piggin 提交于
      Make sure that mlocked pages also live on the unevictable LRU, so kswapd
      will not scan them over and over again.
      
      This is achieved through various strategies:
      
      1) add yet another page flag--PG_mlocked--to indicate that
         the page is locked for efficient testing in vmscan and,
         optionally, fault path.  This allows early culling of
         unevictable pages, preventing them from getting to
         page_referenced()/try_to_unmap().  Also allows separate
         accounting of mlock'd pages, as Nick's original patch
         did.
      
         Note:  Nick's original mlock patch used a PG_mlocked
         flag.  I had removed this in favor of the PG_unevictable
         flag + an mlock_count [new page struct member].  I
         restored the PG_mlocked flag to eliminate the new
         count field.
      
      2) add the mlock/unevictable infrastructure to mm/mlock.c,
         with internal APIs in mm/internal.h.  This is a rework
         of Nick's original patch to these files, taking into
         account that mlocked pages are now kept on unevictable
         LRU list.
      
      3) update vmscan.c:page_evictable() to check PageMlocked()
         and, if vma passed in, the vm_flags.  Note that the vma
         will only be passed in for new pages in the fault path;
         and then only if the "cull unevictable pages in fault
         path" patch is included.
      
      4) add try_to_unlock() to rmap.c to walk a page's rmap and
         ClearPageMlocked() if no other vmas have it mlocked.
         Reuses as much of try_to_unmap() as possible.  This
         effectively replaces the use of one of the lru list links
         as an mlock count.  If this mechanism let's pages in mlocked
         vmas leak through w/o PG_mlocked set [I don't know that it
         does], we should catch them later in try_to_unmap().  One
         hopes this will be rare, as it will be relatively expensive.
      
      Original mm/internal.h, mm/rmap.c and mm/mlock.c changes:
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      
      splitlru: introduce __get_user_pages():
      
        New munlock processing need to GUP_FLAGS_IGNORE_VMA_PERMISSIONS.
        because current get_user_pages() can't grab PROT_NONE pages theresore it
        cause PROT_NONE pages can't munlock.
      
      [akpm@linux-foundation.org: fix this for pagemap-pass-mm-into-pagewalkers.patch]
      [akpm@linux-foundation.org: untangle patch interdependencies]
      [akpm@linux-foundation.org: fix things after out-of-order merging]
      [hugh@veritas.com: fix page-flags mess]
      [lee.schermerhorn@hp.com: fix munlock page table walk - now requires 'mm']
      [kosaki.motohiro@jp.fujitsu.com: build fix]
      [kosaki.motohiro@jp.fujitsu.com: fix truncate race and sevaral comments]
      [kosaki.motohiro@jp.fujitsu.com: splitlru: introduce __get_user_pages()]
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Cc: Matt Mackall <mpm@selenic.com>
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b291f000
    • L
      SHM_LOCKED pages are unevictable · 89e004ea
      Lee Schermerhorn 提交于
      Shmem segments locked into memory via shmctl(SHM_LOCKED) should not be
      kept on the normal LRU, since scanning them is a waste of time and might
      throw off kswapd's balancing algorithms.  Place them on the unevictable
      LRU list instead.
      
      Use the AS_UNEVICTABLE flag to mark address_space of SHM_LOCKed shared
      memory regions as unevictable.  Then these pages will be culled off the
      normal LRU lists during vmscan.
      
      Add new wrapper function to clear the mapping's unevictable state when/if
      shared memory segment is munlocked.
      
      Add 'scan_mapping_unevictable_page()' to mm/vmscan.c to scan all pages in
      the shmem segment's mapping [struct address_space] for evictability now
      that they're no longer locked.  If so, move them to the appropriate zone
      lru list.
      
      Changes depend on [CONFIG_]UNEVICTABLE_LRU.
      
      [kosaki.motohiro@jp.fujitsu.com: revert shm change]
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NKosaki Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      89e004ea
    • L
      Ramfs and Ram Disk pages are unevictable · ba9ddf49
      Lee Schermerhorn 提交于
      Christoph Lameter pointed out that ram disk pages also clutter the LRU
      lists.  When vmscan finds them dirty and tries to clean them, the ram disk
      writeback function just redirties the page so that it goes back onto the
      active list.  Round and round she goes...
      
      With the ram disk driver [rd.c] replaced by the newer 'brd.c', this is no
      longer the case, as ram disk pages are no longer maintained on the lru.
      [This makes them unmigratable for defrag or memory hot remove, but that
      can be addressed by a separate patch series.] However, the ramfs pages
      behave like ram disk pages used to, so:
      
      Define new address_space flag [shares address_space flags member with
      mapping's gfp mask] to indicate that the address space contains all
      unevictable pages.  This will provide for efficient testing of ramfs pages
      in page_evictable().
      
      Also provide wrapper functions to set/test the unevictable state to
      minimize #ifdefs in ramfs driver and any other users of this facility.
      
      Set the unevictable state on address_space structures for new ramfs
      inodes.  Test the unevictable state in page_evictable() to cull
      unevictable pages.
      
      These changes depend on [CONFIG_]UNEVICTABLE_LRU.
      
      [riel@redhat.com: undo the brd.c part]
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Debugged-by: NNick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ba9ddf49
    • L
      Unevictable LRU Page Statistics · 7b854121
      Lee Schermerhorn 提交于
      Report unevictable pages per zone and system wide.
      
      Kosaki Motohiro added support for memory controller unevictable
      statistics.
      
      [riel@redhat.com: fix printk in show_free_areas()]
      [akpm@linux-foundation.org: fix units in /proc/vmstats]
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Debugged-by: NHiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7b854121
    • L
      unevictable lru: add event counting with statistics · bbfd28ee
      Lee Schermerhorn 提交于
      Fix to unevictable-lru-page-statistics.patch
      
      Add unevictable lru infrastructure vm events to the statistics patch.
      Rename the "NORECL_" and "noreclaim_" symbols and text strings to
      "UNEVICTABLE_" and "unevictable_", respectively.
      
      Currently, both the infrastructure and the mlocked pages event are
      added by a single patch later in the series.  This makes it difficult
      to add or rework the incremental patches.  The events actually "belong"
      with the stats, so pull them up to here.
      
      Also, restore the event counting to putback_lru_page().  This was removed
      from previous patch in series where it was "misplaced".  The actual events
      weren't defined that early.
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Rik van Riel <riel@redhat.com>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bbfd28ee
    • L
      Unevictable LRU Infrastructure · 894bc310
      Lee Schermerhorn 提交于
      When the system contains lots of mlocked or otherwise unevictable pages,
      the pageout code (kswapd) can spend lots of time scanning over these
      pages.  Worse still, the presence of lots of unevictable pages can confuse
      kswapd into thinking that more aggressive pageout modes are required,
      resulting in all kinds of bad behaviour.
      
      Infrastructure to manage pages excluded from reclaim--i.e., hidden from
      vmscan.  Based on a patch by Larry Woodman of Red Hat.  Reworked to
      maintain "unevictable" pages on a separate per-zone LRU list, to "hide"
      them from vmscan.
      
      Kosaki Motohiro added the support for the memory controller unevictable
      lru list.
      
      Pages on the unevictable list have both PG_unevictable and PG_lru set.
      Thus, PG_unevictable is analogous to and mutually exclusive with
      PG_active--it specifies which LRU list the page is on.
      
      The unevictable infrastructure is enabled by a new mm Kconfig option
      [CONFIG_]UNEVICTABLE_LRU.
      
      A new function 'page_evictable(page, vma)' in vmscan.c tests whether or
      not a page may be evictable.  Subsequent patches will add the various
      !evictable tests.  We'll want to keep these tests light-weight for use in
      shrink_active_list() and, possibly, the fault path.
      
      To avoid races between tasks putting pages [back] onto an LRU list and
      tasks that might be moving the page from non-evictable to evictable state,
      the new function 'putback_lru_page()' -- inverse to 'isolate_lru_page()'
      -- tests the "evictability" of a page after placing it on the LRU, before
      dropping the reference.  If the page has become unevictable,
      putback_lru_page() will redo the 'putback', thus moving the page to the
      unevictable list.  This way, we avoid "stranding" evictable pages on the
      unevictable list.
      
      [akpm@linux-foundation.org: fix fallout from out-of-order merge]
      [riel@redhat.com: fix UNEVICTABLE_LRU and !PROC_PAGE_MONITOR build]
      [nishimura@mxp.nes.nec.co.jp: remove redundant mapping check]
      [kosaki.motohiro@jp.fujitsu.com: unevictable-lru-infrastructure: putback_lru_page()/unevictable page handling rework]
      [kosaki.motohiro@jp.fujitsu.com: kill unnecessary lock_page() in vmscan.c]
      [kosaki.motohiro@jp.fujitsu.com: revert migration change of unevictable lru infrastructure]
      [kosaki.motohiro@jp.fujitsu.com: revert to unevictable-lru-infrastructure-kconfig-fix.patch]
      [kosaki.motohiro@jp.fujitsu.com: restore patch failure of vmstat-unevictable-and-mlocked-pages-vm-events.patch]
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Debugged-by: NBenjamin Kidwell <benjkidwell@yahoo.com>
      Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      894bc310
    • R
      more aggressively use lumpy reclaim · 33c120ed
      Rik van Riel 提交于
      During an AIM7 run on a 16GB system, fork started failing around 32000
      threads, despite the system having plenty of free swap and 15GB of
      pageable memory.  This was on x86-64, so 8k stacks.
      
      If a higher order allocation fails, we can either:
      - keep evicting pages off the end of the LRUs and hope that
        we eventually create a contiguous region; this is somewhat
        unlikely if the system is under enough stress by new
        allocations
      - after trying normal eviction for a bit, use lumpy reclaim
      
      This patch switches the system to lumpy reclaim if the VM is having
      trouble freeing enough pages, using the same threshold for detection as
      used by pageout congestion wait.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      33c120ed
    • R
      vmscan: add newly swapped in pages to the inactive list · c5fdae46
      Rik van Riel 提交于
      Swapin_readahead can read in a lot of data that the processes in memory
      never need.  Adding swap cache pages to the inactive list prevents them
      from putting too much pressure on the working set.
      
      This has the potential to help the programs that are already in memory,
      but it could also be a disadvantage to processes that are trying to get
      swapped in.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c5fdae46
    • R
      vmscan: fix pagecache reclaim referenced bit check · 7e9cd484
      Rik van Riel 提交于
      Moving referenced pages back to the head of the active list creates a huge
      scalability problem, because by the time a large memory system finally
      runs out of free memory, every single page in the system will have been
      referenced.
      
      Not only do we not have the time to scan every single page on the active
      list, but since they have will all have the referenced bit set, that bit
      conveys no useful information.
      
      A more scalable solution is to just move every page that hits the end of
      the active list to the inactive list.
      
      We clear the referenced bit off of mapped pages, which need just one
      reference to be moved back onto the active list.
      
      Unmapped pages will be moved back to the active list after two references
      (see mark_page_accessed).  We preserve the PG_referenced flag on unmapped
      pages to preserve accesses that were made while the page was on the active
      list.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7e9cd484
    • R
      vmscan: second chance replacement for anonymous pages · 556adecb
      Rik van Riel 提交于
      We avoid evicting and scanning anonymous pages for the most part, but
      under some workloads we can end up with most of memory filled with
      anonymous pages.  At that point, we suddenly need to clear the referenced
      bits on all of memory, which can take ages on very large memory systems.
      
      We can reduce the maximum number of pages that need to be scanned by not
      taking the referenced state into account when deactivating an anonymous
      page.  After all, every anonymous page starts out referenced, so why
      check?
      
      If an anonymous page gets referenced again before it reaches the end of
      the inactive list, we move it back to the active list.
      
      To keep the maximum amount of necessary work reasonable, we scale the
      active to inactive ratio with the size of memory, using the formula
      active:inactive ratio = sqrt(memory in GB * 10).
      
      Kswapd CPU use now seems to scale by the amount of pageout bandwidth,
      instead of by the amount of memory present in the system.
      
      [kamezawa.hiroyu@jp.fujitsu.com: fix OOM with memcg]
      [kamezawa.hiroyu@jp.fujitsu.com: memcg: lru scan fix]
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      556adecb
    • R
      vmscan: split LRU lists into anon & file sets · 4f98a2fe
      Rik van Riel 提交于
      Split the LRU lists in two, one set for pages that are backed by real file
      systems ("file") and one for pages that are backed by memory and swap
      ("anon").  The latter includes tmpfs.
      
      The advantage of doing this is that the VM will not have to scan over lots
      of anonymous pages (which we generally do not want to swap out), just to
      find the page cache pages that it should evict.
      
      This patch has the infrastructure and a basic policy to balance how much
      we scan the anon lists and how much we scan the file lists.  The big
      policy changes are in separate patches.
      
      [lee.schermerhorn@hp.com: collect lru meminfo statistics from correct offset]
      [kosaki.motohiro@jp.fujitsu.com: prevent incorrect oom under split_lru]
      [kosaki.motohiro@jp.fujitsu.com: fix pagevec_move_tail() doesn't treat unevictable page]
      [hugh@veritas.com: memcg swapbacked pages active]
      [hugh@veritas.com: splitlru: BDI_CAP_SWAP_BACKED]
      [akpm@linux-foundation.org: fix /proc/vmstat units]
      [nishimura@mxp.nes.nec.co.jp: memcg: fix handling of shmem migration]
      [kosaki.motohiro@jp.fujitsu.com: adjust Quicklists field of /proc/meminfo]
      [kosaki.motohiro@jp.fujitsu.com: fix style issue of get_scan_ratio()]
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NLee Schermerhorn <Lee.Schermerhorn@hp.com>
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4f98a2fe
    • R
      define page_file_cache() function · b2e18538
      Rik van Riel 提交于
      Define page_file_cache() function to answer the question:
      	is page backed by a file?
      
      Originally part of Rik van Riel's split-lru patch.  Extracted to make
      available for other, independent reclaim patches.
      
      Moved inline function to linux/mm_inline.h where it will be needed by
      subsequent "split LRU" and "noreclaim" patches.
      
      Unfortunately this needs to use a page flag, since the PG_swapbacked state
      needs to be preserved all the way to the point where the page is last
      removed from the LRU.  Trying to derive the status from other info in the
      page resulted in wrong VM statistics in earlier split VM patchsets.
      
      The total number of page flags in use on a 32 bit machine after this patch
      is 19.
      
      [akpm@linux-foundation.org: fix up out-of-order merge fallout]
      [hugh@veritas.com: splitlru: shmem_getpage SetPageSwapBacked sooner[
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: NMinChan Kim <minchan.kim@gmail.com>
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b2e18538
    • R
      vmscan: free swap space on swap-in/activation · 68a22394
      Rik van Riel 提交于
      If vm_swap_full() (swap space more than 50% full), the system will free
      swap space at swapin time.  With this patch, the system will also free the
      swap space in the pageout code, when we decide that the page is not a
      candidate for swapout (and just wasting swap space).
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NLee Schermerhorn <Lee.Schermerhorn@hp.com>
      Signed-off-by: NMinChan Kim <minchan.kim@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      68a22394
    • K
      swap: use an array for the LRU pagevecs · f04e9ebb
      KOSAKI Motohiro 提交于
      Turn the pagevecs into an array just like the LRUs.  This significantly
      cleans up the source code and reduces the size of the kernel by about 13kB
      after all the LRU lists have been created further down in the split VM
      patch series.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f04e9ebb
    • C
      vmscan: Use an indexed array for LRU variables · b69408e8
      Christoph Lameter 提交于
      Currently we are defining explicit variables for the inactive and active
      list.  An indexed array can be more generic and avoid repeating similar
      code in several places in the reclaim code.
      
      We are saving a few bytes in terms of code size:
      
      Before:
      
         text    data     bss     dec     hex filename
      4097753  573120 4092484 8763357  85b7dd vmlinux
      
      After:
      
         text    data     bss     dec     hex filename
      4097729  573120 4092484 8763333  85b7c5 vmlinux
      
      Having an easy way to add new lru lists may ease future work on the
      reclaim code.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: NChristoph Lameter <cl@linux-foundation.org>
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b69408e8
    • N
      vmscan: move isolate_lru_page() to vmscan.c · 62695a84
      Nick Piggin 提交于
      On large memory systems, the VM can spend way too much time scanning
      through pages that it cannot (or should not) evict from memory.  Not only
      does it use up CPU time, but it also provokes lock contention and can
      leave large systems under memory presure in a catatonic state.
      
      This patch series improves VM scalability by:
      
      1) putting filesystem backed, swap backed and unevictable pages
         onto their own LRUs, so the system only scans the pages that it
         can/should evict from memory
      
      2) switching to two handed clock replacement for the anonymous LRUs,
         so the number of pages that need to be scanned when the system
         starts swapping is bound to a reasonable number
      
      3) keeping unevictable pages off the LRU completely, so the
         VM does not waste CPU time scanning them. ramfs, ramdisk,
         SHM_LOCKED shared memory segments and mlock()ed VMA pages
         are keept on the unevictable list.
      
      This patch:
      
      isolate_lru_page logically belongs to be in vmscan.c than migrate.c.
      
      It is tough, because we don't need that function without memory migration
      so there is a valid argument to have it in migrate.c.  However a
      subsequent patch needs to make use of it in the core mm, so we can happily
      move it to vmscan.c.
      
      Also, make the function a little more generic by not requiring that it
      adds an isolated page to a given list.  Callers can do that.
      
      	Note that we now have '__isolate_lru_page()', that does
      	something quite different, visible outside of vmscan.c
      	for use with memory controller.  Methinks we need to
      	rationalize these names/purposes.	--lts
      
      [akpm@linux-foundation.org: fix mm/memory_hotplug.c build]
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NLee Schermerhorn <Lee.Schermerhorn@hp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      62695a84
    • B
      mm: cleanup to make remove_memory() arch-neutral · 71088785
      Badari Pulavarty 提交于
      There is nothing architecture specific about remove_memory().
      remove_memory() function is common for all architectures which support
      hotplug memory remove.  Instead of duplicating it in every architecture,
      collapse them into arch neutral function.
      
      [akpm@linux-foundation.org: fix the export]
      Signed-off-by: NBadari Pulavarty <pbadari@us.ibm.com>
      Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
      Cc: Gary Hade <garyhade@us.ibm.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      71088785
    • L
      anon_vma_prepare: properly lock even newly allocated entries · d9d332e0
      Linus Torvalds 提交于
      The anon_vma code is very subtle, and we end up doing optimistic lookups
      of anon_vmas under RCU in page_lock_anon_vma() with no locking.  Other
      CPU's can also see the newly allocated entry immediately after we've
      exposed it by setting "vma->anon_vma" to the new value.
      
      We protect against the anon_vma being destroyed by having the SLAB
      marked as SLAB_DESTROY_BY_RCU, so the RCU lookup can depend on the
      allocation not being destroyed - but it might still be free'd and
      re-allocated here to a new vma.
      
      As a result, we should not do the anon_vma list ops on a newly allocated
      vma without proper locking.
      Acked-by: NNick Piggin <npiggin@suse.de>
      Acked-by: NHugh Dickins <hugh@veritas.com>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d9d332e0
  2. 18 10月, 2008 1 次提交
  3. 17 10月, 2008 6 次提交
  4. 16 10月, 2008 1 次提交
  5. 14 10月, 2008 2 次提交
    • O
      do_generic_file_read: s/EINTR/EIO/ if lock_page_killable() fails · 85462323
      Oleg Nesterov 提交于
      If lock_page_killable() fails because the task was killed by SIGKILL or
      any other fatal signal, do_generic_file_read() returns -EIO.
      
      This seems to be OK, because in fact the userspace won't see this error,
      the task will dequeue SIGKILL and exit.
      
      However, /sbin/init is different, it will dequeue SIGKILL, ignore it, and
      return to the user-space with the bogus -EIO.
      
      Change the code to return the error code from lock_page_killable(), -EINTR.
      This doesn't fix the bug, but perhaps makes sense anyway. Imho, with this
      change the code looks a bit more logical, and the "good" init should handle
      the spurious EINTR or short read.
      
      Afaics we can also change lock_page_killable() to return -ERESTARTNOINTR,
      but this can't prevent the short reads.
      Signed-off-by: NOleg Nesterov <oleg@tv-sign.ru>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      85462323
    • A
      vfs: Remove the range_cont writeback mode. · 74baaaae
      Aneesh Kumar K.V 提交于
      Ext4 was the only user of range_cont writeback mode and ext4 switched
      to a different method. So remove the range_cont mode which is not used
      in the kernel.
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      CC: linux-fsdevel@vger.kernel.org
      74baaaae
  6. 13 10月, 2008 1 次提交
  7. 10 10月, 2008 1 次提交
  8. 09 10月, 2008 1 次提交
  9. 08 10月, 2008 1 次提交
  10. 03 10月, 2008 3 次提交
    • A
      mm: handle initialising compound pages at orders greater than MAX_ORDER · 6babc32c
      Andy Whitcroft 提交于
      When we initialise a compound page we initialise the page flags and head
      page pointer for all base pages spanned by that page.  When we initialise
      a gigantic page (a page of order greater than or equal to MAX_ORDER) we
      have to initialise more than MAX_ORDER_NR_PAGES pages.  Currently we
      assume that all elements of the mem_map in this page are contigious in
      memory.  However this is only guarenteed out to MAX_ORDER_NR_PAGES pages,
      and with SPARSEMEM enabled they will not be contigious.  This leads us to
      walk off the end of the first section and scribble on everything which
      follows, BAD.
      
      When we reach a MAX_ORDER_NR_PAGES boundary we much locate the next
      section of the mem_map.  As gigantic pages can only be maximally aligned
      we know this will occur at exact multiple of MAX_ORDER_NR_PAGES pages from
      the start of the page.
      
      This is a bug fix for the gigantic page support in hugetlbfs.
      
      Credit to Mel Gorman for spotting the issue.
      Signed-off-by: NAndy Whitcroft <apw@shadowen.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Jon Tollefson <kniht@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6babc32c
    • N
      mm: tiny-shmem nommu fix · 4b19de6d
      Nick Piggin 提交于
      The previous patch db203d53 ("mm:
      tiny-shmem fix lock ordering: mmap_sem vs i_mutex") to fix the lock
      ordering in tiny-shmem breaks shared anonymous and IPC memory on NOMMU
      architectures because it was using the expanding truncate to signal ramfs
      to allocate a physically contiguous RAM backing the inode (otherwise it is
      unusable for "memory mapping" it to userspace).
      
      However do_truncate is what caused the lock ordering error, due to it
      taking i_mutex.  In this case, we can actually just call ramfs directly to
      allocate memory for the mapping, rather than go via truncate.
      Acked-by: NDavid Howells <dhowells@redhat.com>
      Acked-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Cc: Matt Mackall <mpm@selenic.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4b19de6d
    • G
      memory hotplug: missing zone->lock in test_pages_isolated() · 6c1b7f68
      Gerald Schaefer 提交于
      __test_page_isolated_in_pageblock() in mm/page_isolation.c has a comment
      saying that the caller must hold zone->lock. But the only caller of that
      function, test_pages_isolated(), does not hold zone->lock and the lock is
      also not acquired anywhere before. This patch adds the missing zone->lock
      to test_pages_isolated().
      
      We reproducibly run into BUG_ON(!PageBuddy(page)) in __offline_isolated_pages()
      during memory hotplug stress test, see trace below. This patch fixes that
      problem, it would be good if we could have it in 2.6.27.
      
      kernel BUG at /home/autobuild/BUILD/linux-2.6.26-20080909/mm/page_alloc.c:4561!
      illegal operation: 0001 [#1] PREEMPT SMP
      Modules linked in: dm_multipath sunrpc bonding qeth_l3 dm_mod qeth ccwgroup vmur
      CPU: 1 Not tainted 2.6.26-29.x.20080909-s390default #1
      Process memory_loop_all (pid: 10025, task: 2f444028, ksp: 2b10dd28)
      Krnl PSW : 040c0000 801727ea (__offline_isolated_pages+0x18e/0x1c4)
       R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:0 CC:0 PM:0
      Krnl GPRS: 00000000 7e27fc00 00000000 7e27fc00
       00000000 00000400 00014000 7e27fc01
       00606f00 7e27fc00 00013fe0 2b10dd28
       00000005 80172662 801727b2 2b10dd28
      Krnl Code: 801727de: 5810900c l %r1,12(%r9)
       801727e2: a7f4ffb3 brc 15,80172748
       801727e6: a7f40001 brc 15,801727e8
       >801727ea: a7f4ffbc brc 15,80172762
       801727ee: a7f40001 brc 15,801727f0
       801727f2: a7f4ffaf brc 15,80172750
       801727f6: 0707 bcr 0,%r7
       801727f8: 0017 unknown
      Call Trace:
      ([<0000000000172772>] __offline_isolated_pages+0x116/0x1c4)
       [<00000000001953a2>] offline_isolated_pages_cb+0x22/0x34
       [<000000000013164c>] walk_memory_resource+0xcc/0x11c
       [<000000000019520e>] offline_pages+0x36a/0x498
       [<00000000001004d6>] remove_memory+0x36/0x44
       [<000000000028fb06>] memory_block_change_state+0x112/0x150
       [<000000000028ffb8>] store_mem_state+0x90/0xe4
       [<0000000000289c00>] sysdev_store+0x34/0x40
       [<00000000001ee048>] sysfs_write_file+0xd0/0x178
       [<000000000019b1a8>] vfs_write+0x74/0x118
       [<000000000019b9ae>] sys_write+0x46/0x7c
       [<000000000011160e>] sysc_do_restart+0x12/0x16
       [<0000000077f3e8ca>] 0x77f3e8ca
      Signed-off-by: NGerald Schaefer <gerald.schaefer@de.ibm.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6c1b7f68