1. 08 10月, 2010 1 次提交
    • A
      Clean up __page_set_anon_rmap · 4e1c1975
      Andi Kleen 提交于
      Linus asked for a cleanup of __page_set_anon_rmap to make
      it look more like the cleaner huge pages version.
      
      Factor out the duplicated PageAnon check into a single check
      at the beginning of the function.
      
      Remove obsolete comments and rewrite them into standard English.
      
      No functional changes.
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      4e1c1975
  2. 05 10月, 2010 1 次提交
    • H
      ksm: fix page_address_in_vma anon_vma oops · 4829b906
      Hugh Dickins 提交于
      2.6.36-rc1 commit 21d0d443 "rmap:
      resurrect page_address_in_vma anon_vma check" was right to resurrect
      that check; but now that it's comparing anon_vma->roots instead of
      just anon_vmas, there's a danger of oopsing on a NULL anon_vma.
      
      In most cases no NULL anon_vma ever gets here; but it turns out that
      occasionally KSM, when enabled on a forked or forking process, will
      itself call page_address_in_vma() on a "half-KSM" page left over from
      an earlier failed attempt to merge - whose page_anon_vma() is NULL.
      
      It's my bug that those should be getting here at all: I thought they
      were already dealt with, this oops proves me wrong, I'll fix it in
      the next release - such pages are effectively pinned until their
      process exits, since rmap cannot find their ptes (though swapoff can).
      
      For now just work around it by making page_address_in_vma() safe (and
      add a comment on why that check is wanted anyway).  A similar check
      in __page_check_anon_rmap() is safe because do_page_add_anon_rmap()
      already excluded KSM pages.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4829b906
  3. 24 9月, 2010 2 次提交
  4. 29 8月, 2010 1 次提交
    • H
      mm: fix hang on anon_vma->root->lock · f1819427
      Hugh Dickins 提交于
      After several hours, kbuild tests hang with anon_vma_prepare() spinning on
      a newly allocated anon_vma's lock - on a box with CONFIG_TREE_PREEMPT_RCU=y
      (which makes this very much more likely, but it could happen without).
      
      The ever-subtle page_lock_anon_vma() now needs a further twist: since
      anon_vma_prepare() and anon_vma_fork() are liable to change the ->root
      of a reused anon_vma structure at any moment, page_lock_anon_vma()
      needs to check page_mapped() again before succeeding, otherwise
      page_unlock_anon_vma() might address a different root->lock.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f1819427
  5. 11 8月, 2010 2 次提交
    • N
      hwpoison: rename CONFIG · e3390f67
      Naoya Horiguchi 提交于
      CONFIG_HUGETLBFS controls hugetlbfs interface code.
      OTOH, CONFIG_HUGETLB_PAGE controls hugepage management code.
      So we should use CONFIG_HUGETLB_PAGE here.
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      e3390f67
    • N
      hugetlb, rmap: add reverse mapping for hugepage · 0fe6e20b
      Naoya Horiguchi 提交于
      This patch adds reverse mapping feature for hugepage by introducing
      mapcount for shared/private-mapped hugepage and anon_vma for
      private-mapped hugepage.
      
      While hugepage is not currently swappable, reverse mapping can be useful
      for memory error handler.
      
      Without this patch, memory error handler cannot identify processes
      using the bad hugepage nor unmap it from them. That is:
      - for shared hugepage:
        we can collect processes using a hugepage through pagecache,
        but can not unmap the hugepage because of the lack of mapcount.
      - for privately mapped hugepage:
        we can neither collect processes nor unmap the hugepage.
      This patch solves these problems.
      
      This patch include the bug fix given by commit 23be7468, so reverts it.
      
      Dependency:
        "hugetlb: move definition of is_vm_hugetlb_page() to hugepage_inline.h"
      
      ChangeLog since May 24.
      - create hugetlb_inline.h and move is_vm_hugetlb_index() in it.
      - move functions setting up anon_vma for hugepage into mm/rmap.c.
      
      ChangeLog since May 13.
      - rebased to 2.6.34
      - fix logic error (in case that private mapping and shared mapping coexist)
      - move is_vm_hugetlb_page() into include/linux/mm.h to use this function
        from linear_page_index()
      - define and use linear_hugepage_index() instead of compound_order()
      - use page_move_anon_rmap() in hugetlb_cow()
      - copy exclusive switch of __set_page_anon_rmap() into hugepage counterpart.
      - revert commit 24be7468 completely
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Acked-by: NFengguang Wu <fengguang.wu@intel.com>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      0fe6e20b
  6. 10 8月, 2010 8 次提交
  7. 25 5月, 2010 3 次提交
    • M
      mm: migration: avoid race between shift_arg_pages() and rmap_walk() during... · a8bef8ff
      Mel Gorman 提交于
      mm: migration: avoid race between shift_arg_pages() and rmap_walk() during migration by not migrating temporary stacks
      
      Page migration requires rmap to be able to find all ptes mapping a page
      at all times, otherwise the migration entry can be instantiated, but it
      is possible to leave one behind if the second rmap_walk fails to find
      the page.  If this page is later faulted, migration_entry_to_page() will
      call BUG because the page is locked indicating the page was migrated by
      the migration PTE not cleaned up. For example
      
        kernel BUG at include/linux/swapops.h:105!
        invalid opcode: 0000 [#1] PREEMPT SMP
        ...
        Call Trace:
         [<ffffffff810e951a>] handle_mm_fault+0x3f8/0x76a
         [<ffffffff8130c7a2>] do_page_fault+0x44a/0x46e
         [<ffffffff813099b5>] page_fault+0x25/0x30
         [<ffffffff8114de33>] load_elf_binary+0x152a/0x192b
         [<ffffffff8111329b>] search_binary_handler+0x173/0x313
         [<ffffffff81114896>] do_execve+0x219/0x30a
         [<ffffffff8100a5c6>] sys_execve+0x43/0x5e
         [<ffffffff8100320a>] stub_execve+0x6a/0xc0
        RIP  [<ffffffff811094ff>] migration_entry_wait+0xc1/0x129
      
      There is a race between shift_arg_pages and migration that triggers this
      bug.  A temporary stack is setup during exec and later moved.  If
      migration moves a page in the temporary stack and the VMA is then removed
      before migration completes, the migration PTE may not be found leading to
      a BUG when the stack is faulted.
      
      This patch causes pages within the temporary stack during exec to be
      skipped by migration.  It does this by marking the VMA covering the
      temporary stack with an otherwise impossible combination of VMA flags.
      These flags are cleared when the temporary stack is moved to its final
      location.
      
      [kamezawa.hiroyu@jp.fujitsu.com: idea for having migration skip temporary stacks]
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a8bef8ff
    • M
      mm: migration: share the anon_vma ref counts between KSM and page migration · 7f60c214
      Mel Gorman 提交于
      For clarity of review, KSM and page migration have separate refcounts on
      the anon_vma.  While clear, this is a waste of memory.  This patch gets
      KSM and page migration to share their toys in a spirit of harmony.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: NChristoph Lameter <cl@linux-foundation.org>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7f60c214
    • M
      mm: migration: take a reference to the anon_vma before migrating · 3f6c8272
      Mel Gorman 提交于
      This patchset is a memory compaction mechanism that reduces external
      fragmentation memory by moving GFP_MOVABLE pages to a fewer number of
      pageblocks.  The term "compaction" was chosen as there are is a number of
      mechanisms that are not mutually exclusive that can be used to defragment
      memory.  For example, lumpy reclaim is a form of defragmentation as was
      slub "defragmentation" (really a form of targeted reclaim).  Hence, this
      is called "compaction" to distinguish it from other forms of
      defragmentation.
      
      In this implementation, a full compaction run involves two scanners
      operating within a zone - a migration and a free scanner.  The migration
      scanner starts at the beginning of a zone and finds all movable pages
      within one pageblock_nr_pages-sized area and isolates them on a
      migratepages list.  The free scanner begins at the end of the zone and
      searches on a per-area basis for enough free pages to migrate all the
      pages on the migratepages list.  As each area is respectively migrated or
      exhausted of free pages, the scanners are advanced one area.  A compaction
      run completes within a zone when the two scanners meet.
      
      This method is a bit primitive but is easy to understand and greater
      sophistication would require maintenance of counters on a per-pageblock
      basis.  This would have a big impact on allocator fast-paths to improve
      compaction which is a poor trade-off.
      
      It also does not try relocate virtually contiguous pages to be physically
      contiguous.  However, assuming transparent hugepages were in use, a
      hypothetical khugepaged might reuse compaction code to isolate free pages,
      split them and relocate userspace pages for promotion.
      
      Memory compaction can be triggered in one of three ways.  It may be
      triggered explicitly by writing any value to /proc/sys/vm/compact_memory
      and compacting all of memory.  It can be triggered on a per-node basis by
      writing any value to /sys/devices/system/node/nodeN/compact where N is the
      node ID to be compacted.  When a process fails to allocate a high-order
      page, it may compact memory in an attempt to satisfy the allocation
      instead of entering direct reclaim.  Explicit compaction does not finish
      until the two scanners meet and direct compaction ends if a suitable page
      becomes available that would meet watermarks.
      
      The series is in 14 patches.  The first three are not "core" to the series
      but are important pre-requisites.
      
      Patch 1 reference counts anon_vma for rmap_walk_anon(). Without this
      	patch, it's possible to use anon_vma after free if the caller is
      	not holding a VMA or mmap_sem for the pages in question. While
      	there should be no existing user that causes this problem,
      	it's a requirement for memory compaction to be stable. The patch
      	is at the start of the series for bisection reasons.
      Patch 2 merges the KSM and migrate counts. It could be merged with patch 1
      	but would be slightly harder to review.
      Patch 3 skips over unmapped anon pages during migration as there are no
      	guarantees about the anon_vma existing. There is a window between
      	when a page was isolated and migration started during which anon_vma
      	could disappear.
      Patch 4 notes that PageSwapCache pages can still be migrated even if they
      	are unmapped.
      Patch 5 allows CONFIG_MIGRATION to be set without CONFIG_NUMA
      Patch 6 exports a "unusable free space index" via debugfs. It's
      	a measure of external fragmentation that takes the size of the
      	allocation request into account. It can also be calculated from
      	userspace so can be dropped if requested
      Patch 7 exports a "fragmentation index" which only has meaning when an
      	allocation request fails. It determines if an allocation failure
      	would be due to a lack of memory or external fragmentation.
      Patch 8 moves the definition for LRU isolation modes for use by compaction
      Patch 9 is the compaction mechanism although it's unreachable at this point
      Patch 10 adds a means of compacting all of memory with a proc trgger
      Patch 11 adds a means of compacting a specific node with a sysfs trigger
      Patch 12 adds "direct compaction" before "direct reclaim" if it is
      	determined there is a good chance of success.
      Patch 13 adds a sysctl that allows tuning of the threshold at which the
      	kernel will compact or direct reclaim
      Patch 14 temporarily disables compaction if an allocation failure occurs
      	after compaction.
      
      Testing of compaction was in three stages.  For the test, debugging,
      preempt, the sleep watchdog and lockdep were all enabled but nothing nasty
      popped out.  min_free_kbytes was tuned as recommended by hugeadm to help
      fragmentation avoidance and high-order allocations.  It was tested on X86,
      X86-64 and PPC64.
      
      Ths first test represents one of the easiest cases that can be faced for
      lumpy reclaim or memory compaction.
      
      1. Machine freshly booted and configured for hugepage usage with
      	a) hugeadm --create-global-mounts
      	b) hugeadm --pool-pages-max DEFAULT:8G
      	c) hugeadm --set-recommended-min_free_kbytes
      	d) hugeadm --set-recommended-shmmax
      
      	The min_free_kbytes here is important. Anti-fragmentation works best
      	when pageblocks don't mix. hugeadm knows how to calculate a value that
      	will significantly reduce the worst of external-fragmentation-related
      	events as reported by the mm_page_alloc_extfrag tracepoint.
      
      2. Load up memory
      	a) Start updatedb
      	b) Create in parallel a X files of pagesize*128 in size. Wait
      	   until files are created. By parallel, I mean that 4096 instances
      	   of dd were launched, one after the other using &. The crude
      	   objective being to mix filesystem metadata allocations with
      	   the buffer cache.
      	c) Delete every second file so that pageblocks are likely to
      	   have holes
      	d) kill updatedb if it's still running
      
      	At this point, the system is quiet, memory is full but it's full with
      	clean filesystem metadata and clean buffer cache that is unmapped.
      	This is readily migrated or discarded so you'd expect lumpy reclaim
      	to have no significant advantage over compaction but this is at
      	the POC stage.
      
      3. In increments, attempt to allocate 5% of memory as hugepages.
      	   Measure how long it took, how successful it was, how many
      	   direct reclaims took place and how how many compactions. Note
      	   the compaction figures might not fully add up as compactions
      	   can take place for orders other than the hugepage size
      
      X86				vanilla		compaction
      Final page count                    913                916 (attempted 1002)
      pages reclaimed                   68296               9791
      
      X86-64				vanilla		compaction
      Final page count:                   901                902 (attempted 1002)
      Total pages reclaimed:           112599              53234
      
      PPC64				vanilla		compaction
      Final page count:                    93                 94 (attempted 110)
      Total pages reclaimed:           103216              61838
      
      There was not a dramatic improvement in success rates but it wouldn't be
      expected in this case either.  What was important is that fewer pages were
      reclaimed in all cases reducing the amount of IO required to satisfy a
      huge page allocation.
      
      The second tests were all performance related - kernbench, netperf, iozone
      and sysbench.  None showed anything too remarkable.
      
      The last test was a high-order allocation stress test.  Many kernel
      compiles are started to fill memory with a pressured mix of unmovable and
      movable allocations.  During this, an attempt is made to allocate 90% of
      memory as huge pages - one at a time with small delays between attempts to
      avoid flooding the IO queue.
      
                                                   vanilla   compaction
      Percentage of request allocated X86               98           99
      Percentage of request allocated X86-64            95           98
      Percentage of request allocated PPC64             55           70
      
      This patch:
      
      rmap_walk_anon() does not use page_lock_anon_vma() for looking up and
      locking an anon_vma and it does not appear to have sufficient locking to
      ensure the anon_vma does not disappear from under it.
      
      This patch copies an approach used by KSM to take a reference on the
      anon_vma while pages are being migrated.  This should prevent rmap_walk()
      running into nasty surprises later because anon_vma has been freed.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3f6c8272
  8. 12 5月, 2010 1 次提交
  9. 25 4月, 2010 1 次提交
  10. 20 4月, 2010 1 次提交
    • R
      rmap: add exclusively owned pages to the newest anon_vma · e8a03feb
      Rik van Riel 提交于
      The recent anon_vma fixes cause many anonymous pages to end up
      in the parent process anon_vma, even when the page is exclusively
      owned by the current process.
      
      Adding exclusively owned anonymous pages to the top anon_vma
      reduces rmap scanning overhead, especially in workloads with
      forking servers.
      
      This patch adds a parameter to __page_set_anon_rmap that can
      be used to indicate whether or not the added page is exclusively
      owned by the current process.
      
      Pages added through page_add_new_anon_rmap are exclusively
      owned by the current process, and can be added to the top
      anon_vma.
      
      Pages added through page_add_anon_rmap can be either shared
      or exclusively owned, so we do the conservative thing and
      add it to the oldest anon_vma.
      
      A next step would be to add the exclusive parameter to
      page_add_anon_rmap, to be used from functions where we do
      know for sure whether a page is exclusively owned.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NJohannes Weiner <hannes@cmpxchg.org>
      Lightly-tested-by: NBorislav Petkov <bp@alien8.de>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      [ Edited to look nicer  - Linus ]
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e8a03feb
  11. 13 4月, 2010 2 次提交
    • L
      anonvma: when setting up page->mapping, we need to pick the _oldest_ anonvma · ea90002b
      Linus Torvalds 提交于
      Otherwise we might be mapping in a page in a new mapping, but that page
      (through the swapcache) would later be mapped into an old mapping too.
      The page->mapping must be the case that works for everybody, not just
      the mapping that happened to page it in first.
      
      Here's the scenario:
      
       - page gets allocated/mapped by process A. Let's call the anon_vma we
         associate the page with 'A' to keep it easy to track.
      
       - Process A forks, creating process B. The anon_vma in B is 'B', and has
         a chain that looks like 'B' -> 'A'. Everything is fine.
      
       - Swapping happens. The page (with mapping pointing to 'A') gets swapped
         out (perhaps not to disk - it's enough to assume that it's just not
         mapped any more, and lives entirely in the swap-cache)
      
       - Process B pages it in, which goes like this:
      
              do_swap_page ->
                page = lookup_swap_cache(entry);
               ...
                set_pte_at(mm, address, page_table, pte);
                page_add_anon_rmap(page, vma, address);
      
         And think about what happens here!
      
         In particular, what happens is that this will now be the "first"
         mapping of that page, so page_add_anon_rmap() used to do
      
              if (first)
                      __page_set_anon_rmap(page, vma, address);
      
         and notice what anon_vma it will use? It will use the anon_vma for
         process B!
      
         What happens then? Trivial: process 'A' also pages it in (nothing
         happens, it's not the first mapping), and then process 'B' execve's
         or exits or unmaps, making anon_vma B go away.
      
         End result: process A has a page that points to anon_vma B, but
         anon_vma B does not exist any more.  This can go on forever.  Forget
         about RCU grace periods, forget about locking, forget anything like
         that.  The bug is simply that page->mapping points to an anon_vma
         that was correct at one point, but was _not_ the one that was shared
         by all users of that possible mapping.
      
      Changing it to always use the deepest anon_vma in the anonvma chain gets
      us to the safest model.
      
      This can be improved in certain cases: if we know the page is private to
      just this particular mapping (for example, it's a new page, or it is the
      only swapcache entry), we could pick the top (most specific) anon_vma.
      
      But that's a future optimization. Make it _work_ reliably first.
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Tested-by: Borislav Petkov <bp@alien8.de> [ "What do you know, I think you fixed it!" ]
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ea90002b
    • L
      anon_vma: clone the anon_vma chain in the right order · 646d87b4
      Linus Torvalds 提交于
      We want to walk the chain in reverse order when cloning it, so that the
      order of the result chain will be the same as the order in the source
      chain.  When we add entries to the chain, they go at the head of the
      chain, so we want to add the source head last.
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Tested-by: Borislav Petkov <bp@alien8.de> [ "No, it still oopses" ]
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      646d87b4
  12. 06 4月, 2010 1 次提交
  13. 07 3月, 2010 7 次提交
    • J
      vmscan: detect mapped file pages used only once · 64574746
      Johannes Weiner 提交于
      The VM currently assumes that an inactive, mapped and referenced file page
      is in use and promotes it to the active list.
      
      However, every mapped file page starts out like this and thus a problem
      arises when workloads create a stream of such pages that are used only for
      a short time.  By flooding the active list with those pages, the VM
      quickly gets into trouble finding eligible reclaim canditates.  The result
      is long allocation latencies and eviction of the wrong pages.
      
      This patch reuses the PG_referenced page flag (used for unmapped file
      pages) to implement a usage detection that scales with the speed of LRU
      list cycling (i.e.  memory pressure).
      
      If the scanner encounters those pages, the flag is set and the page cycled
      again on the inactive list.  Only if it returns with another page table
      reference it is activated.  Otherwise it is reclaimed as 'not recently
      used cache'.
      
      This effectively changes the minimum lifetime of a used-once mapped file
      page from a full memory cycle to an inactive list cycle, which allows it
      to occur in linear streams without affecting the stable working set of the
      system.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: OSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      64574746
    • R
      mm: remove VM_LOCK_RMAP code · fc148a5f
      Rik van Riel 提交于
      When a VMA is in an inconsistent state during setup or teardown, the worst
      that can happen is that the rmap code will not be able to find the page.
      
      The mapping is in the process of being torn down (PTEs just got
      invalidated by munmap), or set up (no PTEs have been instantiated yet).
      
      It is also impossible for the rmap code to follow a pointer to an already
      freed VMA, because the rmap code holds the anon_vma->lock, which the VMA
      teardown code needs to take before the VMA is removed from the anon_vma
      chain.
      
      Hence, we should not need the VM_LOCK_RMAP locking at all.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fc148a5f
    • R
      rmap: move exclusively owned pages to own anon_vma in do_wp_page() · c44b6743
      Rik van Riel 提交于
      When the parent process breaks the COW on a page, both the original which
      is mapped at child and the new page which is mapped parent end up in that
      same anon_vma.  Generally this won't be a problem, but for some workloads
      it could preserve the O(N) rmap scanning complexity.
      
      A simple fix is to ensure that, when a page which is mapped child gets
      reused in do_wp_page, because we already are the exclusive owner, the page
      gets moved to our own exclusive child's anon_vma.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c44b6743
    • R
      rmap: remove obsolete check from __page_check_anon_rmap() · 033a64b5
      Rik van Riel 提交于
      When an anonymous page is inherited from a parent process, the
      vma->anon_vma can differ from the page anon_vma.  This can trip up
      __page_check_anon_rmap, which is indirectly called from do_swap_page().
      
      Remove that obsolete check to prevent an oops.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      033a64b5
    • R
      mm: change anon_vma linking to fix multi-process server scalability issue · 5beb4930
      Rik van Riel 提交于
      The old anon_vma code can lead to scalability issues with heavily forking
      workloads.  Specifically, each anon_vma will be shared between the parent
      process and all its child processes.
      
      In a workload with 1000 child processes and a VMA with 1000 anonymous
      pages per process that get COWed, this leads to a system with a million
      anonymous pages in the same anon_vma, each of which is mapped in just one
      of the 1000 processes.  However, the current rmap code needs to walk them
      all, leading to O(N) scanning complexity for each page.
      
      This can result in systems where one CPU is walking the page tables of
      1000 processes in page_referenced_one, while all other CPUs are stuck on
      the anon_vma lock.  This leads to catastrophic failure for a benchmark
      like AIM7, where the total number of processes can reach in the tens of
      thousands.  Real workloads are still a factor 10 less process intensive
      than AIM7, but they are catching up.
      
      This patch changes the way anon_vmas and VMAs are linked, which allows us
      to associate multiple anon_vmas with a VMA.  At fork time, each child
      process gets its own anon_vmas, in which its COWed pages will be
      instantiated.  The parents' anon_vma is also linked to the VMA, because
      non-COWed pages could be present in any of the children.
      
      This reduces rmap scanning complexity to O(1) for the pages of the 1000
      child processes, with O(N) complexity for at most 1/N pages in the system.
       This reduces the average scanning cost in heavily forking workloads from
      O(N) to 2.
      
      The only real complexity in this patch stems from the fact that linking a
      VMA to anon_vmas now involves memory allocations.  This means vma_adjust
      can fail, if it needs to attach a VMA to anon_vma structures.  This in
      turn means error handling needs to be added to the calling functions.
      
      A second source of complexity is that, because there can be multiple
      anon_vmas, the anon_vma linking in vma_adjust can no longer be done under
      "the" anon_vma lock.  To prevent the rmap code from walking up an
      incomplete VMA, this patch introduces the VM_LOCK_RMAP VMA flag.  This bit
      flag uses the same slot as the NOMMU VM_MAPPED_COPY, with an ifdef in mm.h
      to make sure it is impossible to compile a kernel that needs both symbolic
      values for the same bitflag.
      
      Some test results:
      
      Without the anon_vma changes, when AIM7 hits around 9.7k users (on a test
      box with 16GB RAM and not quite enough IO), the system ends up running
      >99% in system time, with every CPU on the same anon_vma lock in the
      pageout code.
      
      With these changes, AIM7 hits the cross-over point around 29.7k users.
      This happens with ~99% IO wait time, there never seems to be any spike in
      system time.  The anon_vma lock contention appears to be resolved.
      
      [akpm@linux-foundation.org: cleanups]
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5beb4930
    • K
      mm: count swap usage · b084d435
      KAMEZAWA Hiroyuki 提交于
      A frequent questions from users about memory management is what numbers of
      swap ents are user for processes.  And this information will give some
      hints to oom-killer.
      
      Besides we can count the number of swapents per a process by scanning
      /proc/<pid>/smaps, this is very slow and not good for usual process
      information handler which works like 'ps' or 'top'.  (ps or top is now
      enough slow..)
      
      This patch adds a counter of swapents to mm_counter and update is at each
      swap events.  Information is exported via /proc/<pid>/status file as
      
      [kamezawa@bluextal memory]$ cat /proc/self/status
      Name:   cat
      State:  R (running)
      Tgid:   2910
      Pid:    2910
      PPid:   2823
      TracerPid:      0
      Uid:    500     500     500     500
      Gid:    500     500     500     500
      FDSize: 256
      Groups: 500
      VmPeak:    82696 kB
      VmSize:    82696 kB
      VmLck:         0 kB
      VmHWM:       432 kB
      VmRSS:       432 kB
      VmData:      172 kB
      VmStk:        84 kB
      VmExe:        48 kB
      VmLib:      1568 kB
      VmPTE:        40 kB
      VmSwap:        0 kB <=============== this.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Reviewed-by: NChristoph Lameter <cl@linux-foundation.org>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b084d435
    • K
      mm: clean up mm_counter · d559db08
      KAMEZAWA Hiroyuki 提交于
      Presently, per-mm statistics counter is defined by macro in sched.h
      
      This patch modifies it to
        - defined in mm.h as inlinf functions
        - use array instead of macro's name creation.
      
      This patch is for reducing patch size in future patch to modify
      implementation of per-mm counter.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d559db08
  14. 16 12月, 2009 9 次提交
    • K
      memcg: make memcg's file mapped consistent with global VM · d8046582
      KAMEZAWA Hiroyuki 提交于
      In global VM, FILE_MAPPED is used but memcg uses MAPPED_FILE.  This makes
      grep difficult.  Replace memcg's MAPPED_FILE with FILE_MAPPED
      
      And in global VM, mapped shared memory is accounted into FILE_MAPPED.
      But memcg doesn't. fix it.
      Note:
        page_is_file_cache() just checks SwapBacked or not.
        So, we need to check PageAnon.
      
      Cc: Balbir Singh <balbir@in.ibm.com>
      Reviewed-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d8046582
    • K
      mm: simplify try_to_unmap_one() · caed0f48
      KOSAKI Motohiro 提交于
      SWAP_MLOCK mean "We marked the page as PG_MLOCK, please move it to
      unevictable-lru". So, following code is easy confusable.
      
              if (vma->vm_flags & VM_LOCKED) {
                      ret = SWAP_MLOCK;
                      goto out_unmap;
              }
      
      Plus, if the VMA doesn't have VM_LOCKED, We don't need to check
      the needed of calling mlock_vma_page().
      
      Also, add some commentary to try_to_unmap_one().
      Acked-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      caed0f48
    • H
      ksm: rmap_walk to remove_migation_ptes · e9995ef9
      Hugh Dickins 提交于
      A side-effect of making ksm pages swappable is that they have to be placed
      on the LRUs: which then exposes them to isolate_lru_page() and hence to
      page migration.
      
      Add rmap_walk() for remove_migration_ptes() to use: rmap_walk_anon() and
      rmap_walk_file() in rmap.c, but rmap_walk_ksm() in ksm.c.  Perhaps some
      consolidation with existing code is possible, but don't attempt that yet
      (try_to_unmap needs to handle nonlinears, but migration pte removal does
      not).
      
      rmap_walk() is sadly less general than it appears: rmap_walk_anon(), like
      remove_anon_migration_ptes() which it replaces, avoids calling
      page_lock_anon_vma(), because that includes a page_mapped() test which
      fails when all migration ptes are in place.  That was valid when NUMA page
      migration was introduced (holding mmap_sem provided the missing guarantee
      that anon_vma's slab had not already been destroyed), but I believe not
      valid in the memory hotremove case added since.
      
      For now do the same as before, and consider the best way to fix that
      unlikely race later on.  When fixed, we can probably use rmap_walk() on
      hwpoisoned ksm pages too: for now, they remain among hwpoison's various
      exceptions (its PageKsm test comes before the page is locked, but its
      page_lock_anon_vma fails safely if an anon gets upgraded).
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Izik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Chris Wright <chrisw@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e9995ef9
    • H
      ksm: share anon page without allocating · 80e14822
      Hugh Dickins 提交于
      When ksm pages were unswappable, it made no sense to include them in mem
      cgroup accounting; but now that they are swappable (although I see no
      strict logical connection) the principle of least surprise implies that
      they should be accounted (with the usual dissatisfaction, that a shared
      page is accounted to only one of the cgroups using it).
      
      This patch was intended to add mem cgroup accounting where necessary; but
      turned inside out, it now avoids allocating a ksm page, instead upgrading
      an anon page to ksm - which brings its existing mem cgroup accounting with
      it.  Thus mem cgroups don't appear in the patch at all.
      
      This upgrade from PageAnon to PageKsm takes place under page lock (via a
      somewhat hacky NULL kpage interface), and audit showed only one place
      which needed to cope with the race - page_referenced() is sometimes used
      without page lock, so page_lock_anon_vma() needs an ACCESS_ONCE() to be
      sure of getting anon_vma and flags together (no problem if the page goes
      ksm an instant after, the integrity of that anon_vma list is unaffected).
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Izik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Chris Wright <chrisw@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      80e14822
    • H
      ksm: hold anon_vma in rmap_item · db114b83
      Hugh Dickins 提交于
      For full functionality, page_referenced_one() and try_to_unmap_one() need
      to know the vma: to pass vma down to arch-dependent flushes, or to observe
      VM_LOCKED or VM_EXEC.  But KSM keeps no record of vma: nor can it, since
      vmas get split and merged without its knowledge.
      
      Instead, note page's anon_vma in its rmap_item when adding to stable tree:
      all the vmas which might map that page are listed by its anon_vma.
      
      page_referenced_ksm() and try_to_unmap_ksm() then traverse the anon_vma,
      first to find the probable vma, that which matches rmap_item's mm; but if
      that is not enough to locate all instances, traverse again to try the
      others.  This catches those occasions when fork has duplicated a pte of a
      ksm page, but ksmd has not yet come around to assign it an rmap_item.
      
      But each rmap_item in the stable tree which refers to an anon_vma needs to
      take a reference to it.  Andrea's anon_vma design cleverly avoided a
      reference count (an anon_vma was free when its list of vmas was empty),
      but KSM now needs to add that.  Is a 32-bit count sufficient?  I believe
      so - the anon_vma is only free when both count is 0 and list is empty.
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Izik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Chris Wright <chrisw@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      db114b83
    • H
      ksm: let shared pages be swappable · 5ad64688
      Hugh Dickins 提交于
      Initial implementation for swapping out KSM's shared pages: add
      page_referenced_ksm() and try_to_unmap_ksm(), which rmap.c calls when
      faced with a PageKsm page.
      
      Most of what's needed can be got from the rmap_items listed from the
      stable_node of the ksm page, without discovering the actual vma: so in
      this patch just fake up a struct vma for page_referenced_one() or
      try_to_unmap_one(), then refine that in the next patch.
      
      Add VM_NONLINEAR to ksm_madvise()'s list of exclusions: it has always been
      implicit there (being only set with VM_SHARED, already excluded), but
      let's make it explicit, to help justify the lack of nonlinear unmap.
      
      Rely on the page lock to protect against concurrent modifications to that
      page's node of the stable tree.
      
      The awkward part is not swapout but swapin: do_swap_page() and
      page_add_anon_rmap() now have to allow for new possibilities - perhaps a
      ksm page still in swapcache, perhaps a swapcache page associated with one
      location in one anon_vma now needed for another location or anon_vma.
      (And the vma might even be no longer VM_MERGEABLE when that happens.)
      
      ksm_might_need_to_copy() checks for that case, and supplies a duplicate
      page when necessary, simply leaving it to a subsequent pass of ksmd to
      rediscover the identity and merge them back into one ksm page.
      Disappointingly primitive: but the alternative would have to accumulate
      unswappable info about the swapped out ksm pages, limiting swappability.
      
      Remove page_add_ksm_rmap(): page_add_anon_rmap() now has to allow for the
      particular case it was handling, so just use it instead.
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Izik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Chris Wright <chrisw@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5ad64688
    • H
      mm: pass address down to rmap ones · 1cb1729b
      Hugh Dickins 提交于
      KSM swapping will know where page_referenced_one() and try_to_unmap_one()
      should look.  It could hack page->index to get them to do what it wants,
      but it seems cleaner now to pass the address down to them.
      
      Make the same change to page_mkclean_one(), since it follows the same
      pattern; but there's no real need in its case.
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Izik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1cb1729b
    • H
      mm: CONFIG_MMU for PG_mlocked · af8e3354
      Hugh Dickins 提交于
      Remove three degrees of obfuscation, left over from when we had
      CONFIG_UNEVICTABLE_LRU.  MLOCK_PAGES is CONFIG_HAVE_MLOCKED_PAGE_BIT is
      CONFIG_HAVE_MLOCK is CONFIG_MMU.  rmap.o (and memory-failure.o) are only
      built when CONFIG_MMU, so don't need such conditions at all.
      
      Somehow, I feel no compulsion to remove the CONFIG_HAVE_MLOCK* lines from
      169 defconfigs: leave those to evolve in due course.
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Izik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      af8e3354
    • H
      mm: mlocking in try_to_unmap_one · 53f79acb
      Hugh Dickins 提交于
      There's contorted mlock/munlock handling in try_to_unmap_anon() and
      try_to_unmap_file(), which we'd prefer not to repeat for KSM swapping.
      Simplify it by moving it all down into try_to_unmap_one().
      
      One thing is then lost, try_to_munlock()'s distinction between when no vma
      holds the page mlocked, and when a vma does mlock it, but we could not get
      mmap_sem to set the page flag.  But its only caller takes no interest in
      that distinction (and is better testing SWAP_MLOCK anyway), so let's keep
      the code simple and return SWAP_AGAIN for both cases.
      
      try_to_unmap_file()'s TTU_MUNLOCK nonlinear handling was particularly
      amusing: once unravelled, it turns out to have been choosing between two
      different ways of doing the same nothing.  Ah, no, one way was actually
      returning SWAP_FAIL when it meant to return SWAP_SUCCESS.
      
      [kosaki.motohiro@jp.fujitsu.com: comment adding to mlocking in try_to_unmap_one]
      [akpm@linux-foundation.org: remove test of MLOCK_PAGES]
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Izik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      53f79acb