1. 13 4月, 2010 2 次提交
    • L
      anonvma: when setting up page->mapping, we need to pick the _oldest_ anonvma · ea90002b
      Linus Torvalds 提交于
      Otherwise we might be mapping in a page in a new mapping, but that page
      (through the swapcache) would later be mapped into an old mapping too.
      The page->mapping must be the case that works for everybody, not just
      the mapping that happened to page it in first.
      
      Here's the scenario:
      
       - page gets allocated/mapped by process A. Let's call the anon_vma we
         associate the page with 'A' to keep it easy to track.
      
       - Process A forks, creating process B. The anon_vma in B is 'B', and has
         a chain that looks like 'B' -> 'A'. Everything is fine.
      
       - Swapping happens. The page (with mapping pointing to 'A') gets swapped
         out (perhaps not to disk - it's enough to assume that it's just not
         mapped any more, and lives entirely in the swap-cache)
      
       - Process B pages it in, which goes like this:
      
              do_swap_page ->
                page = lookup_swap_cache(entry);
               ...
                set_pte_at(mm, address, page_table, pte);
                page_add_anon_rmap(page, vma, address);
      
         And think about what happens here!
      
         In particular, what happens is that this will now be the "first"
         mapping of that page, so page_add_anon_rmap() used to do
      
              if (first)
                      __page_set_anon_rmap(page, vma, address);
      
         and notice what anon_vma it will use? It will use the anon_vma for
         process B!
      
         What happens then? Trivial: process 'A' also pages it in (nothing
         happens, it's not the first mapping), and then process 'B' execve's
         or exits or unmaps, making anon_vma B go away.
      
         End result: process A has a page that points to anon_vma B, but
         anon_vma B does not exist any more.  This can go on forever.  Forget
         about RCU grace periods, forget about locking, forget anything like
         that.  The bug is simply that page->mapping points to an anon_vma
         that was correct at one point, but was _not_ the one that was shared
         by all users of that possible mapping.
      
      Changing it to always use the deepest anon_vma in the anonvma chain gets
      us to the safest model.
      
      This can be improved in certain cases: if we know the page is private to
      just this particular mapping (for example, it's a new page, or it is the
      only swapcache entry), we could pick the top (most specific) anon_vma.
      
      But that's a future optimization. Make it _work_ reliably first.
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Tested-by: Borislav Petkov <bp@alien8.de> [ "What do you know, I think you fixed it!" ]
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ea90002b
    • L
      anon_vma: clone the anon_vma chain in the right order · 646d87b4
      Linus Torvalds 提交于
      We want to walk the chain in reverse order when cloning it, so that the
      order of the result chain will be the same as the order in the source
      chain.  When we add entries to the chain, they go at the head of the
      chain, so we want to add the source head last.
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Tested-by: Borislav Petkov <bp@alien8.de> [ "No, it still oopses" ]
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      646d87b4
  2. 06 4月, 2010 1 次提交
  3. 07 3月, 2010 7 次提交
    • J
      vmscan: detect mapped file pages used only once · 64574746
      Johannes Weiner 提交于
      The VM currently assumes that an inactive, mapped and referenced file page
      is in use and promotes it to the active list.
      
      However, every mapped file page starts out like this and thus a problem
      arises when workloads create a stream of such pages that are used only for
      a short time.  By flooding the active list with those pages, the VM
      quickly gets into trouble finding eligible reclaim canditates.  The result
      is long allocation latencies and eviction of the wrong pages.
      
      This patch reuses the PG_referenced page flag (used for unmapped file
      pages) to implement a usage detection that scales with the speed of LRU
      list cycling (i.e.  memory pressure).
      
      If the scanner encounters those pages, the flag is set and the page cycled
      again on the inactive list.  Only if it returns with another page table
      reference it is activated.  Otherwise it is reclaimed as 'not recently
      used cache'.
      
      This effectively changes the minimum lifetime of a used-once mapped file
      page from a full memory cycle to an inactive list cycle, which allows it
      to occur in linear streams without affecting the stable working set of the
      system.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: OSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      64574746
    • R
      mm: remove VM_LOCK_RMAP code · fc148a5f
      Rik van Riel 提交于
      When a VMA is in an inconsistent state during setup or teardown, the worst
      that can happen is that the rmap code will not be able to find the page.
      
      The mapping is in the process of being torn down (PTEs just got
      invalidated by munmap), or set up (no PTEs have been instantiated yet).
      
      It is also impossible for the rmap code to follow a pointer to an already
      freed VMA, because the rmap code holds the anon_vma->lock, which the VMA
      teardown code needs to take before the VMA is removed from the anon_vma
      chain.
      
      Hence, we should not need the VM_LOCK_RMAP locking at all.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fc148a5f
    • R
      rmap: move exclusively owned pages to own anon_vma in do_wp_page() · c44b6743
      Rik van Riel 提交于
      When the parent process breaks the COW on a page, both the original which
      is mapped at child and the new page which is mapped parent end up in that
      same anon_vma.  Generally this won't be a problem, but for some workloads
      it could preserve the O(N) rmap scanning complexity.
      
      A simple fix is to ensure that, when a page which is mapped child gets
      reused in do_wp_page, because we already are the exclusive owner, the page
      gets moved to our own exclusive child's anon_vma.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c44b6743
    • R
      rmap: remove obsolete check from __page_check_anon_rmap() · 033a64b5
      Rik van Riel 提交于
      When an anonymous page is inherited from a parent process, the
      vma->anon_vma can differ from the page anon_vma.  This can trip up
      __page_check_anon_rmap, which is indirectly called from do_swap_page().
      
      Remove that obsolete check to prevent an oops.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      033a64b5
    • R
      mm: change anon_vma linking to fix multi-process server scalability issue · 5beb4930
      Rik van Riel 提交于
      The old anon_vma code can lead to scalability issues with heavily forking
      workloads.  Specifically, each anon_vma will be shared between the parent
      process and all its child processes.
      
      In a workload with 1000 child processes and a VMA with 1000 anonymous
      pages per process that get COWed, this leads to a system with a million
      anonymous pages in the same anon_vma, each of which is mapped in just one
      of the 1000 processes.  However, the current rmap code needs to walk them
      all, leading to O(N) scanning complexity for each page.
      
      This can result in systems where one CPU is walking the page tables of
      1000 processes in page_referenced_one, while all other CPUs are stuck on
      the anon_vma lock.  This leads to catastrophic failure for a benchmark
      like AIM7, where the total number of processes can reach in the tens of
      thousands.  Real workloads are still a factor 10 less process intensive
      than AIM7, but they are catching up.
      
      This patch changes the way anon_vmas and VMAs are linked, which allows us
      to associate multiple anon_vmas with a VMA.  At fork time, each child
      process gets its own anon_vmas, in which its COWed pages will be
      instantiated.  The parents' anon_vma is also linked to the VMA, because
      non-COWed pages could be present in any of the children.
      
      This reduces rmap scanning complexity to O(1) for the pages of the 1000
      child processes, with O(N) complexity for at most 1/N pages in the system.
       This reduces the average scanning cost in heavily forking workloads from
      O(N) to 2.
      
      The only real complexity in this patch stems from the fact that linking a
      VMA to anon_vmas now involves memory allocations.  This means vma_adjust
      can fail, if it needs to attach a VMA to anon_vma structures.  This in
      turn means error handling needs to be added to the calling functions.
      
      A second source of complexity is that, because there can be multiple
      anon_vmas, the anon_vma linking in vma_adjust can no longer be done under
      "the" anon_vma lock.  To prevent the rmap code from walking up an
      incomplete VMA, this patch introduces the VM_LOCK_RMAP VMA flag.  This bit
      flag uses the same slot as the NOMMU VM_MAPPED_COPY, with an ifdef in mm.h
      to make sure it is impossible to compile a kernel that needs both symbolic
      values for the same bitflag.
      
      Some test results:
      
      Without the anon_vma changes, when AIM7 hits around 9.7k users (on a test
      box with 16GB RAM and not quite enough IO), the system ends up running
      >99% in system time, with every CPU on the same anon_vma lock in the
      pageout code.
      
      With these changes, AIM7 hits the cross-over point around 29.7k users.
      This happens with ~99% IO wait time, there never seems to be any spike in
      system time.  The anon_vma lock contention appears to be resolved.
      
      [akpm@linux-foundation.org: cleanups]
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5beb4930
    • K
      mm: count swap usage · b084d435
      KAMEZAWA Hiroyuki 提交于
      A frequent questions from users about memory management is what numbers of
      swap ents are user for processes.  And this information will give some
      hints to oom-killer.
      
      Besides we can count the number of swapents per a process by scanning
      /proc/<pid>/smaps, this is very slow and not good for usual process
      information handler which works like 'ps' or 'top'.  (ps or top is now
      enough slow..)
      
      This patch adds a counter of swapents to mm_counter and update is at each
      swap events.  Information is exported via /proc/<pid>/status file as
      
      [kamezawa@bluextal memory]$ cat /proc/self/status
      Name:   cat
      State:  R (running)
      Tgid:   2910
      Pid:    2910
      PPid:   2823
      TracerPid:      0
      Uid:    500     500     500     500
      Gid:    500     500     500     500
      FDSize: 256
      Groups: 500
      VmPeak:    82696 kB
      VmSize:    82696 kB
      VmLck:         0 kB
      VmHWM:       432 kB
      VmRSS:       432 kB
      VmData:      172 kB
      VmStk:        84 kB
      VmExe:        48 kB
      VmLib:      1568 kB
      VmPTE:        40 kB
      VmSwap:        0 kB <=============== this.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Reviewed-by: NChristoph Lameter <cl@linux-foundation.org>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b084d435
    • K
      mm: clean up mm_counter · d559db08
      KAMEZAWA Hiroyuki 提交于
      Presently, per-mm statistics counter is defined by macro in sched.h
      
      This patch modifies it to
        - defined in mm.h as inlinf functions
        - use array instead of macro's name creation.
      
      This patch is for reducing patch size in future patch to modify
      implementation of per-mm counter.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d559db08
  4. 16 12月, 2009 14 次提交
    • K
      memcg: make memcg's file mapped consistent with global VM · d8046582
      KAMEZAWA Hiroyuki 提交于
      In global VM, FILE_MAPPED is used but memcg uses MAPPED_FILE.  This makes
      grep difficult.  Replace memcg's MAPPED_FILE with FILE_MAPPED
      
      And in global VM, mapped shared memory is accounted into FILE_MAPPED.
      But memcg doesn't. fix it.
      Note:
        page_is_file_cache() just checks SwapBacked or not.
        So, we need to check PageAnon.
      
      Cc: Balbir Singh <balbir@in.ibm.com>
      Reviewed-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d8046582
    • K
      mm: simplify try_to_unmap_one() · caed0f48
      KOSAKI Motohiro 提交于
      SWAP_MLOCK mean "We marked the page as PG_MLOCK, please move it to
      unevictable-lru". So, following code is easy confusable.
      
              if (vma->vm_flags & VM_LOCKED) {
                      ret = SWAP_MLOCK;
                      goto out_unmap;
              }
      
      Plus, if the VMA doesn't have VM_LOCKED, We don't need to check
      the needed of calling mlock_vma_page().
      
      Also, add some commentary to try_to_unmap_one().
      Acked-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      caed0f48
    • H
      ksm: rmap_walk to remove_migation_ptes · e9995ef9
      Hugh Dickins 提交于
      A side-effect of making ksm pages swappable is that they have to be placed
      on the LRUs: which then exposes them to isolate_lru_page() and hence to
      page migration.
      
      Add rmap_walk() for remove_migration_ptes() to use: rmap_walk_anon() and
      rmap_walk_file() in rmap.c, but rmap_walk_ksm() in ksm.c.  Perhaps some
      consolidation with existing code is possible, but don't attempt that yet
      (try_to_unmap needs to handle nonlinears, but migration pte removal does
      not).
      
      rmap_walk() is sadly less general than it appears: rmap_walk_anon(), like
      remove_anon_migration_ptes() which it replaces, avoids calling
      page_lock_anon_vma(), because that includes a page_mapped() test which
      fails when all migration ptes are in place.  That was valid when NUMA page
      migration was introduced (holding mmap_sem provided the missing guarantee
      that anon_vma's slab had not already been destroyed), but I believe not
      valid in the memory hotremove case added since.
      
      For now do the same as before, and consider the best way to fix that
      unlikely race later on.  When fixed, we can probably use rmap_walk() on
      hwpoisoned ksm pages too: for now, they remain among hwpoison's various
      exceptions (its PageKsm test comes before the page is locked, but its
      page_lock_anon_vma fails safely if an anon gets upgraded).
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Izik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Chris Wright <chrisw@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e9995ef9
    • H
      ksm: share anon page without allocating · 80e14822
      Hugh Dickins 提交于
      When ksm pages were unswappable, it made no sense to include them in mem
      cgroup accounting; but now that they are swappable (although I see no
      strict logical connection) the principle of least surprise implies that
      they should be accounted (with the usual dissatisfaction, that a shared
      page is accounted to only one of the cgroups using it).
      
      This patch was intended to add mem cgroup accounting where necessary; but
      turned inside out, it now avoids allocating a ksm page, instead upgrading
      an anon page to ksm - which brings its existing mem cgroup accounting with
      it.  Thus mem cgroups don't appear in the patch at all.
      
      This upgrade from PageAnon to PageKsm takes place under page lock (via a
      somewhat hacky NULL kpage interface), and audit showed only one place
      which needed to cope with the race - page_referenced() is sometimes used
      without page lock, so page_lock_anon_vma() needs an ACCESS_ONCE() to be
      sure of getting anon_vma and flags together (no problem if the page goes
      ksm an instant after, the integrity of that anon_vma list is unaffected).
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Izik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Chris Wright <chrisw@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      80e14822
    • H
      ksm: hold anon_vma in rmap_item · db114b83
      Hugh Dickins 提交于
      For full functionality, page_referenced_one() and try_to_unmap_one() need
      to know the vma: to pass vma down to arch-dependent flushes, or to observe
      VM_LOCKED or VM_EXEC.  But KSM keeps no record of vma: nor can it, since
      vmas get split and merged without its knowledge.
      
      Instead, note page's anon_vma in its rmap_item when adding to stable tree:
      all the vmas which might map that page are listed by its anon_vma.
      
      page_referenced_ksm() and try_to_unmap_ksm() then traverse the anon_vma,
      first to find the probable vma, that which matches rmap_item's mm; but if
      that is not enough to locate all instances, traverse again to try the
      others.  This catches those occasions when fork has duplicated a pte of a
      ksm page, but ksmd has not yet come around to assign it an rmap_item.
      
      But each rmap_item in the stable tree which refers to an anon_vma needs to
      take a reference to it.  Andrea's anon_vma design cleverly avoided a
      reference count (an anon_vma was free when its list of vmas was empty),
      but KSM now needs to add that.  Is a 32-bit count sufficient?  I believe
      so - the anon_vma is only free when both count is 0 and list is empty.
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Izik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Chris Wright <chrisw@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      db114b83
    • H
      ksm: let shared pages be swappable · 5ad64688
      Hugh Dickins 提交于
      Initial implementation for swapping out KSM's shared pages: add
      page_referenced_ksm() and try_to_unmap_ksm(), which rmap.c calls when
      faced with a PageKsm page.
      
      Most of what's needed can be got from the rmap_items listed from the
      stable_node of the ksm page, without discovering the actual vma: so in
      this patch just fake up a struct vma for page_referenced_one() or
      try_to_unmap_one(), then refine that in the next patch.
      
      Add VM_NONLINEAR to ksm_madvise()'s list of exclusions: it has always been
      implicit there (being only set with VM_SHARED, already excluded), but
      let's make it explicit, to help justify the lack of nonlinear unmap.
      
      Rely on the page lock to protect against concurrent modifications to that
      page's node of the stable tree.
      
      The awkward part is not swapout but swapin: do_swap_page() and
      page_add_anon_rmap() now have to allow for new possibilities - perhaps a
      ksm page still in swapcache, perhaps a swapcache page associated with one
      location in one anon_vma now needed for another location or anon_vma.
      (And the vma might even be no longer VM_MERGEABLE when that happens.)
      
      ksm_might_need_to_copy() checks for that case, and supplies a duplicate
      page when necessary, simply leaving it to a subsequent pass of ksmd to
      rediscover the identity and merge them back into one ksm page.
      Disappointingly primitive: but the alternative would have to accumulate
      unswappable info about the swapped out ksm pages, limiting swappability.
      
      Remove page_add_ksm_rmap(): page_add_anon_rmap() now has to allow for the
      particular case it was handling, so just use it instead.
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Izik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Chris Wright <chrisw@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5ad64688
    • H
      mm: pass address down to rmap ones · 1cb1729b
      Hugh Dickins 提交于
      KSM swapping will know where page_referenced_one() and try_to_unmap_one()
      should look.  It could hack page->index to get them to do what it wants,
      but it seems cleaner now to pass the address down to them.
      
      Make the same change to page_mkclean_one(), since it follows the same
      pattern; but there's no real need in its case.
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Izik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1cb1729b
    • H
      mm: CONFIG_MMU for PG_mlocked · af8e3354
      Hugh Dickins 提交于
      Remove three degrees of obfuscation, left over from when we had
      CONFIG_UNEVICTABLE_LRU.  MLOCK_PAGES is CONFIG_HAVE_MLOCKED_PAGE_BIT is
      CONFIG_HAVE_MLOCK is CONFIG_MMU.  rmap.o (and memory-failure.o) are only
      built when CONFIG_MMU, so don't need such conditions at all.
      
      Somehow, I feel no compulsion to remove the CONFIG_HAVE_MLOCK* lines from
      169 defconfigs: leave those to evolve in due course.
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Izik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      af8e3354
    • H
      mm: mlocking in try_to_unmap_one · 53f79acb
      Hugh Dickins 提交于
      There's contorted mlock/munlock handling in try_to_unmap_anon() and
      try_to_unmap_file(), which we'd prefer not to repeat for KSM swapping.
      Simplify it by moving it all down into try_to_unmap_one().
      
      One thing is then lost, try_to_munlock()'s distinction between when no vma
      holds the page mlocked, and when a vma does mlock it, but we could not get
      mmap_sem to set the page flag.  But its only caller takes no interest in
      that distinction (and is better testing SWAP_MLOCK anyway), so let's keep
      the code simple and return SWAP_AGAIN for both cases.
      
      try_to_unmap_file()'s TTU_MUNLOCK nonlinear handling was particularly
      amusing: once unravelled, it turns out to have been choosing between two
      different ways of doing the same nothing.  Ah, no, one way was actually
      returning SWAP_FAIL when it meant to return SWAP_SUCCESS.
      
      [kosaki.motohiro@jp.fujitsu.com: comment adding to mlocking in try_to_unmap_one]
      [akpm@linux-foundation.org: remove test of MLOCK_PAGES]
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Izik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      53f79acb
    • H
      mm: define PAGE_MAPPING_FLAGS · 3ca7b3c5
      Hugh Dickins 提交于
      At present we define PageAnon(page) by the low PAGE_MAPPING_ANON bit set
      in page->mapping, with the higher bits a pointer to the anon_vma; and have
      defined PageKsm(page) as that with NULL anon_vma.
      
      But KSM swapping will need to store a pointer there: so in preparation for
      that, now define PAGE_MAPPING_FLAGS as the low two bits, including
      PAGE_MAPPING_KSM (always set along with PAGE_MAPPING_ANON, until some
      other use for the bit emerges).
      
      Declare page_rmapping(page) to return the pointer part of page->mapping,
      and page_anon_vma(page) to return the anon_vma pointer when that's what it
      is.  Use these in a few appropriate places: notably, unuse_vma() has been
      testing page->mapping, but is better to be testing page_anon_vma() (cases
      may be added in which flag bits are set without any pointer).
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Izik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3ca7b3c5
    • H
      rmap: move label `out' to a better place · 273f047e
      Huang Shijie 提交于
      When the code jumps to the `out', `referenced' is still zero.  So there is
      no need to check it.
      Signed-off-by: NHuang Shijie <shijie8@gmail.com>
      Acked-by: NWu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      273f047e
    • H
      rmap: simplify try_to_unmap_file() · 7b511594
      Huang Shijie 提交于
      Just simplify the code when `mlocked' is true.
      Signed-off-by: NHuang Shijie <shijie8@gmail.com>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7b511594
    • H
      rmap: fix the comment for try_to_unmap_anon · 8051be5e
      Huang Shijie 提交于
      Fix the comment for try_to_unmap_anon() with the new arguments.
      Signed-off-by: NHuang Shijie <shijie8@gmail.com>
      Acked-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8051be5e
    • H
      swap_info: swap count continuations · 570a335b
      Hugh Dickins 提交于
      Swap is duplicated (reference count incremented by one) whenever the same
      swap page is inserted into another mm (when forking finds a swap entry in
      place of a pte, or when reclaim unmaps a pte to insert the swap entry).
      
      swap_info_struct's vmalloc'ed swap_map is the array of these reference
      counts: but what happens when the unsigned short (or unsigned char since
      the preceding patch) is full? (and its high bit is kept for a cache flag)
      
      We then lose track of it, never freeing, leaving it in use until swapoff:
      at which point we _hope_ that a single pass will have found all instances,
      assume there are no more, and will lose user data if we're wrong.
      
      Swapping of KSM pages has not yet been enabled; but it is implemented,
      and makes it very easy for a user to overflow the maximum swap count:
      possible with ordinary process pages, but unlikely, even when pid_max
      has been raised from PID_MAX_DEFAULT.
      
      This patch implements swap count continuations: when the count overflows,
      a continuation page is allocated and linked to the original vmalloc'ed
      map page, and this used to hold the continuation counts for that entry
      and its neighbours.  These continuation pages are seldom referenced:
      the common paths all work on the original swap_map, only referring to
      a continuation page when the low "digit" of a count is incremented or
      decremented through SWAP_MAP_MAX.
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      570a335b
  5. 02 10月, 2009 1 次提交
  6. 22 9月, 2009 2 次提交
  7. 16 9月, 2009 4 次提交
    • A
      HWPOISON: The high level memory error handler in the VM v7 · 6a46079c
      Andi Kleen 提交于
      Add the high level memory handler that poisons pages
      that got corrupted by hardware (typically by a two bit flip in a DIMM
      or a cache) on the Linux level. The goal is to prevent everyone
      from accessing these pages in the future.
      
      This done at the VM level by marking a page hwpoisoned
      and doing the appropriate action based on the type of page
      it is.
      
      The code that does this is portable and lives in mm/memory-failure.c
      
      To quote the overview comment:
      
      High level machine check handler. Handles pages reported by the
      hardware as being corrupted usually due to a 2bit ECC memory or cache
      failure.
      
      This focuses on pages detected as corrupted in the background.
      When the current CPU tries to consume corruption the currently
      running process can just be killed directly instead. This implies
      that if the error cannot be handled for some reason it's safe to
      just ignore it because no corruption has been consumed yet. Instead
      when that happens another machine check will happen.
      
      Handles page cache pages in various states. The tricky part
      here is that we can access any page asynchronous to other VM
      users, because memory failures could happen anytime and anywhere,
      possibly violating some of their assumptions. This is why this code
      has to be extremely careful. Generally it tries to use normal locking
      rules, as in get the standard locks, even if that means the
      error handling takes potentially a long time.
      
      Some of the operations here are somewhat inefficient and have non
      linear algorithmic complexity, because the data structures have not
      been optimized for this case. This is in particular the case
      for the mapping from a vma to a process. Since this case is expected
      to be rare we hope we can get away with this.
      
      There are in principle two strategies to kill processes on poison:
      - just unmap the data and wait for an actual reference before
      killing
      - kill as soon as corruption is detected.
      Both have advantages and disadvantages and should be used
      in different situations. Right now both are implemented and can
      be switched with a new sysctl vm.memory_failure_early_kill
      The default is early kill.
      
      The patch does some rmap data structure walking on its own to collect
      processes to kill. This is unusual because normally all rmap data structure
      knowledge is in rmap.c only. I put it here for now to keep
      everything together and rmap knowledge has been seeping out anyways
      
      Includes contributions from Johannes Weiner, Chris Mason, Fengguang Wu,
      Nick Piggin (who did a lot of great work) and others.
      
      Cc: npiggin@suse.de
      Cc: riel@redhat.com
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NHidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
      6a46079c
    • A
      HWPOISON: Handle hardware poisoned pages in try_to_unmap · 888b9f7c
      Andi Kleen 提交于
      When a page has the poison bit set replace the PTE with a poison entry.
      This causes the right error handling to be done later when a process runs
      into it.
      
      v2: add a new flag to not do that (needed for the memory-failure handler
      later) (Fengguang)
      v3: remove unnecessary is_migration_entry() test (Fengguang, Minchan)
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Reviewed-by: NWu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      888b9f7c
    • A
      HWPOISON: Use bitmask/action code for try_to_unmap behaviour · 14fa31b8
      Andi Kleen 提交于
      try_to_unmap currently has multiple modi (migration, munlock, normal unmap)
      which are selected by magic flag variables. The logic is not very straight
      forward, because each of these flag change multiple behaviours (e.g.
      migration turns off aging, not only sets up migration ptes etc.)
      Also the different flags interact in magic ways.
      
      A later patch in this series adds another mode to try_to_unmap, so
      this becomes quickly unmanageable.
      
      Replace the different flags with a action code (migration, munlock, munmap)
      and some additional flags as modifiers (ignore mlock, ignore aging).
      This makes the logic more straight forward and allows easier extension
      to new behaviours. Change all the caller to declare what they want to
      do.
      
      This patch is supposed to be a nop in behaviour. If anyone can prove
      it is not that would be a bug.
      
      Cc: Lee.Schermerhorn@hp.com
      Cc: npiggin@suse.de
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      14fa31b8
    • A
      HWPOISON: Export some rmap vma locking to outside world · 10be22df
      Andi Kleen 提交于
      Needed for later patch that walks rmap entries on its own.
      
      This used to be very frowned upon, but memory-failure.c does
      some rather specialized rmap walking and rmap has been stable
      for quite some time, so I think it's ok now to export it.
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      10be22df
  8. 27 8月, 2009 1 次提交
  9. 19 6月, 2009 1 次提交
    • B
      memcg: add file-based RSS accounting · d69b042f
      Balbir Singh 提交于
      Add file RSS tracking per memory cgroup
      
      We currently don't track file RSS, the RSS we report is actually anon RSS.
       All the file mapped pages, come in through the page cache and get
      accounted there.  This patch adds support for accounting file RSS pages.
      It should
      
      1. Help improve the metrics reported by the memory resource controller
      2. Will form the basis for a future shared memory accounting heuristic
         that has been proposed by Kamezawa.
      
      Unfortunately, we cannot rename the existing "rss" keyword used in
      memory.stat to "anon_rss".  We however, add "mapped_file" data and hope to
      educate the end user through documentation.
      
      [hugh.dickins@tiscali.co.uk: fix mem_cgroup_update_mapped_file_stat oops]
      Signed-off-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Li Zefan <lizf@cn.fujitsu.cn>
      Cc: Paul Menage <menage@google.com>
      Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d69b042f
  10. 17 6月, 2009 2 次提交
  11. 22 5月, 2009 1 次提交
  12. 12 2月, 2009 1 次提交
  13. 07 1月, 2009 3 次提交
    • H
      badpage: remove vma from page_remove_rmap · edc315fd
      Hugh Dickins 提交于
      Remove page_remove_rmap()'s vma arg, which was only for the Eeek message.
      And remove the BUG_ON(page_mapcount(page) == 0) from CONFIG_DEBUG_VM's
      page_dup_rmap(): we're trying to be more resilient about that than BUGs.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      edc315fd
    • H
      badpage: replace page_remove_rmap Eeek and BUG · 3dc14741
      Hugh Dickins 提交于
      Now that bad pages are kept out of circulation, there is no need for the
      infamous page_remove_rmap() BUG() - once that page is freed, its negative
      mapcount will issue a "Bad page state" message and the page won't be
      freed.  Removing the BUG() allows more info, on subsequent pages, to be
      gathered.
      
      We do have more info about the page at this point than bad_page() can know
      - notably, what the pmd is, which might pinpoint something like low 64kB
      corruption - but page_remove_rmap() isn't given the address to find that.
      
      In practice, there is only one call to page_remove_rmap() which has ever
      reported anything, that from zap_pte_range() (usually on exit, sometimes
      on munmap).  It has all the info, so remove page_remove_rmap()'s "Eeek"
      message and leave it all to zap_pte_range().
      
      mm/memory.c already has a hardly used print_bad_pte() function, showing
      some of the appropriate info: extend it to show what we want for the rmap
      case: pte info, page info (when there is a page) and vma info to compare.
      zap_pte_range() already knows the pmd, but print_bad_pte() is easier to
      use if it works that out for itself.
      
      Some of this info is also shown in bad_page()'s "Bad page state" message.
      Keep them separate, but adjust them to match each other as far as
      possible.  Say "Bad page map" in print_bad_pte(), and add a TAINT_BAD_PAGE
      there too.
      
      print_bad_pte() show current->comm unconditionally (though it should get
      repeated in the usually irrelevant stack trace): sorry, I misled Nick
      Piggin to make it conditional on vm_mm == current->mm, but current->mm is
      already NULL in the exit case.  Usually current->comm is good, though
      exceptionally it may not be that of the mm (when "swapoff" for example).
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3dc14741
    • H
      mm: further cleanup page_add_new_anon_rmap · cbf84b7a
      Hugh Dickins 提交于
      Moving lru_cache_add_active_or_unevictable() into page_add_new_anon_rmap()
      was good but stupid: we can and should SetPageSwapBacked() there too; and
      we know for sure that this anonymous, swap-backed page is not file cache.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Acked-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cbf84b7a