1. 16 12月, 2009 12 次提交
    • H
      ksm: rmap_walk to remove_migation_ptes · e9995ef9
      Hugh Dickins 提交于
      A side-effect of making ksm pages swappable is that they have to be placed
      on the LRUs: which then exposes them to isolate_lru_page() and hence to
      page migration.
      
      Add rmap_walk() for remove_migration_ptes() to use: rmap_walk_anon() and
      rmap_walk_file() in rmap.c, but rmap_walk_ksm() in ksm.c.  Perhaps some
      consolidation with existing code is possible, but don't attempt that yet
      (try_to_unmap needs to handle nonlinears, but migration pte removal does
      not).
      
      rmap_walk() is sadly less general than it appears: rmap_walk_anon(), like
      remove_anon_migration_ptes() which it replaces, avoids calling
      page_lock_anon_vma(), because that includes a page_mapped() test which
      fails when all migration ptes are in place.  That was valid when NUMA page
      migration was introduced (holding mmap_sem provided the missing guarantee
      that anon_vma's slab had not already been destroyed), but I believe not
      valid in the memory hotremove case added since.
      
      For now do the same as before, and consider the best way to fix that
      unlikely race later on.  When fixed, we can probably use rmap_walk() on
      hwpoisoned ksm pages too: for now, they remain among hwpoison's various
      exceptions (its PageKsm test comes before the page is locked, but its
      page_lock_anon_vma fails safely if an anon gets upgraded).
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Izik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Chris Wright <chrisw@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e9995ef9
    • H
      ksm: share anon page without allocating · 80e14822
      Hugh Dickins 提交于
      When ksm pages were unswappable, it made no sense to include them in mem
      cgroup accounting; but now that they are swappable (although I see no
      strict logical connection) the principle of least surprise implies that
      they should be accounted (with the usual dissatisfaction, that a shared
      page is accounted to only one of the cgroups using it).
      
      This patch was intended to add mem cgroup accounting where necessary; but
      turned inside out, it now avoids allocating a ksm page, instead upgrading
      an anon page to ksm - which brings its existing mem cgroup accounting with
      it.  Thus mem cgroups don't appear in the patch at all.
      
      This upgrade from PageAnon to PageKsm takes place under page lock (via a
      somewhat hacky NULL kpage interface), and audit showed only one place
      which needed to cope with the race - page_referenced() is sometimes used
      without page lock, so page_lock_anon_vma() needs an ACCESS_ONCE() to be
      sure of getting anon_vma and flags together (no problem if the page goes
      ksm an instant after, the integrity of that anon_vma list is unaffected).
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Izik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Chris Wright <chrisw@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      80e14822
    • H
      ksm: take keyhole reference to page · 4035c07a
      Hugh Dickins 提交于
      There's a lamentable flaw in KSM swapping: the stable_node holds a
      reference to the ksm page, so the page to be freed cannot actually be
      freed until ksmd works its way around to removing the last rmap_item from
      its stable_node.  Which in some configurations may take minutes: not quite
      responsive enough for memory reclaim.  And we don't want to twist KSM and
      its locking more tightly into the rest of mm.  What a pity.
      
      But although the stable_node needs to hold a pointer to the ksm page, does
      it actually need to raise the reference count of that page?
      
      No.  It would need to do so if struct pages were ordinary kmalloc'ed
      objects; but they are more stable than that, and reused in particular ways
      according to particular rules.
      
      Access to stable_node from its pointer in struct page is no problem, so
      long as we never free a stable_node before the ksm page itself has been
      freed.  Access to struct page from its pointer in stable_node: reintroduce
      get_ksm_page(), and let that peep out through its keyhole (the stable_node
      pointer to ksm page), to see if that struct page still holds the right key
      to open it (the ksm page mapping pointer back to this stable_node).
      
      This relies upon the established way in which free_hot_cold_page() sets an
      anon (including ksm) page->mapping to NULL; and relies upon no other user
      of a struct page to put something which looks like the original
      stable_node pointer (with two low bits also set) into page->mapping.  It
      also needs get_page_unless_zero() technique pioneered by speculative
      pagecache; and uses rcu_read_lock() to keep the guarantees that gives.
      
      There are several drivers which put pointers of their own into page->
      mapping; but none of those could coincide with our stable_node pointers,
      since KSM won't free a stable_node until it sees that the page has gone.
      
      The only problem case found is the pagetable spinlock USE_SPLIT_PTLOCKS
      places in struct page (my own abuse): to accommodate GENERIC_LOCKBREAK's
      break_lock on 32-bit, that spans both page->private and page->mapping.
      Since break_lock is only 0 or 1, again no confusion for get_ksm_page().
      
      But what of DEBUG_SPINLOCK on 64-bit bigendian?  When owner_cpu is 3
      (matching PageKsm low bits), it might see 0xdead4ead00000003 in page->
      mapping, which might coincide?  We could get around that by...  but a
      better answer is to suppress USE_SPLIT_PTLOCKS when DEBUG_SPINLOCK or
      DEBUG_LOCK_ALLOC, to stop bloating sizeof(struct page) in their case -
      already proposed in an earlier mm/Kconfig patch.
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Izik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Chris Wright <chrisw@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4035c07a
    • H
      ksm: hold anon_vma in rmap_item · db114b83
      Hugh Dickins 提交于
      For full functionality, page_referenced_one() and try_to_unmap_one() need
      to know the vma: to pass vma down to arch-dependent flushes, or to observe
      VM_LOCKED or VM_EXEC.  But KSM keeps no record of vma: nor can it, since
      vmas get split and merged without its knowledge.
      
      Instead, note page's anon_vma in its rmap_item when adding to stable tree:
      all the vmas which might map that page are listed by its anon_vma.
      
      page_referenced_ksm() and try_to_unmap_ksm() then traverse the anon_vma,
      first to find the probable vma, that which matches rmap_item's mm; but if
      that is not enough to locate all instances, traverse again to try the
      others.  This catches those occasions when fork has duplicated a pte of a
      ksm page, but ksmd has not yet come around to assign it an rmap_item.
      
      But each rmap_item in the stable tree which refers to an anon_vma needs to
      take a reference to it.  Andrea's anon_vma design cleverly avoided a
      reference count (an anon_vma was free when its list of vmas was empty),
      but KSM now needs to add that.  Is a 32-bit count sufficient?  I believe
      so - the anon_vma is only free when both count is 0 and list is empty.
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Izik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Chris Wright <chrisw@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      db114b83
    • H
      ksm: let shared pages be swappable · 5ad64688
      Hugh Dickins 提交于
      Initial implementation for swapping out KSM's shared pages: add
      page_referenced_ksm() and try_to_unmap_ksm(), which rmap.c calls when
      faced with a PageKsm page.
      
      Most of what's needed can be got from the rmap_items listed from the
      stable_node of the ksm page, without discovering the actual vma: so in
      this patch just fake up a struct vma for page_referenced_one() or
      try_to_unmap_one(), then refine that in the next patch.
      
      Add VM_NONLINEAR to ksm_madvise()'s list of exclusions: it has always been
      implicit there (being only set with VM_SHARED, already excluded), but
      let's make it explicit, to help justify the lack of nonlinear unmap.
      
      Rely on the page lock to protect against concurrent modifications to that
      page's node of the stable tree.
      
      The awkward part is not swapout but swapin: do_swap_page() and
      page_add_anon_rmap() now have to allow for new possibilities - perhaps a
      ksm page still in swapcache, perhaps a swapcache page associated with one
      location in one anon_vma now needed for another location or anon_vma.
      (And the vma might even be no longer VM_MERGEABLE when that happens.)
      
      ksm_might_need_to_copy() checks for that case, and supplies a duplicate
      page when necessary, simply leaving it to a subsequent pass of ksmd to
      rediscover the identity and merge them back into one ksm page.
      Disappointingly primitive: but the alternative would have to accumulate
      unswappable info about the swapped out ksm pages, limiting swappability.
      
      Remove page_add_ksm_rmap(): page_add_anon_rmap() now has to allow for the
      particular case it was handling, so just use it instead.
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Izik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Chris Wright <chrisw@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5ad64688
    • H
      ksm: fix mlockfreed to munlocked · 73848b46
      Hugh Dickins 提交于
      When KSM merges an mlocked page, it has been forgetting to munlock it:
      that's been left to free_page_mlock(), which reports it in /proc/vmstat as
      unevictable_pgs_mlockfreed instead of unevictable_pgs_munlocked (and
      whinges "Page flag mlocked set for process" in mmotm, whereas mainline is
      silently forgiving).  Call munlock_vma_page() to fix that.
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Izik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Chris Wright <chrisw@redhat.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      73848b46
    • H
      ksm: stable_node point to page and back · 08beca44
      Hugh Dickins 提交于
      Add a pointer to the ksm page into struct stable_node, holding a reference
      to the page while the node exists.  Put a pointer to the stable_node into
      the ksm page's ->mapping.
      
      Then we don't need get_ksm_page() while traversing the stable tree: the
      page to compare against is sure to be present and correct, even if it's no
      longer visible through any of its existing rmap_items.
      
      And we can handle the forked ksm page case more efficiently: no need to
      memcmp our way through the tree to find its match.
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Izik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      08beca44
    • H
      ksm: separate stable_node · 7b6ba2c7
      Hugh Dickins 提交于
      Though we still do well to keep rmap_items in the unstable tree without a
      separate tree_item at the node, for several reasons it becomes awkward to
      keep rmap_items in the stable tree without a separate stable_node: lack of
      space in the nicely-sized rmap_item, the need for an anchor as rmap_items
      are removed, the need for a node even when temporarily no rmap_items are
      attached to it.
      
      So declare struct stable_node (rb_node to place it in the tree and
      hlist_head for the rmap_items hanging off it), and convert stable tree
      handling to use it: without yet taking advantage of it.  Note how one
      stable_tree_insert() of a node now has _two_ stable_tree_append()s of the
      two rmap_items being merged.
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Izik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7b6ba2c7
    • H
      ksm: singly-linked rmap_list · 6514d511
      Hugh Dickins 提交于
      Free up a pointer in struct rmap_item, by making the mm_slot's rmap_list a
      singly-linked list: we always traverse that list sequentially, and we
      don't even lose any prefetches (but should consider adding a few later).
      Name it rmap_list throughout.
      
      Do we need to free up that pointer?  Not immediately, and in the end, we
      could continue to avoid it with a union; but having done the conversion,
      let's keep it this way, since there's no downside, and maybe we'll want
      more in future (struct rmap_item is a cache-friendly 32 bytes on 32-bit
      and 64 bytes on 64-bit, so we shall want to avoid expanding it).
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Izik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6514d511
    • H
      ksm: cleanup some function arguments · 8dd3557a
      Hugh Dickins 提交于
      Cleanup: make argument names more consistent from cmp_and_merge_page()
      down to replace_page(), so that it's easier to follow the rmap_item's page
      and the matching tree_page and the merged kpage through that code.
      
      In some places, e.g.  break_cow(), pass rmap_item instead of separate mm
      and address.
      
      cmp_and_merge_page() initialize tree_page to NULL, to avoid a "may be used
      uninitialized" warning seen in one config by Anil SB.
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Izik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8dd3557a
    • H
      ksm: remove redundancies when merging page · 31e855ea
      Hugh Dickins 提交于
      There is no need for replace_page() to calculate a write-protected prot
      vm_page_prot must already be write-protected for an anonymous page (see
      mm/memory.c do_anonymous_page() for similar reliance on vm_page_prot).
      
      There is no need for try_to_merge_one_page() to get_page and put_page on
      newpage and oldpage: in every case we already hold a reference to each of
      them.
      
      But some instinct makes me move try_to_merge_one_page()'s unlock_page of
      oldpage down after replace_page(): that doesn't increase contention on the
      ksm page, and makes thinking about the transition easier.
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Izik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      31e855ea
    • H
      ksm: three remove_rmap_item_from_tree cleanups · 93d17715
      Hugh Dickins 提交于
      1. remove_rmap_item_from_tree() is called as a precaution from
         various places: don't dirty the rmap_item cacheline unnecessarily,
         just mask the flags out of the address when they have been set.
      
      2. First get_next_rmap_item() removes an unstable rmap_item from its tree,
         then shortly afterwards cmp_and_merge_page() removes a stable rmap_item
         from its tree: it's easier just to do both at once (but definitely keep
         the BUG_ON(age > 1) which guards against a future omission).
      
      3. When cmp_and_merge_page() moves an rmap_item from unstable to stable
         tree, it does its own rb_erase() and accounting: that's better
         expressed by remove_rmap_item_from_tree().
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Izik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      93d17715
  2. 10 11月, 2009 1 次提交
  3. 08 10月, 2009 1 次提交
    • H
      ksm: more on default values · c73602ad
      Hugh Dickins 提交于
      Adjust the max_kernel_pages default to a quarter of totalram_pages,
      instead of nr_free_buffer_pages() / 4: the KSM pages themselves come from
      highmem, and even on a 16GB PAE machine, 4GB of KSM pages would only be
      pinning 32MB of lowmem with their rmap_items, so no need for the more
      obscure calculation (nor for its own special init function).
      
      There is no way for the user to switch KSM on if CONFIG_SYSFS is not
      enabled, so in that case default run to KSM_RUN_MERGE.
      
      Update KSM Documentation and Kconfig to reflect the new defaults.
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Izik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c73602ad
  4. 24 9月, 2009 1 次提交
    • I
      ksm: change default values to better fit into mainline kernel · 2c6854fd
      Izik Eidus 提交于
      Now that ksm is in mainline it is better to change the default values to
      better fit to most of the users.
      
      This patch change the ksm default values to be:
      
      	ksm_thread_pages_to_scan = 100 (instead of 200)
      	ksm_thread_sleep_millisecs = 20 (like before)
      	ksm_run = KSM_RUN_STOP (instead of KSM_RUN_MERGE - meaning ksm is
      	                        disabled by default)
      	ksm_max_kernel_pages = nr_free_buffer_pages / 4 (instead of 2046)
      
      The important aspect of this patch is: it disables ksm by default, and sets
      the number of the kernel_pages that can be allocated to be a reasonable
      number.
      Signed-off-by: NIzik Eidus <ieidus@redhat.com>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2c6854fd
  5. 22 9月, 2009 17 次提交
    • H
      ksm: unmerge is an origin of OOMs · 35451bee
      Hugh Dickins 提交于
      Just as the swapoff system call allocates many pages of RAM to various
      processes, perhaps triggering OOM, so "echo 2 >/sys/kernel/mm/ksm/run"
      (unmerge) is liable to allocate many pages of RAM to various processes,
      perhaps triggering OOM; and each is normally run from a modest admin
      process (swapoff or shell), easily repeated until it succeeds.
      
      So treat unmerge_and_remove_all_rmap_items() in the same way that we treat
      try_to_unuse(): generalize PF_SWAPOFF to PF_OOM_ORIGIN, and bracket both
      with that, to ask the OOM killer to kill them first, to prevent them from
      spawning more and more OOM kills.
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Acked-by: NIzik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      35451bee
    • H
      ksm: clean up obsolete references · a913e182
      Hugh Dickins 提交于
      A few cleanups, given the munlock fix: the comment on ksm_test_exit() no
      longer applies, and it can be made private to ksm.c; there's no more
      reference to mmu_gather or tlb.h, and mmap.c doesn't need ksm.h.
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Acked-by: NIzik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a913e182
    • H
      ksm: sysfs and defaults · 2ffd8679
      Hugh Dickins 提交于
      At present KSM is just a waste of space if you don't have CONFIG_SYSFS=y
      to provide the /sys/kernel/mm/ksm files to tune and activate it.
      
      Make KSM depend on SYSFS?  Could do, but it might be better to provide
      some defaults so that KSM works out-of-the-box, ready for testers to
      madvise MADV_MERGEABLE, even without SYSFS.
      
      Though anyone serious is likely to want to retune the numbers to their
      taste once they have experience; and whether these settings ever reach
      2.6.32 can be discussed along the way.
      
      Save 1kB from tiny kernels by #ifdef'ing the SYSFS side of it.
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Acked-by: NIzik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2ffd8679
    • A
      ksm: fix deadlock with munlock in exit_mmap · 1c2fb7a4
      Andrea Arcangeli 提交于
      Rawhide users have reported hang at startup when cryptsetup is run: the
      same problem can be simply reproduced by running a program int main() {
      mlockall(MCL_CURRENT | MCL_FUTURE); return 0; }
      
      The problem is that exit_mmap() applies munlock_vma_pages_all() to
      clean up VM_LOCKED areas, and its current implementation (stupidly)
      tries to fault in absent pages, for example where PROT_NONE prevented
      them being faulted in when mlocking.  Whereas the "ksm: fix oom
      deadlock" patch, knowing there's a race by which KSM might try to fault
      in pages after exit_mmap() had finally zapped the range, backs out of
      such faults doing nothing when its ksm_test_exit() notices mm_users 0.
      
      So revert that part of "ksm: fix oom deadlock" which moved the
      ksm_exit() call from before exit_mmap() to the middle of exit_mmap();
      and remove those ksm_test_exit() checks from the page fault paths, so
      allowing the munlocking to proceed without interference.
      
      ksm_exit, if there are rmap_items still chained on this mm slot, takes
      mmap_sem write side: so preventing KSM from working on an mm while
      exit_mmap runs.  And KSM will bail out as soon as it notices that
      mm_users is already zero, thanks to its internal ksm_test_exit checks.
      So that when a task is killed by OOM killer or the user, KSM will not
      indefinitely prevent it from running exit_mmap to release its memory.
      
      This does break a part of what "ksm: fix oom deadlock" was trying to
      achieve.  When unmerging KSM (echo 2 >/sys/kernel/mm/ksm), and even
      when ksmd itself has to cancel a KSM page, it is possible that the
      first OOM-kill victim would be the KSM process being faulted: then its
      memory won't be freed until a second victim has been selected (freeing
      memory for the unmerging fault to complete).
      
      But the OOM killer is already liable to kill a second victim once the
      intended victim's p->mm goes to NULL: so there's not much point in
      rejecting this KSM patch before fixing that OOM behaviour.  It is very
      much more important to allow KSM users to boot up, than to haggle over
      an unlikely and poorly supported OOM case.
      
      We also intend to fix munlocking to not fault pages: at which point
      this patch _could_ be reverted; though that would be controversial, so
      we hope to find a better solution.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NJustin M. Forbes <jforbes@redhat.com>
      Acked-for-now-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Izik Eidus <ieidus@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1c2fb7a4
    • H
      ksm: fix oom deadlock · 9ba69294
      Hugh Dickins 提交于
      There's a now-obvious deadlock in KSM's out-of-memory handling:
      imagine ksmd or KSM_RUN_UNMERGE handling, holding ksm_thread_mutex,
      trying to allocate a page to break KSM in an mm which becomes the
      OOM victim (quite likely in the unmerge case): it's killed and goes
      to exit, and hangs there waiting to acquire ksm_thread_mutex.
      
      Clearly we must not require ksm_thread_mutex in __ksm_exit, simple
      though that made everything else: perhaps use mmap_sem somehow?
      And part of the answer lies in the comments on unmerge_ksm_pages:
      __ksm_exit should also leave all the rmap_item removal to ksmd.
      
      But there's a fundamental problem, that KSM relies upon mmap_sem to
      guarantee the consistency of the mm it's dealing with, yet exit_mmap
      tears down an mm without taking mmap_sem.  And bumping mm_users won't
      help at all, that just ensures that the pages the OOM killer assumes
      are on their way to being freed will not be freed.
      
      The best answer seems to be, to move the ksm_exit callout from just
      before exit_mmap, to the middle of exit_mmap: after the mm's pages
      have been freed (if the mmu_gather is flushed), but before its page
      tables and vma structures have been freed; and down_write,up_write
      mmap_sem there to serialize with KSM's own reliance on mmap_sem.
      
      But KSM then needs to be careful, whenever it downs mmap_sem, to
      check that the mm is not already exiting: there's a danger of using
      find_vma on a layout that's being torn apart, or writing into page
      tables which have been freed for reuse; and even do_anonymous_page
      and __do_fault need to check they're not being called by break_ksm
      to reinstate a pte after zap_pte_range has zapped that page table.
      
      Though it might be clearer to add an exiting flag, set while holding
      mmap_sem in __ksm_exit, that wouldn't cover the issue of reinstating
      a zapped pte.  All we need is to check whether mm_users is 0 - but
      must remember that ksmd may detect that before __ksm_exit is reached.
      So, ksm_test_exit(mm) added to comment such checks on mm->mm_users.
      
      __ksm_exit now has to leave clearing up the rmap_items to ksmd,
      that needs ksm_thread_mutex; but shift the exiting mm just after the
      ksm_scan cursor so that it will soon be dealt with.  __ksm_enter raise
      mm_count to hold the mm_struct, ksmd's exit processing (exactly like
      its processing when it finds all VM_MERGEABLEs unmapped) mmdrop it,
      similar procedure for KSM_RUN_UNMERGE (which has stopped ksmd).
      
      But also give __ksm_exit a fast path: when there's no complication
      (no rmap_items attached to mm and it's not at the ksm_scan cursor),
      it can safely do all the exiting work itself.  This is not just an
      optimization: when ksmd is not running, the raised mm_count would
      otherwise leak mm_structs.
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Acked-by: NIzik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9ba69294
    • H
      ksm: distribute remove_mm_from_lists · cd551f97
      Hugh Dickins 提交于
      Do some housekeeping in ksm.c, to help make the next patch easier
      to understand: remove the function remove_mm_from_lists, distributing
      its code to its callsites scan_get_next_rmap_item and __ksm_exit.
      
      That turns out to be a win in scan_get_next_rmap_item: move its
      remove_trailing_rmap_items and cursor advancement up, and it becomes
      simpler than before.  __ksm_exit becomes messier, but will change
      again; and moving its remove_trailing_rmap_items up lets us strengthen
      the unstable tree item's age condition in remove_rmap_item_from_tree.
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Acked-by: NIzik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cd551f97
    • H
      ksm: fix endless loop on oom · d952b791
      Hugh Dickins 提交于
      break_ksm has been looping endlessly ignoring VM_FAULT_OOM: that should
      only be a problem for ksmd when a memory control group imposes limits
      (normally the OOM killer will kill others with an mm until it succeeds);
      but in general (especially for MADV_UNMERGEABLE and KSM_RUN_UNMERGE) we
      do need to route the error (or kill) back to the caller (or sighandling).
      
      Test signal_pending in unmerge_ksm_pages, which could be a lengthy
      procedure if it has to spill into swap: returning -ERESTARTSYS so that
      trivial signals will restart but fatals will terminate (is that right?
      we do different things in different places in mm, none exactly this).
      
      unmerge_and_remove_all_rmap_items was forgetting to lock when going
      down the mm_list: fix that.  Whether it's successful or not, reset
      ksm_scan cursor to head; but only if it's successful, reset seqnr
      (shown in full_scans) - page counts will have gone down to zero.
      
      This patch leaves a significant OOM deadlock, but it's a good step
      on the way, and that deadlock is fixed in a subsequent patch.
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Acked-by: NIzik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d952b791
    • H
      ksm: five little cleanups · 81464e30
      Hugh Dickins 提交于
      1. We don't use __break_cow entry point now: merge it into break_cow.
      2. remove_all_slot_rmap_items is just a special case of
         remove_trailing_rmap_items: use the latter instead.
      3. Extend comment on unmerge_ksm_pages and rmap_items.
      4. try_to_merge_two_pages should use try_to_merge_with_ksm_page
         instead of duplicating its code; and so swap them around.
      5. Comment on cmp_and_merge_page described last year's: update it.
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Acked-by: NIzik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      81464e30
    • H
      ksm: keep quiet while list empty · 6e158384
      Hugh Dickins 提交于
      ksm_scan_thread already sleeps in wait_event_interruptible until setting
      ksm_run activates it; but if there's nothing on its list to look at, i.e.
      nobody has yet said madvise MADV_MERGEABLE, it's a shame to be clocking
      up system time and full_scans: ksmd_should_run added to check that too.
      
      And move the mutex_lock out around it: the new counts showed that when
      ksm_run is stopped, a little work often got done afterwards, because it
      had been read before taking the mutex.
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Acked-by: NIzik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6e158384
    • H
      ksm: break cow once unshared · 26465d3e
      Hugh Dickins 提交于
      We kept agreeing not to bother about the unswappable shared KSM pages
      which later become unshared by others: observation suggests they're not
      a significant proportion.  But they are disadvantageous, and it is easier
      to break COW to replace them by swappable pages, than offer statistics
      to show that they don't matter; then we can stop worrying about them.
      
      Doing this in ksm_do_scan, they don't go through cmp_and_merge_page on
      this pass: give them a good chance of getting into the unstable tree
      on the next pass, or back into the stable, by computing checksum now.
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Acked-by: NIzik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      26465d3e
    • H
      ksm: pages_unshared and pages_volatile · 473b0ce4
      Hugh Dickins 提交于
      The pages_shared and pages_sharing counts give a good picture of how
      successful KSM is at sharing; but no clue to how much wasted work it's
      doing to get there.  Add pages_unshared (count of unique pages waiting
      in the unstable tree, hoping to find a mate) and pages_volatile.
      
      pages_volatile is harder to define.  It includes those pages changing
      too fast to get into the unstable tree, but also whatever other edge
      conditions prevent a page getting into the trees: a high value may
      deserve investigation.  Don't try to calculate it from the various
      conditions: it's the total of rmap_items less those accounted for.
      
      Also show full_scans: the number of completed scans of everything
      registered in the mm list.
      
      The locking for all these counts is simply ksm_thread_mutex.
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Acked-by: NIzik Eidus <ieidus@redhat.com>
      Acked-by: NAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      473b0ce4
    • H
      ksm: move pages_sharing updates · e178dfde
      Hugh Dickins 提交于
      The pages_shared count is incremented and decremented when adding a node
      to and removing a node from the stable tree: easy to understand.  But the
      pages_sharing count was hard to follow, being adjusted in various places:
      increment and decrement it when adding to and removing from the stable tree.
      
      And the pages_sharing variable used to include the pages_shared, then those
      were subtracted when shown in the pages_sharing sysfs file: now keep it as
      an exclusive count of leaves hanging off the stable tree nodes, throughout.
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Acked-by: NIzik Eidus <ieidus@redhat.com>
      Acked-by: NAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e178dfde
    • H
      ksm: rename kernel_pages_allocated · b4028260
      Hugh Dickins 提交于
      We're not implementing swapping of KSM pages in its first release;
      but when that follows, "kernel_pages_allocated" will be a very poor
      name for the sysfs file showing number of nodes in the stable tree:
      rename that to "pages_shared" throughout.
      
      But we already have a "pages_shared", counting those page slots
      sharing the shared pages: first rename that to... "pages_sharing".
      
      What will become of "max_kernel_pages" when the pages shared can
      be swapped?  I guess it will just be removed, so keep that name.
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Acked-by: NIzik Eidus <ieidus@redhat.com>
      Acked-by: NAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b4028260
    • I
      ksm: change ksm nice level to be 5 · 339aa624
      Izik Eidus 提交于
      ksm should try not to disturb other tasks as much as possible.
      Signed-off-by: NIzik Eidus <ieidus@redhat.com>
      Cc: Chris Wright <chrisw@redhat.com>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Avi Kivity <avi@redhat.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      339aa624
    • I
      ksm: change copyright message · 36b2528d
      Izik Eidus 提交于
      Adding Hugh Dickins into the authors list.
      Signed-off-by: NIzik Eidus <ieidus@redhat.com>
      Cc: Chris Wright <chrisw@redhat.com>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Avi Kivity <avi@redhat.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      36b2528d
    • I
      ksm: Kernel SamePage Merging · 31dbd01f
      Izik Eidus 提交于
      Ksm is code that allows merging of identical pages between one or more
      applications, in a way invisible to the applications that use it.  Pages
      that are merged are marked as read-only, then COWed when any application
      tries to change them.
      
      Whereas fork() allows sharing anonymous pages between parent and child,
      ksm can share anonymous pages between unrelated processes.
      
      Ksm works by walking over the memory pages of the applications it scans,
      in order to find identical pages.  It uses two sorted data structures,
      called the stable and unstable trees, to locate identical pages in an
      effective way.
      
      When ksm finds two identical pages, it marks them as readonly and merges
      them into a single page.  After the pages have been marked as readonly and
      merged into one, Linux treats them as normal copy-on-write pages, copying
      to a fresh anonymous page if write access is required later.
      
      Ksm scans and merges anonymous pages only in those memory areas that have
      been registered with it by madvise(addr, length, MADV_MERGEABLE).
      
      The ksm scanner is controlled by sysfs files in /sys/kernel/mm/ksm/:
      
      max_kernel_pages - the maximum number of unswappable kernel pages
                         which may be allocated by ksm (0 for unlimited).
      
      kernel_pages_allocated - how many ksm pages are currently allocated,
                               sharing identical content between different
                               processes (pages unswappable in this release).
      
      pages_shared - how many pages have been saved by sharing with ksm pages
                     (kernel_pages_allocated being excluded from this count).
      
      pages_to_scan - how many pages ksm should scan before sleeping.
      
      sleep_millisecs - how many milliseconds ksm should sleep between scans.
      
      run - write 0 to disable ksm, read 0 while ksm is disabled (default),
            write 1 to run ksm, read 1 while ksm is running,
            write 2 to disable ksm and unmerge all its pages.
      
      Includes contributions by Andrea Arcangeli Chris Wright and Hugh Dickins.
      
      [hugh.dickins@tiscali.co.uk: fix rare page leak]
      Signed-off-by: NIzik Eidus <ieidus@redhat.com>
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Signed-off-by: NChris Wright <chrisw@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Avi Kivity <avi@redhat.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      31dbd01f
    • H
      ksm: the mm interface to ksm · f8af4da3
      Hugh Dickins 提交于
      This patch presents the mm interface to a dummy version of ksm.c, for
      better scrutiny of that interface: the real ksm.c follows later.
      
      When CONFIG_KSM is not set, madvise(2) reject MADV_MERGEABLE and
      MADV_UNMERGEABLE with EINVAL, since that seems more helpful than
      pretending that they can be serviced.  But when CONFIG_KSM=y, accept them
      even if KSM is not currently running, and even on areas which KSM will not
      touch (e.g.  hugetlb or shared file or special driver mappings).
      
      Like other madvices, report ENOMEM despite success if any area in the
      range is unmapped, and use EAGAIN to report out of memory.
      
      Define vma flag VM_MERGEABLE to identify an area on which KSM may try
      merging pages: leave it to ksm_madvise() to decide whether to set it.
      Define mm flag MMF_VM_MERGEABLE to identify an mm which might contain
      VM_MERGEABLE areas, to minimize callouts when forking or exiting.
      
      Based upon earlier patches by Chris Wright and Izik Eidus.
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Signed-off-by: NChris Wright <chrisw@redhat.com>
      Signed-off-by: NIzik Eidus <ieidus@redhat.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Avi Kivity <avi@redhat.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f8af4da3