1. 28 4月, 2018 2 次提交
  2. 17 4月, 2018 2 次提交
  3. 07 7月, 2017 1 次提交
    • A
      ksm: introduce ksm_max_page_sharing per page deduplication limit · 2c653d0e
      Andrea Arcangeli 提交于
      Without a max deduplication limit for each KSM page, the list of the
      rmap_items associated to each stable_node can grow infinitely large.
      
      During the rmap walk each entry can take up to ~10usec to process
      because of IPIs for the TLB flushing (both for the primary MMU and the
      secondary MMUs with the MMU notifier).  With only 16GB of address space
      shared in the same KSM page, that would amount to dozens of seconds of
      kernel runtime.
      
      A ~256 max deduplication factor will reduce the latencies of the rmap
      walks on KSM pages to order of a few msec.  Just doing the
      cond_resched() during the rmap walks is not enough, the list size must
      have a limit too, otherwise the caller could get blocked in (schedule
      friendly) kernel computations for seconds, unexpectedly.
      
      There's room for optimization to significantly reduce the IPI delivery
      cost during the page_referenced(), but at least for page_migration in
      the KSM case (used by hard NUMA bindings, compaction and NUMA balancing)
      it may be inevitable to send lots of IPIs if each rmap_item->mm is
      active on a different CPU and there are lots of CPUs.  Even if we ignore
      the IPI delivery cost, we've still to walk the whole KSM rmap list, so
      we can't allow millions or billions (ulimited) number of entries in the
      KSM stable_node rmap_item lists.
      
      The limit is enforced efficiently by adding a second dimension to the
      stable rbtree.  So there are three types of stable_nodes: the regular
      ones (identical as before, living in the first flat dimension of the
      stable rbtree), the "chains" and the "dups".
      
      Every "chain" and all "dups" linked into a "chain" enforce the invariant
      that they represent the same write protected memory content, even if
      each "dup" will be pointed by a different KSM page copy of that content.
      This way the stable rbtree lookup computational complexity is unaffected
      if compared to an unlimited max_sharing_limit.  It is still enforced
      that there cannot be KSM page content duplicates in the stable rbtree
      itself.
      
      Adding the second dimension to the stable rbtree only after the
      max_page_sharing limit hits, provides for a zero memory footprint
      increase on 64bit archs.  The memory overhead of the per-KSM page
      stable_tree and per virtual mapping rmap_item is unchanged.  Only after
      the max_page_sharing limit hits, we need to allocate a stable_tree
      "chain" and rb_replace() the "regular" stable_node with the newly
      allocated stable_node "chain".  After that we simply add the "regular"
      stable_node to the chain as a stable_node "dup" by linking hlist_dup in
      the stable_node_chain->hlist.  This way the "regular" (flat) stable_node
      is converted to a stable_node "dup" living in the second dimension of
      the stable rbtree.
      
      During stable rbtree lookups the stable_node "chain" is identified as
      stable_node->rmap_hlist_len == STABLE_NODE_CHAIN (aka
      is_stable_node_chain()).
      
      When dropping stable_nodes, the stable_node "dup" is identified as
      stable_node->head == STABLE_NODE_DUP_HEAD (aka is_stable_node_dup()).
      
      The STABLE_NODE_DUP_HEAD must be an unique valid pointer never used
      elsewhere in any stable_node->head/node to avoid a clashes with the
      stable_node->node.rb_parent_color pointer, and different from
      &migrate_nodes.  So the second field of &migrate_nodes is picked and
      verified as always safe with a BUILD_BUG_ON in case the list_head
      implementation changes in the future.
      
      The STABLE_NODE_DUP is picked as a random negative value in
      stable_node->rmap_hlist_len.  rmap_hlist_len cannot become negative when
      it's a "regular" stable_node or a stable_node "dup".
      
      The stable_node_chain->nid is irrelevant.  The stable_node_chain->kpfn
      is aliased in a union with a time field used to rate limit the
      stable_node_chain->hlist prunes.
      
      The garbage collection of the stable_node_chain happens lazily during
      stable rbtree lookups (as for all other kind of stable_nodes), or while
      disabling KSM with "echo 2 >/sys/kernel/mm/ksm/run" while collecting the
      entire stable rbtree.
      
      While the "regular" stable_nodes and the stable_node "dups" must wait
      for their underlying tree_page to be freed before they can be freed
      themselves, the stable_node "chains" can be freed immediately if the
      stable_node->hlist turns empty.  This is because the "chains" are never
      pointed by any page->mapping and they're effectively stable rbtree KSM
      self contained metadata.
      
      [akpm@linux-foundation.org: fix non-NUMA build]
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Tested-by: NPetr Holasek <pholasek@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Evgheni Dereveanchin <ederevea@redhat.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Gavin Guo <gavin.guo@canonical.com>
      Cc: Jay Vosburgh <jay.vosburgh@canonical.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2c653d0e
  4. 25 2月, 2017 2 次提交
    • D
      mm, madvise: fail with ENOMEM when splitting vma will hit max_map_count · def5efe0
      David Rientjes 提交于
      If madvise(2) advice will result in the underlying vma being split and
      the number of areas mapped by the process will exceed
      /proc/sys/vm/max_map_count as a result, return ENOMEM instead of EAGAIN.
      
      EAGAIN is returned by madvise(2) when a kernel resource, such as slab,
      is temporarily unavailable.  It indicates that userspace should retry
      the advice in the near future.  This is important for advice such as
      MADV_DONTNEED which is often used by malloc implementations to free
      memory back to the system: we really do want to free memory back when
      madvise(2) returns EAGAIN because slab allocations (for vmas, anon_vmas,
      or mempolicies) cannot be allocated.
      
      Encountering /proc/sys/vm/max_map_count is not a temporary failure,
      however, so return ENOMEM to indicate this is a more serious issue.  A
      followup patch to the man page will specify this behavior.
      
      Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1701241431120.42507@chino.kir.corp.google.comSigned-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      def5efe0
    • C
      mm/ksm: improve deduplication of zero pages with colouring · e86c59b1
      Claudio Imbrenda 提交于
      Some architectures have a set of zero pages (coloured zero pages)
      instead of only one zero page, in order to improve the cache
      performance.  In those cases, the kernel samepage merger (KSM) would
      merge all the allocated pages that happen to be filled with zeroes to
      the same deduplicated page, thus losing all the advantages of coloured
      zero pages.
      
      This behaviour is noticeable when a process accesses large arrays of
      allocated pages containing zeroes.  A test I conducted on s390 shows
      that there is a speed penalty when KSM merges such pages, compared to
      not merging them or using actual zero pages from the start without
      breaking the COW.
      
      This patch fixes this behaviour.  When coloured zero pages are present,
      the checksum of a zero page is calculated during initialisation, and
      compared with the checksum of the current canditate during merging.  In
      case of a match, the normal merging routine is used to merge the page
      with the correct coloured zero page, which ensures the candidate page is
      checked to be equal to the target zero page.
      
      A sysfs entry is also added to toggle this behaviour, since it can
      potentially introduce performance regressions, especially on
      architectures without coloured zero pages.  The default value is
      disabled, for backwards compatibility.
      
      With this patch, the performance with KSM is the same as with non
      COW-broken actual zero pages, which is also the same as without KSM.
      
      [akpm@linux-foundation.org: make zero_checksum and ksm_use_zero_pages __read_mostly, per Andrea]
      [imbrenda@linux.vnet.ibm.com: documentation for coloured zero pages deduplication]
        Link: http://lkml.kernel.org/r/1484927522-1964-1-git-send-email-imbrenda@linux.vnet.ibm.com
      Link: http://lkml.kernel.org/r/1484850953-23941-1-git-send-email-imbrenda@linux.vnet.ibm.comSigned-off-by: NClaudio Imbrenda <imbrenda@linux.vnet.ibm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e86c59b1
  5. 24 2月, 2013 2 次提交
    • H
      ksm: add some comments · 8fdb3dbf
      Hugh Dickins 提交于
      Added slightly more detail to the Documentation of merge_across_nodes, a
      few comments in areas indicated by review, and renamed get_ksm_page()'s
      argument from "locked" to "lock_it".  No functional change.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Petr Holasek <pholasek@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Izik Eidus <izik.eidus@ravellosystems.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8fdb3dbf
    • P
      ksm: allow trees per NUMA node · 90bd6fd3
      Petr Holasek 提交于
      Here's a KSM series, based on mmotm 2013-01-23-17-04: starting with
      Petr's v7 "KSM: numa awareness sysfs knob"; then fixing the two issues
      we had with that, fully enabling KSM page migration on the way.
      
      (A different kind of KSM/NUMA issue which I've certainly not begun to
      address here: when KSM pages are unmerged, there's usually no sense in
      preferring to allocate the new pages local to the caller's node.)
      
      This patch:
      
      Introduces new sysfs boolean knob /sys/kernel/mm/ksm/merge_across_nodes
      which control merging pages across different numa nodes.  When it is set
      to zero only pages from the same node are merged, otherwise pages from
      all nodes can be merged together (default behavior).
      
      Typical use-case could be a lot of KVM guests on NUMA machine and cpus
      from more distant nodes would have significant increase of access
      latency to the merged ksm page.  Sysfs knob was choosen for higher
      variability when some users still prefers higher amount of saved
      physical memory regardless of access latency.
      
      Every numa node has its own stable & unstable trees because of faster
      searching and inserting.  Changing of merge_across_nodes value is
      possible only when there are not any ksm shared pages in system.
      
      I've tested this patch on numa machines with 2, 4 and 8 nodes and
      measured speed of memory access inside of KVM guests with memory pinned
      to one of nodes with this benchmark:
      
        http://pholasek.fedorapeople.org/alloc_pg.c
      
      Population standard deviations of access times in percentage of average
      were following:
      
      merge_across_nodes=1
      2 nodes 1.4%
      4 nodes 1.6%
      8 nodes	1.7%
      
      merge_across_nodes=0
      2 nodes	1%
      4 nodes	0.32%
      8 nodes	0.018%
      
      RFC: https://lkml.org/lkml/2011/11/30/91
      v1: https://lkml.org/lkml/2012/1/23/46
      v2: https://lkml.org/lkml/2012/6/29/105
      v3: https://lkml.org/lkml/2012/9/14/550
      v4: https://lkml.org/lkml/2012/9/23/137
      v5: https://lkml.org/lkml/2012/12/10/540
      v6: https://lkml.org/lkml/2012/12/23/154
      v7: https://lkml.org/lkml/2012/12/27/225
      
      Hugh notes that this patch brings two problems, whose solution needs
      further support in mm/ksm.c, which follows in subsequent patches:
      
      1) switching merge_across_nodes after running KSM is liable to oops
         on stale nodes still left over from the previous stable tree;
      
      2) memory hotremove may migrate KSM pages, but there is no provision
         here for !merge_across_nodes to migrate nodes to the proper tree.
      Signed-off-by: NPetr Holasek <pholasek@redhat.com>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Izik Eidus <izik.eidus@ravellosystems.com>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      90bd6fd3
  6. 16 12月, 2009 1 次提交
  7. 08 10月, 2009 1 次提交
    • H
      ksm: more on default values · c73602ad
      Hugh Dickins 提交于
      Adjust the max_kernel_pages default to a quarter of totalram_pages,
      instead of nr_free_buffer_pages() / 4: the KSM pages themselves come from
      highmem, and even on a 16GB PAE machine, 4GB of KSM pages would only be
      pinning 32MB of lowmem with their rmap_items, so no need for the more
      obscure calculation (nor for its own special init function).
      
      There is no way for the user to switch KSM on if CONFIG_SYSFS is not
      enabled, so in that case default run to KSM_RUN_MERGE.
      
      Update KSM Documentation and Kconfig to reflect the new defaults.
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Izik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c73602ad
  8. 22 9月, 2009 1 次提交