1. 13 5月, 2022 14 次提交
  2. 10 5月, 2022 3 次提交
  3. 29 4月, 2022 5 次提交
  4. 23 3月, 2022 6 次提交
    • H
      NUMA balancing: optimize page placement for memory tiering system · c574bbe9
      Huang Ying 提交于
      With the advent of various new memory types, some machines will have
      multiple types of memory, e.g.  DRAM and PMEM (persistent memory).  The
      memory subsystem of these machines can be called memory tiering system,
      because the performance of the different types of memory are usually
      different.
      
      In such system, because of the memory accessing pattern changing etc,
      some pages in the slow memory may become hot globally.  So in this
      patch, the NUMA balancing mechanism is enhanced to optimize the page
      placement among the different memory types according to hot/cold
      dynamically.
      
      In a typical memory tiering system, there are CPUs, fast memory and slow
      memory in each physical NUMA node.  The CPUs and the fast memory will be
      put in one logical node (called fast memory node), while the slow memory
      will be put in another (faked) logical node (called slow memory node).
      That is, the fast memory is regarded as local while the slow memory is
      regarded as remote.  So it's possible for the recently accessed pages in
      the slow memory node to be promoted to the fast memory node via the
      existing NUMA balancing mechanism.
      
      The original NUMA balancing mechanism will stop to migrate pages if the
      free memory of the target node becomes below the high watermark.  This
      is a reasonable policy if there's only one memory type.  But this makes
      the original NUMA balancing mechanism almost do not work to optimize
      page placement among different memory types.  Details are as follows.
      
      It's the common cases that the working-set size of the workload is
      larger than the size of the fast memory nodes.  Otherwise, it's
      unnecessary to use the slow memory at all.  So, there are almost always
      no enough free pages in the fast memory nodes, so that the globally hot
      pages in the slow memory node cannot be promoted to the fast memory
      node.  To solve the issue, we have 2 choices as follows,
      
      a. Ignore the free pages watermark checking when promoting hot pages
         from the slow memory node to the fast memory node.  This will
         create some memory pressure in the fast memory node, thus trigger
         the memory reclaiming.  So that, the cold pages in the fast memory
         node will be demoted to the slow memory node.
      
      b. Define a new watermark called wmark_promo which is higher than
         wmark_high, and have kswapd reclaiming pages until free pages reach
         such watermark.  The scenario is as follows: when we want to promote
         hot-pages from a slow memory to a fast memory, but fast memory's free
         pages would go lower than high watermark with such promotion, we wake
         up kswapd with wmark_promo watermark in order to demote cold pages and
         free us up some space.  So, next time we want to promote hot-pages we
         might have a chance of doing so.
      
      The choice "a" may create high memory pressure in the fast memory node.
      If the memory pressure of the workload is high, the memory pressure
      may become so high that the memory allocation latency of the workload
      is influenced, e.g.  the direct reclaiming may be triggered.
      
      The choice "b" works much better at this aspect.  If the memory
      pressure of the workload is high, the hot pages promotion will stop
      earlier because its allocation watermark is higher than that of the
      normal memory allocation.  So in this patch, choice "b" is implemented.
      A new zone watermark (WMARK_PROMO) is added.  Which is larger than the
      high watermark and can be controlled via watermark_scale_factor.
      
      In addition to the original page placement optimization among sockets,
      the NUMA balancing mechanism is extended to be used to optimize page
      placement according to hot/cold among different memory types.  So the
      sysctl user space interface (numa_balancing) is extended in a backward
      compatible way as follow, so that the users can enable/disable these
      functionality individually.
      
      The sysctl is converted from a Boolean value to a bits field.  The
      definition of the flags is,
      
      - 0: NUMA_BALANCING_DISABLED
      - 1: NUMA_BALANCING_NORMAL
      - 2: NUMA_BALANCING_MEMORY_TIERING
      
      We have tested the patch with the pmbench memory accessing benchmark
      with the 80:20 read/write ratio and the Gauss access address
      distribution on a 2 socket Intel server with Optane DC Persistent
      Memory Model.  The test results shows that the pmbench score can
      improve up to 95.9%.
      
      Thanks Andrew Morton to help fix the document format error.
      
      Link: https://lkml.kernel.org/r/20220221084529.1052339-3-ying.huang@intel.comSigned-off-by: N"Huang, Ying" <ying.huang@intel.com>
      Tested-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
      Reviewed-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Reviewed-by: NYang Shi <shy828301@gmail.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Wei Xu <weixugc@google.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: zhongjiang-ali <zhongjiang-ali@linux.alibaba.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Feng Tang <feng.tang@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c574bbe9
    • C
      mm: vmscan: fix documentation for page_check_references() · 96bd3e79
      Charan Teja Kalla 提交于
      Commit b518154e ("mm/vmscan: protect the workingset on anonymous
      LRU") requires to look twice for both mapped anon/file pages are used
      more than once to take the decission of reclaim or activation.  Correct
      the documentation accordingly.
      
      Link: https://lkml.kernel.org/r/1646925640-21324-1-git-send-email-quic_charante@quicinc.comSigned-off-by: NCharan Teja Kalla <quic_charante@quicinc.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      96bd3e79
    • H
      mm: __isolate_lru_page_prepare() in isolate_migratepages_block() · 89f6c88a
      Hugh Dickins 提交于
      __isolate_lru_page_prepare() conflates two unrelated functions, with the
      flags to one disjoint from the flags to the other; and hides some of the
      important checks outside of isolate_migratepages_block(), where the
      sequence is better to be visible.  It comes from the days of lumpy
      reclaim, before compaction, when the combination made more sense.
      
      Move what's needed by mm/compaction.c isolate_migratepages_block() inline
      there, and what's needed by mm/vmscan.c isolate_lru_pages() inline there.
      
      Shorten "isolate_mode" to "mode", so the sequence of conditions is easier
      to read.  Declare a "mapping" variable, to save one call to page_mapping()
      (but not another: calling again after page is locked is necessary).
      Simplify isolate_lru_pages() with a "move_to" list pointer.
      
      Link: https://lkml.kernel.org/r/879d62a8-91cc-d3c6-fb3b-69768236df68@google.comSigned-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Reviewed-by: NAlex Shi <alexs@kernel.org>
      Cc: Alexander Duyck <alexander.duyck@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      89f6c88a
    • H
      mm/fs: delete PF_SWAPWRITE · b698f0a1
      Hugh Dickins 提交于
      PF_SWAPWRITE has been redundant since v3.2 commit ee72886d ("mm:
      vmscan: do not writeback filesystem pages in direct reclaim").
      
      Coincidentally, NeilBrown's current patch "remove inode_congested()"
      deletes may_write_to_inode(), which appeared to be the one function which
      took notice of PF_SWAPWRITE.  But if you study the old logic, and the
      conditions under which may_write_to_inode() was called, you discover that
      flag and function have been pointless for a decade.
      
      Link: https://lkml.kernel.org/r/75e80e7-742d-e3bd-531-614db8961e4@google.comSigned-off-by: NHugh Dickins <hughd@google.com>
      Cc: NeilBrown <neilb@suse.de>
      Cc: Jan Kara <jack@suse.de>
      Cc: "Darrick J. Wong" <djwong@kernel.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b698f0a1
    • N
      remove bdi_congested() and wb_congested() and related functions · b9b1335e
      NeilBrown 提交于
      These functions are no longer useful as no BDIs report congestions any
      more.
      
      Removing the test on bdi_write_contested() in current_may_throttle()
      could cause a small change in behaviour, but only when PF_LOCAL_THROTTLE
      is set.
      
      So replace the calls by 'false' and simplify the code - and remove the
      functions.
      
      [akpm@linux-foundation.org: fix build]
      
      Link: https://lkml.kernel.org/r/164549983742.9187.2570198746005819592.stgit@noble.brownSigned-off-by: NNeilBrown <neilb@suse.de>
      Acked-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>	[nilfs]
      Cc: Anna Schumaker <Anna.Schumaker@Netapp.com>
      Cc: Chao Yu <chao@kernel.org>
      Cc: Darrick J. Wong <djwong@kernel.org>
      Cc: Ilya Dryomov <idryomov@gmail.com>
      Cc: Jaegeuk Kim <jaegeuk@kernel.org>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jeff Layton <jlayton@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Lars Ellenberg <lars.ellenberg@linbit.com>
      Cc: Miklos Szeredi <miklos@szeredi.hu>
      Cc: Paolo Valente <paolo.valente@linaro.org>
      Cc: Philipp Reisner <philipp.reisner@linbit.com>
      Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b9b1335e
    • N
      remove inode_congested() · fe55d563
      NeilBrown 提交于
      inode_congested() reports if the backing-device for the inode is
      congested.  No bdi reports congestion any more, so this always returns
      'false'.
      
      So remove inode_congested() and related functions, and remove the call
      sites, assuming that inode_congested() always returns 'false'.
      
      Link: https://lkml.kernel.org/r/164549983741.9187.2174285592262191311.stgit@noble.brownSigned-off-by: NNeilBrown <neilb@suse.de>
      Cc: Anna Schumaker <Anna.Schumaker@Netapp.com>
      Cc: Chao Yu <chao@kernel.org>
      Cc: Darrick J. Wong <djwong@kernel.org>
      Cc: Ilya Dryomov <idryomov@gmail.com>
      Cc: Jaegeuk Kim <jaegeuk@kernel.org>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jeff Layton <jlayton@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Lars Ellenberg <lars.ellenberg@linbit.com>
      Cc: Miklos Szeredi <miklos@szeredi.hu>
      Cc: Paolo Valente <paolo.valente@linaro.org>
      Cc: Philipp Reisner <philipp.reisner@linbit.com>
      Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com>
      Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fe55d563
  5. 22 3月, 2022 12 次提交