1. 06 5月, 2021 3 次提交
    • Y
      mm: vmscan: use nid from shrink_control for tracepoint · 8efb4b59
      Yang Shi 提交于
      Patch series "Make shrinker's nr_deferred memcg aware", v10.
      
      Recently huge amount one-off slab drop was seen on some vfs metadata
      heavy workloads, it turned out there were huge amount accumulated
      nr_deferred objects seen by the shrinker.
      
      On our production machine, I saw absurd number of nr_deferred shown as
      the below tracing result:
      
        <...>-48776 [032] .... 27970562.458916: mm_shrink_slab_start:
        super_cache_scan+0x0/0x1a0 ffff9a83046f3458: nid: 0 objects to shrink
        2531805877005 gfp_flags GFP_HIGHUSER_MOVABLE pgs_scanned 32 lru_pgs
        9300 cache items 1667 delta 11 total_scan 833
      
      There are 2.5 trillion deferred objects on one node, assuming all of them
      are dentry (192 bytes per object), so the total size of deferred on one
      node is ~480TB.  It is definitely ridiculous.
      
      I managed to reproduce this problem with kernel build workload plus
      negative dentry generator.
      
      First step, run the below kernel build test script:
      
      NR_CPUS=`cat /proc/cpuinfo | grep -e processor | wc -l`
      
      cd /root/Buildarea/linux-stable
      
      for i in `seq 1500`; do
              cgcreate -g memory:kern_build
              echo 4G > /sys/fs/cgroup/memory/kern_build/memory.limit_in_bytes
      
              echo 3 > /proc/sys/vm/drop_caches
              cgexec -g memory:kern_build make clean > /dev/null 2>&1
              cgexec -g memory:kern_build make -j$NR_CPUS > /dev/null 2>&1
      
              cgdelete -g memory:kern_build
      done
      
      Then run the below negative dentry generator script:
      
      NR_CPUS=`cat /proc/cpuinfo | grep -e processor | wc -l`
      
      mkdir /sys/fs/cgroup/memory/test
      echo $$ > /sys/fs/cgroup/memory/test/tasks
      
      for i in `seq $NR_CPUS`; do
              while true; do
                      FILE=`head /dev/urandom | tr -dc A-Za-z0-9 | head -c 64`
                      cat $FILE 2>/dev/null
              done &
      done
      
      Then kswapd will shrink half of dentry cache in just one loop as the below
      tracing result showed:
      
      	kswapd0-475   [028] .... 305968.252561: mm_shrink_slab_start: super_cache_scan+0x0/0x190 0000000024acf00c: nid: 0 objects to shrink 4994376020 gfp_flags GFP_KERNEL cache items 93689873 delta 45746 total_scan 46844936 priority 12
      	kswapd0-475   [021] .... 306013.099399: mm_shrink_slab_end: super_cache_scan+0x0/0x190 0000000024acf00c: nid: 0 unused scan count 4994376020 new scan count 4947576838 total_scan 8 last shrinker return val 46844928
      
      There were huge number of deferred objects before the shrinker was called,
      the behavior does match the code but it might be not desirable from the
      user's stand of point.
      
      The excessive amount of nr_deferred might be accumulated due to various
      reasons, for example:
      
      * GFP_NOFS allocation
      
      * Significant times of small amount scan (< scan_batch, 1024 for vfs
        metadata)
      
      However the LRUs of slabs are per memcg (memcg-aware shrinkers) but the
      deferred objects is per shrinker, this may have some bad effects:
      
      * Poor isolation among memcgs.  Some memcgs which happen to have
        frequent limit reclaim may get nr_deferred accumulated to a huge number,
        then other innocent memcgs may take the fall.  In our case the main
        workload was hit.
      
      * Unbounded deferred objects.  There is no cap for deferred objects, it
        can outgrow ridiculously as the tracing result showed.
      
      * Easy to get out of control.  Although shrinkers take into account
        deferred objects, but it can go out of control easily.  One
        misconfigured memcg could incur absurd amount of deferred objects in a
        period of time.
      
      * Sort of reclaim problems, i.e.  over reclaim, long reclaim latency,
        etc.  There may be hundred GB slab caches for vfe metadata heavy
        workload, shrink half of them may take minutes.  We observed latency
        spike due to the prolonged reclaim.
      
      These issues also have been discussed in
      https://lore.kernel.org/linux-mm/20200916185823.5347-1-shy828301@gmail.com/.
      The patchset is the outcome of that discussion.
      
      So this patchset makes nr_deferred per-memcg to tackle the problem.  It
      does:
      
      * Have memcg_shrinker_deferred per memcg per node, just like what
        shrinker_map does.  Instead it is an atomic_long_t array, each element
        represent one shrinker even though the shrinker is not memcg aware, this
        simplifies the implementation.  For memcg aware shrinkers, the deferred
        objects are just accumulated to its own memcg.  The shrinkers just see
        nr_deferred from its own memcg.  Non memcg aware shrinkers still use
        global nr_deferred from struct shrinker.
      
      * Once the memcg is offlined, its nr_deferred will be reparented to its
        parent along with LRUs.
      
      * The root memcg has memcg_shrinker_deferred array too.  It simplifies
        the handling of reparenting to root memcg.
      
      * Cap nr_deferred to 2x of the length of lru.  The idea is borrowed from
        Dave Chinner's series
        (https://lore.kernel.org/linux-xfs/20191031234618.15403-1-david@fromorbit.com/)
      
      The downside is each memcg has to allocate extra memory to store the
      nr_deferred array.  On our production environment, there are typically
      around 40 shrinkers, so each memcg needs ~320 bytes.  10K memcgs would
      need ~3.2MB memory.  It seems fine.
      
      We have been running the patched kernel on some hosts of our fleet (test
      and production) for months, it works very well.  The monitor data shows
      the working set is sustained as expected.
      
      This patch (of 13):
      
      The tracepoint's nid should show what node the shrink happens on, the
      start tracepoint uses nid from shrinkctl, but the nid might be set to 0
      before end tracepoint if the shrinker is not NUMA aware, so the tracing
      log may show the shrink happens on one node but end up on the other node.
      It seems confusing.  And the following patch will remove using nid
      directly in do_shrink_slab(), this patch also helps cleanup the code.
      
      Link: https://lkml.kernel.org/r/20210311190845.9708-1-shy828301@gmail.com
      Link: https://lkml.kernel.org/r/20210311190845.9708-2-shy828301@gmail.comSigned-off-by: NYang Shi <shy828301@gmail.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NRoman Gushchin <guro@fb.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8efb4b59
    • D
      mm/vmscan: move RECLAIM* bits to uapi header · b6676de8
      Dave Hansen 提交于
      It is currently not obvious that the RECLAIM_* bits are part of the uapi
      since they are defined in vmscan.c.  Move them to a uapi header to make it
      obvious.
      
      This should have no functional impact.
      
      Link: https://lkml.kernel.org/r/20210219172557.08074910@viggo.jf.intel.comSigned-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: NBen Widawsky <ben.widawsky@intel.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Cc: Daniel Wagner <dwagner@suse.de>
      Cc: "Tobin C. Harding" <tobin@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Qian Cai <cai@lca.pw>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b6676de8
    • O
      mm: make alloc_contig_range handle in-use hugetlb pages · ae37c7ff
      Oscar Salvador 提交于
      alloc_contig_range() will fail if it finds a HugeTLB page within the
      range, without a chance to handle them.  Since HugeTLB pages can be
      migrated as any LRU or Movable page, it does not make sense to bail out
      without trying.  Enable the interface to recognize in-use HugeTLB pages so
      we can migrate them, and have much better chances to succeed the call.
      
      Link: https://lkml.kernel.org/r/20210419075413.1064-7-osalvador@suse.deSigned-off-by: NOscar Salvador <osalvador@suse.de>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ae37c7ff
  2. 25 2月, 2021 8 次提交
  3. 18 1月, 2021 1 次提交
    • L
      mm: don't put pinned pages into the swap cache · feb889fb
      Linus Torvalds 提交于
      So technically there is nothing wrong with adding a pinned page to the
      swap cache, but the pinning obviously means that the page can't actually
      be free'd right now anyway, so it's a bit pointless.
      
      However, the real problem is not with it being a bit pointless: the real
      issue is that after we've added it to the swap cache, we'll try to unmap
      the page.  That will succeed, because the code in mm/rmap.c doesn't know
      or care about pinned pages.
      
      Even the unmapping isn't fatal per se, since the page will stay around
      in memory due to the pinning, and we do hold the connection to it using
      the swap cache.  But when we then touch it next and take a page fault,
      the logic in do_swap_page() will map it back into the process as a
      possibly read-only page, and we'll then break the page association on
      the next COW fault.
      
      Honestly, this issue could have been fixed in any of those other places:
      (a) we could refuse to unmap a pinned page (which makes conceptual
      sense), or (b) we could make sure to re-map a pinned page writably in
      do_swap_page(), or (c) we could just make do_wp_page() not COW the
      pinned page (which was what we historically did before that "mm:
      do_wp_page() simplification" commit).
      
      But while all of them are equally valid models for breaking this chain,
      not putting pinned pages into the swap cache in the first place is the
      simplest one by far.
      
      It's also the safest one: the reason why do_wp_page() was changed in the
      first place was that getting the "can I re-use this page" wrong is so
      fraught with errors.  If you do it wrong, you end up with an incorrectly
      shared page.
      
      As a result, using "page_maybe_dma_pinned()" in either do_wp_page() or
      do_swap_page() would be a serious bug since it is only a (very good)
      heuristic.  Re-using the page requires a hard black-and-white rule with
      no room for ambiguity.
      
      In contrast, saying "this page is very likely dma pinned, so let's not
      add it to the swap cache and try to unmap it" is an obviously safe thing
      to do, and if the heuristic might very rarely be a false positive, no
      harm is done.
      
      Fixes: 09854ba9 ("mm: do_wp_page() simplification")
      Reported-and-tested-by: NMartin Raiber <martin@urbackup.org>
      Cc: Pavel Begunkov <asml.silence@gmail.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Peter Xu <peterx@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      feb889fb
  4. 16 12月, 2020 13 次提交
  5. 15 11月, 2020 1 次提交
  6. 17 10月, 2020 2 次提交
  7. 14 10月, 2020 2 次提交
  8. 20 9月, 2020 1 次提交
    • H
      mm: fix check_move_unevictable_pages() on THP · 8d8869ca
      Hugh Dickins 提交于
      check_move_unevictable_pages() is used in making unevictable shmem pages
      evictable: by shmem_unlock_mapping(), drm_gem_check_release_pagevec() and
      i915/gem check_release_pagevec().  Those may pass down subpages of a huge
      page, when /sys/kernel/mm/transparent_hugepage/shmem_enabled is "force".
      
      That does not crash or warn at present, but the accounting of vmstats
      unevictable_pgs_scanned and unevictable_pgs_rescued is inconsistent:
      scanned being incremented on each subpage, rescued only on the head (since
      tails already appear evictable once the head has been updated).
      
      5.8 commit 5d91f31f ("mm: swap: fix vmstats for huge page") has
      established that vm_events in general (and unevictable_pgs_rescued in
      particular) should count every subpage: so follow that precedent here.
      
      Do this in such a way that if mem_cgroup_page_lruvec() is made stricter
      (to check page->mem_cgroup is always set), no problem: skip the tails
      before calling it, and add thp_nr_pages() to vmstats on the head.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NYang Shi <shy828301@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Qian Cai <cai@lca.pw>
      Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008301405000.5954@eggly.anvilsSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8d8869ca
  9. 06 9月, 2020 1 次提交
  10. 15 8月, 2020 1 次提交
  11. 13 8月, 2020 7 次提交
    • R
      mm/vmscan.c: delete or fix duplicated words · 1eba09c1
      Randy Dunlap 提交于
      Drop the repeated word "marked".
      Change "time time" to "same time".
      Signed-off-by: NRandy Dunlap <rdunlap@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NZi Yan <ziy@nvidia.com>
      Link: http://lkml.kernel.org/r/20200801173822.14973-14-rdunlap@infradead.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1eba09c1
    • J
      mm/vmscan: restore active/inactive ratio for anonymous LRU · 4002570c
      Joonsoo Kim 提交于
      Now that workingset detection is implemented for anonymous LRU, we don't
      need large inactive list to allow detecting frequently accessed pages
      before they are reclaimed, anymore.  This effectively reverts the
      temporary measure put in by commit "mm/vmscan: make active/inactive ratio
      as 1:1 for anon lru".
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Link: http://lkml.kernel.org/r/1595490560-15117-7-git-send-email-iamjoonsoo.kim@lge.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4002570c
    • J
      mm/swap: implement workingset detection for anonymous LRU · aae466b0
      Joonsoo Kim 提交于
      This patch implements workingset detection for anonymous LRU.  All the
      infrastructure is implemented by the previous patches so this patch just
      activates the workingset detection by installing/retrieving the shadow
      entry and adding refault calculation.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Link: http://lkml.kernel.org/r/1595490560-15117-6-git-send-email-iamjoonsoo.kim@lge.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      aae466b0
    • J
      mm/swapcache: support to handle the shadow entries · 3852f676
      Joonsoo Kim 提交于
      Workingset detection for anonymous page will be implemented in the
      following patch and it requires to store the shadow entries into the
      swapcache.  This patch implements an infrastructure to store the shadow
      entry in the swapcache.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Link: http://lkml.kernel.org/r/1595490560-15117-5-git-send-email-iamjoonsoo.kim@lge.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3852f676
    • J
      mm/workingset: prepare the workingset detection infrastructure for anon LRU · 170b04b7
      Joonsoo Kim 提交于
      To prepare the workingset detection for anon LRU, this patch splits
      workingset event counters for refault, activate and restore into anon and
      file variants, as well as the refaults counter in struct lruvec.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Link: http://lkml.kernel.org/r/1595490560-15117-4-git-send-email-iamjoonsoo.kim@lge.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      170b04b7
    • J
      mm/vmscan: protect the workingset on anonymous LRU · b518154e
      Joonsoo Kim 提交于
      In current implementation, newly created or swap-in anonymous page is
      started on active list.  Growing active list results in rebalancing
      active/inactive list so old pages on active list are demoted to inactive
      list.  Hence, the page on active list isn't protected at all.
      
      Following is an example of this situation.
      
      Assume that 50 hot pages on active list.  Numbers denote the number of
      pages on active/inactive list (active | inactive).
      
      1. 50 hot pages on active list
      50(h) | 0
      
      2. workload: 50 newly created (used-once) pages
      50(uo) | 50(h)
      
      3. workload: another 50 newly created (used-once) pages
      50(uo) | 50(uo), swap-out 50(h)
      
      This patch tries to fix this issue.  Like as file LRU, newly created or
      swap-in anonymous pages will be inserted to the inactive list.  They are
      promoted to active list if enough reference happens.  This simple
      modification changes the above example as following.
      
      1. 50 hot pages on active list
      50(h) | 0
      
      2. workload: 50 newly created (used-once) pages
      50(h) | 50(uo)
      
      3. workload: another 50 newly created (used-once) pages
      50(h) | 50(uo), swap-out 50(uo)
      
      As you can see, hot pages on active list would be protected.
      
      Note that, this implementation has a drawback that the page cannot be
      promoted and will be swapped-out if re-access interval is greater than the
      size of inactive list but less than the size of total(active+inactive).
      To solve this potential issue, following patch will apply workingset
      detection similar to the one that's already applied to file LRU.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Link: http://lkml.kernel.org/r/1595490560-15117-3-git-send-email-iamjoonsoo.kim@lge.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b518154e
    • J
      mm/vmscan: make active/inactive ratio as 1:1 for anon lru · ccc5dc67
      Joonsoo Kim 提交于
      Patch series "workingset protection/detection on the anonymous LRU list", v7.
      
      * PROBLEM
      In current implementation, newly created or swap-in anonymous page is
      started on the active list.  Growing the active list results in
      rebalancing active/inactive list so old pages on the active list are
      demoted to the inactive list.  Hence, hot page on the active list isn't
      protected at all.
      
      Following is an example of this situation.
      
      Assume that 50 hot pages on active list and system can contain total 100
      pages.  Numbers denote the number of pages on active/inactive list (active
      | inactive).  (h) stands for hot pages and (uo) stands for used-once
      pages.
      
      1. 50 hot pages on active list
      50(h) | 0
      
      2. workload: 50 newly created (used-once) pages
      50(uo) | 50(h)
      
      3. workload: another 50 newly created (used-once) pages
      50(uo) | 50(uo), swap-out 50(h)
      
      As we can see, hot pages are swapped-out and it would cause swap-in later.
      
      * SOLUTION
      Since this is what we want to avoid, this patchset implements workingset
      protection.  Like as the file LRU list, newly created or swap-in anonymous
      page is started on the inactive list.  Also, like as the file LRU list, if
      enough reference happens, the page will be promoted.  This simple
      modification changes the above example as following.
      
      1. 50 hot pages on active list
      50(h) | 0
      
      2. workload: 50 newly created (used-once) pages
      50(h) | 50(uo)
      
      3. workload: another 50 newly created (used-once) pages
      50(h) | 50(uo), swap-out 50(uo)
      
      hot pages remains in the active list. :)
      
      * EXPERIMENT
      I tested this scenario on my test bed and confirmed that this problem
      happens on current implementation. I also checked that it is fixed by
      this patchset.
      
      * SUBJECT
      workingset detection
      
      * PROBLEM
      Later part of the patchset implements the workingset detection for the
      anonymous LRU list.  There is a corner case that workingset protection
      could cause thrashing.  If we can avoid thrashing by workingset detection,
      we can get the better performance.
      
      Following is an example of thrashing due to the workingset protection.
      
      1. 50 hot pages on active list
      50(h) | 0
      
      2. workload: 50 newly created (will be hot) pages
      50(h) | 50(wh)
      
      3. workload: another 50 newly created (used-once) pages
      50(h) | 50(uo), swap-out 50(wh)
      
      4. workload: 50 (will be hot) pages
      50(h) | 50(wh), swap-in 50(wh)
      
      5. workload: another 50 newly created (used-once) pages
      50(h) | 50(uo), swap-out 50(wh)
      
      6. repeat 4, 5
      
      Without workingset detection, this kind of workload cannot be promoted and
      thrashing happens forever.
      
      * SOLUTION
      Therefore, this patchset implements workingset detection.  All the
      infrastructure for workingset detecion is already implemented, so there is
      not much work to do.  First, extend workingset detection code to deal with
      the anonymous LRU list.  Then, make swap cache handles the exceptional
      value for the shadow entry.  Lastly, install/retrieve the shadow value
      into/from the swap cache and check the refault distance.
      
      * EXPERIMENT
      I made a test program to imitates above scenario and confirmed that
      problem exists.  Then, I checked that this patchset fixes it.
      
      My test setup is a virtual machine with 8 cpus and 6100MB memory.  But,
      the amount of the memory that the test program can use is about 280 MB.
      This is because the system uses large ram-backed swap and large ramdisk to
      capture the trace.
      
      Test scenario is like as below.
      
      1. allocate cold memory (512MB)
      2. allocate hot-1 memory (96MB)
      3. activate hot-1 memory (96MB)
      4. allocate another hot-2 memory (96MB)
      5. access cold memory (128MB)
      6. access hot-2 memory (96MB)
      7. repeat 5, 6
      
      Since hot-1 memory (96MB) is on the active list, the inactive list can
      contains roughly 190MB pages.  hot-2 memory's re-access interval (96+128
      MB) is more 190MB, so it cannot be promoted without workingset detection
      and swap-in/out happens repeatedly.  With this patchset, workingset
      detection works and promotion happens.  Therefore, swap-in/out occurs
      less.
      
      Here is the result. (average of 5 runs)
      
      type swap-in swap-out
      base 863240 989945
      patch 681565 809273
      
      As we can see, patched kernel do less swap-in/out.
      
      * OVERALL TEST (ebizzy using modified random function)
      ebizzy is the test program that main thread allocates lots of memory and
      child threads access them randomly during the given times.  Swap-in will
      happen if allocated memory is larger than the system memory.
      
      The random function that represents the zipf distribution is used to make
      hot/cold memory.  Hot/cold ratio is controlled by the parameter.  If the
      parameter is high, hot memory is accessed much larger than cold one.  If
      the parameter is low, the number of access on each memory would be
      similar.  I uses various parameters in order to show the effect of
      patchset on various hot/cold ratio workload.
      
      My test setup is a virtual machine with 8 cpus, 1024 MB memory and 5120 MB
      ram swap.
      
      Result format is as following.
      
      param: 1-1024-0.1
      - 1 (number of thread)
      - 1024 (allocated memory size, MB)
      - 0.1 (zipf distribution alpha,
      0.1 works like as roughly uniform random,
      1.3 works like as small portion of memory is hot and the others are cold)
      
      pswpin: smaller is better
      std: standard deviation
      improvement: negative is better
      
      * single thread
                 param        pswpin       std       improvement
            base 1-1024.0-0.1 14101983.40   79441.19
            prot 1-1024.0-0.1 14065875.80  136413.01  (   -0.26 )
          detect 1-1024.0-0.1 13910435.60  100804.82  (   -1.36 )
            base 1-1024.0-0.7 7998368.80   43469.32
            prot 1-1024.0-0.7 7622245.80   88318.74  (   -4.70 )
          detect 1-1024.0-0.7 7618515.20   59742.07  (   -4.75 )
            base 1-1024.0-1.3 1017400.80   38756.30
            prot 1-1024.0-1.3  940464.60   29310.69  (   -7.56 )
          detect 1-1024.0-1.3  945511.40   24579.52  (   -7.07 )
            base 1-1280.0-0.1 22895541.40   50016.08
            prot 1-1280.0-0.1 22860305.40   51952.37  (   -0.15 )
          detect 1-1280.0-0.1 22705565.20   93380.35  (   -0.83 )
            base 1-1280.0-0.7 13717645.60   46250.65
            prot 1-1280.0-0.7 12935355.80   64754.43  (   -5.70 )
          detect 1-1280.0-0.7 13040232.00   63304.00  (   -4.94 )
            base 1-1280.0-1.3 1654251.40    4159.68
            prot 1-1280.0-1.3 1522680.60   33673.50  (   -7.95 )
          detect 1-1280.0-1.3 1599207.00   70327.89  (   -3.33 )
            base 1-1536.0-0.1 31621775.40   31156.28
            prot 1-1536.0-0.1 31540355.20   62241.36  (   -0.26 )
          detect 1-1536.0-0.1 31420056.00  123831.27  (   -0.64 )
            base 1-1536.0-0.7 19620760.60   60937.60
            prot 1-1536.0-0.7 18337839.60   56102.58  (   -6.54 )
          detect 1-1536.0-0.7 18599128.00   75289.48  (   -5.21 )
            base 1-1536.0-1.3 2378142.40   20994.43
            prot 1-1536.0-1.3 2166260.60   48455.46  (   -8.91 )
          detect 1-1536.0-1.3 2183762.20   16883.24  (   -8.17 )
            base 1-1792.0-0.1 40259714.80   90750.70
            prot 1-1792.0-0.1 40053917.20   64509.47  (   -0.51 )
          detect 1-1792.0-0.1 39949736.40  104989.64  (   -0.77 )
            base 1-1792.0-0.7 25704884.40   69429.68
            prot 1-1792.0-0.7 23937389.00   79945.60  (   -6.88 )
          detect 1-1792.0-0.7 24271902.00   35044.30  (   -5.57 )
            base 1-1792.0-1.3 3129497.00   32731.86
            prot 1-1792.0-1.3 2796994.40   19017.26  (  -10.62 )
          detect 1-1792.0-1.3 2886840.40   33938.82  (   -7.75 )
            base 1-2048.0-0.1 48746924.40   50863.88
            prot 1-2048.0-0.1 48631954.40   24537.30  (   -0.24 )
          detect 1-2048.0-0.1 48509419.80   27085.34  (   -0.49 )
            base 1-2048.0-0.7 32046424.40   78624.22
            prot 1-2048.0-0.7 29764182.20   86002.26  (   -7.12 )
          detect 1-2048.0-0.7 30250315.80  101282.14  (   -5.60 )
            base 1-2048.0-1.3 3916723.60   24048.55
            prot 1-2048.0-1.3 3490781.60   33292.61  (  -10.87 )
          detect 1-2048.0-1.3 3585002.20   44942.04  (   -8.47 )
      
      * multi thread
                 param        pswpin       std       improvement
            base 8-1024.0-0.1 16219822.60  329474.01
            prot 8-1024.0-0.1 15959494.00  654597.45  (   -1.61 )
          detect 8-1024.0-0.1 15773790.80  502275.25  (   -2.75 )
            base 8-1024.0-0.7 9174107.80  537619.33
            prot 8-1024.0-0.7 8571915.00  385230.08  (   -6.56 )
          detect 8-1024.0-0.7 8489484.20  364683.00  (   -7.46 )
            base 8-1024.0-1.3 1108495.60   83555.98
            prot 8-1024.0-1.3 1038906.20   63465.20  (   -6.28 )
          detect 8-1024.0-1.3  941817.80   32648.80  (  -15.04 )
            base 8-1280.0-0.1 25776114.20  450480.45
            prot 8-1280.0-0.1 25430847.00  465627.07  (   -1.34 )
          detect 8-1280.0-0.1 25282555.00  465666.55  (   -1.91 )
            base 8-1280.0-0.7 15218968.00  702007.69
            prot 8-1280.0-0.7 13957947.80  492643.86  (   -8.29 )
          detect 8-1280.0-0.7 14158331.20  238656.02  (   -6.97 )
            base 8-1280.0-1.3 1792482.80   30512.90
            prot 8-1280.0-1.3 1577686.40   34002.62  (  -11.98 )
          detect 8-1280.0-1.3 1556133.00   22944.79  (  -13.19 )
            base 8-1536.0-0.1 33923761.40  575455.85
            prot 8-1536.0-0.1 32715766.20  300633.51  (   -3.56 )
          detect 8-1536.0-0.1 33158477.40  117764.51  (   -2.26 )
            base 8-1536.0-0.7 20628907.80  303851.34
            prot 8-1536.0-0.7 19329511.20  341719.31  (   -6.30 )
          detect 8-1536.0-0.7 20013934.00  385358.66  (   -2.98 )
            base 8-1536.0-1.3 2588106.40  130769.20
            prot 8-1536.0-1.3 2275222.40   89637.06  (  -12.09 )
          detect 8-1536.0-1.3 2365008.40  124412.55  (   -8.62 )
            base 8-1792.0-0.1 43328279.20  946469.12
            prot 8-1792.0-0.1 41481980.80  525690.89  (   -4.26 )
          detect 8-1792.0-0.1 41713944.60  406798.93  (   -3.73 )
            base 8-1792.0-0.7 27155647.40  536253.57
            prot 8-1792.0-0.7 24989406.80  502734.52  (   -7.98 )
          detect 8-1792.0-0.7 25524806.40  263237.87  (   -6.01 )
            base 8-1792.0-1.3 3260372.80  137907.92
            prot 8-1792.0-1.3 2879187.80   63597.26  (  -11.69 )
          detect 8-1792.0-1.3 2892962.20   33229.13  (  -11.27 )
            base 8-2048.0-0.1 50583989.80  710121.48
            prot 8-2048.0-0.1 49599984.40  228782.42  (   -1.95 )
          detect 8-2048.0-0.1 50578596.00  660971.66  (   -0.01 )
            base 8-2048.0-0.7 33765479.60  812659.55
            prot 8-2048.0-0.7 30767021.20  462907.24  (   -8.88 )
          detect 8-2048.0-0.7 32213068.80  211884.24  (   -4.60 )
            base 8-2048.0-1.3 3941675.80   28436.45
            prot 8-2048.0-1.3 3538742.40   76856.08  (  -10.22 )
          detect 8-2048.0-1.3 3579397.80   58630.95  (   -9.19 )
      
      As we can see, all the cases show improvement.  Especially, test case with
      zipf distribution 1.3 show more improvements.  It means that if there is a
      hot/cold tendency in anon pages, this patchset works better.
      
      This patch (of 6):
      
      Current implementation of LRU management for anonymous page has some
      problems.  Most important one is that it doesn't protect the workingset,
      that is, pages on the active LRU list.  Although, this problem will be
      fixed in the following patchset, the preparation is required and this
      patch does it.
      
      What following patch does is to implement workingset protection.  After
      the following patchset, newly created or swap-in pages will start their
      lifetime on the inactive list.  If inactive list is too small, there is
      not enough chance to be referenced and the page cannot become the
      workingset.
      
      In order to provide the newly anonymous or swap-in pages enough chance to
      be referenced again, this patch makes active/inactive LRU ratio as 1:1.
      
      This is just a temporary measure.  Later patch in the series introduces
      workingset detection for anonymous LRU that will be used to better decide
      if pages should start on the active and inactive list.  Afterwards this
      patch is effectively reverted.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Link: http://lkml.kernel.org/r/1595490560-15117-1-git-send-email-iamjoonsoo.kim@lge.com
      Link: http://lkml.kernel.org/r/1595490560-15117-2-git-send-email-iamjoonsoo.kim@lge.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ccc5dc67