1. 17 1月, 2020 34 次提交
  2. 15 1月, 2020 6 次提交
    • Y
      alinux: mm: thp: move deferred split queue to memcg's nodeinfo · e6ca020b
      Yang Shi 提交于
      The commit 87eaceb3faa59b9b4d940ec9554ce251325d83fe ("mm: thp: make
      deferred split shrinker memcg aware") makes deferred split queue per
      memcg to resolve memcg pre-mature OOM problem.  But, all nodes end up
      sharing the same queue instead of one queue per-node before the commit.
      It is not a big deal for memcg limit reclaim, but it may cause global
      kswapd shrink THPs from a different node.
      
      And, 0-day testing reported -19.6% regression of stress-ng's madvise
      test [1].  I didn't see that much regression on my test box (24 threads,
      48GB memory, 2 nodes), with the same test (stress-ng --timeout 1
      --metrics-brief --sequential 72  --class vm --exclude spawn,exec), I saw
      average -3% (run the same test 10 times then calculate the average since
      the test itself may have most 15% variation according to my test)
      regression sometimes (not every time, sometimes I didn't see regression
      at all).
      
      This might be caused by deferred split queue lock contention.  With some
      configuration (i.e. just one root memcg) the lock contention my be worse
      than before (given 2 nodes, two locks are reduced to one lock).
      
      So, moving deferred split queue to memcg's nodeinfo to make it NUMA
      aware again.
      
      With this change stress-ng's madvise test shows average 4% improvement
      sometimes and I didn't see degradation anymore.
      
      [1]: https://lore.kernel.org/lkml/20190930084604.GC17687@shao2-debian/
      
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      e6ca020b
    • Y
      mm: thp: make deferred split shrinker memcg aware · d651fcbb
      Yang Shi 提交于
      commit 87eaceb3faa59b9b4d940ec9554ce251325d83fe upstream
      
      Currently THP deferred split shrinker is not memcg aware, this may cause
      premature OOM with some configuration.  For example the below test would
      run into premature OOM easily:
      
      $ cgcreate -g memory:thp
      $ echo 4G > /sys/fs/cgroup/memory/thp/memory/limit_in_bytes
      $ cgexec -g memory:thp transhuge-stress 4000
      
      transhuge-stress comes from kernel selftest.
      
      It is easy to hit OOM, but there are still a lot THP on the deferred
      split queue, memcg direct reclaim can't touch them since the deferred split
      shrinker is not memcg aware.
      
      Convert deferred split shrinker memcg aware by introducing per memcg
      deferred split queue.  The THP should be on either per node or per memcg
      deferred split queue if it belongs to a memcg.  When the page is
      immigrated to the other memcg, it will be immigrated to the target
      memcg's deferred split queue too.
      
      Reuse the second tail page's deferred_list for per memcg list since the
      same THP can't be on multiple deferred split queues.
      
      [yang.shi@linux.alibaba.com: simplify deferred split queue dereference per Kirill Tkhai]
        Link: http://lkml.kernel.org/r/1566496227-84952-5-git-send-email-yang.shi@linux.alibaba.com
      Link: http://lkml.kernel.org/r/1565144277-36240-5-git-send-email-yang.shi@linux.alibaba.comSigned-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      d651fcbb
    • Y
      mm: shrinker: make shrinker not depend on memcg kmem · bd5596b4
      Yang Shi 提交于
      commit 0a432dcbeb32edcd211a5d8f7847d0da7642a8b4 upstream
      
      Currently shrinker is just allocated and can work when memcg kmem is
      enabled.  But, THP deferred split shrinker is not slab shrinker, it
      doesn't make too much sense to have such shrinker depend on memcg kmem.
      It should be able to reclaim THP even though memcg kmem is disabled.
      
      Introduce a new shrinker flag, SHRINKER_NONSLAB, for non-slab shrinker.
      When memcg kmem is disabled, just such shrinkers can be called in
      shrinking memcg slab.
      
      [yang.shi@linux.alibaba.com: add comment]
        Link: http://lkml.kernel.org/r/1566496227-84952-4-git-send-email-yang.shi@linux.alibaba.com
      Link: http://lkml.kernel.org/r/1565144277-36240-4-git-send-email-yang.shi@linux.alibaba.comSigned-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      bd5596b4
    • Y
      mm: move mem_cgroup_uncharge out of __page_cache_release() · 9b78918c
      Yang Shi 提交于
      commit 7ae88534cdd96235cd775c03b32a75009355740b upstream
      
      A later patch makes THP deferred split shrinker memcg aware, but it
      needs page->mem_cgroup information in THP destructor, which is called after
      mem_cgroup_uncharge() now.
      
      So move mem_cgroup_uncharge() from __page_cache_release() to compound
      page destructor, which is called by both THP and other compound pages except
      HugeTLB.  And call it in __put_single_page() for single order page.
      
      Link: http://lkml.kernel.org/r/1565144277-36240-3-git-send-email-yang.shi@linux.alibaba.comSigned-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Suggested-by: N"Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      9b78918c
    • Y
      mm: thp: extract split_queue_* into a struct · e65b6961
      Yang Shi 提交于
      commit 364c1eebe453f06f0c1e837eb155a5725c9cd272 upstream
      
      Patch series "Make deferred split shrinker memcg aware", v6.
      
      Currently THP deferred split shrinker is not memcg aware, this may cause
      premature OOM with some configuration.  For example the below test would
      run into premature OOM easily:
      
      $ cgcreate -g memory:thp
      $ echo 4G > /sys/fs/cgroup/memory/thp/memory/limit_in_bytes
      $ cgexec -g memory:thp transhuge-stress 4000
      
      transhuge-stress comes from kernel selftest.
      
      It is easy to hit OOM, but there are still a lot THP on the deferred
      split queue, memcg direct reclaim can't touch them since the deferred split
      shrinker is not memcg aware.
      
      Convert deferred split shrinker memcg aware by introducing per memcg
      deferred split queue.  The THP should be on either per node or per memcg
      deferred split queue if it belongs to a memcg.  When the page is
      immigrated to the other memcg, it will be immigrated to the target
      memcg's deferred split queue too.
      
      Reuse the second tail page's deferred_list for per memcg list since the
      same THP can't be on multiple deferred split queues.
      
      Make deferred split shrinker not depend on memcg kmem since it is not
      slab.  It doesn't make sense to not shrink THP even though memcg kmem is
      disabled.
      
      With the above change the test demonstrated above doesn't trigger OOM
      even though with cgroup.memory=nokmem.
      
      This patch (of 4):
      
      Put split_queue, split_queue_lock and split_queue_len into a struct in
      order to reduce code duplication when we convert deferred_split to memcg
      aware in the later patches.
      
      Link: http://lkml.kernel.org/r/1565144277-36240-2-git-send-email-yang.shi@linux.alibaba.comSigned-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Suggested-by: N"Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      e65b6961
    • G
      alinux: mm: Support kidled · a29243e2
      Gavin Shan 提交于
      This enables scanning pages in fixed interval to determine their access
      frequency (hot/cold). The result is exported to user land on basis of
      memory cgroup by "memory.idle_page_stats". The design is highlighted as
      below:
      
         * A kernel thread is spawn when this feature is enabled by writing
           non-zero value to "/sys/kernel/mm/kidled/scan_period_in_seconds".
           The thread sequentially scans the nodes and their pages that have
           been chained up in LRU list.
      
         * For each page, its corresponding age information is stored in the
           page flags or array in node. The age represents the scanning intervals
           in which the page isn't accessed. Also, the page flag (PG_idle) is
           leveraged. The page's age is increased by one if the idle flag isn't
           cleared in two consective scans. Otherwise, the page's age is cleared out.
           Also, the page's age information is cleared when it's free'd so that
           the stale age information won't be fetched when it's allocated.
      
         * Initially, the flag is set, while the access bit in its PTE is cleared
           out by the thread. In next scanning period, its PTE access bit is
           synchronized with the page flag: clear the flag if access bit is set.
           The flag is kept otherwise. For unmapped pages, the flag is cleared
           when it's accessed.
      
         * Eventually, the page's aging information is updated to the unstable
           bucket of its corresponding memory cgroup, taking as statistics. The
           unstable bucket (statistics) is copied to stable bucket when all pages
           in all nodes are scanned for once. The stable bucket (statistics) is
           exported to user land through "memory.idle_page_stats".
      
      TESTING
      =======
      
         * cgroup1, unmapped pagecache
      
           # dd if=/dev/zero of=/ext4/test.data oflag=direct bs=1M count=128
           #
           # echo 1 > /sys/kernel/mm/kidled/use_hierarchy
           # echo 15 > /sys/kernel/mm/kidled/scan_period_in_seconds
           # mkdir -p /cgroup/memory
           # mount -tcgroup -o memory /cgroup/memory
           # echo 1 > /cgroup/memory/memory.use_hierarchy
           # mkdir -p /cgroup/memory/test
           # echo 1 > /cgroup/memory/test/memory.use_hierarchy
           #
           # echo $$ > /cgroup/memory/test/cgroup.procs
           # dd if=/ext4/test.data of=/dev/null bs=1M count=128
           # < wait a few minutes >
           # cat /cgroup/memory/test/memory.idle_page_stats | grep cfei
           # cat /cgroup/memory/test/memory.idle_page_stats | grep cfei
             cfei   0   0   0   134217728   0   0   0   0
           # cat /cgroup/memory/memory.idle_page_stats | grep cfei
             cfei   0   0   0   134217728   0   0   0   0
      
         * cgroup1, mapped pagecache
      
           # < create same file and memory cgroups as above >
           #
           # echo $$ > /cgroup/memory/test/cgroup.procs
           # < run program to mmap the whole created file and access the area >
           # < wait a few minutes >
           # cat /cgroup/memory/test/memory.idle_page_stats | grep cfei
             cfei   0   134217728   0   0   0   0   0   0
           # cat /cgroup/memory/memory.idle_page_stats | grep cfei
             cfei   0   134217728   0   0   0   0   0   0
      
         * cgroup1, mapped and locked pagecache
      
           # < create same file and memory cgroups as above >
           #
           # echo $$ > /cgroup/memory/test/cgroup.procs
           # < run program to mmap the whole created file and mlock the area >
           # < wait a few minutes >
           # cat /cgroup/memory/test/memory.idle_page_stats | grep cfui
             cfui   0   134217728   0   0   0   0   0   0
           # cat /cgroup/memory/memory.idle_page_stats | grep cfui
             cfui   0   134217728   0   0   0   0   0   0
      
         * cgroup1, anonymous and locked area
      
           # < create memory cgroups as above >
           #
           # echo $$ > /cgroup/memory/test/cgroup.procs
           # < run program to mmap anonymous area and mlock it >
           # < wait a few minutes >
           # cat /cgroup/memory/test/memory.idle_page_stats | grep csui
             csui   0   0   134217728   0   0   0   0   0
           # cat /cgroup/memory/memory.idle_page_stats | grep csui
             csui   0   0   134217728   0   0   0   0   0
      
         * Rerun above test cases in cgroup2 and the results are no exceptional.
           However, the cgroups are populated in different way as below:
      
           # mkdir -p /cgroup
           # mount -tcgroup2 none /cgroup
           # echo "+memory" > /cgroup/cgroup.subtree_control
           # mkdir -p /cgroup/test
      Signed-off-by: NGavin Shan <shan.gavin@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      a29243e2