1. 25 9月, 2019 2 次提交
    • C
      mm, memcg: throttle allocators when failing reclaim over memory.high · 0e4b01df
      Chris Down 提交于
      We're trying to use memory.high to limit workloads, but have found that
      containment can frequently fail completely and cause OOM situations
      outside of the cgroup.  This happens especially with swap space -- either
      when none is configured, or swap is full.  These failures often also don't
      have enough warning to allow one to react, whether for a human or for a
      daemon monitoring PSI.
      
      Here is output from a simple program showing how long it takes in usec
      (column 2) to allocate a megabyte of anonymous memory (column 1) when a
      cgroup is already beyond its memory high setting, and no swap is
      available:
      
          [root@ktst ~]# systemd-run -p MemoryHigh=100M -p MemorySwapMax=1 \
          > --wait -t timeout 300 /root/mdf
          [...]
          95  1035
          96  1038
          97  1000
          98  1036
          99  1048
          100 1590
          101 1968
          102 1776
          103 1863
          104 1757
          105 1921
          106 1893
          107 1760
          108 1748
          109 1843
          110 1716
          111 1924
          112 1776
          113 1831
          114 1766
          115 1836
          116 1588
          117 1912
          118 1802
          119 1857
          120 1731
          [...]
          [System OOM in 2-3 seconds]
      
      The delay does go up extremely marginally past the 100MB memory.high
      threshold, as now we spend time scanning before returning to usermode, but
      it's nowhere near enough to contain growth.  It also doesn't get worse the
      more pages you have, since it only considers nr_pages.
      
      The current situation goes against both the expectations of users of
      memory.high, and our intentions as cgroup v2 developers.  In
      cgroup-v2.txt, we claim that we will throttle and only under "extreme
      conditions" will memory.high protection be breached.  Likewise, cgroup v2
      users generally also expect that memory.high should throttle workloads as
      they exceed their high threshold.  However, as seen above, this isn't
      always how it works in practice -- even on banal setups like those with no
      swap, or where swap has become exhausted, we can end up with memory.high
      being breached and us having no weapons left in our arsenal to combat
      runaway growth with, since reclaim is futile.
      
      It's also hard for system monitoring software or users to tell how bad the
      situation is, as "high" events for the memcg may in some cases be benign,
      and in others be catastrophic.  The current status quo is that we fail
      containment in a way that doesn't provide any advance warning that things
      are about to go horribly wrong (for example, we are about to invoke the
      kernel OOM killer).
      
      This patch introduces explicit throttling when reclaim is failing to keep
      memcg size contained at the memory.high setting.  It does so by applying
      an exponential delay curve derived from the memcg's overage compared to
      memory.high.  In the normal case where the memcg is either below or only
      marginally over its memory.high setting, no throttling will be performed.
      
      This composes well with system health monitoring and remediation, as these
      allocator delays are factored into PSI's memory pressure calculations.
      This both creates a mechanism system administrators or applications
      consuming the PSI interface to trivially see that the memcg in question is
      struggling and use that to make more reasonable decisions, and permits
      them enough time to act.  Either of these can act with significantly more
      nuance than that we can provide using the system OOM killer.
      
      This is a similar idea to memory.oom_control in cgroup v1 which would put
      the cgroup to sleep if the threshold was violated, but it's also
      significantly improved as it results in visible memory pressure, and also
      doesn't schedule indefinitely, which previously made tracing and other
      introspection difficult (ie.  it's clamped at 2*HZ per allocation through
      MEMCG_MAX_HIGH_DELAY_JIFFIES).
      
      Contrast the previous results with a kernel with this patch:
      
          [root@ktst ~]# systemd-run -p MemoryHigh=100M -p MemorySwapMax=1 \
          > --wait -t timeout 300 /root/mdf
          [...]
          95  1002
          96  1000
          97  1002
          98  1003
          99  1000
          100 1043
          101 84724
          102 330628
          103 610511
          104 1016265
          105 1503969
          106 2391692
          107 2872061
          108 3248003
          109 4791904
          110 5759832
          111 6912509
          112 8127818
          113 9472203
          114 12287622
          115 12480079
          116 14144008
          117 15808029
          118 16384500
          119 16383242
          120 16384979
          [...]
      
      As you can see, in the normal case, memory allocation takes around 1000
      usec.  However, as we exceed our memory.high, things start to increase
      exponentially, but fairly leniently at first.  Our first megabyte over
      memory.high takes us 0.16 seconds, then the next is 0.46 seconds, then the
      next is almost an entire second.  This gets worse until we reach our
      eventual 2*HZ clamp per batch, resulting in 16 seconds per megabyte.
      However, this is still making forward progress, so permits tracing or
      further analysis with programs like GDB.
      
      We use an exponential curve for our delay penalty for a few reasons:
      
      1. We run mem_cgroup_handle_over_high to potentially do reclaim after
         we've already performed allocations, which means that temporarily
         going over memory.high by a small amount may be perfectly legitimate,
         even for compliant workloads. We don't want to unduly penalise such
         cases.
      2. An exponential curve (as opposed to a static or linear delay) allows
         ramping up memory pressure stats more gradually, which can be useful
         to work out that you have set memory.high too low, without destroying
         application performance entirely.
      
      This patch expands on earlier work by Johannes Weiner. Thanks!
      
      [akpm@linux-foundation.org: fix max() warning]
      [akpm@linux-foundation.org: fix __udivdi3 ref on 32-bit]
      [akpm@linux-foundation.org: fix it even more]
      [chris@chrisdown.name: fix 64-bit divide even more]
      Link: http://lkml.kernel.org/r/20190723180700.GA29459@chrisdown.nameSigned-off-by: NChris Down <chris@chrisdown.name>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Nathan Chancellor <natechancellor@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0e4b01df
    • M
      mm: introduce compound_nr() · d8c6546b
      Matthew Wilcox (Oracle) 提交于
      Replace 1 << compound_order(page) with compound_nr(page).  Minor
      improvements in readability.
      
      Link: http://lkml.kernel.org/r/20190721104612.19120-4-willy@infradead.orgSigned-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NIra Weiny <ira.weiny@intel.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d8c6546b
  2. 07 9月, 2019 2 次提交
  3. 31 8月, 2019 3 次提交
  4. 30 8月, 2019 1 次提交
  5. 27 8月, 2019 1 次提交
    • T
      writeback, memcg: Implement foreign dirty flushing · 97b27821
      Tejun Heo 提交于
      There's an inherent mismatch between memcg and writeback.  The former
      trackes ownership per-page while the latter per-inode.  This was a
      deliberate design decision because honoring per-page ownership in the
      writeback path is complicated, may lead to higher CPU and IO overheads
      and deemed unnecessary given that write-sharing an inode across
      different cgroups isn't a common use-case.
      
      Combined with inode majority-writer ownership switching, this works
      well enough in most cases but there are some pathological cases.  For
      example, let's say there are two cgroups A and B which keep writing to
      different but confined parts of the same inode.  B owns the inode and
      A's memory is limited far below B's.  A's dirty ratio can rise enough
      to trigger balance_dirty_pages() sleeps but B's can be low enough to
      avoid triggering background writeback.  A will be slowed down without
      a way to make writeback of the dirty pages happen.
      
      This patch implements foreign dirty recording and foreign mechanism so
      that when a memcg encounters a condition as above it can trigger
      flushes on bdi_writebacks which can clean its pages.  Please see the
      comment on top of mem_cgroup_track_foreign_dirty_slowpath() for
      details.
      
      A reproducer follows.
      
      write-range.c::
      
        #include <stdio.h>
        #include <stdlib.h>
        #include <unistd.h>
        #include <fcntl.h>
        #include <sys/types.h>
      
        static const char *usage = "write-range FILE START SIZE\n";
      
        int main(int argc, char **argv)
        {
      	  int fd;
      	  unsigned long start, size, end, pos;
      	  char *endp;
      	  char buf[4096];
      
      	  if (argc < 4) {
      		  fprintf(stderr, usage);
      		  return 1;
      	  }
      
      	  fd = open(argv[1], O_WRONLY);
      	  if (fd < 0) {
      		  perror("open");
      		  return 1;
      	  }
      
      	  start = strtoul(argv[2], &endp, 0);
      	  if (*endp != '\0') {
      		  fprintf(stderr, usage);
      		  return 1;
      	  }
      
      	  size = strtoul(argv[3], &endp, 0);
      	  if (*endp != '\0') {
      		  fprintf(stderr, usage);
      		  return 1;
      	  }
      
      	  end = start + size;
      
      	  while (1) {
      		  for (pos = start; pos < end; ) {
      			  long bread, bwritten = 0;
      
      			  if (lseek(fd, pos, SEEK_SET) < 0) {
      				  perror("lseek");
      				  return 1;
      			  }
      
      			  bread = read(0, buf, sizeof(buf) < end - pos ?
      					       sizeof(buf) : end - pos);
      			  if (bread < 0) {
      				  perror("read");
      				  return 1;
      			  }
      			  if (bread == 0)
      				  return 0;
      
      			  while (bwritten < bread) {
      				  long this;
      
      				  this = write(fd, buf + bwritten,
      					       bread - bwritten);
      				  if (this < 0) {
      					  perror("write");
      					  return 1;
      				  }
      
      				  bwritten += this;
      				  pos += bwritten;
      			  }
      		  }
      	  }
        }
      
      repro.sh::
      
        #!/bin/bash
      
        set -e
        set -x
      
        sysctl -w vm.dirty_expire_centisecs=300000
        sysctl -w vm.dirty_writeback_centisecs=300000
        sysctl -w vm.dirtytime_expire_seconds=300000
        echo 3 > /proc/sys/vm/drop_caches
      
        TEST=/sys/fs/cgroup/test
        A=$TEST/A
        B=$TEST/B
      
        mkdir -p $A $B
        echo "+memory +io" > $TEST/cgroup.subtree_control
        echo $((1<<30)) > $A/memory.high
        echo $((32<<30)) > $B/memory.high
      
        rm -f testfile
        touch testfile
        fallocate -l 4G testfile
      
        echo "Starting B"
      
        (echo $BASHPID > $B/cgroup.procs
         pv -q --rate-limit 70M < /dev/urandom | ./write-range testfile $((2<<30)) $((2<<30))) &
      
        echo "Waiting 10s to ensure B claims the testfile inode"
        sleep 5
        sync
        sleep 5
        sync
        echo "Starting A"
      
        (echo $BASHPID > $A/cgroup.procs
         pv < /dev/urandom | ./write-range testfile 0 $((2<<30)))
      
      v2: Added comments explaining why the specific intervals are being used.
      
      v3: Use 0 @nr when calling cgroup_writeback_by_id() to use best-effort
          flushing while avoding possible livelocks.
      
      v4: Use get_jiffies_64() and time_before/after64() instead of raw
          jiffies_64 and arthimetic comparisons as suggested by Jan.
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      97b27821
  6. 25 8月, 2019 2 次提交
    • R
      mm: memcontrol: flush percpu vmevents before releasing memcg · bb65f89b
      Roman Gushchin 提交于
      Similar to vmstats, percpu caching of local vmevents leads to an
      accumulation of errors on non-leaf levels.  This happens because some
      leftovers may remain in percpu caches, so that they are never propagated
      up by the cgroup tree and just disappear into nonexistence with on
      releasing of the memory cgroup.
      
      To fix this issue let's accumulate and propagate percpu vmevents values
      before releasing the memory cgroup similar to what we're doing with
      vmstats.
      
      Since on cpu hotplug we do flush percpu vmstats anyway, we can iterate
      only over online cpus.
      
      Link: http://lkml.kernel.org/r/20190819202338.363363-4-guro@fb.com
      Fixes: 42a30035 ("mm: memcontrol: fix recursive statistics correctness & scalabilty")
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bb65f89b
    • R
      mm: memcontrol: flush percpu vmstats before releasing memcg · c350a99e
      Roman Gushchin 提交于
      Percpu caching of local vmstats with the conditional propagation by the
      cgroup tree leads to an accumulation of errors on non-leaf levels.
      
      Let's imagine two nested memory cgroups A and A/B.  Say, a process
      belonging to A/B allocates 100 pagecache pages on the CPU 0.  The percpu
      cache will spill 3 times, so that 32*3=96 pages will be accounted to A/B
      and A atomic vmstat counters, 4 pages will remain in the percpu cache.
      
      Imagine A/B is nearby memory.max, so that every following allocation
      triggers a direct reclaim on the local CPU.  Say, each such attempt will
      free 16 pages on a new cpu.  That means every percpu cache will have -16
      pages, except the first one, which will have 4 - 16 = -12.  A/B and A
      atomic counters will not be touched at all.
      
      Now a user removes A/B.  All percpu caches are freed and corresponding
      vmstat numbers are forgotten.  A has 96 pages more than expected.
      
      As memory cgroups are created and destroyed, errors do accumulate.  Even
      1-2 pages differences can accumulate into large numbers.
      
      To fix this issue let's accumulate and propagate percpu vmstat values
      before releasing the memory cgroup.  At this point these numbers are
      stable and cannot be changed.
      
      Since on cpu hotplug we do flush percpu vmstats anyway, we can iterate
      only over online cpus.
      
      Link: http://lkml.kernel.org/r/20190819202338.363363-2-guro@fb.com
      Fixes: 42a30035 ("mm: memcontrol: fix recursive statistics correctness & scalabilty")
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c350a99e
  7. 14 8月, 2019 2 次提交
    • R
      mm: workingset: fix vmstat counters for shadow nodes · ec9f0238
      Roman Gushchin 提交于
      Memcg counters for shadow nodes are broken because the memcg pointer is
      obtained in a wrong way. The following approach is used:
              virt_to_page(xa_node)->mem_cgroup
      
      Since commit 4d96ba35 ("mm: memcg/slab: stop setting
      page->mem_cgroup pointer for slab pages") page->mem_cgroup pointer isn't
      set for slab pages, so memcg_from_slab_page() should be used instead.
      
      Also I doubt that it ever worked correctly: virt_to_head_page() should
      be used instead of virt_to_page().  Otherwise objects residing on tail
      pages are not accounted, because only the head page contains a valid
      mem_cgroup pointer.  That was a case since the introduction of these
      counters by the commit 68d48e6a ("mm: workingset: add vmstat counter
      for shadow nodes").
      
      Link: http://lkml.kernel.org/r/20190801233532.138743-1-guro@fb.com
      Fixes: 4d96ba35 ("mm: memcg/slab: stop setting page->mem_cgroup pointer for slab pages")
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ec9f0238
    • M
      mm/memcontrol.c: fix use after free in mem_cgroup_iter() · 54a83d6b
      Miles Chen 提交于
      This patch is sent to report an use after free in mem_cgroup_iter()
      after merging commit be2657752e9e ("mm: memcg: fix use after free in
      mem_cgroup_iter()").
      
      I work with android kernel tree (4.9 & 4.14), and commit be2657752e9e
      ("mm: memcg: fix use after free in mem_cgroup_iter()") has been merged
      to the trees.  However, I can still observe use after free issues
      addressed in the commit be2657752e9e.  (on low-end devices, a few times
      this month)
      
      backtrace:
              css_tryget <- crash here
              mem_cgroup_iter
              shrink_node
              shrink_zones
              do_try_to_free_pages
              try_to_free_pages
              __perform_reclaim
              __alloc_pages_direct_reclaim
              __alloc_pages_slowpath
              __alloc_pages_nodemask
      
      To debug, I poisoned mem_cgroup before freeing it:
      
        static void __mem_cgroup_free(struct mem_cgroup *memcg)
              for_each_node(node)
              free_mem_cgroup_per_node_info(memcg, node);
              free_percpu(memcg->stat);
        +     /* poison memcg before freeing it */
        +     memset(memcg, 0x78, sizeof(struct mem_cgroup));
              kfree(memcg);
        }
      
      The coredump shows the position=0xdbbc2a00 is freed.
      
        (gdb) p/x ((struct mem_cgroup_per_node *)0xe5009e00)->iter[8]
        $13 = {position = 0xdbbc2a00, generation = 0x2efd}
      
        0xdbbc2a00:     0xdbbc2e00      0x00000000      0xdbbc2800      0x00000100
        0xdbbc2a10:     0x00000200      0x78787878      0x00026218      0x00000000
        0xdbbc2a20:     0xdcad6000      0x00000001      0x78787800      0x00000000
        0xdbbc2a30:     0x78780000      0x00000000      0x0068fb84      0x78787878
        0xdbbc2a40:     0x78787878      0x78787878      0x78787878      0xe3fa5cc0
        0xdbbc2a50:     0x78787878      0x78787878      0x00000000      0x00000000
        0xdbbc2a60:     0x00000000      0x00000000      0x00000000      0x00000000
        0xdbbc2a70:     0x00000000      0x00000000      0x00000000      0x00000000
        0xdbbc2a80:     0x00000000      0x00000000      0x00000000      0x00000000
        0xdbbc2a90:     0x00000001      0x00000000      0x00000000      0x00100000
        0xdbbc2aa0:     0x00000001      0xdbbc2ac8      0x00000000      0x00000000
        0xdbbc2ab0:     0x00000000      0x00000000      0x00000000      0x00000000
        0xdbbc2ac0:     0x00000000      0x00000000      0xe5b02618      0x00001000
        0xdbbc2ad0:     0x00000000      0x78787878      0x78787878      0x78787878
        0xdbbc2ae0:     0x78787878      0x78787878      0x78787878      0x78787878
        0xdbbc2af0:     0x78787878      0x78787878      0x78787878      0x78787878
        0xdbbc2b00:     0x78787878      0x78787878      0x78787878      0x78787878
        0xdbbc2b10:     0x78787878      0x78787878      0x78787878      0x78787878
        0xdbbc2b20:     0x78787878      0x78787878      0x78787878      0x78787878
        0xdbbc2b30:     0x78787878      0x78787878      0x78787878      0x78787878
        0xdbbc2b40:     0x78787878      0x78787878      0x78787878      0x78787878
        0xdbbc2b50:     0x78787878      0x78787878      0x78787878      0x78787878
        0xdbbc2b60:     0x78787878      0x78787878      0x78787878      0x78787878
        0xdbbc2b70:     0x78787878      0x78787878      0x78787878      0x78787878
        0xdbbc2b80:     0x78787878      0x78787878      0x00000000      0x78787878
        0xdbbc2b90:     0x78787878      0x78787878      0x78787878      0x78787878
        0xdbbc2ba0:     0x78787878      0x78787878      0x78787878      0x78787878
      
      In the reclaim path, try_to_free_pages() does not setup
      sc.target_mem_cgroup and sc is passed to do_try_to_free_pages(), ...,
      shrink_node().
      
      In mem_cgroup_iter(), root is set to root_mem_cgroup because
      sc->target_mem_cgroup is NULL.  It is possible to assign a memcg to
      root_mem_cgroup.nodeinfo.iter in mem_cgroup_iter().
      
              try_to_free_pages
              	struct scan_control sc = {...}, target_mem_cgroup is 0x0;
              do_try_to_free_pages
              shrink_zones
              shrink_node
              	 mem_cgroup *root = sc->target_mem_cgroup;
              	 memcg = mem_cgroup_iter(root, NULL, &reclaim);
              mem_cgroup_iter()
              	if (!root)
              		root = root_mem_cgroup;
              	...
      
              	css = css_next_descendant_pre(css, &root->css);
              	memcg = mem_cgroup_from_css(css);
              	cmpxchg(&iter->position, pos, memcg);
      
      My device uses memcg non-hierarchical mode.  When we release a memcg:
      invalidate_reclaim_iterators() reaches only dead_memcg and its parents.
      If non-hierarchical mode is used, invalidate_reclaim_iterators() never
      reaches root_mem_cgroup.
      
        static void invalidate_reclaim_iterators(struct mem_cgroup *dead_memcg)
        {
              struct mem_cgroup *memcg = dead_memcg;
      
              for (; memcg; memcg = parent_mem_cgroup(memcg)
              ...
        }
      
      So the use after free scenario looks like:
      
        CPU1						CPU2
      
        try_to_free_pages
        do_try_to_free_pages
        shrink_zones
        shrink_node
        mem_cgroup_iter()
            if (!root)
            	root = root_mem_cgroup;
            ...
            css = css_next_descendant_pre(css, &root->css);
            memcg = mem_cgroup_from_css(css);
            cmpxchg(&iter->position, pos, memcg);
      
              				invalidate_reclaim_iterators(memcg);
              				...
              				__mem_cgroup_free()
              					kfree(memcg);
      
        try_to_free_pages
        do_try_to_free_pages
        shrink_zones
        shrink_node
        mem_cgroup_iter()
            if (!root)
            	root = root_mem_cgroup;
            ...
            mz = mem_cgroup_nodeinfo(root, reclaim->pgdat->node_id);
            iter = &mz->iter[reclaim->priority];
            pos = READ_ONCE(iter->position);
            css_tryget(&pos->css) <- use after free
      
      To avoid this, we should also invalidate root_mem_cgroup.nodeinfo.iter
      in invalidate_reclaim_iterators().
      
      [cai@lca.pw: fix -Wparentheses compilation warning]
        Link: http://lkml.kernel.org/r/1564580753-17531-1-git-send-email-cai@lca.pw
      Link: http://lkml.kernel.org/r/20190730015729.4406-1-miles.chen@mediatek.com
      Fixes: 5ac8fb31 ("mm: memcontrol: convert reclaim iterator to simple css refcounting")
      Signed-off-by: NMiles Chen <miles.chen@mediatek.com>
      Signed-off-by: NQian Cai <cai@lca.pw>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      54a83d6b
  8. 17 7月, 2019 1 次提交
  9. 13 7月, 2019 10 次提交
    • S
      mm, oom: remove redundant task_in_mem_cgroup() check · 6ba749ee
      Shakeel Butt 提交于
      oom_unkillable_task() can be called from three different contexts i.e.
      global OOM, memcg OOM and oom_score procfs interface.  At the moment
      oom_unkillable_task() does a task_in_mem_cgroup() check on the given
      process.  Since there is no reason to perform task_in_mem_cgroup()
      check for global OOM and oom_score procfs interface, those contexts
      provide NULL memcg and skips the task_in_mem_cgroup() check.  However
      for memcg OOM context, the oom_unkillable_task() is always called from
      mem_cgroup_scan_tasks() and thus task_in_mem_cgroup() check becomes
      redundant and effectively dead code.  So, just remove the
      task_in_mem_cgroup() check altogether.
      
      Link: http://lkml.kernel.org/r/20190624212631.87212-2-shakeelb@google.comSigned-off-by: NShakeel Butt <shakeelb@google.com>
      Signed-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Acked-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6ba749ee
    • T
      mm: memcontrol: use CSS_TASK_ITER_PROCS at mem_cgroup_scan_tasks() · f168a9a5
      Tetsuo Handa 提交于
      Since commit c03cd773 ("cgroup: Include dying leaders with live
      threads in PROCS iterations") corrected how CSS_TASK_ITER_PROCS works,
      mem_cgroup_scan_tasks() can use CSS_TASK_ITER_PROCS in order to check
      only one thread from each thread group.
      
      [penguin-kernel@I-love.SAKURA.ne.jp: remove thread group leader check in oom_evaluate_task()]
        Link: http://lkml.kernel.org/r/1560853257-14934-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
      Link: http://lkml.kernel.org/r/c763afc8-f0ae-756a-56a7-395f625b95fc@i-love.sakura.ne.jpSigned-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f168a9a5
    • R
      mm: memcg/slab: reparent memcg kmem_caches on cgroup removal · fb2f2b0a
      Roman Gushchin 提交于
      Let's reparent non-root kmem_caches on memcg offlining.  This allows us to
      release the memory cgroup without waiting for the last outstanding kernel
      object (e.g.  dentry used by another application).
      
      Since the parent cgroup is already charged, everything we need to do is to
      splice the list of kmem_caches to the parent's kmem_caches list, swap the
      memcg pointer, drop the css refcounter for each kmem_cache and adjust the
      parent's css refcounter.
      
      Please, note that kmem_cache->memcg_params.memcg isn't a stable pointer
      anymore.  It's safe to read it under rcu_read_lock(), cgroup_mutex held,
      or any other way that protects the memory cgroup from being released.
      
      We can race with the slab allocation and deallocation paths.  It's not a
      big problem: parent's charge and slab global stats are always correct, and
      we don't care anymore about the child usage and global stats.  The child
      cgroup is already offline, so we don't use or show it anywhere.
      
      Local slab stats (NR_SLAB_RECLAIMABLE and NR_SLAB_UNRECLAIMABLE) aren't
      used anywhere except count_shadow_nodes().  But even there it won't break
      anything: after reparenting "nodes" will be 0 on child level (because
      we're already reparenting shrinker lists), and on parent level page stats
      always were 0, and this patch won't change anything.
      
      [guro@fb.com: properly handle kmem_caches reparented to root_mem_cgroup]
        Link: http://lkml.kernel.org/r/20190620213427.1691847-1-guro@fb.com
      Link: http://lkml.kernel.org/r/20190611231813.3148843-11-guro@fb.comSigned-off-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Waiman Long <longman@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Andrei Vagin <avagin@gmail.com>
      Cc: Qian Cai <cai@lca.pw>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fb2f2b0a
    • R
      mm: memcg/slab: stop setting page->mem_cgroup pointer for slab pages · 4d96ba35
      Roman Gushchin 提交于
      Every slab page charged to a non-root memory cgroup has a pointer to the
      memory cgroup and holds a reference to it, which protects a non-empty
      memory cgroup from being released.  At the same time the page has a
      pointer to the corresponding kmem_cache, and also hold a reference to the
      kmem_cache.  And kmem_cache by itself holds a reference to the cgroup.
      
      So there is clearly some redundancy, which allows to stop setting the
      page->mem_cgroup pointer and rely on getting memcg pointer indirectly via
      kmem_cache.  Further it will allow to change this pointer easier, without
      a need to go over all charged pages.
      
      So let's stop setting page->mem_cgroup pointer for slab pages, and stop
      using the css refcounter directly for protecting the memory cgroup from
      going away.  Instead rely on kmem_cache as an intermediate object.
      
      Make sure that vmstats and shrinker lists are working as previously, as
      well as /proc/kpagecgroup interface.
      
      Link: http://lkml.kernel.org/r/20190611231813.3148843-10-guro@fb.comSigned-off-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Waiman Long <longman@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Andrei Vagin <avagin@gmail.com>
      Cc: Qian Cai <cai@lca.pw>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4d96ba35
    • R
      mm: memcg/slab: rework non-root kmem_cache lifecycle management · f0a3a24b
      Roman Gushchin 提交于
      Currently each charged slab page holds a reference to the cgroup to which
      it's charged.  Kmem_caches are held by the memcg and are released all
      together with the memory cgroup.  It means that none of kmem_caches are
      released unless at least one reference to the memcg exists, which is very
      far from optimal.
      
      Let's rework it in a way that allows releasing individual kmem_caches as
      soon as the cgroup is offline, the kmem_cache is empty and there are no
      pending allocations.
      
      To make it possible, let's introduce a new percpu refcounter for non-root
      kmem caches.  The counter is initialized to the percpu mode, and is
      switched to the atomic mode during kmem_cache deactivation.  The counter
      is bumped for every charged page and also for every running allocation.
      So the kmem_cache can't be released unless all allocations complete.
      
      To shutdown non-active empty kmem_caches, let's reuse the work queue,
      previously used for the kmem_cache deactivation.  Once the reference
      counter reaches 0, let's schedule an asynchronous kmem_cache release.
      
      * I used the following simple approach to test the performance
      (stolen from another patchset by T. Harding):
      
          time find / -name fname-no-exist
          echo 2 > /proc/sys/vm/drop_caches
          repeat 10 times
      
      Results:
      
              orig		patched
      
      real	0m1.455s	real	0m1.355s
      user	0m0.206s	user	0m0.219s
      sys	0m0.855s	sys	0m0.807s
      
      real	0m1.487s	real	0m1.699s
      user	0m0.221s	user	0m0.256s
      sys	0m0.806s	sys	0m0.948s
      
      real	0m1.515s	real	0m1.505s
      user	0m0.183s	user	0m0.215s
      sys	0m0.876s	sys	0m0.858s
      
      real	0m1.291s	real	0m1.380s
      user	0m0.193s	user	0m0.198s
      sys	0m0.843s	sys	0m0.786s
      
      real	0m1.364s	real	0m1.374s
      user	0m0.180s	user	0m0.182s
      sys	0m0.868s	sys	0m0.806s
      
      real	0m1.352s	real	0m1.312s
      user	0m0.201s	user	0m0.212s
      sys	0m0.820s	sys	0m0.761s
      
      real	0m1.302s	real	0m1.349s
      user	0m0.205s	user	0m0.203s
      sys	0m0.803s	sys	0m0.792s
      
      real	0m1.334s	real	0m1.301s
      user	0m0.194s	user	0m0.201s
      sys	0m0.806s	sys	0m0.779s
      
      real	0m1.426s	real	0m1.434s
      user	0m0.216s	user	0m0.181s
      sys	0m0.824s	sys	0m0.864s
      
      real	0m1.350s	real	0m1.295s
      user	0m0.200s	user	0m0.190s
      sys	0m0.842s	sys	0m0.811s
      
      So it looks like the difference is not noticeable in this test.
      
      [cai@lca.pw: fix an use-after-free in kmemcg_workfn()]
        Link: http://lkml.kernel.org/r/1560977573-10715-1-git-send-email-cai@lca.pw
      Link: http://lkml.kernel.org/r/20190611231813.3148843-9-guro@fb.comSigned-off-by: NRoman Gushchin <guro@fb.com>
      Signed-off-by: NQian Cai <cai@lca.pw>
      Acked-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Waiman Long <longman@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Andrei Vagin <avagin@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f0a3a24b
    • R
      mm: memcg/slab: introduce __memcg_kmem_uncharge_memcg() · 49a18eae
      Roman Gushchin 提交于
      Let's separate the page counter modification code out of
      __memcg_kmem_uncharge() in a way similar to what
      __memcg_kmem_charge() and __memcg_kmem_charge_memcg() work.
      
      This will allow to reuse this code later using a new
      memcg_kmem_uncharge_memcg() wrapper, which calls
      __memcg_kmem_uncharge_memcg() if memcg_kmem_enabled()
      check is passed.
      
      Link: http://lkml.kernel.org/r/20190611231813.3148843-5-guro@fb.comSigned-off-by: NRoman Gushchin <guro@fb.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Waiman Long <longman@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Andrei Vagin <avagin@gmail.com>
      Cc: Qian Cai <cai@lca.pw>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      49a18eae
    • J
      mm: memcontrol: dump memory.stat during cgroup OOM · c8713d0b
      Johannes Weiner 提交于
      The current cgroup OOM memory info dump doesn't include all the memory
      we are tracking, nor does it give insight into what the VM tried to do
      leading up to the OOM. All that useful info is in memory.stat.
      
      Furthermore, the recursive printing for every child cgroup can
      generate absurd amounts of data on the console for larger cgroup
      trees, and it's not like we provide a per-cgroup breakdown during
      global OOM kills.
      
      When an OOM kill is triggered, print one set of recursive memory.stat
      items at the level whose limit triggered the OOM condition.
      
      Example output:
      
          stress invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
          CPU: 2 PID: 210 Comm: stress Not tainted 5.2.0-rc2-mm1-00247-g47d49835983c #135
          Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-20181126_142135-anatol 04/01/2014
          Call Trace:
           dump_stack+0x46/0x60
           dump_header+0x4c/0x2d0
           oom_kill_process.cold.10+0xb/0x10
           out_of_memory+0x200/0x270
           ? try_to_free_mem_cgroup_pages+0xdf/0x130
           mem_cgroup_out_of_memory+0xb7/0xc0
           try_charge+0x680/0x6f0
           mem_cgroup_try_charge+0xb5/0x160
           __add_to_page_cache_locked+0xc6/0x300
           ? list_lru_destroy+0x80/0x80
           add_to_page_cache_lru+0x45/0xc0
           pagecache_get_page+0x11b/0x290
           filemap_fault+0x458/0x6d0
           ext4_filemap_fault+0x27/0x36
           __do_fault+0x2f/0xb0
           __handle_mm_fault+0x9c5/0x1140
           ? apic_timer_interrupt+0xa/0x20
           handle_mm_fault+0xc5/0x180
           __do_page_fault+0x1ab/0x440
           ? page_fault+0x8/0x30
           page_fault+0x1e/0x30
          RIP: 0033:0x55c32167fc10
          Code: Bad RIP value.
          RSP: 002b:00007fff1d031c50 EFLAGS: 00010206
          RAX: 000000000dc00000 RBX: 00007fd2db000010 RCX: 00007fd2db000010
          RDX: 0000000000000000 RSI: 0000000010001000 RDI: 0000000000000000
          RBP: 000055c321680a54 R08: 00000000ffffffff R09: 0000000000000000
          R10: 0000000000000022 R11: 0000000000000246 R12: ffffffffffffffff
          R13: 0000000000000002 R14: 0000000000001000 R15: 0000000010000000
          memory: usage 1024kB, limit 1024kB, failcnt 75131
          swap: usage 0kB, limit 9007199254740988kB, failcnt 0
          Memory cgroup stats for /foo:
          anon 0
          file 0
          kernel_stack 36864
          slab 274432
          sock 0
          shmem 0
          file_mapped 0
          file_dirty 0
          file_writeback 0
          anon_thp 0
          inactive_anon 126976
          active_anon 0
          inactive_file 0
          active_file 0
          unevictable 0
          slab_reclaimable 0
          slab_unreclaimable 274432
          pgfault 59466
          pgmajfault 1617
          workingset_refault 2145
          workingset_activate 0
          workingset_nodereclaim 0
          pgrefill 98952
          pgscan 200060
          pgsteal 59340
          pgactivate 40095
          pgdeactivate 96787
          pglazyfree 0
          pglazyfreed 0
          thp_fault_alloc 0
          thp_collapse_alloc 0
          Tasks state (memory values in pages):
          [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
          [    200]     0   200     1121      884    53248       29             0 bash
          [    209]     0   209      905      246    45056       19             0 stress
          [    210]     0   210    66442       56   499712    56349             0 stress
          oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),oom_memcg=/foo,task_memcg=/foo,task=stress,pid=210,uid=0
          Memory cgroup out of memory: Killed process 210 (stress) total-vm:265768kB, anon-rss:0kB, file-rss:224kB, shmem-rss:0kB
          oom_reaper: reaped process 210 (stress), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
      
      [hannes@cmpxchg.org: s/kvmalloc/kmalloc/ per Michal]
        Link: http://lkml.kernel.org/r/20190605161133.GA12453@cmpxchg.org
      Link: http://lkml.kernel.org/r/20190604210509.9744-1-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c8713d0b
    • S
      mm, memcg: introduce memory.events.local · 1e577f97
      Shakeel Butt 提交于
      The memory controller in cgroup v2 exposes memory.events file for each
      memcg which shows the number of times events like low, high, max, oom
      and oom_kill have happened for the whole tree rooted at that memcg.
      Users can also poll or register notification to monitor the changes in
      that file.  Any event at any level of the tree rooted at memcg will
      notify all the listeners along the path till root_mem_cgroup.  There are
      existing users which depend on this behavior.
      
      However there are users which are only interested in the events
      happening at a specific level of the memcg tree and not in the events in
      the underlying tree rooted at that memcg.  One such use-case is a
      centralized resource monitor which can dynamically adjust the limits of
      the jobs running on a system.  The jobs can create their sub-hierarchy
      for their own sub-tasks.  The centralized monitor is only interested in
      the events at the top level memcgs of the jobs as it can then act and
      adjust the limits of the jobs.  Using the current memory.events for such
      centralized monitor is very inconvenient.  The monitor will keep
      receiving events which it is not interested and to find if the received
      event is interesting, it has to read memory.event files of the next
      level and compare it with the top level one.  So, let's introduce
      memory.events.local to the memcg which shows and notify for the events
      at the memcg level.
      
      Now, does memory.stat and memory.pressure need their local versions.  IMHO
      no due to the no internal process contraint of the cgroup v2.  The
      memory.stat file of the top level memcg of a job shows the stats and
      vmevents of the whole tree.  The local stats or vmevents of the top level
      memcg will only change if there is a process running in that memcg but v2
      does not allow that.  Similarly for memory.pressure there will not be any
      process in the internal nodes and thus no chance of local pressure.
      
      Link: http://lkml.kernel.org/r/20190527174643.209172-1-shakeelb@google.comSigned-off-by: NShakeel Butt <shakeelb@google.com>
      Reviewed-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Chris Down <chris@chrisdown.name>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1e577f97
    • S
      memcg, oom: no oom-kill for __GFP_RETRY_MAYFAIL · 38d38493
      Shakeel Butt 提交于
      The documentation of __GFP_RETRY_MAYFAIL clearly mentioned that the OOM
      killer will not be triggered and indeed the page alloc does not invoke OOM
      killer for such allocations.  However we do trigger memcg OOM killer for
      __GFP_RETRY_MAYFAIL.  Fix that.  This flag will used later to not trigger
      oom-killer in the charging path for fanotify and inotify event
      allocations.
      
      Link: http://lkml.kernel.org/r/20190514212259.156585-1-shakeelb@google.comSigned-off-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Amir Goldstein <amir73il@gmail.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      38d38493
    • Y
      mm/memcontrol: fix wrong statistics in memory.stat · dd923990
      Yafang Shao 提交于
      When we calculate total statistics for memcg1_stats and memcg1_events,
      we use the the index 'i' in the for loop as the events index.  Actually
      we should use memcg1_stats[i] and memcg1_events[i] as the events index.
      
      Link: http://lkml.kernel.org/r/1562116978-19539-1-git-send-email-laoar.shao@gmail.com
      Fixes: 42a30035 ("mm: memcontrol: fix recursive statistics correctness & scalabilty").
      Signed-off-by: Yafang Shao <laoar.shao@gmail.com
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Yafang Shao <shaoyafang@didiglobal.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dd923990
  10. 03 7月, 2019 1 次提交
  11. 14 6月, 2019 1 次提交
    • J
      mm: memcontrol: don't batch updates of local VM stats and events · 815744d7
      Johannes Weiner 提交于
      The kernel test robot noticed a 26% will-it-scale pagefault regression
      from commit 42a30035 ("mm: memcontrol: fix recursive statistics
      correctness & scalabilty").  This appears to be caused by bouncing the
      additional cachelines from the new hierarchical statistics counters.
      
      We can fix this by getting rid of the batched local counters instead.
      
      Originally, there were *only* group-local counters, and they were fully
      maintained per cpu.  A reader of a stats file high up in the cgroup tree
      would have to walk the entire subtree and collect each level's per-cpu
      counters to get the recursive view.  This was prohibitively expensive,
      and so we switched to per-cpu batched updates of the local counters
      during a983b5eb ("mm: memcontrol: fix excessive complexity in
      memory.stat reporting"), reducing the complexity from nr_subgroups *
      nr_cpus to nr_subgroups.
      
      With growing machines and cgroup trees, the tree walk itself became too
      expensive for monitoring top-level groups, and this is when the culprit
      patch added hierarchy counters on each cgroup level.  When the per-cpu
      batch size would be reached, both the local and the hierarchy counters
      would get batch-updated from the per-cpu delta simultaneously.
      
      This makes local and hierarchical counter reads blazingly fast, but it
      unfortunately makes the write-side too cache line intense.
      
      Since local counter reads were never a problem - we only centralized
      them to accelerate the hierarchy walk - and use of the local counters
      are becoming rarer due to replacement with hierarchical views (ongoing
      rework in the page reclaim and workingset code), we can make those local
      counters unbatched per-cpu counters again.
      
      The scheme will then be as such:
      
         when a memcg statistic changes, the writer will:
         - update the local counter (per-cpu)
         - update the batch counter (per-cpu). If the batch is full:
         - spill the batch into the group's atomic_t
         - spill the batch into all ancestors' atomic_ts
         - empty out the batch counter (per-cpu)
      
         when a local memcg counter is read, the reader will:
         - collect the local counter from all cpus
      
         when a hiearchy memcg counter is read, the reader will:
         - read the atomic_t
      
      We might be able to simplify this further and make the recursive
      counters unbatched per-cpu counters as well (batch upward propagation,
      but leave per-cpu collection to the readers), but that will require a
      more in-depth analysis and testing of all the callsites.  Deal with the
      immediate regression for now.
      
      Link: http://lkml.kernel.org/r/20190521151647.GB2870@cmpxchg.org
      Fixes: 42a30035 ("mm: memcontrol: fix recursive statistics correctness & scalabilty")
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: Nkernel test robot <rong.a.chen@intel.com>
      Tested-by: Nkernel test robot <rong.a.chen@intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Roman Gushchin <guro@fb.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      815744d7
  12. 31 5月, 2019 1 次提交
    • T
      treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 157 · c942fddf
      Thomas Gleixner 提交于
      Based on 3 normalized pattern(s):
      
        this program is free software you can redistribute it and or modify
        it under the terms of the gnu general public license as published by
        the free software foundation either version 2 of the license or at
        your option any later version this program is distributed in the
        hope that it will be useful but without any warranty without even
        the implied warranty of merchantability or fitness for a particular
        purpose see the gnu general public license for more details
      
        this program is free software you can redistribute it and or modify
        it under the terms of the gnu general public license as published by
        the free software foundation either version 2 of the license or at
        your option any later version [author] [kishon] [vijay] [abraham]
        [i] [kishon]@[ti] [com] this program is distributed in the hope that
        it will be useful but without any warranty without even the implied
        warranty of merchantability or fitness for a particular purpose see
        the gnu general public license for more details
      
        this program is free software you can redistribute it and or modify
        it under the terms of the gnu general public license as published by
        the free software foundation either version 2 of the license or at
        your option any later version [author] [graeme] [gregory]
        [gg]@[slimlogic] [co] [uk] [author] [kishon] [vijay] [abraham] [i]
        [kishon]@[ti] [com] [based] [on] [twl6030]_[usb] [c] [author] [hema]
        [hk] [hemahk]@[ti] [com] this program is distributed in the hope
        that it will be useful but without any warranty without even the
        implied warranty of merchantability or fitness for a particular
        purpose see the gnu general public license for more details
      
      extracted by the scancode license scanner the SPDX license identifier
      
        GPL-2.0-or-later
      
      has been chosen to replace the boilerplate/reference in 1105 file(s).
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NAllison Randal <allison@lohutok.net>
      Reviewed-by: NRichard Fontana <rfontana@redhat.com>
      Reviewed-by: NKate Stewart <kstewart@linuxfoundation.org>
      Cc: linux-spdx@vger.kernel.org
      Link: https://lkml.kernel.org/r/20190527070033.202006027@linutronix.deSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c942fddf
  13. 15 5月, 2019 10 次提交
  14. 06 4月, 2019 1 次提交
    • G
      mm: writeback: use exact memcg dirty counts · 0b3d6e6f
      Greg Thelen 提交于
      Since commit a983b5eb ("mm: memcontrol: fix excessive complexity in
      memory.stat reporting") memcg dirty and writeback counters are managed
      as:
      
       1) per-memcg per-cpu values in range of [-32..32]
      
       2) per-memcg atomic counter
      
      When a per-cpu counter cannot fit in [-32..32] it's flushed to the
      atomic.  Stat readers only check the atomic.  Thus readers such as
      balance_dirty_pages() may see a nontrivial error margin: 32 pages per
      cpu.
      
      Assuming 100 cpus:
         4k x86 page_size:  13 MiB error per memcg
        64k ppc page_size: 200 MiB error per memcg
      
      Considering that dirty+writeback are used together for some decisions the
      errors double.
      
      This inaccuracy can lead to undeserved oom kills.  One nasty case is
      when all per-cpu counters hold positive values offsetting an atomic
      negative value (i.e.  per_cpu[*]=32, atomic=n_cpu*-32).
      balance_dirty_pages() only consults the atomic and does not consider
      throttling the next n_cpu*32 dirty pages.  If the file_lru is in the
      13..200 MiB range then there's absolutely no dirty throttling, which
      burdens vmscan with only dirty+writeback pages thus resorting to oom
      kill.
      
      It could be argued that tiny containers are not supported, but it's more
      subtle.  It's the amount the space available for file lru that matters.
      If a container has memory.max-200MiB of non reclaimable memory, then it
      will also suffer such oom kills on a 100 cpu machine.
      
      The following test reliably ooms without this patch.  This patch avoids
      oom kills.
      
        $ cat test
        mount -t cgroup2 none /dev/cgroup
        cd /dev/cgroup
        echo +io +memory > cgroup.subtree_control
        mkdir test
        cd test
        echo 10M > memory.max
        (echo $BASHPID > cgroup.procs && exec /memcg-writeback-stress /foo)
        (echo $BASHPID > cgroup.procs && exec dd if=/dev/zero of=/foo bs=2M count=100)
      
        $ cat memcg-writeback-stress.c
        /*
         * Dirty pages from all but one cpu.
         * Clean pages from the non dirtying cpu.
         * This is to stress per cpu counter imbalance.
         * On a 100 cpu machine:
         * - per memcg per cpu dirty count is 32 pages for each of 99 cpus
         * - per memcg atomic is -99*32 pages
         * - thus the complete dirty limit: sum of all counters 0
         * - balance_dirty_pages() only sees atomic count -99*32 pages, which
         *   it max()s to 0.
         * - So a workload can dirty -99*32 pages before balance_dirty_pages()
         *   cares.
         */
        #define _GNU_SOURCE
        #include <err.h>
        #include <fcntl.h>
        #include <sched.h>
        #include <stdlib.h>
        #include <stdio.h>
        #include <sys/stat.h>
        #include <sys/sysinfo.h>
        #include <sys/types.h>
        #include <unistd.h>
      
        static char *buf;
        static int bufSize;
      
        static void set_affinity(int cpu)
        {
        	cpu_set_t affinity;
      
        	CPU_ZERO(&affinity);
        	CPU_SET(cpu, &affinity);
        	if (sched_setaffinity(0, sizeof(affinity), &affinity))
        		err(1, "sched_setaffinity");
        }
      
        static void dirty_on(int output_fd, int cpu)
        {
        	int i, wrote;
      
        	set_affinity(cpu);
        	for (i = 0; i < 32; i++) {
        		for (wrote = 0; wrote < bufSize; ) {
        			int ret = write(output_fd, buf+wrote, bufSize-wrote);
        			if (ret == -1)
        				err(1, "write");
        			wrote += ret;
        		}
        	}
        }
      
        int main(int argc, char **argv)
        {
        	int cpu, flush_cpu = 1, output_fd;
        	const char *output;
      
        	if (argc != 2)
        		errx(1, "usage: output_file");
      
        	output = argv[1];
        	bufSize = getpagesize();
        	buf = malloc(getpagesize());
        	if (buf == NULL)
        		errx(1, "malloc failed");
      
        	output_fd = open(output, O_CREAT|O_RDWR);
        	if (output_fd == -1)
        		err(1, "open(%s)", output);
      
        	for (cpu = 0; cpu < get_nprocs(); cpu++) {
        		if (cpu != flush_cpu)
        			dirty_on(output_fd, cpu);
        	}
      
        	set_affinity(flush_cpu);
        	if (fsync(output_fd))
        		err(1, "fsync(%s)", output);
        	if (close(output_fd))
        		err(1, "close(%s)", output);
        	free(buf);
        }
      
      Make balance_dirty_pages() and wb_over_bg_thresh() work harder to
      collect exact per memcg counters.  This avoids the aforementioned oom
      kills.
      
      This does not affect the overhead of memory.stat, which still reads the
      single atomic counter.
      
      Why not use percpu_counter? memcg already handles cpus going offline, so
      no need for that overhead from percpu_counter.  And the percpu_counter
      spinlocks are more heavyweight than is required.
      
      It probably also makes sense to use exact dirty and writeback counters
      in memcg oom reports.  But that is saved for later.
      
      Link: http://lkml.kernel.org/r/20190329174609.164344-1-gthelen@google.comSigned-off-by: NGreg Thelen <gthelen@google.com>
      Reviewed-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: <stable@vger.kernel.org>	[4.16+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0b3d6e6f
  15. 06 3月, 2019 2 次提交