1. 19 4月, 2020 1 次提交
    • G
      memcontrol.h: Replace zero-length array with flexible-array member · 307ed94c
      Gustavo A. R. Silva 提交于
      The current codebase makes use of the zero-length array language
      extension to the C90 standard, but the preferred mechanism to declare
      variable-length types such as these ones is a flexible array member[1][2],
      introduced in C99:
      
      struct foo {
              int stuff;
              struct boo array[];
      };
      
      By making use of the mechanism above, we will get a compiler warning
      in case the flexible array does not occur last in the structure, which
      will help us prevent some kind of undefined behavior bugs from being
      inadvertently introduced[3] to the codebase from now on.
      
      Also, notice that, dynamic memory allocations won't be affected by
      this change:
      
      "Flexible array members have incomplete type, and so the sizeof operator
      may not be applied. As a quirk of the original implementation of
      zero-length arrays, sizeof evaluates to zero."[1]
      
      This issue was found with the help of Coccinelle.
      
      [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
      [2] https://github.com/KSPP/linux/issues/21
      [3] commit 76497732 ("cxgb3/l2t: Fix undefined behaviour")
      Signed-off-by: NGustavo A. R. Silva <gustavo@embeddedor.com>
      307ed94c
  2. 03 4月, 2020 5 次提交
  3. 30 3月, 2020 1 次提交
    • R
      mm: fork: fix kernel_stack memcg stats for various stack implementations · 8380ce47
      Roman Gushchin 提交于
      Depending on CONFIG_VMAP_STACK and the THREAD_SIZE / PAGE_SIZE ratio the
      space for task stacks can be allocated using __vmalloc_node_range(),
      alloc_pages_node() and kmem_cache_alloc_node().
      
      In the first and the second cases page->mem_cgroup pointer is set, but
      in the third it's not: memcg membership of a slab page should be
      determined using the memcg_from_slab_page() function, which looks at
      page->slab_cache->memcg_params.memcg .  In this case, using
      mod_memcg_page_state() (as in account_kernel_stack()) is incorrect:
      page->mem_cgroup pointer is NULL even for pages charged to a non-root
      memory cgroup.
      
      It can lead to kernel_stack per-memcg counters permanently showing 0 on
      some architectures (depending on the configuration).
      
      In order to fix it, let's introduce a mod_memcg_obj_state() helper,
      which takes a pointer to a kernel object as a first argument, uses
      mem_cgroup_from_obj() to get a RCU-protected memcg pointer and calls
      mod_memcg_state().  It allows to handle all possible configurations
      (CONFIG_VMAP_STACK and various THREAD_SIZE/PAGE_SIZE values) without
      spilling any memcg/kmem specifics into fork.c .
      
      Note: This is a special version of the patch created for stable
      backports.  It contains code from the following two patches:
        - mm: memcg/slab: introduce mem_cgroup_from_obj()
        - mm: fork: fix kernel_stack memcg stats for various stack implementations
      
      [guro@fb.com: introduce mem_cgroup_from_obj()]
        Link: http://lkml.kernel.org/r/20200324004221.GA36662@carbon.dhcp.thefacebook.com
      Fixes: 4d96ba35 ("mm: memcg/slab: stop setting page->mem_cgroup pointer for slab pages")
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Bharata B Rao <bharata@linux.ibm.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/20200303233550.251375-1-guro@fb.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8380ce47
  4. 02 12月, 2019 3 次提交
  5. 01 12月, 2019 3 次提交
  6. 08 10月, 2019 4 次提交
    • C
      mm, memcg: make scan aggression always exclude protection · 1bc63fb1
      Chris Down 提交于
      This patch is an incremental improvement on the existing
      memory.{low,min} relative reclaim work to base its scan pressure
      calculations on how much protection is available compared to the current
      usage, rather than how much the current usage is over some protection
      threshold.
      
      This change doesn't change the experience for the user in the normal
      case too much.  One benefit is that it replaces the (somewhat arbitrary)
      100% cutoff with an indefinite slope, which makes it easier to ballpark
      a memory.low value.
      
      As well as this, the old methodology doesn't quite apply generically to
      machines with varying amounts of physical memory.  Let's say we have a
      top level cgroup, workload.slice, and another top level cgroup,
      system-management.slice.  We want to roughly give 12G to
      system-management.slice, so on a 32GB machine we set memory.low to 20GB
      in workload.slice, and on a 64GB machine we set memory.low to 52GB.
      However, because these are relative amounts to the total machine size,
      while the amount of memory we want to generally be willing to yield to
      system.slice is absolute (12G), we end up putting more pressure on
      system.slice just because we have a larger machine and a larger workload
      to fill it, which seems fairly unintuitive.  With this new behaviour, we
      don't end up with this unintended side effect.
      
      Previously the way that memory.low protection works is that if you are
      50% over a certain baseline, you get 50% of your normal scan pressure.
      This is certainly better than the previous cliff-edge behaviour, but it
      can be improved even further by always considering memory under the
      currently enforced protection threshold to be out of bounds.  This means
      that we can set relatively low memory.low thresholds for variable or
      bursty workloads while still getting a reasonable level of protection,
      whereas with the previous version we may still trivially hit the 100%
      clamp.  The previous 100% clamp is also somewhat arbitrary, whereas this
      one is more concretely based on the currently enforced protection
      threshold, which is likely easier to reason about.
      
      There is also a subtle issue with the way that proportional reclaim
      worked previously -- it promotes having no memory.low, since it makes
      pressure higher during low reclaim.  This happens because we base our
      scan pressure modulation on how far memory.current is between memory.min
      and memory.low, but if memory.low is unset, we only use the overage
      method.  In most cromulent configurations, this then means that we end
      up with *more* pressure than with no memory.low at all when we're in low
      reclaim, which is not really very usable or expected.
      
      With this patch, memory.low and memory.min affect reclaim pressure in a
      more understandable and composable way.  For example, from a user
      standpoint, "protected" memory now remains untouchable from a reclaim
      aggression standpoint, and users can also have more confidence that
      bursty workloads will still receive some amount of guaranteed
      protection.
      
      Link: http://lkml.kernel.org/r/20190322160307.GA3316@chrisdown.nameSigned-off-by: NChris Down <chris@chrisdown.name>
      Reviewed-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1bc63fb1
    • C
      mm, memcg: make memory.emin the baseline for utilisation determination · 9de7ca46
      Chris Down 提交于
      Roman points out that when when we do the low reclaim pass, we scale the
      reclaim pressure relative to position between 0 and the maximum
      protection threshold.
      
      However, if the maximum protection is based on memory.elow, and
      memory.emin is above zero, this means we still may get binary behaviour
      on second-pass low reclaim.  This is because we scale starting at 0, not
      starting at memory.emin, and since we don't scan at all below emin, we
      end up with cliff behaviour.
      
      This should be a fairly uncommon case since usually we don't go into the
      second pass, but it makes sense to scale our low reclaim pressure
      starting at emin.
      
      You can test this by catting two large sparse files, one in a cgroup
      with emin set to some moderate size compared to physical RAM, and
      another cgroup without any emin.  In both cgroups, set an elow larger
      than 50% of physical RAM.  The one with emin will have less page
      scanning, as reclaim pressure is lower.
      
      Rebase on top of and apply the same idea as what was applied to handle
      cgroup_memory=disable properly for the original proportional patch
      http://lkml.kernel.org/r/20190201045711.GA18302@chrisdown.name ("mm,
      memcg: Handle cgroup_disable=memory when getting memcg protection").
      
      Link: http://lkml.kernel.org/r/20190201051810.GA18895@chrisdown.nameSigned-off-by: NChris Down <chris@chrisdown.name>
      Suggested-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Dennis Zhou <dennis@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9de7ca46
    • C
      mm, memcg: proportional memory.{low,min} reclaim · 9783aa99
      Chris Down 提交于
      cgroup v2 introduces two memory protection thresholds: memory.low
      (best-effort) and memory.min (hard protection).  While they generally do
      what they say on the tin, there is a limitation in their implementation
      that makes them difficult to use effectively: that cliff behaviour often
      manifests when they become eligible for reclaim.  This patch implements
      more intuitive and usable behaviour, where we gradually mount more
      reclaim pressure as cgroups further and further exceed their protection
      thresholds.
      
      This cliff edge behaviour happens because we only choose whether or not
      to reclaim based on whether the memcg is within its protection limits
      (see the use of mem_cgroup_protected in shrink_node), but we don't vary
      our reclaim behaviour based on this information.  Imagine the following
      timeline, with the numbers the lruvec size in this zone:
      
      1. memory.low=1000000, memory.current=999999. 0 pages may be scanned.
      2. memory.low=1000000, memory.current=1000000. 0 pages may be scanned.
      3. memory.low=1000000, memory.current=1000001. 1000001* pages may be
         scanned. (?!)
      
      * Of course, we won't usually scan all available pages in the zone even
        without this patch because of scan control priority, over-reclaim
        protection, etc.  However, as shown by the tests at the end, these
        techniques don't sufficiently throttle such an extreme change in input,
        so cliff-like behaviour isn't really averted by their existence alone.
      
      Here's an example of how this plays out in practice.  At Facebook, we are
      trying to protect various workloads from "system" software, like
      configuration management tools, metric collectors, etc (see this[0] case
      study).  In order to find a suitable memory.low value, we start by
      determining the expected memory range within which the workload will be
      comfortable operating.  This isn't an exact science -- memory usage deemed
      "comfortable" will vary over time due to user behaviour, differences in
      composition of work, etc, etc.  As such we need to ballpark memory.low,
      but doing this is currently problematic:
      
      1. If we end up setting it too low for the workload, it won't have
         *any* effect (see discussion above).  The group will receive the full
         weight of reclaim and won't have any priority while competing with the
         less important system software, as if we had no memory.low configured
         at all.
      
      2. Because of this behaviour, we end up erring on the side of setting
         it too high, such that the comfort range is reliably covered.  However,
         protected memory is completely unavailable to the rest of the system,
         so we might cause undue memory and IO pressure there when we *know* we
         have some elasticity in the workload.
      
      3. Even if we get the value totally right, smack in the middle of the
         comfort zone, we get extreme jumps between no pressure and full
         pressure that cause unpredictable pressure spikes in the workload due
         to the current binary reclaim behaviour.
      
      With this patch, we can set it to our ballpark estimation without too much
      worry.  Any undesirable behaviour, such as too much or too little reclaim
      pressure on the workload or system will be proportional to how far our
      estimation is off.  This means we can set memory.low much more
      conservatively and thus waste less resources *without* the risk of the
      workload falling off a cliff if we overshoot.
      
      As a more abstract technical description, this unintuitive behaviour
      results in having to give high-priority workloads a large protection
      buffer on top of their expected usage to function reliably, as otherwise
      we have abrupt periods of dramatically increased memory pressure which
      hamper performance.  Having to set these thresholds so high wastes
      resources and generally works against the principle of work conservation.
      In addition, having proportional memory reclaim behaviour has other
      benefits.  Most notably, before this patch it's basically mandatory to set
      memory.low to a higher than desirable value because otherwise as soon as
      you exceed memory.low, all protection is lost, and all pages are eligible
      to scan again.  By contrast, having a gradual ramp in reclaim pressure
      means that you now still get some protection when thresholds are exceeded,
      which means that one can now be more comfortable setting memory.low to
      lower values without worrying that all protection will be lost.  This is
      important because workingset size is really hard to know exactly,
      especially with variable workloads, so at least getting *some* protection
      if your workingset size grows larger than you expect increases user
      confidence in setting memory.low without a huge buffer on top being
      needed.
      
      Thanks a lot to Johannes Weiner and Tejun Heo for their advice and
      assistance in thinking about how to make this work better.
      
      In testing these changes, I intended to verify that:
      
      1. Changes in page scanning become gradual and proportional instead of
         binary.
      
         To test this, I experimented stepping further and further down
         memory.low protection on a workload that floats around 19G workingset
         when under memory.low protection, watching page scan rates for the
         workload cgroup:
      
         +------------+-----------------+--------------------+--------------+
         | memory.low | test (pgscan/s) | control (pgscan/s) | % of control |
         +------------+-----------------+--------------------+--------------+
         |        21G |               0 |                  0 | N/A          |
         |        17G |             867 |               3799 | 23%          |
         |        12G |            1203 |               3543 | 34%          |
         |         8G |            2534 |               3979 | 64%          |
         |         4G |            3980 |               4147 | 96%          |
         |          0 |            3799 |               3980 | 95%          |
         +------------+-----------------+--------------------+--------------+
      
         As you can see, the test kernel (with a kernel containing this
         patch) ramps up page scanning significantly more gradually than the
         control kernel (without this patch).
      
      2. More gradual ramp up in reclaim aggression doesn't result in
         premature OOMs.
      
         To test this, I wrote a script that slowly increments the number of
         pages held by stress(1)'s --vm-keep mode until a production system
         entered severe overall memory contention.  This script runs in a highly
         protected slice taking up the majority of available system memory.
         Watching vmstat revealed that page scanning continued essentially
         nominally between test and control, without causing forward reclaim
         progress to become arrested.
      
      [0]: https://facebookmicrosites.github.io/cgroup2/docs/overview.html#case-study-the-fbtax2-project
      
      [akpm@linux-foundation.org: reflow block comments to fit in 80 cols]
      [chris@chrisdown.name: handle cgroup_disable=memory when getting memcg protection]
        Link: http://lkml.kernel.org/r/20190201045711.GA18302@chrisdown.name
      Link: http://lkml.kernel.org/r/20190124014455.GA6396@chrisdown.nameSigned-off-by: NChris Down <chris@chrisdown.name>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NRoman Gushchin <guro@fb.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9783aa99
    • B
      memcg: only record foreign writebacks with dirty pages when memcg is not disabled · 08d1d0e6
      Baoquan He 提交于
      In kdump kernel, memcg usually is disabled with 'cgroup_disable=memory'
      for saving memory.  Now kdump kernel will always panic when dump vmcore
      to local disk:
      
        BUG: kernel NULL pointer dereference, address: 0000000000000ab8
        Oops: 0000 [#1] SMP NOPTI
        CPU: 0 PID: 598 Comm: makedumpfile Not tainted 5.3.0+ #26
        Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 10/02/2018
        RIP: 0010:mem_cgroup_track_foreign_dirty_slowpath+0x38/0x140
        Call Trace:
         __set_page_dirty+0x52/0xc0
         iomap_set_page_dirty+0x50/0x90
         iomap_write_end+0x6e/0x270
         iomap_write_actor+0xce/0x170
         iomap_apply+0xba/0x11e
         iomap_file_buffered_write+0x62/0x90
         xfs_file_buffered_aio_write+0xca/0x320 [xfs]
         new_sync_write+0x12d/0x1d0
         vfs_write+0xa5/0x1a0
         ksys_write+0x59/0xd0
         do_syscall_64+0x59/0x1e0
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      And this will corrupt the 1st kernel too with 'cgroup_disable=memory'.
      
      Via the trace and with debugging, it is pointing to commit 97b27821
      ("writeback, memcg: Implement foreign dirty flushing") which introduced
      this regression.  Disabling memcg causes the null pointer dereference at
      uninitialized data in function mem_cgroup_track_foreign_dirty_slowpath().
      
      Fix it by returning directly if memcg is disabled, but not trying to
      record the foreign writebacks with dirty pages.
      
      Link: http://lkml.kernel.org/r/20190924141928.GD31919@MiWiFi-R3L-srv
      Fixes: 97b27821 ("writeback, memcg: Implement foreign dirty flushing")
      Signed-off-by: NBaoquan He <bhe@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      08d1d0e6
  7. 25 9月, 2019 2 次提交
  8. 27 8月, 2019 1 次提交
    • T
      writeback, memcg: Implement foreign dirty flushing · 97b27821
      Tejun Heo 提交于
      There's an inherent mismatch between memcg and writeback.  The former
      trackes ownership per-page while the latter per-inode.  This was a
      deliberate design decision because honoring per-page ownership in the
      writeback path is complicated, may lead to higher CPU and IO overheads
      and deemed unnecessary given that write-sharing an inode across
      different cgroups isn't a common use-case.
      
      Combined with inode majority-writer ownership switching, this works
      well enough in most cases but there are some pathological cases.  For
      example, let's say there are two cgroups A and B which keep writing to
      different but confined parts of the same inode.  B owns the inode and
      A's memory is limited far below B's.  A's dirty ratio can rise enough
      to trigger balance_dirty_pages() sleeps but B's can be low enough to
      avoid triggering background writeback.  A will be slowed down without
      a way to make writeback of the dirty pages happen.
      
      This patch implements foreign dirty recording and foreign mechanism so
      that when a memcg encounters a condition as above it can trigger
      flushes on bdi_writebacks which can clean its pages.  Please see the
      comment on top of mem_cgroup_track_foreign_dirty_slowpath() for
      details.
      
      A reproducer follows.
      
      write-range.c::
      
        #include <stdio.h>
        #include <stdlib.h>
        #include <unistd.h>
        #include <fcntl.h>
        #include <sys/types.h>
      
        static const char *usage = "write-range FILE START SIZE\n";
      
        int main(int argc, char **argv)
        {
      	  int fd;
      	  unsigned long start, size, end, pos;
      	  char *endp;
      	  char buf[4096];
      
      	  if (argc < 4) {
      		  fprintf(stderr, usage);
      		  return 1;
      	  }
      
      	  fd = open(argv[1], O_WRONLY);
      	  if (fd < 0) {
      		  perror("open");
      		  return 1;
      	  }
      
      	  start = strtoul(argv[2], &endp, 0);
      	  if (*endp != '\0') {
      		  fprintf(stderr, usage);
      		  return 1;
      	  }
      
      	  size = strtoul(argv[3], &endp, 0);
      	  if (*endp != '\0') {
      		  fprintf(stderr, usage);
      		  return 1;
      	  }
      
      	  end = start + size;
      
      	  while (1) {
      		  for (pos = start; pos < end; ) {
      			  long bread, bwritten = 0;
      
      			  if (lseek(fd, pos, SEEK_SET) < 0) {
      				  perror("lseek");
      				  return 1;
      			  }
      
      			  bread = read(0, buf, sizeof(buf) < end - pos ?
      					       sizeof(buf) : end - pos);
      			  if (bread < 0) {
      				  perror("read");
      				  return 1;
      			  }
      			  if (bread == 0)
      				  return 0;
      
      			  while (bwritten < bread) {
      				  long this;
      
      				  this = write(fd, buf + bwritten,
      					       bread - bwritten);
      				  if (this < 0) {
      					  perror("write");
      					  return 1;
      				  }
      
      				  bwritten += this;
      				  pos += bwritten;
      			  }
      		  }
      	  }
        }
      
      repro.sh::
      
        #!/bin/bash
      
        set -e
        set -x
      
        sysctl -w vm.dirty_expire_centisecs=300000
        sysctl -w vm.dirty_writeback_centisecs=300000
        sysctl -w vm.dirtytime_expire_seconds=300000
        echo 3 > /proc/sys/vm/drop_caches
      
        TEST=/sys/fs/cgroup/test
        A=$TEST/A
        B=$TEST/B
      
        mkdir -p $A $B
        echo "+memory +io" > $TEST/cgroup.subtree_control
        echo $((1<<30)) > $A/memory.high
        echo $((32<<30)) > $B/memory.high
      
        rm -f testfile
        touch testfile
        fallocate -l 4G testfile
      
        echo "Starting B"
      
        (echo $BASHPID > $B/cgroup.procs
         pv -q --rate-limit 70M < /dev/urandom | ./write-range testfile $((2<<30)) $((2<<30))) &
      
        echo "Waiting 10s to ensure B claims the testfile inode"
        sleep 5
        sync
        sleep 5
        sync
        echo "Starting A"
      
        (echo $BASHPID > $A/cgroup.procs
         pv < /dev/urandom | ./write-range testfile 0 $((2<<30)))
      
      v2: Added comments explaining why the specific intervals are being used.
      
      v3: Use 0 @nr when calling cgroup_writeback_by_id() to use best-effort
          flushing while avoding possible livelocks.
      
      v4: Use get_jiffies_64() and time_before/after64() instead of raw
          jiffies_64 and arthimetic comparisons as suggested by Jan.
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      97b27821
  9. 14 8月, 2019 1 次提交
  10. 13 7月, 2019 3 次提交
    • S
      mm, oom: remove redundant task_in_mem_cgroup() check · 6ba749ee
      Shakeel Butt 提交于
      oom_unkillable_task() can be called from three different contexts i.e.
      global OOM, memcg OOM and oom_score procfs interface.  At the moment
      oom_unkillable_task() does a task_in_mem_cgroup() check on the given
      process.  Since there is no reason to perform task_in_mem_cgroup()
      check for global OOM and oom_score procfs interface, those contexts
      provide NULL memcg and skips the task_in_mem_cgroup() check.  However
      for memcg OOM context, the oom_unkillable_task() is always called from
      mem_cgroup_scan_tasks() and thus task_in_mem_cgroup() check becomes
      redundant and effectively dead code.  So, just remove the
      task_in_mem_cgroup() check altogether.
      
      Link: http://lkml.kernel.org/r/20190624212631.87212-2-shakeelb@google.comSigned-off-by: NShakeel Butt <shakeelb@google.com>
      Signed-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Acked-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6ba749ee
    • R
      mm: memcg/slab: introduce __memcg_kmem_uncharge_memcg() · 49a18eae
      Roman Gushchin 提交于
      Let's separate the page counter modification code out of
      __memcg_kmem_uncharge() in a way similar to what
      __memcg_kmem_charge() and __memcg_kmem_charge_memcg() work.
      
      This will allow to reuse this code later using a new
      memcg_kmem_uncharge_memcg() wrapper, which calls
      __memcg_kmem_uncharge_memcg() if memcg_kmem_enabled()
      check is passed.
      
      Link: http://lkml.kernel.org/r/20190611231813.3148843-5-guro@fb.comSigned-off-by: NRoman Gushchin <guro@fb.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Waiman Long <longman@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Andrei Vagin <avagin@gmail.com>
      Cc: Qian Cai <cai@lca.pw>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      49a18eae
    • S
      mm, memcg: introduce memory.events.local · 1e577f97
      Shakeel Butt 提交于
      The memory controller in cgroup v2 exposes memory.events file for each
      memcg which shows the number of times events like low, high, max, oom
      and oom_kill have happened for the whole tree rooted at that memcg.
      Users can also poll or register notification to monitor the changes in
      that file.  Any event at any level of the tree rooted at memcg will
      notify all the listeners along the path till root_mem_cgroup.  There are
      existing users which depend on this behavior.
      
      However there are users which are only interested in the events
      happening at a specific level of the memcg tree and not in the events in
      the underlying tree rooted at that memcg.  One such use-case is a
      centralized resource monitor which can dynamically adjust the limits of
      the jobs running on a system.  The jobs can create their sub-hierarchy
      for their own sub-tasks.  The centralized monitor is only interested in
      the events at the top level memcgs of the jobs as it can then act and
      adjust the limits of the jobs.  Using the current memory.events for such
      centralized monitor is very inconvenient.  The monitor will keep
      receiving events which it is not interested and to find if the received
      event is interesting, it has to read memory.event files of the next
      level and compare it with the top level one.  So, let's introduce
      memory.events.local to the memcg which shows and notify for the events
      at the memcg level.
      
      Now, does memory.stat and memory.pressure need their local versions.  IMHO
      no due to the no internal process contraint of the cgroup v2.  The
      memory.stat file of the top level memcg of a job shows the stats and
      vmevents of the whole tree.  The local stats or vmevents of the top level
      memcg will only change if there is a process running in that memcg but v2
      does not allow that.  Similarly for memory.pressure there will not be any
      process in the internal nodes and thus no chance of local pressure.
      
      Link: http://lkml.kernel.org/r/20190527174643.209172-1-shakeelb@google.comSigned-off-by: NShakeel Butt <shakeelb@google.com>
      Reviewed-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Chris Down <chris@chrisdown.name>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1e577f97
  11. 14 6月, 2019 1 次提交
    • J
      mm: memcontrol: don't batch updates of local VM stats and events · 815744d7
      Johannes Weiner 提交于
      The kernel test robot noticed a 26% will-it-scale pagefault regression
      from commit 42a30035 ("mm: memcontrol: fix recursive statistics
      correctness & scalabilty").  This appears to be caused by bouncing the
      additional cachelines from the new hierarchical statistics counters.
      
      We can fix this by getting rid of the batched local counters instead.
      
      Originally, there were *only* group-local counters, and they were fully
      maintained per cpu.  A reader of a stats file high up in the cgroup tree
      would have to walk the entire subtree and collect each level's per-cpu
      counters to get the recursive view.  This was prohibitively expensive,
      and so we switched to per-cpu batched updates of the local counters
      during a983b5eb ("mm: memcontrol: fix excessive complexity in
      memory.stat reporting"), reducing the complexity from nr_subgroups *
      nr_cpus to nr_subgroups.
      
      With growing machines and cgroup trees, the tree walk itself became too
      expensive for monitoring top-level groups, and this is when the culprit
      patch added hierarchy counters on each cgroup level.  When the per-cpu
      batch size would be reached, both the local and the hierarchy counters
      would get batch-updated from the per-cpu delta simultaneously.
      
      This makes local and hierarchical counter reads blazingly fast, but it
      unfortunately makes the write-side too cache line intense.
      
      Since local counter reads were never a problem - we only centralized
      them to accelerate the hierarchy walk - and use of the local counters
      are becoming rarer due to replacement with hierarchical views (ongoing
      rework in the page reclaim and workingset code), we can make those local
      counters unbatched per-cpu counters again.
      
      The scheme will then be as such:
      
         when a memcg statistic changes, the writer will:
         - update the local counter (per-cpu)
         - update the batch counter (per-cpu). If the batch is full:
         - spill the batch into the group's atomic_t
         - spill the batch into all ancestors' atomic_ts
         - empty out the batch counter (per-cpu)
      
         when a local memcg counter is read, the reader will:
         - collect the local counter from all cpus
      
         when a hiearchy memcg counter is read, the reader will:
         - read the atomic_t
      
      We might be able to simplify this further and make the recursive
      counters unbatched per-cpu counters as well (batch upward propagation,
      but leave per-cpu collection to the readers), but that will require a
      more in-depth analysis and testing of all the callsites.  Deal with the
      immediate regression for now.
      
      Link: http://lkml.kernel.org/r/20190521151647.GB2870@cmpxchg.org
      Fixes: 42a30035 ("mm: memcontrol: fix recursive statistics correctness & scalabilty")
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: Nkernel test robot <rong.a.chen@intel.com>
      Tested-by: Nkernel test robot <rong.a.chen@intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Roman Gushchin <guro@fb.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      815744d7
  12. 02 6月, 2019 1 次提交
    • C
      mm, memcg: consider subtrees in memory.events · 9852ae3f
      Chris Down 提交于
      memory.stat and other files already consider subtrees in their output, and
      we should too in order to not present an inconsistent interface.
      
      The current situation is fairly confusing, because people interacting with
      cgroups expect hierarchical behaviour in the vein of memory.stat,
      cgroup.events, and other files.  For example, this causes confusion when
      debugging reclaim events under low, as currently these always read "0" at
      non-leaf memcg nodes, which frequently causes people to misdiagnose breach
      behaviour.  The same confusion applies to other counters in this file when
      debugging issues.
      
      Aggregation is done at write time instead of at read-time since these
      counters aren't hot (unlike memory.stat which is per-page, so it does it
      at read time), and it makes sense to bundle this with the file
      notifications.
      
      After this patch, events are propagated up the hierarchy:
      
          [root@ktst ~]# cat /sys/fs/cgroup/system.slice/memory.events
          low 0
          high 0
          max 0
          oom 0
          oom_kill 0
          [root@ktst ~]# systemd-run -p MemoryMax=1 true
          Running as unit: run-r251162a189fb4562b9dabfdc9b0422f5.service
          [root@ktst ~]# cat /sys/fs/cgroup/system.slice/memory.events
          low 0
          high 0
          max 7
          oom 1
          oom_kill 1
      
      As this is a change in behaviour, this can be reverted to the old
      behaviour by mounting with the `memory_localevents' flag set.  However, we
      use the new behaviour by default as there's a lack of evidence that there
      are any current users of memory.events that would find this change
      undesirable.
      
      akpm: this is a behaviour change, so Cc:stable.  THis is so that
      forthcoming distros which use cgroup v2 are more likely to pick up the
      revised behaviour.
      
      Link: http://lkml.kernel.org/r/20190208224419.GA24772@chrisdown.nameSigned-off-by: NChris Down <chris@chrisdown.name>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9852ae3f
  13. 31 5月, 2019 1 次提交
    • T
      treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 157 · c942fddf
      Thomas Gleixner 提交于
      Based on 3 normalized pattern(s):
      
        this program is free software you can redistribute it and or modify
        it under the terms of the gnu general public license as published by
        the free software foundation either version 2 of the license or at
        your option any later version this program is distributed in the
        hope that it will be useful but without any warranty without even
        the implied warranty of merchantability or fitness for a particular
        purpose see the gnu general public license for more details
      
        this program is free software you can redistribute it and or modify
        it under the terms of the gnu general public license as published by
        the free software foundation either version 2 of the license or at
        your option any later version [author] [kishon] [vijay] [abraham]
        [i] [kishon]@[ti] [com] this program is distributed in the hope that
        it will be useful but without any warranty without even the implied
        warranty of merchantability or fitness for a particular purpose see
        the gnu general public license for more details
      
        this program is free software you can redistribute it and or modify
        it under the terms of the gnu general public license as published by
        the free software foundation either version 2 of the license or at
        your option any later version [author] [graeme] [gregory]
        [gg]@[slimlogic] [co] [uk] [author] [kishon] [vijay] [abraham] [i]
        [kishon]@[ti] [com] [based] [on] [twl6030]_[usb] [c] [author] [hema]
        [hk] [hemahk]@[ti] [com] this program is distributed in the hope
        that it will be useful but without any warranty without even the
        implied warranty of merchantability or fitness for a particular
        purpose see the gnu general public license for more details
      
      extracted by the scancode license scanner the SPDX license identifier
      
        GPL-2.0-or-later
      
      has been chosen to replace the boilerplate/reference in 1105 file(s).
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NAllison Randal <allison@lohutok.net>
      Reviewed-by: NRichard Fontana <rfontana@redhat.com>
      Reviewed-by: NKate Stewart <kstewart@linuxfoundation.org>
      Cc: linux-spdx@vger.kernel.org
      Link: https://lkml.kernel.org/r/20190527070033.202006027@linutronix.deSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c942fddf
  14. 15 5月, 2019 7 次提交
  15. 06 4月, 2019 1 次提交
    • G
      mm: writeback: use exact memcg dirty counts · 0b3d6e6f
      Greg Thelen 提交于
      Since commit a983b5eb ("mm: memcontrol: fix excessive complexity in
      memory.stat reporting") memcg dirty and writeback counters are managed
      as:
      
       1) per-memcg per-cpu values in range of [-32..32]
      
       2) per-memcg atomic counter
      
      When a per-cpu counter cannot fit in [-32..32] it's flushed to the
      atomic.  Stat readers only check the atomic.  Thus readers such as
      balance_dirty_pages() may see a nontrivial error margin: 32 pages per
      cpu.
      
      Assuming 100 cpus:
         4k x86 page_size:  13 MiB error per memcg
        64k ppc page_size: 200 MiB error per memcg
      
      Considering that dirty+writeback are used together for some decisions the
      errors double.
      
      This inaccuracy can lead to undeserved oom kills.  One nasty case is
      when all per-cpu counters hold positive values offsetting an atomic
      negative value (i.e.  per_cpu[*]=32, atomic=n_cpu*-32).
      balance_dirty_pages() only consults the atomic and does not consider
      throttling the next n_cpu*32 dirty pages.  If the file_lru is in the
      13..200 MiB range then there's absolutely no dirty throttling, which
      burdens vmscan with only dirty+writeback pages thus resorting to oom
      kill.
      
      It could be argued that tiny containers are not supported, but it's more
      subtle.  It's the amount the space available for file lru that matters.
      If a container has memory.max-200MiB of non reclaimable memory, then it
      will also suffer such oom kills on a 100 cpu machine.
      
      The following test reliably ooms without this patch.  This patch avoids
      oom kills.
      
        $ cat test
        mount -t cgroup2 none /dev/cgroup
        cd /dev/cgroup
        echo +io +memory > cgroup.subtree_control
        mkdir test
        cd test
        echo 10M > memory.max
        (echo $BASHPID > cgroup.procs && exec /memcg-writeback-stress /foo)
        (echo $BASHPID > cgroup.procs && exec dd if=/dev/zero of=/foo bs=2M count=100)
      
        $ cat memcg-writeback-stress.c
        /*
         * Dirty pages from all but one cpu.
         * Clean pages from the non dirtying cpu.
         * This is to stress per cpu counter imbalance.
         * On a 100 cpu machine:
         * - per memcg per cpu dirty count is 32 pages for each of 99 cpus
         * - per memcg atomic is -99*32 pages
         * - thus the complete dirty limit: sum of all counters 0
         * - balance_dirty_pages() only sees atomic count -99*32 pages, which
         *   it max()s to 0.
         * - So a workload can dirty -99*32 pages before balance_dirty_pages()
         *   cares.
         */
        #define _GNU_SOURCE
        #include <err.h>
        #include <fcntl.h>
        #include <sched.h>
        #include <stdlib.h>
        #include <stdio.h>
        #include <sys/stat.h>
        #include <sys/sysinfo.h>
        #include <sys/types.h>
        #include <unistd.h>
      
        static char *buf;
        static int bufSize;
      
        static void set_affinity(int cpu)
        {
        	cpu_set_t affinity;
      
        	CPU_ZERO(&affinity);
        	CPU_SET(cpu, &affinity);
        	if (sched_setaffinity(0, sizeof(affinity), &affinity))
        		err(1, "sched_setaffinity");
        }
      
        static void dirty_on(int output_fd, int cpu)
        {
        	int i, wrote;
      
        	set_affinity(cpu);
        	for (i = 0; i < 32; i++) {
        		for (wrote = 0; wrote < bufSize; ) {
        			int ret = write(output_fd, buf+wrote, bufSize-wrote);
        			if (ret == -1)
        				err(1, "write");
        			wrote += ret;
        		}
        	}
        }
      
        int main(int argc, char **argv)
        {
        	int cpu, flush_cpu = 1, output_fd;
        	const char *output;
      
        	if (argc != 2)
        		errx(1, "usage: output_file");
      
        	output = argv[1];
        	bufSize = getpagesize();
        	buf = malloc(getpagesize());
        	if (buf == NULL)
        		errx(1, "malloc failed");
      
        	output_fd = open(output, O_CREAT|O_RDWR);
        	if (output_fd == -1)
        		err(1, "open(%s)", output);
      
        	for (cpu = 0; cpu < get_nprocs(); cpu++) {
        		if (cpu != flush_cpu)
        			dirty_on(output_fd, cpu);
        	}
      
        	set_affinity(flush_cpu);
        	if (fsync(output_fd))
        		err(1, "fsync(%s)", output);
        	if (close(output_fd))
        		err(1, "close(%s)", output);
        	free(buf);
        }
      
      Make balance_dirty_pages() and wb_over_bg_thresh() work harder to
      collect exact per memcg counters.  This avoids the aforementioned oom
      kills.
      
      This does not affect the overhead of memory.stat, which still reads the
      single atomic counter.
      
      Why not use percpu_counter? memcg already handles cpus going offline, so
      no need for that overhead from percpu_counter.  And the percpu_counter
      spinlocks are more heavyweight than is required.
      
      It probably also makes sense to use exact dirty and writeback counters
      in memcg oom reports.  But that is saved for later.
      
      Link: http://lkml.kernel.org/r/20190329174609.164344-1-gthelen@google.comSigned-off-by: NGreg Thelen <gthelen@google.com>
      Reviewed-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: <stable@vger.kernel.org>	[4.16+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0b3d6e6f
  16. 06 3月, 2019 2 次提交
  17. 29 12月, 2018 1 次提交
  18. 27 10月, 2018 2 次提交