1. 16 4月, 2020 2 次提交
    • X
      alinux: mm, memcg: record latency of direct compact in every memcg · 4bec5cfe
      Xu Yu 提交于
      to #26424368
      
      Probe and calculate the latency of direct compact, and then group into
      the latency histogram in struct mem_cgroup.
      
      Note that the latency in each memcg is aggregated from all child memcgs.
      
      Usage:
      
      $ cat memory.direct_compact_latency
      0-1ms:  1176
      1-5ms:  259
      5-10ms:         17
      10-100ms:       10
      100-500ms:      0
      500-1000ms:     0
      >=1000ms:       0
      total(ms):      921
      
      Each line is the count of direct compact within the appropriate latency
      range.
      
      To clear the latency histogram:
      
      $ echo 0 > memory.direct_compact_latency
      $ cat memory.direct_compact_latency
      0-1ms:  0
      1-5ms:  0
      5-10ms:         0
      10-100ms:       0
      100-500ms:      0
      500-1000ms:     0
      >=1000ms:       0
      total(ms):      0
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      4bec5cfe
    • X
      alinux: mm, memcg: record latency of direct reclaim in every memcg · 83058e75
      Xu Yu 提交于
      to #26424368
      
      Probe and calculate the latency of global direct reclaim and memcg
      direct reclaim, respectively, and then group into the latency histogram
      in struct mem_cgroup. Besides, the total latency is accumulated each
      time the histogram is updated.
      
      Note that the latency in each memcg is aggregated from all child memcgs.
      
      Usage:
      
      $ cat memory.direct_reclaim_global_latency
      0-1ms:  228
      1-5ms:  283
      5-10ms:         0
      10-100ms:       0
      100-500ms:      0
      500-1000ms:     0
      >=1000ms:       0
      total(ms):      539
      
      Each line is the count of global direct reclaim within the appropriate
      latency range.
      
      To clear the latency histogram:
      
      $ echo 0 > memory.direct_reclaim_global_latency
      $ cat memory.direct_reclaim_global_latency
      0-1ms:  0
      1-5ms:  0
      5-10ms:         0
      10-100ms:       0
      100-500ms:      0
      500-1000ms:     0
      >=1000ms:       0
      total(ms):      0
      
      The usage of memory.direct_reclaim_memcg_latency is the same as
      memory.direct_reclaim_global_latency.
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      83058e75
  2. 25 3月, 2020 1 次提交
  3. 18 3月, 2020 19 次提交
  4. 17 1月, 2020 1 次提交
  5. 15 1月, 2020 10 次提交
    • Y
      alinux: mm: thp: move deferred split queue to memcg's nodeinfo · e6ca020b
      Yang Shi 提交于
      The commit 87eaceb3faa59b9b4d940ec9554ce251325d83fe ("mm: thp: make
      deferred split shrinker memcg aware") makes deferred split queue per
      memcg to resolve memcg pre-mature OOM problem.  But, all nodes end up
      sharing the same queue instead of one queue per-node before the commit.
      It is not a big deal for memcg limit reclaim, but it may cause global
      kswapd shrink THPs from a different node.
      
      And, 0-day testing reported -19.6% regression of stress-ng's madvise
      test [1].  I didn't see that much regression on my test box (24 threads,
      48GB memory, 2 nodes), with the same test (stress-ng --timeout 1
      --metrics-brief --sequential 72  --class vm --exclude spawn,exec), I saw
      average -3% (run the same test 10 times then calculate the average since
      the test itself may have most 15% variation according to my test)
      regression sometimes (not every time, sometimes I didn't see regression
      at all).
      
      This might be caused by deferred split queue lock contention.  With some
      configuration (i.e. just one root memcg) the lock contention my be worse
      than before (given 2 nodes, two locks are reduced to one lock).
      
      So, moving deferred split queue to memcg's nodeinfo to make it NUMA
      aware again.
      
      With this change stress-ng's madvise test shows average 4% improvement
      sometimes and I didn't see degradation anymore.
      
      [1]: https://lore.kernel.org/lkml/20190930084604.GC17687@shao2-debian/
      
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      e6ca020b
    • Y
      mm: thp: make deferred split shrinker memcg aware · d651fcbb
      Yang Shi 提交于
      commit 87eaceb3faa59b9b4d940ec9554ce251325d83fe upstream
      
      Currently THP deferred split shrinker is not memcg aware, this may cause
      premature OOM with some configuration.  For example the below test would
      run into premature OOM easily:
      
      $ cgcreate -g memory:thp
      $ echo 4G > /sys/fs/cgroup/memory/thp/memory/limit_in_bytes
      $ cgexec -g memory:thp transhuge-stress 4000
      
      transhuge-stress comes from kernel selftest.
      
      It is easy to hit OOM, but there are still a lot THP on the deferred
      split queue, memcg direct reclaim can't touch them since the deferred split
      shrinker is not memcg aware.
      
      Convert deferred split shrinker memcg aware by introducing per memcg
      deferred split queue.  The THP should be on either per node or per memcg
      deferred split queue if it belongs to a memcg.  When the page is
      immigrated to the other memcg, it will be immigrated to the target
      memcg's deferred split queue too.
      
      Reuse the second tail page's deferred_list for per memcg list since the
      same THP can't be on multiple deferred split queues.
      
      [yang.shi@linux.alibaba.com: simplify deferred split queue dereference per Kirill Tkhai]
        Link: http://lkml.kernel.org/r/1566496227-84952-5-git-send-email-yang.shi@linux.alibaba.com
      Link: http://lkml.kernel.org/r/1565144277-36240-5-git-send-email-yang.shi@linux.alibaba.comSigned-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      d651fcbb
    • Y
      mm: shrinker: make shrinker not depend on memcg kmem · bd5596b4
      Yang Shi 提交于
      commit 0a432dcbeb32edcd211a5d8f7847d0da7642a8b4 upstream
      
      Currently shrinker is just allocated and can work when memcg kmem is
      enabled.  But, THP deferred split shrinker is not slab shrinker, it
      doesn't make too much sense to have such shrinker depend on memcg kmem.
      It should be able to reclaim THP even though memcg kmem is disabled.
      
      Introduce a new shrinker flag, SHRINKER_NONSLAB, for non-slab shrinker.
      When memcg kmem is disabled, just such shrinkers can be called in
      shrinking memcg slab.
      
      [yang.shi@linux.alibaba.com: add comment]
        Link: http://lkml.kernel.org/r/1566496227-84952-4-git-send-email-yang.shi@linux.alibaba.com
      Link: http://lkml.kernel.org/r/1565144277-36240-4-git-send-email-yang.shi@linux.alibaba.comSigned-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      bd5596b4
    • G
      alinux: mm: Support kidled · a29243e2
      Gavin Shan 提交于
      This enables scanning pages in fixed interval to determine their access
      frequency (hot/cold). The result is exported to user land on basis of
      memory cgroup by "memory.idle_page_stats". The design is highlighted as
      below:
      
         * A kernel thread is spawn when this feature is enabled by writing
           non-zero value to "/sys/kernel/mm/kidled/scan_period_in_seconds".
           The thread sequentially scans the nodes and their pages that have
           been chained up in LRU list.
      
         * For each page, its corresponding age information is stored in the
           page flags or array in node. The age represents the scanning intervals
           in which the page isn't accessed. Also, the page flag (PG_idle) is
           leveraged. The page's age is increased by one if the idle flag isn't
           cleared in two consective scans. Otherwise, the page's age is cleared out.
           Also, the page's age information is cleared when it's free'd so that
           the stale age information won't be fetched when it's allocated.
      
         * Initially, the flag is set, while the access bit in its PTE is cleared
           out by the thread. In next scanning period, its PTE access bit is
           synchronized with the page flag: clear the flag if access bit is set.
           The flag is kept otherwise. For unmapped pages, the flag is cleared
           when it's accessed.
      
         * Eventually, the page's aging information is updated to the unstable
           bucket of its corresponding memory cgroup, taking as statistics. The
           unstable bucket (statistics) is copied to stable bucket when all pages
           in all nodes are scanned for once. The stable bucket (statistics) is
           exported to user land through "memory.idle_page_stats".
      
      TESTING
      =======
      
         * cgroup1, unmapped pagecache
      
           # dd if=/dev/zero of=/ext4/test.data oflag=direct bs=1M count=128
           #
           # echo 1 > /sys/kernel/mm/kidled/use_hierarchy
           # echo 15 > /sys/kernel/mm/kidled/scan_period_in_seconds
           # mkdir -p /cgroup/memory
           # mount -tcgroup -o memory /cgroup/memory
           # echo 1 > /cgroup/memory/memory.use_hierarchy
           # mkdir -p /cgroup/memory/test
           # echo 1 > /cgroup/memory/test/memory.use_hierarchy
           #
           # echo $$ > /cgroup/memory/test/cgroup.procs
           # dd if=/ext4/test.data of=/dev/null bs=1M count=128
           # < wait a few minutes >
           # cat /cgroup/memory/test/memory.idle_page_stats | grep cfei
           # cat /cgroup/memory/test/memory.idle_page_stats | grep cfei
             cfei   0   0   0   134217728   0   0   0   0
           # cat /cgroup/memory/memory.idle_page_stats | grep cfei
             cfei   0   0   0   134217728   0   0   0   0
      
         * cgroup1, mapped pagecache
      
           # < create same file and memory cgroups as above >
           #
           # echo $$ > /cgroup/memory/test/cgroup.procs
           # < run program to mmap the whole created file and access the area >
           # < wait a few minutes >
           # cat /cgroup/memory/test/memory.idle_page_stats | grep cfei
             cfei   0   134217728   0   0   0   0   0   0
           # cat /cgroup/memory/memory.idle_page_stats | grep cfei
             cfei   0   134217728   0   0   0   0   0   0
      
         * cgroup1, mapped and locked pagecache
      
           # < create same file and memory cgroups as above >
           #
           # echo $$ > /cgroup/memory/test/cgroup.procs
           # < run program to mmap the whole created file and mlock the area >
           # < wait a few minutes >
           # cat /cgroup/memory/test/memory.idle_page_stats | grep cfui
             cfui   0   134217728   0   0   0   0   0   0
           # cat /cgroup/memory/memory.idle_page_stats | grep cfui
             cfui   0   134217728   0   0   0   0   0   0
      
         * cgroup1, anonymous and locked area
      
           # < create memory cgroups as above >
           #
           # echo $$ > /cgroup/memory/test/cgroup.procs
           # < run program to mmap anonymous area and mlock it >
           # < wait a few minutes >
           # cat /cgroup/memory/test/memory.idle_page_stats | grep csui
             csui   0   0   134217728   0   0   0   0   0
           # cat /cgroup/memory/memory.idle_page_stats | grep csui
             csui   0   0   134217728   0   0   0   0   0
      
         * Rerun above test cases in cgroup2 and the results are no exceptional.
           However, the cgroups are populated in different way as below:
      
           # mkdir -p /cgroup
           # mount -tcgroup2 none /cgroup
           # echo "+memory" > /cgroup/cgroup.subtree_control
           # mkdir -p /cgroup/test
      Signed-off-by: NGavin Shan <shan.gavin@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      a29243e2
    • Y
      alinux: mm: memcontrol: make distance between wmark_low and wmark_high configurable · bbaee3af
      Yang Shi 提交于
      Introduce a new interface, wmark_scale_factor, which defines the
      distance between wmark_high and wmark_low.  The unit is in fractions of
      10,000. The default value of 50 means the distance between wmark_high
      and wmark_low is 0.5% of the max limit of the cgroup.  The maximum value
      is 1000, or 10% of the max limit.
      
      The distance between wmark_low and wmark_high have impact on how hard
      memcg kswapd would reclaim.
      Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      bbaee3af
    • Y
      alinux: mm: memcontrol: treat memcg wmark reclaim work as kswapd · 6332d4e3
      Yang Shi 提交于
      Since background water mark reclaim is scheduled by workqueue, it could
      do more work than direct reclaim, i.e. write out dirty page, etc.
      
      So, add PF_KSWAPD flag, so that current_is_kswapd() would return true
      for memcg background reclaim.  The condition "current_is_kswapd() &&
      !global_reclaim(sc)" is good enough to tell current is global kswapd or
      memcg background reclaim.
      
      And, kswapd is not allowed to break memory.low protection for now, memcg
      kswapd should not break it either.
      Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      6332d4e3
    • Y
      alinux: mm: memcontrol: add background reclaim support for cgroupv2 · 0149d7b9
      Yang Shi 提交于
      Like v1, add background reclaim support for cgroup v2.  The interfaces
      are exactly same with v1.  However, if high limit is setup for v2, the
      water mark would be calculated by high limit instead of max limit.
      Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      0149d7b9
    • Y
      alinux: mm: memcontrol: support background async page reclaim · 6967792f
      Yang Shi 提交于
      Currently when memory usage exceeds memory cgroup limit, memory cgroup
      just can do sync direct reclaim.  This may incur unexpected stall on
      some applications which are sensitive to latency.  Introduce background
      async page reclaim mechanism, like what kswapd does.
      
      Define memcg memory usage water mark by introducing wmark_ratio interface,
      which is from 0 to 100 and represents percentage of max limit.  The
      wmark_high is calculated by (max * wmark_ratio / 100), the wmark_low is
      (wmark_high - wmark_high >> 8), which is an empirical value.  If wmark_ratio
      is 0, it means water mark is disabled, both wmark_low and wmark_high is max,
      which is the default value.
      
      If wmark_ratio is setup, when charging page, if usage is greater than
      wmark_high, which means the available memory of memcg is low, a work
      would be scheduled to do background page reclaim until memory usage is
      reduced to wmark_low if possible.
      
      Define a dedicated unbound workqueue for scheduling water mark reclaim
      works.
      Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      6967792f
    • X
      alinux: mm, memcg: fix possible soft lockup in try_charge · 2ced6afd
      Xu Yu 提交于
      When events such as direct reclaim and oom occur intensively, soft
      lockup is very likely to happen in the instances with 1 vcpu and with
      kernel preempt disabled.
      
      The example soft lockup is as follows.
      
      [  160.555984] watchdog: BUG: soft lockup - CPU#0 stuck for 112s! [malloc:2188]
      [  160.557975] Modules linked in: button
      [  160.559495] CPU: 0 PID: 2188 Comm: malloc Not tainted 4.19.57-15.457.al7.x86_64 #1
      [  160.561546] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 3288b3c 04/01/2014
      [  160.563707] RIP: 0010:shrink_node+0x1ae/0x450
      [  160.565391] Code: 00 00 00 49 8b 4f 20 ba 01 00 00 00 4c 8b 74 24 10 4d 8b 47 28 49 8b 77 10 48 2b 4c 24 08 41 8b 7f 1c 4d8
      [  160.570747] RSP: 0000:ffff9d0ec07a3b58 EFLAGS: 00000286 ORIG_RAX: ffffffffffffff13
      [  160.572889] RAX: ffff982ab6014330 RBX: ffff982ab6014000 RCX: 0000000000000000
      [  160.574992] RDX: 0000000000000001 RSI: ffff982ab6014000 RDI: ffff982ab6014000
      [  160.577106] RBP: ffff982afffb6000 R08: 0000000000000000 R09: ffff982ab6014000
      [  160.579219] R10: 0000000000000004 R11: 0000000000aaaaaa R12: 0000000000000000
      [  160.581326] R13: 0000000000000000 R14: 0000000000000000 R15: ffff9d0ec07a3c50
      [  160.583450] FS:  00007f8b414f7740(0000) GS:ffff982afda00000(0000) knlGS:0000000000000000
      [  160.585704] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  160.587662] CR2: 00007f8adb800010 CR3: 000000007ac9e001 CR4: 00000000003606b0
      [  160.589835] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  160.591971] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [  160.594133] Call Trace:
      [  160.595602]  do_try_to_free_pages+0xcc/0x390
      [  160.597356]  try_to_free_mem_cgroup_pages+0xf9/0x1d0
      [  160.599198]  ? out_of_memory+0xb5/0x4a0
      [  160.600882]  try_charge+0x244/0x750
      [  160.602522]  ? __pagevec_lru_add_fn+0x1d0/0x330
      [  160.604310]  mem_cgroup_try_charge+0xb4/0x1d0
      [  160.606085]  mem_cgroup_try_charge_delay+0x1c/0x40
      [  160.607892]  do_anonymous_page+0xf7/0x540
      [  160.609574]  __handle_mm_fault+0x665/0xa00
      [  160.611233]  ? __switch_to_asm+0x35/0x70
      [  160.612838]  handle_mm_fault+0x122/0x1e0
      [  160.614407]  __do_page_fault+0x1b7/0x470
      [  160.615962]  do_page_fault+0x32/0x140
      [  160.617474]  ? async_page_fault+0x8/0x30
      [  160.619012]  async_page_fault+0x1e/0x30
      [  160.620526] RIP: 0033:0x40068e
      
      Fix it by adding cond_resched() in try_charge(), just before goto retry
      after OOM_SUCCESS, in order to let OOM free some memory first.
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      2ced6afd
    • C
      mm, memcg: throttle allocators when failing reclaim over memory.high · 6202ab24
      Chris Down 提交于
      commit 0e4b01df865935007bd712cbc8e7299005b28894 upstream.
      
      We're trying to use memory.high to limit workloads, but have found that
      containment can frequently fail completely and cause OOM situations
      outside of the cgroup.  This happens especially with swap space -- either
      when none is configured, or swap is full.  These failures often also don't
      have enough warning to allow one to react, whether for a human or for a
      daemon monitoring PSI.
      
      Here is output from a simple program showing how long it takes in usec
      (column 2) to allocate a megabyte of anonymous memory (column 1) when a
      cgroup is already beyond its memory high setting, and no swap is
      available:
      
          [root@ktst ~]# systemd-run -p MemoryHigh=100M -p MemorySwapMax=1 \
          > --wait -t timeout 300 /root/mdf
          [...]
          95  1035
          96  1038
          97  1000
          98  1036
          99  1048
          100 1590
          101 1968
          102 1776
          103 1863
          104 1757
          105 1921
          106 1893
          107 1760
          108 1748
          109 1843
          110 1716
          111 1924
          112 1776
          113 1831
          114 1766
          115 1836
          116 1588
          117 1912
          118 1802
          119 1857
          120 1731
          [...]
          [System OOM in 2-3 seconds]
      
      The delay does go up extremely marginally past the 100MB memory.high
      threshold, as now we spend time scanning before returning to usermode, but
      it's nowhere near enough to contain growth.  It also doesn't get worse the
      more pages you have, since it only considers nr_pages.
      
      The current situation goes against both the expectations of users of
      memory.high, and our intentions as cgroup v2 developers.  In
      cgroup-v2.txt, we claim that we will throttle and only under "extreme
      conditions" will memory.high protection be breached.  Likewise, cgroup v2
      users generally also expect that memory.high should throttle workloads as
      they exceed their high threshold.  However, as seen above, this isn't
      always how it works in practice -- even on banal setups like those with no
      swap, or where swap has become exhausted, we can end up with memory.high
      being breached and us having no weapons left in our arsenal to combat
      runaway growth with, since reclaim is futile.
      
      It's also hard for system monitoring software or users to tell how bad the
      situation is, as "high" events for the memcg may in some cases be benign,
      and in others be catastrophic.  The current status quo is that we fail
      containment in a way that doesn't provide any advance warning that things
      are about to go horribly wrong (for example, we are about to invoke the
      kernel OOM killer).
      
      This patch introduces explicit throttling when reclaim is failing to keep
      memcg size contained at the memory.high setting.  It does so by applying
      an exponential delay curve derived from the memcg's overage compared to
      memory.high.  In the normal case where the memcg is either below or only
      marginally over its memory.high setting, no throttling will be performed.
      
      This composes well with system health monitoring and remediation, as these
      allocator delays are factored into PSI's memory pressure calculations.
      This both creates a mechanism system administrators or applications
      consuming the PSI interface to trivially see that the memcg in question is
      struggling and use that to make more reasonable decisions, and permits
      them enough time to act.  Either of these can act with significantly more
      nuance than that we can provide using the system OOM killer.
      
      This is a similar idea to memory.oom_control in cgroup v1 which would put
      the cgroup to sleep if the threshold was violated, but it's also
      significantly improved as it results in visible memory pressure, and also
      doesn't schedule indefinitely, which previously made tracing and other
      introspection difficult (ie.  it's clamped at 2*HZ per allocation through
      MEMCG_MAX_HIGH_DELAY_JIFFIES).
      
      Contrast the previous results with a kernel with this patch:
      
          [root@ktst ~]# systemd-run -p MemoryHigh=100M -p MemorySwapMax=1 \
          > --wait -t timeout 300 /root/mdf
          [...]
          95  1002
          96  1000
          97  1002
          98  1003
          99  1000
          100 1043
          101 84724
          102 330628
          103 610511
          104 1016265
          105 1503969
          106 2391692
          107 2872061
          108 3248003
          109 4791904
          110 5759832
          111 6912509
          112 8127818
          113 9472203
          114 12287622
          115 12480079
          116 14144008
          117 15808029
          118 16384500
          119 16383242
          120 16384979
          [...]
      
      As you can see, in the normal case, memory allocation takes around 1000
      usec.  However, as we exceed our memory.high, things start to increase
      exponentially, but fairly leniently at first.  Our first megabyte over
      memory.high takes us 0.16 seconds, then the next is 0.46 seconds, then the
      next is almost an entire second.  This gets worse until we reach our
      eventual 2*HZ clamp per batch, resulting in 16 seconds per megabyte.
      However, this is still making forward progress, so permits tracing or
      further analysis with programs like GDB.
      
      We use an exponential curve for our delay penalty for a few reasons:
      
      1. We run mem_cgroup_handle_over_high to potentially do reclaim after
         we've already performed allocations, which means that temporarily
         going over memory.high by a small amount may be perfectly legitimate,
         even for compliant workloads. We don't want to unduly penalise such
         cases.
      2. An exponential curve (as opposed to a static or linear delay) allows
         ramping up memory pressure stats more gradually, which can be useful
         to work out that you have set memory.high too low, without destroying
         application performance entirely.
      
      This patch expands on earlier work by Johannes Weiner. Thanks!
      
      [akpm@linux-foundation.org: fix max() warning]
      [akpm@linux-foundation.org: fix __udivdi3 ref on 32-bit]
      [akpm@linux-foundation.org: fix it even more]
      [chris@chrisdown.name: fix 64-bit divide even more]
      Link: http://lkml.kernel.org/r/20190723180700.GA29459@chrisdown.nameSigned-off-by: NChris Down <chris@chrisdown.name>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Nathan Chancellor <natechancellor@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      6202ab24
  6. 27 12月, 2019 3 次提交
  7. 01 12月, 2019 1 次提交
    • R
      mm: handle no memcg case in memcg_kmem_charge() properly · 6a2245d8
      Roman Gushchin 提交于
      [ Upstream commit e68599a3c3ad0f3171a7cb4e48aa6f9a69381902 ]
      
      Mike Galbraith reported a regression caused by the commit 9b6f7e163cd0
      ("mm: rework memcg kernel stack accounting") on a system with
      "cgroup_disable=memory" boot option: the system panics with the following
      stack trace:
      
        BUG: unable to handle kernel NULL pointer dereference at 00000000000000f8
        PGD 0 P4D 0
        Oops: 0002 [#1] PREEMPT SMP PTI
        CPU: 0 PID: 1 Comm: systemd Not tainted 4.19.0-preempt+ #410
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20180531_142017-buildhw-08.phx2.fed4
        RIP: 0010:page_counter_try_charge+0x22/0xc0
        Code: 41 5d c3 c3 0f 1f 40 00 0f 1f 44 00 00 48 85 ff 0f 84 a7 00 00 00 41 56 48 89 f8 49 89 fe 49
        Call Trace:
         try_charge+0xcb/0x780
         memcg_kmem_charge_memcg+0x28/0x80
         memcg_kmem_charge+0x8b/0x1d0
         copy_process.part.41+0x1ca/0x2070
         _do_fork+0xd7/0x3d0
         do_syscall_64+0x5a/0x180
         entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      The problem occurs because get_mem_cgroup_from_current() returns the NULL
      pointer if memory controller is disabled.  Let's check if this is a case
      at the beginning of memcg_kmem_charge() and just return 0 if
      mem_cgroup_disabled() returns true.  This is how we handle this case in
      many other places in the memory controller code.
      
      Link: http://lkml.kernel.org/r/20181029215123.17830-1-guro@fb.com
      Fixes: 9b6f7e163cd0 ("mm: rework memcg kernel stack accounting")
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Reported-by: NMike Galbraith <efault@gmx.de>
      Acked-by: NRik van Riel <riel@surriel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      6a2245d8
  8. 21 11月, 2019 1 次提交
    • R
      mm: memcg: switch to css_tryget() in get_mem_cgroup_from_mm() · bb1bc2d8
      Roman Gushchin 提交于
      commit 00d484f354d85845991b40141d40ba9e5eb60faf upstream.
      
      We've encountered a rcu stall in get_mem_cgroup_from_mm():
      
        rcu: INFO: rcu_sched self-detected stall on CPU
        rcu: 33-....: (21000 ticks this GP) idle=6c6/1/0x4000000000000002 softirq=35441/35441 fqs=5017
        (t=21031 jiffies g=324821 q=95837) NMI backtrace for cpu 33
        <...>
        RIP: 0010:get_mem_cgroup_from_mm+0x2f/0x90
        <...>
         __memcg_kmem_charge+0x55/0x140
         __alloc_pages_nodemask+0x267/0x320
         pipe_write+0x1ad/0x400
         new_sync_write+0x127/0x1c0
         __kernel_write+0x4f/0xf0
         dump_emit+0x91/0xc0
         writenote+0xa0/0xc0
         elf_core_dump+0x11af/0x1430
         do_coredump+0xc65/0xee0
         get_signal+0x132/0x7c0
         do_signal+0x36/0x640
         exit_to_usermode_loop+0x61/0xd0
         do_syscall_64+0xd4/0x100
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      The problem is caused by an exiting task which is associated with an
      offline memcg.  We're iterating over and over in the do {} while
      (!css_tryget_online()) loop, but obviously the memcg won't become online
      and the exiting task won't be migrated to a live memcg.
      
      Let's fix it by switching from css_tryget_online() to css_tryget().
      
      As css_tryget_online() cannot guarantee that the memcg won't go offline,
      the check is usually useless, except some rare cases when for example it
      determines if something should be presented to a user.
      
      A similar problem is described by commit 18fa84a2db0e ("cgroup: Use
      css_tryget() instead of css_tryget_online() in task_get_css()").
      
      Johannes:
      
      : The bug aside, it doesn't matter whether the cgroup is online for the
      : callers.  It used to matter when offlining needed to evacuate all charges
      : from the memcg, and so needed to prevent new ones from showing up, but we
      : don't care now.
      
      Link: http://lkml.kernel.org/r/20191106225131.3543616-1-guro@fb.comSigned-off-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NShakeel Butt <shakeeb@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michal Koutn <mkoutny@suse.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      bb1bc2d8
  9. 13 11月, 2019 1 次提交
    • J
      mm: memcontrol: fix network errors from failing __GFP_ATOMIC charges · 8e6bf4bc
      Johannes Weiner 提交于
      commit 869712fd3de5a90b7ba23ae1272278cddc66b37b upstream.
      
      While upgrading from 4.16 to 5.2, we noticed these allocation errors in
      the log of the new kernel:
      
        SLUB: Unable to allocate memory on node -1, gfp=0xa20(GFP_ATOMIC)
          cache: tw_sock_TCPv6(960:helper-logs), object size: 232, buffer size: 240, default order: 1, min order: 0
          node 0: slabs: 5, objs: 170, free: 0
      
              slab_out_of_memory+1
              ___slab_alloc+969
              __slab_alloc+14
              kmem_cache_alloc+346
              inet_twsk_alloc+60
              tcp_time_wait+46
              tcp_fin+206
              tcp_data_queue+2034
              tcp_rcv_state_process+784
              tcp_v6_do_rcv+405
              __release_sock+118
              tcp_close+385
              inet_release+46
              __sock_release+55
              sock_close+17
              __fput+170
              task_work_run+127
              exit_to_usermode_loop+191
              do_syscall_64+212
              entry_SYSCALL_64_after_hwframe+68
      
      accompanied by an increase in machines going completely radio silent
      under memory pressure.
      
      One thing that changed since 4.16 is e699e2c6 ("net, mm: account
      sock objects to kmemcg"), which made these slab caches subject to cgroup
      memory accounting and control.
      
      The problem with that is that cgroups, unlike the page allocator, do not
      maintain dedicated atomic reserves.  As a cgroup's usage hovers at its
      limit, atomic allocations - such as done during network rx - can fail
      consistently for extended periods of time.  The kernel is not able to
      operate under these conditions.
      
      We don't want to revert the culprit patch, because it indeed tracks a
      potentially substantial amount of memory used by a cgroup.
      
      We also don't want to implement dedicated atomic reserves for cgroups.
      There is no point in keeping a fixed margin of unused bytes in the
      cgroup's memory budget to accomodate a consumer that is impossible to
      predict - we'd be wasting memory and get into configuration headaches,
      not unlike what we have going with min_free_kbytes.  We do this for
      physical mem because we have to, but cgroups are an accounting game.
      
      Instead, account these privileged allocations to the cgroup, but let
      them bypass the configured limit if they have to.  This way, we get the
      benefits of accounting the consumed memory and have it exert pressure on
      the rest of the cgroup, but like with the page allocator, we shift the
      burden of reclaimining on behalf of atomic allocations onto the regular
      allocations that can block.
      
      Link: http://lkml.kernel.org/r/20191022233708.365764-1-hannes@cmpxchg.org
      Fixes: e699e2c6 ("net, mm: account sock objects to kmemcg")
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Cc: Suleiman Souhlal <suleiman@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: <stable@vger.kernel.org>	[4.18+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8e6bf4bc
  10. 05 10月, 2019 1 次提交
    • M
      memcg, kmem: do not fail __GFP_NOFAIL charges · b4a734a5
      Michal Hocko 提交于
      commit e55d9d9bfb69405bd7615c0f8d229d8fafb3e9b8 upstream.
      
      Thomas has noticed the following NULL ptr dereference when using cgroup
      v1 kmem limit:
      BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
      PGD 0
      P4D 0
      Oops: 0000 [#1] PREEMPT SMP PTI
      CPU: 3 PID: 16923 Comm: gtk-update-icon Not tainted 4.19.51 #42
      Hardware name: Gigabyte Technology Co., Ltd. Z97X-Gaming G1/Z97X-Gaming G1, BIOS F9 07/31/2015
      RIP: 0010:create_empty_buffers+0x24/0x100
      Code: cd 0f 1f 44 00 00 0f 1f 44 00 00 41 54 49 89 d4 ba 01 00 00 00 55 53 48 89 fb e8 97 fe ff ff 48 89 c5 48 89 c2 eb 03 48 89 ca <48> 8b 4a 08 4c 09 22 48 85 c9 75 f1 48 89 6a 08 48 8b 43 18 48 8d
      RSP: 0018:ffff927ac1b37bf8 EFLAGS: 00010286
      RAX: 0000000000000000 RBX: fffff2d4429fd740 RCX: 0000000100097149
      RDX: 0000000000000000 RSI: 0000000000000082 RDI: ffff9075a99fbe00
      RBP: 0000000000000000 R08: fffff2d440949cc8 R09: 00000000000960c0
      R10: 0000000000000002 R11: 0000000000000000 R12: 0000000000000000
      R13: ffff907601f18360 R14: 0000000000002000 R15: 0000000000001000
      FS:  00007fb55b288bc0(0000) GS:ffff90761f8c0000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000000000008 CR3: 000000007aebc002 CR4: 00000000001606e0
      Call Trace:
       create_page_buffers+0x4d/0x60
       __block_write_begin_int+0x8e/0x5a0
       ? ext4_inode_attach_jinode.part.82+0xb0/0xb0
       ? jbd2__journal_start+0xd7/0x1f0
       ext4_da_write_begin+0x112/0x3d0
       generic_perform_write+0xf1/0x1b0
       ? file_update_time+0x70/0x140
       __generic_file_write_iter+0x141/0x1a0
       ext4_file_write_iter+0xef/0x3b0
       __vfs_write+0x17e/0x1e0
       vfs_write+0xa5/0x1a0
       ksys_write+0x57/0xd0
       do_syscall_64+0x55/0x160
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Tetsuo then noticed that this is because the __memcg_kmem_charge_memcg
      fails __GFP_NOFAIL charge when the kmem limit is reached.  This is a wrong
      behavior because nofail allocations are not allowed to fail.  Normal
      charge path simply forces the charge even if that means to cross the
      limit.  Kmem accounting should be doing the same.
      
      Link: http://lkml.kernel.org/r/20190906125608.32129-1-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NThomas Lindroth <thomas.lindroth@gmail.com>
      Debugged-by: NTetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Thomas Lindroth <thomas.lindroth@gmail.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b4a734a5