1. 08 8月, 2020 4 次提交
    • R
      mm: memcg/slab: move memcg_kmem_bypass() to memcontrol.h · 0f876e4d
      Roman Gushchin 提交于
      To make the memcg_kmem_bypass() function available outside of the
      memcontrol.c, let's move it to memcontrol.h.  The function is small and
      nicely fits into static inline sort of functions.
      
      It will be used from the slab code.
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Link: http://lkml.kernel.org/r/20200623174037.3951353-12-guro@fb.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0f876e4d
    • R
      mm: memcg/slab: save obj_cgroup for non-root slab objects · 964d4bd3
      Roman Gushchin 提交于
      Store the obj_cgroup pointer in the corresponding place of
      page->obj_cgroups for each allocated non-root slab object.  Make sure that
      each allocated object holds a reference to obj_cgroup.
      
      Objcg pointer is obtained from the memcg->objcg dereferencing in
      memcg_kmem_get_cache() and passed from pre_alloc_hook to post_alloc_hook.
      Then in case of successful allocation(s) it's getting stored in the
      page->obj_cgroups vector.
      
      The objcg obtaining part look a bit bulky now, but it will be simplified
      by next commits in the series.
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Link: http://lkml.kernel.org/r/20200623174037.3951353-9-guro@fb.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      964d4bd3
    • R
      mm: memcg/slab: obj_cgroup API · bf4f0599
      Roman Gushchin 提交于
      Obj_cgroup API provides an ability to account sub-page sized kernel
      objects, which potentially outlive the original memory cgroup.
      
      The top-level API consists of the following functions:
        bool obj_cgroup_tryget(struct obj_cgroup *objcg);
        void obj_cgroup_get(struct obj_cgroup *objcg);
        void obj_cgroup_put(struct obj_cgroup *objcg);
      
        int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size);
        void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size);
      
        struct mem_cgroup *obj_cgroup_memcg(struct obj_cgroup *objcg);
        struct obj_cgroup *get_obj_cgroup_from_current(void);
      
      Object cgroup is basically a pointer to a memory cgroup with a per-cpu
      reference counter.  It substitutes a memory cgroup in places where it's
      necessary to charge a custom amount of bytes instead of pages.
      
      All charged memory rounded down to pages is charged to the corresponding
      memory cgroup using __memcg_kmem_charge().
      
      It implements reparenting: on memcg offlining it's getting reattached to
      the parent memory cgroup.  Each online memory cgroup has an associated
      active object cgroup to handle new allocations and the list of all
      attached object cgroups.  On offlining of a cgroup this list is reparented
      and for each object cgroup in the list the memcg pointer is swapped to the
      parent memory cgroup.  It prevents long-living objects from pinning the
      original memory cgroup in the memory.
      
      The implementation is based on byte-sized per-cpu stocks.  A sub-page
      sized leftover is stored in an atomic field, which is a part of obj_cgroup
      object.  So on cgroup offlining the leftover is automatically reparented.
      
      memcg->objcg is rcu protected.  objcg->memcg is a raw pointer, which is
      always pointing at a memory cgroup, but can be atomically swapped to the
      parent memory cgroup.  So a user must ensure the lifetime of the
      cgroup, e.g.  grab rcu_read_lock or css_set_lock.
      Suggested-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Link: http://lkml.kernel.org/r/20200623174037.3951353-7-guro@fb.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bf4f0599
    • R
      mm: memcg: factor out memcg- and lruvec-level changes out of __mod_lruvec_state() · eedc4e5a
      Roman Gushchin 提交于
      Patch series "The new cgroup slab memory controller", v7.
      
      The patchset moves the accounting from the page level to the object level.
      It allows to share slab pages between memory cgroups.  This leads to a
      significant win in the slab utilization (up to 45%) and the corresponding
      drop in the total kernel memory footprint.  The reduced number of
      unmovable slab pages should also have a positive effect on the memory
      fragmentation.
      
      The patchset makes the slab accounting code simpler: there is no more need
      in the complicated dynamic creation and destruction of per-cgroup slab
      caches, all memory cgroups use a global set of shared slab caches.  The
      lifetime of slab caches is not more connected to the lifetime of memory
      cgroups.
      
      The more precise accounting does require more CPU, however in practice the
      difference seems to be negligible.  We've been using the new slab
      controller in Facebook production for several months with different
      workloads and haven't seen any noticeable regressions.  What we've seen
      were memory savings in order of 1 GB per host (it varied heavily depending
      on the actual workload, size of RAM, number of CPUs, memory pressure,
      etc).
      
      The third version of the patchset added yet another step towards the
      simplification of the code: sharing of slab caches between accounted and
      non-accounted allocations.  It comes with significant upsides (most
      noticeable, a complete elimination of dynamic slab caches creation) but
      not without some regression risks, so this change sits on top of the
      patchset and is not completely merged in.  So in the unlikely event of a
      noticeable performance regression it can be reverted separately.
      
      The slab memory accounting works in exactly the same way for SLAB and
      SLUB.  With both allocators the new controller shows significant memory
      savings, with SLUB the difference is bigger.  On my 16-core desktop
      machine running Fedora 32 the size of the slab memory measured after the
      start of the system was lower by 58% and 38% with SLUB and SLAB
      correspondingly.
      
      As an estimation of a potential CPU overhead, below are results of
      slab_bulk_test01 test, kindly provided by Jesper D.  Brouer.  He also
      helped with the evaluation of results.
      
      The test can be found here: https://github.com/netoptimizer/prototype-kernel/
      The smallest number in each row should be picked for a comparison.
      
      SLUB-patched - bulk-API
       - SLUB-patched : bulk_quick_reuse objects=1 : 187 -  90 - 224  cycles(tsc)
       - SLUB-patched : bulk_quick_reuse objects=2 : 110 -  53 - 133  cycles(tsc)
       - SLUB-patched : bulk_quick_reuse objects=3 :  88 -  95 -  42  cycles(tsc)
       - SLUB-patched : bulk_quick_reuse objects=4 :  91 -  85 -  36  cycles(tsc)
       - SLUB-patched : bulk_quick_reuse objects=8 :  32 -  66 -  32  cycles(tsc)
      
      SLUB-original -  bulk-API
       - SLUB-original: bulk_quick_reuse objects=1 :  87 -  87 - 142  cycles(tsc)
       - SLUB-original: bulk_quick_reuse objects=2 :  52 -  53 -  53  cycles(tsc)
       - SLUB-original: bulk_quick_reuse objects=3 :  42 -  42 -  91  cycles(tsc)
       - SLUB-original: bulk_quick_reuse objects=4 :  91 -  37 -  37  cycles(tsc)
       - SLUB-original: bulk_quick_reuse objects=8 :  31 -  79 -  76  cycles(tsc)
      
      SLAB-patched -  bulk-API
       - SLAB-patched : bulk_quick_reuse objects=1 :  67 -  67 - 140  cycles(tsc)
       - SLAB-patched : bulk_quick_reuse objects=2 :  55 -  46 -  46  cycles(tsc)
       - SLAB-patched : bulk_quick_reuse objects=3 :  93 -  94 -  39  cycles(tsc)
       - SLAB-patched : bulk_quick_reuse objects=4 :  35 -  88 -  85  cycles(tsc)
       - SLAB-patched : bulk_quick_reuse objects=8 :  30 -  30 -  30  cycles(tsc)
      
      SLAB-original-  bulk-API
       - SLAB-original: bulk_quick_reuse objects=1 : 143 - 136 -  67  cycles(tsc)
       - SLAB-original: bulk_quick_reuse objects=2 :  45 -  46 -  46  cycles(tsc)
       - SLAB-original: bulk_quick_reuse objects=3 :  38 -  39 -  39  cycles(tsc)
       - SLAB-original: bulk_quick_reuse objects=4 :  35 -  87 -  87  cycles(tsc)
       - SLAB-original: bulk_quick_reuse objects=8 :  29 -  66 -  30  cycles(tsc)
      
      This patch (of 19):
      
      To convert memcg and lruvec slab counters to bytes there must be a way to
      change these counters without touching node counters.  Factor out
      __mod_memcg_lruvec_state() out of __mod_lruvec_state().
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Link: http://lkml.kernel.org/r/20200623174037.3951353-1-guro@fb.com
      Link: http://lkml.kernel.org/r/20200623174037.3951353-2-guro@fb.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      eedc4e5a
  2. 04 6月, 2020 10 次提交
  3. 03 6月, 2020 2 次提交
    • J
      mm/memcg: automatically penalize tasks with high swap use · 4b82ab4f
      Jakub Kicinski 提交于
      Add a memory.swap.high knob, which can be used to protect the system
      from SWAP exhaustion.  The mechanism used for penalizing is similar to
      memory.high penalty (sleep on return to user space).
      
      That is not to say that the knob itself is equivalent to memory.high.
      The objective is more to protect the system from potentially buggy tasks
      consuming a lot of swap and impacting other tasks, or even bringing the
      whole system to stand still with complete SWAP exhaustion.  Hopefully
      without the need to find per-task hard limits.
      
      Slowing misbehaving tasks down gradually allows user space oom killers
      or other protection mechanisms to react.  oomd and earlyoom already do
      killing based on swap exhaustion, and memory.swap.high protection will
      help implement such userspace oom policies more reliably.
      
      We can use one counter for number of pages allocated under pressure to
      save struct task space and avoid two separate hierarchy walks on the hot
      path.  The exact overage is calculated on return to user space, anyway.
      
      Take the new high limit into account when determining if swap is "full".
      Borrowing the explanation from Johannes:
      
        The idea behind "swap full" is that as long as the workload has plenty
        of swap space available and it's not changing its memory contents, it
        makes sense to generously hold on to copies of data in the swap device,
        even after the swapin.  A later reclaim cycle can drop the page without
        any IO.  Trading disk space for IO.
      
        But the only two ways to reclaim a swap slot is when they're faulted
        in and the references go away, or by scanning the virtual address space
        like swapoff does - which is very expensive (one could argue it's too
        expensive even for swapoff, it's often more practical to just reboot).
      
        So at some point in the fill level, we have to start freeing up swap
        slots on fault/swapin.  Otherwise we could eventually run out of swap
        slots while they're filled with copies of data that is also in RAM.
      
        We don't want to OOM a workload because its available swap space is
        filled with redundant cache.
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Chris Down <chris@chrisdown.name>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Link: http://lkml.kernel.org/r/20200527195846.102707-5-kuba@kernel.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4b82ab4f
    • J
      mm/memcg: move cgroup high memory limit setting into struct page_counter · d1663a90
      Jakub Kicinski 提交于
      High memory limit is currently recorded directly in struct mem_cgroup.
      We are about to add a high limit for swap, move the field to struct
      page_counter and add some helpers.
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Chris Down <chris@chrisdown.name>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Link: http://lkml.kernel.org/r/20200527195846.102707-4-kuba@kernel.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d1663a90
  4. 15 5月, 2020 1 次提交
  5. 19 4月, 2020 1 次提交
    • G
      memcontrol.h: Replace zero-length array with flexible-array member · 307ed94c
      Gustavo A. R. Silva 提交于
      The current codebase makes use of the zero-length array language
      extension to the C90 standard, but the preferred mechanism to declare
      variable-length types such as these ones is a flexible array member[1][2],
      introduced in C99:
      
      struct foo {
              int stuff;
              struct boo array[];
      };
      
      By making use of the mechanism above, we will get a compiler warning
      in case the flexible array does not occur last in the structure, which
      will help us prevent some kind of undefined behavior bugs from being
      inadvertently introduced[3] to the codebase from now on.
      
      Also, notice that, dynamic memory allocations won't be affected by
      this change:
      
      "Flexible array members have incomplete type, and so the sizeof operator
      may not be applied. As a quirk of the original implementation of
      zero-length arrays, sizeof evaluates to zero."[1]
      
      This issue was found with the help of Coccinelle.
      
      [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
      [2] https://github.com/KSPP/linux/issues/21
      [3] commit 76497732 ("cxgb3/l2t: Fix undefined behaviour")
      Signed-off-by: NGustavo A. R. Silva <gustavo@embeddedor.com>
      307ed94c
  6. 03 4月, 2020 5 次提交
  7. 30 3月, 2020 1 次提交
    • R
      mm: fork: fix kernel_stack memcg stats for various stack implementations · 8380ce47
      Roman Gushchin 提交于
      Depending on CONFIG_VMAP_STACK and the THREAD_SIZE / PAGE_SIZE ratio the
      space for task stacks can be allocated using __vmalloc_node_range(),
      alloc_pages_node() and kmem_cache_alloc_node().
      
      In the first and the second cases page->mem_cgroup pointer is set, but
      in the third it's not: memcg membership of a slab page should be
      determined using the memcg_from_slab_page() function, which looks at
      page->slab_cache->memcg_params.memcg .  In this case, using
      mod_memcg_page_state() (as in account_kernel_stack()) is incorrect:
      page->mem_cgroup pointer is NULL even for pages charged to a non-root
      memory cgroup.
      
      It can lead to kernel_stack per-memcg counters permanently showing 0 on
      some architectures (depending on the configuration).
      
      In order to fix it, let's introduce a mod_memcg_obj_state() helper,
      which takes a pointer to a kernel object as a first argument, uses
      mem_cgroup_from_obj() to get a RCU-protected memcg pointer and calls
      mod_memcg_state().  It allows to handle all possible configurations
      (CONFIG_VMAP_STACK and various THREAD_SIZE/PAGE_SIZE values) without
      spilling any memcg/kmem specifics into fork.c .
      
      Note: This is a special version of the patch created for stable
      backports.  It contains code from the following two patches:
        - mm: memcg/slab: introduce mem_cgroup_from_obj()
        - mm: fork: fix kernel_stack memcg stats for various stack implementations
      
      [guro@fb.com: introduce mem_cgroup_from_obj()]
        Link: http://lkml.kernel.org/r/20200324004221.GA36662@carbon.dhcp.thefacebook.com
      Fixes: 4d96ba35 ("mm: memcg/slab: stop setting page->mem_cgroup pointer for slab pages")
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Bharata B Rao <bharata@linux.ibm.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/20200303233550.251375-1-guro@fb.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8380ce47
  8. 02 12月, 2019 3 次提交
  9. 01 12月, 2019 3 次提交
  10. 08 10月, 2019 4 次提交
    • C
      mm, memcg: make scan aggression always exclude protection · 1bc63fb1
      Chris Down 提交于
      This patch is an incremental improvement on the existing
      memory.{low,min} relative reclaim work to base its scan pressure
      calculations on how much protection is available compared to the current
      usage, rather than how much the current usage is over some protection
      threshold.
      
      This change doesn't change the experience for the user in the normal
      case too much.  One benefit is that it replaces the (somewhat arbitrary)
      100% cutoff with an indefinite slope, which makes it easier to ballpark
      a memory.low value.
      
      As well as this, the old methodology doesn't quite apply generically to
      machines with varying amounts of physical memory.  Let's say we have a
      top level cgroup, workload.slice, and another top level cgroup,
      system-management.slice.  We want to roughly give 12G to
      system-management.slice, so on a 32GB machine we set memory.low to 20GB
      in workload.slice, and on a 64GB machine we set memory.low to 52GB.
      However, because these are relative amounts to the total machine size,
      while the amount of memory we want to generally be willing to yield to
      system.slice is absolute (12G), we end up putting more pressure on
      system.slice just because we have a larger machine and a larger workload
      to fill it, which seems fairly unintuitive.  With this new behaviour, we
      don't end up with this unintended side effect.
      
      Previously the way that memory.low protection works is that if you are
      50% over a certain baseline, you get 50% of your normal scan pressure.
      This is certainly better than the previous cliff-edge behaviour, but it
      can be improved even further by always considering memory under the
      currently enforced protection threshold to be out of bounds.  This means
      that we can set relatively low memory.low thresholds for variable or
      bursty workloads while still getting a reasonable level of protection,
      whereas with the previous version we may still trivially hit the 100%
      clamp.  The previous 100% clamp is also somewhat arbitrary, whereas this
      one is more concretely based on the currently enforced protection
      threshold, which is likely easier to reason about.
      
      There is also a subtle issue with the way that proportional reclaim
      worked previously -- it promotes having no memory.low, since it makes
      pressure higher during low reclaim.  This happens because we base our
      scan pressure modulation on how far memory.current is between memory.min
      and memory.low, but if memory.low is unset, we only use the overage
      method.  In most cromulent configurations, this then means that we end
      up with *more* pressure than with no memory.low at all when we're in low
      reclaim, which is not really very usable or expected.
      
      With this patch, memory.low and memory.min affect reclaim pressure in a
      more understandable and composable way.  For example, from a user
      standpoint, "protected" memory now remains untouchable from a reclaim
      aggression standpoint, and users can also have more confidence that
      bursty workloads will still receive some amount of guaranteed
      protection.
      
      Link: http://lkml.kernel.org/r/20190322160307.GA3316@chrisdown.nameSigned-off-by: NChris Down <chris@chrisdown.name>
      Reviewed-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1bc63fb1
    • C
      mm, memcg: make memory.emin the baseline for utilisation determination · 9de7ca46
      Chris Down 提交于
      Roman points out that when when we do the low reclaim pass, we scale the
      reclaim pressure relative to position between 0 and the maximum
      protection threshold.
      
      However, if the maximum protection is based on memory.elow, and
      memory.emin is above zero, this means we still may get binary behaviour
      on second-pass low reclaim.  This is because we scale starting at 0, not
      starting at memory.emin, and since we don't scan at all below emin, we
      end up with cliff behaviour.
      
      This should be a fairly uncommon case since usually we don't go into the
      second pass, but it makes sense to scale our low reclaim pressure
      starting at emin.
      
      You can test this by catting two large sparse files, one in a cgroup
      with emin set to some moderate size compared to physical RAM, and
      another cgroup without any emin.  In both cgroups, set an elow larger
      than 50% of physical RAM.  The one with emin will have less page
      scanning, as reclaim pressure is lower.
      
      Rebase on top of and apply the same idea as what was applied to handle
      cgroup_memory=disable properly for the original proportional patch
      http://lkml.kernel.org/r/20190201045711.GA18302@chrisdown.name ("mm,
      memcg: Handle cgroup_disable=memory when getting memcg protection").
      
      Link: http://lkml.kernel.org/r/20190201051810.GA18895@chrisdown.nameSigned-off-by: NChris Down <chris@chrisdown.name>
      Suggested-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Dennis Zhou <dennis@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9de7ca46
    • C
      mm, memcg: proportional memory.{low,min} reclaim · 9783aa99
      Chris Down 提交于
      cgroup v2 introduces two memory protection thresholds: memory.low
      (best-effort) and memory.min (hard protection).  While they generally do
      what they say on the tin, there is a limitation in their implementation
      that makes them difficult to use effectively: that cliff behaviour often
      manifests when they become eligible for reclaim.  This patch implements
      more intuitive and usable behaviour, where we gradually mount more
      reclaim pressure as cgroups further and further exceed their protection
      thresholds.
      
      This cliff edge behaviour happens because we only choose whether or not
      to reclaim based on whether the memcg is within its protection limits
      (see the use of mem_cgroup_protected in shrink_node), but we don't vary
      our reclaim behaviour based on this information.  Imagine the following
      timeline, with the numbers the lruvec size in this zone:
      
      1. memory.low=1000000, memory.current=999999. 0 pages may be scanned.
      2. memory.low=1000000, memory.current=1000000. 0 pages may be scanned.
      3. memory.low=1000000, memory.current=1000001. 1000001* pages may be
         scanned. (?!)
      
      * Of course, we won't usually scan all available pages in the zone even
        without this patch because of scan control priority, over-reclaim
        protection, etc.  However, as shown by the tests at the end, these
        techniques don't sufficiently throttle such an extreme change in input,
        so cliff-like behaviour isn't really averted by their existence alone.
      
      Here's an example of how this plays out in practice.  At Facebook, we are
      trying to protect various workloads from "system" software, like
      configuration management tools, metric collectors, etc (see this[0] case
      study).  In order to find a suitable memory.low value, we start by
      determining the expected memory range within which the workload will be
      comfortable operating.  This isn't an exact science -- memory usage deemed
      "comfortable" will vary over time due to user behaviour, differences in
      composition of work, etc, etc.  As such we need to ballpark memory.low,
      but doing this is currently problematic:
      
      1. If we end up setting it too low for the workload, it won't have
         *any* effect (see discussion above).  The group will receive the full
         weight of reclaim and won't have any priority while competing with the
         less important system software, as if we had no memory.low configured
         at all.
      
      2. Because of this behaviour, we end up erring on the side of setting
         it too high, such that the comfort range is reliably covered.  However,
         protected memory is completely unavailable to the rest of the system,
         so we might cause undue memory and IO pressure there when we *know* we
         have some elasticity in the workload.
      
      3. Even if we get the value totally right, smack in the middle of the
         comfort zone, we get extreme jumps between no pressure and full
         pressure that cause unpredictable pressure spikes in the workload due
         to the current binary reclaim behaviour.
      
      With this patch, we can set it to our ballpark estimation without too much
      worry.  Any undesirable behaviour, such as too much or too little reclaim
      pressure on the workload or system will be proportional to how far our
      estimation is off.  This means we can set memory.low much more
      conservatively and thus waste less resources *without* the risk of the
      workload falling off a cliff if we overshoot.
      
      As a more abstract technical description, this unintuitive behaviour
      results in having to give high-priority workloads a large protection
      buffer on top of their expected usage to function reliably, as otherwise
      we have abrupt periods of dramatically increased memory pressure which
      hamper performance.  Having to set these thresholds so high wastes
      resources and generally works against the principle of work conservation.
      In addition, having proportional memory reclaim behaviour has other
      benefits.  Most notably, before this patch it's basically mandatory to set
      memory.low to a higher than desirable value because otherwise as soon as
      you exceed memory.low, all protection is lost, and all pages are eligible
      to scan again.  By contrast, having a gradual ramp in reclaim pressure
      means that you now still get some protection when thresholds are exceeded,
      which means that one can now be more comfortable setting memory.low to
      lower values without worrying that all protection will be lost.  This is
      important because workingset size is really hard to know exactly,
      especially with variable workloads, so at least getting *some* protection
      if your workingset size grows larger than you expect increases user
      confidence in setting memory.low without a huge buffer on top being
      needed.
      
      Thanks a lot to Johannes Weiner and Tejun Heo for their advice and
      assistance in thinking about how to make this work better.
      
      In testing these changes, I intended to verify that:
      
      1. Changes in page scanning become gradual and proportional instead of
         binary.
      
         To test this, I experimented stepping further and further down
         memory.low protection on a workload that floats around 19G workingset
         when under memory.low protection, watching page scan rates for the
         workload cgroup:
      
         +------------+-----------------+--------------------+--------------+
         | memory.low | test (pgscan/s) | control (pgscan/s) | % of control |
         +------------+-----------------+--------------------+--------------+
         |        21G |               0 |                  0 | N/A          |
         |        17G |             867 |               3799 | 23%          |
         |        12G |            1203 |               3543 | 34%          |
         |         8G |            2534 |               3979 | 64%          |
         |         4G |            3980 |               4147 | 96%          |
         |          0 |            3799 |               3980 | 95%          |
         +------------+-----------------+--------------------+--------------+
      
         As you can see, the test kernel (with a kernel containing this
         patch) ramps up page scanning significantly more gradually than the
         control kernel (without this patch).
      
      2. More gradual ramp up in reclaim aggression doesn't result in
         premature OOMs.
      
         To test this, I wrote a script that slowly increments the number of
         pages held by stress(1)'s --vm-keep mode until a production system
         entered severe overall memory contention.  This script runs in a highly
         protected slice taking up the majority of available system memory.
         Watching vmstat revealed that page scanning continued essentially
         nominally between test and control, without causing forward reclaim
         progress to become arrested.
      
      [0]: https://facebookmicrosites.github.io/cgroup2/docs/overview.html#case-study-the-fbtax2-project
      
      [akpm@linux-foundation.org: reflow block comments to fit in 80 cols]
      [chris@chrisdown.name: handle cgroup_disable=memory when getting memcg protection]
        Link: http://lkml.kernel.org/r/20190201045711.GA18302@chrisdown.name
      Link: http://lkml.kernel.org/r/20190124014455.GA6396@chrisdown.nameSigned-off-by: NChris Down <chris@chrisdown.name>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NRoman Gushchin <guro@fb.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9783aa99
    • B
      memcg: only record foreign writebacks with dirty pages when memcg is not disabled · 08d1d0e6
      Baoquan He 提交于
      In kdump kernel, memcg usually is disabled with 'cgroup_disable=memory'
      for saving memory.  Now kdump kernel will always panic when dump vmcore
      to local disk:
      
        BUG: kernel NULL pointer dereference, address: 0000000000000ab8
        Oops: 0000 [#1] SMP NOPTI
        CPU: 0 PID: 598 Comm: makedumpfile Not tainted 5.3.0+ #26
        Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 10/02/2018
        RIP: 0010:mem_cgroup_track_foreign_dirty_slowpath+0x38/0x140
        Call Trace:
         __set_page_dirty+0x52/0xc0
         iomap_set_page_dirty+0x50/0x90
         iomap_write_end+0x6e/0x270
         iomap_write_actor+0xce/0x170
         iomap_apply+0xba/0x11e
         iomap_file_buffered_write+0x62/0x90
         xfs_file_buffered_aio_write+0xca/0x320 [xfs]
         new_sync_write+0x12d/0x1d0
         vfs_write+0xa5/0x1a0
         ksys_write+0x59/0xd0
         do_syscall_64+0x59/0x1e0
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      And this will corrupt the 1st kernel too with 'cgroup_disable=memory'.
      
      Via the trace and with debugging, it is pointing to commit 97b27821
      ("writeback, memcg: Implement foreign dirty flushing") which introduced
      this regression.  Disabling memcg causes the null pointer dereference at
      uninitialized data in function mem_cgroup_track_foreign_dirty_slowpath().
      
      Fix it by returning directly if memcg is disabled, but not trying to
      record the foreign writebacks with dirty pages.
      
      Link: http://lkml.kernel.org/r/20190924141928.GD31919@MiWiFi-R3L-srv
      Fixes: 97b27821 ("writeback, memcg: Implement foreign dirty flushing")
      Signed-off-by: NBaoquan He <bhe@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      08d1d0e6
  11. 25 9月, 2019 2 次提交
  12. 27 8月, 2019 1 次提交
    • T
      writeback, memcg: Implement foreign dirty flushing · 97b27821
      Tejun Heo 提交于
      There's an inherent mismatch between memcg and writeback.  The former
      trackes ownership per-page while the latter per-inode.  This was a
      deliberate design decision because honoring per-page ownership in the
      writeback path is complicated, may lead to higher CPU and IO overheads
      and deemed unnecessary given that write-sharing an inode across
      different cgroups isn't a common use-case.
      
      Combined with inode majority-writer ownership switching, this works
      well enough in most cases but there are some pathological cases.  For
      example, let's say there are two cgroups A and B which keep writing to
      different but confined parts of the same inode.  B owns the inode and
      A's memory is limited far below B's.  A's dirty ratio can rise enough
      to trigger balance_dirty_pages() sleeps but B's can be low enough to
      avoid triggering background writeback.  A will be slowed down without
      a way to make writeback of the dirty pages happen.
      
      This patch implements foreign dirty recording and foreign mechanism so
      that when a memcg encounters a condition as above it can trigger
      flushes on bdi_writebacks which can clean its pages.  Please see the
      comment on top of mem_cgroup_track_foreign_dirty_slowpath() for
      details.
      
      A reproducer follows.
      
      write-range.c::
      
        #include <stdio.h>
        #include <stdlib.h>
        #include <unistd.h>
        #include <fcntl.h>
        #include <sys/types.h>
      
        static const char *usage = "write-range FILE START SIZE\n";
      
        int main(int argc, char **argv)
        {
      	  int fd;
      	  unsigned long start, size, end, pos;
      	  char *endp;
      	  char buf[4096];
      
      	  if (argc < 4) {
      		  fprintf(stderr, usage);
      		  return 1;
      	  }
      
      	  fd = open(argv[1], O_WRONLY);
      	  if (fd < 0) {
      		  perror("open");
      		  return 1;
      	  }
      
      	  start = strtoul(argv[2], &endp, 0);
      	  if (*endp != '\0') {
      		  fprintf(stderr, usage);
      		  return 1;
      	  }
      
      	  size = strtoul(argv[3], &endp, 0);
      	  if (*endp != '\0') {
      		  fprintf(stderr, usage);
      		  return 1;
      	  }
      
      	  end = start + size;
      
      	  while (1) {
      		  for (pos = start; pos < end; ) {
      			  long bread, bwritten = 0;
      
      			  if (lseek(fd, pos, SEEK_SET) < 0) {
      				  perror("lseek");
      				  return 1;
      			  }
      
      			  bread = read(0, buf, sizeof(buf) < end - pos ?
      					       sizeof(buf) : end - pos);
      			  if (bread < 0) {
      				  perror("read");
      				  return 1;
      			  }
      			  if (bread == 0)
      				  return 0;
      
      			  while (bwritten < bread) {
      				  long this;
      
      				  this = write(fd, buf + bwritten,
      					       bread - bwritten);
      				  if (this < 0) {
      					  perror("write");
      					  return 1;
      				  }
      
      				  bwritten += this;
      				  pos += bwritten;
      			  }
      		  }
      	  }
        }
      
      repro.sh::
      
        #!/bin/bash
      
        set -e
        set -x
      
        sysctl -w vm.dirty_expire_centisecs=300000
        sysctl -w vm.dirty_writeback_centisecs=300000
        sysctl -w vm.dirtytime_expire_seconds=300000
        echo 3 > /proc/sys/vm/drop_caches
      
        TEST=/sys/fs/cgroup/test
        A=$TEST/A
        B=$TEST/B
      
        mkdir -p $A $B
        echo "+memory +io" > $TEST/cgroup.subtree_control
        echo $((1<<30)) > $A/memory.high
        echo $((32<<30)) > $B/memory.high
      
        rm -f testfile
        touch testfile
        fallocate -l 4G testfile
      
        echo "Starting B"
      
        (echo $BASHPID > $B/cgroup.procs
         pv -q --rate-limit 70M < /dev/urandom | ./write-range testfile $((2<<30)) $((2<<30))) &
      
        echo "Waiting 10s to ensure B claims the testfile inode"
        sleep 5
        sync
        sleep 5
        sync
        echo "Starting A"
      
        (echo $BASHPID > $A/cgroup.procs
         pv < /dev/urandom | ./write-range testfile 0 $((2<<30)))
      
      v2: Added comments explaining why the specific intervals are being used.
      
      v3: Use 0 @nr when calling cgroup_writeback_by_id() to use best-effort
          flushing while avoding possible livelocks.
      
      v4: Use get_jiffies_64() and time_before/after64() instead of raw
          jiffies_64 and arthimetic comparisons as suggested by Jan.
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      97b27821
  13. 14 8月, 2019 1 次提交
  14. 13 7月, 2019 2 次提交