1. 08 8月, 2020 11 次提交
    • R
      mm: memcg/slab: use a single set of kmem_caches for all accounted allocations · 9855609b
      Roman Gushchin 提交于
      This is fairly big but mostly red patch, which makes all accounted slab
      allocations use a single set of kmem_caches instead of creating a separate
      set for each memory cgroup.
      
      Because the number of non-root kmem_caches is now capped by the number of
      root kmem_caches, there is no need to shrink or destroy them prematurely.
      They can be perfectly destroyed together with their root counterparts.
      This allows to dramatically simplify the management of non-root
      kmem_caches and delete a ton of code.
      
      This patch performs the following changes:
      1) introduces memcg_params.memcg_cache pointer to represent the
         kmem_cache which will be used for all non-root allocations
      2) reuses the existing memcg kmem_cache creation mechanism
         to create memcg kmem_cache on the first allocation attempt
      3) memcg kmem_caches are named <kmemcache_name>-memcg,
         e.g. dentry-memcg
      4) simplifies memcg_kmem_get_cache() to just return memcg kmem_cache
         or schedule it's creation and return the root cache
      5) removes almost all non-root kmem_cache management code
         (separate refcounter, reparenting, shrinking, etc)
      6) makes slab debugfs to display root_mem_cgroup css id and never
         show :dead and :deact flags in the memcg_slabinfo attribute.
      
      Following patches in the series will simplify the kmem_cache creation.
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Link: http://lkml.kernel.org/r/20200623174037.3951353-13-guro@fb.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9855609b
    • R
      mm: memcg/slab: move memcg_kmem_bypass() to memcontrol.h · 0f876e4d
      Roman Gushchin 提交于
      To make the memcg_kmem_bypass() function available outside of the
      memcontrol.c, let's move it to memcontrol.h.  The function is small and
      nicely fits into static inline sort of functions.
      
      It will be used from the slab code.
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Link: http://lkml.kernel.org/r/20200623174037.3951353-12-guro@fb.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0f876e4d
    • R
      mm: memcg/slab: deprecate memory.kmem.slabinfo · 4330a26b
      Roman Gushchin 提交于
      Deprecate memory.kmem.slabinfo.
      
      An empty file will be presented if corresponding config options are
      enabled.
      
      The interface is implementation dependent, isn't present in cgroup v2, and
      is generally useful only for core mm debugging purposes.  In other words,
      it doesn't provide any value for the absolute majority of users.
      
      A drgn-based replacement can be found in
      tools/cgroup/memcg_slabinfo.py.  It does support cgroup v1 and v2,
      mimics memory.kmem.slabinfo output and also allows to get any
      additional information without a need to recompile the kernel.
      
      If a drgn-based solution is too slow for a task, a bpf-based tracing tool
      can be used, which can easily keep track of all slab allocations belonging
      to a memory cgroup.
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Link: http://lkml.kernel.org/r/20200623174037.3951353-11-guro@fb.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4330a26b
    • R
      mm: memcg/slab: save obj_cgroup for non-root slab objects · 964d4bd3
      Roman Gushchin 提交于
      Store the obj_cgroup pointer in the corresponding place of
      page->obj_cgroups for each allocated non-root slab object.  Make sure that
      each allocated object holds a reference to obj_cgroup.
      
      Objcg pointer is obtained from the memcg->objcg dereferencing in
      memcg_kmem_get_cache() and passed from pre_alloc_hook to post_alloc_hook.
      Then in case of successful allocation(s) it's getting stored in the
      page->obj_cgroups vector.
      
      The objcg obtaining part look a bit bulky now, but it will be simplified
      by next commits in the series.
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Link: http://lkml.kernel.org/r/20200623174037.3951353-9-guro@fb.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      964d4bd3
    • R
      mm: memcg/slab: allocate obj_cgroups for non-root slab pages · 286e04b8
      Roman Gushchin 提交于
      Allocate and release memory to store obj_cgroup pointers for each non-root
      slab page. Reuse page->mem_cgroup pointer to store a pointer to the
      allocated space.
      
      This commit temporarily increases the memory footprint of the kernel memory
      accounting. To store obj_cgroup pointers we'll need a place for an
      objcg_pointer for each allocated object. However, the following patches
      in the series will enable sharing of slab pages between memory cgroups,
      which will dramatically increase the total slab utilization. And the final
      memory footprint will be significantly smaller than before.
      
      To distinguish between obj_cgroups and memcg pointers in case when it's
      not obvious which one is used (as in page_cgroup_ino()), let's always set
      the lowest bit in the obj_cgroup case. The original obj_cgroups
      pointer is marked to be ignored by kmemleak, which otherwise would
      report a memory leak for each allocated vector.
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Link: http://lkml.kernel.org/r/20200623174037.3951353-8-guro@fb.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      286e04b8
    • R
      mm: memcg/slab: obj_cgroup API · bf4f0599
      Roman Gushchin 提交于
      Obj_cgroup API provides an ability to account sub-page sized kernel
      objects, which potentially outlive the original memory cgroup.
      
      The top-level API consists of the following functions:
        bool obj_cgroup_tryget(struct obj_cgroup *objcg);
        void obj_cgroup_get(struct obj_cgroup *objcg);
        void obj_cgroup_put(struct obj_cgroup *objcg);
      
        int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size);
        void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size);
      
        struct mem_cgroup *obj_cgroup_memcg(struct obj_cgroup *objcg);
        struct obj_cgroup *get_obj_cgroup_from_current(void);
      
      Object cgroup is basically a pointer to a memory cgroup with a per-cpu
      reference counter.  It substitutes a memory cgroup in places where it's
      necessary to charge a custom amount of bytes instead of pages.
      
      All charged memory rounded down to pages is charged to the corresponding
      memory cgroup using __memcg_kmem_charge().
      
      It implements reparenting: on memcg offlining it's getting reattached to
      the parent memory cgroup.  Each online memory cgroup has an associated
      active object cgroup to handle new allocations and the list of all
      attached object cgroups.  On offlining of a cgroup this list is reparented
      and for each object cgroup in the list the memcg pointer is swapped to the
      parent memory cgroup.  It prevents long-living objects from pinning the
      original memory cgroup in the memory.
      
      The implementation is based on byte-sized per-cpu stocks.  A sub-page
      sized leftover is stored in an atomic field, which is a part of obj_cgroup
      object.  So on cgroup offlining the leftover is automatically reparented.
      
      memcg->objcg is rcu protected.  objcg->memcg is a raw pointer, which is
      always pointing at a memory cgroup, but can be atomically swapped to the
      parent memory cgroup.  So a user must ensure the lifetime of the
      cgroup, e.g.  grab rcu_read_lock or css_set_lock.
      Suggested-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Link: http://lkml.kernel.org/r/20200623174037.3951353-7-guro@fb.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bf4f0599
    • J
      mm: memcontrol: decouple reference counting from page accounting · 1a3e1f40
      Johannes Weiner 提交于
      The reference counting of a memcg is currently coupled directly to how
      many 4k pages are charged to it.  This doesn't work well with Roman's new
      slab controller, which maintains pools of objects and doesn't want to keep
      an extra balance sheet for the pages backing those objects.
      
      This unusual refcounting design (reference counts usually track pointers
      to an object) is only for historical reasons: memcg used to not take any
      css references and simply stalled offlining until all charges had been
      reparented and the page counters had dropped to zero.  When we got rid of
      the reparenting requirement, the simple mechanical translation was to take
      a reference for every charge.
      
      More historical context can be found in commit e8ea14cc ("mm:
      memcontrol: take a css reference for each charged page"), commit
      64f21993 ("mm: memcontrol: remove obsolete kmemcg pinning tricks") and
      commit b2052564 ("mm: memcontrol: continue cache reclaim from offlined
      groups").
      
      The new slab controller exposes the limitations in this scheme, so let's
      switch it to a more idiomatic reference counting model based on actual
      kernel pointers to the memcg:
      
      - The per-cpu stock holds a reference to the memcg its caching
      
      - User pages hold a reference for their page->mem_cgroup. Transparent
        huge pages will no longer acquire tail references in advance, we'll
        get them if needed during the split.
      
      - Kernel pages hold a reference for their page->mem_cgroup
      
      - Pages allocated in the root cgroup will acquire and release css
        references for simplicity. css_get() and css_put() optimize that.
      
      - The current memcg_charge_slab() already hacked around the per-charge
        references; this change gets rid of that as well.
      
      - tcp accounting will handle reference in mem_cgroup_sk_{alloc,free}
      
      Roman:
      1) Rebased on top of the current mm tree: added css_get() in
         mem_cgroup_charge(), dropped mem_cgroup_try_charge() part
      2) I've reformatted commit references in the commit log to make
         checkpatch.pl happy.
      
      [hughd@google.com: remove css_put_many() from __mem_cgroup_clear_mc()]
        Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2007302011450.2347@eggly.anvilsSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Link: http://lkml.kernel.org/r/20200623174037.3951353-6-guro@fb.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1a3e1f40
    • R
      mm: memcg: convert vmstat slab counters to bytes · d42f3245
      Roman Gushchin 提交于
      In order to prepare for per-object slab memory accounting, convert
      NR_SLAB_RECLAIMABLE and NR_SLAB_UNRECLAIMABLE vmstat items to bytes.
      
      To make it obvious, rename them to NR_SLAB_RECLAIMABLE_B and
      NR_SLAB_UNRECLAIMABLE_B (similar to NR_KERNEL_STACK_KB).
      
      Internally global and per-node counters are stored in pages, however memcg
      and lruvec counters are stored in bytes.  This scheme may look weird, but
      only for now.  As soon as slab pages will be shared between multiple
      cgroups, global and node counters will reflect the total number of slab
      pages.  However memcg and lruvec counters will be used for per-memcg slab
      memory tracking, which will take separate kernel objects in the account.
      Keeping global and node counters in pages helps to avoid additional
      overhead.
      
      The size of slab memory shouldn't exceed 4Gb on 32-bit machines, so it
      will fit into atomic_long_t we use for vmstats.
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Link: http://lkml.kernel.org/r/20200623174037.3951353-4-guro@fb.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d42f3245
    • R
      mm: memcg: prepare for byte-sized vmstat items · ea426c2a
      Roman Gushchin 提交于
      To implement per-object slab memory accounting, we need to convert slab
      vmstat counters to bytes.  Actually, out of 4 levels of counters: global,
      per-node, per-memcg and per-lruvec only two last levels will require
      byte-sized counters.  It's because global and per-node counters will be
      counting the number of slab pages, and per-memcg and per-lruvec will be
      counting the amount of memory taken by charged slab objects.
      
      Converting all vmstat counters to bytes or even all slab counters to bytes
      would introduce an additional overhead.  So instead let's store global and
      per-node counters in pages, and memcg and lruvec counters in bytes.
      
      To make the API clean all access helpers (both on the read and write
      sides) are dealing with bytes.
      
      To avoid back-and-forth conversions a new flavor of read-side helpers is
      introduced, which always returns values in pages: node_page_state_pages()
      and global_node_page_state_pages().
      
      Actually new helpers are just reading raw values.  Old helpers are simple
      wrappers, which will complain on an attempt to read byte value, because at
      the moment no one actually needs bytes.
      
      Thanks to Johannes Weiner for the idea of having the byte-sized API on top
      of the page-sized internal storage.
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Link: http://lkml.kernel.org/r/20200623174037.3951353-3-guro@fb.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ea426c2a
    • R
      mm: memcg: factor out memcg- and lruvec-level changes out of __mod_lruvec_state() · eedc4e5a
      Roman Gushchin 提交于
      Patch series "The new cgroup slab memory controller", v7.
      
      The patchset moves the accounting from the page level to the object level.
      It allows to share slab pages between memory cgroups.  This leads to a
      significant win in the slab utilization (up to 45%) and the corresponding
      drop in the total kernel memory footprint.  The reduced number of
      unmovable slab pages should also have a positive effect on the memory
      fragmentation.
      
      The patchset makes the slab accounting code simpler: there is no more need
      in the complicated dynamic creation and destruction of per-cgroup slab
      caches, all memory cgroups use a global set of shared slab caches.  The
      lifetime of slab caches is not more connected to the lifetime of memory
      cgroups.
      
      The more precise accounting does require more CPU, however in practice the
      difference seems to be negligible.  We've been using the new slab
      controller in Facebook production for several months with different
      workloads and haven't seen any noticeable regressions.  What we've seen
      were memory savings in order of 1 GB per host (it varied heavily depending
      on the actual workload, size of RAM, number of CPUs, memory pressure,
      etc).
      
      The third version of the patchset added yet another step towards the
      simplification of the code: sharing of slab caches between accounted and
      non-accounted allocations.  It comes with significant upsides (most
      noticeable, a complete elimination of dynamic slab caches creation) but
      not without some regression risks, so this change sits on top of the
      patchset and is not completely merged in.  So in the unlikely event of a
      noticeable performance regression it can be reverted separately.
      
      The slab memory accounting works in exactly the same way for SLAB and
      SLUB.  With both allocators the new controller shows significant memory
      savings, with SLUB the difference is bigger.  On my 16-core desktop
      machine running Fedora 32 the size of the slab memory measured after the
      start of the system was lower by 58% and 38% with SLUB and SLAB
      correspondingly.
      
      As an estimation of a potential CPU overhead, below are results of
      slab_bulk_test01 test, kindly provided by Jesper D.  Brouer.  He also
      helped with the evaluation of results.
      
      The test can be found here: https://github.com/netoptimizer/prototype-kernel/
      The smallest number in each row should be picked for a comparison.
      
      SLUB-patched - bulk-API
       - SLUB-patched : bulk_quick_reuse objects=1 : 187 -  90 - 224  cycles(tsc)
       - SLUB-patched : bulk_quick_reuse objects=2 : 110 -  53 - 133  cycles(tsc)
       - SLUB-patched : bulk_quick_reuse objects=3 :  88 -  95 -  42  cycles(tsc)
       - SLUB-patched : bulk_quick_reuse objects=4 :  91 -  85 -  36  cycles(tsc)
       - SLUB-patched : bulk_quick_reuse objects=8 :  32 -  66 -  32  cycles(tsc)
      
      SLUB-original -  bulk-API
       - SLUB-original: bulk_quick_reuse objects=1 :  87 -  87 - 142  cycles(tsc)
       - SLUB-original: bulk_quick_reuse objects=2 :  52 -  53 -  53  cycles(tsc)
       - SLUB-original: bulk_quick_reuse objects=3 :  42 -  42 -  91  cycles(tsc)
       - SLUB-original: bulk_quick_reuse objects=4 :  91 -  37 -  37  cycles(tsc)
       - SLUB-original: bulk_quick_reuse objects=8 :  31 -  79 -  76  cycles(tsc)
      
      SLAB-patched -  bulk-API
       - SLAB-patched : bulk_quick_reuse objects=1 :  67 -  67 - 140  cycles(tsc)
       - SLAB-patched : bulk_quick_reuse objects=2 :  55 -  46 -  46  cycles(tsc)
       - SLAB-patched : bulk_quick_reuse objects=3 :  93 -  94 -  39  cycles(tsc)
       - SLAB-patched : bulk_quick_reuse objects=4 :  35 -  88 -  85  cycles(tsc)
       - SLAB-patched : bulk_quick_reuse objects=8 :  30 -  30 -  30  cycles(tsc)
      
      SLAB-original-  bulk-API
       - SLAB-original: bulk_quick_reuse objects=1 : 143 - 136 -  67  cycles(tsc)
       - SLAB-original: bulk_quick_reuse objects=2 :  45 -  46 -  46  cycles(tsc)
       - SLAB-original: bulk_quick_reuse objects=3 :  38 -  39 -  39  cycles(tsc)
       - SLAB-original: bulk_quick_reuse objects=4 :  35 -  87 -  87  cycles(tsc)
       - SLAB-original: bulk_quick_reuse objects=8 :  29 -  66 -  30  cycles(tsc)
      
      This patch (of 19):
      
      To convert memcg and lruvec slab counters to bytes there must be a way to
      change these counters without touching node counters.  Factor out
      __mod_memcg_lruvec_state() out of __mod_lruvec_state().
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Link: http://lkml.kernel.org/r/20200623174037.3951353-1-guro@fb.com
      Link: http://lkml.kernel.org/r/20200623174037.3951353-2-guro@fb.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      eedc4e5a
    • R
      mm: kmem: make memcg_kmem_enabled() irreversible · d648bcc7
      Roman Gushchin 提交于
      Historically the kernel memory accounting was an opt-in feature, which
      could be enabled for individual cgroups.  But now it's not true, and it's
      on by default both on cgroup v1 and cgroup v2.  And as long as a user has
      at least one non-root memory cgroup, the kernel memory accounting is on.
      So in most setups it's either always on (if memory cgroups are in use and
      kmem accounting is not disabled), either always off (otherwise).
      
      memcg_kmem_enabled() is used in many places to guard the kernel memory
      accounting code.  If memcg_kmem_enabled() can reverse from returning true
      to returning false (as now), we can't rely on it on release paths and have
      to check if it was on before.
      
      If we'll make memcg_kmem_enabled() irreversible (always returning true
      after returning it for the first time), it'll make the general logic more
      simple and robust.  It also will allow to guard some checks which
      otherwise would stay unguarded.
      Reported-by: NNaresh Kamboju <naresh.kamboju@linaro.org>
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Tested-by: NNaresh Kamboju <naresh.kamboju@linaro.org>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Link: http://lkml.kernel.org/r/20200702180926.1330769-1-guro@fb.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d648bcc7
  2. 25 7月, 2020 2 次提交
    • H
      mm/memcg: fix refcount error while moving and swapping · 8d22a935
      Hugh Dickins 提交于
      It was hard to keep a test running, moving tasks between memcgs with
      move_charge_at_immigrate, while swapping: mem_cgroup_id_get_many()'s
      refcount is discovered to be 0 (supposedly impossible), so it is then
      forced to REFCOUNT_SATURATED, and after thousands of warnings in quick
      succession, the test is at last put out of misery by being OOM killed.
      
      This is because of the way moved_swap accounting was saved up until the
      task move gets completed in __mem_cgroup_clear_mc(), deferred from when
      mem_cgroup_move_swap_account() actually exchanged old and new ids.
      Concurrent activity can free up swap quicker than the task is scanned,
      bringing id refcount down 0 (which should only be possible when
      offlining).
      
      Just skip that optimization: do that part of the accounting immediately.
      
      Fixes: 615d66c3 ("mm: memcontrol: fix memcg id ref counter on swap charge move")
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NAlex Shi <alex.shi@linux.alibaba.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2007071431050.4726@eggly.anvilsSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8d22a935
    • B
      mm/memcontrol: fix OOPS inside mem_cgroup_get_nr_swap_pages() · 82ff165c
      Bhupesh Sharma 提交于
      Prabhakar reported an OOPS inside mem_cgroup_get_nr_swap_pages()
      function in a corner case seen on some arm64 boards when kdump kernel
      runs with "cgroup_disable=memory" passed to the kdump kernel via
      bootargs.
      
      The root-cause behind the same is that currently mem_cgroup_swap_init()
      function is implemented as a subsys_initcall() call instead of a
      core_initcall(), this means 'cgroup_memory_noswap' still remains set to
      the default value (false) even when memcg is disabled via
      "cgroup_disable=memory" boot parameter.
      
      This may result in premature OOPS inside mem_cgroup_get_nr_swap_pages()
      function in corner cases:
      
        Unable to handle kernel NULL pointer dereference at virtual address 0000000000000188
        Mem abort info:
          ESR = 0x96000006
          EC = 0x25: DABT (current EL), IL = 32 bits
          SET = 0, FnV = 0
          EA = 0, S1PTW = 0
        Data abort info:
          ISV = 0, ISS = 0x00000006
          CM = 0, WnR = 0
        [0000000000000188] user address but active_mm is swapper
        Internal error: Oops: 96000006 [#1] SMP
        Modules linked in:
        <..snip..>
        Call trace:
          mem_cgroup_get_nr_swap_pages+0x9c/0xf4
          shrink_lruvec+0x404/0x4f8
          shrink_node+0x1a8/0x688
          do_try_to_free_pages+0xe8/0x448
          try_to_free_pages+0x110/0x230
          __alloc_pages_slowpath.constprop.106+0x2b8/0xb48
          __alloc_pages_nodemask+0x2ac/0x2f8
          alloc_page_interleave+0x20/0x90
          alloc_pages_current+0xdc/0xf8
          atomic_pool_expand+0x60/0x210
          __dma_atomic_pool_init+0x50/0xa4
          dma_atomic_pool_init+0xac/0x158
          do_one_initcall+0x50/0x218
          kernel_init_freeable+0x22c/0x2d0
          kernel_init+0x18/0x110
          ret_from_fork+0x10/0x18
        Code: aa1403e3 91106000 97f82a27 14000011 (f940c663)
        ---[ end trace 9795948475817de4 ]---
        Kernel panic - not syncing: Fatal exception
        Rebooting in 10 seconds..
      
      Fixes: eccb52e7 ("mm: memcontrol: prepare swap controller setup for integration")
      Reported-by: NPrabhakar Kushwaha <pkushwaha@marvell.com>
      Signed-off-by: NBhupesh Sharma <bhsharma@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: James Morse <james.morse@arm.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Link: http://lkml.kernel.org/r/1593641660-13254-2-git-send-email-bhsharma@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      82ff165c
  3. 17 7月, 2020 1 次提交
  4. 26 6月, 2020 3 次提交
  5. 10 6月, 2020 2 次提交
  6. 05 6月, 2020 1 次提交
  7. 04 6月, 2020 17 次提交
  8. 03 6月, 2020 3 次提交