1. 04 7月, 2022 1 次提交
    • R
      mm: memcontrol: introduce mem_cgroup_ino() and mem_cgroup_get_from_ino() · c15187a4
      Roman Gushchin 提交于
      Patch series "mm: introduce shrinker debugfs interface", v5.
      
      The only existing debugging mechanism is a couple of tracepoints in
      do_shrink_slab(): mm_shrink_slab_start and mm_shrink_slab_end.  They
      aren't covering everything though: shrinkers which report 0 objects will
      never show up, there is no support for memcg-aware shrinkers.  Shrinkers
      are identified by their scan function, which is not always enough (e.g. 
      hard to guess which super block's shrinker it is having only
      "super_cache_scan").
      
      To provide a better visibility and debug options for memory shrinkers this
      patchset introduces a /sys/kernel/debug/shrinker interface, to some extent
      similar to /sys/kernel/slab.
      
      For each shrinker registered in the system a directory is created.  As
      now, the directory will contain only a "scan" file, which allows to get
      the number of managed objects for each memory cgroup (for memcg-aware
      shrinkers) and each numa node (for numa-aware shrinkers on a numa
      machine).  Other interfaces might be added in the future.
      
      To make debugging more pleasant, the patchset also names all shrinkers, so
      that debugfs entries can have meaningful names.
      
      
      This patch (of 5):
      
      Shrinker debugfs requires a way to represent memory cgroups without using
      full paths, both for displaying information and getting input from a user.
      
      Cgroup inode number is a perfect way, already used by bpf.
      
      This commit adds a couple of helper functions which will be used to handle
      memcg-aware shrinkers.
      
      Link: https://lkml.kernel.org/r/20220601032227.4076670-1-roman.gushchin@linux.dev
      Link: https://lkml.kernel.org/r/20220601032227.4076670-2-roman.gushchin@linux.devSigned-off-by: NRoman Gushchin <roman.gushchin@linux.dev>
      Acked-by: NMuchun Song <songmuchun@bytedance.com>
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: Kent Overstreet <kent.overstreet@gmail.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      c15187a4
  2. 17 6月, 2022 2 次提交
  3. 20 5月, 2022 1 次提交
    • J
      zswap: memcg accounting · f4840ccf
      Johannes Weiner 提交于
      Applications can currently escape their cgroup memory containment when
      zswap is enabled.  This patch adds per-cgroup tracking and limiting of
      zswap backend memory to rectify this.
      
      The existing cgroup2 memory.stat file is extended to show zswap statistics
      analogous to what's in meminfo and vmstat.  Furthermore, two new control
      files, memory.zswap.current and memory.zswap.max, are added to allow
      tuning zswap usage on a per-workload basis.  This is important since not
      all workloads benefit from zswap equally; some even suffer compared to
      disk swap when memory contents don't compress well.  The optimal size of
      the zswap pool, and the threshold for writeback, also depends on the size
      of the workload's warm set.
      
      The implementation doesn't use a traditional page_counter transaction. 
      zswap is unconventional as a memory consumer in that we only know the
      amount of memory to charge once expensive compression has occurred.  If
      zwap is disabled or the limit is already exceeded we obviously don't want
      to compress page upon page only to reject them all.  Instead, the limit is
      checked against current usage, then we compress and charge.  This allows
      some limit overrun, but not enough to matter in practice.
      
      [hannes@cmpxchg.org: fix for CONFIG_SLOB builds]
        Link: https://lkml.kernel.org/r/YnwD14zxYjUJPc2w@cmpxchg.org
      [hannes@cmpxchg.org: opt out of cgroups v1]
        Link: https://lkml.kernel.org/r/Yn6it9mBYFA+/lTb@cmpxchg.org
      Link: https://lkml.kernel.org/r/20220510152847.230957-7-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Seth Jennings <sjenning@redhat.com>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      f4840ccf
  4. 13 5月, 2022 1 次提交
  5. 29 4月, 2022 1 次提交
  6. 22 4月, 2022 1 次提交
    • S
      memcg: sync flush only if periodic flush is delayed · 9b301615
      Shakeel Butt 提交于
      Daniel Dao has reported [1] a regression on workloads that may trigger a
      lot of refaults (anon and file).  The underlying issue is that flushing
      rstat is expensive.  Although rstat flush are batched with (nr_cpus *
      MEMCG_BATCH) stat updates, it seems like there are workloads which
      genuinely do stat updates larger than batch value within short amount of
      time.  Since the rstat flush can happen in the performance critical
      codepaths like page faults, such workload can suffer greatly.
      
      This patch fixes this regression by making the rstat flushing
      conditional in the performance critical codepaths.  More specifically,
      the kernel relies on the async periodic rstat flusher to flush the stats
      and only if the periodic flusher is delayed by more than twice the
      amount of its normal time window then the kernel allows rstat flushing
      from the performance critical codepaths.
      
      Now the question: what are the side-effects of this change? The worst
      that can happen is the refault codepath will see 4sec old lruvec stats
      and may cause false (or missed) activations of the refaulted page which
      may under-or-overestimate the workingset size.  Though that is not very
      concerning as the kernel can already miss or do false activations.
      
      There are two more codepaths whose flushing behavior is not changed by
      this patch and we may need to come to them in future.  One is the
      writeback stats used by dirty throttling and second is the deactivation
      heuristic in the reclaim.  For now keeping an eye on them and if there
      is report of regression due to these codepaths, we will reevaluate then.
      
      Link: https://lore.kernel.org/all/CA+wXwBSyO87ZX5PVwdHm-=dBjZYECGmfnydUicUyrQqndgX2MQ@mail.gmail.com [1]
      Link: https://lkml.kernel.org/r/20220304184040.1304781-1-shakeelb@google.com
      Fixes: 1f828223 ("memcg: flush lruvec stats in the refault")
      Signed-off-by: NShakeel Butt <shakeelb@google.com>
      Reported-by: NDaniel Dao <dqminh@cloudflare.com>
      Tested-by: NIvan Babrou <ivan@cloudflare.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Koutný <mkoutny@suse.com>
      Cc: Frank Hofmann <fhofmann@cloudflare.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9b301615
  7. 23 3月, 2022 5 次提交
    • M
      mm: memcontrol: rename memcg_cache_id to memcg_kmem_id · 7c52f65d
      Muchun Song 提交于
      The memcg_cache_id() introduced by commit 2633d7a0 ("slab/slub:
      consider a memcg parameter in kmem_create_cache") is used to index in the
      kmem_cache->memcg_params->memcg_caches array.  Since
      kmem_cache->memcg_params.memcg_caches has been removed by commit
      9855609b ("mm: memcg/slab: use a single set of kmem_caches for all
      accounted allocations").  So the name does not need to reflect cache
      related.  Just rename it to memcg_kmem_id.  And it can reflect kmem
      related.
      
      Link: https://lkml.kernel.org/r/20220228122126.37293-17-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com>
      Cc: Alex Shi <alexs@kernel.org>
      Cc: Anna Schumaker <Anna.Schumaker@Netapp.com>
      Cc: Chao Yu <chao@kernel.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Fam Zheng <fam.zheng@bytedance.com>
      Cc: Jaegeuk Kim <jaegeuk@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kari Argillander <kari.argillander@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7c52f65d
    • M
      mm: list_lru: replace linear array with xarray · bbca91cc
      Muchun Song 提交于
      If we run 10k containers in the system, the size of the
      list_lru_memcg->lrus can be ~96KB per list_lru.  When we decrease the
      number containers, the size of the array will not be shrinked.  It is
      not scalable.  The xarray is a good choice for this case.  We can save a
      lot of memory when there are tens of thousands continers in the system.
      If we use xarray, we also can remove the logic code of resizing array,
      which can simplify the code.
      
      [akpm@linux-foundation.org: remove unused local]
      
      Link: https://lkml.kernel.org/r/20220228122126.37293-13-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com>
      Cc: Alex Shi <alexs@kernel.org>
      Cc: Anna Schumaker <Anna.Schumaker@Netapp.com>
      Cc: Chao Yu <chao@kernel.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Fam Zheng <fam.zheng@bytedance.com>
      Cc: Jaegeuk Kim <jaegeuk@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kari Argillander <kari.argillander@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bbca91cc
    • M
      mm: introduce kmem_cache_alloc_lru · 88f2ef73
      Muchun Song 提交于
      We currently allocate scope for every memcg to be able to tracked on
      every superblock instantiated in the system, regardless of whether that
      superblock is even accessible to that memcg.
      
      These huge memcg counts come from container hosts where memcgs are
      confined to just a small subset of the total number of superblocks that
      instantiated at any given point in time.
      
      For these systems with huge container counts, list_lru does not need the
      capability of tracking every memcg on every superblock.  What it comes
      down to is that adding the memcg to the list_lru at the first insert.
      So introduce kmem_cache_alloc_lru to allocate objects and its list_lru.
      In the later patch, we will convert all inode and dentry allocation from
      kmem_cache_alloc to kmem_cache_alloc_lru.
      
      Link: https://lkml.kernel.org/r/20220228122126.37293-3-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com>
      Cc: Alex Shi <alexs@kernel.org>
      Cc: Anna Schumaker <Anna.Schumaker@Netapp.com>
      Cc: Chao Yu <chao@kernel.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Fam Zheng <fam.zheng@bytedance.com>
      Cc: Jaegeuk Kim <jaegeuk@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kari Argillander <kari.argillander@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      88f2ef73
    • W
      mm/memcg: retrieve parent memcg from css.parent · 486bc706
      Wei Yang 提交于
      The parent we get from page_counter is correct, while this is two
      different hierarchy.
      
      Let's retrieve the parent memcg from css.parent just like parent_cs(),
      blkcg_parent(), etc.
      
      Link: https://lkml.kernel.org/r/20220201004643.8391-2-richard.weiyang@gmail.comSigned-off-by: NWei Yang <richard.weiyang@gmail.com>
      Reviewed-by: NMuchun Song <songmuchun@bytedance.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NRoman Gushchin <guro@fb.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      486bc706
    • Y
      memcg: add per-memcg total kernel memory stat · a8c49af3
      Yosry Ahmed 提交于
      Currently memcg stats show several types of kernel memory: kernel stack,
      page tables, sock, vmalloc, and slab.  However, there are other
      allocations with __GFP_ACCOUNT (or supersets such as GFP_KERNEL_ACCOUNT)
      that are not accounted in any of those stats, a few examples are:
      
       - various kvm allocations (e.g. allocated pages to create vcpus)
       - io_uring
       - tmp_page in pipes during pipe_write()
       - bpf ringbuffers
       - unix sockets
      
      Keeping track of the total kernel memory is essential for the ease of
      migration from cgroup v1 to v2 as there are large discrepancies between
      v1's kmem.usage_in_bytes and the sum of the available kernel memory
      stats in v2.  Adding separate memcg stats for all __GFP_ACCOUNT kernel
      allocations is an impractical maintenance burden as there a lot of those
      all over the kernel code, with more use cases likely to show up in the
      future.
      
      Therefore, add a "kernel" memcg stat that is analogous to kmem page
      counter, with added benefits such as using rstat infrastructure which
      aggregates stats more efficiently.  Additionally, this provides a
      lighter alternative in case the legacy kmem is deprecated in the future
      
      [yosryahmed@google.com: v2]
        Link: https://lkml.kernel.org/r/20220203193856.972500-1-yosryahmed@google.com
      
      Link: https://lkml.kernel.org/r/20220201200823.3283171-1-yosryahmed@google.comSigned-off-by: NYosry Ahmed <yosryahmed@google.com>
      Acked-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a8c49af3
  8. 12 2月, 2022 1 次提交
    • R
      mm: memcg: synchronize objcg lists with a dedicated spinlock · 0764db9b
      Roman Gushchin 提交于
      Alexander reported a circular lock dependency revealed by the mmap1 ltp
      test:
      
        LOCKDEP_CIRCULAR (suite: ltp, case: mtest06 (mmap1))
                WARNING: possible circular locking dependency detected
                5.17.0-20220113.rc0.git0.f2211f194038.300.fc35.s390x+debug #1 Not tainted
                ------------------------------------------------------
                mmap1/202299 is trying to acquire lock:
                00000001892c0188 (css_set_lock){..-.}-{2:2}, at: obj_cgroup_release+0x4a/0xe0
                but task is already holding lock:
                00000000ca3b3818 (&sighand->siglock){-.-.}-{2:2}, at: force_sig_info_to_task+0x38/0x180
                which lock already depends on the new lock.
                the existing dependency chain (in reverse order) is:
                -> #1 (&sighand->siglock){-.-.}-{2:2}:
                       __lock_acquire+0x604/0xbd8
                       lock_acquire.part.0+0xe2/0x238
                       lock_acquire+0xb0/0x200
                       _raw_spin_lock_irqsave+0x6a/0xd8
                       __lock_task_sighand+0x90/0x190
                       cgroup_freeze_task+0x2e/0x90
                       cgroup_migrate_execute+0x11c/0x608
                       cgroup_update_dfl_csses+0x246/0x270
                       cgroup_subtree_control_write+0x238/0x518
                       kernfs_fop_write_iter+0x13e/0x1e0
                       new_sync_write+0x100/0x190
                       vfs_write+0x22c/0x2d8
                       ksys_write+0x6c/0xf8
                       __do_syscall+0x1da/0x208
                       system_call+0x82/0xb0
                -> #0 (css_set_lock){..-.}-{2:2}:
                       check_prev_add+0xe0/0xed8
                       validate_chain+0x736/0xb20
                       __lock_acquire+0x604/0xbd8
                       lock_acquire.part.0+0xe2/0x238
                       lock_acquire+0xb0/0x200
                       _raw_spin_lock_irqsave+0x6a/0xd8
                       obj_cgroup_release+0x4a/0xe0
                       percpu_ref_put_many.constprop.0+0x150/0x168
                       drain_obj_stock+0x94/0xe8
                       refill_obj_stock+0x94/0x278
                       obj_cgroup_charge+0x164/0x1d8
                       kmem_cache_alloc+0xac/0x528
                       __sigqueue_alloc+0x150/0x308
                       __send_signal+0x260/0x550
                       send_signal+0x7e/0x348
                       force_sig_info_to_task+0x104/0x180
                       force_sig_fault+0x48/0x58
                       __do_pgm_check+0x120/0x1f0
                       pgm_check_handler+0x11e/0x180
                other info that might help us debug this:
                 Possible unsafe locking scenario:
                       CPU0                    CPU1
                       ----                    ----
                  lock(&sighand->siglock);
                                               lock(css_set_lock);
                                               lock(&sighand->siglock);
                  lock(css_set_lock);
                 *** DEADLOCK ***
                2 locks held by mmap1/202299:
                 #0: 00000000ca3b3818 (&sighand->siglock){-.-.}-{2:2}, at: force_sig_info_to_task+0x38/0x180
                 #1: 00000001892ad560 (rcu_read_lock){....}-{1:2}, at: percpu_ref_put_many.constprop.0+0x0/0x168
                stack backtrace:
                CPU: 15 PID: 202299 Comm: mmap1 Not tainted 5.17.0-20220113.rc0.git0.f2211f194038.300.fc35.s390x+debug #1
                Hardware name: IBM 3906 M04 704 (LPAR)
                Call Trace:
                  dump_stack_lvl+0x76/0x98
                  check_noncircular+0x136/0x158
                  check_prev_add+0xe0/0xed8
                  validate_chain+0x736/0xb20
                  __lock_acquire+0x604/0xbd8
                  lock_acquire.part.0+0xe2/0x238
                  lock_acquire+0xb0/0x200
                  _raw_spin_lock_irqsave+0x6a/0xd8
                  obj_cgroup_release+0x4a/0xe0
                  percpu_ref_put_many.constprop.0+0x150/0x168
                  drain_obj_stock+0x94/0xe8
                  refill_obj_stock+0x94/0x278
                  obj_cgroup_charge+0x164/0x1d8
                  kmem_cache_alloc+0xac/0x528
                  __sigqueue_alloc+0x150/0x308
                  __send_signal+0x260/0x550
                  send_signal+0x7e/0x348
                  force_sig_info_to_task+0x104/0x180
                  force_sig_fault+0x48/0x58
                  __do_pgm_check+0x120/0x1f0
                  pgm_check_handler+0x11e/0x180
                INFO: lockdep is turned off.
      
      In this example a slab allocation from __send_signal() caused a
      refilling and draining of a percpu objcg stock, resulted in a releasing
      of another non-related objcg.  Objcg release path requires taking the
      css_set_lock, which is used to synchronize objcg lists.
      
      This can create a circular dependency with the sighandler lock, which is
      taken with the locked css_set_lock by the freezer code (to freeze a
      task).
      
      In general it seems that using css_set_lock to synchronize objcg lists
      makes any slab allocations and deallocation with the locked css_set_lock
      and any intervened locks risky.
      
      To fix the problem and make the code more robust let's stop using
      css_set_lock to synchronize objcg lists and use a new dedicated spinlock
      instead.
      
      Link: https://lkml.kernel.org/r/Yfm1IHmoGdyUR81T@carbon.dhcp.thefacebook.com
      Fixes: bf4f0599 ("mm: memcg/slab: obj_cgroup API")
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Reported-by: NAlexander Egorenkov <egorenar@linux.ibm.com>
      Tested-by: NAlexander Egorenkov <egorenar@linux.ibm.com>
      Reviewed-by: NWaiman Long <longman@redhat.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Reviewed-by: NJeremy Linton <jeremy.linton@arm.com>
      Tested-by: NJeremy Linton <jeremy.linton@arm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0764db9b
  9. 15 1月, 2022 2 次提交
  10. 06 1月, 2022 1 次提交
    • V
      mm/memcg: Convert slab objcgs from struct page to struct slab · 4b5f8d9a
      Vlastimil Babka 提交于
      page->memcg_data is used with MEMCG_DATA_OBJCGS flag only for slab pages
      so convert all the related infrastructure to struct slab. Also use
      struct folio instead of struct page when resolving object pointers.
      
      This is not just mechanistic changing of types and names. Now in
      mem_cgroup_from_obj() we use folio_test_slab() to decide if we interpret
      the folio as a real slab instead of a large kmalloc, instead of relying
      on MEMCG_DATA_OBJCGS bit that used to be checked in page_objcgs_check().
      Similarly in memcg_slab_free_hook() where we can encounter
      kmalloc_large() pages (here the folio slab flag check is implied by
      virt_to_slab()). As a result, page_objcgs_check() can be dropped instead
      of converted.
      
      To avoid include cycles, move the inline definition of slab_objcgs()
      from memcontrol.h to mm/slab.h.
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NRoman Gushchin <guro@fb.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: <cgroups@vger.kernel.org>
      4b5f8d9a
  11. 07 11月, 2021 2 次提交
    • Y
      mm/vmpressure: fix data-race with memcg->socket_pressure · 7e6ec49c
      Yuanzheng Song 提交于
      When reading memcg->socket_pressure in mem_cgroup_under_socket_pressure()
      and writing memcg->socket_pressure in vmpressure() at the same time, the
      following data-race occurs:
      
        BUG: KCSAN: data-race in __sk_mem_reduce_allocated / vmpressure
      
        write to 0xffff8881286f4938 of 8 bytes by task 24550 on cpu 3:
         vmpressure+0x218/0x230 mm/vmpressure.c:307
         shrink_node_memcgs+0x2b9/0x410 mm/vmscan.c:2658
         shrink_node+0x9d2/0x11d0 mm/vmscan.c:2769
         shrink_zones+0x29f/0x470 mm/vmscan.c:2972
         do_try_to_free_pages+0x193/0x6e0 mm/vmscan.c:3027
         try_to_free_mem_cgroup_pages+0x1c0/0x3f0 mm/vmscan.c:3345
         reclaim_high mm/memcontrol.c:2440 [inline]
         mem_cgroup_handle_over_high+0x18b/0x4d0 mm/memcontrol.c:2624
         tracehook_notify_resume include/linux/tracehook.h:197 [inline]
         exit_to_user_mode_loop kernel/entry/common.c:164 [inline]
         exit_to_user_mode_prepare+0x110/0x170 kernel/entry/common.c:191
         syscall_exit_to_user_mode+0x16/0x30 kernel/entry/common.c:266
         ret_from_fork+0x15/0x30 arch/x86/entry/entry_64.S:289
      
        read to 0xffff8881286f4938 of 8 bytes by interrupt on cpu 1:
         mem_cgroup_under_socket_pressure include/linux/memcontrol.h:1483 [inline]
         sk_under_memory_pressure include/net/sock.h:1314 [inline]
         __sk_mem_reduce_allocated+0x1d2/0x270 net/core/sock.c:2696
         __sk_mem_reclaim+0x44/0x50 net/core/sock.c:2711
         sk_mem_reclaim include/net/sock.h:1490 [inline]
         ......
         net_rx_action+0x17a/0x480 net/core/dev.c:6864
         __do_softirq+0x12c/0x2af kernel/softirq.c:298
         run_ksoftirqd+0x13/0x20 kernel/softirq.c:653
         smpboot_thread_fn+0x33f/0x510 kernel/smpboot.c:165
         kthread+0x1fc/0x220 kernel/kthread.c:292
         ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:296
      
      Fix it by using READ_ONCE() and WRITE_ONCE() to read and write
      memcg->socket_pressure.
      
      Link: https://lkml.kernel.org/r/20211025082843.671690-1-songyuanzheng@huawei.comSigned-off-by: NYuanzheng Song <songyuanzheng@huawei.com>
      Reviewed-by: NMuchun Song <songmuchun@bytedance.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Alex Shi <alexs@kernel.org>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7e6ec49c
    • M
      mm: memcontrol: remove the kmem states · e80216d9
      Muchun Song 提交于
      Now the kmem states is only used to indicate whether the kmem is
      offline.  However, we can set ->kmemcg_id to -1 to indicate whether the
      kmem is offline.  Finally, we can remove the kmem states to simplify the
      code.
      
      Link: https://lkml.kernel.org/r/20211025125259.56624-1-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com>
      Acked-by: NRoman Gushchin <guro@fb.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e80216d9
  12. 18 10月, 2021 1 次提交
  13. 27 9月, 2021 10 次提交
  14. 04 9月, 2021 6 次提交
  15. 21 8月, 2021 1 次提交
  16. 18 8月, 2021 1 次提交
  17. 30 6月, 2021 3 次提交