1. 30 6月, 2021 14 次提交
    • D
      loop: charge i/o to mem and blk cg · c74d40e8
      Dan Schatzberg 提交于
      The current code only associates with the existing blkcg when aio is used
      to access the backing file.  This patch covers all types of i/o to the
      backing file and also associates the memcg so if the backing file is on
      tmpfs, memory is charged appropriately.
      
      This patch also exports cgroup_get_e_css and int_active_memcg so it can be
      used by the loop module.
      
      Link: https://lkml.kernel.org/r/20210610173944.1203706-4-schatzberg.dan@gmail.comSigned-off-by: NDan Schatzberg <schatzberg.dan@gmail.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NJens Axboe <axboe@kernel.dk>
      Cc: Chris Down <chris@chrisdown.name>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Ming Lei <ming.lei@redhat.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c74d40e8
    • D
      mm: charge active memcg when no mm is set · 04f94e3f
      Dan Schatzberg 提交于
      set_active_memcg() worked for kernel allocations but was silently ignored
      for user pages.
      
      This patch establishes a precedence order for who gets charged:
      
      1. If there is a memcg associated with the page already, that memcg is
         charged. This happens during swapin.
      
      2. If an explicit mm is passed, mm->memcg is charged. This happens
         during page faults, which can be triggered in remote VMs (eg gup).
      
      3. Otherwise consult the current process context. If there is an
         active_memcg, use that. Otherwise, current->mm->memcg.
      
      Previously, if a NULL mm was passed to mem_cgroup_charge (case 3) it would
      always charge the root cgroup.  Now it looks up the active_memcg first
      (falling back to charging the root cgroup if not set).
      
      Link: https://lkml.kernel.org/r/20210610173944.1203706-3-schatzberg.dan@gmail.comSigned-off-by: NDan Schatzberg <schatzberg.dan@gmail.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NTejun Heo <tj@kernel.org>
      Acked-by: NChris Down <chris@chrisdown.name>
      Acked-by: NJens Axboe <axboe@kernel.dk>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Reviewed-by: NMichal Koutný <mkoutny@suse.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Ming Lei <ming.lei@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      04f94e3f
    • M
      mm: memcontrol: move obj_cgroup_uncharge_pages() out of css_set_lock · 271dd6b1
      Muchun Song 提交于
      The css_set_lock is used to guard the list of inherited objcgs.  So there
      is no need to uncharge kernel memory under css_set_lock.  Just move it out
      of the lock.
      
      Link: https://lkml.kernel.org/r/20210417043538.9793-8-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      271dd6b1
    • M
      mm: memcontrol: simplify the logic of objcg pinning memcg · 9838354e
      Muchun Song 提交于
      The obj_cgroup_release() and memcg_reparent_objcgs() are serialized by the
      css_set_lock.  We do not need to care about objcg->memcg being released in
      the process of obj_cgroup_release().  So there is no need to pin memcg
      before releasing objcg.  Remove those pinning logic to simplfy the code.
      
      There are only two places that modifies the objcg->memcg.  One is the
      initialization to objcg->memcg in the memcg_online_kmem(), another is
      objcgs reparenting in the memcg_reparent_objcgs().  It is also impossible
      for the two to run in parallel.  So xchg() is unnecessary and it is enough
      to use WRITE_ONCE().
      
      Link: https://lkml.kernel.org/r/20210417043538.9793-7-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NRoman Gushchin <guro@fb.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9838354e
    • M
      mm: memcontrol: remove the pgdata parameter of mem_cgroup_page_lruvec · a984226f
      Muchun Song 提交于
      All the callers of mem_cgroup_page_lruvec() just pass page_pgdat(page) as
      the 2nd parameter to it (except isolate_migratepages_block()).  But for
      isolate_migratepages_block(), the page_pgdat(page) is also equal to the
      local variable of @pgdat.  So mem_cgroup_page_lruvec() do not need the
      pgdat parameter.  Just remove it to simplify the code.
      
      Link: https://lkml.kernel.org/r/20210417043538.9793-4-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a984226f
    • M
      mm: memcontrol: bail out early when !mm in get_mem_cgroup_from_mm · 2884b6b7
      Muchun Song 提交于
      When mm is NULL, we do not need to hold rcu lock and call css_tryget for
      the root memcg.  And we also do not need to check !mm in every loop of
      while.  So bail out early when !mm.
      
      Link: https://lkml.kernel.org/r/20210417043538.9793-3-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2884b6b7
    • M
      mm: memcontrol: fix page charging in page replacement · 8dc87c7d
      Muchun Song 提交于
      Patch series "memcontrol code cleanup and simplification", v3.
      
      This patch (of 8):
      
      The pages aren't accounted at the root level, so do not charge the page to
      the root memcg in page replacement.  Although we do not display the value
      (mem_cgroup_usage) so there shouldn't be any actual problem, but there is
      a WARN_ON_ONCE in the page_counter_cancel().  Who knows if it will
      trigger?  So it is better to fix it.
      
      Link: https://lkml.kernel.org/r/20210417043538.9793-1-songmuchun@bytedance.com
      Link: https://lkml.kernel.org/r/20210417043538.9793-2-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8dc87c7d
    • M
      mm: memcontrol: fix root_mem_cgroup charging · c5c8b16b
      Muchun Song 提交于
      The below scenario can cause the page counters of the root_mem_cgroup to
      be out of balance.
      
      CPU0:                                   CPU1:
      
      objcg = get_obj_cgroup_from_current()
      obj_cgroup_charge_pages(objcg)
                                              memcg_reparent_objcgs()
                                                  // reparent to root_mem_cgroup
                                                  WRITE_ONCE(iter->memcg, parent)
          // memcg == root_mem_cgroup
          memcg = get_mem_cgroup_from_objcg(objcg)
          // do not charge to the root_mem_cgroup
          try_charge(memcg)
      
      obj_cgroup_uncharge_pages(objcg)
          memcg = get_mem_cgroup_from_objcg(objcg)
          // uncharge from the root_mem_cgroup
          refill_stock(memcg)
              drain_stock(memcg)
                  page_counter_uncharge(&memcg->memory)
      
      get_obj_cgroup_from_current() never returns a root_mem_cgroup's objcg, so
      we never explicitly charge the root_mem_cgroup.  And it's not going to
      change.  It's all about a race when we got an obj_cgroup pointing at some
      non-root memcg, but before we were able to charge it, the cgroup was gone,
      objcg was reparented to the root and so we're skipping the charging.  Then
      we store the objcg pointer and later use to uncharge the root_mem_cgroup.
      
      This can cause the page counter to be less than the actual value.
      Although we do not display the value (mem_cgroup_usage) so there shouldn't
      be any actual problem, but there is a WARN_ON_ONCE in the
      page_counter_cancel().  Who knows if it will trigger?  So it is better to
      fix it.
      
      Link: https://lkml.kernel.org/r/20210425075410.19255-1-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com>
      Acked-by: NRoman Gushchin <guro@fb.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c5c8b16b
    • W
      mm: memcg/slab: create a new set of kmalloc-cg-<n> caches · 494c1dfe
      Waiman Long 提交于
      There are currently two problems in the way the objcg pointer array
      (memcg_data) in the page structure is being allocated and freed.
      
      On its allocation, it is possible that the allocated objcg pointer
      array comes from the same slab that requires memory accounting. If this
      happens, the slab will never become empty again as there is at least
      one object left (the obj_cgroup array) in the slab.
      
      When it is freed, the objcg pointer array object may be the last one
      in its slab and hence causes kfree() to be called again. With the
      right workload, the slab cache may be set up in a way that allows the
      recursive kfree() calling loop to nest deep enough to cause a kernel
      stack overflow and panic the system.
      
      One way to solve this problem is to split the kmalloc-<n> caches
      (KMALLOC_NORMAL) into two separate sets - a new set of kmalloc-<n>
      (KMALLOC_NORMAL) caches for unaccounted objects only and a new set of
      kmalloc-cg-<n> (KMALLOC_CGROUP) caches for accounted objects only. All
      the other caches can still allow a mix of accounted and unaccounted
      objects.
      
      With this change, all the objcg pointer array objects will come from
      KMALLOC_NORMAL caches which won't have their objcg pointer arrays. So
      both the recursive kfree() problem and non-freeable slab problem are
      gone.
      
      Since both the KMALLOC_NORMAL and KMALLOC_CGROUP caches no longer have
      mixed accounted and unaccounted objects, this will slightly reduce the
      number of objcg pointer arrays that need to be allocated and save a bit
      of memory. On the other hand, creating a new set of kmalloc caches does
      have the effect of reducing cache utilization. So it is properly a wash.
      
      The new KMALLOC_CGROUP is added between KMALLOC_NORMAL and
      KMALLOC_RECLAIM so that the first for loop in create_kmalloc_caches()
      will include the newly added caches without change.
      
      [vbabka@suse.cz: don't create kmalloc-cg caches with cgroup.memory=nokmem]
        Link: https://lkml.kernel.org/r/20210512145107.6208-1-longman@redhat.com
      [akpm@linux-foundation.org: un-fat-finger v5 delta creation]
      [longman@redhat.com: disable cache merging for KMALLOC_NORMAL caches]
        Link: https://lkml.kernel.org/r/20210505200610.13943-4-longman@redhat.com
      
      Link: https://lkml.kernel.org/r/20210512145107.6208-1-longman@redhat.com
      Link: https://lkml.kernel.org/r/20210505200610.13943-3-longman@redhat.comSigned-off-by: NWaiman Long <longman@redhat.com>
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Suggested-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NRoman Gushchin <guro@fb.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      [longman@redhat.com: fix for CONFIG_ZONE_DMA=n]
      Suggested-by: NRoman Gushchin <guro@fb.com>
      Reviewed-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      494c1dfe
    • W
      mm: memcg/slab: properly set up gfp flags for objcg pointer array · 41eb5df1
      Waiman Long 提交于
      Patch series "mm: memcg/slab: Fix objcg pointer array handling problem", v4.
      
      Since the merging of the new slab memory controller in v5.9, the page
      structure stores a pointer to objcg pointer array for slab pages.  When
      the slab has no used objects, it can be freed in free_slab() which will
      call kfree() to free the objcg pointer array in
      memcg_alloc_page_obj_cgroups().  If it happens that the objcg pointer
      array is the last used object in its slab, that slab may then be freed
      which may caused kfree() to be called again.
      
      With the right workload, the slab cache may be set up in a way that allows
      the recursive kfree() calling loop to nest deep enough to cause a kernel
      stack overflow and panic the system.  In fact, we have a reproducer that
      can cause kernel stack overflow on a s390 system involving kmalloc-rcl-256
      and kmalloc-rcl-128 slabs with the following kfree() loop recursively
      called 74 times:
      
        [ 285.520739] [<000000000ec432fc>] kfree+0x4bc/0x560 [ 285.520740]
      [<000000000ec43466>] __free_slab+0xc6/0x228 [ 285.520741]
      [<000000000ec41fc2>] __slab_free+0x3c2/0x3e0 [ 285.520742]
      [<000000000ec432fc>] kfree+0x4bc/0x560 : While investigating this issue, I
      also found an issue on the allocation side.  If the objcg pointer array
      happen to come from the same slab or a circular dependency linkage is
      formed with multiple slabs, those affected slabs can never be freed again.
      
      This patch series addresses these two issues by introducing a new set of
      kmalloc-cg-<n> caches split from kmalloc-<n> caches.  The new set will
      only contain non-reclaimable and non-dma objects that are accounted in
      memory cgroups whereas the old set are now for unaccounted objects only.
      By making this split, all the objcg pointer arrays will come from the
      kmalloc-<n> caches, but those caches will never hold any objcg pointer
      array.  As a result, deeply nested kfree() call and the unfreeable slab
      problems are now gone.
      
      This patch (of 4):
      
      Since the merging of the new slab memory controller in v5.9, the page
      structure may store a pointer to obj_cgroup pointer array for slab pages.
      Currently, only the __GFP_ACCOUNT bit is masked off.  However, the array
      is not readily reclaimable and doesn't need to come from the DMA buffer.
      So those GFP bits should be masked off as well.
      
      Do the flag bit clearing at memcg_alloc_page_obj_cgroups() to make sure
      that it is consistently applied no matter where it is called.
      
      Link: https://lkml.kernel.org/r/20210505200610.13943-1-longman@redhat.com
      Link: https://lkml.kernel.org/r/20210505200610.13943-2-longman@redhat.com
      Fixes: 286e04b8 ("mm: memcg/slab: allocate obj_cgroups for non-root slab pages")
      Signed-off-by: NWaiman Long <longman@redhat.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NRoman Gushchin <guro@fb.com>
      Reviewed-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      41eb5df1
    • W
      mm/memcg: optimize user context object stock access · 55927114
      Waiman Long 提交于
      Most kmem_cache_alloc() calls are from user context.  With instrumentation
      enabled, the measured amount of kmem_cache_alloc() calls from non-task
      context was about 0.01% of the total.
      
      The irq disable/enable sequence used in this case to access content from
      object stock is slow.  To optimize for user context access, there are now
      two sets of object stocks (in the new obj_stock structure) for task
      context and interrupt context access respectively.
      
      The task context object stock can be accessed after disabling preemption
      which is cheap in non-preempt kernel.  The interrupt context object stock
      can only be accessed after disabling interrupt.  User context code can
      access interrupt object stock, but not vice versa.
      
      The downside of this change is that there are more data stored in local
      object stocks and not reflected in the charge counter and the vmstat
      arrays.  However, this is a small price to pay for better performance.
      
      [longman@redhat.com: fix potential uninitialized variable warning]
        Link: https://lkml.kernel.org/r/20210526193602.8742-1-longman@redhat.com
      [akpm@linux-foundation.org: coding style fixes]
      
      Link: https://lkml.kernel.org/r/20210506150007.16288-5-longman@redhat.comSigned-off-by: NWaiman Long <longman@redhat.com>
      Acked-by: NRoman Gushchin <guro@fb.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Cc: Chris Down <chris@chrisdown.name>
      Cc: Yafang Shao <laoar.shao@gmail.com>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Masayoshi Mizuma <msys.mizuma@gmail.com>
      Cc: Xing Zhengjun <zhengjun.xing@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      55927114
    • W
      mm/memcg: improve refill_obj_stock() performance · 5387c904
      Waiman Long 提交于
      There are two issues with the current refill_obj_stock() code.  First of
      all, when nr_bytes reaches over PAGE_SIZE, it calls drain_obj_stock() to
      atomically flush out remaining bytes to obj_cgroup, clear cached_objcg and
      do a obj_cgroup_put().  It is likely that the same obj_cgroup will be used
      again which leads to another call to drain_obj_stock() and
      obj_cgroup_get() as well as atomically retrieve the available byte from
      obj_cgroup.  That is costly.  Instead, we should just uncharge the excess
      pages, reduce the stock bytes and be done with it.  The drain_obj_stock()
      function should only be called when obj_cgroup changes.
      
      Secondly, when charging an object of size not less than a page in
      obj_cgroup_charge(), it is possible that the remaining bytes to be
      refilled to the stock will overflow a page and cause refill_obj_stock() to
      uncharge 1 page.  To avoid the additional uncharge in this case, a new
      allow_uncharge flag is added to refill_obj_stock() which will be set to
      false when called from obj_cgroup_charge() so that an uncharge_pages()
      call won't be issued right after a charge_pages() call unless the objcg
      changes.
      
      A multithreaded kmalloc+kfree microbenchmark on a 2-socket 48-core
      96-thread x86-64 system with 96 testing threads were run.  Before this
      patch, the total number of kilo kmalloc+kfree operations done for a 4k
      large object by all the testing threads per second were 4,304 kops/s
      (cgroup v1) and 8,478 kops/s (cgroup v2).  After applying this patch, the
      number were 4,731 (cgroup v1) and 418,142 (cgroup v2) respectively.  This
      represents a performance improvement of 1.10X (cgroup v1) and 49.3X
      (cgroup v2).
      
      Link: https://lkml.kernel.org/r/20210506150007.16288-4-longman@redhat.comSigned-off-by: NWaiman Long <longman@redhat.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Cc: Chris Down <chris@chrisdown.name>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Masayoshi Mizuma <msys.mizuma@gmail.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Xing Zhengjun <zhengjun.xing@linux.intel.com>
      Cc: Yafang Shao <laoar.shao@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5387c904
    • W
      mm/memcg: cache vmstat data in percpu memcg_stock_pcp · 68ac5b3c
      Waiman Long 提交于
      Before the new slab memory controller with per object byte charging,
      charging and vmstat data update happen only when new slab pages are
      allocated or freed.  Now they are done with every kmem_cache_alloc() and
      kmem_cache_free().  This causes additional overhead for workloads that
      generate a lot of alloc and free calls.
      
      The memcg_stock_pcp is used to cache byte charge for a specific obj_cgroup
      to reduce that overhead.  To further reducing it, this patch makes the
      vmstat data cached in the memcg_stock_pcp structure as well until it
      accumulates a page size worth of update or when other cached data change.
      Caching the vmstat data in the per-cpu stock eliminates two writes to
      non-hot cachelines for memcg specific as well as memcg-lruvecs specific
      vmstat data by a write to a hot local stock cacheline.
      
      On a 2-socket Cascade Lake server with instrumentation enabled and this
      patch applied, it was found that about 20% (634400 out of 3243830) of the
      time when mod_objcg_state() is called leads to an actual call to
      __mod_objcg_state() after initial boot.  When doing parallel kernel build,
      the figure was about 17% (24329265 out of 142512465).  So caching the
      vmstat data reduces the number of calls to __mod_objcg_state() by more
      than 80%.
      
      Link: https://lkml.kernel.org/r/20210506150007.16288-3-longman@redhat.comSigned-off-by: NWaiman Long <longman@redhat.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Cc: Chris Down <chris@chrisdown.name>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Masayoshi Mizuma <msys.mizuma@gmail.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Xing Zhengjun <zhengjun.xing@linux.intel.com>
      Cc: Yafang Shao <laoar.shao@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      68ac5b3c
    • W
      mm/memcg: move mod_objcg_state() to memcontrol.c · fdbcb2a6
      Waiman Long 提交于
      Patch series "mm/memcg: Reduce kmemcache memory accounting overhead", v6.
      
      With the recent introduction of the new slab memory controller, we
      eliminate the need for having separate kmemcaches for each memory cgroup
      and reduce overall kernel memory usage.  However, we also add additional
      memory accounting overhead to each call of kmem_cache_alloc() and
      kmem_cache_free().
      
      For workloads that require a lot of kmemcache allocations and
      de-allocations, they may experience performance regression as illustrated
      in [1] and [2].
      
      A simple kernel module that performs repeated loop of 100,000,000
      kmem_cache_alloc() and kmem_cache_free() of either a small 32-byte object
      or a big 4k object at module init time with a batch size of 4 (4 kmalloc's
      followed by 4 kfree's) is used for benchmarking.  The benchmarking tool
      was run on a kernel based on linux-next-20210419.  The test was run on a
      CascadeLake server with turbo-boosting disable to reduce run-to-run
      variation.
      
      The small object test exercises mainly the object stock charging and
      vmstat update code paths.  The large object test also exercises the
      refill_obj_stock() and __memcg_kmem_charge()/__memcg_kmem_uncharge() code
      paths.
      
      With memory accounting disabled, the run time was 3.130s with both small
      object big object tests.
      
      With memory accounting enabled, both cgroup v1 and v2 showed similar
      results in the small object test.  The performance results of the large
      object test, however, differed between cgroup v1 and v2.
      
      The execution times with the application of various patches in the
      patchset were:
      
        Applied patches   Run time   Accounting overhead   %age 1   %age 2
        ---------------   --------   -------------------   ------   ------
      
        Small 32-byte object:
             None          11.634s         8.504s          100.0%   271.7%
              1-2           9.425s         6.295s           74.0%   201.1%
              1-3           9.708s         6.578s           77.4%   210.2%
              1-4           8.062s         4.932s           58.0%   157.6%
      
        Large 4k object (v2):
             None          22.107s        18.977s          100.0%   606.3%
              1-2          20.960s        17.830s           94.0%   569.6%
              1-3          14.238s        11.108s           58.5%   354.9%
              1-4          11.329s         8.199s           43.2%   261.9%
      
        Large 4k object (v1):
             None          36.807s        33.677s          100.0%  1075.9%
              1-2          36.648s        33.518s           99.5%  1070.9%
              1-3          22.345s        19.215s           57.1%   613.9%
              1-4          18.662s        15.532s           46.1%   496.2%
      
        N.B. %age 1 = overhead/unpatched overhead
             %age 2 = overhead/accounting disabled time
      
      Patch 2 (vmstat data stock caching) helps in both the small object test
      and the large v2 object test. It doesn't help much in v1 big object test.
      
      Patch 3 (refill_obj_stock improvement) does help the small object test
      but offer significant performance improvement for the large object test
      (both v1 and v2).
      
      Patch 4 (eliminating irq disable/enable) helps in all test cases.
      
      To test for the extreme case, a multi-threaded kmalloc/kfree
      microbenchmark was run on the 2-socket 48-core 96-thread system with
      96 testing threads in the same memcg doing kmalloc+kfree of a 4k object
      with accounting enabled for 10s. The total number of kmalloc+kfree done
      in kilo operations per second (kops/s) were as follows:
      
        Applied patches   v1 kops/s   v1 change   v2 kops/s   v2 change
        ---------------   ---------   ---------   ---------   ---------
             None           3,520        1.00X      6,242        1.00X
              1-2           4,304        1.22X      8,478        1.36X
              1-3           4,731        1.34X    418,142       66.99X
              1-4           4,587        1.30X    438,838       70.30X
      
      With memory accounting disabled, the kmalloc/kfree rate was 1,481,291
      kop/s. This test shows how significant the memory accouting overhead
      can be in some extreme situations.
      
      For this multithreaded test, the improvement from patch 2 mainly
      comes from the conditional atomic xchg of objcg->nr_charged_bytes in
      mod_objcg_state(). By using an unconditional xchg, the operation rates
      were similar to the unpatched kernel.
      
      Patch 3 elminates the single highly contended cacheline of
      objcg->nr_charged_bytes for cgroup v2 leading to a huge performance
      improvement. Cgroup v1, however, still has another highly contended
      cacheline in the shared page counter &memcg->kmem. So the improvement
      is only modest.
      
      Patch 4 helps in cgroup v2, but performs worse in cgroup v1 as
      eliminating the irq_disable/irq_enable overhead seems to aggravate the
      cacheline contention.
      
      [1] https://lore.kernel.org/linux-mm/20210408193948.vfktg3azh2wrt56t@gabell/T/#u
      [2] https://lore.kernel.org/lkml/20210114025151.GA22932@xsang-OptiPlex-9020/
      
      This patch (of 4):
      
      mod_objcg_state() is moved from mm/slab.h to mm/memcontrol.c so that
      further optimization can be done to it in later patches without exposing
      unnecessary details to other mm components.
      
      Link: https://lkml.kernel.org/r/20210506150007.16288-1-longman@redhat.com
      Link: https://lkml.kernel.org/r/20210506150007.16288-2-longman@redhat.comSigned-off-by: NWaiman Long <longman@redhat.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NRoman Gushchin <guro@fb.com>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Cc: Chris Down <chris@chrisdown.name>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Masayoshi Mizuma <msys.mizuma@gmail.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Xing Zhengjun <zhengjun.xing@linux.intel.com>
      Cc: Yafang Shao <laoar.shao@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fdbcb2a6
  2. 06 6月, 2021 2 次提交
  3. 07 5月, 2021 1 次提交
  4. 06 5月, 2021 3 次提交
  5. 01 5月, 2021 14 次提交
  6. 14 3月, 2021 1 次提交
  7. 25 2月, 2021 5 次提交