1. 27 9月, 2021 4 次提交
  2. 24 9月, 2021 1 次提交
    • S
      memcg: flush lruvec stats in the refault · 1f828223
      Shakeel Butt 提交于
      Prior to the commit 7e1c0d6f ("memcg: switch lruvec stats to rstat")
      and the commit aa48e47e ("memcg: infrastructure to flush memcg
      stats"), each lruvec memcg stats can be off by (nr_cgroups * nr_cpus *
      32) at worst and for unbounded amount of time.  The commit aa48e47e
      moved the lruvec stats to rstat infrastructure and the commit
      7e1c0d6f bounded the error for all the lruvec stats to (nr_cpus *
      32) at worst for at most 2 seconds.  More specifically it decoupled the
      number of stats and the number of cgroups from the error rate.
      
      However this reduction in error comes with the cost of triggering the
      slowpath of stats update more frequently.  Previously in the slowpath
      the kernel adds the stats up the memcg tree.  After aa48e47e, the
      kernel triggers the asyn lruvec stats flush through queue_work().  This
      causes regression reports from 0day kernel bot [1] as well as from
      phoronix test suite [2].
      
      We tried two options to fix the regression:
      
       1) Increase the threshold to trigger the slowpath in lruvec stats
          update codepath from 32 to 512.
      
       2) Remove the slowpath from lruvec stats update codepath and instead
          flush the stats in the page refault codepath. The assumption is that
          the kernel timely flush the stats, so, the update tree would be
          small in the refault codepath to not cause the preformance impact.
      
      Following are the results of will-it-scale/page_fault[1|2|3] benchmark
      on four settings i.e.  (1) 5.15-rc1 as baseline (2) 5.15-rc1 with
      aa48e47e and 7e1c0d6f reverted (3) 5.15-rc1 with option-1
      (4) 5.15-rc1 with option-2.
      
        test       (1)      (2)               (3)               (4)
        pg_f1   368563   406277 (10.23%)   399693  (8.44%)   416398 (12.97%)
        pg_f2   338399   372133  (9.96%)   369180  (9.09%)   381024 (12.59%)
        pg_f3   500853   575399 (14.88%)   570388 (13.88%)   576083 (15.02%)
      
      From the above result, it seems like the option-2 not only solves the
      regression but also improves the performance for at least these
      benchmarks.
      
      Feng Tang (intel) ran the aim7 benchmark with these two options and
      confirms that option-1 reduces the regression but option-2 removes the
      regression.
      
      Michael Larabel (phoronix) ran multiple benchmarks with these options
      and reported the results at [3] and it shows for most benchmarks
      option-2 removes the regression introduced by the commit aa48e47e
      ("memcg: infrastructure to flush memcg stats").
      
      Based on the experiment results, this patch proposed the option-2 as the
      solution to resolve the regression.
      
      Link: https://lore.kernel.org/all/20210726022421.GB21872@xsang-OptiPlex-9020 [1]
      Link: https://www.phoronix.com/scan.php?page=article&item=linux515-compile-regress [2]
      Link: https://openbenchmarking.org/result/2109226-DEBU-LINUX5104 [3]
      Fixes: aa48e47e ("memcg: infrastructure to flush memcg stats")
      Signed-off-by: NShakeel Butt <shakeelb@google.com>
      Tested-by: NMichael Larabel <Michael@phoronix.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Feng Tang <feng.tang@intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Hillf Danton <hdanton@sina.com>,
      Cc: Michal Koutný <mkoutny@suse.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>,
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1f828223
  3. 04 9月, 2021 12 次提交
  4. 18 8月, 2021 1 次提交
  5. 14 8月, 2021 1 次提交
  6. 31 7月, 2021 1 次提交
  7. 20 7月, 2021 1 次提交
    • V
      memcg: enable accounting for IP address and routing-related objects · 6126891c
      Vasily Averin 提交于
      An netadmin inside container can use 'ip a a' and 'ip r a'
      to assign a large number of ipv4/ipv6 addresses and routing entries
      and force kernel to allocate megabytes of unaccounted memory
      for long-lived per-netdevice related kernel objects:
      'struct in_ifaddr', 'struct inet6_ifaddr', 'struct fib6_node',
      'struct rt6_info', 'struct fib_rules' and ip_fib caches.
      
      These objects can be manually removed, though usually they lives
      in memory till destroy of its net namespace.
      
      It makes sense to account for them to restrict the host's memory
      consumption from inside the memcg-limited container.
      
      One of such objects is the 'struct fib6_node' mostly allocated in
      net/ipv6/route.c::__ip6_ins_rt() inside the lock_bh()/unlock_bh() section:
      
       write_lock_bh(&table->tb6_lock);
       err = fib6_add(&table->tb6_root, rt, info, mxc);
       write_unlock_bh(&table->tb6_lock);
      
      In this case it is not enough to simply add SLAB_ACCOUNT to corresponding
      kmem cache. The proper memory cgroup still cannot be found due to the
      incorrect 'in_interrupt()' check used in memcg_kmem_bypass().
      
      Obsoleted in_interrupt() does not describe real execution context properly.
      >From include/linux/preempt.h:
      
       The following macros are deprecated and should not be used in new code:
       in_interrupt()	- We're in NMI,IRQ,SoftIRQ context or have BH disabled
      
      To verify the current execution context new macro should be used instead:
       in_task()	- We're in task context
      Signed-off-by: NVasily Averin <vvs@virtuozzo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6126891c
  8. 02 7月, 2021 2 次提交
    • A
      mm: remove special swap entry functions · af5cdaf8
      Alistair Popple 提交于
      Patch series "Add support for SVM atomics in Nouveau", v11.
      
      Introduction
      ============
      
      Some devices have features such as atomic PTE bits that can be used to
      implement atomic access to system memory.  To support atomic operations to
      a shared virtual memory page such a device needs access to that page which
      is exclusive of the CPU.  This series introduces a mechanism to
      temporarily unmap pages granting exclusive access to a device.
      
      These changes are required to support OpenCL atomic operations in Nouveau
      to shared virtual memory (SVM) regions allocated with the
      CL_MEM_SVM_ATOMICS clSVMAlloc flag.  A more complete description of the
      OpenCL SVM feature is available at
      https://www.khronos.org/registry/OpenCL/specs/3.0-unified/html/
      OpenCL_API.html#_shared_virtual_memory .
      
      Implementation
      ==============
      
      Exclusive device access is implemented by adding a new swap entry type
      (SWAP_DEVICE_EXCLUSIVE) which is similar to a migration entry.  The main
      difference is that on fault the original entry is immediately restored by
      the fault handler instead of waiting.
      
      Restoring the entry triggers calls to MMU notifers which allows a device
      driver to revoke the atomic access permission from the GPU prior to the
      CPU finalising the entry.
      
      Patches
      =======
      
      Patches 1 & 2 refactor existing migration and device private entry
      functions.
      
      Patches 3 & 4 rework try_to_unmap_one() by splitting out unrelated
      functionality into separate functions - try_to_migrate_one() and
      try_to_munlock_one().
      
      Patch 5 renames some existing code but does not introduce functionality.
      
      Patch 6 is a small clean-up to swap entry handling in copy_pte_range().
      
      Patch 7 contains the bulk of the implementation for device exclusive
      memory.
      
      Patch 8 contains some additions to the HMM selftests to ensure everything
      works as expected.
      
      Patch 9 is a cleanup for the Nouveau SVM implementation.
      
      Patch 10 contains the implementation of atomic access for the Nouveau
      driver.
      
      Testing
      =======
      
      This has been tested with upstream Mesa 21.1.0 and a simple OpenCL program
      which checks that GPU atomic accesses to system memory are atomic.
      Without this series the test fails as there is no way of write-protecting
      the page mapping which results in the device clobbering CPU writes.  For
      reference the test is available at
      https://ozlabs.org/~apopple/opencl_svm_atomics/
      
      Further testing has been performed by adding support for testing exclusive
      access to the hmm-tests kselftests.
      
      This patch (of 10):
      
      Remove multiple similar inline functions for dealing with different types
      of special swap entries.
      
      Both migration and device private swap entries use the swap offset to
      store a pfn.  Instead of multiple inline functions to obtain a struct page
      for each swap entry type use a common function pfn_swap_entry_to_page().
      Also open-code the various entry_to_pfn() functions as this results is
      shorter code that is easier to understand.
      
      Link: https://lkml.kernel.org/r/20210616105937.23201-1-apopple@nvidia.com
      Link: https://lkml.kernel.org/r/20210616105937.23201-2-apopple@nvidia.comSigned-off-by: NAlistair Popple <apopple@nvidia.com>
      Reviewed-by: NRalph Campbell <rcampbell@nvidia.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Ben Skeggs <bskeggs@redhat.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      af5cdaf8
    • M
      mm/memcontrol.c: fix kerneldoc comment for mem_cgroup_calculate_protection · 05395718
      Mel Gorman 提交于
      make W=1 generates the following warning for mem_cgroup_calculate_protection
      
        mm/memcontrol.c:6468: warning: expecting prototype for mem_cgroup_protected(). Prototype was for mem_cgroup_calculate_protection() instead
      
      Commit 45c7f7e1 ("mm, memcg: decouple e{low,min} state mutations from
      protection checks") changed the function definition but not the associated
      kerneldoc comment.
      
      Link: https://lkml.kernel.org/r/20210520084809.8576-7-mgorman@techsingularity.net
      Fixes: 45c7f7e1 ("mm, memcg: decouple e{low,min} state mutations from protection checks")
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Reviewed-by: NYang Shi <shy828301@gmail.com>
      Acked-by: NChris Down <chris@chrisdown.name>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      05395718
  9. 30 6月, 2021 14 次提交
    • D
      loop: charge i/o to mem and blk cg · c74d40e8
      Dan Schatzberg 提交于
      The current code only associates with the existing blkcg when aio is used
      to access the backing file.  This patch covers all types of i/o to the
      backing file and also associates the memcg so if the backing file is on
      tmpfs, memory is charged appropriately.
      
      This patch also exports cgroup_get_e_css and int_active_memcg so it can be
      used by the loop module.
      
      Link: https://lkml.kernel.org/r/20210610173944.1203706-4-schatzberg.dan@gmail.comSigned-off-by: NDan Schatzberg <schatzberg.dan@gmail.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NJens Axboe <axboe@kernel.dk>
      Cc: Chris Down <chris@chrisdown.name>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Ming Lei <ming.lei@redhat.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c74d40e8
    • D
      mm: charge active memcg when no mm is set · 04f94e3f
      Dan Schatzberg 提交于
      set_active_memcg() worked for kernel allocations but was silently ignored
      for user pages.
      
      This patch establishes a precedence order for who gets charged:
      
      1. If there is a memcg associated with the page already, that memcg is
         charged. This happens during swapin.
      
      2. If an explicit mm is passed, mm->memcg is charged. This happens
         during page faults, which can be triggered in remote VMs (eg gup).
      
      3. Otherwise consult the current process context. If there is an
         active_memcg, use that. Otherwise, current->mm->memcg.
      
      Previously, if a NULL mm was passed to mem_cgroup_charge (case 3) it would
      always charge the root cgroup.  Now it looks up the active_memcg first
      (falling back to charging the root cgroup if not set).
      
      Link: https://lkml.kernel.org/r/20210610173944.1203706-3-schatzberg.dan@gmail.comSigned-off-by: NDan Schatzberg <schatzberg.dan@gmail.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NTejun Heo <tj@kernel.org>
      Acked-by: NChris Down <chris@chrisdown.name>
      Acked-by: NJens Axboe <axboe@kernel.dk>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Reviewed-by: NMichal Koutný <mkoutny@suse.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Ming Lei <ming.lei@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      04f94e3f
    • M
      mm: memcontrol: move obj_cgroup_uncharge_pages() out of css_set_lock · 271dd6b1
      Muchun Song 提交于
      The css_set_lock is used to guard the list of inherited objcgs.  So there
      is no need to uncharge kernel memory under css_set_lock.  Just move it out
      of the lock.
      
      Link: https://lkml.kernel.org/r/20210417043538.9793-8-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      271dd6b1
    • M
      mm: memcontrol: simplify the logic of objcg pinning memcg · 9838354e
      Muchun Song 提交于
      The obj_cgroup_release() and memcg_reparent_objcgs() are serialized by the
      css_set_lock.  We do not need to care about objcg->memcg being released in
      the process of obj_cgroup_release().  So there is no need to pin memcg
      before releasing objcg.  Remove those pinning logic to simplfy the code.
      
      There are only two places that modifies the objcg->memcg.  One is the
      initialization to objcg->memcg in the memcg_online_kmem(), another is
      objcgs reparenting in the memcg_reparent_objcgs().  It is also impossible
      for the two to run in parallel.  So xchg() is unnecessary and it is enough
      to use WRITE_ONCE().
      
      Link: https://lkml.kernel.org/r/20210417043538.9793-7-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NRoman Gushchin <guro@fb.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9838354e
    • M
      mm: memcontrol: remove the pgdata parameter of mem_cgroup_page_lruvec · a984226f
      Muchun Song 提交于
      All the callers of mem_cgroup_page_lruvec() just pass page_pgdat(page) as
      the 2nd parameter to it (except isolate_migratepages_block()).  But for
      isolate_migratepages_block(), the page_pgdat(page) is also equal to the
      local variable of @pgdat.  So mem_cgroup_page_lruvec() do not need the
      pgdat parameter.  Just remove it to simplify the code.
      
      Link: https://lkml.kernel.org/r/20210417043538.9793-4-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a984226f
    • M
      mm: memcontrol: bail out early when !mm in get_mem_cgroup_from_mm · 2884b6b7
      Muchun Song 提交于
      When mm is NULL, we do not need to hold rcu lock and call css_tryget for
      the root memcg.  And we also do not need to check !mm in every loop of
      while.  So bail out early when !mm.
      
      Link: https://lkml.kernel.org/r/20210417043538.9793-3-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2884b6b7
    • M
      mm: memcontrol: fix page charging in page replacement · 8dc87c7d
      Muchun Song 提交于
      Patch series "memcontrol code cleanup and simplification", v3.
      
      This patch (of 8):
      
      The pages aren't accounted at the root level, so do not charge the page to
      the root memcg in page replacement.  Although we do not display the value
      (mem_cgroup_usage) so there shouldn't be any actual problem, but there is
      a WARN_ON_ONCE in the page_counter_cancel().  Who knows if it will
      trigger?  So it is better to fix it.
      
      Link: https://lkml.kernel.org/r/20210417043538.9793-1-songmuchun@bytedance.com
      Link: https://lkml.kernel.org/r/20210417043538.9793-2-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8dc87c7d
    • M
      mm: memcontrol: fix root_mem_cgroup charging · c5c8b16b
      Muchun Song 提交于
      The below scenario can cause the page counters of the root_mem_cgroup to
      be out of balance.
      
      CPU0:                                   CPU1:
      
      objcg = get_obj_cgroup_from_current()
      obj_cgroup_charge_pages(objcg)
                                              memcg_reparent_objcgs()
                                                  // reparent to root_mem_cgroup
                                                  WRITE_ONCE(iter->memcg, parent)
          // memcg == root_mem_cgroup
          memcg = get_mem_cgroup_from_objcg(objcg)
          // do not charge to the root_mem_cgroup
          try_charge(memcg)
      
      obj_cgroup_uncharge_pages(objcg)
          memcg = get_mem_cgroup_from_objcg(objcg)
          // uncharge from the root_mem_cgroup
          refill_stock(memcg)
              drain_stock(memcg)
                  page_counter_uncharge(&memcg->memory)
      
      get_obj_cgroup_from_current() never returns a root_mem_cgroup's objcg, so
      we never explicitly charge the root_mem_cgroup.  And it's not going to
      change.  It's all about a race when we got an obj_cgroup pointing at some
      non-root memcg, but before we were able to charge it, the cgroup was gone,
      objcg was reparented to the root and so we're skipping the charging.  Then
      we store the objcg pointer and later use to uncharge the root_mem_cgroup.
      
      This can cause the page counter to be less than the actual value.
      Although we do not display the value (mem_cgroup_usage) so there shouldn't
      be any actual problem, but there is a WARN_ON_ONCE in the
      page_counter_cancel().  Who knows if it will trigger?  So it is better to
      fix it.
      
      Link: https://lkml.kernel.org/r/20210425075410.19255-1-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com>
      Acked-by: NRoman Gushchin <guro@fb.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c5c8b16b
    • W
      mm: memcg/slab: create a new set of kmalloc-cg-<n> caches · 494c1dfe
      Waiman Long 提交于
      There are currently two problems in the way the objcg pointer array
      (memcg_data) in the page structure is being allocated and freed.
      
      On its allocation, it is possible that the allocated objcg pointer
      array comes from the same slab that requires memory accounting. If this
      happens, the slab will never become empty again as there is at least
      one object left (the obj_cgroup array) in the slab.
      
      When it is freed, the objcg pointer array object may be the last one
      in its slab and hence causes kfree() to be called again. With the
      right workload, the slab cache may be set up in a way that allows the
      recursive kfree() calling loop to nest deep enough to cause a kernel
      stack overflow and panic the system.
      
      One way to solve this problem is to split the kmalloc-<n> caches
      (KMALLOC_NORMAL) into two separate sets - a new set of kmalloc-<n>
      (KMALLOC_NORMAL) caches for unaccounted objects only and a new set of
      kmalloc-cg-<n> (KMALLOC_CGROUP) caches for accounted objects only. All
      the other caches can still allow a mix of accounted and unaccounted
      objects.
      
      With this change, all the objcg pointer array objects will come from
      KMALLOC_NORMAL caches which won't have their objcg pointer arrays. So
      both the recursive kfree() problem and non-freeable slab problem are
      gone.
      
      Since both the KMALLOC_NORMAL and KMALLOC_CGROUP caches no longer have
      mixed accounted and unaccounted objects, this will slightly reduce the
      number of objcg pointer arrays that need to be allocated and save a bit
      of memory. On the other hand, creating a new set of kmalloc caches does
      have the effect of reducing cache utilization. So it is properly a wash.
      
      The new KMALLOC_CGROUP is added between KMALLOC_NORMAL and
      KMALLOC_RECLAIM so that the first for loop in create_kmalloc_caches()
      will include the newly added caches without change.
      
      [vbabka@suse.cz: don't create kmalloc-cg caches with cgroup.memory=nokmem]
        Link: https://lkml.kernel.org/r/20210512145107.6208-1-longman@redhat.com
      [akpm@linux-foundation.org: un-fat-finger v5 delta creation]
      [longman@redhat.com: disable cache merging for KMALLOC_NORMAL caches]
        Link: https://lkml.kernel.org/r/20210505200610.13943-4-longman@redhat.com
      
      Link: https://lkml.kernel.org/r/20210512145107.6208-1-longman@redhat.com
      Link: https://lkml.kernel.org/r/20210505200610.13943-3-longman@redhat.comSigned-off-by: NWaiman Long <longman@redhat.com>
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Suggested-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NRoman Gushchin <guro@fb.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      [longman@redhat.com: fix for CONFIG_ZONE_DMA=n]
      Suggested-by: NRoman Gushchin <guro@fb.com>
      Reviewed-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      494c1dfe
    • W
      mm: memcg/slab: properly set up gfp flags for objcg pointer array · 41eb5df1
      Waiman Long 提交于
      Patch series "mm: memcg/slab: Fix objcg pointer array handling problem", v4.
      
      Since the merging of the new slab memory controller in v5.9, the page
      structure stores a pointer to objcg pointer array for slab pages.  When
      the slab has no used objects, it can be freed in free_slab() which will
      call kfree() to free the objcg pointer array in
      memcg_alloc_page_obj_cgroups().  If it happens that the objcg pointer
      array is the last used object in its slab, that slab may then be freed
      which may caused kfree() to be called again.
      
      With the right workload, the slab cache may be set up in a way that allows
      the recursive kfree() calling loop to nest deep enough to cause a kernel
      stack overflow and panic the system.  In fact, we have a reproducer that
      can cause kernel stack overflow on a s390 system involving kmalloc-rcl-256
      and kmalloc-rcl-128 slabs with the following kfree() loop recursively
      called 74 times:
      
        [ 285.520739] [<000000000ec432fc>] kfree+0x4bc/0x560 [ 285.520740]
      [<000000000ec43466>] __free_slab+0xc6/0x228 [ 285.520741]
      [<000000000ec41fc2>] __slab_free+0x3c2/0x3e0 [ 285.520742]
      [<000000000ec432fc>] kfree+0x4bc/0x560 : While investigating this issue, I
      also found an issue on the allocation side.  If the objcg pointer array
      happen to come from the same slab or a circular dependency linkage is
      formed with multiple slabs, those affected slabs can never be freed again.
      
      This patch series addresses these two issues by introducing a new set of
      kmalloc-cg-<n> caches split from kmalloc-<n> caches.  The new set will
      only contain non-reclaimable and non-dma objects that are accounted in
      memory cgroups whereas the old set are now for unaccounted objects only.
      By making this split, all the objcg pointer arrays will come from the
      kmalloc-<n> caches, but those caches will never hold any objcg pointer
      array.  As a result, deeply nested kfree() call and the unfreeable slab
      problems are now gone.
      
      This patch (of 4):
      
      Since the merging of the new slab memory controller in v5.9, the page
      structure may store a pointer to obj_cgroup pointer array for slab pages.
      Currently, only the __GFP_ACCOUNT bit is masked off.  However, the array
      is not readily reclaimable and doesn't need to come from the DMA buffer.
      So those GFP bits should be masked off as well.
      
      Do the flag bit clearing at memcg_alloc_page_obj_cgroups() to make sure
      that it is consistently applied no matter where it is called.
      
      Link: https://lkml.kernel.org/r/20210505200610.13943-1-longman@redhat.com
      Link: https://lkml.kernel.org/r/20210505200610.13943-2-longman@redhat.com
      Fixes: 286e04b8 ("mm: memcg/slab: allocate obj_cgroups for non-root slab pages")
      Signed-off-by: NWaiman Long <longman@redhat.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NRoman Gushchin <guro@fb.com>
      Reviewed-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      41eb5df1
    • W
      mm/memcg: optimize user context object stock access · 55927114
      Waiman Long 提交于
      Most kmem_cache_alloc() calls are from user context.  With instrumentation
      enabled, the measured amount of kmem_cache_alloc() calls from non-task
      context was about 0.01% of the total.
      
      The irq disable/enable sequence used in this case to access content from
      object stock is slow.  To optimize for user context access, there are now
      two sets of object stocks (in the new obj_stock structure) for task
      context and interrupt context access respectively.
      
      The task context object stock can be accessed after disabling preemption
      which is cheap in non-preempt kernel.  The interrupt context object stock
      can only be accessed after disabling interrupt.  User context code can
      access interrupt object stock, but not vice versa.
      
      The downside of this change is that there are more data stored in local
      object stocks and not reflected in the charge counter and the vmstat
      arrays.  However, this is a small price to pay for better performance.
      
      [longman@redhat.com: fix potential uninitialized variable warning]
        Link: https://lkml.kernel.org/r/20210526193602.8742-1-longman@redhat.com
      [akpm@linux-foundation.org: coding style fixes]
      
      Link: https://lkml.kernel.org/r/20210506150007.16288-5-longman@redhat.comSigned-off-by: NWaiman Long <longman@redhat.com>
      Acked-by: NRoman Gushchin <guro@fb.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Cc: Chris Down <chris@chrisdown.name>
      Cc: Yafang Shao <laoar.shao@gmail.com>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Masayoshi Mizuma <msys.mizuma@gmail.com>
      Cc: Xing Zhengjun <zhengjun.xing@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      55927114
    • W
      mm/memcg: improve refill_obj_stock() performance · 5387c904
      Waiman Long 提交于
      There are two issues with the current refill_obj_stock() code.  First of
      all, when nr_bytes reaches over PAGE_SIZE, it calls drain_obj_stock() to
      atomically flush out remaining bytes to obj_cgroup, clear cached_objcg and
      do a obj_cgroup_put().  It is likely that the same obj_cgroup will be used
      again which leads to another call to drain_obj_stock() and
      obj_cgroup_get() as well as atomically retrieve the available byte from
      obj_cgroup.  That is costly.  Instead, we should just uncharge the excess
      pages, reduce the stock bytes and be done with it.  The drain_obj_stock()
      function should only be called when obj_cgroup changes.
      
      Secondly, when charging an object of size not less than a page in
      obj_cgroup_charge(), it is possible that the remaining bytes to be
      refilled to the stock will overflow a page and cause refill_obj_stock() to
      uncharge 1 page.  To avoid the additional uncharge in this case, a new
      allow_uncharge flag is added to refill_obj_stock() which will be set to
      false when called from obj_cgroup_charge() so that an uncharge_pages()
      call won't be issued right after a charge_pages() call unless the objcg
      changes.
      
      A multithreaded kmalloc+kfree microbenchmark on a 2-socket 48-core
      96-thread x86-64 system with 96 testing threads were run.  Before this
      patch, the total number of kilo kmalloc+kfree operations done for a 4k
      large object by all the testing threads per second were 4,304 kops/s
      (cgroup v1) and 8,478 kops/s (cgroup v2).  After applying this patch, the
      number were 4,731 (cgroup v1) and 418,142 (cgroup v2) respectively.  This
      represents a performance improvement of 1.10X (cgroup v1) and 49.3X
      (cgroup v2).
      
      Link: https://lkml.kernel.org/r/20210506150007.16288-4-longman@redhat.comSigned-off-by: NWaiman Long <longman@redhat.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Cc: Chris Down <chris@chrisdown.name>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Masayoshi Mizuma <msys.mizuma@gmail.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Xing Zhengjun <zhengjun.xing@linux.intel.com>
      Cc: Yafang Shao <laoar.shao@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5387c904
    • W
      mm/memcg: cache vmstat data in percpu memcg_stock_pcp · 68ac5b3c
      Waiman Long 提交于
      Before the new slab memory controller with per object byte charging,
      charging and vmstat data update happen only when new slab pages are
      allocated or freed.  Now they are done with every kmem_cache_alloc() and
      kmem_cache_free().  This causes additional overhead for workloads that
      generate a lot of alloc and free calls.
      
      The memcg_stock_pcp is used to cache byte charge for a specific obj_cgroup
      to reduce that overhead.  To further reducing it, this patch makes the
      vmstat data cached in the memcg_stock_pcp structure as well until it
      accumulates a page size worth of update or when other cached data change.
      Caching the vmstat data in the per-cpu stock eliminates two writes to
      non-hot cachelines for memcg specific as well as memcg-lruvecs specific
      vmstat data by a write to a hot local stock cacheline.
      
      On a 2-socket Cascade Lake server with instrumentation enabled and this
      patch applied, it was found that about 20% (634400 out of 3243830) of the
      time when mod_objcg_state() is called leads to an actual call to
      __mod_objcg_state() after initial boot.  When doing parallel kernel build,
      the figure was about 17% (24329265 out of 142512465).  So caching the
      vmstat data reduces the number of calls to __mod_objcg_state() by more
      than 80%.
      
      Link: https://lkml.kernel.org/r/20210506150007.16288-3-longman@redhat.comSigned-off-by: NWaiman Long <longman@redhat.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Cc: Chris Down <chris@chrisdown.name>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Masayoshi Mizuma <msys.mizuma@gmail.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Xing Zhengjun <zhengjun.xing@linux.intel.com>
      Cc: Yafang Shao <laoar.shao@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      68ac5b3c
    • W
      mm/memcg: move mod_objcg_state() to memcontrol.c · fdbcb2a6
      Waiman Long 提交于
      Patch series "mm/memcg: Reduce kmemcache memory accounting overhead", v6.
      
      With the recent introduction of the new slab memory controller, we
      eliminate the need for having separate kmemcaches for each memory cgroup
      and reduce overall kernel memory usage.  However, we also add additional
      memory accounting overhead to each call of kmem_cache_alloc() and
      kmem_cache_free().
      
      For workloads that require a lot of kmemcache allocations and
      de-allocations, they may experience performance regression as illustrated
      in [1] and [2].
      
      A simple kernel module that performs repeated loop of 100,000,000
      kmem_cache_alloc() and kmem_cache_free() of either a small 32-byte object
      or a big 4k object at module init time with a batch size of 4 (4 kmalloc's
      followed by 4 kfree's) is used for benchmarking.  The benchmarking tool
      was run on a kernel based on linux-next-20210419.  The test was run on a
      CascadeLake server with turbo-boosting disable to reduce run-to-run
      variation.
      
      The small object test exercises mainly the object stock charging and
      vmstat update code paths.  The large object test also exercises the
      refill_obj_stock() and __memcg_kmem_charge()/__memcg_kmem_uncharge() code
      paths.
      
      With memory accounting disabled, the run time was 3.130s with both small
      object big object tests.
      
      With memory accounting enabled, both cgroup v1 and v2 showed similar
      results in the small object test.  The performance results of the large
      object test, however, differed between cgroup v1 and v2.
      
      The execution times with the application of various patches in the
      patchset were:
      
        Applied patches   Run time   Accounting overhead   %age 1   %age 2
        ---------------   --------   -------------------   ------   ------
      
        Small 32-byte object:
             None          11.634s         8.504s          100.0%   271.7%
              1-2           9.425s         6.295s           74.0%   201.1%
              1-3           9.708s         6.578s           77.4%   210.2%
              1-4           8.062s         4.932s           58.0%   157.6%
      
        Large 4k object (v2):
             None          22.107s        18.977s          100.0%   606.3%
              1-2          20.960s        17.830s           94.0%   569.6%
              1-3          14.238s        11.108s           58.5%   354.9%
              1-4          11.329s         8.199s           43.2%   261.9%
      
        Large 4k object (v1):
             None          36.807s        33.677s          100.0%  1075.9%
              1-2          36.648s        33.518s           99.5%  1070.9%
              1-3          22.345s        19.215s           57.1%   613.9%
              1-4          18.662s        15.532s           46.1%   496.2%
      
        N.B. %age 1 = overhead/unpatched overhead
             %age 2 = overhead/accounting disabled time
      
      Patch 2 (vmstat data stock caching) helps in both the small object test
      and the large v2 object test. It doesn't help much in v1 big object test.
      
      Patch 3 (refill_obj_stock improvement) does help the small object test
      but offer significant performance improvement for the large object test
      (both v1 and v2).
      
      Patch 4 (eliminating irq disable/enable) helps in all test cases.
      
      To test for the extreme case, a multi-threaded kmalloc/kfree
      microbenchmark was run on the 2-socket 48-core 96-thread system with
      96 testing threads in the same memcg doing kmalloc+kfree of a 4k object
      with accounting enabled for 10s. The total number of kmalloc+kfree done
      in kilo operations per second (kops/s) were as follows:
      
        Applied patches   v1 kops/s   v1 change   v2 kops/s   v2 change
        ---------------   ---------   ---------   ---------   ---------
             None           3,520        1.00X      6,242        1.00X
              1-2           4,304        1.22X      8,478        1.36X
              1-3           4,731        1.34X    418,142       66.99X
              1-4           4,587        1.30X    438,838       70.30X
      
      With memory accounting disabled, the kmalloc/kfree rate was 1,481,291
      kop/s. This test shows how significant the memory accouting overhead
      can be in some extreme situations.
      
      For this multithreaded test, the improvement from patch 2 mainly
      comes from the conditional atomic xchg of objcg->nr_charged_bytes in
      mod_objcg_state(). By using an unconditional xchg, the operation rates
      were similar to the unpatched kernel.
      
      Patch 3 elminates the single highly contended cacheline of
      objcg->nr_charged_bytes for cgroup v2 leading to a huge performance
      improvement. Cgroup v1, however, still has another highly contended
      cacheline in the shared page counter &memcg->kmem. So the improvement
      is only modest.
      
      Patch 4 helps in cgroup v2, but performs worse in cgroup v1 as
      eliminating the irq_disable/irq_enable overhead seems to aggravate the
      cacheline contention.
      
      [1] https://lore.kernel.org/linux-mm/20210408193948.vfktg3azh2wrt56t@gabell/T/#u
      [2] https://lore.kernel.org/lkml/20210114025151.GA22932@xsang-OptiPlex-9020/
      
      This patch (of 4):
      
      mod_objcg_state() is moved from mm/slab.h to mm/memcontrol.c so that
      further optimization can be done to it in later patches without exposing
      unnecessary details to other mm components.
      
      Link: https://lkml.kernel.org/r/20210506150007.16288-1-longman@redhat.com
      Link: https://lkml.kernel.org/r/20210506150007.16288-2-longman@redhat.comSigned-off-by: NWaiman Long <longman@redhat.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NRoman Gushchin <guro@fb.com>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Cc: Chris Down <chris@chrisdown.name>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Masayoshi Mizuma <msys.mizuma@gmail.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Xing Zhengjun <zhengjun.xing@linux.intel.com>
      Cc: Yafang Shao <laoar.shao@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fdbcb2a6
  10. 06 6月, 2021 2 次提交
  11. 07 5月, 2021 1 次提交