1. 27 9月, 2022 2 次提交
    • Y
      mm: multi-gen LRU: exploit locality in rmap · 018ee47f
      Yu Zhao 提交于
      Searching the rmap for PTEs mapping each page on an LRU list (to test and
      clear the accessed bit) can be expensive because pages from different VMAs
      (PA space) are not cache friendly to the rmap (VA space).  For workloads
      mostly using mapped pages, searching the rmap can incur the highest CPU
      cost in the reclaim path.
      
      This patch exploits spatial locality to reduce the trips into the rmap. 
      When shrink_page_list() walks the rmap and finds a young PTE, a new
      function lru_gen_look_around() scans at most BITS_PER_LONG-1 adjacent
      PTEs.  On finding another young PTE, it clears the accessed bit and
      updates the gen counter of the page mapped by this PTE to
      (max_seq%MAX_NR_GENS)+1.
      
      Server benchmark results:
        Single workload:
          fio (buffered I/O): no change
      
        Single workload:
          memcached (anon): +[3, 5]%
                      Ops/sec      KB/sec
            patch1-6: 1106168.46   43025.04
            patch1-7: 1147696.57   44640.29
      
        Configurations:
          no change
      
      Client benchmark results:
        kswapd profiles:
          patch1-6
            39.03%  lzo1x_1_do_compress (real work)
            18.47%  page_vma_mapped_walk (overhead)
             6.74%  _raw_spin_unlock_irq
             3.97%  do_raw_spin_lock
             2.49%  ptep_clear_flush
             2.48%  anon_vma_interval_tree_iter_first
             1.92%  folio_referenced_one
             1.88%  __zram_bvec_write
             1.48%  memmove
             1.31%  vma_interval_tree_iter_next
      
          patch1-7
            48.16%  lzo1x_1_do_compress (real work)
             8.20%  page_vma_mapped_walk (overhead)
             7.06%  _raw_spin_unlock_irq
             2.92%  ptep_clear_flush
             2.53%  __zram_bvec_write
             2.11%  do_raw_spin_lock
             2.02%  memmove
             1.93%  lru_gen_look_around
             1.56%  free_unref_page_list
             1.40%  memset
      
        Configurations:
          no change
      
      Link: https://lkml.kernel.org/r/20220918080010.2920238-8-yuzhao@google.comSigned-off-by: NYu Zhao <yuzhao@google.com>
      Acked-by: NBarry Song <baohua@kernel.org>
      Acked-by: NBrian Geffon <bgeffon@google.com>
      Acked-by: NJan Alexander Steffens (heftig) <heftig@archlinux.org>
      Acked-by: NOleksandr Natalenko <oleksandr@natalenko.name>
      Acked-by: NSteven Barrett <steven@liquorix.net>
      Acked-by: NSuleiman Souhlal <suleiman@google.com>
      Tested-by: NDaniel Byrne <djbyrne@mtu.edu>
      Tested-by: NDonald Carr <d@chaos-reins.com>
      Tested-by: NHolger Hoffstätte <holger@applied-asynchrony.com>
      Tested-by: NKonstantin Kharlamov <Hi-Angel@yandex.ru>
      Tested-by: NShuang Zhai <szhai2@cs.rochester.edu>
      Tested-by: NSofia Trinh <sofia.trinh@edi.works>
      Tested-by: NVaibhav Jain <vaibhav@linux.ibm.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michael Larabel <Michael@MichaelLarabel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      018ee47f
    • Y
      mm: multi-gen LRU: groundwork · ec1c86b2
      Yu Zhao 提交于
      Evictable pages are divided into multiple generations for each lruvec.
      The youngest generation number is stored in lrugen->max_seq for both
      anon and file types as they are aged on an equal footing. The oldest
      generation numbers are stored in lrugen->min_seq[] separately for anon
      and file types as clean file pages can be evicted regardless of swap
      constraints. These three variables are monotonically increasing.
      
      Generation numbers are truncated into order_base_2(MAX_NR_GENS+1) bits
      in order to fit into the gen counter in folio->flags. Each truncated
      generation number is an index to lrugen->lists[]. The sliding window
      technique is used to track at least MIN_NR_GENS and at most
      MAX_NR_GENS generations. The gen counter stores a value within [1,
      MAX_NR_GENS] while a page is on one of lrugen->lists[]. Otherwise it
      stores 0.
      
      There are two conceptually independent procedures: "the aging", which
      produces young generations, and "the eviction", which consumes old
      generations.  They form a closed-loop system, i.e., "the page reclaim". 
      Both procedures can be invoked from userspace for the purposes of working
      set estimation and proactive reclaim.  These techniques are commonly used
      to optimize job scheduling (bin packing) in data centers [1][2].
      
      To avoid confusion, the terms "hot" and "cold" will be applied to the
      multi-gen LRU, as a new convention; the terms "active" and "inactive" will
      be applied to the active/inactive LRU, as usual.
      
      The protection of hot pages and the selection of cold pages are based
      on page access channels and patterns. There are two access channels:
      one through page tables and the other through file descriptors. The
      protection of the former channel is by design stronger because:
      1. The uncertainty in determining the access patterns of the former
         channel is higher due to the approximation of the accessed bit.
      2. The cost of evicting the former channel is higher due to the TLB
         flushes required and the likelihood of encountering the dirty bit.
      3. The penalty of underprotecting the former channel is higher because
         applications usually do not prepare themselves for major page
         faults like they do for blocked I/O. E.g., GUI applications
         commonly use dedicated I/O threads to avoid blocking rendering
         threads.
      
      There are also two access patterns: one with temporal locality and the
      other without.  For the reasons listed above, the former channel is
      assumed to follow the former pattern unless VM_SEQ_READ or VM_RAND_READ is
      present; the latter channel is assumed to follow the latter pattern unless
      outlying refaults have been observed [3][4].
      
      The next patch will address the "outlying refaults".  Three macros, i.e.,
      LRU_REFS_WIDTH, LRU_REFS_PGOFF and LRU_REFS_MASK, used later are added in
      this patch to make the entire patchset less diffy.
      
      A page is added to the youngest generation on faulting.  The aging needs
      to check the accessed bit at least twice before handing this page over to
      the eviction.  The first check takes care of the accessed bit set on the
      initial fault; the second check makes sure this page has not been used
      since then.  This protocol, AKA second chance, requires a minimum of two
      generations, hence MIN_NR_GENS.
      
      [1] https://dl.acm.org/doi/10.1145/3297858.3304053
      [2] https://dl.acm.org/doi/10.1145/3503222.3507731
      [3] https://lwn.net/Articles/495543/
      [4] https://lwn.net/Articles/815342/
      
      Link: https://lkml.kernel.org/r/20220918080010.2920238-6-yuzhao@google.comSigned-off-by: NYu Zhao <yuzhao@google.com>
      Acked-by: NBrian Geffon <bgeffon@google.com>
      Acked-by: NJan Alexander Steffens (heftig) <heftig@archlinux.org>
      Acked-by: NOleksandr Natalenko <oleksandr@natalenko.name>
      Acked-by: NSteven Barrett <steven@liquorix.net>
      Acked-by: NSuleiman Souhlal <suleiman@google.com>
      Tested-by: NDaniel Byrne <djbyrne@mtu.edu>
      Tested-by: NDonald Carr <d@chaos-reins.com>
      Tested-by: NHolger Hoffstätte <holger@applied-asynchrony.com>
      Tested-by: NKonstantin Kharlamov <Hi-Angel@yandex.ru>
      Tested-by: NShuang Zhai <szhai2@cs.rochester.edu>
      Tested-by: NSofia Trinh <sofia.trinh@edi.works>
      Tested-by: NVaibhav Jain <vaibhav@linux.ibm.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Barry Song <baohua@kernel.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michael Larabel <Michael@MichaelLarabel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      ec1c86b2
  2. 12 9月, 2022 2 次提交
  3. 30 7月, 2022 4 次提交
  4. 18 7月, 2022 2 次提交
  5. 04 7月, 2022 1 次提交
    • R
      mm: memcontrol: introduce mem_cgroup_ino() and mem_cgroup_get_from_ino() · c15187a4
      Roman Gushchin 提交于
      Patch series "mm: introduce shrinker debugfs interface", v5.
      
      The only existing debugging mechanism is a couple of tracepoints in
      do_shrink_slab(): mm_shrink_slab_start and mm_shrink_slab_end.  They
      aren't covering everything though: shrinkers which report 0 objects will
      never show up, there is no support for memcg-aware shrinkers.  Shrinkers
      are identified by their scan function, which is not always enough (e.g. 
      hard to guess which super block's shrinker it is having only
      "super_cache_scan").
      
      To provide a better visibility and debug options for memory shrinkers this
      patchset introduces a /sys/kernel/debug/shrinker interface, to some extent
      similar to /sys/kernel/slab.
      
      For each shrinker registered in the system a directory is created.  As
      now, the directory will contain only a "scan" file, which allows to get
      the number of managed objects for each memory cgroup (for memcg-aware
      shrinkers) and each numa node (for numa-aware shrinkers on a numa
      machine).  Other interfaces might be added in the future.
      
      To make debugging more pleasant, the patchset also names all shrinkers, so
      that debugfs entries can have meaningful names.
      
      
      This patch (of 5):
      
      Shrinker debugfs requires a way to represent memory cgroups without using
      full paths, both for displaying information and getting input from a user.
      
      Cgroup inode number is a perfect way, already used by bpf.
      
      This commit adds a couple of helper functions which will be used to handle
      memcg-aware shrinkers.
      
      Link: https://lkml.kernel.org/r/20220601032227.4076670-1-roman.gushchin@linux.dev
      Link: https://lkml.kernel.org/r/20220601032227.4076670-2-roman.gushchin@linux.devSigned-off-by: NRoman Gushchin <roman.gushchin@linux.dev>
      Acked-by: NMuchun Song <songmuchun@bytedance.com>
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: Kent Overstreet <kent.overstreet@gmail.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      c15187a4
  6. 17 6月, 2022 3 次提交
  7. 20 5月, 2022 1 次提交
    • J
      zswap: memcg accounting · f4840ccf
      Johannes Weiner 提交于
      Applications can currently escape their cgroup memory containment when
      zswap is enabled.  This patch adds per-cgroup tracking and limiting of
      zswap backend memory to rectify this.
      
      The existing cgroup2 memory.stat file is extended to show zswap statistics
      analogous to what's in meminfo and vmstat.  Furthermore, two new control
      files, memory.zswap.current and memory.zswap.max, are added to allow
      tuning zswap usage on a per-workload basis.  This is important since not
      all workloads benefit from zswap equally; some even suffer compared to
      disk swap when memory contents don't compress well.  The optimal size of
      the zswap pool, and the threshold for writeback, also depends on the size
      of the workload's warm set.
      
      The implementation doesn't use a traditional page_counter transaction. 
      zswap is unconventional as a memory consumer in that we only know the
      amount of memory to charge once expensive compression has occurred.  If
      zwap is disabled or the limit is already exceeded we obviously don't want
      to compress page upon page only to reject them all.  Instead, the limit is
      checked against current usage, then we compress and charge.  This allows
      some limit overrun, but not enough to matter in practice.
      
      [hannes@cmpxchg.org: fix for CONFIG_SLOB builds]
        Link: https://lkml.kernel.org/r/YnwD14zxYjUJPc2w@cmpxchg.org
      [hannes@cmpxchg.org: opt out of cgroups v1]
        Link: https://lkml.kernel.org/r/Yn6it9mBYFA+/lTb@cmpxchg.org
      Link: https://lkml.kernel.org/r/20220510152847.230957-7-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Seth Jennings <sjenning@redhat.com>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      f4840ccf
  8. 14 5月, 2022 1 次提交
  9. 13 5月, 2022 2 次提交
  10. 10 5月, 2022 1 次提交
  11. 30 4月, 2022 1 次提交
    • S
      memcg: introduce per-memcg reclaim interface · 94968384
      Shakeel Butt 提交于
      This patch series adds a memory.reclaim proactive reclaim interface.
      The rationale behind the interface and how it works are in the first
      patch.
      
      
      This patch (of 4):
      
      Introduce a memcg interface to trigger memory reclaim on a memory cgroup.
      
      Use case: Proactive Reclaim
      ---------------------------
      
      A userspace proactive reclaimer can continuously probe the memcg to
      reclaim a small amount of memory.  This gives more accurate and up-to-date
      workingset estimation as the LRUs are continuously sorted and can
      potentially provide more deterministic memory overcommit behavior.  The
      memory overcommit controller can provide more proactive response to the
      changing behavior of the running applications instead of being reactive.
      
      A userspace reclaimer's purpose in this case is not a complete replacement
      for kswapd or direct reclaim, it is to proactively identify memory savings
      opportunities and reclaim some amount of cold pages set by the policy to
      free up the memory for more demanding jobs or scheduling new jobs.
      
      A user space proactive reclaimer is used in Google data centers. 
      Additionally, Meta's TMO paper recently referenced a very similar
      interface used for user space proactive reclaim:
      https://dl.acm.org/doi/pdf/10.1145/3503222.3507731
      
      Benefits of a user space reclaimer:
      -----------------------------------
      
      1) More flexible on who should be charged for the cpu of the memory
         reclaim.  For proactive reclaim, it makes more sense to be centralized.
      
      2) More flexible on dedicating the resources (like cpu).  The memory
         overcommit controller can balance the cost between the cpu usage and
         the memory reclaimed.
      
      3) Provides a way to the applications to keep their LRUs sorted, so,
         under memory pressure better reclaim candidates are selected.  This
         also gives more accurate and uptodate notion of working set for an
         application.
      
      Why memory.high is not enough?
      ------------------------------
      
      - memory.high can be used to trigger reclaim in a memcg and can
        potentially be used for proactive reclaim.  However there is a big
        downside in using memory.high.  It can potentially introduce high
        reclaim stalls in the target application as the allocations from the
        processes or the threads of the application can hit the temporary
        memory.high limit.
      
      - Userspace proactive reclaimers usually use feedback loops to decide
        how much memory to proactively reclaim from a workload.  The metrics
        used for this are usually either refaults or PSI, and these metrics will
        become messy if the application gets throttled by hitting the high
        limit.
      
      - memory.high is a stateful interface, if the userspace proactive
        reclaimer crashes for any reason while triggering reclaim it can leave
        the application in a bad state.
      
      - If a workload is rapidly expanding, setting memory.high to proactively
        reclaim memory can result in actually reclaiming more memory than
        intended.
      
      The benefits of such interface and shortcomings of existing interface were
      further discussed in this RFC thread:
      https://lore.kernel.org/linux-mm/5df21376-7dd1-bf81-8414-32a73cea45dd@google.com/
      
      Interface:
      ----------
      
      Introducing a very simple memcg interface 'echo 10M > memory.reclaim' to
      trigger reclaim in the target memory cgroup.
      
      The interface is introduced as a nested-keyed file to allow for future
      optional arguments to be easily added to configure the behavior of
      reclaim.
      
      Possible Extensions:
      --------------------
      
      - This interface can be extended with an additional parameter or flags
        to allow specifying one or more types of memory to reclaim from (e.g.
        file, anon, ..).
      
      - The interface can also be extended with a node mask to reclaim from
        specific nodes. This has use cases for reclaim-based demotion in memory
        tiering systens.
      
      - A similar per-node interface can also be added to support proactive
        reclaim and reclaim-based demotion in systems without memcg.
      
      - Add a timeout parameter to make it easier for user space to call the
        interface without worrying about being blocked for an undefined amount
        of time.
      
      For now, let's keep things simple by adding the basic functionality.
      
      [yosryahmed@google.com: worked on versions v2 onwards, refreshed to
      current master, updated commit message based on recent
      discussions and use cases]
      Link: https://lkml.kernel.org/r/20220425190040.2475377-1-yosryahmed@google.com
      Link: https://lkml.kernel.org/r/20220425190040.2475377-2-yosryahmed@google.comSigned-off-by: NShakeel Butt <shakeelb@google.com>
      Co-developed-by: NYosry Ahmed <yosryahmed@google.com>
      Signed-off-by: NYosry Ahmed <yosryahmed@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NWei Xu <weixugc@google.com>
      Acked-by: NRoman Gushchin <roman.gushchin@linux.dev>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Zefan Li <lizefan.x@bytedance.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Chen Wandun <chenwandun@huawei.com>
      Cc: Vaibhav Jain <vaibhav@linux.ibm.com>
      Cc: "Michal Koutn" <mkoutny@suse.com>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      94968384
  12. 29 4月, 2022 8 次提交
  13. 22 4月, 2022 1 次提交
    • S
      memcg: sync flush only if periodic flush is delayed · 9b301615
      Shakeel Butt 提交于
      Daniel Dao has reported [1] a regression on workloads that may trigger a
      lot of refaults (anon and file).  The underlying issue is that flushing
      rstat is expensive.  Although rstat flush are batched with (nr_cpus *
      MEMCG_BATCH) stat updates, it seems like there are workloads which
      genuinely do stat updates larger than batch value within short amount of
      time.  Since the rstat flush can happen in the performance critical
      codepaths like page faults, such workload can suffer greatly.
      
      This patch fixes this regression by making the rstat flushing
      conditional in the performance critical codepaths.  More specifically,
      the kernel relies on the async periodic rstat flusher to flush the stats
      and only if the periodic flusher is delayed by more than twice the
      amount of its normal time window then the kernel allows rstat flushing
      from the performance critical codepaths.
      
      Now the question: what are the side-effects of this change? The worst
      that can happen is the refault codepath will see 4sec old lruvec stats
      and may cause false (or missed) activations of the refaulted page which
      may under-or-overestimate the workingset size.  Though that is not very
      concerning as the kernel can already miss or do false activations.
      
      There are two more codepaths whose flushing behavior is not changed by
      this patch and we may need to come to them in future.  One is the
      writeback stats used by dirty throttling and second is the deactivation
      heuristic in the reclaim.  For now keeping an eye on them and if there
      is report of regression due to these codepaths, we will reevaluate then.
      
      Link: https://lore.kernel.org/all/CA+wXwBSyO87ZX5PVwdHm-=dBjZYECGmfnydUicUyrQqndgX2MQ@mail.gmail.com [1]
      Link: https://lkml.kernel.org/r/20220304184040.1304781-1-shakeelb@google.com
      Fixes: 1f828223 ("memcg: flush lruvec stats in the refault")
      Signed-off-by: NShakeel Butt <shakeelb@google.com>
      Reported-by: NDaniel Dao <dqminh@cloudflare.com>
      Tested-by: NIvan Babrou <ivan@cloudflare.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Koutný <mkoutny@suse.com>
      Cc: Frank Hofmann <fhofmann@cloudflare.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9b301615
  14. 23 3月, 2022 11 次提交
    • W
      memcg: do not tweak node in alloc_mem_cgroup_per_node_info · 8c9bb398
      Wei Yang 提交于
      alloc_mem_cgroup_per_node_info is allocated for each possible node and
      this used to be a problem because !node_online nodes didn't have
      appropriate data structure allocated.  This has changed by "mm: handle
      uninitialized numa nodes gracefully" so we can drop the special casing
      here.
      
      Link: https://lkml.kernel.org/r/20220127085305.20890-7-mhocko@kernel.orgSigned-off-by: NWei Yang <richard.weiyang@gmail.com>
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Alexey Makhalov <amakhalov@vmware.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Nico Pache <npache@redhat.com>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Rafael Aquini <raquini@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8c9bb398
    • M
      mm: memcontrol: fix cannot alloc the maximum memcg ID · be740503
      Muchun Song 提交于
      The idr_alloc() does not include @max ID.  So in the current
      implementation, the maximum memcg ID is 65534 instead of 65535.  It
      seems a bug.  So fix this.
      
      Link: https://lkml.kernel.org/r/20220228122126.37293-15-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com>
      Cc: Alex Shi <alexs@kernel.org>
      Cc: Anna Schumaker <Anna.Schumaker@Netapp.com>
      Cc: Chao Yu <chao@kernel.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Fam Zheng <fam.zheng@bytedance.com>
      Cc: Jaegeuk Kim <jaegeuk@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kari Argillander <kari.argillander@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      be740503
    • M
      mm: memcontrol: reuse memory cgroup ID for kmem ID · f9c69d63
      Muchun Song 提交于
      There are two idrs being used by memory cgroup, one is for kmem ID,
      another is for memory cgroup ID.  The maximum ID of both is 64Ki.  Both
      of them can limit the total number of memory cgroups.  Actually, we can
      reuse memory cgroup ID for kmem ID to simplify the code.
      
      Link: https://lkml.kernel.org/r/20220228122126.37293-14-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com>
      Cc: Alex Shi <alexs@kernel.org>
      Cc: Anna Schumaker <Anna.Schumaker@Netapp.com>
      Cc: Chao Yu <chao@kernel.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Fam Zheng <fam.zheng@bytedance.com>
      Cc: Jaegeuk Kim <jaegeuk@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kari Argillander <kari.argillander@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f9c69d63
    • M
      mm: list_lru: replace linear array with xarray · bbca91cc
      Muchun Song 提交于
      If we run 10k containers in the system, the size of the
      list_lru_memcg->lrus can be ~96KB per list_lru.  When we decrease the
      number containers, the size of the array will not be shrinked.  It is
      not scalable.  The xarray is a good choice for this case.  We can save a
      lot of memory when there are tens of thousands continers in the system.
      If we use xarray, we also can remove the logic code of resizing array,
      which can simplify the code.
      
      [akpm@linux-foundation.org: remove unused local]
      
      Link: https://lkml.kernel.org/r/20220228122126.37293-13-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com>
      Cc: Alex Shi <alexs@kernel.org>
      Cc: Anna Schumaker <Anna.Schumaker@Netapp.com>
      Cc: Chao Yu <chao@kernel.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Fam Zheng <fam.zheng@bytedance.com>
      Cc: Jaegeuk Kim <jaegeuk@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kari Argillander <kari.argillander@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bbca91cc
    • M
      mm: list_lru: rename memcg_drain_all_list_lrus to memcg_reparent_list_lrus · 1f391eb2
      Muchun Song 提交于
      The purpose of the memcg_drain_all_list_lrus() is list_lrus reparenting.
      It is very similar to memcg_reparent_objcgs().  Rename it to
      memcg_reparent_list_lrus() so that the name can more consistent with
      memcg_reparent_objcgs().
      
      Link: https://lkml.kernel.org/r/20220228122126.37293-12-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com>
      Cc: Alex Shi <alexs@kernel.org>
      Cc: Anna Schumaker <Anna.Schumaker@Netapp.com>
      Cc: Chao Yu <chao@kernel.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Fam Zheng <fam.zheng@bytedance.com>
      Cc: Jaegeuk Kim <jaegeuk@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kari Argillander <kari.argillander@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1f391eb2
    • M
      mm: list_lru: allocate list_lru_one only when needed · 5abc1e37
      Muchun Song 提交于
      In our server, we found a suspected memory leak problem.  The kmalloc-32
      consumes more than 6GB of memory.  Other kmem_caches consume less than
      2GB memory.
      
      After our in-depth analysis, the memory consumption of kmalloc-32 slab
      cache is the cause of list_lru_one allocation.
      
        crash> p memcg_nr_cache_ids
        memcg_nr_cache_ids = $2 = 24574
      
      memcg_nr_cache_ids is very large and memory consumption of each list_lru
      can be calculated with the following formula.
      
        num_numa_node * memcg_nr_cache_ids * 32 (kmalloc-32)
      
      There are 4 numa nodes in our system, so each list_lru consumes ~3MB.
      
        crash> list super_blocks | wc -l
        952
      
      Every mount will register 2 list lrus, one is for inode, another is for
      dentry.  There are 952 super_blocks.  So the total memory is 952 * 2 * 3
      MB (~5.6GB).  But the number of memory cgroup is less than 500.  So I
      guess more than 12286 containers have been deployed on this machine (I do
      not know why there are so many containers, it may be a user's bug or the
      user really want to do that).  And memcg_nr_cache_ids has not been reduced
      to a suitable value.  This can waste a lot of memory.
      
      Now the infrastructure for dynamic list_lru_one allocation is ready, so
      remove statically allocated memory code to save memory.
      
      Link: https://lkml.kernel.org/r/20220228122126.37293-11-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com>
      Cc: Alex Shi <alexs@kernel.org>
      Cc: Anna Schumaker <Anna.Schumaker@Netapp.com>
      Cc: Chao Yu <chao@kernel.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Fam Zheng <fam.zheng@bytedance.com>
      Cc: Jaegeuk Kim <jaegeuk@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kari Argillander <kari.argillander@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5abc1e37
    • M
      mm: memcontrol: move memcg_online_kmem() to mem_cgroup_css_online() · da0efe30
      Muchun Song 提交于
      It will simplify the code if moving memcg_online_kmem() to
      mem_cgroup_css_online() and do not need to set ->kmemcg_id to -1 to
      indicate the memcg is offline.  In the next patch, ->kmemcg_id will be
      used to sync list lru reparenting which requires not to change
      ->kmemcg_id.
      
      Link: https://lkml.kernel.org/r/20220228122126.37293-10-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com>
      Acked-by: NRoman Gushchin <roman.gushchin@linux.dev>
      Cc: Alex Shi <alexs@kernel.org>
      Cc: Anna Schumaker <Anna.Schumaker@Netapp.com>
      Cc: Chao Yu <chao@kernel.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Fam Zheng <fam.zheng@bytedance.com>
      Cc: Jaegeuk Kim <jaegeuk@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kari Argillander <kari.argillander@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      da0efe30
    • M
      mm: introduce kmem_cache_alloc_lru · 88f2ef73
      Muchun Song 提交于
      We currently allocate scope for every memcg to be able to tracked on
      every superblock instantiated in the system, regardless of whether that
      superblock is even accessible to that memcg.
      
      These huge memcg counts come from container hosts where memcgs are
      confined to just a small subset of the total number of superblocks that
      instantiated at any given point in time.
      
      For these systems with huge container counts, list_lru does not need the
      capability of tracking every memcg on every superblock.  What it comes
      down to is that adding the memcg to the list_lru at the first insert.
      So introduce kmem_cache_alloc_lru to allocate objects and its list_lru.
      In the later patch, we will convert all inode and dentry allocation from
      kmem_cache_alloc to kmem_cache_alloc_lru.
      
      Link: https://lkml.kernel.org/r/20220228122126.37293-3-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com>
      Cc: Alex Shi <alexs@kernel.org>
      Cc: Anna Schumaker <Anna.Schumaker@Netapp.com>
      Cc: Chao Yu <chao@kernel.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Fam Zheng <fam.zheng@bytedance.com>
      Cc: Jaegeuk Kim <jaegeuk@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kari Argillander <kari.argillander@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      88f2ef73
    • S
      mm/memcg: disable migration instead of preemption in drain_all_stock(). · 0790ed62
      Sebastian Andrzej Siewior 提交于
      Before the for-each-CPU loop, preemption is disabled so that so that
      drain_local_stock() can be invoked directly instead of scheduling a
      worker.  Ensuring that drain_local_stock() completed on the local CPU is
      not correctness problem.  It _could_ be that the charging path will be
      forced to reclaim memory because cached charges are still waiting for
      their draining.
      
      Disabling preemption before invoking drain_local_stock() is problematic
      on PREEMPT_RT due to the sleeping locks involved.  To ensure that no CPU
      migrations happens across for_each_online_cpu() it is enouhg to use
      migrate_disable() which disables migration and keeps context preemptible
      to a sleeping lock can be acquired.  A race with CPU hotplug is not a
      problem because pcp data is not going away.  In the worst case we just
      schedule draining of an empty stock.
      
      Use migrate_disable() instead of get_cpu() around the
      for_each_online_cpu() loop.
      
      Link: https://lkml.kernel.org/r/20220226204144.1008339-7-bigeasy@linutronix.deSigned-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: kernel test robot <oliver.sang@intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michal Koutný <mkoutny@suse.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Waiman Long <longman@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0790ed62
    • S
      mm/memcg: protect memcg_stock with a local_lock_t · 56751146
      Sebastian Andrzej Siewior 提交于
      The members of the per-CPU structure memcg_stock_pcp are protected by
      disabling interrupts.  This is not working on PREEMPT_RT because it
      creates atomic context in which actions are performed which require
      preemptible context.  One example is obj_cgroup_release().
      
      The IRQ-disable sections can be replaced with local_lock_t which
      preserves the explicit disabling of interrupts while keeps the code
      preemptible on PREEMPT_RT.
      
      drain_obj_stock() drops a reference on obj_cgroup which leads to an
      invocat= ion of obj_cgroup_release() if it is the last object.  This in
      turn leads to recursive locking of the local_lock_t.  To avoid this,
      obj_cgroup_release() = is invoked outside of the locked section.
      
      obj_cgroup_uncharge_pages() can be invoked with the local_lock_t
      acquired a= nd without it.  This will lead later to a recursion in
      refill_stock().  To avoid the locking recursion provide
      obj_cgroup_uncharge_pages_locked() which uses the locked version of
      refill_stock().
      
       - Replace disabling interrupts for memcg_stock with a local_lock_t.
      
       - Let drain_obj_stock() return the old struct obj_cgroup which is
         passed to obj_cgroup_put() outside of the locked section.
      
       - Provide obj_cgroup_uncharge_pages_locked() which uses the locked
         version of refill_stock() to avoid recursive locking in
         drain_obj_stock().
      
      Link: https://lkml.kernel.org/r/20220209014709.GA26885@xsang-OptiPlex-9020
      Link: https://lkml.kernel.org/r/20220226204144.1008339-6-bigeasy@linutronix.deSigned-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Reported-by: Nkernel test robot <oliver.sang@intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Koutný <mkoutny@suse.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Waiman Long <longman@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      56751146
    • J
      mm/memcg: opencode the inner part of obj_cgroup_uncharge_pages() in drain_obj_stock() · af9a3b69
      Johannes Weiner 提交于
      Provide the inner part of refill_stock() as __refill_stock() without
      disabling interrupts.  This eases the integration of local_lock_t where
      recursive locking must be avoided.
      
      Open code obj_cgroup_uncharge_pages() in drain_obj_stock() and use
      __refill_stock().  The caller of drain_obj_stock() already disables
      interrupts.
      
      [bigeasy@linutronix.de: patch body around Johannes' diff]
      
      Link: https://lkml.kernel.org/r/20220226204144.1008339-5-bigeasy@linutronix.deSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Reviewed-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: kernel test robot <oliver.sang@intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michal Koutný <mkoutny@suse.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Waiman Long <longman@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      af9a3b69