1. 04 7月, 2022 1 次提交
    • R
      mm: memcontrol: introduce mem_cgroup_ino() and mem_cgroup_get_from_ino() · c15187a4
      Roman Gushchin 提交于
      Patch series "mm: introduce shrinker debugfs interface", v5.
      
      The only existing debugging mechanism is a couple of tracepoints in
      do_shrink_slab(): mm_shrink_slab_start and mm_shrink_slab_end.  They
      aren't covering everything though: shrinkers which report 0 objects will
      never show up, there is no support for memcg-aware shrinkers.  Shrinkers
      are identified by their scan function, which is not always enough (e.g. 
      hard to guess which super block's shrinker it is having only
      "super_cache_scan").
      
      To provide a better visibility and debug options for memory shrinkers this
      patchset introduces a /sys/kernel/debug/shrinker interface, to some extent
      similar to /sys/kernel/slab.
      
      For each shrinker registered in the system a directory is created.  As
      now, the directory will contain only a "scan" file, which allows to get
      the number of managed objects for each memory cgroup (for memcg-aware
      shrinkers) and each numa node (for numa-aware shrinkers on a numa
      machine).  Other interfaces might be added in the future.
      
      To make debugging more pleasant, the patchset also names all shrinkers, so
      that debugfs entries can have meaningful names.
      
      
      This patch (of 5):
      
      Shrinker debugfs requires a way to represent memory cgroups without using
      full paths, both for displaying information and getting input from a user.
      
      Cgroup inode number is a perfect way, already used by bpf.
      
      This commit adds a couple of helper functions which will be used to handle
      memcg-aware shrinkers.
      
      Link: https://lkml.kernel.org/r/20220601032227.4076670-1-roman.gushchin@linux.dev
      Link: https://lkml.kernel.org/r/20220601032227.4076670-2-roman.gushchin@linux.devSigned-off-by: NRoman Gushchin <roman.gushchin@linux.dev>
      Acked-by: NMuchun Song <songmuchun@bytedance.com>
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: Kent Overstreet <kent.overstreet@gmail.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      c15187a4
  2. 17 6月, 2022 3 次提交
  3. 20 5月, 2022 1 次提交
    • J
      zswap: memcg accounting · f4840ccf
      Johannes Weiner 提交于
      Applications can currently escape their cgroup memory containment when
      zswap is enabled.  This patch adds per-cgroup tracking and limiting of
      zswap backend memory to rectify this.
      
      The existing cgroup2 memory.stat file is extended to show zswap statistics
      analogous to what's in meminfo and vmstat.  Furthermore, two new control
      files, memory.zswap.current and memory.zswap.max, are added to allow
      tuning zswap usage on a per-workload basis.  This is important since not
      all workloads benefit from zswap equally; some even suffer compared to
      disk swap when memory contents don't compress well.  The optimal size of
      the zswap pool, and the threshold for writeback, also depends on the size
      of the workload's warm set.
      
      The implementation doesn't use a traditional page_counter transaction. 
      zswap is unconventional as a memory consumer in that we only know the
      amount of memory to charge once expensive compression has occurred.  If
      zwap is disabled or the limit is already exceeded we obviously don't want
      to compress page upon page only to reject them all.  Instead, the limit is
      checked against current usage, then we compress and charge.  This allows
      some limit overrun, but not enough to matter in practice.
      
      [hannes@cmpxchg.org: fix for CONFIG_SLOB builds]
        Link: https://lkml.kernel.org/r/YnwD14zxYjUJPc2w@cmpxchg.org
      [hannes@cmpxchg.org: opt out of cgroups v1]
        Link: https://lkml.kernel.org/r/Yn6it9mBYFA+/lTb@cmpxchg.org
      Link: https://lkml.kernel.org/r/20220510152847.230957-7-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Seth Jennings <sjenning@redhat.com>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      f4840ccf
  4. 14 5月, 2022 1 次提交
  5. 13 5月, 2022 2 次提交
  6. 10 5月, 2022 1 次提交
  7. 30 4月, 2022 1 次提交
    • S
      memcg: introduce per-memcg reclaim interface · 94968384
      Shakeel Butt 提交于
      This patch series adds a memory.reclaim proactive reclaim interface.
      The rationale behind the interface and how it works are in the first
      patch.
      
      
      This patch (of 4):
      
      Introduce a memcg interface to trigger memory reclaim on a memory cgroup.
      
      Use case: Proactive Reclaim
      ---------------------------
      
      A userspace proactive reclaimer can continuously probe the memcg to
      reclaim a small amount of memory.  This gives more accurate and up-to-date
      workingset estimation as the LRUs are continuously sorted and can
      potentially provide more deterministic memory overcommit behavior.  The
      memory overcommit controller can provide more proactive response to the
      changing behavior of the running applications instead of being reactive.
      
      A userspace reclaimer's purpose in this case is not a complete replacement
      for kswapd or direct reclaim, it is to proactively identify memory savings
      opportunities and reclaim some amount of cold pages set by the policy to
      free up the memory for more demanding jobs or scheduling new jobs.
      
      A user space proactive reclaimer is used in Google data centers. 
      Additionally, Meta's TMO paper recently referenced a very similar
      interface used for user space proactive reclaim:
      https://dl.acm.org/doi/pdf/10.1145/3503222.3507731
      
      Benefits of a user space reclaimer:
      -----------------------------------
      
      1) More flexible on who should be charged for the cpu of the memory
         reclaim.  For proactive reclaim, it makes more sense to be centralized.
      
      2) More flexible on dedicating the resources (like cpu).  The memory
         overcommit controller can balance the cost between the cpu usage and
         the memory reclaimed.
      
      3) Provides a way to the applications to keep their LRUs sorted, so,
         under memory pressure better reclaim candidates are selected.  This
         also gives more accurate and uptodate notion of working set for an
         application.
      
      Why memory.high is not enough?
      ------------------------------
      
      - memory.high can be used to trigger reclaim in a memcg and can
        potentially be used for proactive reclaim.  However there is a big
        downside in using memory.high.  It can potentially introduce high
        reclaim stalls in the target application as the allocations from the
        processes or the threads of the application can hit the temporary
        memory.high limit.
      
      - Userspace proactive reclaimers usually use feedback loops to decide
        how much memory to proactively reclaim from a workload.  The metrics
        used for this are usually either refaults or PSI, and these metrics will
        become messy if the application gets throttled by hitting the high
        limit.
      
      - memory.high is a stateful interface, if the userspace proactive
        reclaimer crashes for any reason while triggering reclaim it can leave
        the application in a bad state.
      
      - If a workload is rapidly expanding, setting memory.high to proactively
        reclaim memory can result in actually reclaiming more memory than
        intended.
      
      The benefits of such interface and shortcomings of existing interface were
      further discussed in this RFC thread:
      https://lore.kernel.org/linux-mm/5df21376-7dd1-bf81-8414-32a73cea45dd@google.com/
      
      Interface:
      ----------
      
      Introducing a very simple memcg interface 'echo 10M > memory.reclaim' to
      trigger reclaim in the target memory cgroup.
      
      The interface is introduced as a nested-keyed file to allow for future
      optional arguments to be easily added to configure the behavior of
      reclaim.
      
      Possible Extensions:
      --------------------
      
      - This interface can be extended with an additional parameter or flags
        to allow specifying one or more types of memory to reclaim from (e.g.
        file, anon, ..).
      
      - The interface can also be extended with a node mask to reclaim from
        specific nodes. This has use cases for reclaim-based demotion in memory
        tiering systens.
      
      - A similar per-node interface can also be added to support proactive
        reclaim and reclaim-based demotion in systems without memcg.
      
      - Add a timeout parameter to make it easier for user space to call the
        interface without worrying about being blocked for an undefined amount
        of time.
      
      For now, let's keep things simple by adding the basic functionality.
      
      [yosryahmed@google.com: worked on versions v2 onwards, refreshed to
      current master, updated commit message based on recent
      discussions and use cases]
      Link: https://lkml.kernel.org/r/20220425190040.2475377-1-yosryahmed@google.com
      Link: https://lkml.kernel.org/r/20220425190040.2475377-2-yosryahmed@google.comSigned-off-by: NShakeel Butt <shakeelb@google.com>
      Co-developed-by: NYosry Ahmed <yosryahmed@google.com>
      Signed-off-by: NYosry Ahmed <yosryahmed@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NWei Xu <weixugc@google.com>
      Acked-by: NRoman Gushchin <roman.gushchin@linux.dev>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Zefan Li <lizefan.x@bytedance.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Chen Wandun <chenwandun@huawei.com>
      Cc: Vaibhav Jain <vaibhav@linux.ibm.com>
      Cc: "Michal Koutn" <mkoutny@suse.com>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      94968384
  8. 29 4月, 2022 8 次提交
  9. 22 4月, 2022 1 次提交
    • S
      memcg: sync flush only if periodic flush is delayed · 9b301615
      Shakeel Butt 提交于
      Daniel Dao has reported [1] a regression on workloads that may trigger a
      lot of refaults (anon and file).  The underlying issue is that flushing
      rstat is expensive.  Although rstat flush are batched with (nr_cpus *
      MEMCG_BATCH) stat updates, it seems like there are workloads which
      genuinely do stat updates larger than batch value within short amount of
      time.  Since the rstat flush can happen in the performance critical
      codepaths like page faults, such workload can suffer greatly.
      
      This patch fixes this regression by making the rstat flushing
      conditional in the performance critical codepaths.  More specifically,
      the kernel relies on the async periodic rstat flusher to flush the stats
      and only if the periodic flusher is delayed by more than twice the
      amount of its normal time window then the kernel allows rstat flushing
      from the performance critical codepaths.
      
      Now the question: what are the side-effects of this change? The worst
      that can happen is the refault codepath will see 4sec old lruvec stats
      and may cause false (or missed) activations of the refaulted page which
      may under-or-overestimate the workingset size.  Though that is not very
      concerning as the kernel can already miss or do false activations.
      
      There are two more codepaths whose flushing behavior is not changed by
      this patch and we may need to come to them in future.  One is the
      writeback stats used by dirty throttling and second is the deactivation
      heuristic in the reclaim.  For now keeping an eye on them and if there
      is report of regression due to these codepaths, we will reevaluate then.
      
      Link: https://lore.kernel.org/all/CA+wXwBSyO87ZX5PVwdHm-=dBjZYECGmfnydUicUyrQqndgX2MQ@mail.gmail.com [1]
      Link: https://lkml.kernel.org/r/20220304184040.1304781-1-shakeelb@google.com
      Fixes: 1f828223 ("memcg: flush lruvec stats in the refault")
      Signed-off-by: NShakeel Butt <shakeelb@google.com>
      Reported-by: NDaniel Dao <dqminh@cloudflare.com>
      Tested-by: NIvan Babrou <ivan@cloudflare.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Koutný <mkoutny@suse.com>
      Cc: Frank Hofmann <fhofmann@cloudflare.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9b301615
  10. 23 3月, 2022 21 次提交