- 04 7月, 2022 1 次提交
-
-
由 Roman Gushchin 提交于
Patch series "mm: introduce shrinker debugfs interface", v5. The only existing debugging mechanism is a couple of tracepoints in do_shrink_slab(): mm_shrink_slab_start and mm_shrink_slab_end. They aren't covering everything though: shrinkers which report 0 objects will never show up, there is no support for memcg-aware shrinkers. Shrinkers are identified by their scan function, which is not always enough (e.g. hard to guess which super block's shrinker it is having only "super_cache_scan"). To provide a better visibility and debug options for memory shrinkers this patchset introduces a /sys/kernel/debug/shrinker interface, to some extent similar to /sys/kernel/slab. For each shrinker registered in the system a directory is created. As now, the directory will contain only a "scan" file, which allows to get the number of managed objects for each memory cgroup (for memcg-aware shrinkers) and each numa node (for numa-aware shrinkers on a numa machine). Other interfaces might be added in the future. To make debugging more pleasant, the patchset also names all shrinkers, so that debugfs entries can have meaningful names. This patch (of 5): Shrinker debugfs requires a way to represent memory cgroups without using full paths, both for displaying information and getting input from a user. Cgroup inode number is a perfect way, already used by bpf. This commit adds a couple of helper functions which will be used to handle memcg-aware shrinkers. Link: https://lkml.kernel.org/r/20220601032227.4076670-1-roman.gushchin@linux.dev Link: https://lkml.kernel.org/r/20220601032227.4076670-2-roman.gushchin@linux.devSigned-off-by: NRoman Gushchin <roman.gushchin@linux.dev> Acked-by: NMuchun Song <songmuchun@bytedance.com> Cc: Dave Chinner <dchinner@redhat.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Cc: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
-
- 17 6月, 2022 3 次提交
-
-
由 Roman Gushchin 提交于
Currently mem_cgroup_from_obj() is not working properly with objects allocated using vmalloc(). It creates problems in some cases, when it's called for static objects belonging to modules or generally allocated using vmalloc(). This patch makes mem_cgroup_from_obj() safe to be called on objects allocated using vmalloc(). It also introduces mem_cgroup_from_slab_obj(), which is a faster version to use in places when we know the object is either a slab object or a generic slab page (e.g. when adding an object to a lru list). Link: https://lkml.kernel.org/r/20220610180310.1725111-1-roman.gushchin@linux.devSuggested-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NRoman Gushchin <roman.gushchin@linux.dev> Tested-by: NLinux Kernel Functional Testing <lkft@linaro.org> Acked-by: NShakeel Butt <shakeelb@google.com> Tested-by: NVasily Averin <vvs@openvz.org> Acked-by: NMichal Hocko <mhocko@suse.com> Acked-by: NMuchun Song <songmuchun@bytedance.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Naresh Kamboju <naresh.kamboju@linaro.org> Cc: Qian Cai <quic_qiancai@quicinc.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: David S. Miller <davem@davemloft.net> Cc: Eric Dumazet <edumazet@google.com> Cc: Florian Westphal <fw@strlen.de> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Michal Koutný <mkoutny@suse.com> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
-
由 Qi Zheng 提交于
There are already statistics of {pgscan,pgsteal}_kswapd and {pgscan,pgsteal}_direct of memcg event here, but now only the sum of the two is displayed in memory.stat of cgroup v2. In order to obtain more accurate information during monitoring and debugging, and to align with the display in /proc/vmstat, it better to display {pgscan,pgsteal}_kswapd and {pgscan,pgsteal}_direct separately. Also, for forward compatibility, we still display pgscan and pgsteal items so that it won't break existing applications. [zhengqi.arch@bytedance.com: add comment for memcg_vm_event_stat (suggested by Michal)] Link: https://lkml.kernel.org/r/20220606154028.55030-1-zhengqi.arch@bytedance.com [zhengqi.arch@bytedance.com: fix the doc, thanks to Johannes] Link: https://lkml.kernel.org/r/20220607064803.79363-1-zhengqi.arch@bytedance.com Link: https://lkml.kernel.org/r/20220604082209.55174-1-zhengqi.arch@bytedance.comSigned-off-by: NQi Zheng <zhengqi.arch@bytedance.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Acked-by: NRoman Gushchin <roman.gushchin@linux.dev> Acked-by: NMuchun Song <songmuchun@bytedance.com> Acked-by: NShakeel Butt <shakeelb@google.com> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
-
由 Yang Yang 提交于
There is no slabinfo.py in tools/cgroup, but has memcg_slabinfo.py instead. Link: https://lkml.kernel.org/r/20220610024451.744135-1-yang.yang29@zte.com.cnSigned-off-by: NYang Yang <yang.yang29@zte.com.cn> Reviewed-by: NMuchun Song <songmuchun@bytedance.com> Acked-by: NRoman Gushchin <roman.gushchin@linux.dev> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
-
- 20 5月, 2022 1 次提交
-
-
由 Johannes Weiner 提交于
Applications can currently escape their cgroup memory containment when zswap is enabled. This patch adds per-cgroup tracking and limiting of zswap backend memory to rectify this. The existing cgroup2 memory.stat file is extended to show zswap statistics analogous to what's in meminfo and vmstat. Furthermore, two new control files, memory.zswap.current and memory.zswap.max, are added to allow tuning zswap usage on a per-workload basis. This is important since not all workloads benefit from zswap equally; some even suffer compared to disk swap when memory contents don't compress well. The optimal size of the zswap pool, and the threshold for writeback, also depends on the size of the workload's warm set. The implementation doesn't use a traditional page_counter transaction. zswap is unconventional as a memory consumer in that we only know the amount of memory to charge once expensive compression has occurred. If zwap is disabled or the limit is already exceeded we obviously don't want to compress page upon page only to reject them all. Instead, the limit is checked against current usage, then we compress and charge. This allows some limit overrun, but not enough to matter in practice. [hannes@cmpxchg.org: fix for CONFIG_SLOB builds] Link: https://lkml.kernel.org/r/YnwD14zxYjUJPc2w@cmpxchg.org [hannes@cmpxchg.org: opt out of cgroups v1] Link: https://lkml.kernel.org/r/Yn6it9mBYFA+/lTb@cmpxchg.org Link: https://lkml.kernel.org/r/20220510152847.230957-7-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Seth Jennings <sjenning@redhat.com> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
-
- 14 5月, 2022 1 次提交
-
-
由 Ganesan Rajagopal 提交于
We run a lot of automated tests when building our software and run into OOM scenarios when the tests run unbounded. v1 memcg exports memcg->watermark as "memory.max_usage_in_bytes" in sysfs. We use this metric to heuristically limit the number of tests that can run in parallel based on per test historical data. This metric is currently not exported for v2 memcg and there is no other easy way of getting this information. getrusage() syscall returns "ru_maxrss" which can be used as an approximation but that's the max RSS of a single child process across all children instead of the aggregated max for all child processes. The only work around is to periodically poll "memory.current" but that's not practical for short-lived one-off cgroups. Hence, expose memcg->watermark as "memory.peak" for v2 memcg. Link: https://lkml.kernel.org/r/20220507050916.GA13577@us192.sjc.aristanetworks.comSigned-off-by: NGanesan Rajagopal <rganesan@arista.com> Acked-by: NShakeel Butt <shakeelb@google.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Acked-by: NRoman Gushchin <roman.gushchin@linux.dev> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Reviewed-by: NMichal Koutný <mkoutny@suse.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
-
- 13 5月, 2022 2 次提交
-
-
由 Matthew Wilcox (Oracle) 提交于
This removes an assumption that a large folio is HPAGE_PMD_NR pages in size. Link: https://lkml.kernel.org/r/20220504182857.4013401-8-willy@infradead.orgSigned-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
-
由 Peter Xu 提交于
This patch still does not use pte marker in any way, however it teaches the core mm about the pte marker idea. For example, handle_pte_marker() is introduced that will parse and handle all the pte marker faults. Many of the places are more about commenting it up - so that we know there's the possibility of pte marker showing up, and why we don't need special code for the cases. [peterx@redhat.com: userfaultfd.c needs swapops.h] Link: https://lkml.kernel.org/r/YmRlVj3cdizYJsr0@xz-m1.local Link: https://lkml.kernel.org/r/20220405014833.14015-1-peterx@redhat.comSigned-off-by: NPeter Xu <peterx@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
-
- 10 5月, 2022 1 次提交
-
-
由 NeilBrown 提交于
Patch series "MM changes to improve swap-over-NFS support". Assorted improvements for swap-via-filesystem. This is a resend of these patches, rebased on current HEAD. The only substantial changes is that swap_dirty_folio has replaced swap_set_page_dirty. Currently swap-via-fs (SWP_FS_OPS) doesn't work for any filesystem. It has previously worked for NFS but that broke a few releases back. This series changes to use a new ->swap_rw rather than ->readpage and ->direct_IO. It also makes other improvements. There is a companion series already in linux-next which fixes various issues with NFS. Once both series land, a final patch is needed which changes NFS over to use ->swap_rw. This patch (of 10): Many functions declared in include/linux/swap.h are only used within mm/ Create a new "mm/swap.h" and move some of these declarations there. Remove the redundant 'extern' from the function declarations. [akpm@linux-foundation.org: mm/memory-failure.c needs mm/swap.h] Link: https://lkml.kernel.org/r/164859751830.29473.5309689752169286816.stgit@noble.brown Link: https://lkml.kernel.org/r/164859778120.29473.11725907882296224053.stgit@noble.brownSigned-off-by: NNeilBrown <neilb@suse.de> Reviewed-by: NChristoph Hellwig <hch@lst.de> Tested-by: NDavid Howells <dhowells@redhat.com> Tested-by: NGeert Uytterhoeven <geert+renesas@glider.be> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
-
- 30 4月, 2022 1 次提交
-
-
由 Shakeel Butt 提交于
This patch series adds a memory.reclaim proactive reclaim interface. The rationale behind the interface and how it works are in the first patch. This patch (of 4): Introduce a memcg interface to trigger memory reclaim on a memory cgroup. Use case: Proactive Reclaim --------------------------- A userspace proactive reclaimer can continuously probe the memcg to reclaim a small amount of memory. This gives more accurate and up-to-date workingset estimation as the LRUs are continuously sorted and can potentially provide more deterministic memory overcommit behavior. The memory overcommit controller can provide more proactive response to the changing behavior of the running applications instead of being reactive. A userspace reclaimer's purpose in this case is not a complete replacement for kswapd or direct reclaim, it is to proactively identify memory savings opportunities and reclaim some amount of cold pages set by the policy to free up the memory for more demanding jobs or scheduling new jobs. A user space proactive reclaimer is used in Google data centers. Additionally, Meta's TMO paper recently referenced a very similar interface used for user space proactive reclaim: https://dl.acm.org/doi/pdf/10.1145/3503222.3507731 Benefits of a user space reclaimer: ----------------------------------- 1) More flexible on who should be charged for the cpu of the memory reclaim. For proactive reclaim, it makes more sense to be centralized. 2) More flexible on dedicating the resources (like cpu). The memory overcommit controller can balance the cost between the cpu usage and the memory reclaimed. 3) Provides a way to the applications to keep their LRUs sorted, so, under memory pressure better reclaim candidates are selected. This also gives more accurate and uptodate notion of working set for an application. Why memory.high is not enough? ------------------------------ - memory.high can be used to trigger reclaim in a memcg and can potentially be used for proactive reclaim. However there is a big downside in using memory.high. It can potentially introduce high reclaim stalls in the target application as the allocations from the processes or the threads of the application can hit the temporary memory.high limit. - Userspace proactive reclaimers usually use feedback loops to decide how much memory to proactively reclaim from a workload. The metrics used for this are usually either refaults or PSI, and these metrics will become messy if the application gets throttled by hitting the high limit. - memory.high is a stateful interface, if the userspace proactive reclaimer crashes for any reason while triggering reclaim it can leave the application in a bad state. - If a workload is rapidly expanding, setting memory.high to proactively reclaim memory can result in actually reclaiming more memory than intended. The benefits of such interface and shortcomings of existing interface were further discussed in this RFC thread: https://lore.kernel.org/linux-mm/5df21376-7dd1-bf81-8414-32a73cea45dd@google.com/ Interface: ---------- Introducing a very simple memcg interface 'echo 10M > memory.reclaim' to trigger reclaim in the target memory cgroup. The interface is introduced as a nested-keyed file to allow for future optional arguments to be easily added to configure the behavior of reclaim. Possible Extensions: -------------------- - This interface can be extended with an additional parameter or flags to allow specifying one or more types of memory to reclaim from (e.g. file, anon, ..). - The interface can also be extended with a node mask to reclaim from specific nodes. This has use cases for reclaim-based demotion in memory tiering systens. - A similar per-node interface can also be added to support proactive reclaim and reclaim-based demotion in systems without memcg. - Add a timeout parameter to make it easier for user space to call the interface without worrying about being blocked for an undefined amount of time. For now, let's keep things simple by adding the basic functionality. [yosryahmed@google.com: worked on versions v2 onwards, refreshed to current master, updated commit message based on recent discussions and use cases] Link: https://lkml.kernel.org/r/20220425190040.2475377-1-yosryahmed@google.com Link: https://lkml.kernel.org/r/20220425190040.2475377-2-yosryahmed@google.comSigned-off-by: NShakeel Butt <shakeelb@google.com> Co-developed-by: NYosry Ahmed <yosryahmed@google.com> Signed-off-by: NYosry Ahmed <yosryahmed@google.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Acked-by: NMichal Hocko <mhocko@suse.com> Acked-by: NWei Xu <weixugc@google.com> Acked-by: NRoman Gushchin <roman.gushchin@linux.dev> Acked-by: NDavid Rientjes <rientjes@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Zefan Li <lizefan.x@bytedance.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Shuah Khan <shuah@kernel.org> Cc: Yu Zhao <yuzhao@google.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Greg Thelen <gthelen@google.com> Cc: Chen Wandun <chenwandun@huawei.com> Cc: Vaibhav Jain <vaibhav@linux.ibm.com> Cc: "Michal Koutn" <mkoutny@suse.com> Cc: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
-
- 29 4月, 2022 8 次提交
-
-
由 Lu Jialin 提交于
There is no use for the private value, __OOM_TYPE and OOM notifier OOM_CONTROL. Therefore remove them to make the code clean. Link: https://lkml.kernel.org/r/20220421122755.40899-1-lujialin4@huawei.comSigned-off-by: NLu Jialin <lujialin4@huawei.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Acked-by: NShakeel Butt <shakeelb@google.com> Acked-by: NRoman Gushchin <roman.gushchin@linux.dev> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
-
由 Lu Jialin 提交于
cgroup_memory_noswap is only used in mm/memcontrol.c, therefore just make it static, and remove export in include/linux/memcontrol.h Link: https://lkml.kernel.org/r/20220421124736.62180-1-lujialin4@huawei.comSigned-off-by: NLu Jialin <lujialin4@huawei.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Acked-by: NShakeel Butt <shakeelb@google.com> Acked-by: NRoman Gushchin <roman.gushchin@linux.dev> Reviewed-by: NMuchun Song <songmuchun@bytedance.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
-
由 Wei Yang 提交于
After commit bef8620c ("mm: memcg: deprecate the non-hierarchical mode"), we won't have a NULL parent except root_mem_cgroup. And this case is handled when (memcg == root). Link: https://lkml.kernel.org/r/20220403020833.26164-1-richard.weiyang@gmail.comSigned-off-by: NWei Yang <richard.weiyang@gmail.com> Acked-by: NMichal Hocko <mhocko@suse.com> Reviewed-by: NRoman Gushchin <roman.gushchin@linux.dev> Reviewed-by: NShakeel Butt <shakeelb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
-
由 Wei Yang 提交于
For each round-trip, we assign generation on first invocation and compare it on subsequent invocations. Let's move them together to make it more self-explaining. Also this reduce a check on prev. [hannes@cmpxchg.org: better comment to explain reclaim model] Link: https://lkml.kernel.org/r/20220330234719.18340-4-richard.weiyang@gmail.comSigned-off-by: NWei Yang <richard.weiyang@gmail.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Reviewed-by: NRoman Gushchin <roman.gushchin@linux.dev> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
-
由 Wei Yang 提交于
During mem_cgroup_iter, there are two ways to get iteration position: reclaim vs non-reclaim mode. Let's do it explicitly for reclaim vs non-reclaim mode. Link: https://lkml.kernel.org/r/20220330234719.18340-3-richard.weiyang@gmail.comSigned-off-by: NWei Yang <richard.weiyang@gmail.com> Reviewed-by: NRoman Gushchin <roman.gushchin@linux.dev> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
-
由 Wei Yang 提交于
Patch series "mm/memcg: some cleanup for mem_cgroup_iter()", v2. No functional change, try to make it more readable. This patch (of 3): Instead of resetting memcg when css is either not verified or not got reference, we can set it after these process. No functional change, just simplified the code a little. Link: https://lkml.kernel.org/r/20220330234719.18340-1-richard.weiyang@gmail.com Link: https://lkml.kernel.org/r/20220330234719.18340-2-richard.weiyang@gmail.comSigned-off-by: NWei Yang <richard.weiyang@gmail.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Reviewed-by: NRoman Gushchin <roman.gushchin@linux.dev> Cc: Michal Hocko <mhocko@suse.com> Cc: Shakeel Butt <shakeelb@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
-
由 Wei Yang 提交于
When mz is not NULL, it means mz can either come from mem_cgroup_largest_soft_limit_node or __mem_cgroup_largest_soft_limit_node. And both of them have removed this node by __mem_cgroup_remove_exceeded(). Not necessary to call __mem_cgroup_remove_exceeded() again. [mhocko@suse.com: refine changelog] Link: https://lkml.kernel.org/r/20220314233030.12334-1-richard.weiyang@gmail.comSigned-off-by: NWei Yang <richard.weiyang@gmail.com> Acked-by: NMichal Hocko <mhocko@suse.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
-
由 Miaohe Lin 提交于
The local variable nr_scanned is unneeded as mem_cgroup_soft_reclaim always does *total_scanned += nr_scanned. So we can pass total_scanned directly to the mem_cgroup_soft_reclaim to simplify the code and save some cpu cycles of adding nr_scanned to total_scanned. Link: https://lkml.kernel.org/r/20220328114144.53389-1-linmiaohe@huawei.comSigned-off-by: NMiaohe Lin <linmiaohe@huawei.com> Acked-by: NMichal Hocko <mhocko@suse.com> Reviewed-by: NDavid Hildenbrand <david@redhat.com> Reviewed-by: NRoman Gushchin <roman.gushchin@linux.dev> Reviewed-by: NWei Yang <richard.weiyang@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
-
- 22 4月, 2022 1 次提交
-
-
由 Shakeel Butt 提交于
Daniel Dao has reported [1] a regression on workloads that may trigger a lot of refaults (anon and file). The underlying issue is that flushing rstat is expensive. Although rstat flush are batched with (nr_cpus * MEMCG_BATCH) stat updates, it seems like there are workloads which genuinely do stat updates larger than batch value within short amount of time. Since the rstat flush can happen in the performance critical codepaths like page faults, such workload can suffer greatly. This patch fixes this regression by making the rstat flushing conditional in the performance critical codepaths. More specifically, the kernel relies on the async periodic rstat flusher to flush the stats and only if the periodic flusher is delayed by more than twice the amount of its normal time window then the kernel allows rstat flushing from the performance critical codepaths. Now the question: what are the side-effects of this change? The worst that can happen is the refault codepath will see 4sec old lruvec stats and may cause false (or missed) activations of the refaulted page which may under-or-overestimate the workingset size. Though that is not very concerning as the kernel can already miss or do false activations. There are two more codepaths whose flushing behavior is not changed by this patch and we may need to come to them in future. One is the writeback stats used by dirty throttling and second is the deactivation heuristic in the reclaim. For now keeping an eye on them and if there is report of regression due to these codepaths, we will reevaluate then. Link: https://lore.kernel.org/all/CA+wXwBSyO87ZX5PVwdHm-=dBjZYECGmfnydUicUyrQqndgX2MQ@mail.gmail.com [1] Link: https://lkml.kernel.org/r/20220304184040.1304781-1-shakeelb@google.com Fixes: 1f828223 ("memcg: flush lruvec stats in the refault") Signed-off-by: NShakeel Butt <shakeelb@google.com> Reported-by: NDaniel Dao <dqminh@cloudflare.com> Tested-by: NIvan Babrou <ivan@cloudflare.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Koutný <mkoutny@suse.com> Cc: Frank Hofmann <fhofmann@cloudflare.com> Cc: <stable@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 23 3月, 2022 21 次提交
-
-
由 Wei Yang 提交于
alloc_mem_cgroup_per_node_info is allocated for each possible node and this used to be a problem because !node_online nodes didn't have appropriate data structure allocated. This has changed by "mm: handle uninitialized numa nodes gracefully" so we can drop the special casing here. Link: https://lkml.kernel.org/r/20220127085305.20890-7-mhocko@kernel.orgSigned-off-by: NWei Yang <richard.weiyang@gmail.com> Signed-off-by: NMichal Hocko <mhocko@suse.com> Cc: David Hildenbrand <david@redhat.com> Cc: Alexey Makhalov <amakhalov@vmware.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Christoph Lameter <cl@linux.com> Cc: Nico Pache <npache@redhat.com> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Rafael Aquini <raquini@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Muchun Song 提交于
The idr_alloc() does not include @max ID. So in the current implementation, the maximum memcg ID is 65534 instead of 65535. It seems a bug. So fix this. Link: https://lkml.kernel.org/r/20220228122126.37293-15-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com> Cc: Alex Shi <alexs@kernel.org> Cc: Anna Schumaker <Anna.Schumaker@Netapp.com> Cc: Chao Yu <chao@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Fam Zheng <fam.zheng@bytedance.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kari Argillander <kari.argillander@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeelb@google.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Muchun Song 提交于
There are two idrs being used by memory cgroup, one is for kmem ID, another is for memory cgroup ID. The maximum ID of both is 64Ki. Both of them can limit the total number of memory cgroups. Actually, we can reuse memory cgroup ID for kmem ID to simplify the code. Link: https://lkml.kernel.org/r/20220228122126.37293-14-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com> Cc: Alex Shi <alexs@kernel.org> Cc: Anna Schumaker <Anna.Schumaker@Netapp.com> Cc: Chao Yu <chao@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Fam Zheng <fam.zheng@bytedance.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kari Argillander <kari.argillander@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeelb@google.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Muchun Song 提交于
If we run 10k containers in the system, the size of the list_lru_memcg->lrus can be ~96KB per list_lru. When we decrease the number containers, the size of the array will not be shrinked. It is not scalable. The xarray is a good choice for this case. We can save a lot of memory when there are tens of thousands continers in the system. If we use xarray, we also can remove the logic code of resizing array, which can simplify the code. [akpm@linux-foundation.org: remove unused local] Link: https://lkml.kernel.org/r/20220228122126.37293-13-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com> Cc: Alex Shi <alexs@kernel.org> Cc: Anna Schumaker <Anna.Schumaker@Netapp.com> Cc: Chao Yu <chao@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Fam Zheng <fam.zheng@bytedance.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kari Argillander <kari.argillander@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeelb@google.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Muchun Song 提交于
The purpose of the memcg_drain_all_list_lrus() is list_lrus reparenting. It is very similar to memcg_reparent_objcgs(). Rename it to memcg_reparent_list_lrus() so that the name can more consistent with memcg_reparent_objcgs(). Link: https://lkml.kernel.org/r/20220228122126.37293-12-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com> Cc: Alex Shi <alexs@kernel.org> Cc: Anna Schumaker <Anna.Schumaker@Netapp.com> Cc: Chao Yu <chao@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Fam Zheng <fam.zheng@bytedance.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kari Argillander <kari.argillander@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeelb@google.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Muchun Song 提交于
In our server, we found a suspected memory leak problem. The kmalloc-32 consumes more than 6GB of memory. Other kmem_caches consume less than 2GB memory. After our in-depth analysis, the memory consumption of kmalloc-32 slab cache is the cause of list_lru_one allocation. crash> p memcg_nr_cache_ids memcg_nr_cache_ids = $2 = 24574 memcg_nr_cache_ids is very large and memory consumption of each list_lru can be calculated with the following formula. num_numa_node * memcg_nr_cache_ids * 32 (kmalloc-32) There are 4 numa nodes in our system, so each list_lru consumes ~3MB. crash> list super_blocks | wc -l 952 Every mount will register 2 list lrus, one is for inode, another is for dentry. There are 952 super_blocks. So the total memory is 952 * 2 * 3 MB (~5.6GB). But the number of memory cgroup is less than 500. So I guess more than 12286 containers have been deployed on this machine (I do not know why there are so many containers, it may be a user's bug or the user really want to do that). And memcg_nr_cache_ids has not been reduced to a suitable value. This can waste a lot of memory. Now the infrastructure for dynamic list_lru_one allocation is ready, so remove statically allocated memory code to save memory. Link: https://lkml.kernel.org/r/20220228122126.37293-11-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com> Cc: Alex Shi <alexs@kernel.org> Cc: Anna Schumaker <Anna.Schumaker@Netapp.com> Cc: Chao Yu <chao@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Fam Zheng <fam.zheng@bytedance.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kari Argillander <kari.argillander@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeelb@google.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Muchun Song 提交于
It will simplify the code if moving memcg_online_kmem() to mem_cgroup_css_online() and do not need to set ->kmemcg_id to -1 to indicate the memcg is offline. In the next patch, ->kmemcg_id will be used to sync list lru reparenting which requires not to change ->kmemcg_id. Link: https://lkml.kernel.org/r/20220228122126.37293-10-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com> Acked-by: NRoman Gushchin <roman.gushchin@linux.dev> Cc: Alex Shi <alexs@kernel.org> Cc: Anna Schumaker <Anna.Schumaker@Netapp.com> Cc: Chao Yu <chao@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Fam Zheng <fam.zheng@bytedance.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kari Argillander <kari.argillander@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Muchun Song 提交于
We currently allocate scope for every memcg to be able to tracked on every superblock instantiated in the system, regardless of whether that superblock is even accessible to that memcg. These huge memcg counts come from container hosts where memcgs are confined to just a small subset of the total number of superblocks that instantiated at any given point in time. For these systems with huge container counts, list_lru does not need the capability of tracking every memcg on every superblock. What it comes down to is that adding the memcg to the list_lru at the first insert. So introduce kmem_cache_alloc_lru to allocate objects and its list_lru. In the later patch, we will convert all inode and dentry allocation from kmem_cache_alloc to kmem_cache_alloc_lru. Link: https://lkml.kernel.org/r/20220228122126.37293-3-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com> Cc: Alex Shi <alexs@kernel.org> Cc: Anna Schumaker <Anna.Schumaker@Netapp.com> Cc: Chao Yu <chao@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Fam Zheng <fam.zheng@bytedance.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kari Argillander <kari.argillander@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeelb@google.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
Before the for-each-CPU loop, preemption is disabled so that so that drain_local_stock() can be invoked directly instead of scheduling a worker. Ensuring that drain_local_stock() completed on the local CPU is not correctness problem. It _could_ be that the charging path will be forced to reclaim memory because cached charges are still waiting for their draining. Disabling preemption before invoking drain_local_stock() is problematic on PREEMPT_RT due to the sleeping locks involved. To ensure that no CPU migrations happens across for_each_online_cpu() it is enouhg to use migrate_disable() which disables migration and keeps context preemptible to a sleeping lock can be acquired. A race with CPU hotplug is not a problem because pcp data is not going away. In the worst case we just schedule draining of an empty stock. Use migrate_disable() instead of get_cpu() around the for_each_online_cpu() loop. Link: https://lkml.kernel.org/r/20220226204144.1008339-7-bigeasy@linutronix.deSigned-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: kernel test robot <oliver.sang@intel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Koutný <mkoutny@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeelb@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Waiman Long <longman@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
The members of the per-CPU structure memcg_stock_pcp are protected by disabling interrupts. This is not working on PREEMPT_RT because it creates atomic context in which actions are performed which require preemptible context. One example is obj_cgroup_release(). The IRQ-disable sections can be replaced with local_lock_t which preserves the explicit disabling of interrupts while keeps the code preemptible on PREEMPT_RT. drain_obj_stock() drops a reference on obj_cgroup which leads to an invocat= ion of obj_cgroup_release() if it is the last object. This in turn leads to recursive locking of the local_lock_t. To avoid this, obj_cgroup_release() = is invoked outside of the locked section. obj_cgroup_uncharge_pages() can be invoked with the local_lock_t acquired a= nd without it. This will lead later to a recursion in refill_stock(). To avoid the locking recursion provide obj_cgroup_uncharge_pages_locked() which uses the locked version of refill_stock(). - Replace disabling interrupts for memcg_stock with a local_lock_t. - Let drain_obj_stock() return the old struct obj_cgroup which is passed to obj_cgroup_put() outside of the locked section. - Provide obj_cgroup_uncharge_pages_locked() which uses the locked version of refill_stock() to avoid recursive locking in drain_obj_stock(). Link: https://lkml.kernel.org/r/20220209014709.GA26885@xsang-OptiPlex-9020 Link: https://lkml.kernel.org/r/20220226204144.1008339-6-bigeasy@linutronix.deSigned-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de> Reported-by: Nkernel test robot <oliver.sang@intel.com> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Koutný <mkoutny@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeelb@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Waiman Long <longman@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Johannes Weiner 提交于
Provide the inner part of refill_stock() as __refill_stock() without disabling interrupts. This eases the integration of local_lock_t where recursive locking must be avoided. Open code obj_cgroup_uncharge_pages() in drain_obj_stock() and use __refill_stock(). The caller of drain_obj_stock() already disables interrupts. [bigeasy@linutronix.de: patch body around Johannes' diff] Link: https://lkml.kernel.org/r/20220226204144.1008339-5-bigeasy@linutronix.deSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org> Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: NShakeel Butt <shakeelb@google.com> Reviewed-by: NRoman Gushchin <guro@fb.com> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: kernel test robot <oliver.sang@intel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Koutný <mkoutny@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Waiman Long <longman@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
The per-CPU counter are modified with the non-atomic modifier. The consistency is ensured by disabling interrupts for the update. On non PREEMPT_RT configuration this works because acquiring a spinlock_t typed lock with the _irq() suffix disables interrupts. On PREEMPT_RT configurations the RMW operation can be interrupted. Another problem is that mem_cgroup_swapout() expects to be invoked with disabled interrupts because the caller has to acquire a spinlock_t which is acquired with disabled interrupts. Since spinlock_t never disables interrupts on PREEMPT_RT the interrupts are never disabled at this point. The code is never called from in_irq() context on PREEMPT_RT therefore disabling preemption during the update is sufficient on PREEMPT_RT. The sections which explicitly disable interrupts can remain on PREEMPT_RT because the sections remain short and they don't involve sleeping locks (memcg_check_events() is doing nothing on PREEMPT_RT). Disable preemption during update of the per-CPU variables which do not explicitly disable interrupts. Link: https://lkml.kernel.org/r/20220226204144.1008339-4-bigeasy@linutronix.deSigned-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de> Acked-by: NRoman Gushchin <guro@fb.com> Reviewed-by: Shakeel Butt <shakeelb@google.com Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: kernel test robot <oliver.sang@intel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Michal Koutný <mkoutny@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Waiman Long <longman@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
During the integration of PREEMPT_RT support, the code flow around memcg_check_events() resulted in `twisted code'. Moving the code around and avoiding then would then lead to an additional local-irq-save section within memcg_check_events(). While looking better, it adds a local-irq-save section to code flow which is usually within an local-irq-off block on non-PREEMPT_RT configurations. The threshold event handler is a deprecated memcg v1 feature. Instead of trying to get it to work under PREEMPT_RT just disable it. There should be no users on PREEMPT_RT. From that perspective it makes even less sense to get it to work under PREEMPT_RT while having zero users. Make memory.soft_limit_in_bytes and cgroup.event_control return -EOPNOTSUPP on PREEMPT_RT. Make an empty memcg_check_events() and memcg_write_event_control() which return only -EOPNOTSUPP on PREEMPT_RT. Document that the two knobs are disabled on PREEMPT_RT. Link: https://lkml.kernel.org/r/20220226204144.1008339-3-bigeasy@linutronix.deSuggested-by: NMichal Hocko <mhocko@kernel.org> Suggested-by: NMichal Koutný <mkoutny@suse.com> Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de> Acked-by: NRoman Gushchin <guro@fb.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Reviewed-by: NShakeel Butt <shakeelb@google.com> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: kernel test robot <oliver.sang@intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Waiman Long <longman@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Michal Hocko 提交于
Patch series "mm/memcg: Address PREEMPT_RT problems instead of disabling it", v5. This series aims to address the memcg related problem on PREEMPT_RT. I tested them on CONFIG_PREEMPT and CONFIG_PREEMPT_RT with the tools/testing/selftests/cgroup/* tests and I haven't observed any regressions (other than the lockdep report that is already there). This patch (of 6): The optimisation is based on a micro benchmark where local_irq_save() is more expensive than a preempt_disable(). There is no evidence that it is visible in a real-world workload and there are CPUs where the opposite is true (local_irq_save() is cheaper than preempt_disable()). Based on micro benchmarks, the optimisation makes sense on PREEMPT_NONE where preempt_disable() is optimized away. There is no improvement with PREEMPT_DYNAMIC since the preemption counter is always available. The optimization makes also the PREEMPT_RT integration more complicated since most of the assumption are not true on PREEMPT_RT. Revert the optimisation since it complicates the PREEMPT_RT integration and the improvement is hardly visible. [bigeasy@linutronix.de: patch body around Michal's diff] Link: https://lkml.kernel.org/r/20220226204144.1008339-1-bigeasy@linutronix.de Link: https://lore.kernel.org/all/YgOGkXXCrD%2F1k+p4@dhcp22.suse.cz Link: https://lkml.kernel.org/r/YdX+INO9gQje6d0S@linutronix.de Link: https://lkml.kernel.org/r/20220226204144.1008339-2-bigeasy@linutronix.deSigned-off-by: NMichal Hocko <mhocko@suse.com> Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de> Acked-by: NRoman Gushchin <guro@fb.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Reviewed-by: NShakeel Butt <shakeelb@google.com> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Waiman Long <longman@redhat.com> Cc: kernel test robot <oliver.sang@intel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Koutný <mkoutny@suse.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Randy Dunlap 提交于
__setup() handlers should return 1 if the command line option is handled and 0 if not (or maybe never return 0; it just pollutes init's environment). The only reason that this particular __setup handler does not pollute init's environment is that the setup string contains a '.', as in "cgroup.memory". This causes init/main.c::unknown_boottoption() to consider it to be an "Unused module parameter" and ignore it. (This is for parsing of loadable module parameters any time after kernel init.) Otherwise the string "cgroup.memory=whatever" would be added to init's environment strings. Instead of relying on this '.' quirk, just return 1 to indicate that the boot option has been handled. Note that there is no warning message if someone enters: cgroup.memory=anything_invalid Link: https://lkml.kernel.org/r/20220222005811.10672-1-rdunlap@infradead.org Fixes: f7e1cb6e ("mm: memcontrol: account socket memory in unified hierarchy memory controller") Signed-off-by: NRandy Dunlap <rdunlap@infradead.org> Reported-by: NIgor Zhbanov <i.zhbanov@omprussia.ru> Link: lore.kernel.org/r/64644a2f-4a20-bab3-1e15-3b2cdd0defe3@omprussia.ru Reviewed-by: NMichal Koutný <mkoutny@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Shakeel Butt 提交于
The high limit is used to throttle the workload without invoking the oom-killer. Recently we tried to use the high limit to right size our internal workloads. More specifically dynamically adjusting the limits of the workload without letting the workload get oom-killed. However due to the limitation of the implementation of high limit enforcement, we observed the mechanism fails for some real workloads. The high limit is enforced on return-to-userspace i.e. the kernel let the usage goes over the limit and when the execution returns to userspace, the high reclaim is triggered and the process can get throttled as well. However this mechanism fails for workloads which do large allocations in a single kernel entry e.g. applications that mlock() a large chunk of memory in a single syscall. Such applications bypass the high limit and can trigger the oom-killer. To make high limit enforcement more robust, this patch makes the limit enforcement synchronous only if the accumulated overcharge becomes larger than MEMCG_CHARGE_BATCH. So, most of the allocations would still be throttled on the return-to-userspace path but only the extreme allocations which accumulates large amount of overcharge without returning to the userspace will be throttled synchronously. The value MEMCG_CHARGE_BATCH is a bit arbitrary but most of other places in the memcg codebase uses this constant therefore for now uses the same one. Link: https://lkml.kernel.org/r/20220211064917.2028469-5-shakeelb@google.comSigned-off-by: NShakeel Butt <shakeelb@google.com> Reviewed-by: NRoman Gushchin <guro@fb.com> Acked-by: NChris Down <chris@chrisdown.name> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Shakeel Butt 提交于
Currently the kernel force charges the allocations which have __GFP_HIGH flag without triggering the memory reclaim. __GFP_HIGH indicates that the caller is high priority and since commit 869712fd ("mm: memcontrol: fix network errors from failing __GFP_ATOMIC charges") the kernel lets such allocations do force charging. Please note that __GFP_ATOMIC has been replaced by __GFP_HIGH. __GFP_HIGH does not tell if the caller can block or can trigger reclaim. There are separate checks to determine that. So, there is no need to skip reclaiming for __GFP_HIGH allocations. So, handle __GFP_HIGH together with __GFP_NOFAIL which also does force charging. Please note that this is a noop change as there are no __GFP_HIGH allocators in the kernel which also have __GFP_ACCOUNT (or SLAB_ACCOUNT) and does not allow reclaim for now. Link: https://lkml.kernel.org/r/20220211064917.2028469-3-shakeelb@google.comSigned-off-by: NShakeel Butt <shakeelb@google.com> Reviewed-by: NRoman Gushchin <guro@fb.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Chris Down <chris@chrisdown.name> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Shakeel Butt 提交于
Patch series "memcg: robust enforcement of memory.high", v2. Due to the semantics of memory.high enforcement i.e. throttle the workload without oom-kill, we are trying to use it for right sizing the workloads in our production environment. However we observed the mechanism fails for some specific applications which does big chunck of allocations in a single syscall. The reason behind this failure is due to the limitation of the memory.high enforcement's current implementation. This patch series solves this issue by enforcing the memory.high synchronously if the current process has accumulated a large amount of high overcharge. This patch (of 4): The function mem_cgroup_oom returns enum which has four possible values but the caller does not care about such values and only cares if the return value is OOM_SUCCESS or not. So, remove the enum altogether and make mem_cgroup_oom returns a simple bool. Link: https://lkml.kernel.org/r/20220211064917.2028469-1-shakeelb@google.com Link: https://lkml.kernel.org/r/20220211064917.2028469-2-shakeelb@google.comSigned-off-by: NShakeel Butt <shakeelb@google.com> Reviewed-by: NRoman Gushchin <guro@fb.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Chris Down <chris@chrisdown.name> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Wei Yang 提交于
kzalloc_node() would set data to 0, so it's not necessary to set it again. Link: https://lkml.kernel.org/r/20220201004643.8391-1-richard.weiyang@gmail.comSigned-off-by: NWei Yang <richard.weiyang@gmail.com> Reviewed-by: NMuchun Song <songmuchun@bytedance.com> Acked-by: NMichal Hocko <mhocko@suse.com> Reviewed-by: NRoman Gushchin <guro@fb.com> Reviewed-by: NMike Rapoport <rppt@linux.ibm.com> Reviewed-by: NShakeel Butt <shakeelb@google.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Yosry Ahmed 提交于
Currently memcg stats show several types of kernel memory: kernel stack, page tables, sock, vmalloc, and slab. However, there are other allocations with __GFP_ACCOUNT (or supersets such as GFP_KERNEL_ACCOUNT) that are not accounted in any of those stats, a few examples are: - various kvm allocations (e.g. allocated pages to create vcpus) - io_uring - tmp_page in pipes during pipe_write() - bpf ringbuffers - unix sockets Keeping track of the total kernel memory is essential for the ease of migration from cgroup v1 to v2 as there are large discrepancies between v1's kmem.usage_in_bytes and the sum of the available kernel memory stats in v2. Adding separate memcg stats for all __GFP_ACCOUNT kernel allocations is an impractical maintenance burden as there a lot of those all over the kernel code, with more use cases likely to show up in the future. Therefore, add a "kernel" memcg stat that is analogous to kmem page counter, with added benefits such as using rstat infrastructure which aggregates stats more efficiently. Additionally, this provides a lighter alternative in case the legacy kmem is deprecated in the future [yosryahmed@google.com: v2] Link: https://lkml.kernel.org/r/20220203193856.972500-1-yosryahmed@google.com Link: https://lkml.kernel.org/r/20220201200823.3283171-1-yosryahmed@google.comSigned-off-by: NYosry Ahmed <yosryahmed@google.com> Acked-by: NShakeel Butt <shakeelb@google.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Shakeel Butt 提交于
Replace the deprecated in_interrupt() with !in_task() because in_interrupt() returns true for BH disabled even if the call happens in the task context. in_task() is the right interface to differentiate task context from NMI, hard IRQ and softirq contexts. Link: https://lkml.kernel.org/r/20220127162636.3461256-1-shakeelb@google.comSigned-off-by: NShakeel Butt <shakeelb@google.com> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: Vasily Averin <vvs@virtuozzo.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-