- 07 4月, 2022 2 次提交
-
-
由 Lu Jialin 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4X0YD?from=project-issue CVE: NA -------- Export "memory.events" and "memory.events.local" from cgroupv2 to cgroupv1. There are some differences between v2 and v1: 1)events of MEMCG_OOM_GROUP_KILL is not included in cgroupv1. Because, there is no member of memory.oom.group. 2)events of MEMCG_MAX is represented with "limit_in_bytes" in cgroupv1 instead of memory.max 3)event of oom_kill is include in memory.oom_control. make oom_kill include its descendants' events and add oom_kill_local include its oom_kill event only. Signed-off-by: NLu Jialin <lujialin4@huawei.com> Reviewed-by: NXiu Jianfeng <xiujianfeng@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Liu Shixin 提交于
hulk inclusion category: feature bugzilla: 186182, https://gitee.com/openeuler/kernel/issues/I4UOJI CVE: NA -------------------------------- Support to print rootfs files and tmpfs files that having pages charged in given memory cgroup. The files infomations can be printed through interface "memory.memfs_files_info" or printed when OOM is triggered. In order not to flush memory logs, we limit the maximum number of files to be printed when oom through interface "max_print_files_in_oom". And in order to filter out small files, we limit the minimum size of files that can be printed through interface "size_threshold". Signed-off-by: NLiu Shixin <liushixin2@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
- 20 3月, 2022 1 次提交
-
-
由 Liu Shixin 提交于
hulk inclusion category: bugfix bugzilla: 46904 https://gitee.com/openeuler/kernel/issues/I4Y0XO -------------------------------- When all processes in the memory cgroup are finished, some memory may still be occupied such as file cache. Use mem_cgroup_force_empty to reclaim these pages that charged in the memory cgroup before merging all pages. Signed-off-by: NLiu Shixin <liushixin2@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
- 19 1月, 2022 5 次提交
-
-
由 Liu Shixin 提交于
hulk inclusion category: feature bugzilla: 46904, https://gitee.com/openeuler/kernel/issues/I4QSHG CVE: NA -------------------------------- Add new interface "dhugetlb.disable_normal_pages" to disable the allocation of normal pages from a hpool. This makes dynamic hugetlb more flexible. Signed-off-by: NLiu Shixin <liushixin2@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Liu Shixin 提交于
hulk inclusion category: feature bugzilla: 46904, https://gitee.com/openeuler/kernel/issues/I4QSHG CVE: NA -------------------------------- Add function to alloc page from dhugetlb_pool. When process is bound to a mem_cgroup configured with dhugtlb_pool, alloc page from dhugetlb_pool firstly. If there is no page in dhugetlb_pool, fallback to alloc page from buddy system. As the process will alloc pages from dhugetlb_pool in the mem_cgroup, process is not allowed to migrate to other mem_cgroup. Signed-off-by: NLiu Shixin <liushixin2@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Liu Shixin 提交于
hulk inclusion category: feature bugzilla: 46904, https://gitee.com/openeuler/kernel/issues/I4QSHG CVE: NA -------------------------------- Add two interfaces in mem_cgroup to configure the count of 1G/2M hugepages in dhugetlb_pool. Signed-off-by: NLiu Shixin <liushixin2@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Liu Shixin 提交于
hulk inclusion category: feature bugzilla: 46904, https://gitee.com/openeuler/kernel/issues/I4QSHG CVE: NA -------------------------------- Dynamic hugetlb is a self-developed feature based on the hugetlb and memcontrol. It supports to split huge page dynamically in a memory cgroup. There is a new structure dhugetlb_pool in every mem_cgroup to manage the pages configured to the mem_cgroup. For the mem_cgroup configured with dhugetlb_pool, processes in the mem_cgroup will preferentially use the pages in dhugetlb_pool. Dynamic hugetlb supports three types of pages, including 1G/2M huge pages and 4K pages. For the mem_cgroup configured with dhugetlb_pool, processes will be limited to alloc 1G/2M huge pages only from dhugetlb_pool. But there is no such constraint for 4K pages. If there are insufficient 4K pages in the dhugetlb_pool, pages can also be allocated from buddy system. So before using dynamic hugetlb, user must know how many huge pages they need. Usage: 1. Add 'dynamic_hugetlb=on' in cmdline to enable dynamic hugetlb feature. 2. Prealloc some 1G hugepages through hugetlb. 3. Create a mem_cgroup and configure dhugetlb_pool to mem_cgroup. 4. Configure the count of 1G/2M hugepages, and the remaining pages in dhugetlb_pool will be used as basic pages. 5. Bound a process to mem_cgroup. then the memory for it will be allocated from dhugetlb_pool. This patch add the corresponding structure dhugetlb_pool for dynamic hugetlb feature, the interface 'dhugetlb.nr_pages' in mem_cgroup to configure dhugetlb_pool and the cmdline 'dynamic_hugetlb=on' to enable dynamic hugetlb feature. Signed-off-by: NLiu Shixin <liushixin2@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Liu Shixin 提交于
hulk inclusion category: feature bugzilla: 46904, https://gitee.com/openeuler/kernel/issues/I4QSHG CVE: NA -------------------------------- There are several functions that will be used in next patches for dynamic hugetlb feature. Declare them. No functional changes. Signed-off-by: NLiu Shixin <liushixin2@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
- 07 1月, 2022 4 次提交
-
-
由 Lu Jialin 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4IMAK?from=project-issue CVE: NA -------- This patch adds a default-false static key to disable memcg kswapd feature. User can enable by set memcg_kswapd in cmdline. Signed-off-by: NLu Jialin <lujialin4@huawei.com> Reviewed-by: Nweiyang wang <wangweiyang2@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Lu Jialin 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4IMAK?from=project-issue CVE: NA -------- Since memory.high reclaim is sync whether is in interrupt, it could do more work than direct reclaim, i.e. write out dirty page, etc. So, add PF_KSWAPD flag, so that current_is_kswapd() would return true for memcg kswapd. Memcg kswapd should stop when usage of memcg fit the memcg kswapd stop flag. When the userland sets the memcg->memory.max, the stop_flag is (memcg->memory.high - memcg->memory.max * 10 / 1000), which is similar with global kswapd. Otherwise, the stop_flag is (memcg->memory.high - memcg->memory.high / 6), which is similar with most difference between watermark_low and watermark_high. And, memcg kswapd should not break memory.low protection for now. Signed-off-by: NLu Jialin <lujialin4@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Nweiyang wang <wangweiyang2@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Lu Jialin 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4IMAK?from=project-issue CVE: NA -------- Export memory.high from cgroupv2 to cgroupv1. Therefore, when the usage of the memcg is larger than memory.high, some pages will be reclaimed before return to userland, which will throttle the process. Only export memory.high number in mem_cgroup_legacy_files and move related functions in front of mem_cgroup_legacy_files. There is no need to other changes. Signed-off-by: NLu Jialin <lujialin4@huawei.com> Reviewed-by: Nweiyang wang <wangweiyang2@huawei.com> Reviewed-by: Nweiyang wang <wangweiyang2@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Lu Jialin 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4IMAK?from=project-issue CVE: NA -------- Export memcg.min and memcg.low from cgroupv2 to cgroupv1, in order to reduce the negtive impact between cgroups when the system memory is insufficient. Only export memory.{min/low} numbers in mem_cgroup_legacy_files and move related functions in front of mem_cgroup_legacy_files. There is no need to other changes. Signed-off-by: NLu Jialin <lujialin4@huawei.com> Reviewed-by: Nweiyang wang <wangweiyang2@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
- 30 12月, 2021 1 次提交
-
-
由 Weilong Chen 提交于
ascend inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4K2U5 CVE: NA ------------------------------------------------- Support disable oom-killer, and report oom events to bbox vm.enable_oom_killer: 0: disable oom killer 1: enable oom killer (default,compatible with mainline) Signed-off-by: NWeilong Chen <chenweilong@huawei.com> Signed-off-by: NZhang Jian <zhangjian210@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
- 06 12月, 2021 1 次提交
-
-
由 Vasily Averin 提交于
stable inclusion from stable-5.10.80 commit 74293225f50391620aaef3507ebd6fd17e0003e1 bugzilla: 185821 https://gitee.com/openeuler/kernel/issues/I4L7CG Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=74293225f50391620aaef3507ebd6fd17e0003e1 -------------------------------- commit a4ebf1b6 upstream. Memory cgroup charging allows killed or exiting tasks to exceed the hard limit. It is assumed that the amount of the memory charged by those tasks is bound and most of the memory will get released while the task is exiting. This is resembling a heuristic for the global OOM situation when tasks get access to memory reserves. There is no global memory shortage at the memcg level so the memcg heuristic is more relieved. The above assumption is overly optimistic though. E.g. vmalloc can scale to really large requests and the heuristic would allow that. We used to have an early break in the vmalloc allocator for killed tasks but this has been reverted by commit b8c8a338 ("Revert "vmalloc: back off when the current task is killed""). There are likely other similar code paths which do not check for fatal signals in an allocation&charge loop. Also there are some kernel objects charged to a memcg which are not bound to a process life time. It has been observed that it is not really hard to trigger these bypasses and cause global OOM situation. One potential way to address these runaways would be to limit the amount of excess (similar to the global OOM with limited oom reserves). This is certainly possible but it is not really clear how much of an excess is desirable and still protects from global OOMs as that would have to consider the overall memcg configuration. This patch is addressing the problem by removing the heuristic altogether. Bypass is only allowed for requests which either cannot fail or where the failure is not desirable while excess should be still limited (e.g. atomic requests). Implementation wise a killed or dying task fails to charge if it has passed the OOM killer stage. That should give all forms of reclaim chance to restore the limit before the failure (ENOMEM) and tell the caller to back off. In addition, this patch renames should_force_charge() helper to task_is_dying() because now its use is not associated witch forced charging. This patch depends on pagefault_out_of_memory() to not trigger out_of_memory(), because then a memcg failure can unwind to VM_FAULT_OOM and cause a global OOM killer. Link: https://lkml.kernel.org/r/8f5cebbb-06da-4902-91f0-6566fc4b4203@virtuozzo.comSigned-off-by: NVasily Averin <vvs@virtuozzo.com> Suggested-by: NMichal Hocko <mhocko@suse.com> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Roman Gushchin <guro@fb.com> Cc: Uladzislau Rezki <urezki@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Shakeel Butt <shakeelb@google.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: <stable@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NChen Jun <chenjun102@huawei.com> Reviewed-by: NWeilong Chen <chenweilong@huawei.com> Acked-by: NWeilong Chen <chenweilong@huawei.com> Signed-off-by: NChen Jun <chenjun102@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
- 30 11月, 2021 11 次提交
-
-
由 Shakeel Butt 提交于
mainline inclusion from mainline-v5.16-rc1 commit fd25a9e0 category: feature bugzilla: 185803 https://gitee.com/openeuler/kernel/issues/I4JOG9?from=project-issue CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=fd25a9e0e23b --------------------------------------------------------------------- The memcg stats can be flushed in multiple context and potentially in parallel too. For example multiple parallel user space readers for memcg stats will contend on the rstat locks with each other. There is no need for that. We just need one flusher and everyone else can benefit. In addition after aa48e47e ("memcg: infrastructure to flush memcg stats") the kernel periodically flush the memcg stats from the root, so, the other flushers will potentially have much less work to do. Link: https://lkml.kernel.org/r/20211001190040.48086-2-shakeelb@google.comSigned-off-by: NShakeel Butt <shakeelb@google.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: "Michal Koutný" <mkoutny@suse.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NLu Jialin <lujialin4@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Shakeel Butt 提交于
mainline inclusion from mainline-v5.16-rc1 commit 11192d9c category: feature bugzilla: 185803 https://gitee.com/openeuler/kernel/issues/I4JOG9?from=project-issue CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=11192d9c124d ---------------------------------------------------------------------- At the moment, the kernel flushes the memcg stats on every refault and also on every reclaim iteration. Although rstat maintains per-cpu update tree but on the flush the kernel still has to go through all the cpu rstat update tree to check if there is anything to flush. This patch adds the tracking on the stats update side to make flush side more clever by skipping the flush if there is no update. The stats update codepath is very sensitive performance wise for many workloads and benchmarks. So, we can not follow what the commit aa48e47e ("memcg: infrastructure to flush memcg stats") did which was triggering async flush through queue_work() and caused a lot performance regression reports. That got reverted by the commit 1f828223 ("memcg: flush lruvec stats in the refault"). In this patch we kept the stats update codepath very minimal and let the stats reader side to flush the stats only when the updates are over a specific threshold. For now the threshold is (nr_cpus * CHARGE_BATCH). To evaluate the impact of this patch, an 8 GiB tmpfs file is created on a system with swap-on-zram and the file was pushed to swap through memory.force_empty interface. On reading the whole file, the memcg stat flush in the refault code path is triggered. With this patch, we observed 63% reduction in the read time of 8 GiB file. Link: https://lkml.kernel.org/r/20211001190040.48086-1-shakeelb@google.comSigned-off-by: NShakeel Butt <shakeelb@google.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Reviewed-by: N"Michal Koutný" <mkoutny@suse.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NLu Jialin <lujialin4@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Shakeel Butt 提交于
mainline inclusion from mainline-v5.15-rc3 commit 1f828223 category: feature bugzilla: 185803 https://gitee.com/openeuler/kernel/issues/I4JOG9?from=project-issue CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=1f828223b799 ----------------------------------------------------------------------- Prior to the commit 7e1c0d6f ("memcg: switch lruvec stats to rstat") and the commit aa48e47e ("memcg: infrastructure to flush memcg stats"), each lruvec memcg stats can be off by (nr_cgroups * nr_cpus * 32) at worst and for unbounded amount of time. The commit aa48e47e moved the lruvec stats to rstat infrastructure and the commit 7e1c0d6f bounded the error for all the lruvec stats to (nr_cpus * 32) at worst for at most 2 seconds. More specifically it decoupled the number of stats and the number of cgroups from the error rate. However this reduction in error comes with the cost of triggering the slowpath of stats update more frequently. Previously in the slowpath the kernel adds the stats up the memcg tree. After aa48e47e, the kernel triggers the asyn lruvec stats flush through queue_work(). This causes regression reports from 0day kernel bot [1] as well as from phoronix test suite [2]. We tried two options to fix the regression: 1) Increase the threshold to trigger the slowpath in lruvec stats update codepath from 32 to 512. 2) Remove the slowpath from lruvec stats update codepath and instead flush the stats in the page refault codepath. The assumption is that the kernel timely flush the stats, so, the update tree would be small in the refault codepath to not cause the preformance impact. Following are the results of will-it-scale/page_fault[1|2|3] benchmark on four settings i.e. (1) 5.15-rc1 as baseline (2) 5.15-rc1 with aa48e47e and 7e1c0d6f reverted (3) 5.15-rc1 with option-1 (4) 5.15-rc1 with option-2. test (1) (2) (3) (4) pg_f1 368563 406277 (10.23%) 399693 (8.44%) 416398 (12.97%) pg_f2 338399 372133 (9.96%) 369180 (9.09%) 381024 (12.59%) pg_f3 500853 575399 (14.88%) 570388 (13.88%) 576083 (15.02%) From the above result, it seems like the option-2 not only solves the regression but also improves the performance for at least these benchmarks. Feng Tang (intel) ran the aim7 benchmark with these two options and confirms that option-1 reduces the regression but option-2 removes the regression. Michael Larabel (phoronix) ran multiple benchmarks with these options and reported the results at [3] and it shows for most benchmarks option-2 removes the regression introduced by the commit aa48e47e ("memcg: infrastructure to flush memcg stats"). Based on the experiment results, this patch proposed the option-2 as the solution to resolve the regression. Link: https://lore.kernel.org/all/20210726022421.GB21872@xsang-OptiPlex-9020 [1] Link: https://www.phoronix.com/scan.php?page=article&item=linux515-compile-regress [2] Link: https://openbenchmarking.org/result/2109226-DEBU-LINUX5104 [3] Fixes: aa48e47e ("memcg: infrastructure to flush memcg stats") Signed-off-by: NShakeel Butt <shakeelb@google.com> Tested-by: NMichael Larabel <Michael@phoronix.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Roman Gushchin <guro@fb.com> Cc: Feng Tang <feng.tang@intel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Hillf Danton <hdanton@sina.com>, Cc: Michal Koutný <mkoutny@suse.com> Cc: Andrew Morton <akpm@linux-foundation.org>, Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NLu Jialin <lujialin4@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Shakeel Butt 提交于
mainline inclusion from mainline-v5.15-rc1 commit aa48e47e category: feature bugzilla: 185803 https://gitee.com/openeuler/kernel/issues/I4JOG9?from=project-issue CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=aa48e47e3906 ------------------------------------------------------ At the moment memcg stats are read in four contexts: 1. memcg stat user interfaces 2. dirty throttling 3. page fault 4. memory reclaim Currently the kernel flushes the stats for first two cases. Flushing the stats for remaining two casese may have performance impact. Always flushing the memcg stats on the page fault code path may negatively impacts the performance of the applications. In addition flushing in the memory reclaim code path, though treated as slowpath, can become the source of contention for the global lock taken for stat flushing because when system or memcg is under memory pressure, many tasks may enter the reclaim path. This patch uses following mechanisms to solve these challenges: 1. Periodically flush the stats from root memcg every 2 seconds. This will time limit the out of sync stats. 2. Asynchronously flush the stats after fixed number of stat updates. In the worst case the stat can be out of sync by O(nr_cpus * BATCH) for 2 seconds. 3. For avoiding thundering herd to flush the stats particularly from the memory reclaim context, introduce memcg local spinlock and let only one flusher active at a time. This could have been done through cgroup_rstat_lock lock but that lock is used by other subsystem and for userspace reading memcg stats. So, it is better to keep flushers introduced by this patch decoupled from cgroup_rstat_lock. However we would have to use irqsafe version of rstat flush but that is fine as this code path will be flushing for whole tree and do the work for everyone. No one will be waiting for that worker. [shakeelb@google.com: fix sleep-in-wrong context bug] Link: https://lkml.kernel.org/r/20210716212137.1391164-2-shakeelb@google.com Link: https://lkml.kernel.org/r/20210714013948.270662-2-shakeelb@google.comSigned-off-by: NShakeel Butt <shakeelb@google.com> Tested-by: NMarek Szyprowski <m.szyprowski@samsung.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Koutný <mkoutny@suse.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Roman Gushchin <guro@fb.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Conflict: include/linux/memcontrol.h Signed-off-by: NLu Jialin <lujialin4@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Shakeel Butt 提交于
mainline inclusion from mainline-v5.15-rc1 commit 7e1c0d6f category: feature bugzilla: 185803 https://gitee.com/openeuler/kernel/issues/I4JOG9?from=project-issue CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=7e1c0d6f58207 ------------------------------------------------------------------------- The commit 2d146aa3 ("mm: memcontrol: switch to rstat") switched memcg stats to rstat infrastructure but skipped the conversion of the lruvec stats as such stats are read in the performance critical code paths and flushing stats may have impacted the performances of the applications. This patch converts the lruvec stats to rstat and later patches add mechanisms to keep the performance impact to minimum. The rstat conversion comes with the price i.e. memory cost. Effectively this patch reverts the savings done by the commit f3344adf ("mm: memcontrol: optimize per-lruvec stats counter memory usage"). However this cost is justified due to negative impact of the inaccurate lruvec stats on many heuristics. One such case is reported in [1]. The memory reclaim code is filled with plethora of heuristics and many of those heuristics reads the lruvec stats. So, inaccurate stats can make such heuristics ineffective. [1] reports the impact of inaccurate lruvec stats on the "cache trim mode" heuristic. Inaccurate lruvec stats can impact the deactivation and aging anon heuristics as well. [1] https://lore.kernel.org/linux-mm/20210311004449.1170308-1-ying.huang@intel.com/ Link: https://lkml.kernel.org/r/20210716212137.1391164-1-shakeelb@google.com Link: https://lkml.kernel.org/r/20210714013948.270662-1-shakeelb@google.comSigned-off-by: NShakeel Butt <shakeelb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Roman Gushchin <guro@fb.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Michal Koutný <mkoutny@suse.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NLu Jialin <lujialin4@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Johannes Weiner 提交于
mainline inclusion from mainline-v5.14-rc4 commit 30def935 category: feature bugzilla: 185803 https://gitee.com/openeuler/kernel/issues/I4JOG9?from=project-issue CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=30def93565e5b ------------------------------------------------------------------------ Dan Carpenter reports: The patch 2d146aa3: "mm: memcontrol: switch to rstat" from Apr 29, 2021, leads to the following static checker warning: kernel/cgroup/rstat.c:200 cgroup_rstat_flush() warn: sleeping in atomic context mm/memcontrol.c 3572 static unsigned long mem_cgroup_usage(struct mem_cgroup *memcg, bool swap) 3573 { 3574 unsigned long val; 3575 3576 if (mem_cgroup_is_root(memcg)) { 3577 cgroup_rstat_flush(memcg->css.cgroup); ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ This is from static analysis and potentially a false positive. The problem is that mem_cgroup_usage() is called from __mem_cgroup_threshold() which holds an rcu_read_lock(). And the cgroup_rstat_flush() function can sleep. 3578 val = memcg_page_state(memcg, NR_FILE_PAGES) + 3579 memcg_page_state(memcg, NR_ANON_MAPPED); 3580 if (swap) 3581 val += memcg_page_state(memcg, MEMCG_SWAP); 3582 } else { 3583 if (!swap) 3584 val = page_counter_read(&memcg->memory); 3585 else 3586 val = page_counter_read(&memcg->memsw); 3587 } 3588 return val; 3589 } __mem_cgroup_threshold() indeed holds the rcu lock. In addition, the thresholding code is invoked during stat changes, and those contexts have irqs disabled as well. If the lock breaking occurs inside the flush function, it will result in a sleep from an atomic context. Use the irqsafe flushing variant in mem_cgroup_usage() to fix this. Link: https://lkml.kernel.org/r/20210726150019.251820-1-hannes@cmpxchg.org Fixes: 2d146aa3 ("mm: memcontrol: switch to rstat") Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org> Reported-by: NDan Carpenter <dan.carpenter@oracle.com> Acked-by: NChris Down <chris@chrisdown.name> Reviewed-by: NRik van Riel <riel@surriel.com> Acked-by: NMichal Hocko <mhocko@suse.com> Reviewed-by: NShakeel Butt <shakeelb@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NLu Jialin <lujialin4@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Johannes Weiner 提交于
mainline inclusion from mainline-v5.13-rc1 commit 2cd21c89 category: feature bugzilla: 185803 https://gitee.com/openeuler/kernel/issues/I4JOG9?from=project-issue CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2cd21c89800c ----------------------------------------------------------------------- There are two functions to flush the per-cpu data of an lruvec into the rest of the cgroup tree: when the cgroup is being freed, and when a CPU disappears during hotplug. The difference is whether all CPUs or just one is being collected, but the rest of the flushing code is the same. Merge them into one function and share the common code. Link: https://lkml.kernel.org/r/20210209163304.77088-8-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org> Reviewed-by: NShakeel Butt <shakeelb@google.com> Acked-by: NMichal Hocko <mhocko@suse.com> Acked-by: NRoman Gushchin <guro@fb.com> Cc: Michal Koutný <mkoutny@suse.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NLu Jialin <lujialin4@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Johannes Weiner 提交于
mainline inclusion from mainline-v5.13-rc1 commit 2d146aa3 category: feature bugzilla: 185803 https://gitee.com/openeuler/kernel/issues/I4JOG9?from=project-issue CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2d146aa3aa84 ---------------------------------------------------------------------- Replace the memory controller's custom hierarchical stats code with the generic rstat infrastructure provided by the cgroup core. The current implementation does batched upward propagation from the write side (i.e. as stats change). The per-cpu batches introduce an error, which is multiplied by the number of subgroups in a tree. In systems with many CPUs and sizable cgroup trees, the error can be large enough to confuse users (e.g. 32 batch pages * 32 CPUs * 32 subgroups results in an error of up to 128M per stat item). This can entirely swallow allocation bursts inside a workload that the user is expecting to see reflected in the statistics. In the past, we've done read-side aggregation, where a memory.stat read would have to walk the entire subtree and add up per-cpu counts. This became problematic with lazily-freed cgroups: we could have large subtrees where most cgroups were entirely idle. Hence the switch to change-driven upward propagation. Unfortunately, it needed to trade accuracy for speed due to the write side being so hot. Rstat combines the best of both worlds: from the write side, it cheaply maintains a queue of cgroups that have pending changes, so that the read side can do selective tree aggregation. This way the reported stats will always be precise and recent as can be, while the aggregation can skip over potentially large numbers of idle cgroups. The way rstat works is that it implements a tree for tracking cgroups with pending local changes, as well as a flush function that walks the tree upwards. The controller then drives this by 1) telling rstat when a local cgroup stat changes (e.g. mod_memcg_state) and 2) when a flush is required to get uptodate hierarchy stats for a given subtree (e.g. when memory.stat is read). The controller also provides a flush callback that is called during the rstat flush walk for each cgroup and aggregates its local per-cpu counters and propagates them upwards. This adds a second vmstats to struct mem_cgroup (MEMCG_NR_STAT + NR_VM_EVENT_ITEMS) to track pending subtree deltas during upward aggregation. It removes 3 words from the per-cpu data. It eliminates memcg_exact_page_state(), since memcg_page_state() is now exact. [akpm@linux-foundation.org: merge fix] [hannes@cmpxchg.org: fix a sleep in atomic section problem] Link: https://lkml.kernel.org/r/20210315234100.64307-1-hannes@cmpxchg.org Link: https://lkml.kernel.org/r/20210209163304.77088-7-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org> Reviewed-by: NRoman Gushchin <guro@fb.com> Acked-by: NMichal Hocko <mhocko@suse.com> Reviewed-by: NShakeel Butt <shakeelb@google.com> Reviewed-by: NMichal Koutný <mkoutny@suse.com> Acked-by: NBalbir Singh <bsingharora@gmail.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Conflict: include/linux/memcontrol.h Signed-off-by: NLu Jialin <lujialin4@huawei.com> Reviewed-by: NXiu Jianfeng <xiujianfeng@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Johannes Weiner 提交于
mainline inclusion from mainline-v5.13-rc1 commit a18e6e6e category: feature bugzilla: 185803 https://gitee.com/openeuler/kernel/issues/I4JOG9?from=project-issue CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a18e6e6e150a ------------------------------------------------------------------- There are no users outside of the memory controller itself. The rest of the kernel cares either about node or lruvec stats. Link: https://lkml.kernel.org/r/20210209163304.77088-4-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org> Reviewed-by: NShakeel Butt <shakeelb@google.com> Reviewed-by: NRoman Gushchin <guro@fb.com> Acked-by: NMichal Hocko <mhocko@suse.com> Reviewed-by: NMichal Koutný <mkoutny@suse.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Conflict: include/linux/memcontrol.h Signed-off-by: NLu Jialin <lujialin4@huawei.com> Reviewed-by: NXiu Jianfeng <xiujianfeng@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Johannes Weiner 提交于
mainline inclusion from mainline-v5.13-rc1 commit a3747b53 category: feature bugzilla: 185803 https://gitee.com/openeuler/kernel/issues/I4JOG9?from=project-issue CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a3747b53b1771 ---------------------------------------------------- No need to encapsulate a simple struct member access. Link: https://lkml.kernel.org/r/20210209163304.77088-3-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org> Reviewed-by: NShakeel Butt <shakeelb@google.com> Reviewed-by: NRoman Gushchin <guro@fb.com> Acked-by: NMichal Hocko <mhocko@suse.com> Reviewed-by: NMichal Koutný <mkoutny@suse.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Conflict: mm/memcontrol.c Signed-off-by: NLu Jialin <lujialin4@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Johannes Weiner 提交于
mainline inclusion from mainline-5.13-rc1 commit a3d4c05a category: feature bugzilla:185803 https://gitee.com/openeuler/kernel/issues/I4JOG9?from=project-issue CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a3d4c05a447486 --------------------------------------------------- Patch series "mm: memcontrol: switch to rstat", v3. This series converts memcg stats tracking to the streamlined rstat infrastructure provided by the cgroup core code. rstat is already used by the CPU controller and the IO controller. This change is motivated by recent accuracy problems in memcg's custom stats code, as well as the benefits of sharing common infra with other controllers. The current memcg implementation does batched tree aggregation on the write side: local stat changes are cached in per-cpu counters, which are then propagated upward in batches when a threshold (32 pages) is exceeded. This is cheap, but the error introduced by the lazy upward propagation adds up: 32 pages times CPUs times cgroups in the subtree. We've had complaints from service owners that the stats do not reliably track and react to allocation behavior as expected, sometimes swallowing the results of entire test applications. The original memcg stat implementation used to do tree aggregation exclusively on the read side: local stats would only ever be tracked in per-cpu counters, and a memory.stat read would iterate the entire subtree and sum those counters up. This didn't keep up with the times: - Cgroup trees are much bigger now. We switched to lazily-freed cgroups, where deleted groups would hang around until their remaining page cache has been reclaimed. This can result in large subtrees that are expensive to walk, while most of the groups are idle and their statistics don't change much anymore. - Automated monitoring increased. With the proliferation of userspace oom killing, proactive reclaim, and higher-resolution logging of workload trends in general, top-level stat files are polled at least once a second in many deployments. - The lifetime of cgroups got shorter. Where most cgroup setups in the past would have a few large policy-oriented cgroups for everything running on the system, newer cgroup deployments tend to create one group per application - which gets deleted again as the processes exit. An aggregation scheme that doesn't retain child data inside the parents loses event history of the subtree. Rstat addresses all three of those concerns through intelligent, persistent read-side aggregation. As statistics change at the local level, rstat tracks - on a per-cpu basis - only those parts of a subtree that have changes pending and require aggregation. The actual aggregation occurs on the colder read side - which can now skip over (potentially large) numbers of recently idle cgroups. === The test_kmem cgroup selftest is currently failing due to excessive cumulative vmstat drift from 100 subgroups: ok 1 test_kmem_basic memory.current = 8810496 slab + anon + file + kernel_stack = 17074568 slab = 6101384 anon = 946176 file = 0 kernel_stack = 10027008 not ok 2 test_kmem_memcg_deletion ok 3 test_kmem_proc_kpagecgroup ok 4 test_kmem_kernel_stacks ok 5 test_kmem_dead_cgroups ok 6 test_percpu_basic As you can see, memory.stat items far exceed memory.current. The kernel stack alone is bigger than all of charged memory. That's because the memory of the test has been uncharged from memory.current, but the negative vmstat deltas are still sitting in the percpu caches. The test at this time isn't even counting percpu, pagetables etc. yet, which would further contribute to the error. The last patch in the series updates the test to include them - as well as reduces the vmstat tolerances in general to only expect page_counter batching. With all patches applied, the (now more stringent) test succeeds: ok 1 test_kmem_basic ok 2 test_kmem_memcg_deletion ok 3 test_kmem_proc_kpagecgroup ok 4 test_kmem_kernel_stacks ok 5 test_kmem_dead_cgroups ok 6 test_percpu_basic === A kernel build test confirms that overhead is comparable. Two kernels are built simultaneously in a nested tree with several idle siblings: root - kernelbuild - one - two - three - four - build-a (defconfig, make -j16) `- build-b (defconfig, make -j16) `- idle-1 `- ... `- idle-9 During the builds, kernelbuild/memory.stat is read once a second. A perf diff shows that the changes in cycle distribution is minimal. Top 10 kernel symbols: 0.09% +0.08% [kernel.kallsyms] [k] __mod_memcg_lruvec_state 0.00% +0.06% [kernel.kallsyms] [k] cgroup_rstat_updated 0.08% -0.05% [kernel.kallsyms] [k] __mod_memcg_state.part.0 0.16% -0.04% [kernel.kallsyms] [k] release_pages 0.00% +0.03% [kernel.kallsyms] [k] __count_memcg_events 0.01% +0.03% [kernel.kallsyms] [k] mem_cgroup_charge_statistics.constprop.0 0.10% -0.02% [kernel.kallsyms] [k] get_mem_cgroup_from_mm 0.05% -0.02% [kernel.kallsyms] [k] mem_cgroup_update_lru_size 0.57% +0.01% [kernel.kallsyms] [k] asm_exc_page_fault === The on-demand aggregated stats are now fully accurate: $ grep -e nr_inactive_file /proc/vmstat | awk '{print($1,$2*4096)}'; \ grep -e inactive_file /sys/fs/cgroup/memory.stat vanilla: patched: nr_inactive_file 1574105088 nr_inactive_file 1027801088 inactive_file 1577410560 inactive_file 1027801088 === This patch (of 8): The memcg hotunplug callback erroneously flushes counts on the local CPU, not the counts of the CPU going away; those counts will be lost. Flush the CPU that is actually going away. Also simplify the code a bit by using mod_memcg_state() and count_memcg_events() instead of open-coding the upward flush - this is comparable to how vmstat.c handles hotunplug flushing. Link: https://lkml.kernel.org/r/20210209163304.77088-1-hannes@cmpxchg.org Link: https://lkml.kernel.org/r/20210209163304.77088-2-hannes@cmpxchg.org Fixes: a983b5eb ("mm: memcontrol: fix excessive complexity in memory.stat reporting") Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org> Reviewed-by: NShakeel Butt <shakeelb@google.com> Reviewed-by: NRoman Gushchin <guro@fb.com> Reviewed-by: NMichal Koutný <mkoutny@suse.com> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: Tejun Heo <tj@kernel.org> Cc: Roman Gushchin <guro@fb.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NLu Jialin <lujialin4@huawei.com> Reviewed-by: NXiu Jianfeng <xiujianfeng@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
- 15 11月, 2021 1 次提交
-
-
由 Baolin Wang 提交于
mainline inclusion from mainline-5.15-rc1 commit 37bc3cb9 category: bugfix bugzilla: 180690 https://gitee.com/openeuler/kernel/issues/I4DDEL Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=37bc3cb9bbef86d1ddbbc789e55b588c8a2cac26 -------------------------------- Since commit c843966c ("mm: allow swappiness that prefers reclaiming anon over the file workingset") has expended the swappiness value to make swap to be preferred in some systems. We should also change the memcg swappiness restriction to allow memcg swap-preferred. Link: https://lkml.kernel.org/r/d77469b90c45c49953ccbc51e54a1d465bc18f70.1627626255.git.baolin.wang@linux.alibaba.com Fixes: c843966c ("mm: allow swappiness that prefers reclaiming anon over the file workingset") Signed-off-by: NBaolin Wang <baolin.wang@linux.alibaba.com> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Conflicts: mm/memcontrol.c Signed-off-by: NLu Jialin <lujialin4@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: NXiu Jianfeng <xiujianfeng@huawei.com> Signed-off-by: NChen Jun <chenjun102@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
- 30 10月, 2021 7 次提交
-
-
由 Muchun Song 提交于
mainline inclusion from mainline-v5.13-rc1 commit f1286fae category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4C0GB CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=f1286fae540697e0b4713a8262f4aab5cf65f1c5 ---------------------------------------------------------------------- There is only one user of __memcg_kmem_charge(), so manually inline __memcg_kmem_charge() to obj_cgroup_charge_pages(). Similarly manually inline __memcg_kmem_uncharge() into obj_cgroup_uncharge_pages() and call obj_cgroup_uncharge_pages() in obj_cgroup_release(). This is just code cleanup without any functionality changes. Link: https://lkml.kernel.org/r/20210319163821.20704-7-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com> Reviewed-by: NShakeel Butt <shakeelb@google.com> Acked-by: NRoman Gushchin <guro@fb.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Conflicts: mm/memcontrol.c Signed-off-by: NChen Huang <chenhuang5@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: NChen Wandun <chenwandun@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Muchun Song 提交于
mainline inclusion from mainline-v5.13-rc1 commit b4e0b68f category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4C0GB CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b4e0b68fbd9d1fd7e31cbe8adca3ad6cf556e2ee ---------------------------------------------------------------------- Since Roman's series "The new cgroup slab memory controller" applied. All slab objects are charged via the new APIs of obj_cgroup. The new APIs introduce a struct obj_cgroup to charge slab objects. It prevents long-living objects from pinning the original memory cgroup in the memory. But there are still some corner objects (e.g. allocations larger than order-1 page on SLUB) which are not charged via the new APIs. Those objects (include the pages which are allocated from buddy allocator directly) are charged as kmem pages which still hold a reference to the memory cgroup. We want to reuse the obj_cgroup APIs to charge the kmem pages. If we do that, we should store an object cgroup pointer to page->memcg_data for the kmem pages. Finally, page->memcg_data will have 3 different meanings. 1) For the slab pages, page->memcg_data points to an object cgroups vector. 2) For the kmem pages (exclude the slab pages), page->memcg_data points to an object cgroup. 3) For the user pages (e.g. the LRU pages), page->memcg_data points to a memory cgroup. We do not change the behavior of page_memcg() and page_memcg_rcu(). They are also suitable for LRU pages and kmem pages. Why? Because memory allocations pinning memcgs for a long time - it exists at a larger scale and is causing recurring problems in the real world: page cache doesn't get reclaimed for a long time, or is used by the second, third, fourth, ... instance of the same job that was restarted into a new cgroup every time. Unreclaimable dying cgroups pile up, waste memory, and make page reclaim very inefficient. We can convert LRU pages and most other raw memcg pins to the objcg direction to fix this problem, and then the page->memcg will always point to an object cgroup pointer. At that time, LRU pages and kmem pages will be treated the same. The implementation of page_memcg() will remove the kmem page check. This patch aims to charge the kmem pages by using the new APIs of obj_cgroup. Finally, the page->memcg_data of the kmem page points to an object cgroup. We can use the __page_objcg() to get the object cgroup associated with a kmem page. Or we can use page_memcg() to get the memory cgroup associated with a kmem page, but caller must ensure that the returned memcg won't be released (e.g. acquire the rcu_read_lock or css_set_lock). Link: https://lkml.kernel.org/r/20210401030141.37061-1-songmuchun@bytedance.com Link: https://lkml.kernel.org/r/20210319163821.20704-6-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Reviewed-by: NShakeel Butt <shakeelb@google.com> Acked-by: NRoman Gushchin <guro@fb.com> Reviewed-by: NMiaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> [songmuchun@bytedance.com: fix forget to obtain the ref to objcg in split_page_memcg] Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Conflicts: include/linux/memcontrol.h mm/memcontrol.c Signed-off-by: NChen Huang <chenhuang5@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: NChen Wandun <chenwandun@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Muchun Song 提交于
mainline inclusion from mainline-v5.13-rc1 commit 7ab345a8 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4C0GB CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=7ab345a8973017c89a1be87b6c8722d1fee1fd95 ---------------------------------------------------------------------- Just like assignment to ug->memcg, we only need to update ug->dummy_page if memcg changed. So move it to there. This is a very small optimization. Link: https://lkml.kernel.org/r/20210319163821.20704-5-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Reviewed-by: NShakeel Butt <shakeelb@google.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Roman Gushchin <guro@fb.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NChen Huang <chenhuang5@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: NChen Wandun <chenwandun@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Muchun Song 提交于
mainline inclusion from mainline-v5.13-rc1 commit e74d2259 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4C0GB CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=e74d225910ec3a9999f06934afa068b6a30babf8 ---------------------------------------------------------------------- We know that the unit of slab object charging is bytes, the unit of kmem page charging is PAGE_SIZE. If we want to reuse obj_cgroup APIs to charge the kmem pages, we should pass PAGE_SIZE (as third parameter) to obj_cgroup_charge(). Because the size is already PAGE_SIZE, we can skip touch the objcg stock. And obj_cgroup_{un}charge_pages() are introduced to charge in units of page level. In the latter patch, we also can reuse those two helpers to charge or uncharge a number of kernel pages to a object cgroup. This is just a code movement without any functional changes. Link: https://lkml.kernel.org/r/20210319163821.20704-3-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com> Acked-by: NRoman Gushchin <guro@fb.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Reviewed-by: NShakeel Butt <shakeelb@google.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NChen Huang <chenhuang5@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: NChen Wandun <chenwandun@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Roman Gushchin 提交于
mainline inclusion from mainline-v5.11-rc1 commit 18b2db3b category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4C0GB CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=18b2db3b0385226b71cb3288474fa5a6e4a45474 ---------------------------------------------------------------------- PageKmemcg flag is currently defined as a page type (like buddy, offline, table and guard). Semantically it means that the page was accounted as a kernel memory by the page allocator and has to be uncharged on the release. As a side effect of defining the flag as a page type, the accounted page can't be mapped to userspace (look at page_has_type() and comments above). In particular, this blocks the accounting of vmalloc-backed memory used by some bpf maps, because these maps do map the memory to userspace. One option is to fix it by complicating the access to page->mapcount, which provides some free bits for page->page_type. But it's way better to move this flag into page->memcg_data flags. Indeed, the flag makes no sense without enabled memory cgroups and memory cgroup pointer set in particular. This commit replaces PageKmemcg() and __SetPageKmemcg() with PageMemcgKmem() and an open-coded OR operation setting the memcg pointer with the MEMCG_DATA_KMEM bit. __ClearPageKmemcg() can be simple deleted, as the whole memcg_data is zeroed at once. As a bonus, on !CONFIG_MEMCG build the PageMemcgKmem() check will be compiled out. Signed-off-by: NRoman Gushchin <guro@fb.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NAlexei Starovoitov <ast@kernel.org> Reviewed-by: NShakeel Butt <shakeelb@google.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Acked-by: NMichal Hocko <mhocko@suse.com> Link: https://lkml.kernel.org/r/20201027001657.3398190-5-guro@fb.com Link: https://lore.kernel.org/bpf/20201201215900.3569844-5-guro@fb.com Conflicts: mm/memcontrol.c Signed-off-by: NChen Huang <chenhuang5@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: NChen Wandun <chenwandun@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Roman Gushchin 提交于
mainline inclusion from mainline-v5.11-rc1 commit 270c6a71 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4C0GB CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=270c6a71460e12b07b1dcadf7457ff95b6c6e8f4 ---------------------------------------------------------------------- To gather all direct accesses to struct page's memcg_data field in one place, let's introduce 3 new helpers to use in the slab accounting code: struct obj_cgroup **page_objcgs(struct page *page); struct obj_cgroup **page_objcgs_check(struct page *page); bool set_page_objcgs(struct page *page, struct obj_cgroup **objcgs); They are similar to the corresponding API for generic pages, except that the setter can return false, indicating that the value has been already set from a different thread. Signed-off-by: NRoman Gushchin <guro@fb.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NAlexei Starovoitov <ast@kernel.org> Reviewed-by: NShakeel Butt <shakeelb@google.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Link: https://lkml.kernel.org/r/20201027001657.3398190-3-guro@fb.com Link: https://lore.kernel.org/bpf/20201201215900.3569844-3-guro@fb.comSigned-off-by: NChen Huang <chenhuang5@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: NChen Wandun <chenwandun@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Roman Gushchin 提交于
mainline inclusion from mainline-v5.11-rc1 commit bcfe06bf category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4C0GB CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=bcfe06bf2622f7c4899468e427683aec49070687 ---------------------------------------------------------------------- Patch series "mm: allow mapping accounted kernel pages to userspace", v6. Currently a non-slab kernel page which has been charged to a memory cgroup can't be mapped to userspace. The underlying reason is simple: PageKmemcg flag is defined as a page type (like buddy, offline, etc), so it takes a bit from a page->mapped counter. Pages with a type set can't be mapped to userspace. But in general the kmemcg flag has nothing to do with mapping to userspace. It only means that the page has been accounted by the page allocator, so it has to be properly uncharged on release. Some bpf maps are mapping the vmalloc-based memory to userspace, and their memory can't be accounted because of this implementation detail. This patchset removes this limitation by moving the PageKmemcg flag into one of the free bits of the page->mem_cgroup pointer. Also it formalizes accesses to the page->mem_cgroup and page->obj_cgroups using new helpers, adds several checks and removes a couple of obsolete functions. As the result the code became more robust with fewer open-coded bit tricks. This patch (of 4): Currently there are many open-coded reads of the page->mem_cgroup pointer, as well as a couple of read helpers, which are barely used. It creates an obstacle on a way to reuse some bits of the pointer for storing additional bits of information. In fact, we already do this for slab pages, where the last bit indicates that a pointer has an attached vector of objcg pointers instead of a regular memcg pointer. This commits uses 2 existing helpers and introduces a new helper to converts all read sides to calls of these helpers: struct mem_cgroup *page_memcg(struct page *page); struct mem_cgroup *page_memcg_rcu(struct page *page); struct mem_cgroup *page_memcg_check(struct page *page); page_memcg_check() is intended to be used in cases when the page can be a slab page and have a memcg pointer pointing at objcg vector. It does check the lowest bit, and if set, returns NULL. page_memcg() contains a VM_BUG_ON_PAGE() check for the page not being a slab page. To make sure nobody uses a direct access, struct page's mem_cgroup/obj_cgroups is converted to unsigned long memcg_data. Signed-off-by: NRoman Gushchin <guro@fb.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NAlexei Starovoitov <ast@kernel.org> Reviewed-by: NShakeel Butt <shakeelb@google.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Acked-by: NMichal Hocko <mhocko@suse.com> Link: https://lkml.kernel.org/r/20201027001657.3398190-1-guro@fb.com Link: https://lkml.kernel.org/r/20201027001657.3398190-2-guro@fb.com Link: https://lore.kernel.org/bpf/20201201215900.3569844-2-guro@fb.com Conflicts: mm/memcontrol.c Signed-off-by: NChen Huang <chenhuang5@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: NChen Wandun <chenwandun@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
- 13 10月, 2021 1 次提交
-
-
由 Waiman Long 提交于
stable inclusion from stable-5.10.50 commit 5458985533ba8860d6f0b08304ae90b1d7ec970a bugzilla: 174522 https://gitee.com/openeuler/kernel/issues/I4DNFY Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=5458985533ba8860d6f0b08304ae90b1d7ec970a -------------------------------- [ Upstream commit 41eb5df1 ] Patch series "mm: memcg/slab: Fix objcg pointer array handling problem", v4. Since the merging of the new slab memory controller in v5.9, the page structure stores a pointer to objcg pointer array for slab pages. When the slab has no used objects, it can be freed in free_slab() which will call kfree() to free the objcg pointer array in memcg_alloc_page_obj_cgroups(). If it happens that the objcg pointer array is the last used object in its slab, that slab may then be freed which may caused kfree() to be called again. With the right workload, the slab cache may be set up in a way that allows the recursive kfree() calling loop to nest deep enough to cause a kernel stack overflow and panic the system. In fact, we have a reproducer that can cause kernel stack overflow on a s390 system involving kmalloc-rcl-256 and kmalloc-rcl-128 slabs with the following kfree() loop recursively called 74 times: [ 285.520739] [<000000000ec432fc>] kfree+0x4bc/0x560 [ 285.520740] [<000000000ec43466>] __free_slab+0xc6/0x228 [ 285.520741] [<000000000ec41fc2>] __slab_free+0x3c2/0x3e0 [ 285.520742] [<000000000ec432fc>] kfree+0x4bc/0x560 : While investigating this issue, I also found an issue on the allocation side. If the objcg pointer array happen to come from the same slab or a circular dependency linkage is formed with multiple slabs, those affected slabs can never be freed again. This patch series addresses these two issues by introducing a new set of kmalloc-cg-<n> caches split from kmalloc-<n> caches. The new set will only contain non-reclaimable and non-dma objects that are accounted in memory cgroups whereas the old set are now for unaccounted objects only. By making this split, all the objcg pointer arrays will come from the kmalloc-<n> caches, but those caches will never hold any objcg pointer array. As a result, deeply nested kfree() call and the unfreeable slab problems are now gone. This patch (of 4): Since the merging of the new slab memory controller in v5.9, the page structure may store a pointer to obj_cgroup pointer array for slab pages. Currently, only the __GFP_ACCOUNT bit is masked off. However, the array is not readily reclaimable and doesn't need to come from the DMA buffer. So those GFP bits should be masked off as well. Do the flag bit clearing at memcg_alloc_page_obj_cgroups() to make sure that it is consistently applied no matter where it is called. Link: https://lkml.kernel.org/r/20210505200610.13943-1-longman@redhat.com Link: https://lkml.kernel.org/r/20210505200610.13943-2-longman@redhat.com Fixes: 286e04b8 ("mm: memcg/slab: allocate obj_cgroups for non-root slab pages") Signed-off-by: NWaiman Long <longman@redhat.com> Reviewed-by: NShakeel Butt <shakeelb@google.com> Acked-by: NRoman Gushchin <guro@fb.com> Reviewed-by: NVlastimil Babka <vbabka@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NSasha Levin <sashal@kernel.org> Signed-off-by: NChen Jun <chenjun102@huawei.com> Acked-by: NWeilong Chen <chenweilong@huawei.com> Signed-off-by: NChen Jun <chenjun102@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
- 26 9月, 2021 3 次提交
-
-
由 Yang Shi 提交于
mainline inclusion from mainline-v5.13-rc1 commit a178015c category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I48N0H CVE: NA ------------------------------------------------- Now shrinker's nr_deferred is per memcg for memcg aware shrinkers, add to parent's corresponding nr_deferred when memcg offline. Link: https://lkml.kernel.org/r/20210311190845.9708-13-shy828301@gmail.comSigned-off-by: NYang Shi <shy828301@gmail.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Acked-by: NKirill Tkhai <ktkhai@virtuozzo.com> Acked-by: NRoman Gushchin <guro@fb.com> Reviewed-by: NShakeel Butt <shakeelb@google.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NChen Wandun <chenwandun@huawei.com> Reviewed-by: NTong Tiangen <tongtiangen@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Yang Shi 提交于
mainline inclusion from mainline-v5.13-rc1 commit e4262c4f category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I48N0H CVE: NA ------------------------------------------------- The following patch is going to add nr_deferred into shrinker_map, the change will make shrinker_map not only include map anymore, so rename it to "memcg_shrinker_info". And this should make the patch adding nr_deferred cleaner and readable and make review easier. Also remove the "memcg_" prefix. Link: https://lkml.kernel.org/r/20210311190845.9708-7-shy828301@gmail.comSigned-off-by: NYang Shi <shy828301@gmail.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Acked-by: NKirill Tkhai <ktkhai@virtuozzo.com> Acked-by: NRoman Gushchin <guro@fb.com> Reviewed-by: NShakeel Butt <shakeelb@google.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NChen Wandun <chenwandun@huawei.com> Reviewed-by: NTong Tiangen <tongtiangen@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Yang Shi 提交于
mainline inclusion from mainline-v5.13-rc1 commit 2bfd3637 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I48N0H CVE: NA ------------------------------------------------- The shrinker map management is not purely memcg specific, it is at the intersection between memory cgroup and shrinkers. It's allocation and assignment of a structure, and the only memcg bit is the map is being stored in a memcg structure. So move the shrinker_maps handling code into vmscan.c for tighter integration with shrinker code, and remove the "memcg_" prefix. There is no functional change. Link: https://lkml.kernel.org/r/20210311190845.9708-3-shy828301@gmail.comSigned-off-by: NYang Shi <shy828301@gmail.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Acked-by: NKirill Tkhai <ktkhai@virtuozzo.com> Acked-by: NRoman Gushchin <guro@fb.com> Reviewed-by: NShakeel Butt <shakeelb@google.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Conflicts: mm/memcontrol.c Signed-off-by: NChen Wandun <chenwandun@huawei.com> Reviewed-by: NTong Tiangen <tongtiangen@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
- 14 7月, 2021 3 次提交
-
-
由 Muchun Song 提交于
mainline inclusion from mainline-5.12-rc1 commit f3344adf category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZVOM CVE: NA ------------------------------------------------- The vmstat threshold is 32 (MEMCG_CHARGE_BATCH), Actually the threshold can be as big as MEMCG_CHARGE_BATCH * PAGE_SIZE. It still fits into s32. So introduce struct batched_lruvec_stat to optimize memory usage. The size of struct lruvec_stat is 304 bytes on 64 bit systems. As it is a per-cpu structure. So with this patch, we can save 304 / 2 * ncpu bytes per-memcg per-node where ncpu is the number of the possible CPU. If there are c memory cgroup (include dying cgroup) and n NUMA node in the system. Finally, we can save (152 * ncpu * c * n) bytes. [akpm@linux-foundation.org: fix typo in comment] Link: https://lkml.kernel.org/r/20201210042121.39665-1-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com> Reviewed-by: NShakeel Butt <shakeelb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Roman Gushchin <guro@fb.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Chris Down <chris@chrisdown.name> Cc: Yafang Shao <laoar.shao@gmail.com> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NChengyang Fan <cy.fan@huawei.com> Reviewed-by: chenwandun<chenwandun@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Alex Shi 提交于
mainline inclusion from mainline-v5.11-rc1 commit 6168d0da category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZF7C?from=project-issue CVE: NA -------------------------------------- This patch moves per node lru_lock into lruvec, thus bring a lru_lock for each of memcg per node. So on a large machine, each of memcg don't have to suffer from per node pgdat->lru_lock competition. They could go fast with their self lru_lock. After move memcg charge before lru inserting, page isolation could serialize page's memcg, then per memcg lruvec lock is stable and could replace per node lru lock. In isolate_migratepages_block(), compact_unlock_should_abort and lock_page_lruvec_irqsave are open coded to work with compact_control. Also add a debug func in locking which may give some clues if there are sth out of hands. Daniel Jordan's testing show 62% improvement on modified readtwice case on his 2P * 10 core * 2 HT broadwell box. https://lore.kernel.org/lkml/20200915165807.kpp7uhiw7l3loofu@ca-dmjordan1.us.oracle.com/ Hugh Dickins helped on the patch polish, thanks! [alex.shi@linux.alibaba.com: fix comment typo] Link: https://lkml.kernel.org/r/5b085715-292a-4b43-50b3-d73dc90d1de5@linux.alibaba.com [alex.shi@linux.alibaba.com: use page_memcg()] Link: https://lkml.kernel.org/r/5a4c2b72-7ee8-2478-fc0e-85eb83aafec4@linux.alibaba.com Link: https://lkml.kernel.org/r/1604566549-62481-18-git-send-email-alex.shi@linux.alibaba.comSigned-off-by: NAlex Shi <alex.shi@linux.alibaba.com> Acked-by: NHugh Dickins <hughd@google.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Cc: Rong Chen <rong.a.chen@intel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Yang Shi <yang.shi@linux.alibaba.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Alexander Duyck <alexander.duyck@gmail.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Jann Horn <jannh@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Mika Penttilä <mika.penttila@nextfour.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Conflicts: mm/memcontrol.c Signed-off-by: NJing Xiangfeng <jingxiangfeng@huawei.com> Reviewed-by: Nchenwandun <chenwandun@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Alex Shi 提交于
mainline inclusion from mainline-v5.11-rc1 commit 20ad50d6 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZF7C?from=project-issue CVE: NA -------------------------------------- Add a debug checking in lock_page_memcg, then we could get alarm if anything wrong here. Link: https://lkml.kernel.org/r/1604566549-62481-9-git-send-email-alex.shi@linux.alibaba.comSuggested-by: NJohannes Weiner <hannes@cmpxchg.org> Signed-off-by: NAlex Shi <alex.shi@linux.alibaba.com> Acked-by: NHugh Dickins <hughd@google.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Alexander Duyck <alexander.duyck@gmail.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: "Chen, Rong A" <rong.a.chen@intel.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Jann Horn <jannh@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Mika Penttilä <mika.penttila@nextfour.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Yang Shi <yang.shi@linux.alibaba.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NJing Xiangfeng <jingxiangfeng@huawei.com> Reviewed-by: Nchenwandun <chenwandun@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-