提交 · 4c0bc9e5dee72dec5e2181e2e408fbbefe7751a3 · openeuler / Kernel

31 1月, 2023 1 次提交

mm/vmpressure: fix data-race with memcg->socket_pressure · bbc0f30d

由 Yuanzheng Song 提交于 1月 31, 2023

mainline inclusion
from mainline-v5.16-rc1
commit 7e6ec49c
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I6AW65
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=7e6ec49c18988f1b8dab0677271dafde5f8d9a43

--------------------------------

When reading memcg->socket_pressure in mem_cgroup_under_socket_pressure()
and writing memcg->socket_pressure in vmpressure() at the same time, the
following data-race occurs:

  BUG: KCSAN: data-race in __sk_mem_reduce_allocated / vmpressure

  write to 0xffff8881286f4938 of 8 bytes by task 24550 on cpu 3:
   vmpressure+0x218/0x230 mm/vmpressure.c:307
   shrink_node_memcgs+0x2b9/0x410 mm/vmscan.c:2658
   shrink_node+0x9d2/0x11d0 mm/vmscan.c:2769
   shrink_zones+0x29f/0x470 mm/vmscan.c:2972
   do_try_to_free_pages+0x193/0x6e0 mm/vmscan.c:3027
   try_to_free_mem_cgroup_pages+0x1c0/0x3f0 mm/vmscan.c:3345
   reclaim_high mm/memcontrol.c:2440 [inline]
   mem_cgroup_handle_over_high+0x18b/0x4d0 mm/memcontrol.c:2624
   tracehook_notify_resume include/linux/tracehook.h:197 [inline]
   exit_to_user_mode_loop kernel/entry/common.c:164 [inline]
   exit_to_user_mode_prepare+0x110/0x170 kernel/entry/common.c:191
   syscall_exit_to_user_mode+0x16/0x30 kernel/entry/common.c:266
   ret_from_fork+0x15/0x30 arch/x86/entry/entry_64.S:289

  read to 0xffff8881286f4938 of 8 bytes by interrupt on cpu 1:
   mem_cgroup_under_socket_pressure include/linux/memcontrol.h:1483 [inline]
   sk_under_memory_pressure include/net/sock.h:1314 [inline]
   __sk_mem_reduce_allocated+0x1d2/0x270 net/core/sock.c:2696
   __sk_mem_reclaim+0x44/0x50 net/core/sock.c:2711
   sk_mem_reclaim include/net/sock.h:1490 [inline]
   ......
   net_rx_action+0x17a/0x480 net/core/dev.c:6864
   __do_softirq+0x12c/0x2af kernel/softirq.c:298
   run_ksoftirqd+0x13/0x20 kernel/softirq.c:653
   smpboot_thread_fn+0x33f/0x510 kernel/smpboot.c:165
   kthread+0x1fc/0x220 kernel/kthread.c:292
   ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:296

Fix it by using READ_ONCE() and WRITE_ONCE() to read and write
memcg->socket_pressure.

Link: https://lkml.kernel.org/r/20211025082843.671690-1-songyuanzheng@huawei.comSigned-off-by: NYuanzheng Song <songyuanzheng@huawei.com>
Reviewed-by: NMuchun Song <songmuchun@bytedance.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Alex Shi <alexs@kernel.org>
Cc: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NCai Xinchen <caixinchen1@huawei.com>
Reviewed-by: NWang Weiyang <wangweiyang2@huawei.com>
Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>

bbc0f30d

23 11月, 2022 1 次提交

cgroup: support cgroup writeback on cgroupv1 · 644547a9

由 Lu Jialin 提交于 11月 04, 2022

hulkl inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I5ZG61

-------------------------------

In cgroupv1, cgroup writeback is not supproted for two problems:
1) Blkcg_css and memcg_css are mounted on different cgroup trees.
   Therefore, blkcg_css cannot be found according to a certain memcg_css.
2) Buffer I/O is worked by kthread, which is in the root_blkcg.
   Therefore, blkcg cannot limit wbps and wiops of buffer I/O.

We solve the two problems to support cgroup writeback on cgroupv1.
1) A memcg is attached to the blkcg_root css when the memcg was created.
2) We add a member "wb_blkio_ino" in mem_cgroup_legacy_files.
   User can attach a memcg to a cerntain blkcg through echo the file
   inode of the blkcg into the wb_blkio of the memcg.
3) inode_cgwb_enabled() return true when memcg and io are both mounted
   on cgroupv2 or both on cgroupv1.
4) Buffer I/O can find a blkcg according to its memcg.

Thus, a memcg can find a certain blkcg, and cgroup writeback can be
supported on cgroupv1.
Signed-off-by: NLu Jialin <lujialin4@huawei.com>

644547a9

09 8月, 2022 2 次提交

memcg: enable memcg async reclaim · 078057a1

由 Lu Jialin 提交于 8月 09, 2022

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I4IMAK
CVE: NA

-------------------------------

Enable memcg async reclaim when memcg usage is larger than memory_high *
memcg->high_async_ratio / 10; if memcg usage is larger than memory_high *
(memcg->high_async_ratio - 1)/ 10, the reclaim pages is the diff of
memcg usage and memory_high * (memcg->high_async_ratio - 1)/ 10;
else reclaim pages is MEMCG_CHARGE_BATCH;
The default memcg->high_async_ratio is 0; when memcg->high_async_ratio
is 0, memcg async reclaim is disabled;

The situation when enable memcg async reclaim is
1) try_charge;
2) reset memory_high
Signed-off-by: NLu Jialin <lujialin4@huawei.com>
Reviewed-by: NXiu Jianfeng <xiujianfeng@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

078057a1

Revert "memcg: Add static key for memcg kswapd" · c544b5bd

由 Lu Jialin 提交于 8月 09, 2022

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I4IMAK
CVE: NA

-------------------------------

This reverts commit 84355bcc.
Signed-off-by: NLu Jialin <lujialin4@huawei.com>
Reviewed-by: NXiu Jianfeng <xiujianfeng@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

c544b5bd

23 5月, 2022 1 次提交

mm: memcg: synchronize objcg lists with a dedicated spinlock · 9ba3dc33

由 Roman Gushchin 提交于 5月 23, 2022

stable inclusion
from stable-v5.10.102
commit 8c8385972ea96adeb9b678c9390beaa4d94c4aae
bugzilla: https://gitee.com/openeuler/kernel/issues/I567K6

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=8c8385972ea96adeb9b678c9390beaa4d94c4aae

--------------------------------

commit 0764db9b upstream.

Alexander reported a circular lock dependency revealed by the mmap1 ltp
test:

  LOCKDEP_CIRCULAR (suite: ltp, case: mtest06 (mmap1))
          WARNING: possible circular locking dependency detected
          5.17.0-20220113.rc0.git0.f2211f194038.300.fc35.s390x+debug #1 Not tainted
          ------------------------------------------------------
          mmap1/202299 is trying to acquire lock:
          00000001892c0188 (css_set_lock){..-.}-{2:2}, at: obj_cgroup_release+0x4a/0xe0
          but task is already holding lock:
          00000000ca3b3818 (&sighand->siglock){-.-.}-{2:2}, at: force_sig_info_to_task+0x38/0x180
          which lock already depends on the new lock.
          the existing dependency chain (in reverse order) is:
          -> #1 (&sighand->siglock){-.-.}-{2:2}:
                 __lock_acquire+0x604/0xbd8
                 lock_acquire.part.0+0xe2/0x238
                 lock_acquire+0xb0/0x200
                 _raw_spin_lock_irqsave+0x6a/0xd8
                 __lock_task_sighand+0x90/0x190
                 cgroup_freeze_task+0x2e/0x90
                 cgroup_migrate_execute+0x11c/0x608
                 cgroup_update_dfl_csses+0x246/0x270
                 cgroup_subtree_control_write+0x238/0x518
                 kernfs_fop_write_iter+0x13e/0x1e0
                 new_sync_write+0x100/0x190
                 vfs_write+0x22c/0x2d8
                 ksys_write+0x6c/0xf8
                 __do_syscall+0x1da/0x208
                 system_call+0x82/0xb0
          -> #0 (css_set_lock){..-.}-{2:2}:
                 check_prev_add+0xe0/0xed8
                 validate_chain+0x736/0xb20
                 __lock_acquire+0x604/0xbd8
                 lock_acquire.part.0+0xe2/0x238
                 lock_acquire+0xb0/0x200
                 _raw_spin_lock_irqsave+0x6a/0xd8
                 obj_cgroup_release+0x4a/0xe0
                 percpu_ref_put_many.constprop.0+0x150/0x168
                 drain_obj_stock+0x94/0xe8
                 refill_obj_stock+0x94/0x278
                 obj_cgroup_charge+0x164/0x1d8
                 kmem_cache_alloc+0xac/0x528
                 __sigqueue_alloc+0x150/0x308
                 __send_signal+0x260/0x550
                 send_signal+0x7e/0x348
                 force_sig_info_to_task+0x104/0x180
                 force_sig_fault+0x48/0x58
                 __do_pgm_check+0x120/0x1f0
                 pgm_check_handler+0x11e/0x180
          other info that might help us debug this:
           Possible unsafe locking scenario:
                 CPU0                    CPU1
                 ----                    ----
            lock(&sighand->siglock);
                                         lock(css_set_lock);
                                         lock(&sighand->siglock);
            lock(css_set_lock);
           *** DEADLOCK ***
          2 locks held by mmap1/202299:
           #0: 00000000ca3b3818 (&sighand->siglock){-.-.}-{2:2}, at: force_sig_info_to_task+0x38/0x180
           #1: 00000001892ad560 (rcu_read_lock){....}-{1:2}, at: percpu_ref_put_many.constprop.0+0x0/0x168
          stack backtrace:
          CPU: 15 PID: 202299 Comm: mmap1 Not tainted 5.17.0-20220113.rc0.git0.f2211f194038.300.fc35.s390x+debug #1
          Hardware name: IBM 3906 M04 704 (LPAR)
          Call Trace:
            dump_stack_lvl+0x76/0x98
            check_noncircular+0x136/0x158
            check_prev_add+0xe0/0xed8
            validate_chain+0x736/0xb20
            __lock_acquire+0x604/0xbd8
            lock_acquire.part.0+0xe2/0x238
            lock_acquire+0xb0/0x200
            _raw_spin_lock_irqsave+0x6a/0xd8
            obj_cgroup_release+0x4a/0xe0
            percpu_ref_put_many.constprop.0+0x150/0x168
            drain_obj_stock+0x94/0xe8
            refill_obj_stock+0x94/0x278
            obj_cgroup_charge+0x164/0x1d8
            kmem_cache_alloc+0xac/0x528
            __sigqueue_alloc+0x150/0x308
            __send_signal+0x260/0x550
            send_signal+0x7e/0x348
            force_sig_info_to_task+0x104/0x180
            force_sig_fault+0x48/0x58
            __do_pgm_check+0x120/0x1f0
            pgm_check_handler+0x11e/0x180
          INFO: lockdep is turned off.

In this example a slab allocation from __send_signal() caused a
refilling and draining of a percpu objcg stock, resulted in a releasing
of another non-related objcg.  Objcg release path requires taking the
css_set_lock, which is used to synchronize objcg lists.

This can create a circular dependency with the sighandler lock, which is
taken with the locked css_set_lock by the freezer code (to freeze a
task).

In general it seems that using css_set_lock to synchronize objcg lists
makes any slab allocations and deallocation with the locked css_set_lock
and any intervened locks risky.

To fix the problem and make the code more robust let's stop using
css_set_lock to synchronize objcg lists and use a new dedicated spinlock
instead.

Link: https://lkml.kernel.org/r/Yfm1IHmoGdyUR81T@carbon.dhcp.thefacebook.com
Fixes: bf4f0599 ("mm: memcg/slab: obj_cgroup API")
Signed-off-by: NRoman Gushchin <guro@fb.com>
Reported-by: NAlexander Egorenkov <egorenar@linux.ibm.com>
Tested-by: NAlexander Egorenkov <egorenar@linux.ibm.com>
Reviewed-by: NWaiman Long <longman@redhat.com>
Acked-by: NTejun Heo <tj@kernel.org>
Reviewed-by: NShakeel Butt <shakeelb@google.com>
Reviewed-by: NJeremy Linton <jeremy.linton@arm.com>
Tested-by: NJeremy Linton <jeremy.linton@arm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NYu Liao <liaoyu15@huawei.com>
Reviewed-by: NWei Li <liwei391@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

9ba3dc33

07 4月, 2022 2 次提交

memcg: Fix inconsistent oom event behavior for OOM_MEMCG_KILL · 07c863f7

由 Lu Jialin 提交于 4月 07, 2022

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I4X0YD?from=project-issue
CVE: NA

--------

Since memory.event is fully supported in cgroupv1, the problem of inconsistent
oom event behavior for OOM_MEMCG_KILL occurs again.
We fix the problem by add a new condition to support the event adding
continue. Therefore, there are two condition:
1) memcg is not root memcg;
2) the memcg is root memcg and the event is OOM_MEMCG_KILL of cgroupv1
Signed-off-by: NLu Jialin <lujialin4@huawei.com>
Reviewed-by: NXiu Jianfeng <xiujianfeng@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

07c863f7

memcg: Export memory.events and memory.events.local from cgroupv2 to cgroupv1 · 8f39ae66

由 Lu Jialin 提交于 4月 07, 2022

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I4X0YD?from=project-issue
CVE: NA

--------

Export "memory.events" and "memory.events.local" from cgroupv2 to
cgroupv1.

There are some differences between v2 and v1:

1)events of MEMCG_OOM_GROUP_KILL is not included in cgroupv1. Because,
there is no member of memory.oom.group.

2)events of MEMCG_MAX is represented with "limit_in_bytes" in cgroupv1 instead
of memory.max

3)event of oom_kill is include in memory.oom_control. make oom_kill include
its descendants' events and add oom_kill_local include its oom_kill event only.
Signed-off-by: NLu Jialin <lujialin4@huawei.com>
Reviewed-by: NXiu Jianfeng <xiujianfeng@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

8f39ae66

20 3月, 2022 1 次提交

mm/dynamic_hugetlb: use mem_cgroup_force_empty to reclaim pages · 94c749a3

由 Liu Shixin 提交于 3月 20, 2022

hulk inclusion
category: bugfix
bugzilla: 46904 https://gitee.com/openeuler/kernel/issues/I4Y0XO

--------------------------------

When all processes in the memory cgroup are finished, some memory may still be
occupied such as file cache. Use mem_cgroup_force_empty to reclaim these pages
that charged in the memory cgroup before merging all pages.
Signed-off-by: NLiu Shixin <liushixin2@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

94c749a3

19 1月, 2022 2 次提交

mm/dynamic_hugetlb: establish the dynamic hugetlb feature framework · a8a836a3

由 Liu Shixin 提交于 1月 18, 2022

hulk inclusion
category: feature
bugzilla: 46904, https://gitee.com/openeuler/kernel/issues/I4QSHG
CVE: NA

--------------------------------

Dynamic hugetlb is a self-developed feature based on the hugetlb and memcontrol.
It supports to split huge page dynamically in a memory cgroup. There is a new structure
dhugetlb_pool in every mem_cgroup to manage the pages configured to the mem_cgroup.
For the mem_cgroup configured with dhugetlb_pool, processes in the mem_cgroup will
preferentially use the pages in dhugetlb_pool.

Dynamic hugetlb supports three types of pages, including 1G/2M huge pages and 4K pages.
For the mem_cgroup configured with dhugetlb_pool, processes will be limited to alloc
1G/2M huge pages only from dhugetlb_pool. But there is no such constraint for 4K pages.
If there are insufficient 4K pages in the dhugetlb_pool, pages can also be allocated from
buddy system. So before using dynamic hugetlb, user must know how many huge pages they
need.

Usage:
1. Add 'dynamic_hugetlb=on' in cmdline to enable dynamic hugetlb feature.
2. Prealloc some 1G hugepages through hugetlb.
3. Create a mem_cgroup and configure dhugetlb_pool to mem_cgroup.
4. Configure the count of 1G/2M hugepages, and the remaining pages in dhugetlb_pool will
be used as basic pages.
5. Bound a process to mem_cgroup. then the memory for it will be allocated from dhugetlb_pool.

This patch add the corresponding structure dhugetlb_pool for dynamic hugetlb feature,
the interface 'dhugetlb.nr_pages' in mem_cgroup to configure dhugetlb_pool and the cmdline
'dynamic_hugetlb=on' to enable dynamic hugetlb feature.
Signed-off-by: NLiu Shixin <liushixin2@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

a8a836a3

mm: declare several functions · 98ecb3cd

由 Liu Shixin 提交于 1月 18, 2022

hulk inclusion
category: feature
bugzilla: 46904, https://gitee.com/openeuler/kernel/issues/I4QSHG
CVE: NA

--------------------------------

There are several functions that will be used in next patches for
dynamic hugetlb feature. Declare them.

No functional changes.
Signed-off-by: NLiu Shixin <liushixin2@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

98ecb3cd

07 1月, 2022 1 次提交

memcg: Add static key for memcg kswapd · 84355bcc

由 Lu Jialin 提交于 1月 07, 2022

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I4IMAK?from=project-issue
CVE: NA

--------

This patch adds a default-false static key to disable memcg kswapd
feature. User can enable by set memcg_kswapd in cmdline.
Signed-off-by: NLu Jialin <lujialin4@huawei.com>
Reviewed-by: Nweiyang wang <wangweiyang2@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

84355bcc

31 12月, 2021 1 次提交

kabi: reserve space for memcg related structures · 283fde80

由 Lu Jialin 提交于 12月 31, 2021

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I4GII8?from=project-issue
CVE: NA

--------

We reserve some fields beforehand for memcg related structures prone
to change, therefore, we can hot add/change features of memcg with this enhancement.

After reserving, normally cache does not matter as the reserved fields
are not accessed at all.

--------
Signed-off-by: NLu Jialin <lujialin4@huawei.com>
Reviewed-by: Nweiyang wang <wangweiyang2@huawei.com>
Reviewed-by: NXiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

283fde80

30 11月, 2021 6 次提交

mm, memcg: remove unused functions · 20894dfe

由 Miaohe Lin 提交于 11月 30, 2021

mainline inclusion
from mainline-v5.15-rc1
commit bec49c06
category: feature
bugzilla: 185803 https://gitee.com/openeuler/kernel/issues/I4JOG9?from=project-issue
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=bec49c067c67

-----------------------------------------------------------------------

Since commit 2d146aa3 ("mm: memcontrol: switch to rstat"), last user
of memcg_stat_item_in_bytes() is gone.  And since commit fa40d1ee
("mm: vmscan: memcontrol: remove mem_cgroup_select_victim_node()"), only
the declaration of mem_cgroup_select_victim_node() is remained here.
Remove them.

Link: https://lkml.kernel.org/r/20210807082835.61281-2-linmiaohe@huawei.comSigned-off-by: NMiaohe Lin <linmiaohe@huawei.com>
Reviewed-by: NShakeel Butt <shakeelb@google.com>
Reviewed-by: NMuchun Song <songmuchun@bytedance.com>
Acked-by: NRoman Gushchin <guro@fb.com>
Acked-by: NMichal Hocko <mhocko@suse.com>
Cc: Alex Shi <alexs@kernel.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NLu Jialin <lujialin4@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

20894dfe

memcg: infrastructure to flush memcg stats · 94bb1084

由 Shakeel Butt 提交于 11月 30, 2021

mainline inclusion
from mainline-v5.15-rc1
commit aa48e47e
category: feature
bugzilla: 185803 https://gitee.com/openeuler/kernel/issues/I4JOG9?from=project-issue
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=aa48e47e3906

------------------------------------------------------

At the moment memcg stats are read in four contexts:

1. memcg stat user interfaces
2. dirty throttling
3. page fault
4. memory reclaim

Currently the kernel flushes the stats for first two cases.  Flushing the
stats for remaining two casese may have performance impact.  Always
flushing the memcg stats on the page fault code path may negatively
impacts the performance of the applications.  In addition flushing in the
memory reclaim code path, though treated as slowpath, can become the
source of contention for the global lock taken for stat flushing because
when system or memcg is under memory pressure, many tasks may enter the
reclaim path.

This patch uses following mechanisms to solve these challenges:

1. Periodically flush the stats from root memcg every 2 seconds.  This
   will time limit the out of sync stats.

2. Asynchronously flush the stats after fixed number of stat updates.
   In the worst case the stat can be out of sync by O(nr_cpus * BATCH) for
   2 seconds.

3. For avoiding thundering herd to flush the stats particularly from
   the memory reclaim context, introduce memcg local spinlock and let only
   one flusher active at a time.  This could have been done through
   cgroup_rstat_lock lock but that lock is used by other subsystem and for
   userspace reading memcg stats.  So, it is better to keep flushers
   introduced by this patch decoupled from cgroup_rstat_lock.  However we
   would have to use irqsafe version of rstat flush but that is fine as
   this code path will be flushing for whole tree and do the work for
   everyone.  No one will be waiting for that worker.

[shakeelb@google.com: fix sleep-in-wrong context bug]
  Link: https://lkml.kernel.org/r/20210716212137.1391164-2-shakeelb@google.com

Link: https://lkml.kernel.org/r/20210714013948.270662-2-shakeelb@google.comSigned-off-by: NShakeel Butt <shakeelb@google.com>
Tested-by: NMarek Szyprowski <m.szyprowski@samsung.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Conflict:
	include/linux/memcontrol.h
Signed-off-by: NLu Jialin <lujialin4@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

94bb1084

memcg: switch lruvec stats to rstat · f43d14df

由 Shakeel Butt 提交于 11月 30, 2021

mainline inclusion
from mainline-v5.15-rc1
commit 7e1c0d6f
category: feature
bugzilla: 185803 https://gitee.com/openeuler/kernel/issues/I4JOG9?from=project-issue
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=7e1c0d6f58207

-------------------------------------------------------------------------

The commit 2d146aa3 ("mm: memcontrol: switch to rstat") switched memcg
stats to rstat infrastructure but skipped the conversion of the lruvec
stats as such stats are read in the performance critical code paths and
flushing stats may have impacted the performances of the applications.
This patch converts the lruvec stats to rstat and later patches add
mechanisms to keep the performance impact to minimum.

The rstat conversion comes with the price i.e.  memory cost.  Effectively
this patch reverts the savings done by the commit f3344adf ("mm:
memcontrol: optimize per-lruvec stats counter memory usage").  However
this cost is justified due to negative impact of the inaccurate lruvec
stats on many heuristics.  One such case is reported in [1].

The memory reclaim code is filled with plethora of heuristics and many of
those heuristics reads the lruvec stats.  So, inaccurate stats can make
such heuristics ineffective.  [1] reports the impact of inaccurate lruvec
stats on the "cache trim mode" heuristic.  Inaccurate lruvec stats can
impact the deactivation and aging anon heuristics as well.

[1] https://lore.kernel.org/linux-mm/20210311004449.1170308-1-ying.huang@intel.com/

Link: https://lkml.kernel.org/r/20210716212137.1391164-1-shakeelb@google.com
Link: https://lkml.kernel.org/r/20210714013948.270662-1-shakeelb@google.comSigned-off-by: NShakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Michal Koutný <mkoutny@suse.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NLu Jialin <lujialin4@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

f43d14df

mm: memcontrol: switch to rstat · 0030a2ab

由 Johannes Weiner 提交于 11月 30, 2021

mainline inclusion
from mainline-v5.13-rc1
commit 2d146aa3
category: feature
bugzilla: 185803 https://gitee.com/openeuler/kernel/issues/I4JOG9?from=project-issue
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2d146aa3aa84

----------------------------------------------------------------------

Replace the memory controller's custom hierarchical stats code with the
generic rstat infrastructure provided by the cgroup core.

The current implementation does batched upward propagation from the
write side (i.e.  as stats change).  The per-cpu batches introduce an
error, which is multiplied by the number of subgroups in a tree.  In
systems with many CPUs and sizable cgroup trees, the error can be large
enough to confuse users (e.g.  32 batch pages * 32 CPUs * 32 subgroups
results in an error of up to 128M per stat item).  This can entirely
swallow allocation bursts inside a workload that the user is expecting
to see reflected in the statistics.

In the past, we've done read-side aggregation, where a memory.stat read
would have to walk the entire subtree and add up per-cpu counts.  This
became problematic with lazily-freed cgroups: we could have large
subtrees where most cgroups were entirely idle.  Hence the switch to
change-driven upward propagation.  Unfortunately, it needed to trade
accuracy for speed due to the write side being so hot.

Rstat combines the best of both worlds: from the write side, it cheaply
maintains a queue of cgroups that have pending changes, so that the read
side can do selective tree aggregation.  This way the reported stats
will always be precise and recent as can be, while the aggregation can
skip over potentially large numbers of idle cgroups.

The way rstat works is that it implements a tree for tracking cgroups
with pending local changes, as well as a flush function that walks the
tree upwards.  The controller then drives this by 1) telling rstat when
a local cgroup stat changes (e.g.  mod_memcg_state) and 2) when a flush
is required to get uptodate hierarchy stats for a given subtree (e.g.
when memory.stat is read).  The controller also provides a flush
callback that is called during the rstat flush walk for each cgroup and
aggregates its local per-cpu counters and propagates them upwards.

This adds a second vmstats to struct mem_cgroup (MEMCG_NR_STAT +
NR_VM_EVENT_ITEMS) to track pending subtree deltas during upward
aggregation.  It removes 3 words from the per-cpu data.  It eliminates
memcg_exact_page_state(), since memcg_page_state() is now exact.

[akpm@linux-foundation.org: merge fix]
[hannes@cmpxchg.org: fix a sleep in atomic section problem]
  Link: https://lkml.kernel.org/r/20210315234100.64307-1-hannes@cmpxchg.org

Link: https://lkml.kernel.org/r/20210209163304.77088-7-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
Reviewed-by: NRoman Gushchin <guro@fb.com>
Acked-by: NMichal Hocko <mhocko@suse.com>
Reviewed-by: NShakeel Butt <shakeelb@google.com>
Reviewed-by: NMichal Koutný <mkoutny@suse.com>
Acked-by: NBalbir Singh <bsingharora@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Conflict:
	include/linux/memcontrol.h
Signed-off-by: NLu Jialin <lujialin4@huawei.com>
Reviewed-by: NXiu Jianfeng <xiujianfeng@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

0030a2ab

mm: memcontrol: privatize memcg_page_state query functions · 06876391

由 Johannes Weiner 提交于 11月 30, 2021

mainline inclusion
from mainline-v5.13-rc1
commit a18e6e6e
category: feature
bugzilla: 185803 https://gitee.com/openeuler/kernel/issues/I4JOG9?from=project-issue
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a18e6e6e150a

-------------------------------------------------------------------

There are no users outside of the memory controller itself. The rest
of the kernel cares either about node or lruvec stats.

Link: https://lkml.kernel.org/r/20210209163304.77088-4-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
Reviewed-by: NShakeel Butt <shakeelb@google.com>
Reviewed-by: NRoman Gushchin <guro@fb.com>
Acked-by: NMichal Hocko <mhocko@suse.com>
Reviewed-by: NMichal Koutný <mkoutny@suse.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Conflict:
	include/linux/memcontrol.h
Signed-off-by: NLu Jialin <lujialin4@huawei.com>
Reviewed-by: NXiu Jianfeng <xiujianfeng@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

06876391

mm: memcontrol: kill mem_cgroup_nodeinfo() · 82f0ea81

由 Johannes Weiner 提交于 11月 30, 2021

mainline inclusion
from mainline-v5.13-rc1
commit a3747b53
category: feature
bugzilla: 185803 https://gitee.com/openeuler/kernel/issues/I4JOG9?from=project-issue
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a3747b53b1771

----------------------------------------------------

No need to encapsulate a simple struct member access.

Link: https://lkml.kernel.org/r/20210209163304.77088-3-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
Reviewed-by: NShakeel Butt <shakeelb@google.com>
Reviewed-by: NRoman Gushchin <guro@fb.com>
Acked-by: NMichal Hocko <mhocko@suse.com>
Reviewed-by: NMichal Koutný <mkoutny@suse.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Conflict:
	mm/memcontrol.c
Signed-off-by: NLu Jialin <lujialin4@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

82f0ea81

30 10月, 2021 6 次提交

mm: memcontrol: move PageMemcgKmem to the scope of CONFIG_MEMCG_KMEM · 6c27b037

由 Muchun Song 提交于 10月 30, 2021

mainline inclusion
from mainline-v5.13-rc1
commit bd290e1e
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I4C0GB
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=bd290e1e75d8a8b2d87031b63db56ae165677870

----------------------------------------------------------------------

The page only can be marked as kmem when CONFIG_MEMCG_KMEM is enabled.
So move PageMemcgKmem() to the scope of the CONFIG_MEMCG_KMEM.

As a bonus, on !CONFIG_MEMCG_KMEM build some code can be compiled out.

Link: https://lkml.kernel.org/r/20210319163821.20704-8-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com>
Acked-by: NRoman Gushchin <guro@fb.com>
Reviewed-by: NShakeel Butt <shakeelb@google.com>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NChen Huang <chenhuang5@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: NChen Wandun <chenwandun@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

6c27b037

mm: memcontrol: use obj_cgroup APIs to charge kmem pages · 63472377

由 Muchun Song 提交于 10月 30, 2021

mainline inclusion
from mainline-v5.13-rc1
commit b4e0b68f
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I4C0GB
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b4e0b68fbd9d1fd7e31cbe8adca3ad6cf556e2ee

----------------------------------------------------------------------

Since Roman's series "The new cgroup slab memory controller" applied.
All slab objects are charged via the new APIs of obj_cgroup.  The new
APIs introduce a struct obj_cgroup to charge slab objects.  It prevents
long-living objects from pinning the original memory cgroup in the
memory.  But there are still some corner objects (e.g.  allocations
larger than order-1 page on SLUB) which are not charged via the new
APIs.  Those objects (include the pages which are allocated from buddy
allocator directly) are charged as kmem pages which still hold a
reference to the memory cgroup.

We want to reuse the obj_cgroup APIs to charge the kmem pages.  If we do
that, we should store an object cgroup pointer to page->memcg_data for
the kmem pages.

Finally, page->memcg_data will have 3 different meanings.

  1) For the slab pages, page->memcg_data points to an object cgroups
     vector.

  2) For the kmem pages (exclude the slab pages), page->memcg_data
     points to an object cgroup.

  3) For the user pages (e.g. the LRU pages), page->memcg_data points
     to a memory cgroup.

We do not change the behavior of page_memcg() and page_memcg_rcu().  They
are also suitable for LRU pages and kmem pages.  Why?

Because memory allocations pinning memcgs for a long time - it exists at a
larger scale and is causing recurring problems in the real world: page
cache doesn't get reclaimed for a long time, or is used by the second,
third, fourth, ...  instance of the same job that was restarted into a new
cgroup every time.  Unreclaimable dying cgroups pile up, waste memory, and
make page reclaim very inefficient.

We can convert LRU pages and most other raw memcg pins to the objcg
direction to fix this problem, and then the page->memcg will always point
to an object cgroup pointer.  At that time, LRU pages and kmem pages will
be treated the same.  The implementation of page_memcg() will remove the
kmem page check.

This patch aims to charge the kmem pages by using the new APIs of
obj_cgroup.  Finally, the page->memcg_data of the kmem page points to an
object cgroup.  We can use the __page_objcg() to get the object cgroup
associated with a kmem page.  Or we can use page_memcg() to get the memory
cgroup associated with a kmem page, but caller must ensure that the
returned memcg won't be released (e.g.  acquire the rcu_read_lock or
css_set_lock).

  Link: https://lkml.kernel.org/r/20210401030141.37061-1-songmuchun@bytedance.com

Link: https://lkml.kernel.org/r/20210319163821.20704-6-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Reviewed-by: NShakeel Butt <shakeelb@google.com>
Acked-by: NRoman Gushchin <guro@fb.com>
Reviewed-by: NMiaohe Lin <linmiaohe@huawei.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
[songmuchun@bytedance.com: fix forget to obtain the ref to objcg in split_page_memcg]
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

Conflicts:
	include/linux/memcontrol.h
	mm/memcontrol.c
Signed-off-by: NChen Huang <chenhuang5@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: NChen Wandun <chenwandun@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

63472377

mm: Convert page kmemcg type to a page memcg flag · 492cf0b0

由 Roman Gushchin 提交于 10月 30, 2021

mainline inclusion
from mainline-v5.11-rc1
commit 18b2db3b
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I4C0GB
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=18b2db3b0385226b71cb3288474fa5a6e4a45474

----------------------------------------------------------------------

PageKmemcg flag is currently defined as a page type (like buddy, offline,
table and guard).  Semantically it means that the page was accounted as a
kernel memory by the page allocator and has to be uncharged on the
release.

As a side effect of defining the flag as a page type, the accounted page
can't be mapped to userspace (look at page_has_type() and comments above).
In particular, this blocks the accounting of vmalloc-backed memory used
by some bpf maps, because these maps do map the memory to userspace.

One option is to fix it by complicating the access to page->mapcount,
which provides some free bits for page->page_type.

But it's way better to move this flag into page->memcg_data flags.
Indeed, the flag makes no sense without enabled memory cgroups and memory
cgroup pointer set in particular.

This commit replaces PageKmemcg() and __SetPageKmemcg() with
PageMemcgKmem() and an open-coded OR operation setting the memcg pointer
with the MEMCG_DATA_KMEM bit.  __ClearPageKmemcg() can be simple deleted,
as the whole memcg_data is zeroed at once.

As a bonus, on !CONFIG_MEMCG build the PageMemcgKmem() check will be
compiled out.
Signed-off-by: NRoman Gushchin <guro@fb.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Reviewed-by: NShakeel Butt <shakeelb@google.com>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Acked-by: NMichal Hocko <mhocko@suse.com>
Link: https://lkml.kernel.org/r/20201027001657.3398190-5-guro@fb.com
Link: https://lore.kernel.org/bpf/20201201215900.3569844-5-guro@fb.com

Conflicts:
	mm/memcontrol.c
Signed-off-by: NChen Huang <chenhuang5@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: NChen Wandun <chenwandun@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

492cf0b0

mm: Introduce page memcg flags · bfc245bc

由 Roman Gushchin 提交于 10月 30, 2021

mainline inclusion
from mainline-v5.11-rc1
commit 87944e29
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I4C0GB
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=87944e2992bd28098c6806086a1e96bb4d0e502b

----------------------------------------------------------------------

The lowest bit in page->memcg_data is used to distinguish between struct
memory_cgroup pointer and a pointer to a objcgs array.  All checks and
modifications of this bit are open-coded.

Let's formalize it using page memcg flags, defined in enum
page_memcg_data_flags.

Additional flags might be added later.
Signed-off-by: NRoman Gushchin <guro@fb.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Reviewed-by: NShakeel Butt <shakeelb@google.com>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Acked-by: NMichal Hocko <mhocko@suse.com>
Link: https://lkml.kernel.org/r/20201027001657.3398190-4-guro@fb.com
Link: https://lore.kernel.org/bpf/20201201215900.3569844-4-guro@fb.comSigned-off-by: NChen Huang <chenhuang5@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: NChen Wandun <chenwandun@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

bfc245bc

mm: memcontrol/slab: Use helpers to access slab page's memcg_data · dbaf3609

由 Roman Gushchin 提交于 10月 30, 2021

mainline inclusion
from mainline-v5.11-rc1
commit 270c6a71
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I4C0GB
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=270c6a71460e12b07b1dcadf7457ff95b6c6e8f4

----------------------------------------------------------------------

To gather all direct accesses to struct page's memcg_data field in one
place, let's introduce 3 new helpers to use in the slab accounting code:

  struct obj_cgroup **page_objcgs(struct page *page);
  struct obj_cgroup **page_objcgs_check(struct page *page);
  bool set_page_objcgs(struct page *page, struct obj_cgroup **objcgs);

They are similar to the corresponding API for generic pages, except that
the setter can return false, indicating that the value has been already
set from a different thread.
Signed-off-by: NRoman Gushchin <guro@fb.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Reviewed-by: NShakeel Butt <shakeelb@google.com>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Link: https://lkml.kernel.org/r/20201027001657.3398190-3-guro@fb.com
Link: https://lore.kernel.org/bpf/20201201215900.3569844-3-guro@fb.comSigned-off-by: NChen Huang <chenhuang5@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: NChen Wandun <chenwandun@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

dbaf3609

mm: memcontrol: Use helpers to read page's memcg data · d1b942b7

由 Roman Gushchin 提交于 10月 30, 2021

mainline inclusion
from mainline-v5.11-rc1
commit bcfe06bf
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I4C0GB
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=bcfe06bf2622f7c4899468e427683aec49070687

----------------------------------------------------------------------

Patch series "mm: allow mapping accounted kernel pages to userspace", v6.

Currently a non-slab kernel page which has been charged to a memory cgroup
can't be mapped to userspace.  The underlying reason is simple: PageKmemcg
flag is defined as a page type (like buddy, offline, etc), so it takes a
bit from a page->mapped counter.  Pages with a type set can't be mapped to
userspace.

But in general the kmemcg flag has nothing to do with mapping to
userspace.  It only means that the page has been accounted by the page
allocator, so it has to be properly uncharged on release.

Some bpf maps are mapping the vmalloc-based memory to userspace, and their
memory can't be accounted because of this implementation detail.

This patchset removes this limitation by moving the PageKmemcg flag into
one of the free bits of the page->mem_cgroup pointer.  Also it formalizes
accesses to the page->mem_cgroup and page->obj_cgroups using new helpers,
adds several checks and removes a couple of obsolete functions.  As the
result the code became more robust with fewer open-coded bit tricks.

This patch (of 4):

Currently there are many open-coded reads of the page->mem_cgroup pointer,
as well as a couple of read helpers, which are barely used.

It creates an obstacle on a way to reuse some bits of the pointer for
storing additional bits of information.  In fact, we already do this for
slab pages, where the last bit indicates that a pointer has an attached
vector of objcg pointers instead of a regular memcg pointer.

This commits uses 2 existing helpers and introduces a new helper to
converts all read sides to calls of these helpers:
  struct mem_cgroup *page_memcg(struct page *page);
  struct mem_cgroup *page_memcg_rcu(struct page *page);
  struct mem_cgroup *page_memcg_check(struct page *page);

page_memcg_check() is intended to be used in cases when the page can be a
slab page and have a memcg pointer pointing at objcg vector.  It does
check the lowest bit, and if set, returns NULL.  page_memcg() contains a
VM_BUG_ON_PAGE() check for the page not being a slab page.

To make sure nobody uses a direct access, struct page's
mem_cgroup/obj_cgroups is converted to unsigned long memcg_data.
Signed-off-by: NRoman Gushchin <guro@fb.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Reviewed-by: NShakeel Butt <shakeelb@google.com>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Acked-by: NMichal Hocko <mhocko@suse.com>
Link: https://lkml.kernel.org/r/20201027001657.3398190-1-guro@fb.com
Link: https://lkml.kernel.org/r/20201027001657.3398190-2-guro@fb.com
Link: https://lore.kernel.org/bpf/20201201215900.3569844-2-guro@fb.com

Conflicts:
	mm/memcontrol.c
Signed-off-by: NChen Huang <chenhuang5@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: NChen Wandun <chenwandun@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

d1b942b7

19 10月, 2021 1 次提交

mm: memcontrol: fix occasional OOMs due to proportional memory.low reclaim · d10dc503

由 Johannes Weiner 提交于 10月 19, 2021

stable inclusion
from stable-5.10.61
commit 8132fc2bf4b70d19f666e82e7ce28dc871ed3b92
bugzilla: 177029 https://gitee.com/openeuler/kernel/issues/I4EAXD

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=8132fc2bf4b70d19f666e82e7ce28dc871ed3b92

--------------------------------

[ Upstream commit f56ce412 ]

We've noticed occasional OOM killing when memory.low settings are in
effect for cgroups.  This is unexpected and undesirable as memory.low is
supposed to express non-OOMing memory priorities between cgroups.

The reason for this is proportional memory.low reclaim.  When cgroups
are below their memory.low threshold, reclaim passes them over in the
first round, and then retries if it couldn't find pages anywhere else.
But when cgroups are slightly above their memory.low setting, page scan
force is scaled down and diminished in proportion to the overage, to the
point where it can cause reclaim to fail as well - only in that case we
currently don't retry, and instead trigger OOM.

To fix this, hook proportional reclaim into the same retry logic we have
in place for when cgroups are skipped entirely.  This way if reclaim
fails and some cgroups were scanned with diminished pressure, we'll try
another full-force cycle before giving up and OOMing.

[akpm@linux-foundation.org: coding-style fixes]

Link: https://lkml.kernel.org/r/20210817180506.220056-1-hannes@cmpxchg.org
Fixes: 9783aa99 ("mm, memcg: proportional memory.{low,min} reclaim")
Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
Reported-by: NLeon Yang <lnyng@fb.com>
Reviewed-by: NRik van Riel <riel@surriel.com>
Reviewed-by: NShakeel Butt <shakeelb@google.com>
Acked-by: NRoman Gushchin <guro@fb.com>
Acked-by: NChris Down <chris@chrisdown.name>
Acked-by: NMichal Hocko <mhocko@suse.com>
Cc: <stable@vger.kernel.org>		[5.4+]
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Acked-by: NWeilong Chen <chenweilong@huawei.com>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

d10dc503

26 9月, 2021 4 次提交

mm: memcontrol: reparent nr_deferred when memcg offline · 3db7d696

由 Yang Shi 提交于 9月 04, 2021

mainline inclusion
from mainline-v5.13-rc1
commit a178015c
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I48N0H
CVE: NA

-------------------------------------------------

Now shrinker's nr_deferred is per memcg for memcg aware shrinkers, add
to parent's corresponding nr_deferred when memcg offline.

Link: https://lkml.kernel.org/r/20210311190845.9708-13-shy828301@gmail.comSigned-off-by: NYang Shi <shy828301@gmail.com>
Acked-by: NVlastimil Babka <vbabka@suse.cz>
Acked-by: NKirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: NRoman Gushchin <guro@fb.com>
Reviewed-by: NShakeel Butt <shakeelb@google.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NChen Wandun <chenwandun@huawei.com>
Reviewed-by: NTong Tiangen <tongtiangen@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

3db7d696

mm: vmscan: add per memcg shrinker nr_deferred · 951a268b

由 Yang Shi 提交于 9月 04, 2021

mainline inclusion
from mainline-v5.13-rc1
commit 3c6f17e6
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I48N0H
CVE: NA

-------------------------------------------------

Currently the number of deferred objects are per shrinker, but some
slabs, for example, vfs inode/dentry cache are per memcg, this would
result in poor isolation among memcgs.

The deferred objects typically are generated by __GFP_NOFS allocations,
one memcg with excessive __GFP_NOFS allocations may blow up deferred
objects, then other innocent memcgs may suffer from over shrink,
excessive reclaim latency, etc.

For example, two workloads run in memcgA and memcgB respectively,
workload in B is vfs heavy workload.  Workload in A generates excessive
deferred objects, then B's vfs cache might be hit heavily (drop half of
caches) by B's limit reclaim or global reclaim.

We observed this hit in our production environment which was running vfs
heavy workload shown as the below tracing log:

  <...>-409454 [016] .... 28286961.747146: mm_shrink_slab_start: super_cache_scan+0x0/0x1a0 ffff9a83046f3458:
  nid: 1 objects to shrink 3641681686040 gfp_flags GFP_HIGHUSER_MOVABLE|__GFP_ZERO pgs_scanned 1 lru_pgs 15721
  cache items 246404277 delta 31345 total_scan 123202138
  <...>-409454 [022] .... 28287105.928018: mm_shrink_slab_end: super_cache_scan+0x0/0x1a0 ffff9a83046f3458:
  nid: 1 unused scan count 3641681686040 new scan count 3641798379189 total_scan 602
  last shrinker return val 123186855

The vfs cache and page cache ratio was 10:1 on this machine, and half of
caches were dropped.  This also resulted in significant amount of page
caches were dropped due to inodes eviction.

Make nr_deferred per memcg for memcg aware shrinkers would solve the
unfairness and bring better isolation.

The following patch will add nr_deferred to parent memcg when memcg
offline.  To preserve nr_deferred when reparenting memcgs to root, root
memcg needs shrinker_info allocated too.

When memcg is not enabled (!CONFIG_MEMCG or memcg disabled), the
shrinker's nr_deferred would be used.  And non memcg aware shrinkers use
shrinker's nr_deferred all the time.

Link: https://lkml.kernel.org/r/20210311190845.9708-10-shy828301@gmail.comSigned-off-by: NYang Shi <shy828301@gmail.com>
Acked-by: NRoman Gushchin <guro@fb.com>
Acked-by: NKirill Tkhai <ktkhai@virtuozzo.com>
Reviewed-by: NShakeel Butt <shakeelb@google.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NChen Wandun <chenwandun@huawei.com>
Reviewed-by: NTong Tiangen <tongtiangen@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

951a268b

mm: memcontrol: rename shrinker_map to shrinker_info · 4496c7bf

由 Yang Shi 提交于 9月 04, 2021

mainline inclusion
from mainline-v5.13-rc1
commit e4262c4f
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I48N0H
CVE: NA

-------------------------------------------------

The following patch is going to add nr_deferred into shrinker_map, the
change will make shrinker_map not only include map anymore, so rename it
to "memcg_shrinker_info".  And this should make the patch adding
nr_deferred cleaner and readable and make review easier.  Also remove the
"memcg_" prefix.

Link: https://lkml.kernel.org/r/20210311190845.9708-7-shy828301@gmail.comSigned-off-by: NYang Shi <shy828301@gmail.com>
Acked-by: NVlastimil Babka <vbabka@suse.cz>
Acked-by: NKirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: NRoman Gushchin <guro@fb.com>
Reviewed-by: NShakeel Butt <shakeelb@google.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NChen Wandun <chenwandun@huawei.com>
Reviewed-by: NTong Tiangen <tongtiangen@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

4496c7bf

mm: vmscan: consolidate shrinker_maps handling code · dcf6ec43

由 Yang Shi 提交于 9月 04, 2021

mainline inclusion
from mainline-v5.13-rc1
commit 2bfd3637
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I48N0H
CVE: NA

-------------------------------------------------

The shrinker map management is not purely memcg specific, it is at the
intersection between memory cgroup and shrinkers.  It's allocation and
assignment of a structure, and the only memcg bit is the map is being
stored in a memcg structure.  So move the shrinker_maps handling code
into vmscan.c for tighter integration with shrinker code, and remove the
"memcg_" prefix.  There is no functional change.

Link: https://lkml.kernel.org/r/20210311190845.9708-3-shy828301@gmail.comSigned-off-by: NYang Shi <shy828301@gmail.com>
Acked-by: NVlastimil Babka <vbabka@suse.cz>
Acked-by: NKirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: NRoman Gushchin <guro@fb.com>
Reviewed-by: NShakeel Butt <shakeelb@google.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Conflicts:
	mm/memcontrol.c
Signed-off-by: NChen Wandun <chenwandun@huawei.com>
Reviewed-by: NTong Tiangen <tongtiangen@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

dcf6ec43

14 7月, 2021 3 次提交

mm: memcontrol: optimize per-lruvec stats counter memory usage · ed38a086

由 Muchun Song 提交于 7月 09, 2021

mainline inclusion
from mainline-5.12-rc1
commit f3344adf
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZVOM
CVE: NA

-------------------------------------------------
The vmstat threshold is 32 (MEMCG_CHARGE_BATCH), Actually the threshold
can be as big as MEMCG_CHARGE_BATCH * PAGE_SIZE.  It still fits into s32.
So introduce struct batched_lruvec_stat to optimize memory usage.

The size of struct lruvec_stat is 304 bytes on 64 bit systems.  As it is a
per-cpu structure.  So with this patch, we can save 304 / 2 * ncpu bytes
per-memcg per-node where ncpu is the number of the possible CPU.  If there
are c memory cgroup (include dying cgroup) and n NUMA node in the system.
Finally, we can save (152 * ncpu * c * n) bytes.

[akpm@linux-foundation.org: fix typo in comment]

Link: https://lkml.kernel.org/r/20201210042121.39665-1-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com>
Reviewed-by: NShakeel Butt <shakeelb@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Chris Down <chris@chrisdown.name>
Cc: Yafang Shao <laoar.shao@gmail.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NChengyang Fan <cy.fan@huawei.com>
Reviewed-by: chenwandun<chenwandun@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

ed38a086

mm/lru: introduce relock_page_lruvec() · 5f36b2bf

由 Alexander Duyck 提交于 7月 07, 2021

mainline inclusion
from mainline-v5.11-rc1
commit 2a5e4e34
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZF7C?from=project-issue
CVE: NA

--------------------------------------

Add relock_page_lruvec() to replace repeated same code, no functional
change.

When testing for relock we can avoid the need for RCU locking if we simply
compare the page pgdat and memcg pointers versus those that the lruvec is
holding.  By doing this we can avoid the extra pointer walks and accesses
of the memory cgroup.

In addition we can avoid the checks entirely if lruvec is currently NULL.

[alex.shi@linux.alibaba.com: use page_memcg()]
  Link: https://lkml.kernel.org/r/66d8e79d-7ec6-bfbc-1c82-bf32db3ae5b7@linux.alibaba.com

Link: https://lkml.kernel.org/r/1604566549-62481-19-git-send-email-alex.shi@linux.alibaba.comSigned-off-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
Signed-off-by: NAlex Shi <alex.shi@linux.alibaba.com>
Acked-by: NHugh Dickins <hughd@google.com>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Acked-by: NVlastimil Babka <vbabka@suse.cz>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Tejun Heo <tj@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Chen, Rong A" <rong.a.chen@intel.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Jann Horn <jannh@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mika Penttilä <mika.penttila@nextfour.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NJing Xiangfeng <jingxiangfeng@huawei.com>
Reviewed-by: Nchenwandun <chenwandun@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

5f36b2bf

mm/lru: replace pgdat lru_lock with lruvec lock · 3c09b74d

由 Alex Shi 提交于 7月 07, 2021

mainline inclusion
from mainline-v5.11-rc1
commit 6168d0da
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZF7C?from=project-issue
CVE: NA

--------------------------------------

This patch moves per node lru_lock into lruvec, thus bring a lru_lock for
each of memcg per node.  So on a large machine, each of memcg don't have
to suffer from per node pgdat->lru_lock competition.  They could go fast
with their self lru_lock.

After move memcg charge before lru inserting, page isolation could
serialize page's memcg, then per memcg lruvec lock is stable and could
replace per node lru lock.

In isolate_migratepages_block(), compact_unlock_should_abort and
lock_page_lruvec_irqsave are open coded to work with compact_control.
Also add a debug func in locking which may give some clues if there are
sth out of hands.

Daniel Jordan's testing show 62% improvement on modified readtwice case on
his 2P * 10 core * 2 HT broadwell box.
https://lore.kernel.org/lkml/20200915165807.kpp7uhiw7l3loofu@ca-dmjordan1.us.oracle.com/

Hugh Dickins helped on the patch polish, thanks!

[alex.shi@linux.alibaba.com: fix comment typo]
  Link: https://lkml.kernel.org/r/5b085715-292a-4b43-50b3-d73dc90d1de5@linux.alibaba.com
[alex.shi@linux.alibaba.com: use page_memcg()]
  Link: https://lkml.kernel.org/r/5a4c2b72-7ee8-2478-fc0e-85eb83aafec4@linux.alibaba.com

Link: https://lkml.kernel.org/r/1604566549-62481-18-git-send-email-alex.shi@linux.alibaba.comSigned-off-by: NAlex Shi <alex.shi@linux.alibaba.com>
Acked-by: NHugh Dickins <hughd@google.com>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Cc: Rong Chen <rong.a.chen@intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Alexander Duyck <alexander.duyck@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Jann Horn <jannh@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mika Penttilä <mika.penttila@nextfour.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

Conflicts:
	mm/memcontrol.c
Signed-off-by: NJing Xiangfeng <jingxiangfeng@huawei.com>
Reviewed-by: Nchenwandun <chenwandun@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

3c09b74d

08 7月, 2021 2 次提交

memcg: Add static key for memcg priority · ce7fa1af

由 Jing Xiangfeng 提交于 7月 06, 2021

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZN3O
CVE: NA

--------------------------------------

This patch adds a default-false static key to disable memcg priority
feature. If you want to enable it by writing 1:

echo 1 > /proc/sys/vm/memcg_qos_enable
Signed-off-by: NJing Xiangfeng <jingxiangfeng@huawei.com>
Reviewed-by: NLiu Shixin <liushixin2@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

ce7fa1af

memcg: support priority for oom · 4da32073

由 Jing Xiangfeng 提交于 7月 06, 2021

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZN3O
CVE: NA

--------------------------------------

We first kill the process from the low priority memcg if OOM occurs.
If the process is not found, then fallback to normal handle.
Signed-off-by: NJing Xiangfeng <jingxiangfeng@huawei.com>
Reviewed-by: NLiu Shixin <liushixin2@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

4da32073

19 4月, 2021 1 次提交

mm/memcg: rename mem_cgroup_split_huge_fixup to split_page_memcg and add nr_pages argument · 4e00be7c

由 Zhou Guanghui 提交于 4月 07, 2021

stable inclusion
from stable-5.10.27
commit 6143a1d193e9ecc18250516594655c5a6fbc3a7b
bugzilla: 51493

--------------------------------

commit be6c8982 upstream.

Rename mem_cgroup_split_huge_fixup to split_page_memcg and explicitly pass
in page number argument.

In this way, the interface name is more common and can be used by
potential users.  In addition, the complete info(memcg and flag) of the
memcg needs to be set to the tail pages.

Link: https://lkml.kernel.org/r/20210304074053.65527-2-zhouguanghui1@huawei.comSigned-off-by: NZhou Guanghui <zhouguanghui1@huawei.com>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Reviewed-by: NZi Yan <ziy@nvidia.com>
Reviewed-by: NShakeel Butt <shakeelb@google.com>
Acked-by: NMichal Hocko <mhocko@suse.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Hanjun Guo <guohanjun@huawei.com>
Cc: Tianhong Ding <dingtianhong@huawei.com>
Cc: Weilong Chen <chenweilong@huawei.com>
Cc: Rui Xiang <rui.xiang@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NHugh Dickins <hughd@google.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Acked-by: N  Weilong Chen <chenweilong@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

4e00be7c

27 11月, 2020 1 次提交

mm: memcg: relayout structure mem_cgroup to avoid cache interference · 4df91062

由 Feng Tang 提交于 11月 25, 2020

0day reported one -22.7% regression for will-it-scale page_fault2
case [1] on a 4 sockets 144 CPU platform, and bisected to it to be
caused by Waiman's optimization (commit bd0b230f) of saving one
'struct page_counter' space for 'struct mem_cgroup'.

Initially we thought it was due to the cache alignment change introduced
by the patch, but further debug shows that it is due to some hot data
members ('vmstats_local', 'vmstats_percpu', 'vmstats') sit in 2 adjacent
cacheline (2N and 2N+1 cacheline), and when adjacent cache line prefetch
is enabled, it triggers an "extended level" of cache false sharing for
2 adjacent cache lines.

So exchange the 2 member blocks, while keeping mostly the original
cache alignment, which can restore and even enhance the performance,
and save 64 bytes of space for 'struct mem_cgroup' (from 2880 to 2816,
with 0day's default RHEL-8.3 kernel config)

[1]. https://lore.kernel.org/lkml/20201102091543.GM31092@shao2-debian/

Fixes: bd0b230f ("mm/memcg: unify swap and memsw page counters")
Reported-by: Nkernel test robot <rong.a.chen@intel.com>
Signed-off-by: NFeng Tang <feng.tang@intel.com>
Acked-by: NWaiman Long <longman@redhat.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

4df91062

15 11月, 2020 1 次提交

mm: memcontrol: fix missing wakeup polling thread · 8b21ca02

由 Muchun Song 提交于 11月 13, 2020

When we poll the swap.events, we can miss being woken up when the swap
event occurs.  Because we didn't notify.

Fixes: f3a53a3a ("mm, memcontrol: implement memory.swap.events")
Signed-off-by: NMuchun Song <songmuchun@bytedance.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Reviewed-by: NShakeel Butt <shakeelb@google.com>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Yafang Shao <laoar.shao@gmail.com>
Cc: Chris Down <chris@chrisdown.name>
Cc: Tejun Heo <tj@kernel.org>
Link: https://lkml.kernel.org/r/20201105161936.98312-1-songmuchun@bytedance.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

8b21ca02

19 10月, 2020 1 次提交

mm: kmem: enable kernel memcg accounting from interrupt contexts · 4127c650

由 Roman Gushchin 提交于 10月 17, 2020

If a memcg to charge can be determined (using remote charging API), there
are no reasons to exclude allocations made from an interrupt context from
the accounting.

Such allocations will pass even if the resulting memcg size will exceed
the hard limit, but it will affect the application of the memory pressure
and an inability to put the workload under the limit will eventually
trigger the OOM.

To use active_memcg() helper, memcg_kmem_bypass() is moved back to
memcontrol.c.
Signed-off-by: NRoman Gushchin <guro@fb.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Reviewed-by: NShakeel Butt <shakeelb@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Link: http://lkml.kernel.org/r/20200827225843.1270629-5-guro@fb.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

4127c650

14 10月, 2020 1 次提交

mm/memcg: unify swap and memsw page counters · bd0b230f

由 Waiman Long 提交于 10月 13, 2020

The swap page counter is v2 only while memsw is v1 only.  As v1 and v2
controllers cannot be active at the same time, there is no point to keep
both swap and memsw page counters in mem_cgroup.  The previous patch has
made sure that memsw page counter is updated and accessed only when in v1
code paths.  So it is now safe to alias the v1 memsw page counter to v2
swap page counter.  This saves 14 long's in the size of mem_cgroup.  This
is a saving of 112 bytes for 64-bit archs.

While at it, also document which page counters are used in v1 and/or v2.
Signed-off-by: NWaiman Long <longman@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Reviewed-by: NShakeel Butt <shakeelb@google.com>
Acked-by: NMichal Hocko <mhocko@suse.com>
Cc: Chris Down <chris@chrisdown.name>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Yafang Shao <laoar.shao@gmail.com>
Link: https://lkml.kernel.org/r/20200914024452.19167-4-longman@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

bd0b230f

15 8月, 2020 1 次提交

mm/memcontrol: fix a data race in scan count · e0e3f42f

由 Qian Cai 提交于 8月 14, 2020

struct mem_cgroup_per_node mz.lru_zone_size[zone_idx][lru] could be
accessed concurrently as noticed by KCSAN,

 BUG: KCSAN: data-race in lruvec_lru_size / mem_cgroup_update_lru_size

 write to 0xffff9c804ca285f8 of 8 bytes by task 50951 on cpu 12:
  mem_cgroup_update_lru_size+0x11c/0x1d0
  mem_cgroup_update_lru_size at mm/memcontrol.c:1266
  isolate_lru_pages+0x6a9/0xf30
  shrink_active_list+0x123/0xcc0
  shrink_lruvec+0x8fd/0x1380
  shrink_node+0x317/0xd80
  do_try_to_free_pages+0x1f7/0xa10
  try_to_free_pages+0x26c/0x5e0
  __alloc_pages_slowpath+0x458/0x1290
  __alloc_pages_nodemask+0x3bb/0x450
  alloc_pages_vma+0x8a/0x2c0
  do_anonymous_page+0x170/0x700
  __handle_mm_fault+0xc9f/0xd00
  handle_mm_fault+0xfc/0x2f0
  do_page_fault+0x263/0x6f9
  page_fault+0x34/0x40

 read to 0xffff9c804ca285f8 of 8 bytes by task 50964 on cpu 95:
  lruvec_lru_size+0xbb/0x270
  mem_cgroup_get_zone_lru_size at include/linux/memcontrol.h:536
  (inlined by) lruvec_lru_size at mm/vmscan.c:326
  shrink_lruvec+0x1d0/0x1380
  shrink_node+0x317/0xd80
  do_try_to_free_pages+0x1f7/0xa10
  try_to_free_pages+0x26c/0x5e0
  __alloc_pages_slowpath+0x458/0x1290
  __alloc_pages_nodemask+0x3bb/0x450
  alloc_pages_current+0xa6/0x120
  alloc_slab_page+0x3b1/0x540
  allocate_slab+0x70/0x660
  new_slab+0x46/0x70
  ___slab_alloc+0x4ad/0x7d0
  __slab_alloc+0x43/0x70
  kmem_cache_alloc+0x2c3/0x420
  getname_flags+0x4c/0x230
  getname+0x22/0x30
  do_sys_openat2+0x205/0x3b0
  do_sys_open+0x9a/0xf0
  __x64_sys_openat+0x62/0x80
  do_syscall_64+0x91/0xb47
  entry_SYSCALL_64_after_hwframe+0x49/0xbe

 Reported by Kernel Concurrency Sanitizer on:
 CPU: 95 PID: 50964 Comm: cc1 Tainted: G        W  O L    5.5.0-next-20200204+ #6
 Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019

The write is under lru_lock, but the read is done as lockless.  The scan
count is used to determine how aggressively the anon and file LRU lists
should be scanned.  Load tearing could generate an inefficient heuristic,
so fix it by adding READ_ONCE() for the read.
Signed-off-by: NQian Cai <cai@lca.pw>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Link: http://lkml.kernel.org/r/20200206034945.2481-1-cai@lca.pwSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

e0e3f42f

openeuler / Kernel 大约 2 年 前同步成功

openeuler / Kernel
大约 2 年前同步成功