提交 · 3f533ee366ff06bcc088e5a086bee100bac6cd7f · openeuler / Kernel

09 2月, 2023 1 次提交

mm: memcontrol: fix potential oom_lock recursion deadlock · 3f533ee3

由 Tetsuo Handa 提交于 2月 07, 2023

mainline inclusion
from mainline-v6.0-rc1
commit 68aaee14
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I6ADCF
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=68aaee147e597b495622b7c9038e5922c7c61f57

--------------------------------

syzbot is reporting GFP_KERNEL allocation with oom_lock held when
reporting memcg OOM [1].  If this allocation triggers the global OOM
situation then the system can livelock because the GFP_KERNEL
allocation with oom_lock held cannot trigger the global OOM killer
because __alloc_pages_may_oom() fails to hold oom_lock.

Fix this problem by removing the allocation from memory_stat_format()
completely, and pass static buffer when calling from memcg OOM path.

Note that the caller holding filesystem lock was the trigger for syzbot
to report this locking dependency.  Doing GFP_KERNEL allocation with
filesystem lock held can deadlock the system even without involving OOM
situation.

Link: https://syzkaller.appspot.com/bug?extid=2d2aeadc6ce1e1f11d45 [1]
Link: https://lkml.kernel.org/r/86afb39f-8c65-bec2-6cfc-c5e3cd600c0b@I-love.SAKURA.ne.jp
Fixes: c8713d0b ("mm: memcontrol: dump memory.stat during cgroup OOM")
Signed-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reported-by: Nsyzbot <syzbot+2d2aeadc6ce1e1f11d45@syzkaller.appspotmail.com>
Suggested-by: NMichal Hocko <mhocko@suse.com>
Acked-by: NMichal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>

conflicts:
	mm/memcontrol.c
Signed-off-by: NCai Xinchen <caixinchen1@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: NWang Weiyang <wangweiyang2@huawei.com>
Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
(cherry picked from commit ee2d7b76)

3f533ee3

01 12月, 2022 1 次提交

memcontrol: Add oom recover for kmemcg when release buddy hugepage · e56c6e5c

由 Jian Zhang 提交于 11月 29, 2022

hulk inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I63SDZ

-------------------------------

In Ascend, we use tmp hugepage and disable OOM-killer, when we cause a OOM,
and after some time, the memory is enough for process, the process will not
return to run normal. In this case, we must use oom recover to let the
process run.
Signed-off-by: NJian Zhang <zhangjian210@huawei.com>

e56c6e5c

23 11月, 2022 1 次提交

cgroup: support cgroup writeback on cgroupv1 · 644547a9

由 Lu Jialin 提交于 11月 04, 2022

hulkl inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I5ZG61

-------------------------------

In cgroupv1, cgroup writeback is not supproted for two problems:
1) Blkcg_css and memcg_css are mounted on different cgroup trees.
   Therefore, blkcg_css cannot be found according to a certain memcg_css.
2) Buffer I/O is worked by kthread, which is in the root_blkcg.
   Therefore, blkcg cannot limit wbps and wiops of buffer I/O.

We solve the two problems to support cgroup writeback on cgroupv1.
1) A memcg is attached to the blkcg_root css when the memcg was created.
2) We add a member "wb_blkio_ino" in mem_cgroup_legacy_files.
   User can attach a memcg to a cerntain blkcg through echo the file
   inode of the blkcg into the wb_blkio of the memcg.
3) inode_cgwb_enabled() return true when memcg and io are both mounted
   on cgroupv2 or both on cgroupv1.
4) Buffer I/O can find a blkcg according to its memcg.

Thus, a memcg can find a certain blkcg, and cgroup writeback can be
supported on cgroupv1.
Signed-off-by: NLu Jialin <lujialin4@huawei.com>

644547a9

15 11月, 2022 1 次提交

ascend/arm64: Add ascend_enable_all kernel parameter · 66ae8ddd

由 Wang Wensheng 提交于 11月 14, 2022

ascend inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I612UG
CVE: NA

--------------------------------

This kernel parameter is used for ascend scene and would open all the
options needed at once.
Signed-off-by: NWang Wensheng <wangwensheng4@huawei.com>

66ae8ddd

07 9月, 2022 2 次提交

memcg: Fix the problem of cat memory.high_async_ratio · e946b0e0

由 Lu Jialin 提交于 9月 06, 2022

hulk inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I4IMAK
CVE: NA

--------

The patch fixes the problem of cat memory.high_async_ratio.
After this patch, when user cat memory.high_async_ratio, the correct
memory.high_async_ratio will be shown.

Show case:
/sys/fs/cgroup/test # cat memory.high_async_ratio
0
/sys/fs/cgroup/test # echo 90 > memory.high_async_ratio
/sys/fs/cgroup/test # cat memory.high_async_ratio
90
/sys/fs/cgroup/test # echo 85 > memory.high_async_ratio
/sys/fs/cgroup/test # cat memory.high_async_ratio
85
Signed-off-by: NLu Jialin <lujialin4@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

e946b0e0

memcg: Modify memory.high_async_ratio changing scope · d014172f

由 Lu Jialin 提交于 9月 06, 2022

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I4IMAK
CVE: NA

--------

This patch changes HIGH_ASYNC_RATIO_BASE from 10 to 100 and changes
HIGH_ASYNC_RATIO_GAP from 1 to 10.
After this patch, user can set high_async_ratio from 0 to 99, which will
make memcg async reclaim more delicacy management.
If high_async_ratio is smaller than 10, try to reclaim all the pages of the
memcg.
Signed-off-by: NLu Jialin <lujialin4@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

d014172f

09 8月, 2022 4 次提交

memcg: export high_async_ratio to userland · 4eb5e50e

由 Lu Jialin 提交于 8月 09, 2022

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I4IMAK
CVE: NA

-------------------------------

User can set high_async_ratio from 0 to 9; start memcg high async when
memcg_usage is larger than memory.high * high_async_ratio / 10;
Signed-off-by: NLu Jialin <lujialin4@huawei.com>
Reviewed-by: NXiu Jianfeng <xiujianfeng@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

4eb5e50e

memcg: enable memcg async reclaim · 078057a1

由 Lu Jialin 提交于 8月 09, 2022

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I4IMAK
CVE: NA

-------------------------------

Enable memcg async reclaim when memcg usage is larger than memory_high *
memcg->high_async_ratio / 10; if memcg usage is larger than memory_high *
(memcg->high_async_ratio - 1)/ 10, the reclaim pages is the diff of
memcg usage and memory_high * (memcg->high_async_ratio - 1)/ 10;
else reclaim pages is MEMCG_CHARGE_BATCH;
The default memcg->high_async_ratio is 0; when memcg->high_async_ratio
is 0, memcg async reclaim is disabled;

The situation when enable memcg async reclaim is
1) try_charge;
2) reset memory_high
Signed-off-by: NLu Jialin <lujialin4@huawei.com>
Reviewed-by: NXiu Jianfeng <xiujianfeng@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

078057a1

Revert "memcg: support memcg sync reclaim work as kswapd" · 902afcc3

由 Lu Jialin 提交于 8月 09, 2022

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I4IMAK
CVE: NA

-------------------------------

This reverts commit 1496d67c.
Signed-off-by: NLu Jialin <lujialin4@huawei.com>
Reviewed-by: NXiu Jianfeng <xiujianfeng@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

902afcc3

Revert "memcg: Add static key for memcg kswapd" · c544b5bd

由 Lu Jialin 提交于 8月 09, 2022

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I4IMAK
CVE: NA

-------------------------------

This reverts commit 84355bcc.
Signed-off-by: NLu Jialin <lujialin4@huawei.com>
Reviewed-by: NXiu Jianfeng <xiujianfeng@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

c544b5bd

06 7月, 2022 1 次提交

mm/memcontrol: return 1 from cgroup.memory __setup() handler · afa602f5

由 Randy Dunlap 提交于 7月 06, 2022

stable inclusion
from stable-v5.10.110
commit 81a04b9a32e40876dd41909542f1b23560cb99d3
bugzilla: https://gitee.com/openeuler/kernel/issues/I574AL

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=81a04b9a32e40876dd41909542f1b23560cb99d3

--------------------------------

commit 460a79e1 upstream.

__setup() handlers should return 1 if the command line option is handled
and 0 if not (or maybe never return 0; it just pollutes init's
environment).

The only reason that this particular __setup handler does not pollute
init's environment is that the setup string contains a '.', as in
"cgroup.memory".  This causes init/main.c::unknown_boottoption() to
consider it to be an "Unused module parameter" and ignore it.  (This is
for parsing of loadable module parameters any time after kernel init.)
Otherwise the string "cgroup.memory=whatever" would be added to init's
environment strings.

Instead of relying on this '.' quirk, just return 1 to indicate that the
boot option has been handled.

Note that there is no warning message if someone enters:
	cgroup.memory=anything_invalid

Link: https://lkml.kernel.org/r/20220222005811.10672-1-rdunlap@infradead.org
Fixes: f7e1cb6e ("mm: memcontrol: account socket memory in unified hierarchy memory controller")
Signed-off-by: NRandy Dunlap <rdunlap@infradead.org>
Reported-by: NIgor Zhbanov <i.zhbanov@omprussia.ru>
Link: lore.kernel.org/r/64644a2f-4a20-bab3-1e15-3b2cdd0defe3@omprussia.ru
Reviewed-by: NMichal Koutný <mkoutny@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NYu Liao <liaoyu15@huawei.com>
Reviewed-by: NWei Li <liwei391@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

afa602f5

22 6月, 2022 1 次提交

mm: memcontrol: add the flag_stat file · 5cda4079

由 tatataeki 提交于 6月 22, 2022

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I4MC3F
CVE: NA

----------------------------------

Multiple operations on cgroups in cgroup v1 are related to the status
of the cgroup. The status of the current cgroup can be displayed in
cgroupv2, but it cannot be displayed in cgroup v1, so the
cgroup.flag_stat member is added in memory cgroup to display the
status of the current cgroup and sub-cgroups.

Testing result:
List the status of user.slice
[root@test user.slice]#cat memory.flag_stat
NO_REF 0
ONLINE 1
RELEASED 0
VISIBLE 1
DYING 0
CHILD_NO_REF 0
CHILD_ONLINE 1
CHILD_RELEASED 0
CHILD_VISIBLE 1
CHILD_DYING 0

Create a new cgroup in user.slice
[root@test user.slice]#mkdir user-test

List the current status of user.slice after operation above
[root@test user.slice]#cat memory.flag_stat
NO_REF 0
ONLINE 1
RELEASED 0
VISIBLE 1
DYING 0
CHILD_NO_REF 0
CHILD_ONLINE 2
CHILD_RELEASED 0
CHILD_VISIBLE 2
CHILD_DYING 0
Signed-off-by: Ntatataeki <shengzeyu19_98@163.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: NXiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

5cda4079

07 6月, 2022 2 次提交

memcg: introduce per-memcg reclaim interface for cgroup v1 · f698ccf8

由 Chen Wandun 提交于 6月 07, 2022

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I545DF
CVE: NA

--------------------------------

introduce per-memcg reclaim interface for cgroup v1, and
disable memory reclaim for root memcg.
Signed-off-by: NChen Wandun <chenwandun@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

f698ccf8

memcg: introduce per-memcg reclaim interface · 50b7afef

由 Shakeel Butt 提交于 6月 07, 2022

mainline inclusion
from mainline-v5.19-rc1
commit 94968384
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I545DF
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=94968384dde15d48263bfc59d280cd71b1259d8c

--------------------------------

This patch series adds a memory.reclaim proactive reclaim interface.
The rationale behind the interface and how it works are in the first
patch.

This patch (of 4):

Introduce a memcg interface to trigger memory reclaim on a memory cgroup.

Use case: Proactive Reclaim
---------------------------

A userspace proactive reclaimer can continuously probe the memcg to
reclaim a small amount of memory.  This gives more accurate and up-to-date
workingset estimation as the LRUs are continuously sorted and can
potentially provide more deterministic memory overcommit behavior.  The
memory overcommit controller can provide more proactive response to the
changing behavior of the running applications instead of being reactive.

A userspace reclaimer's purpose in this case is not a complete replacement
for kswapd or direct reclaim, it is to proactively identify memory savings
opportunities and reclaim some amount of cold pages set by the policy to
free up the memory for more demanding jobs or scheduling new jobs.

A user space proactive reclaimer is used in Google data centers.
Additionally, Meta's TMO paper recently referenced a very similar
interface used for user space proactive reclaim:
https://dl.acm.org/doi/pdf/10.1145/3503222.3507731

Benefits of a user space reclaimer:
-----------------------------------

1) More flexible on who should be charged for the cpu of the memory
   reclaim.  For proactive reclaim, it makes more sense to be centralized.

2) More flexible on dedicating the resources (like cpu).  The memory
   overcommit controller can balance the cost between the cpu usage and
   the memory reclaimed.

3) Provides a way to the applications to keep their LRUs sorted, so,
   under memory pressure better reclaim candidates are selected.  This
   also gives more accurate and uptodate notion of working set for an
   application.

Why memory.high is not enough?
------------------------------

- memory.high can be used to trigger reclaim in a memcg and can
  potentially be used for proactive reclaim.  However there is a big
  downside in using memory.high.  It can potentially introduce high
  reclaim stalls in the target application as the allocations from the
  processes or the threads of the application can hit the temporary
  memory.high limit.

- Userspace proactive reclaimers usually use feedback loops to decide
  how much memory to proactively reclaim from a workload.  The metrics
  used for this are usually either refaults or PSI, and these metrics will
  become messy if the application gets throttled by hitting the high
  limit.

- memory.high is a stateful interface, if the userspace proactive
  reclaimer crashes for any reason while triggering reclaim it can leave
  the application in a bad state.

- If a workload is rapidly expanding, setting memory.high to proactively
  reclaim memory can result in actually reclaiming more memory than
  intended.

The benefits of such interface and shortcomings of existing interface were
further discussed in this RFC thread:
https://lore.kernel.org/linux-mm/5df21376-7dd1-bf81-8414-32a73cea45dd@google.com/

Interface:
----------

Introducing a very simple memcg interface 'echo 10M > memory.reclaim' to
trigger reclaim in the target memory cgroup.

The interface is introduced as a nested-keyed file to allow for future
optional arguments to be easily added to configure the behavior of
reclaim.

Possible Extensions:
--------------------

- This interface can be extended with an additional parameter or flags
  to allow specifying one or more types of memory to reclaim from (e.g.
  file, anon, ..).

- The interface can also be extended with a node mask to reclaim from
  specific nodes. This has use cases for reclaim-based demotion in memory
  tiering systens.

- A similar per-node interface can also be added to support proactive
  reclaim and reclaim-based demotion in systems without memcg.

- Add a timeout parameter to make it easier for user space to call the
  interface without worrying about being blocked for an undefined amount
  of time.

For now, let's keep things simple by adding the basic functionality.

[yosryahmed@google.com: worked on versions v2 onwards, refreshed to
current master, updated commit message based on recent
discussions and use cases]
Link: https://lkml.kernel.org/r/20220425190040.2475377-1-yosryahmed@google.com
Link: https://lkml.kernel.org/r/20220425190040.2475377-2-yosryahmed@google.comSigned-off-by: NShakeel Butt <shakeelb@google.com>
Co-developed-by: NYosry Ahmed <yosryahmed@google.com>
Signed-off-by: NYosry Ahmed <yosryahmed@google.com>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Acked-by: NMichal Hocko <mhocko@suse.com>
Acked-by: NWei Xu <weixugc@google.com>
Acked-by: NRoman Gushchin <roman.gushchin@linux.dev>
Acked-by: NDavid Rientjes <rientjes@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Zefan Li <lizefan.x@bytedance.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Chen Wandun <chenwandun@huawei.com>
Cc: Vaibhav Jain <vaibhav@linux.ibm.com>
Cc: "Michal Koutn" <mkoutny@suse.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NChen Wandun <chenwandun@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

50b7afef

23 5月, 2022 1 次提交

mm: memcg: synchronize objcg lists with a dedicated spinlock · 9ba3dc33

由 Roman Gushchin 提交于 5月 23, 2022

stable inclusion
from stable-v5.10.102
commit 8c8385972ea96adeb9b678c9390beaa4d94c4aae
bugzilla: https://gitee.com/openeuler/kernel/issues/I567K6

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=8c8385972ea96adeb9b678c9390beaa4d94c4aae

--------------------------------

commit 0764db9b upstream.

Alexander reported a circular lock dependency revealed by the mmap1 ltp
test:

  LOCKDEP_CIRCULAR (suite: ltp, case: mtest06 (mmap1))
          WARNING: possible circular locking dependency detected
          5.17.0-20220113.rc0.git0.f2211f194038.300.fc35.s390x+debug #1 Not tainted
          ------------------------------------------------------
          mmap1/202299 is trying to acquire lock:
          00000001892c0188 (css_set_lock){..-.}-{2:2}, at: obj_cgroup_release+0x4a/0xe0
          but task is already holding lock:
          00000000ca3b3818 (&sighand->siglock){-.-.}-{2:2}, at: force_sig_info_to_task+0x38/0x180
          which lock already depends on the new lock.
          the existing dependency chain (in reverse order) is:
          -> #1 (&sighand->siglock){-.-.}-{2:2}:
                 __lock_acquire+0x604/0xbd8
                 lock_acquire.part.0+0xe2/0x238
                 lock_acquire+0xb0/0x200
                 _raw_spin_lock_irqsave+0x6a/0xd8
                 __lock_task_sighand+0x90/0x190
                 cgroup_freeze_task+0x2e/0x90
                 cgroup_migrate_execute+0x11c/0x608
                 cgroup_update_dfl_csses+0x246/0x270
                 cgroup_subtree_control_write+0x238/0x518
                 kernfs_fop_write_iter+0x13e/0x1e0
                 new_sync_write+0x100/0x190
                 vfs_write+0x22c/0x2d8
                 ksys_write+0x6c/0xf8
                 __do_syscall+0x1da/0x208
                 system_call+0x82/0xb0
          -> #0 (css_set_lock){..-.}-{2:2}:
                 check_prev_add+0xe0/0xed8
                 validate_chain+0x736/0xb20
                 __lock_acquire+0x604/0xbd8
                 lock_acquire.part.0+0xe2/0x238
                 lock_acquire+0xb0/0x200
                 _raw_spin_lock_irqsave+0x6a/0xd8
                 obj_cgroup_release+0x4a/0xe0
                 percpu_ref_put_many.constprop.0+0x150/0x168
                 drain_obj_stock+0x94/0xe8
                 refill_obj_stock+0x94/0x278
                 obj_cgroup_charge+0x164/0x1d8
                 kmem_cache_alloc+0xac/0x528
                 __sigqueue_alloc+0x150/0x308
                 __send_signal+0x260/0x550
                 send_signal+0x7e/0x348
                 force_sig_info_to_task+0x104/0x180
                 force_sig_fault+0x48/0x58
                 __do_pgm_check+0x120/0x1f0
                 pgm_check_handler+0x11e/0x180
          other info that might help us debug this:
           Possible unsafe locking scenario:
                 CPU0                    CPU1
                 ----                    ----
            lock(&sighand->siglock);
                                         lock(css_set_lock);
                                         lock(&sighand->siglock);
            lock(css_set_lock);
           *** DEADLOCK ***
          2 locks held by mmap1/202299:
           #0: 00000000ca3b3818 (&sighand->siglock){-.-.}-{2:2}, at: force_sig_info_to_task+0x38/0x180
           #1: 00000001892ad560 (rcu_read_lock){....}-{1:2}, at: percpu_ref_put_many.constprop.0+0x0/0x168
          stack backtrace:
          CPU: 15 PID: 202299 Comm: mmap1 Not tainted 5.17.0-20220113.rc0.git0.f2211f194038.300.fc35.s390x+debug #1
          Hardware name: IBM 3906 M04 704 (LPAR)
          Call Trace:
            dump_stack_lvl+0x76/0x98
            check_noncircular+0x136/0x158
            check_prev_add+0xe0/0xed8
            validate_chain+0x736/0xb20
            __lock_acquire+0x604/0xbd8
            lock_acquire.part.0+0xe2/0x238
            lock_acquire+0xb0/0x200
            _raw_spin_lock_irqsave+0x6a/0xd8
            obj_cgroup_release+0x4a/0xe0
            percpu_ref_put_many.constprop.0+0x150/0x168
            drain_obj_stock+0x94/0xe8
            refill_obj_stock+0x94/0x278
            obj_cgroup_charge+0x164/0x1d8
            kmem_cache_alloc+0xac/0x528
            __sigqueue_alloc+0x150/0x308
            __send_signal+0x260/0x550
            send_signal+0x7e/0x348
            force_sig_info_to_task+0x104/0x180
            force_sig_fault+0x48/0x58
            __do_pgm_check+0x120/0x1f0
            pgm_check_handler+0x11e/0x180
          INFO: lockdep is turned off.

In this example a slab allocation from __send_signal() caused a
refilling and draining of a percpu objcg stock, resulted in a releasing
of another non-related objcg.  Objcg release path requires taking the
css_set_lock, which is used to synchronize objcg lists.

This can create a circular dependency with the sighandler lock, which is
taken with the locked css_set_lock by the freezer code (to freeze a
task).

In general it seems that using css_set_lock to synchronize objcg lists
makes any slab allocations and deallocation with the locked css_set_lock
and any intervened locks risky.

To fix the problem and make the code more robust let's stop using
css_set_lock to synchronize objcg lists and use a new dedicated spinlock
instead.

Link: https://lkml.kernel.org/r/Yfm1IHmoGdyUR81T@carbon.dhcp.thefacebook.com
Fixes: bf4f0599 ("mm: memcg/slab: obj_cgroup API")
Signed-off-by: NRoman Gushchin <guro@fb.com>
Reported-by: NAlexander Egorenkov <egorenar@linux.ibm.com>
Tested-by: NAlexander Egorenkov <egorenar@linux.ibm.com>
Reviewed-by: NWaiman Long <longman@redhat.com>
Acked-by: NTejun Heo <tj@kernel.org>
Reviewed-by: NShakeel Butt <shakeelb@google.com>
Reviewed-by: NJeremy Linton <jeremy.linton@arm.com>
Tested-by: NJeremy Linton <jeremy.linton@arm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NYu Liao <liaoyu15@huawei.com>
Reviewed-by: NWei Li <liwei391@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

9ba3dc33

07 4月, 2022 2 次提交

memcg: Export memory.events and memory.events.local from cgroupv2 to cgroupv1 · 8f39ae66

由 Lu Jialin 提交于 4月 07, 2022

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I4X0YD?from=project-issue
CVE: NA

--------

Export "memory.events" and "memory.events.local" from cgroupv2 to
cgroupv1.

There are some differences between v2 and v1:

1)events of MEMCG_OOM_GROUP_KILL is not included in cgroupv1. Because,
there is no member of memory.oom.group.

2)events of MEMCG_MAX is represented with "limit_in_bytes" in cgroupv1 instead
of memory.max

3)event of oom_kill is include in memory.oom_control. make oom_kill include
its descendants' events and add oom_kill_local include its oom_kill event only.
Signed-off-by: NLu Jialin <lujialin4@huawei.com>
Reviewed-by: NXiu Jianfeng <xiujianfeng@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

8f39ae66

mm/memcg_memfs_info: show files that having pages charged in mem_cgroup · 6b1d4d3a

由 Liu Shixin 提交于 4月 07, 2022

hulk inclusion
category: feature
bugzilla: 186182, https://gitee.com/openeuler/kernel/issues/I4UOJI
CVE: NA

--------------------------------

Support to print rootfs files and tmpfs files that having pages charged
in given memory cgroup. The files infomations can be printed through
interface "memory.memfs_files_info" or printed when OOM is triggered.

In order not to flush memory logs, we limit the maximum number of files
to be printed when oom through interface "max_print_files_in_oom". And
in order to filter out small files, we limit the minimum size of files
that can be printed through interface "size_threshold".
Signed-off-by: NLiu Shixin <liushixin2@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

6b1d4d3a

20 3月, 2022 1 次提交

mm/dynamic_hugetlb: use mem_cgroup_force_empty to reclaim pages · 94c749a3

由 Liu Shixin 提交于 3月 20, 2022

hulk inclusion
category: bugfix
bugzilla: 46904 https://gitee.com/openeuler/kernel/issues/I4Y0XO

--------------------------------

When all processes in the memory cgroup are finished, some memory may still be
occupied such as file cache. Use mem_cgroup_force_empty to reclaim these pages
that charged in the memory cgroup before merging all pages.
Signed-off-by: NLiu Shixin <liushixin2@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

94c749a3

19 1月, 2022 5 次提交

mm/dynamic_hugetlb: add interface to disable normal pages allocation · cf8510b3

由 Liu Shixin 提交于 1月 18, 2022

hulk inclusion
category: feature
bugzilla: 46904, https://gitee.com/openeuler/kernel/issues/I4QSHG
CVE: NA

--------------------------------

Add new interface "dhugetlb.disable_normal_pages" to disable the allocation
of normal pages from a hpool. This makes dynamic hugetlb more flexible.
Signed-off-by: NLiu Shixin <liushixin2@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

cf8510b3

mm/dynamic_hugetlb: alloc page from dhugetlb_pool · 32d6d14f

由 Liu Shixin 提交于 1月 18, 2022

hulk inclusion
category: feature
bugzilla: 46904, https://gitee.com/openeuler/kernel/issues/I4QSHG
CVE: NA

--------------------------------

Add function to alloc page from dhugetlb_pool.
When process is bound to a mem_cgroup configured with dhugtlb_pool, alloc
page from dhugetlb_pool firstly. If there is no page in dhugetlb_pool,
fallback to alloc page from buddy system.

As the process will alloc pages from dhugetlb_pool in the mem_cgroup,
process is not allowed to migrate to other mem_cgroup.
Signed-off-by: NLiu Shixin <liushixin2@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

32d6d14f

mm/dynamic_hugetlb: add interface to configure the count of hugepages · 0c06a1c0

由 Liu Shixin 提交于 1月 18, 2022

hulk inclusion
category: feature
bugzilla: 46904, https://gitee.com/openeuler/kernel/issues/I4QSHG
CVE: NA

--------------------------------

Add two interfaces in mem_cgroup to configure the count of 1G/2M hugepages
in dhugetlb_pool.
Signed-off-by: NLiu Shixin <liushixin2@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

0c06a1c0

mm/dynamic_hugetlb: establish the dynamic hugetlb feature framework · a8a836a3

由 Liu Shixin 提交于 1月 18, 2022

hulk inclusion
category: feature
bugzilla: 46904, https://gitee.com/openeuler/kernel/issues/I4QSHG
CVE: NA

--------------------------------

Dynamic hugetlb is a self-developed feature based on the hugetlb and memcontrol.
It supports to split huge page dynamically in a memory cgroup. There is a new structure
dhugetlb_pool in every mem_cgroup to manage the pages configured to the mem_cgroup.
For the mem_cgroup configured with dhugetlb_pool, processes in the mem_cgroup will
preferentially use the pages in dhugetlb_pool.

Dynamic hugetlb supports three types of pages, including 1G/2M huge pages and 4K pages.
For the mem_cgroup configured with dhugetlb_pool, processes will be limited to alloc
1G/2M huge pages only from dhugetlb_pool. But there is no such constraint for 4K pages.
If there are insufficient 4K pages in the dhugetlb_pool, pages can also be allocated from
buddy system. So before using dynamic hugetlb, user must know how many huge pages they
need.

Usage:
1. Add 'dynamic_hugetlb=on' in cmdline to enable dynamic hugetlb feature.
2. Prealloc some 1G hugepages through hugetlb.
3. Create a mem_cgroup and configure dhugetlb_pool to mem_cgroup.
4. Configure the count of 1G/2M hugepages, and the remaining pages in dhugetlb_pool will
be used as basic pages.
5. Bound a process to mem_cgroup. then the memory for it will be allocated from dhugetlb_pool.

This patch add the corresponding structure dhugetlb_pool for dynamic hugetlb feature,
the interface 'dhugetlb.nr_pages' in mem_cgroup to configure dhugetlb_pool and the cmdline
'dynamic_hugetlb=on' to enable dynamic hugetlb feature.
Signed-off-by: NLiu Shixin <liushixin2@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

a8a836a3

mm: declare several functions · 98ecb3cd

由 Liu Shixin 提交于 1月 18, 2022

hulk inclusion
category: feature
bugzilla: 46904, https://gitee.com/openeuler/kernel/issues/I4QSHG
CVE: NA

--------------------------------

There are several functions that will be used in next patches for
dynamic hugetlb feature. Declare them.

No functional changes.
Signed-off-by: NLiu Shixin <liushixin2@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

98ecb3cd

07 1月, 2022 4 次提交

memcg: Add static key for memcg kswapd · 84355bcc

由 Lu Jialin 提交于 1月 07, 2022

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I4IMAK?from=project-issue
CVE: NA

--------

This patch adds a default-false static key to disable memcg kswapd
feature. User can enable by set memcg_kswapd in cmdline.
Signed-off-by: NLu Jialin <lujialin4@huawei.com>
Reviewed-by: Nweiyang wang <wangweiyang2@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

84355bcc

memcg: support memcg sync reclaim work as kswapd · 1496d67c

由 Lu Jialin 提交于 1月 07, 2022

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I4IMAK?from=project-issue
CVE: NA

--------

Since memory.high reclaim is sync whether is in interrupt, it could
do more work than direct reclaim, i.e. write out dirty page, etc.

So, add PF_KSWAPD flag, so that current_is_kswapd() would return true
for memcg kswapd.

Memcg kswapd should stop when usage of memcg fit the memcg kswapd stop
flag. When the userland sets the memcg->memory.max, the stop_flag is
(memcg->memory.high - memcg->memory.max * 10 / 1000), which is similar
with global kswapd. Otherwise, the stop_flag is (memcg->memory.high -
memcg->memory.high / 6), which is similar with most difference between
watermark_low and watermark_high.

And, memcg kswapd should not break memory.low protection for now.
Signed-off-by: NLu Jialin <lujialin4@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: Nweiyang wang <wangweiyang2@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

1496d67c

memcg: Export memcg.high from cgroupv2 to cgroupv1 · 6a7b3e98

由 Lu Jialin 提交于 1月 07, 2022

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I4IMAK?from=project-issue
CVE: NA

--------

Export memory.high from cgroupv2 to cgroupv1. Therefore, when the usage
of the memcg is larger than memory.high, some pages will be reclaimed
before return to userland, which will throttle the process.

Only export memory.high number in mem_cgroup_legacy_files and move
related functions in front of mem_cgroup_legacy_files. There is no need
to other changes.
Signed-off-by: NLu Jialin <lujialin4@huawei.com>
Reviewed-by: Nweiyang wang <wangweiyang2@huawei.com>
Reviewed-by: Nweiyang wang <wangweiyang2@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

6a7b3e98

memcg: Export memcg.{min/low} from cgroupv2 to cgroupv1 · 27c047f4

由 Lu Jialin 提交于 1月 07, 2022

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I4IMAK?from=project-issue
CVE: NA

--------

Export memcg.min and memcg.low from cgroupv2 to cgroupv1, in order to reduce
the negtive impact between cgroups when the system memory is insufficient.

Only export memory.{min/low} numbers in mem_cgroup_legacy_files and move
related functions in front of mem_cgroup_legacy_files. There is no need
to other changes.
Signed-off-by: NLu Jialin <lujialin4@huawei.com>
Reviewed-by: Nweiyang wang <wangweiyang2@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

27c047f4

30 12月, 2021 1 次提交

arm64/ascend: Add new enable_oom_killer interface for oom contrl · 6d494d7f

由 Weilong Chen 提交于 12月 30, 2021

ascend inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I4K2U5
CVE: NA

-------------------------------------------------

Support disable oom-killer, and report oom events to bbox
vm.enable_oom_killer:
	0: disable oom killer
	1: enable oom killer (default,compatible with mainline)
Signed-off-by: NWeilong Chen <chenweilong@huawei.com>
Signed-off-by: NZhang Jian <zhangjian210@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

6d494d7f

06 12月, 2021 1 次提交

memcg: prohibit unconditional exceeding the limit of dying tasks · 8937d4d7

由 Vasily Averin 提交于 12月 06, 2021

stable inclusion
from stable-5.10.80
commit 74293225f50391620aaef3507ebd6fd17e0003e1
bugzilla: 185821 https://gitee.com/openeuler/kernel/issues/I4L7CG

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=74293225f50391620aaef3507ebd6fd17e0003e1

--------------------------------

commit a4ebf1b6 upstream.

Memory cgroup charging allows killed or exiting tasks to exceed the hard
limit.  It is assumed that the amount of the memory charged by those
tasks is bound and most of the memory will get released while the task
is exiting.  This is resembling a heuristic for the global OOM situation
when tasks get access to memory reserves.  There is no global memory
shortage at the memcg level so the memcg heuristic is more relieved.

The above assumption is overly optimistic though.  E.g.  vmalloc can
scale to really large requests and the heuristic would allow that.  We
used to have an early break in the vmalloc allocator for killed tasks
but this has been reverted by commit b8c8a338 ("Revert "vmalloc:
back off when the current task is killed"").  There are likely other
similar code paths which do not check for fatal signals in an
allocation&charge loop.  Also there are some kernel objects charged to a
memcg which are not bound to a process life time.

It has been observed that it is not really hard to trigger these
bypasses and cause global OOM situation.

One potential way to address these runaways would be to limit the amount
of excess (similar to the global OOM with limited oom reserves).  This
is certainly possible but it is not really clear how much of an excess
is desirable and still protects from global OOMs as that would have to
consider the overall memcg configuration.

This patch is addressing the problem by removing the heuristic
altogether.  Bypass is only allowed for requests which either cannot
fail or where the failure is not desirable while excess should be still
limited (e.g.  atomic requests).  Implementation wise a killed or dying
task fails to charge if it has passed the OOM killer stage.  That should
give all forms of reclaim chance to restore the limit before the failure
(ENOMEM) and tell the caller to back off.

In addition, this patch renames should_force_charge() helper to
task_is_dying() because now its use is not associated witch forced
charging.

This patch depends on pagefault_out_of_memory() to not trigger
out_of_memory(), because then a memcg failure can unwind to VM_FAULT_OOM
and cause a global OOM killer.

Link: https://lkml.kernel.org/r/8f5cebbb-06da-4902-91f0-6566fc4b4203@virtuozzo.comSigned-off-by: NVasily Averin <vvs@virtuozzo.com>
Suggested-by: NMichal Hocko <mhocko@suse.com>
Acked-by: NMichal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Uladzislau Rezki <urezki@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Cc: <stable@vger.kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Reviewed-by: NWeilong Chen <chenweilong@huawei.com>
Acked-by: NWeilong Chen <chenweilong@huawei.com>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

8937d4d7

30 11月, 2021 11 次提交

memcg: unify memcg stat flushing · 863511a4