1. 09 2月, 2023 1 次提交
  2. 01 12月, 2022 1 次提交
  3. 23 11月, 2022 1 次提交
    • L
      cgroup: support cgroup writeback on cgroupv1 · 644547a9
      Lu Jialin 提交于
      hulkl inclusion
      category: feature
      bugzilla: https://gitee.com/openeuler/kernel/issues/I5ZG61
      
      -------------------------------
      
      In cgroupv1, cgroup writeback is not supproted for two problems:
      1) Blkcg_css and memcg_css are mounted on different cgroup trees.
         Therefore, blkcg_css cannot be found according to a certain memcg_css.
      2) Buffer I/O is worked by kthread, which is in the root_blkcg.
         Therefore, blkcg cannot limit wbps and wiops of buffer I/O.
      
      We solve the two problems to support cgroup writeback on cgroupv1.
      1) A memcg is attached to the blkcg_root css when the memcg was created.
      2) We add a member "wb_blkio_ino" in mem_cgroup_legacy_files.
         User can attach a memcg to a cerntain blkcg through echo the file
         inode of the blkcg into the wb_blkio of the memcg.
      3) inode_cgwb_enabled() return true when memcg and io are both mounted
         on cgroupv2 or both on cgroupv1.
      4) Buffer I/O can find a blkcg according to its memcg.
      
      Thus, a memcg can find a certain blkcg, and cgroup writeback can be
      supported on cgroupv1.
      Signed-off-by: NLu Jialin <lujialin4@huawei.com>
      644547a9
  4. 15 11月, 2022 1 次提交
  5. 07 9月, 2022 2 次提交
  6. 09 8月, 2022 4 次提交
  7. 06 7月, 2022 1 次提交
  8. 22 6月, 2022 1 次提交
    • T
      mm: memcontrol: add the flag_stat file · 5cda4079
      tatataeki 提交于
      hulk inclusion
      category: feature
      bugzilla: https://gitee.com/openeuler/kernel/issues/I4MC3F
      CVE: NA
      
      ----------------------------------
      
      Multiple operations on cgroups in cgroup v1 are related to the status
      of the cgroup. The status of the current cgroup can be displayed in
      cgroupv2, but it cannot be displayed in cgroup v1, so the
      cgroup.flag_stat member is added in memory cgroup to display the
      status of the current cgroup and sub-cgroups.
      
      Testing result:
      List the status of user.slice
      [root@test user.slice]#cat memory.flag_stat
      NO_REF 0
      ONLINE 1
      RELEASED 0
      VISIBLE 1
      DYING 0
      CHILD_NO_REF 0
      CHILD_ONLINE 1
      CHILD_RELEASED 0
      CHILD_VISIBLE 1
      CHILD_DYING 0
      
      Create a new cgroup in user.slice
      [root@test user.slice]#mkdir user-test
      
      List the current status of user.slice after operation above
      [root@test user.slice]#cat memory.flag_stat
      NO_REF 0
      ONLINE 1
      RELEASED 0
      VISIBLE 1
      DYING 0
      CHILD_NO_REF 0
      CHILD_ONLINE 2
      CHILD_RELEASED 0
      CHILD_VISIBLE 2
      CHILD_DYING 0
      Signed-off-by: Ntatataeki <shengzeyu19_98@163.com>
      Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
      Reviewed-by: NXiu Jianfeng <xiujianfeng@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      5cda4079
  9. 07 6月, 2022 2 次提交
    • C
      memcg: introduce per-memcg reclaim interface for cgroup v1 · f698ccf8
      Chen Wandun 提交于
      hulk inclusion
      category: feature
      bugzilla: https://gitee.com/openeuler/kernel/issues/I545DF
      CVE: NA
      
      --------------------------------
      
      introduce per-memcg reclaim interface for cgroup v1, and
      disable memory reclaim for root memcg.
      Signed-off-by: NChen Wandun <chenwandun@huawei.com>
      Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      f698ccf8
    • S
      memcg: introduce per-memcg reclaim interface · 50b7afef
      Shakeel Butt 提交于
      mainline inclusion
      from mainline-v5.19-rc1
      commit 94968384
      category: feature
      bugzilla: https://gitee.com/openeuler/kernel/issues/I545DF
      CVE: NA
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=94968384dde15d48263bfc59d280cd71b1259d8c
      
      --------------------------------
      
      This patch series adds a memory.reclaim proactive reclaim interface.
      The rationale behind the interface and how it works are in the first
      patch.
      
      This patch (of 4):
      
      Introduce a memcg interface to trigger memory reclaim on a memory cgroup.
      
      Use case: Proactive Reclaim
      ---------------------------
      
      A userspace proactive reclaimer can continuously probe the memcg to
      reclaim a small amount of memory.  This gives more accurate and up-to-date
      workingset estimation as the LRUs are continuously sorted and can
      potentially provide more deterministic memory overcommit behavior.  The
      memory overcommit controller can provide more proactive response to the
      changing behavior of the running applications instead of being reactive.
      
      A userspace reclaimer's purpose in this case is not a complete replacement
      for kswapd or direct reclaim, it is to proactively identify memory savings
      opportunities and reclaim some amount of cold pages set by the policy to
      free up the memory for more demanding jobs or scheduling new jobs.
      
      A user space proactive reclaimer is used in Google data centers.
      Additionally, Meta's TMO paper recently referenced a very similar
      interface used for user space proactive reclaim:
      https://dl.acm.org/doi/pdf/10.1145/3503222.3507731
      
      Benefits of a user space reclaimer:
      -----------------------------------
      
      1) More flexible on who should be charged for the cpu of the memory
         reclaim.  For proactive reclaim, it makes more sense to be centralized.
      
      2) More flexible on dedicating the resources (like cpu).  The memory
         overcommit controller can balance the cost between the cpu usage and
         the memory reclaimed.
      
      3) Provides a way to the applications to keep their LRUs sorted, so,
         under memory pressure better reclaim candidates are selected.  This
         also gives more accurate and uptodate notion of working set for an
         application.
      
      Why memory.high is not enough?
      ------------------------------
      
      - memory.high can be used to trigger reclaim in a memcg and can
        potentially be used for proactive reclaim.  However there is a big
        downside in using memory.high.  It can potentially introduce high
        reclaim stalls in the target application as the allocations from the
        processes or the threads of the application can hit the temporary
        memory.high limit.
      
      - Userspace proactive reclaimers usually use feedback loops to decide
        how much memory to proactively reclaim from a workload.  The metrics
        used for this are usually either refaults or PSI, and these metrics will
        become messy if the application gets throttled by hitting the high
        limit.
      
      - memory.high is a stateful interface, if the userspace proactive
        reclaimer crashes for any reason while triggering reclaim it can leave
        the application in a bad state.
      
      - If a workload is rapidly expanding, setting memory.high to proactively
        reclaim memory can result in actually reclaiming more memory than
        intended.
      
      The benefits of such interface and shortcomings of existing interface were
      further discussed in this RFC thread:
      https://lore.kernel.org/linux-mm/5df21376-7dd1-bf81-8414-32a73cea45dd@google.com/
      
      Interface:
      ----------
      
      Introducing a very simple memcg interface 'echo 10M > memory.reclaim' to
      trigger reclaim in the target memory cgroup.
      
      The interface is introduced as a nested-keyed file to allow for future
      optional arguments to be easily added to configure the behavior of
      reclaim.
      
      Possible Extensions:
      --------------------
      
      - This interface can be extended with an additional parameter or flags
        to allow specifying one or more types of memory to reclaim from (e.g.
        file, anon, ..).
      
      - The interface can also be extended with a node mask to reclaim from
        specific nodes. This has use cases for reclaim-based demotion in memory
        tiering systens.
      
      - A similar per-node interface can also be added to support proactive
        reclaim and reclaim-based demotion in systems without memcg.
      
      - Add a timeout parameter to make it easier for user space to call the
        interface without worrying about being blocked for an undefined amount
        of time.
      
      For now, let's keep things simple by adding the basic functionality.
      
      [yosryahmed@google.com: worked on versions v2 onwards, refreshed to
      current master, updated commit message based on recent
      discussions and use cases]
      Link: https://lkml.kernel.org/r/20220425190040.2475377-1-yosryahmed@google.com
      Link: https://lkml.kernel.org/r/20220425190040.2475377-2-yosryahmed@google.comSigned-off-by: NShakeel Butt <shakeelb@google.com>
      Co-developed-by: NYosry Ahmed <yosryahmed@google.com>
      Signed-off-by: NYosry Ahmed <yosryahmed@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NWei Xu <weixugc@google.com>
      Acked-by: NRoman Gushchin <roman.gushchin@linux.dev>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Zefan Li <lizefan.x@bytedance.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Chen Wandun <chenwandun@huawei.com>
      Cc: Vaibhav Jain <vaibhav@linux.ibm.com>
      Cc: "Michal Koutn" <mkoutny@suse.com>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NChen Wandun <chenwandun@huawei.com>
      Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      50b7afef
  10. 23 5月, 2022 1 次提交
    • R
      mm: memcg: synchronize objcg lists with a dedicated spinlock · 9ba3dc33
      Roman Gushchin 提交于
      stable inclusion
      from stable-v5.10.102
      commit 8c8385972ea96adeb9b678c9390beaa4d94c4aae
      bugzilla: https://gitee.com/openeuler/kernel/issues/I567K6
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=8c8385972ea96adeb9b678c9390beaa4d94c4aae
      
      --------------------------------
      
      commit 0764db9b upstream.
      
      Alexander reported a circular lock dependency revealed by the mmap1 ltp
      test:
      
        LOCKDEP_CIRCULAR (suite: ltp, case: mtest06 (mmap1))
                WARNING: possible circular locking dependency detected
                5.17.0-20220113.rc0.git0.f2211f194038.300.fc35.s390x+debug #1 Not tainted
                ------------------------------------------------------
                mmap1/202299 is trying to acquire lock:
                00000001892c0188 (css_set_lock){..-.}-{2:2}, at: obj_cgroup_release+0x4a/0xe0
                but task is already holding lock:
                00000000ca3b3818 (&sighand->siglock){-.-.}-{2:2}, at: force_sig_info_to_task+0x38/0x180
                which lock already depends on the new lock.
                the existing dependency chain (in reverse order) is:
                -> #1 (&sighand->siglock){-.-.}-{2:2}:
                       __lock_acquire+0x604/0xbd8
                       lock_acquire.part.0+0xe2/0x238
                       lock_acquire+0xb0/0x200
                       _raw_spin_lock_irqsave+0x6a/0xd8
                       __lock_task_sighand+0x90/0x190
                       cgroup_freeze_task+0x2e/0x90
                       cgroup_migrate_execute+0x11c/0x608
                       cgroup_update_dfl_csses+0x246/0x270
                       cgroup_subtree_control_write+0x238/0x518
                       kernfs_fop_write_iter+0x13e/0x1e0
                       new_sync_write+0x100/0x190
                       vfs_write+0x22c/0x2d8
                       ksys_write+0x6c/0xf8
                       __do_syscall+0x1da/0x208
                       system_call+0x82/0xb0
                -> #0 (css_set_lock){..-.}-{2:2}:
                       check_prev_add+0xe0/0xed8
                       validate_chain+0x736/0xb20
                       __lock_acquire+0x604/0xbd8
                       lock_acquire.part.0+0xe2/0x238
                       lock_acquire+0xb0/0x200
                       _raw_spin_lock_irqsave+0x6a/0xd8
                       obj_cgroup_release+0x4a/0xe0
                       percpu_ref_put_many.constprop.0+0x150/0x168
                       drain_obj_stock+0x94/0xe8
                       refill_obj_stock+0x94/0x278
                       obj_cgroup_charge+0x164/0x1d8
                       kmem_cache_alloc+0xac/0x528
                       __sigqueue_alloc+0x150/0x308
                       __send_signal+0x260/0x550
                       send_signal+0x7e/0x348
                       force_sig_info_to_task+0x104/0x180
                       force_sig_fault+0x48/0x58
                       __do_pgm_check+0x120/0x1f0
                       pgm_check_handler+0x11e/0x180
                other info that might help us debug this:
                 Possible unsafe locking scenario:
                       CPU0                    CPU1
                       ----                    ----
                  lock(&sighand->siglock);
                                               lock(css_set_lock);
                                               lock(&sighand->siglock);
                  lock(css_set_lock);
                 *** DEADLOCK ***
                2 locks held by mmap1/202299:
                 #0: 00000000ca3b3818 (&sighand->siglock){-.-.}-{2:2}, at: force_sig_info_to_task+0x38/0x180
                 #1: 00000001892ad560 (rcu_read_lock){....}-{1:2}, at: percpu_ref_put_many.constprop.0+0x0/0x168
                stack backtrace:
                CPU: 15 PID: 202299 Comm: mmap1 Not tainted 5.17.0-20220113.rc0.git0.f2211f194038.300.fc35.s390x+debug #1
                Hardware name: IBM 3906 M04 704 (LPAR)
                Call Trace:
                  dump_stack_lvl+0x76/0x98
                  check_noncircular+0x136/0x158
                  check_prev_add+0xe0/0xed8
                  validate_chain+0x736/0xb20
                  __lock_acquire+0x604/0xbd8
                  lock_acquire.part.0+0xe2/0x238
                  lock_acquire+0xb0/0x200
                  _raw_spin_lock_irqsave+0x6a/0xd8
                  obj_cgroup_release+0x4a/0xe0
                  percpu_ref_put_many.constprop.0+0x150/0x168
                  drain_obj_stock+0x94/0xe8
                  refill_obj_stock+0x94/0x278
                  obj_cgroup_charge+0x164/0x1d8
                  kmem_cache_alloc+0xac/0x528
                  __sigqueue_alloc+0x150/0x308
                  __send_signal+0x260/0x550
                  send_signal+0x7e/0x348
                  force_sig_info_to_task+0x104/0x180
                  force_sig_fault+0x48/0x58
                  __do_pgm_check+0x120/0x1f0
                  pgm_check_handler+0x11e/0x180
                INFO: lockdep is turned off.
      
      In this example a slab allocation from __send_signal() caused a
      refilling and draining of a percpu objcg stock, resulted in a releasing
      of another non-related objcg.  Objcg release path requires taking the
      css_set_lock, which is used to synchronize objcg lists.
      
      This can create a circular dependency with the sighandler lock, which is
      taken with the locked css_set_lock by the freezer code (to freeze a
      task).
      
      In general it seems that using css_set_lock to synchronize objcg lists
      makes any slab allocations and deallocation with the locked css_set_lock
      and any intervened locks risky.
      
      To fix the problem and make the code more robust let's stop using
      css_set_lock to synchronize objcg lists and use a new dedicated spinlock
      instead.
      
      Link: https://lkml.kernel.org/r/Yfm1IHmoGdyUR81T@carbon.dhcp.thefacebook.com
      Fixes: bf4f0599 ("mm: memcg/slab: obj_cgroup API")
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Reported-by: NAlexander Egorenkov <egorenar@linux.ibm.com>
      Tested-by: NAlexander Egorenkov <egorenar@linux.ibm.com>
      Reviewed-by: NWaiman Long <longman@redhat.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Reviewed-by: NJeremy Linton <jeremy.linton@arm.com>
      Tested-by: NJeremy Linton <jeremy.linton@arm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYu Liao <liaoyu15@huawei.com>
      Reviewed-by: NWei Li <liwei391@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      9ba3dc33
  11. 07 4月, 2022 2 次提交
  12. 20 3月, 2022 1 次提交
  13. 19 1月, 2022 5 次提交
  14. 07 1月, 2022 4 次提交
  15. 30 12月, 2021 1 次提交
  16. 06 12月, 2021 1 次提交
    • V
      memcg: prohibit unconditional exceeding the limit of dying tasks · 8937d4d7
      Vasily Averin 提交于
      stable inclusion
      from stable-5.10.80
      commit 74293225f50391620aaef3507ebd6fd17e0003e1
      bugzilla: 185821 https://gitee.com/openeuler/kernel/issues/I4L7CG
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=74293225f50391620aaef3507ebd6fd17e0003e1
      
      --------------------------------
      
      commit a4ebf1b6 upstream.
      
      Memory cgroup charging allows killed or exiting tasks to exceed the hard
      limit.  It is assumed that the amount of the memory charged by those
      tasks is bound and most of the memory will get released while the task
      is exiting.  This is resembling a heuristic for the global OOM situation
      when tasks get access to memory reserves.  There is no global memory
      shortage at the memcg level so the memcg heuristic is more relieved.
      
      The above assumption is overly optimistic though.  E.g.  vmalloc can
      scale to really large requests and the heuristic would allow that.  We
      used to have an early break in the vmalloc allocator for killed tasks
      but this has been reverted by commit b8c8a338 ("Revert "vmalloc:
      back off when the current task is killed"").  There are likely other
      similar code paths which do not check for fatal signals in an
      allocation&charge loop.  Also there are some kernel objects charged to a
      memcg which are not bound to a process life time.
      
      It has been observed that it is not really hard to trigger these
      bypasses and cause global OOM situation.
      
      One potential way to address these runaways would be to limit the amount
      of excess (similar to the global OOM with limited oom reserves).  This
      is certainly possible but it is not really clear how much of an excess
      is desirable and still protects from global OOMs as that would have to
      consider the overall memcg configuration.
      
      This patch is addressing the problem by removing the heuristic
      altogether.  Bypass is only allowed for requests which either cannot
      fail or where the failure is not desirable while excess should be still
      limited (e.g.  atomic requests).  Implementation wise a killed or dying
      task fails to charge if it has passed the OOM killer stage.  That should
      give all forms of reclaim chance to restore the limit before the failure
      (ENOMEM) and tell the caller to back off.
      
      In addition, this patch renames should_force_charge() helper to
      task_is_dying() because now its use is not associated witch forced
      charging.
      
      This patch depends on pagefault_out_of_memory() to not trigger
      out_of_memory(), because then a memcg failure can unwind to VM_FAULT_OOM
      and cause a global OOM killer.
      
      Link: https://lkml.kernel.org/r/8f5cebbb-06da-4902-91f0-6566fc4b4203@virtuozzo.comSigned-off-by: NVasily Averin <vvs@virtuozzo.com>
      Suggested-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Uladzislau Rezki <urezki@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Reviewed-by: NWeilong Chen <chenweilong@huawei.com>
      Acked-by: NWeilong Chen <chenweilong@huawei.com>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      8937d4d7
  17. 30 11月, 2021 11 次提交