1. 17 4月, 2020 4 次提交
    • C
      memcg: fix NULL pointer dereference in __mem_cgroup_usage_unregister_event · 61a600d3
      Chunguang Xu 提交于
      commit 7d36665a5886c27ca4c4d0afd3ecc50b400f3587 upstream.
      
      An eventfd monitors multiple memory thresholds of the cgroup, closes them,
      the kernel deletes all events related to this eventfd.  Before all events
      are deleted, another eventfd monitors the memory threshold of this cgroup,
      leading to a crash:
      
        BUG: kernel NULL pointer dereference, address: 0000000000000004
        #PF: supervisor write access in kernel mode
        #PF: error_code(0x0002) - not-present page
        PGD 800000033058e067 P4D 800000033058e067 PUD 3355ce067 PMD 0
        Oops: 0002 [#1] SMP PTI
        CPU: 2 PID: 14012 Comm: kworker/2:6 Kdump: loaded Not tainted 5.6.0-rc4 #3
        Hardware name: LENOVO 20AWS01K00/20AWS01K00, BIOS GLET70WW (2.24 ) 05/21/2014
        Workqueue: events memcg_event_remove
        RIP: 0010:__mem_cgroup_usage_unregister_event+0xb3/0x190
        RSP: 0018:ffffb47e01c4fe18 EFLAGS: 00010202
        RAX: 0000000000000001 RBX: ffff8bb223a8a000 RCX: 0000000000000001
        RDX: 0000000000000001 RSI: ffff8bb22fb83540 RDI: 0000000000000001
        RBP: ffffb47e01c4fe48 R08: 0000000000000000 R09: 0000000000000010
        R10: 000000000000000c R11: 071c71c71c71c71c R12: ffff8bb226aba880
        R13: ffff8bb223a8a480 R14: 0000000000000000 R15: 0000000000000000
        FS:  0000000000000000(0000) GS:ffff8bb242680000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000000000000004 CR3: 000000032c29c003 CR4: 00000000001606e0
        Call Trace:
          memcg_event_remove+0x32/0x90
          process_one_work+0x172/0x380
          worker_thread+0x49/0x3f0
          kthread+0xf8/0x130
          ret_from_fork+0x35/0x40
        CR2: 0000000000000004
      
      We can reproduce this problem in the following ways:
      
      1. We create a new cgroup subdirectory and a new eventfd, and then we
         monitor multiple memory thresholds of the cgroup through this eventfd.
      
      2.  closing this eventfd, and __mem_cgroup_usage_unregister_event ()
         will be called multiple times to delete all events related to this
         eventfd.
      
      The first time __mem_cgroup_usage_unregister_event() is called, the
      kernel will clear all items related to this eventfd in thresholds->
      primary.
      
      Since there is currently only one eventfd, thresholds-> primary becomes
      empty, so the kernel will set thresholds-> primary and hresholds-> spare
      to NULL.  If at this time, the user creates a new eventfd and monitor
      the memory threshold of this cgroup, kernel will re-initialize
      thresholds-> primary.
      
      Then when __mem_cgroup_usage_unregister_event () is called for the
      second time, because thresholds-> primary is not empty, the system will
      access thresholds-> spare, but thresholds-> spare is NULL, which will
      trigger a crash.
      
      In general, the longer it takes to delete all events related to this
      eventfd, the easier it is to trigger this problem.
      
      The solution is to check whether the thresholds associated with the
      eventfd has been cleared when deleting the event.  If so, we do nothing.
      
      [akpm@linux-foundation.org: fix comment, per Kirill]
      Fixes: 907860ed ("cgroups: make cftype.unregister_event() void-returning")
      Signed-off-by: NChunguang Xu <brookxu@tencent.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/077a6f67-aefa-4591-efec-f2f3af2b0b02@gmail.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      61a600d3
    • S
      net: memcg: late association of sock to memcg · b8386bd4
      Shakeel Butt 提交于
      [ Upstream commit d752a4986532cb6305dfd5290a614cde8072769d ]
      
      If a TCP socket is allocated in IRQ context or cloned from unassociated
      (i.e. not associated to a memcg) in IRQ context then it will remain
      unassociated for its whole life. Almost half of the TCPs created on the
      system are created in IRQ context, so, memory used by such sockets will
      not be accounted by the memcg.
      
      This issue is more widespread in cgroup v1 where network memory
      accounting is opt-in but it can happen in cgroup v2 if the source socket
      for the cloning was created in root memcg.
      
      To fix the issue, just do the association of the sockets at the accept()
      time in the process context and then force charge the memory buffer
      already used and reserved by the socket.
      Signed-off-by: NShakeel Butt <shakeelb@google.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      b8386bd4
    • S
      cgroup: memcg: net: do not associate sock with unrelated cgroup · 0d6c7d19
      Shakeel Butt 提交于
      [ Upstream commit e876ecc67db80dfdb8e237f71e5b43bb88ae549c ]
      
      We are testing network memory accounting in our setup and noticed
      inconsistent network memory usage and often unrelated cgroups network
      usage correlates with testing workload. On further inspection, it
      seems like mem_cgroup_sk_alloc() and cgroup_sk_alloc() are broken in
      irq context specially for cgroup v1.
      
      mem_cgroup_sk_alloc() and cgroup_sk_alloc() can be called in irq context
      and kind of assumes that this can only happen from sk_clone_lock()
      and the source sock object has already associated cgroup. However in
      cgroup v1, where network memory accounting is opt-in, the source sock
      can be unassociated with any cgroup and the new cloned sock can get
      associated with unrelated interrupted cgroup.
      
      Cgroup v2 can also suffer if the source sock object was created by
      process in the root cgroup or if sk_alloc() is called in irq context.
      The fix is to just do nothing in interrupt.
      
      WARNING: Please note that about half of the TCP sockets are allocated
      from the IRQ context, so, memory used by such sockets will not be
      accouted by the memcg.
      
      The stack trace of mem_cgroup_sk_alloc() from IRQ-context:
      
      CPU: 70 PID: 12720 Comm: ssh Tainted:  5.6.0-smp-DEV #1
      Hardware name: ...
      Call Trace:
       <IRQ>
       dump_stack+0x57/0x75
       mem_cgroup_sk_alloc+0xe9/0xf0
       sk_clone_lock+0x2a7/0x420
       inet_csk_clone_lock+0x1b/0x110
       tcp_create_openreq_child+0x23/0x3b0
       tcp_v6_syn_recv_sock+0x88/0x730
       tcp_check_req+0x429/0x560
       tcp_v6_rcv+0x72d/0xa40
       ip6_protocol_deliver_rcu+0xc9/0x400
       ip6_input+0x44/0xd0
       ? ip6_protocol_deliver_rcu+0x400/0x400
       ip6_rcv_finish+0x71/0x80
       ipv6_rcv+0x5b/0xe0
       ? ip6_sublist_rcv+0x2e0/0x2e0
       process_backlog+0x108/0x1e0
       net_rx_action+0x26b/0x460
       __do_softirq+0x104/0x2a6
       do_softirq_own_stack+0x2a/0x40
       </IRQ>
       do_softirq.part.19+0x40/0x50
       __local_bh_enable_ip+0x51/0x60
       ip6_finish_output2+0x23d/0x520
       ? ip6table_mangle_hook+0x55/0x160
       __ip6_finish_output+0xa1/0x100
       ip6_finish_output+0x30/0xd0
       ip6_output+0x73/0x120
       ? __ip6_finish_output+0x100/0x100
       ip6_xmit+0x2e3/0x600
       ? ipv6_anycast_cleanup+0x50/0x50
       ? inet6_csk_route_socket+0x136/0x1e0
       ? skb_free_head+0x1e/0x30
       inet6_csk_xmit+0x95/0xf0
       __tcp_transmit_skb+0x5b4/0xb20
       __tcp_send_ack.part.60+0xa3/0x110
       tcp_send_ack+0x1d/0x20
       tcp_rcv_state_process+0xe64/0xe80
       ? tcp_v6_connect+0x5d1/0x5f0
       tcp_v6_do_rcv+0x1b1/0x3f0
       ? tcp_v6_do_rcv+0x1b1/0x3f0
       __release_sock+0x7f/0xd0
       release_sock+0x30/0xa0
       __inet_stream_connect+0x1c3/0x3b0
       ? prepare_to_wait+0xb0/0xb0
       inet_stream_connect+0x3b/0x60
       __sys_connect+0x101/0x120
       ? __sys_getsockopt+0x11b/0x140
       __x64_sys_connect+0x1a/0x20
       do_syscall_64+0x51/0x200
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      The stack trace of mem_cgroup_sk_alloc() from IRQ-context:
      Fixes: 2d758073 ("mm: memcontrol: consolidate cgroup socket tracking")
      Fixes: d979a39d ("cgroup: duplicate cgroup reference when cloning sockets")
      Signed-off-by: NShakeel Butt <shakeelb@google.com>
      Reviewed-by: NRoman Gushchin <guro@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      0d6c7d19
    • V
      mm/memcontrol.c: lost css_put in memcg_expand_shrinker_maps() · a26df351
      Vasily Averin 提交于
      mainline inclusion
      from mainline-5.6-rc3
      commit 75866af62b439859d5146b7093ceb6b482852683
      category: bugfix
      bugzilla: 30952
      CVE: NA
      
      -------------------------------------------------
      for_each_mem_cgroup() increases css reference counter for memory cgroup
      and requires to use mem_cgroup_iter_break() if the walk is cancelled.
      
      Link: http://lkml.kernel.org/r/c98414fb-7e1f-da0f-867a-9340ec4bd30b@virtuozzo.com
      Fixes: 0a4465d3 ("mm, memcg: assign memcg-aware shrinkers bitmap to memcg")
      Signed-off-by: NVasily Averin <vvs@virtuozzo.com>
      Acked-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NRoman Gushchin <guro@fb.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      (cherry picked from commit 75866af62b439859d5146b7093ceb6b482852683)
      Signed-off-by: NKefeng Wang <wangkefeng.wang@huawei.com>
      Reviewed-by: NJing Xiangfeng <jingxiangfeng@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      a26df351
  2. 27 12月, 2019 13 次提交
    • R
      mm: memcg: switch to css_tryget() in get_mem_cgroup_from_mm() · 0baaf02e
      Roman Gushchin 提交于
      commit 00d484f354d85845991b40141d40ba9e5eb60faf upstream.
      
      We've encountered a rcu stall in get_mem_cgroup_from_mm():
      
        rcu: INFO: rcu_sched self-detected stall on CPU
        rcu: 33-....: (21000 ticks this GP) idle=6c6/1/0x4000000000000002 softirq=35441/35441 fqs=5017
        (t=21031 jiffies g=324821 q=95837) NMI backtrace for cpu 33
        <...>
        RIP: 0010:get_mem_cgroup_from_mm+0x2f/0x90
        <...>
         __memcg_kmem_charge+0x55/0x140
         __alloc_pages_nodemask+0x267/0x320
         pipe_write+0x1ad/0x400
         new_sync_write+0x127/0x1c0
         __kernel_write+0x4f/0xf0
         dump_emit+0x91/0xc0
         writenote+0xa0/0xc0
         elf_core_dump+0x11af/0x1430
         do_coredump+0xc65/0xee0
         get_signal+0x132/0x7c0
         do_signal+0x36/0x640
         exit_to_usermode_loop+0x61/0xd0
         do_syscall_64+0xd4/0x100
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      The problem is caused by an exiting task which is associated with an
      offline memcg.  We're iterating over and over in the do {} while
      (!css_tryget_online()) loop, but obviously the memcg won't become online
      and the exiting task won't be migrated to a live memcg.
      
      Let's fix it by switching from css_tryget_online() to css_tryget().
      
      As css_tryget_online() cannot guarantee that the memcg won't go offline,
      the check is usually useless, except some rare cases when for example it
      determines if something should be presented to a user.
      
      A similar problem is described by commit 18fa84a2db0e ("cgroup: Use
      css_tryget() instead of css_tryget_online() in task_get_css()").
      
      Johannes:
      
      : The bug aside, it doesn't matter whether the cgroup is online for the
      : callers.  It used to matter when offlining needed to evacuate all charges
      : from the memcg, and so needed to prevent new ones from showing up, but we
      : don't care now.
      
      Link: http://lkml.kernel.org/r/20191106225131.3543616-1-guro@fb.comSigned-off-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NShakeel Butt <shakeeb@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michal Koutn <mkoutny@suse.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      0baaf02e
    • J
      mm: memcontrol: fix network errors from failing __GFP_ATOMIC charges · 0f728582
      Johannes Weiner 提交于
      commit 869712fd3de5a90b7ba23ae1272278cddc66b37b upstream.
      
      While upgrading from 4.16 to 5.2, we noticed these allocation errors in
      the log of the new kernel:
      
        SLUB: Unable to allocate memory on node -1, gfp=0xa20(GFP_ATOMIC)
          cache: tw_sock_TCPv6(960:helper-logs), object size: 232, buffer size: 240, default order: 1, min order: 0
          node 0: slabs: 5, objs: 170, free: 0
      
              slab_out_of_memory+1
              ___slab_alloc+969
              __slab_alloc+14
              kmem_cache_alloc+346
              inet_twsk_alloc+60
              tcp_time_wait+46
              tcp_fin+206
              tcp_data_queue+2034
              tcp_rcv_state_process+784
              tcp_v6_do_rcv+405
              __release_sock+118
              tcp_close+385
              inet_release+46
              __sock_release+55
              sock_close+17
              __fput+170
              task_work_run+127
              exit_to_usermode_loop+191
              do_syscall_64+212
              entry_SYSCALL_64_after_hwframe+68
      
      accompanied by an increase in machines going completely radio silent
      under memory pressure.
      
      One thing that changed since 4.16 is e699e2c6 ("net, mm: account
      sock objects to kmemcg"), which made these slab caches subject to cgroup
      memory accounting and control.
      
      The problem with that is that cgroups, unlike the page allocator, do not
      maintain dedicated atomic reserves.  As a cgroup's usage hovers at its
      limit, atomic allocations - such as done during network rx - can fail
      consistently for extended periods of time.  The kernel is not able to
      operate under these conditions.
      
      We don't want to revert the culprit patch, because it indeed tracks a
      potentially substantial amount of memory used by a cgroup.
      
      We also don't want to implement dedicated atomic reserves for cgroups.
      There is no point in keeping a fixed margin of unused bytes in the
      cgroup's memory budget to accomodate a consumer that is impossible to
      predict - we'd be wasting memory and get into configuration headaches,
      not unlike what we have going with min_free_kbytes.  We do this for
      physical mem because we have to, but cgroups are an accounting game.
      
      Instead, account these privileged allocations to the cgroup, but let
      them bypass the configured limit if they have to.  This way, we get the
      benefits of accounting the consumed memory and have it exert pressure on
      the rest of the cgroup, but like with the page allocator, we shift the
      burden of reclaimining on behalf of atomic allocations onto the regular
      allocations that can block.
      
      Link: http://lkml.kernel.org/r/20191022233708.365764-1-hannes@cmpxchg.org
      Fixes: e699e2c6 ("net, mm: account sock objects to kmemcg")
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Cc: Suleiman Souhlal <suleiman@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: <stable@vger.kernel.org>	[4.18+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      0f728582
    • K
      mm/memcontrol: update lruvec counters in mem_cgroup_move_account · 36c32dd4
      Konstantin Khlebnikov 提交于
      mainline inclusion
      from mainline-5.4-rc4
      commit ae8af4388db002bbd1df78ecee7ca31cee78e964
      category: bugfix
      bugzilla: 24064
      CVE: NA
      
      -------------------------------------------------
      
      Mapped, dirty and writeback pages are also counted in per-lruvec stats.
      These counters needs update when page is moved between cgroups.
      
      Currently is nobody *consuming* the lruvec versions of these counters and
      that there is no user-visible effect.
      
      Link: http://lkml.kernel.org/r/157112699975.7360.1062614888388489788.stgit@buzz
      Fixes: 00f3ca2c ("mm: memcontrol: per-lruvec stats infrastructure")
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NChenwandun <chenwandun@huawei.com>
      Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      36c32dd4
    • T
      mm: memcontrol: use CSS_TASK_ITER_PROCS at mem_cgroup_scan_tasks() · cd637046
      Tetsuo Handa 提交于
      mainline inclusion
      from mainline-5.3-rc1
      commit f168a9a5
      category: bugfix
      bugzilla: 20545
      CVE: NA
      
      -------------------------------------------------
      
      Since commit c03cd773 ("cgroup: Include dying leaders with live
      threads in PROCS iterations") corrected how CSS_TASK_ITER_PROCS works,
      mem_cgroup_scan_tasks() can use CSS_TASK_ITER_PROCS in order to check
      only one thread from each thread group.
      
      [penguin-kernel@I-love.SAKURA.ne.jp: remove thread group leader check in oom_evaluate_task()]
        Link: http://lkml.kernel.org/r/1560853257-14934-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
      Link: http://lkml.kernel.org/r/c763afc8-f0ae-756a-56a7-395f625b95fc@i-love.sakura.ne.jpSigned-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: Ntongtiangen <tongtiangen@huawei.com>
      Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      cd637046
    • M
      memcg, kmem: do not fail __GFP_NOFAIL charges · 421ceaf7
      Michal Hocko 提交于
      commit e55d9d9bfb69405bd7615c0f8d229d8fafb3e9b8 upstream.
      
      Thomas has noticed the following NULL ptr dereference when using cgroup
      v1 kmem limit:
      BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
      PGD 0
      P4D 0
      Oops: 0000 [#1] PREEMPT SMP PTI
      CPU: 3 PID: 16923 Comm: gtk-update-icon Not tainted 4.19.51 #42
      Hardware name: Gigabyte Technology Co., Ltd. Z97X-Gaming G1/Z97X-Gaming G1, BIOS F9 07/31/2015
      RIP: 0010:create_empty_buffers+0x24/0x100
      Code: cd 0f 1f 44 00 00 0f 1f 44 00 00 41 54 49 89 d4 ba 01 00 00 00 55 53 48 89 fb e8 97 fe ff ff 48 89 c5 48 89 c2 eb 03 48 89 ca <48> 8b 4a 08 4c 09 22 48 85 c9 75 f1 48 89 6a 08 48 8b 43 18 48 8d
      RSP: 0018:ffff927ac1b37bf8 EFLAGS: 00010286
      RAX: 0000000000000000 RBX: fffff2d4429fd740 RCX: 0000000100097149
      RDX: 0000000000000000 RSI: 0000000000000082 RDI: ffff9075a99fbe00
      RBP: 0000000000000000 R08: fffff2d440949cc8 R09: 00000000000960c0
      R10: 0000000000000002 R11: 0000000000000000 R12: 0000000000000000
      R13: ffff907601f18360 R14: 0000000000002000 R15: 0000000000001000
      FS:  00007fb55b288bc0(0000) GS:ffff90761f8c0000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000000000008 CR3: 000000007aebc002 CR4: 00000000001606e0
      Call Trace:
       create_page_buffers+0x4d/0x60
       __block_write_begin_int+0x8e/0x5a0
       ? ext4_inode_attach_jinode.part.82+0xb0/0xb0
       ? jbd2__journal_start+0xd7/0x1f0
       ext4_da_write_begin+0x112/0x3d0
       generic_perform_write+0xf1/0x1b0
       ? file_update_time+0x70/0x140
       __generic_file_write_iter+0x141/0x1a0
       ext4_file_write_iter+0xef/0x3b0
       __vfs_write+0x17e/0x1e0
       vfs_write+0xa5/0x1a0
       ksys_write+0x57/0xd0
       do_syscall_64+0x55/0x160
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Tetsuo then noticed that this is because the __memcg_kmem_charge_memcg
      fails __GFP_NOFAIL charge when the kmem limit is reached.  This is a wrong
      behavior because nofail allocations are not allowed to fail.  Normal
      charge path simply forces the charge even if that means to cross the
      limit.  Kmem accounting should be doing the same.
      
      Link: http://lkml.kernel.org/r/20190906125608.32129-1-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NThomas Lindroth <thomas.lindroth@gmail.com>
      Debugged-by: NTetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Thomas Lindroth <thomas.lindroth@gmail.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      421ceaf7
    • M
      memcg, kmem: deprecate kmem.limit_in_bytes · a1721a1b
      Michal Hocko 提交于
      mainline inclusion
      from mainline-5.3
      commit 0158115f
      category: doc
      bugzilla: 23416
      CVE: NA
      
      -------------------------------------------------
      
      Cgroup v1 memcg controller has exposed a dedicated kmem limit to users
      which turned out to be really a bad idea because there are paths which
      cannot shrink the kernel memory usage enough to get below the limit (e.g.
      because the accounted memory is not reclaimable).  There are cases when
      the failure is even not allowed (e.g.  __GFP_NOFAIL).  This means that the
      kmem limit is in excess to the hard limit without any way to shrink and
      thus completely useless.  OOM killer cannot be invoked to handle the
      situation because that would lead to a premature oom killing.
      
      As a result many places might see ENOMEM returning from kmalloc and result
      in unexpected errors.  E.g.  a global OOM killer when there is a lot of
      free memory because ENOMEM is translated into VM_FAULT_OOM in #PF path and
      therefore pagefault_out_of_memory would result in OOM killer.
      
      Please note that the kernel memory is still accounted to the overall limit
      along with the user memory so removing the kmem specific limit should
      still allow to contain kernel memory consumption.  Unlike the kmem one,
      though, it invokes memory reclaim and targeted memcg oom killing if
      necessary.
      
      Start the deprecation process by crying to the kernel log.  Let's see
      whether there are relevant usecases and simply return to EINVAL in the
      second stage if nobody complains in few releases.
      
      [akpm@linux-foundation.org: tweak documentation text]
      Link: http://lkml.kernel.org/r/20190911151612.GI4023@dhcp22.suse.czSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Thomas Lindroth <thomas.lindroth@gmail.com>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NJing Xiangfeng <jingxiangfeng@huawei.com>
      Reviewed-by: Nzhong jiang <zhongjiang@huawei.com>
      Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      a1721a1b
    • S
      memcg: localize memcg_kmem_enabled() check · 0f878ed1
      Shakeel Butt 提交于
      mainline inclusion
      from mainline-5.1-rc1
      commit 60cd4bcd62384cfa1e5890cebacccf08b3161156
      category: bugfix
      bugzilla: 21077
      CVE: NA
      
      ------------------------------------------------
      
      Move the memcg_kmem_enabled() checks into memcg kmem charge/uncharge
      functions, so, the users don't have to explicitly check that condition.
      
      This is purely code cleanup patch without any functional change.  Only
      the order of checks in memcg_charge_slab() can potentially be changed
      but the functionally it will be same.  This should not matter as
      memcg_charge_slab() is not in the hot path.
      
      Link: http://lkml.kernel.org/r/20190103161203.162375-1-shakeelb@google.comSigned-off-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Roman Gushchin <guro@fb.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: Nzhong jiang <zhongjiang@huawei.com>
      Reviewed-by: NYang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      0f878ed1
    • M
      mm/memcontrol.c: fix use after free in mem_cgroup_iter() · 721f8a64
      Miles Chen 提交于
      commit 54a83d6bcbf8f4700013766b974bf9190d40b689 upstream.
      
      This patch is sent to report an use after free in mem_cgroup_iter()
      after merging commit be2657752e9e ("mm: memcg: fix use after free in
      mem_cgroup_iter()").
      
      I work with android kernel tree (4.9 & 4.14), and commit be2657752e9e
      ("mm: memcg: fix use after free in mem_cgroup_iter()") has been merged
      to the trees.  However, I can still observe use after free issues
      addressed in the commit be2657752e9e.  (on low-end devices, a few times
      this month)
      
      backtrace:
              css_tryget <- crash here
              mem_cgroup_iter
              shrink_node
              shrink_zones
              do_try_to_free_pages
              try_to_free_pages
              __perform_reclaim
              __alloc_pages_direct_reclaim
              __alloc_pages_slowpath
              __alloc_pages_nodemask
      
      To debug, I poisoned mem_cgroup before freeing it:
      
        static void __mem_cgroup_free(struct mem_cgroup *memcg)
              for_each_node(node)
              free_mem_cgroup_per_node_info(memcg, node);
              free_percpu(memcg->stat);
        +     /* poison memcg before freeing it */
        +     memset(memcg, 0x78, sizeof(struct mem_cgroup));
              kfree(memcg);
        }
      
      The coredump shows the position=0xdbbc2a00 is freed.
      
        (gdb) p/x ((struct mem_cgroup_per_node *)0xe5009e00)->iter[8]
        $13 = {position = 0xdbbc2a00, generation = 0x2efd}
      
        0xdbbc2a00:     0xdbbc2e00      0x00000000      0xdbbc2800      0x00000100
        0xdbbc2a10:     0x00000200      0x78787878      0x00026218      0x00000000
        0xdbbc2a20:     0xdcad6000      0x00000001      0x78787800      0x00000000
        0xdbbc2a30:     0x78780000      0x00000000      0x0068fb84      0x78787878
        0xdbbc2a40:     0x78787878      0x78787878      0x78787878      0xe3fa5cc0
        0xdbbc2a50:     0x78787878      0x78787878      0x00000000      0x00000000
        0xdbbc2a60:     0x00000000      0x00000000      0x00000000      0x00000000
        0xdbbc2a70:     0x00000000      0x00000000      0x00000000      0x00000000
        0xdbbc2a80:     0x00000000      0x00000000      0x00000000      0x00000000
        0xdbbc2a90:     0x00000001      0x00000000      0x00000000      0x00100000
        0xdbbc2aa0:     0x00000001      0xdbbc2ac8      0x00000000      0x00000000
        0xdbbc2ab0:     0x00000000      0x00000000      0x00000000      0x00000000
        0xdbbc2ac0:     0x00000000      0x00000000      0xe5b02618      0x00001000
        0xdbbc2ad0:     0x00000000      0x78787878      0x78787878      0x78787878
        0xdbbc2ae0:     0x78787878      0x78787878      0x78787878      0x78787878
        0xdbbc2af0:     0x78787878      0x78787878      0x78787878      0x78787878
        0xdbbc2b00:     0x78787878      0x78787878      0x78787878      0x78787878
        0xdbbc2b10:     0x78787878      0x78787878      0x78787878      0x78787878
        0xdbbc2b20:     0x78787878      0x78787878      0x78787878      0x78787878
        0xdbbc2b30:     0x78787878      0x78787878      0x78787878      0x78787878
        0xdbbc2b40:     0x78787878      0x78787878      0x78787878      0x78787878
        0xdbbc2b50:     0x78787878      0x78787878      0x78787878      0x78787878
        0xdbbc2b60:     0x78787878      0x78787878      0x78787878      0x78787878
        0xdbbc2b70:     0x78787878      0x78787878      0x78787878      0x78787878
        0xdbbc2b80:     0x78787878      0x78787878      0x00000000      0x78787878
        0xdbbc2b90:     0x78787878      0x78787878      0x78787878      0x78787878
        0xdbbc2ba0:     0x78787878      0x78787878      0x78787878      0x78787878
      
      In the reclaim path, try_to_free_pages() does not setup
      sc.target_mem_cgroup and sc is passed to do_try_to_free_pages(), ...,
      shrink_node().
      
      In mem_cgroup_iter(), root is set to root_mem_cgroup because
      sc->target_mem_cgroup is NULL.  It is possible to assign a memcg to
      root_mem_cgroup.nodeinfo.iter in mem_cgroup_iter().
      
              try_to_free_pages
              	struct scan_control sc = {...}, target_mem_cgroup is 0x0;
              do_try_to_free_pages
              shrink_zones
              shrink_node
              	 mem_cgroup *root = sc->target_mem_cgroup;
              	 memcg = mem_cgroup_iter(root, NULL, &reclaim);
              mem_cgroup_iter()
              	if (!root)
              		root = root_mem_cgroup;
              	...
      
              	css = css_next_descendant_pre(css, &root->css);
              	memcg = mem_cgroup_from_css(css);
              	cmpxchg(&iter->position, pos, memcg);
      
      My device uses memcg non-hierarchical mode.  When we release a memcg:
      invalidate_reclaim_iterators() reaches only dead_memcg and its parents.
      If non-hierarchical mode is used, invalidate_reclaim_iterators() never
      reaches root_mem_cgroup.
      
        static void invalidate_reclaim_iterators(struct mem_cgroup *dead_memcg)
        {
              struct mem_cgroup *memcg = dead_memcg;
      
              for (; memcg; memcg = parent_mem_cgroup(memcg)
              ...
        }
      
      So the use after free scenario looks like:
      
        CPU1						CPU2
      
        try_to_free_pages
        do_try_to_free_pages
        shrink_zones
        shrink_node
        mem_cgroup_iter()
            if (!root)
            	root = root_mem_cgroup;
            ...
            css = css_next_descendant_pre(css, &root->css);
            memcg = mem_cgroup_from_css(css);
            cmpxchg(&iter->position, pos, memcg);
      
              				invalidate_reclaim_iterators(memcg);
              				...
              				__mem_cgroup_free()
              					kfree(memcg);
      
        try_to_free_pages
        do_try_to_free_pages
        shrink_zones
        shrink_node
        mem_cgroup_iter()
            if (!root)
            	root = root_mem_cgroup;
            ...
            mz = mem_cgroup_nodeinfo(root, reclaim->pgdat->node_id);
            iter = &mz->iter[reclaim->priority];
            pos = READ_ONCE(iter->position);
            css_tryget(&pos->css) <- use after free
      
      To avoid this, we should also invalidate root_mem_cgroup.nodeinfo.iter
      in invalidate_reclaim_iterators().
      
      [cai@lca.pw: fix -Wparentheses compilation warning]
        Link: http://lkml.kernel.org/r/1564580753-17531-1-git-send-email-cai@lca.pw
      Link: http://lkml.kernel.org/r/20190730015729.4406-1-miles.chen@mediatek.com
      Fixes: 5ac8fb31 ("mm: memcontrol: convert reclaim iterator to simple css refcounting")
      Signed-off-by: NMiles Chen <miles.chen@mediatek.com>
      Signed-off-by: NQian Cai <cai@lca.pw>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      721f8a64
    • Y
      cgroup: disable kernel memory accounting for all memory cgroups by default · fe02ebb4
      Yang Yingliang 提交于
      hulk inclusion
      category: bugfix
      bugzilla: 18665
      CVE: NA
      -------------------
      
      The kernel memory accounting for all memory cgroups is
      not stable now, it could lead kmem.usage refcount leak.
      It's used as a debug feature for now, so disable it by
      default. We can use the following command line to enable
      or disable it, cgroup.memory=kmem or cgroup.memory=kmem.
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      Reviewed-by: NJing Xiangfeng <jingxiangfeng@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      fe02ebb4
    • G
      mm: writeback: use exact memcg dirty counts · 80f97e37
      Greg Thelen 提交于
      mainline inclusion
      from mainline-5.1-rc3
      commit 0b3d6e6f2dd0
      category: bugfix
      bugzilla: 13657
      CVE: NA
      
      -------------------------------------------------
      
      Since commit a983b5eb ("mm: memcontrol: fix excessive complexity in
      memory.stat reporting") memcg dirty and writeback counters are managed
      as:
      
       1) per-memcg per-cpu values in range of [-32..32]
      
       2) per-memcg atomic counter
      
      When a per-cpu counter cannot fit in [-32..32] it's flushed to the
      atomic.  Stat readers only check the atomic.  Thus readers such as
      balance_dirty_pages() may see a nontrivial error margin: 32 pages per
      cpu.
      
      Assuming 100 cpus:
         4k x86 page_size:  13 MiB error per memcg
        64k ppc page_size: 200 MiB error per memcg
      
      Considering that dirty+writeback are used together for some decisions the
      errors double.
      
      This inaccuracy can lead to undeserved oom kills.  One nasty case is
      when all per-cpu counters hold positive values offsetting an atomic
      negative value (i.e.  per_cpu[*]=32, atomic=n_cpu*-32).
      balance_dirty_pages() only consults the atomic and does not consider
      throttling the next n_cpu*32 dirty pages.  If the file_lru is in the
      13..200 MiB range then there's absolutely no dirty throttling, which
      burdens vmscan with only dirty+writeback pages thus resorting to oom
      kill.
      
      It could be argued that tiny containers are not supported, but it's more
      subtle.  It's the amount the space available for file lru that matters.
      If a container has memory.max-200MiB of non reclaimable memory, then it
      will also suffer such oom kills on a 100 cpu machine.
      
      The following test reliably ooms without this patch.  This patch avoids
      oom kills.
      
        $ cat test
        mount -t cgroup2 none /dev/cgroup
        cd /dev/cgroup
        echo +io +memory > cgroup.subtree_control
        mkdir test
        cd test
        echo 10M > memory.max
        (echo $BASHPID > cgroup.procs && exec /memcg-writeback-stress /foo)
        (echo $BASHPID > cgroup.procs && exec dd if=/dev/zero of=/foo bs=2M count=100)
      
        $ cat memcg-writeback-stress.c
        /*
         * Dirty pages from all but one cpu.
         * Clean pages from the non dirtying cpu.
         * This is to stress per cpu counter imbalance.
         * On a 100 cpu machine:
         * - per memcg per cpu dirty count is 32 pages for each of 99 cpus
         * - per memcg atomic is -99*32 pages
         * - thus the complete dirty limit: sum of all counters 0
         * - balance_dirty_pages() only sees atomic count -99*32 pages, which
         *   it max()s to 0.
         * - So a workload can dirty -99*32 pages before balance_dirty_pages()
         *   cares.
         */
        #define _GNU_SOURCE
        #include <err.h>
        #include <fcntl.h>
        #include <sched.h>
        #include <stdlib.h>
        #include <stdio.h>
        #include <sys/stat.h>
        #include <sys/sysinfo.h>
        #include <sys/types.h>
        #include <unistd.h>
      
        static char *buf;
        static int bufSize;
      
        static void set_affinity(int cpu)
        {
        	cpu_set_t affinity;
      
        	CPU_ZERO(&affinity);
        	CPU_SET(cpu, &affinity);
        	if (sched_setaffinity(0, sizeof(affinity), &affinity))
        		err(1, "sched_setaffinity");
        }
      
        static void dirty_on(int output_fd, int cpu)
        {
        	int i, wrote;
      
        	set_affinity(cpu);
        	for (i = 0; i < 32; i++) {
        		for (wrote = 0; wrote < bufSize; ) {
        			int ret = write(output_fd, buf+wrote, bufSize-wrote);
        			if (ret == -1)
        				err(1, "write");
        			wrote += ret;
        		}
        	}
        }
      
        int main(int argc, char **argv)
        {
        	int cpu, flush_cpu = 1, output_fd;
        	const char *output;
      
        	if (argc != 2)
        		errx(1, "usage: output_file");
      
        	output = argv[1];
        	bufSize = getpagesize();
        	buf = malloc(getpagesize());
        	if (buf == NULL)
        		errx(1, "malloc failed");
      
        	output_fd = open(output, O_CREAT|O_RDWR);
        	if (output_fd == -1)
        		err(1, "open(%s)", output);
      
        	for (cpu = 0; cpu < get_nprocs(); cpu++) {
        		if (cpu != flush_cpu)
        			dirty_on(output_fd, cpu);
        	}
      
        	set_affinity(flush_cpu);
        	if (fsync(output_fd))
        		err(1, "fsync(%s)", output);
        	if (close(output_fd))
        		err(1, "close(%s)", output);
        	free(buf);
        }
      
      Make balance_dirty_pages() and wb_over_bg_thresh() work harder to
      collect exact per memcg counters.  This avoids the aforementioned oom
      kills.
      
      This does not affect the overhead of memory.stat, which still reads the
      single atomic counter.
      
      Why not use percpu_counter? memcg already handles cpus going offline, so
      no need for that overhead from percpu_counter.  And the percpu_counter
      spinlocks are more heavyweight than is required.
      
      It probably also makes sense to use exact dirty and writeback counters
      in memcg oom reports.  But that is saved for later.
      
      Link: http://lkml.kernel.org/r/20190329174609.164344-1-gthelen@google.comSigned-off-by: NGreg Thelen <gthelen@google.com>
      Reviewed-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: <stable@vger.kernel.org>	[4.16+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: Nzhong jiang <zhongjiang@huawei.com>
      Reviewed-by: NJing Xiangfeng <jingxiangfeng@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      80f97e37
    • T
      memcg: killed threads should not invoke memcg OOM killer · 6e7e294c
      Tetsuo Handa 提交于
      [ Upstream commit 7775face207922ea62a4e96b9cd45abfdc7b9840 ]
      
      If a memory cgroup contains a single process with many threads
      (including different process group sharing the mm) then it is possible
      to trigger a race when the oom killer complains that there are no oom
      elible tasks and complain into the log which is both annoying and
      confusing because there is no actual problem.  The race looks as
      follows:
      
      P1				oom_reaper		P2
      try_charge						try_charge
        mem_cgroup_out_of_memory
          mutex_lock(oom_lock)
            out_of_memory
              oom_kill_process(P1,P2)
               wake_oom_reaper
          mutex_unlock(oom_lock)
          				oom_reap_task
      							  mutex_lock(oom_lock)
      							    select_bad_process # no victim
      
      The problem is more visible with many threads.
      
      Fix this by checking for fatal_signal_pending from
      mem_cgroup_out_of_memory when the oom_lock is already held.
      
      The oom bypass is safe because we do the same early in the try_charge
      path already.  The situation migh have changed in the mean time.  It
      should be safe to check for fatal_signal_pending and tsk_is_oom_victim
      but for a better code readability abstract the current charge bypass
      condition into should_force_charge and reuse it from that path.  "
      
      Link: http://lkml.kernel.org/r/01370f70-e1f6-ebe4-b95e-0df21a0bc15e@i-love.sakura.ne.jpSigned-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      6e7e294c
    • R
      mm: handle no memcg case in memcg_kmem_charge() properly · c84d2f61
      Roman Gushchin 提交于
      mainline inclusion
      from mainline-4.20
      commit e68599a3c3ad
      category: bugfix
      bugzilla: 5751
      CVE: NA
      
      -------------------------------------------------
      
      Mike Galbraith reported a regression caused by the commit 9b6f7e163cd0
      ("mm: rework memcg kernel stack accounting") on a system with
      "cgroup_disable=memory" boot option: the system panics with the following
      stack trace:
      
        BUG: unable to handle kernel NULL pointer dereference at 00000000000000f8
        PGD 0 P4D 0
        Oops: 0002 [#1] PREEMPT SMP PTI
        CPU: 0 PID: 1 Comm: systemd Not tainted 4.19.0-preempt+ #410
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20180531_142017-buildhw-08.phx2.fed4
        RIP: 0010:page_counter_try_charge+0x22/0xc0
        Code: 41 5d c3 c3 0f 1f 40 00 0f 1f 44 00 00 48 85 ff 0f 84 a7 00 00 00 41 56 48 89 f8 49 89 fe 49
        Call Trace:
         try_charge+0xcb/0x780
         memcg_kmem_charge_memcg+0x28/0x80
         memcg_kmem_charge+0x8b/0x1d0
         copy_process.part.41+0x1ca/0x2070
         _do_fork+0xd7/0x3d0
         do_syscall_64+0x5a/0x180
         entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      The problem occurs because get_mem_cgroup_from_current() returns the NULL
      pointer if memory controller is disabled.  Let's check if this is a case
      at the beginning of memcg_kmem_charge() and just return 0 if
      mem_cgroup_disabled() returns true.  This is how we handle this case in
      many other places in the memory controller code.
      
      Link: http://lkml.kernel.org/r/20181029215123.17830-1-guro@fb.com
      Fixes: 9b6f7e163cd0 ("mm: rework memcg kernel stack accounting")
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Reported-by: NMike Galbraith <efault@gmx.de>
      Acked-by: NRik van Riel <riel@surriel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
      Reviewed-by: NJing xiangfeng <jingxiangfeng@huawei.com>
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Reported-by: NMike Galbraith <efault@gmx.de>
      Acked-by: NRik van Riel <riel@surriel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
      Reviewed-by: NJing xiangfeng <jingxiangfeng@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      c84d2f61
    • M
      memcg, oom: notify on oom killer invocation from the charge path · 2276dadc
      Michal Hocko 提交于
      commit 7056d3a37d2c6aaaab10c13e8e69adc67ec1fc65 upstream.
      
      Burt Holzman has noticed that memcg v1 doesn't notify about OOM events via
      eventfd anymore.  The reason is that 29ef680a ("memcg, oom: move
      out_of_memory back to the charge path") has moved the oom handling back to
      the charge path.  While doing so the notification was left behind in
      mem_cgroup_oom_synchronize.
      
      Fix the issue by replicating the oom hierarchy locking and the
      notification.
      
      Link: http://lkml.kernel.org/r/20181224091107.18354-1-mhocko@kernel.org
      Fixes: 29ef680a ("memcg, oom: move out_of_memory back to the charge path")
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NBurt Holzman <burt@fnal.gov>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com
      Cc: <stable@vger.kernel.org>	[4.19+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      2276dadc
  3. 05 9月, 2018 1 次提交
    • J
      mm: memcontrol: print proper OOM header when no eligible victim left · 3100dab2
      Johannes Weiner 提交于
      When the memcg OOM killer runs out of killable tasks, it currently
      prints a WARN with no further OOM context.  This has caused some user
      confusion.
      
      Warnings indicate a kernel problem.  In a reported case, however, the
      situation was triggered by a nonsensical memcg configuration (hard limit
      set to 0).  But without any VM context this wasn't obvious from the
      report, and it took some back and forth on the mailing list to identify
      what is actually a trivial issue.
      
      Handle this OOM condition like we handle it in the global OOM killer:
      dump the full OOM context and tell the user we ran out of tasks.
      
      This way the user can identify misconfigurations easily by themselves
      and rectify the problem - without having to go through the hassle of
      running into an obscure but unsettling warning, finding the appropriate
      kernel mailing list and waiting for a kernel developer to remote-analyze
      that the memcg configuration caused this.
      
      If users cannot make sense of why the OOM killer was triggered or why it
      failed, they will still report it to the mailing list, we know that from
      experience.  So in case there is an actual kernel bug causing this,
      kernel developers will very likely hear about it.
      
      Link: http://lkml.kernel.org/r/20180821160406.22578-1-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3100dab2
  4. 23 8月, 2018 2 次提交
    • R
      mm, oom: introduce memory.oom.group · 3d8b38eb
      Roman Gushchin 提交于
      For some workloads an intervention from the OOM killer can be painful.
      Killing a random task can bring the workload into an inconsistent state.
      
      Historically, there are two common solutions for this
      problem:
      1) enabling panic_on_oom,
      2) using a userspace daemon to monitor OOMs and kill
         all outstanding processes.
      
      Both approaches have their downsides: rebooting on each OOM is an obvious
      waste of capacity, and handling all in userspace is tricky and requires a
      userspace agent, which will monitor all cgroups for OOMs.
      
      In most cases an in-kernel after-OOM cleaning-up mechanism can eliminate
      the necessity of enabling panic_on_oom.  Also, it can simplify the cgroup
      management for userspace applications.
      
      This commit introduces a new knob for cgroup v2 memory controller:
      memory.oom.group.  The knob determines whether the cgroup should be
      treated as an indivisible workload by the OOM killer.  If set, all tasks
      belonging to the cgroup or to its descendants (if the memory cgroup is not
      a leaf cgroup) are killed together or not at all.
      
      To determine which cgroup has to be killed, we do traverse the cgroup
      hierarchy from the victim task's cgroup up to the OOMing cgroup (or root)
      and looking for the highest-level cgroup with memory.oom.group set.
      
      Tasks with the OOM protection (oom_score_adj set to -1000) are treated as
      an exception and are never killed.
      
      This patch doesn't change the OOM victim selection algorithm.
      
      Link: http://lkml.kernel.org/r/20180802003201.817-4-guro@fb.comSigned-off-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3d8b38eb
    • S
      memcg: reduce memcg tree traversals for stats collection · 8de7ecc6
      Shakeel Butt 提交于
      Currently cgroup-v1's memcg_stat_show traverses the memcg tree ~17 times
      to collect the stats while cgroup-v2's memory_stat_show traverses the
      memcg tree thrice.  On a large machine, a couple thousand memcgs is very
      normal and if the churn is high and memcgs stick around during to several
      reasons, tens of thousands of nodes in memcg tree can exist.  This patch
      has refactored and shared the stat collection code between cgroup-v1 and
      cgroup-v2 and has reduced the tree traversal to just one.
      
      I ran a simple benchmark which reads the root_mem_cgroup's stat file
      1000 times in the presense of 2500 memcgs on cgroup-v1. The results are:
      
      Without the patch:
      $ time ./read-root-stat-1000-times
      
      real    0m1.663s
      user    0m0.000s
      sys     0m1.660s
      
      With the patch:
      $ time ./read-root-stat-1000-times
      
      real    0m0.468s
      user    0m0.000s
      sys     0m0.467s
      
      Link: http://lkml.kernel.org/r/20180724224635.143944-1-shakeelb@google.comSigned-off-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Bruce Merry <bmerry@ska.ac.za>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8de7ecc6
  5. 18 8月, 2018 10 次提交
    • K
      mm/vmscan.c: clear shrinker bit if there are no objects related to memcg · f90280d6
      Kirill Tkhai 提交于
      To avoid further unneed calls of do_shrink_slab() for shrinkers, which
      already do not have any charged objects in a memcg, their bits have to
      be cleared.
      
      This patch introduces a lockless mechanism to do that without races
      without parallel list lru add.  After do_shrink_slab() returns
      SHRINK_EMPTY the first time, we clear the bit and call it once again.
      Then we restore the bit, if the new return value is different.
      
      Note, that single smp_mb__after_atomic() in shrink_slab_memcg() covers
      two situations:
      
      1)list_lru_add()     shrink_slab_memcg
          list_add_tail()    for_each_set_bit() <--- read bit
                               do_shrink_slab() <--- missed list update (no barrier)
          <MB>                 <MB>
          set_bit()            do_shrink_slab() <--- seen list update
      
      This situation, when the first do_shrink_slab() sees set bit, but it
      doesn't see list update (i.e., race with the first element queueing), is
      rare.  So we don't add <MB> before the first call of do_shrink_slab()
      instead of this to do not slow down generic case.  Also, it's need the
      second call as seen in below in (2).
      
      2)list_lru_add()      shrink_slab_memcg()
          list_add_tail()     ...
          set_bit()           ...
        ...                   for_each_set_bit()
        do_shrink_slab()        do_shrink_slab()
          clear_bit()           ...
        ...                     ...
        list_lru_add()          ...
          list_add_tail()       clear_bit()
          <MB>                  <MB>
          set_bit()             do_shrink_slab()
      
      The barriers guarantee that the second do_shrink_slab() in the right
      side task sees list update if really cleared the bit.  This case is
      drawn in the code comment.
      
      [Results/performance of the patchset]
      
      After the whole patchset applied the below test shows signify increase
      of performance:
      
        $echo 1 > /sys/fs/cgroup/memory/memory.use_hierarchy
        $mkdir /sys/fs/cgroup/memory/ct
        $echo 4000M > /sys/fs/cgroup/memory/ct/memory.kmem.limit_in_bytes
            $for i in `seq 0 4000`; do mkdir /sys/fs/cgroup/memory/ct/$i;
      			    echo $$ > /sys/fs/cgroup/memory/ct/$i/cgroup.procs;
      			    mkdir -p s/$i; mount -t tmpfs $i s/$i;
      			    touch s/$i/file; done
      
      Then, 5 sequential calls of drop caches:
      
        $time echo 3 > /proc/sys/vm/drop_caches
      
      1)Before:
        0.00user 13.78system 0:13.78elapsed 99%CPU
        0.00user 5.59system 0:05.60elapsed 99%CPU
        0.00user 5.48system 0:05.48elapsed 99%CPU
        0.00user 8.35system 0:08.35elapsed 99%CPU
        0.00user 8.34system 0:08.35elapsed 99%CPU
      
      2)After
        0.00user 1.10system 0:01.10elapsed 99%CPU
        0.00user 0.00system 0:00.01elapsed 64%CPU
        0.00user 0.01system 0:00.01elapsed 82%CPU
        0.00user 0.00system 0:00.01elapsed 64%CPU
        0.00user 0.01system 0:00.01elapsed 82%CPU
      
      The results show the performance increases at least in 548 times.
      
      Shakeel Butt tested this patchset with fork-bomb on his configuration:
      
       > I created 255 memcgs, 255 ext4 mounts and made each memcg create a
       > file containing few KiBs on corresponding mount. Then in a separate
       > memcg of 200 MiB limit ran a fork-bomb.
       >
       > I ran the "perf record -ag -- sleep 60" and below are the results:
       >
       > Without the patch series:
       > Samples: 4M of event 'cycles', Event count (approx.): 3279403076005
       > +  36.40%            fb.sh  [kernel.kallsyms]    [k] shrink_slab
       > +  18.97%            fb.sh  [kernel.kallsyms]    [k] list_lru_count_one
       > +   6.75%            fb.sh  [kernel.kallsyms]    [k] super_cache_count
       > +   0.49%            fb.sh  [kernel.kallsyms]    [k] down_read_trylock
       > +   0.44%            fb.sh  [kernel.kallsyms]    [k] mem_cgroup_iter
       > +   0.27%            fb.sh  [kernel.kallsyms]    [k] up_read
       > +   0.21%            fb.sh  [kernel.kallsyms]    [k] osq_lock
       > +   0.13%            fb.sh  [kernel.kallsyms]    [k] shmem_unused_huge_count
       > +   0.08%            fb.sh  [kernel.kallsyms]    [k] shrink_node_memcg
       > +   0.08%            fb.sh  [kernel.kallsyms]    [k] shrink_node
       >
       > With the patch series:
       > Samples: 4M of event 'cycles', Event count (approx.): 2756866824946
       > +  47.49%            fb.sh  [kernel.kallsyms]    [k] down_read_trylock
       > +  30.72%            fb.sh  [kernel.kallsyms]    [k] up_read
       > +   9.51%            fb.sh  [kernel.kallsyms]    [k] mem_cgroup_iter
       > +   1.69%            fb.sh  [kernel.kallsyms]    [k] shrink_node_memcg
       > +   1.35%            fb.sh  [kernel.kallsyms]    [k] mem_cgroup_protected
       > +   1.05%            fb.sh  [kernel.kallsyms]    [k] queued_spin_lock_slowpath
       > +   0.85%            fb.sh  [kernel.kallsyms]    [k] _raw_spin_lock
       > +   0.78%            fb.sh  [kernel.kallsyms]    [k] lruvec_lru_size
       > +   0.57%            fb.sh  [kernel.kallsyms]    [k] shrink_node
       > +   0.54%            fb.sh  [kernel.kallsyms]    [k] queue_work_on
       > +   0.46%            fb.sh  [kernel.kallsyms]    [k] shrink_slab_memcg
      
      [ktkhai@virtuozzo.com: v9]
        Link: http://lkml.kernel.org/r/153112561772.4097.11011071937553113003.stgit@localhost.localdomain
      Link: http://lkml.kernel.org/r/153063070859.1818.11870882950920963480.stgit@localhost.localdomainSigned-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Tested-by: NShakeel Butt <shakeelb@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Josef Bacik <jbacik@fb.com>
      Cc: Li RongQing <lirongqing@baidu.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Matthias Kaehlcke <mka@chromium.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Philippe Ombredanne <pombredanne@nexb.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Sahitya Tummala <stummala@codeaurora.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Waiman Long <longman@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f90280d6
    • K
      mm/list_lru.c: set bit in memcg shrinker bitmap on first list_lru item appearance · fae91d6d
      Kirill Tkhai 提交于
      Introduce set_shrinker_bit() function to set shrinker-related bit in
      memcg shrinker bitmap, and set the bit after the first item is added and
      in case of reparenting destroyed memcg's items.
      
      This will allow next patch to make shrinkers be called only, in case of
      they have charged objects at the moment, and to improve shrink_slab()
      performance.
      
      [ktkhai@virtuozzo.com: v9]
        Link: http://lkml.kernel.org/r/153112557572.4097.17315791419810749985.stgit@localhost.localdomain
      Link: http://lkml.kernel.org/r/153063065671.1818.15914674956134687268.stgit@localhost.localdomainSigned-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Tested-by: NShakeel Butt <shakeelb@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Josef Bacik <jbacik@fb.com>
      Cc: Li RongQing <lirongqing@baidu.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Matthias Kaehlcke <mka@chromium.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Philippe Ombredanne <pombredanne@nexb.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Sahitya Tummala <stummala@codeaurora.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Waiman Long <longman@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fae91d6d
    • K
      mm/memcontrol.c: export mem_cgroup_is_root() · dfd2f10c
      Kirill Tkhai 提交于
      This will be used in next patch.
      
      Link: http://lkml.kernel.org/r/153063064347.1818.1987011484100392706.stgit@localhost.localdomainSigned-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Tested-by: NShakeel Butt <shakeelb@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Josef Bacik <jbacik@fb.com>
      Cc: Li RongQing <lirongqing@baidu.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Matthias Kaehlcke <mka@chromium.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Philippe Ombredanne <pombredanne@nexb.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Sahitya Tummala <stummala@codeaurora.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Waiman Long <longman@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dfd2f10c
    • K
      mm/list_lru: pass dst_memcg argument to memcg_drain_list_lru_node() · 9bec5c35
      Kirill Tkhai 提交于
      This is just refactoring to allow the next patches to have dst_memcg
      pointer in memcg_drain_list_lru_node().
      
      Link: http://lkml.kernel.org/r/153063062118.1818.2761273817739499749.stgit@localhost.localdomainSigned-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Tested-by: NShakeel Butt <shakeelb@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Josef Bacik <jbacik@fb.com>
      Cc: Li RongQing <lirongqing@baidu.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Matthias Kaehlcke <mka@chromium.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Philippe Ombredanne <pombredanne@nexb.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Sahitya Tummala <stummala@codeaurora.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Waiman Long <longman@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9bec5c35
    • K
      mm, memcg: assign memcg-aware shrinkers bitmap to memcg · 0a4465d3
      Kirill Tkhai 提交于
      Imagine a big node with many cpus, memory cgroups and containers.  Let
      we have 200 containers, every container has 10 mounts, and 10 cgroups.
      All container tasks don't touch foreign containers mounts.  If there is
      intensive pages write, and global reclaim happens, a writing task has to
      iterate over all memcgs to shrink slab, before it's able to go to
      shrink_page_list().
      
      Iteration over all the memcg slabs is very expensive: the task has to
      visit 200 * 10 = 2000 shrinkers for every memcg, and since there are
      2000 memcgs, the total calls are 2000 * 2000 = 4000000.
      
      So, the shrinker makes 4 million do_shrink_slab() calls just to try to
      isolate SWAP_CLUSTER_MAX pages in one of the actively writing memcg via
      shrink_page_list().  I've observed a node spending almost 100% in
      kernel, making useless iteration over already shrinked slab.
      
      This patch adds bitmap of memcg-aware shrinkers to memcg.  The size of
      the bitmap depends on bitmap_nr_ids, and during memcg life it's
      maintained to be enough to fit bitmap_nr_ids shrinkers.  Every bit in
      the map is related to corresponding shrinker id.
      
      Next patches will maintain set bit only for really charged memcg.  This
      will allow shrink_slab() to increase its performance in significant way.
      See the last patch for the numbers.
      
      [ktkhai@virtuozzo.com: v9]
        Link: http://lkml.kernel.org/r/153112549031.4097.3576147070498769979.stgit@localhost.localdomain
      [ktkhai@virtuozzo.com: add comment to mem_cgroup_css_online()]
        Link: http://lkml.kernel.org/r/521f9e5f-c436-b388-fe83-4dc870bfb489@virtuozzo.com
      Link: http://lkml.kernel.org/r/153063056619.1818.12550500883688681076.stgit@localhost.localdomainSigned-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Tested-by: NShakeel Butt <shakeelb@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Josef Bacik <jbacik@fb.com>
      Cc: Li RongQing <lirongqing@baidu.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Matthias Kaehlcke <mka@chromium.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Philippe Ombredanne <pombredanne@nexb.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Sahitya Tummala <stummala@codeaurora.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Waiman Long <longman@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0a4465d3
    • K
      mm/memcontrol.c: move up for_each_mem_cgroup{, _tree} defines · b05706f1
      Kirill Tkhai 提交于
      Next patch requires these defines are above their current position, so
      here they are moved to declarations.
      
      Link: http://lkml.kernel.org/r/153063055665.1818.5200425793649695598.stgit@localhost.localdomainSigned-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Tested-by: NShakeel Butt <shakeelb@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Josef Bacik <jbacik@fb.com>
      Cc: Li RongQing <lirongqing@baidu.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Matthias Kaehlcke <mka@chromium.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Philippe Ombredanne <pombredanne@nexb.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Sahitya Tummala <stummala@codeaurora.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Waiman Long <longman@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b05706f1
    • K
      mm: introduce CONFIG_MEMCG_KMEM as combination of CONFIG_MEMCG && !CONFIG_SLOB · 84c07d11
      Kirill Tkhai 提交于
      Introduce new config option, which is used to replace repeating
      CONFIG_MEMCG && !CONFIG_SLOB pattern.  Next patches add a little more
      memcg+kmem related code, so let's keep the defines more clearly.
      
      Link: http://lkml.kernel.org/r/153063053670.1818.15013136946600481138.stgit@localhost.localdomainSigned-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Tested-by: NShakeel Butt <shakeelb@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Josef Bacik <jbacik@fb.com>
      Cc: Li RongQing <lirongqing@baidu.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Matthias Kaehlcke <mka@chromium.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Philippe Ombredanne <pombredanne@nexb.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Sahitya Tummala <stummala@codeaurora.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Waiman Long <longman@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      84c07d11
    • M
      memcg, oom: move out_of_memory back to the charge path · 29ef680a
      Michal Hocko 提交于
      Commit 3812c8c8 ("mm: memcg: do not trap chargers with full
      callstack on OOM") has changed the ENOMEM semantic of memcg charges.
      Rather than invoking the oom killer from the charging context it delays
      the oom killer to the page fault path (pagefault_out_of_memory).  This
      in turn means that many users (e.g.  slab or g-u-p) will get ENOMEM when
      the corresponding memcg hits the hard limit and the memcg is is OOM.
      This is behavior is inconsistent with !memcg case where the oom killer
      is invoked from the allocation context and the allocator keeps retrying
      until it succeeds.
      
      The difference in the behavior is user visible.  mmap(MAP_POPULATE)
      might result in not fully populated ranges while the mmap return code
      doesn't tell that to the userspace.  Random syscalls might fail with
      ENOMEM etc.
      
      The primary motivation of the different memcg oom semantic was the
      deadlock avoidance.  Things have changed since then, though.  We have an
      async oom teardown by the oom reaper now and so we do not have to rely
      on the victim to tear down its memory anymore.  Therefore we can return
      to the original semantic as long as the memcg oom killer is not handed
      over to the users space.
      
      There is still one thing to be careful about here though.  If the oom
      killer is not able to make any forward progress - e.g.  because there is
      no eligible task to kill - then we have to bail out of the charge path
      to prevent from same class of deadlocks.  We have basically two options
      here.  Either we fail the charge with ENOMEM or force the charge and
      allow overcharge.  The first option has been considered more harmful
      than useful because rare inconsistencies in the ENOMEM behavior is hard
      to test for and error prone.  Basically the same reason why the page
      allocator doesn't fail allocations under such conditions.  The later
      might allow runaways but those should be really unlikely unless somebody
      misconfigures the system.  E.g.  allowing to migrate tasks away from the
      memcg to a different unlimited memcg with move_charge_at_immigrate
      disabled.
      
      Link: http://lkml.kernel.org/r/20180628151101.25307-1-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NGreg Thelen <gthelen@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      29ef680a
    • S
      fs, mm: account buffer_head to kmemcg · f745c6f5
      Shakeel Butt 提交于
      The buffer_head can consume a significant amount of system memory and is
      directly related to the amount of page cache.  In our production
      environment we have observed that a lot of machines are spending a
      significant amount of memory as buffer_head and can not be left as
      system memory overhead.
      
      Charging buffer_head is not as simple as adding __GFP_ACCOUNT to the
      allocation.  The buffer_heads can be allocated in a memcg different from
      the memcg of the page for which buffer_heads are being allocated.  One
      concrete example is memory reclaim.  The reclaim can trigger I/O of
      pages of any memcg on the system.  So, the right way to charge
      buffer_head is to extract the memcg from the page for which buffer_heads
      are being allocated and then use targeted memcg charging API.
      
      [shakeelb@google.com: use __GFP_ACCOUNT for directed memcg charging]
        Link: http://lkml.kernel.org/r/20180702220208.213380-1-shakeelb@google.com
      Link: http://lkml.kernel.org/r/20180627191250.209150-3-shakeelb@google.comSigned-off-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Amir Goldstein <amir73il@gmail.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f745c6f5
    • S
      fs: fsnotify: account fsnotify metadata to kmemcg · d46eb14b
      Shakeel Butt 提交于
      Patch series "Directed kmem charging", v8.
      
      The Linux kernel's memory cgroup allows limiting the memory usage of the
      jobs running on the system to provide isolation between the jobs.  All
      the kernel memory allocated in the context of the job and marked with
      __GFP_ACCOUNT will also be included in the memory usage and be limited
      by the job's limit.
      
      The kernel memory can only be charged to the memcg of the process in
      whose context kernel memory was allocated.  However there are cases
      where the allocated kernel memory should be charged to the memcg
      different from the current processes's memcg.  This patch series
      contains two such concrete use-cases i.e.  fsnotify and buffer_head.
      
      The fsnotify event objects can consume a lot of system memory for large
      or unlimited queues if there is either no or slow listener.  The events
      are allocated in the context of the event producer.  However they should
      be charged to the event consumer.  Similarly the buffer_head objects can
      be allocated in a memcg different from the memcg of the page for which
      buffer_head objects are being allocated.
      
      To solve this issue, this patch series introduces mechanism to charge
      kernel memory to a given memcg.  In case of fsnotify events, the memcg
      of the consumer can be used for charging and for buffer_head, the memcg
      of the page can be charged.  For directed charging, the caller can use
      the scope API memalloc_[un]use_memcg() to specify the memcg to charge
      for all the __GFP_ACCOUNT allocations within the scope.
      
      This patch (of 2):
      
      A lot of memory can be consumed by the events generated for the huge or
      unlimited queues if there is either no or slow listener.  This can cause
      system level memory pressure or OOMs.  So, it's better to account the
      fsnotify kmem caches to the memcg of the listener.
      
      However the listener can be in a different memcg than the memcg of the
      producer and these allocations happen in the context of the event
      producer.  This patch introduces remote memcg charging API which the
      producer can use to charge the allocations to the memcg of the listener.
      
      There are seven fsnotify kmem caches and among them allocations from
      dnotify_struct_cache, dnotify_mark_cache, fanotify_mark_cache and
      inotify_inode_mark_cachep happens in the context of syscall from the
      listener.  So, SLAB_ACCOUNT is enough for these caches.
      
      The objects from fsnotify_mark_connector_cachep are not accounted as
      they are small compared to the notification mark or events and it is
      unclear whom to account connector to since it is shared by all events
      attached to the inode.
      
      The allocations from the event caches happen in the context of the event
      producer.  For such caches we will need to remote charge the allocations
      to the listener's memcg.  Thus we save the memcg reference in the
      fsnotify_group structure of the listener.
      
      This patch has also moved the members of fsnotify_group to keep the size
      same, at least for 64 bit build, even with additional member by filling
      the holes.
      
      [shakeelb@google.com: use GFP_KERNEL_ACCOUNT rather than open-coding it]
        Link: http://lkml.kernel.org/r/20180702215439.211597-1-shakeelb@google.com
      Link: http://lkml.kernel.org/r/20180627191250.209150-2-shakeelb@google.comSigned-off-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Amir Goldstein <amir73il@gmail.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d46eb14b
  6. 03 8月, 2018 1 次提交
  7. 22 7月, 2018 1 次提交
    • J
      mm: memcg: fix use after free in mem_cgroup_iter() · 9f15bde6
      Jing Xia 提交于
      It was reported that a kernel crash happened in mem_cgroup_iter(), which
      can be triggered if the legacy cgroup-v1 non-hierarchical mode is used.
      
      Unable to handle kernel paging request at virtual address 6b6b6b6b6b6b8f
      ......
      Call trace:
        mem_cgroup_iter+0x2e0/0x6d4
        shrink_zone+0x8c/0x324
        balance_pgdat+0x450/0x640
        kswapd+0x130/0x4b8
        kthread+0xe8/0xfc
        ret_from_fork+0x10/0x20
      
        mem_cgroup_iter():
            ......
            if (css_tryget(css))    <-- crash here
      	    break;
            ......
      
      The crashing reason is that mem_cgroup_iter() uses the memcg object whose
      pointer is stored in iter->position, which has been freed before and
      filled with POISON_FREE(0x6b).
      
      And the root cause of the use-after-free issue is that
      invalidate_reclaim_iterators() fails to reset the value of iter->position
      to NULL when the css of the memcg is released in non- hierarchical mode.
      
      Link: http://lkml.kernel.org/r/1531994807-25639-1-git-send-email-jing.xia@unisoc.com
      Fixes: 6df38689 ("mm: memcontrol: fix possible memcg leak due to interrupted reclaim")
      Signed-off-by: NJing Xia <jing.xia.mail@gmail.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: <chunyan.zhang@unisoc.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9f15bde6
  8. 09 7月, 2018 1 次提交
  9. 15 6月, 2018 2 次提交
  10. 08 6月, 2018 5 次提交