1. 08 7月, 2021 2 次提交
  2. 03 6月, 2021 1 次提交
    • M
      mm: memcontrol: slab: fix obtain a reference to a freeing memcg · 197d839d
      Muchun Song 提交于
      stable inclusion
      from stable-5.10.37
      commit 31df8bc4d3feca9f9c6b2cd06fd64a111ae1a0e6
      bugzilla: 51868
      CVE: NA
      
      --------------------------------
      
      [ Upstream commit 9f38f03a ]
      
      Patch series "Use obj_cgroup APIs to charge kmem pages", v5.
      
      Since Roman's series "The new cgroup slab memory controller" applied.
      All slab objects are charged with the new APIs of obj_cgroup.  The new
      APIs introduce a struct obj_cgroup to charge slab objects.  It prevents
      long-living objects from pinning the original memory cgroup in the
      memory.  But there are still some corner objects (e.g.  allocations
      larger than order-1 page on SLUB) which are not charged with the new
      APIs.  Those objects (include the pages which are allocated from buddy
      allocator directly) are charged as kmem pages which still hold a
      reference to the memory cgroup.
      
      E.g.  We know that the kernel stack is charged as kmem pages because the
      size of the kernel stack can be greater than 2 pages (e.g.  16KB on
      x86_64 or arm64).  If we create a thread (suppose the thread stack is
      charged to memory cgroup A) and then move it from memory cgroup A to
      memory cgroup B.  Because the kernel stack of the thread hold a
      reference to the memory cgroup A.  The thread can pin the memory cgroup
      A in the memory even if we remove the cgroup A.  If we want to see this
      scenario by using the following script.  We can see that the system has
      added 500 dying cgroups (This is not a real world issue, just a script
      to show that the large kmallocs are charged as kmem pages which can pin
      the memory cgroup in the memory).
      
      	#!/bin/bash
      
      	cat /proc/cgroups | grep memory
      
      	cd /sys/fs/cgroup/memory
      	echo 1 > memory.move_charge_at_immigrate
      
      	for i in range{1..500}
      	do
      		mkdir kmem_test
      		echo $$ > kmem_test/cgroup.procs
      		sleep 3600 &
      		echo $$ > cgroup.procs
      		echo `cat kmem_test/cgroup.procs` > cgroup.procs
      		rmdir kmem_test
      	done
      
      	cat /proc/cgroups | grep memory
      
      This patchset aims to make those kmem pages to drop the reference to
      memory cgroup by using the APIs of obj_cgroup.  Finally, we can see that
      the number of the dying cgroups will not increase if we run the above test
      script.
      
      This patch (of 7):
      
      The rcu_read_lock/unlock only can guarantee that the memcg will not be
      freed, but it cannot guarantee the success of css_get (which is in the
      refill_stock when cached memcg changed) to memcg.
      
        rcu_read_lock()
        memcg = obj_cgroup_memcg(old)
        __memcg_kmem_uncharge(memcg)
            refill_stock(memcg)
                if (stock->cached != memcg)
                    // css_get can change the ref counter from 0 back to 1.
                    css_get(&memcg->css)
        rcu_read_unlock()
      
      This fix is very like the commit:
      
        eefbfa7f ("mm: memcg/slab: fix use after free in obj_cgroup_charge")
      
      Fix this by holding a reference to the memcg which is passed to the
      __memcg_kmem_uncharge() before calling __memcg_kmem_uncharge().
      
      Link: https://lkml.kernel.org/r/20210319163821.20704-1-songmuchun@bytedance.com
      Link: https://lkml.kernel.org/r/20210319163821.20704-2-songmuchun@bytedance.com
      Fixes: 3de7d4f2 ("mm: memcg/slab: optimize objcg stock draining")
      Signed-off-by: NMuchun Song <songmuchun@bytedance.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: NWeilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      197d839d
  3. 19 4月, 2021 2 次提交
  4. 14 4月, 2021 1 次提交
  5. 09 4月, 2021 3 次提交
  6. 09 3月, 2021 1 次提交
    • J
      Revert "mm: memcontrol: avoid workload stalls when lowering memory.high" · c0daedd0
      Johannes Weiner 提交于
      stable inclusion
      from stable-5.10.16
      commit dd0a41bc17bb9e934e401246ab2f8d269a49c6cf
      bugzilla: 48168
      
      --------------------------------
      
      commit e82553c1 upstream.
      
      This reverts commit 536d3bf2, as it can
      cause writers to memory.high to get stuck in the kernel forever,
      performing page reclaim and consuming excessive amounts of CPU cycles.
      
      Before the patch, a write to memory.high would first put the new limit
      in place for the workload, and then reclaim the requested delta.  After
      the patch, the kernel tries to reclaim the delta before putting the new
      limit into place, in order to not overwhelm the workload with a sudden,
      large excess over the limit.  However, if reclaim is actively racing
      with new allocations from the uncurbed workload, it can keep the write()
      working inside the kernel indefinitely.
      
      This is causing problems in Facebook production.  A privileged
      system-level daemon that adjusts memory.high for various workloads
      running on a host can get unexpectedly stuck in the kernel and
      essentially turn into a sort of involuntary kswapd for one of the
      workloads.  We've observed that daemon busy-spin in a write() for
      minutes at a time, neglecting its other duties on the system, and
      expending privileged system resources on behalf of a workload.
      
      To remedy this, we have first considered changing the reclaim logic to
      break out after a couple of loops - whether the workload has converged
      to the new limit or not - and bound the write() call this way.  However,
      the root cause that inspired the sequence change in the first place has
      been fixed through other means, and so a revert back to the proven
      limit-setting sequence, also used by memory.max, is preferable.
      
      The sequence was changed to avoid extreme latencies in the workload when
      the limit was lowered: the sudden, large excess created by the limit
      lowering would erroneously trigger the penalty sleeping code that is
      meant to throttle excessive growth from below.  Allocating threads could
      end up sleeping long after the write() had already reclaimed the delta
      for which they were being punished.
      
      However, erroneous throttling also caused problems in other scenarios at
      around the same time.  This resulted in commit b3ff9291 ("mm, memcg:
      reclaim more aggressively before high allocator throttling"), included
      in the same release as the offending commit.  When allocating threads
      now encounter large excess caused by a racing write() to memory.high,
      instead of entering punitive sleeps, they will simply be tasked with
      helping reclaim down the excess, and will be held no longer than it
      takes to accomplish that.  This is in line with regular limit
      enforcement - i.e.  if the workload allocates up against or over an
      otherwise unchanged limit from below.
      
      With the patch breaking userspace, and the root cause addressed by other
      means already, revert it again.
      
      Link: https://lkml.kernel.org/r/20210122184341.292461-1-hannes@cmpxchg.org
      Fixes: 536d3bf2 ("mm: memcontrol: avoid workload stalls when lowering memory.high")
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: NTejun Heo <tj@kernel.org>
      Acked-by: NChris Down <chris@chrisdown.name>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Michal Koutný <mkoutny@suse.com>
      Cc: <stable@vger.kernel.org>	[5.8+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      Acked-by: NXie XiuQi <xiexiuqi@huawei.com>
      c0daedd0
  7. 08 2月, 2021 1 次提交
    • R
      mm: memcg/slab: optimize objcg stock draining · 8385f0f9
      Roman Gushchin 提交于
      stable inclusion
      from stable-5.10.11
      commit 26f54dac15640c65ec69867e182de7be708ea389
      bugzilla: 47621
      
      --------------------------------
      
      commit 3de7d4f2 upstream.
      
      Imran Khan reported a 16% regression in hackbench results caused by the
      commit f2fe7b09 ("mm: memcg/slab: charge individual slab objects
      instead of pages").  The regression is noticeable in the case of a
      consequent allocation of several relatively large slab objects, e.g.
      skb's.  As soon as the amount of stocked bytes exceeds PAGE_SIZE,
      drain_obj_stock() and __memcg_kmem_uncharge() are called, and it leads
      to a number of atomic operations in page_counter_uncharge().
      
      The corresponding call graph is below (provided by Imran Khan):
      
        |__alloc_skb
        |    |
        |    |__kmalloc_reserve.isra.61
        |    |    |
        |    |    |__kmalloc_node_track_caller
        |    |    |    |
        |    |    |    |slab_pre_alloc_hook.constprop.88
        |    |    |     obj_cgroup_charge
        |    |    |    |    |
        |    |    |    |    |__memcg_kmem_charge
        |    |    |    |    |    |
        |    |    |    |    |    |page_counter_try_charge
        |    |    |    |    |
        |    |    |    |    |refill_obj_stock
        |    |    |    |    |    |
        |    |    |    |    |    |drain_obj_stock.isra.68
        |    |    |    |    |    |    |
        |    |    |    |    |    |    |__memcg_kmem_uncharge
        |    |    |    |    |    |    |    |
        |    |    |    |    |    |    |    |page_counter_uncharge
        |    |    |    |    |    |    |    |    |
        |    |    |    |    |    |    |    |    |page_counter_cancel
        |    |    |    |
        |    |    |    |
        |    |    |    |__slab_alloc
        |    |    |    |    |
        |    |    |    |    |___slab_alloc
        |    |    |    |    |
        |    |    |    |slab_post_alloc_hook
      
      Instead of directly uncharging the accounted kernel memory, it's
      possible to refill the generic page-sized per-cpu stock instead.  It's a
      much faster operation, especially on a default hierarchy.  As a bonus,
      __memcg_kmem_uncharge_page() will also get faster, so the freeing of
      page-sized kernel allocations (e.g.  large kmallocs) will become faster.
      
      A similar change has been done earlier for the socket memory by the
      commit 475d0487 ("mm: memcontrol: use per-cpu stocks for socket
      memory uncharging").
      
      Link: https://lkml.kernel.org/r/20210106042239.2860107-1-guro@fb.com
      Fixes: f2fe7b09 ("mm: memcg/slab: charge individual slab objects instead of pages")
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Reported-by: NImran Khan <imran.f.khan@oracle.com>
      Tested-by: NImran Khan <imran.f.khan@oracle.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Reviewed-by: NMichal Koutn <mkoutny@suse.com>
      Cc: Michal Koutný <mkoutny@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      Acked-by: NXie XiuQi <xiexiuqi@huawei.com>
      8385f0f9
  8. 12 1月, 2021 2 次提交
  9. 23 11月, 2020 1 次提交
  10. 03 11月, 2020 2 次提交
  11. 19 10月, 2020 5 次提交
  12. 14 10月, 2020 11 次提交
  13. 27 9月, 2020 1 次提交
  14. 25 9月, 2020 1 次提交
  15. 06 9月, 2020 1 次提交
    • M
      memcg: fix use-after-free in uncharge_batch · f1796544
      Michal Hocko 提交于
      syzbot has reported an use-after-free in the uncharge_batch path
      
        BUG: KASAN: use-after-free in instrument_atomic_write include/linux/instrumented.h:71 [inline]
        BUG: KASAN: use-after-free in atomic64_sub_return include/asm-generic/atomic-instrumented.h:970 [inline]
        BUG: KASAN: use-after-free in atomic_long_sub_return include/asm-generic/atomic-long.h:113 [inline]
        BUG: KASAN: use-after-free in page_counter_cancel mm/page_counter.c:54 [inline]
        BUG: KASAN: use-after-free in page_counter_uncharge+0x3d/0xc0 mm/page_counter.c:155
        Write of size 8 at addr ffff8880371c0148 by task syz-executor.0/9304
      
        CPU: 0 PID: 9304 Comm: syz-executor.0 Not tainted 5.8.0-syzkaller #0
        Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
        Call Trace:
          __dump_stack lib/dump_stack.c:77 [inline]
          dump_stack+0x1f0/0x31e lib/dump_stack.c:118
          print_address_description+0x66/0x620 mm/kasan/report.c:383
          __kasan_report mm/kasan/report.c:513 [inline]
          kasan_report+0x132/0x1d0 mm/kasan/report.c:530
          check_memory_region_inline mm/kasan/generic.c:183 [inline]
          check_memory_region+0x2b5/0x2f0 mm/kasan/generic.c:192
          instrument_atomic_write include/linux/instrumented.h:71 [inline]
          atomic64_sub_return include/asm-generic/atomic-instrumented.h:970 [inline]
          atomic_long_sub_return include/asm-generic/atomic-long.h:113 [inline]
          page_counter_cancel mm/page_counter.c:54 [inline]
          page_counter_uncharge+0x3d/0xc0 mm/page_counter.c:155
          uncharge_batch+0x6c/0x350 mm/memcontrol.c:6764
          uncharge_page+0x115/0x430 mm/memcontrol.c:6796
          uncharge_list mm/memcontrol.c:6835 [inline]
          mem_cgroup_uncharge_list+0x70/0xe0 mm/memcontrol.c:6877
          release_pages+0x13a2/0x1550 mm/swap.c:911
          tlb_batch_pages_flush mm/mmu_gather.c:49 [inline]
          tlb_flush_mmu_free mm/mmu_gather.c:242 [inline]
          tlb_flush_mmu+0x780/0x910 mm/mmu_gather.c:249
          tlb_finish_mmu+0xcb/0x200 mm/mmu_gather.c:328
          exit_mmap+0x296/0x550 mm/mmap.c:3185
          __mmput+0x113/0x370 kernel/fork.c:1076
          exit_mm+0x4cd/0x550 kernel/exit.c:483
          do_exit+0x576/0x1f20 kernel/exit.c:793
          do_group_exit+0x161/0x2d0 kernel/exit.c:903
          get_signal+0x139b/0x1d30 kernel/signal.c:2743
          arch_do_signal+0x33/0x610 arch/x86/kernel/signal.c:811
          exit_to_user_mode_loop kernel/entry/common.c:135 [inline]
          exit_to_user_mode_prepare+0x8d/0x1b0 kernel/entry/common.c:166
          syscall_exit_to_user_mode+0x5e/0x1a0 kernel/entry/common.c:241
          entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Commit 1a3e1f40 ("mm: memcontrol: decouple reference counting from
      page accounting") reworked the memcg lifetime to be bound the the struct
      page rather than charges.  It also removed the css_put_many from
      uncharge_batch and that is causing the above splat.
      
      uncharge_batch() is supposed to uncharge accumulated charges for all
      pages freed from the same memcg.  The queuing is done by uncharge_page
      which however drops the memcg reference after it adds charges to the
      batch.  If the current page happens to be the last one holding the
      reference for its memcg then the memcg is OK to go and the next page to
      be freed will trigger batched uncharge which needs to access the memcg
      which is gone already.
      
      Fix the issue by taking a reference for the memcg in the current batch.
      
      Fixes: 1a3e1f40 ("mm: memcontrol: decouple reference counting from page accounting")
      Reported-by: syzbot+b305848212deec86eabe@syzkaller.appspotmail.com
      Reported-by: syzbot+b5ea6fb6f139c8b9482b@syzkaller.appspotmail.com
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Hugh Dickins <hughd@google.com>
      Link: https://lkml.kernel.org/r/20200820090341.GC5033@dhcp22.suse.czSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f1796544
  16. 15 8月, 2020 1 次提交
  17. 14 8月, 2020 1 次提交
    • J
      mm: memcontrol: fix warning when allocating the root cgroup · 9f457179
      Johannes Weiner 提交于
      Commit 3e38e0aa ("mm: memcg: charge memcg percpu memory to the
      parent cgroup") adds memory tracking to the memcg kernel structures
      themselves to make cgroups liable for the memory they are consuming
      through the allocation of child groups (which can be significant).
      
      This code is a bit awkward as it's spread out through several functions:
      The outermost function does memalloc_use_memcg(parent) to set up
      current->active_memcg, which designates which cgroup to charge, and the
      inner functions pass GFP_ACCOUNT to request charging for specific
      allocations.  To make sure this dependency is satisfied at all times -
      to make sure we don't randomly charge whoever is calling the functions -
      the inner functions warn on !current->active_memcg.
      
      However, this triggers a false warning when the root memcg itself is
      allocated.  No parent exists in this case, and so current->active_memcg
      is rightfully NULL.  It's a false positive, not indicative of a bug.
      
      Delete the warnings for now, we can revisit this later.
      
      Fixes: 3e38e0aa ("mm: memcg: charge memcg percpu memory to the parent cgroup")
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Acked-by: NRoman Gushchin <guro@fb.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9f457179
  18. 13 8月, 2020 3 次提交