1. 22 7月, 2018 1 次提交
    • J
      mm: memcg: fix use after free in mem_cgroup_iter() · 9f15bde6
      Jing Xia 提交于
      It was reported that a kernel crash happened in mem_cgroup_iter(), which
      can be triggered if the legacy cgroup-v1 non-hierarchical mode is used.
      
      Unable to handle kernel paging request at virtual address 6b6b6b6b6b6b8f
      ......
      Call trace:
        mem_cgroup_iter+0x2e0/0x6d4
        shrink_zone+0x8c/0x324
        balance_pgdat+0x450/0x640
        kswapd+0x130/0x4b8
        kthread+0xe8/0xfc
        ret_from_fork+0x10/0x20
      
        mem_cgroup_iter():
            ......
            if (css_tryget(css))    <-- crash here
      	    break;
            ......
      
      The crashing reason is that mem_cgroup_iter() uses the memcg object whose
      pointer is stored in iter->position, which has been freed before and
      filled with POISON_FREE(0x6b).
      
      And the root cause of the use-after-free issue is that
      invalidate_reclaim_iterators() fails to reset the value of iter->position
      to NULL when the css of the memcg is released in non- hierarchical mode.
      
      Link: http://lkml.kernel.org/r/1531994807-25639-1-git-send-email-jing.xia@unisoc.com
      Fixes: 6df38689 ("mm: memcontrol: fix possible memcg leak due to interrupted reclaim")
      Signed-off-by: NJing Xia <jing.xia.mail@gmail.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: <chunyan.zhang@unisoc.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9f15bde6
  2. 15 6月, 2018 2 次提交
  3. 08 6月, 2018 11 次提交
  4. 26 5月, 2018 1 次提交
  5. 21 4月, 2018 1 次提交
    • M
      mm: memcg: add __GFP_NOWARN in __memcg_schedule_kmem_cache_create() · c892fd82
      Minchan Kim 提交于
      If there is heavy memory pressure, page allocation with __GFP_NOWAIT
      fails easily although it's order-0 request.  I got below warning 9 times
      for normal boot.
      
           <snip >: page allocation failure: order:0, mode:0x2200000(GFP_NOWAIT|__GFP_NOTRACK)
           .. snip ..
           Call trace:
             dump_backtrace+0x0/0x4
             dump_stack+0xa4/0xc0
             warn_alloc+0xd4/0x15c
             __alloc_pages_nodemask+0xf88/0x10fc
             alloc_slab_page+0x40/0x18c
             new_slab+0x2b8/0x2e0
             ___slab_alloc+0x25c/0x464
             __kmalloc+0x394/0x498
             memcg_kmem_get_cache+0x114/0x2b8
             kmem_cache_alloc+0x98/0x3e8
             mmap_region+0x3bc/0x8c0
             do_mmap+0x40c/0x43c
             vm_mmap_pgoff+0x15c/0x1e4
             sys_mmap+0xb0/0xc8
             el0_svc_naked+0x24/0x28
           Mem-Info:
           active_anon:17124 inactive_anon:193 isolated_anon:0
            active_file:7898 inactive_file:712955 isolated_file:55
            unevictable:0 dirty:27 writeback:18 unstable:0
            slab_reclaimable:12250 slab_unreclaimable:23334
            mapped:19310 shmem:212 pagetables:816 bounce:0
            free:36561 free_pcp:1205 free_cma:35615
           Node 0 active_anon:68496kB inactive_anon:772kB active_file:31592kB inactive_file:2851820kB unevictable:0kB isolated(anon):0kB isolated(file):220kB mapped:77240kB dirty:108kB writeback:72kB shmem:848kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
           DMA free:142188kB min:3056kB low:3820kB high:4584kB active_anon:10052kB inactive_anon:12kB active_file:312kB inactive_file:1412620kB unevictable:0kB writepending:0kB present:1781412kB managed:1604728kB mlocked:0kB slab_reclaimable:3592kB slab_unreclaimable:876kB kernel_stack:400kB pagetables:52kB bounce:0kB free_pcp:1436kB local_pcp:124kB free_cma:142492kB
           lowmem_reserve[]: 0 1842 1842
           Normal free:4056kB min:4172kB low:5212kB high:6252kB active_anon:58376kB inactive_anon:760kB active_file:31348kB inactive_file:1439040kB unevictable:0kB writepending:180kB present:2000636kB managed:1923688kB mlocked:0kB slab_reclaimable:45408kB slab_unreclaimable:92460kB kernel_stack:9680kB pagetables:3212kB bounce:0kB free_pcp:3392kB local_pcp:688kB free_cma:0kB
           lowmem_reserve[]: 0 0 0
           DMA: 0*4kB 0*8kB 1*16kB (C) 0*32kB 0*64kB 0*128kB 1*256kB (C) 1*512kB (C) 0*1024kB 1*2048kB (C) 34*4096kB (C) = 142096kB
           Normal: 228*4kB (UMEH) 172*8kB (UMH) 23*16kB (UH) 24*32kB (H) 5*64kB (H) 1*128kB (H) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3872kB
           721350 total pagecache pages
           0 pages in swap cache
           Swap cache stats: add 0, delete 0, find 0/0
           Free swap  = 0kB
           Total swap = 0kB
           945512 pages RAM
           0 pages HighMem/MovableOnly
           63408 pages reserved
           51200 pages cma reserved
      
      __memcg_schedule_kmem_cache_create() tries to create a shadow slab cache
      and the worker allocation failure is not really critical because we will
      retry on the next kmem charge.  We might miss some charges but that
      shouldn't be critical.  The excessive allocation failure report is not
      very helpful.
      
      [mhocko@kernel.org: changelog update]
      Link: http://lkml.kernel.org/r/20180418022912.248417-1-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c892fd82
  6. 12 4月, 2018 4 次提交
  7. 29 3月, 2018 1 次提交
  8. 12 2月, 2018 1 次提交
    • L
      vfs: do bulk POLL* -> EPOLL* replacement · a9a08845
      Linus Torvalds 提交于
      This is the mindless scripted replacement of kernel use of POLL*
      variables as described by Al, done by this script:
      
          for V in IN OUT PRI ERR RDNORM RDBAND WRNORM WRBAND HUP RDHUP NVAL MSG; do
              L=`git grep -l -w POLL$V | grep -v '^t' | grep -v /um/ | grep -v '^sa' | grep -v '/poll.h$'|grep -v '^D'`
              for f in $L; do sed -i "-es/^\([^\"]*\)\(\<POLL$V\>\)/\\1E\\2/" $f; done
          done
      
      with de-mangling cleanups yet to come.
      
      NOTE! On almost all architectures, the EPOLL* constants have the same
      values as the POLL* constants do.  But they keyword here is "almost".
      For various bad reasons they aren't the same, and epoll() doesn't
      actually work quite correctly in some cases due to this on Sparc et al.
      
      The next patch from Al will sort out the final differences, and we
      should be all done.
      Scripted-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a9a08845
  9. 07 2月, 2018 2 次提交
  10. 03 2月, 2018 1 次提交
    • R
      Revert "defer call to mem_cgroup_sk_alloc()" · edbe69ef
      Roman Gushchin 提交于
      This patch effectively reverts commit 9f1c2674 ("net: memcontrol:
      defer call to mem_cgroup_sk_alloc()").
      
      Moving mem_cgroup_sk_alloc() to the inet_csk_accept() completely breaks
      memcg socket memory accounting, as packets received before memcg
      pointer initialization are not accounted and are causing refcounting
      underflow on socket release.
      
      Actually the free-after-use problem was fixed by
      commit c0576e39 ("net: call cgroup_sk_alloc() earlier in
      sk_clone_lock()") for the cgroup pointer.
      
      So, let's revert it and call mem_cgroup_sk_alloc() just before
      cgroup_sk_alloc(). This is safe, as we hold a reference to the socket
      we're cloning, and it holds a reference to the memcg.
      
      Also, let's drop BUG_ON(mem_cgroup_is_root()) check from
      mem_cgroup_sk_alloc(). I see no reasons why bumping the root
      memcg counter is a good reason to panic, and there are no realistic
      ways to hit it.
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      edbe69ef
  11. 01 2月, 2018 5 次提交
  12. 30 11月, 2017 1 次提交
  13. 28 11月, 2017 1 次提交
  14. 16 11月, 2017 1 次提交
  15. 10 10月, 2017 1 次提交
  16. 04 10月, 2017 2 次提交
    • J
      mm/memcg: avoid page count check for zone device · 3f2eb028
      Jérôme Glisse 提交于
      Fix for 4.14, zone device page always have an elevated refcount of one
      and thus page count sanity check in uncharge_page() is inappropriate for
      them.
      
      [mhocko@suse.com: nano-optimize VM_BUG_ON in uncharge_page]
      Link: http://lkml.kernel.org/r/20170914190011.5217-1-jglisse@redhat.comSigned-off-by: NJérôme Glisse <jglisse@redhat.com>
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NEvgeny Baskakov <ebaskakov@nvidia.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3f2eb028
    • M
      mm, memcg: remove hotplug locking from try_charge · 72f0184c
      Michal Hocko 提交于
      The following lockdep splat has been noticed during LTP testing
      
        ======================================================
        WARNING: possible circular locking dependency detected
        4.13.0-rc3-next-20170807 #12 Not tainted
        ------------------------------------------------------
        a.out/4771 is trying to acquire lock:
         (cpu_hotplug_lock.rw_sem){++++++}, at: [<ffffffff812b4668>] drain_all_stock.part.35+0x18/0x140
      
        but task is already holding lock:
         (&mm->mmap_sem){++++++}, at: [<ffffffff8106eb35>] __do_page_fault+0x175/0x530
      
        which lock already depends on the new lock.
      
        the existing dependency chain (in reverse order) is:
      
        -> #3 (&mm->mmap_sem){++++++}:
               lock_acquire+0xc9/0x230
               __might_fault+0x70/0xa0
               _copy_to_user+0x23/0x70
               filldir+0xa7/0x110
               xfs_dir2_sf_getdents.isra.10+0x20c/0x2c0 [xfs]
               xfs_readdir+0x1fa/0x2c0 [xfs]
               xfs_file_readdir+0x30/0x40 [xfs]
               iterate_dir+0x17a/0x1a0
               SyS_getdents+0xb0/0x160
               entry_SYSCALL_64_fastpath+0x1f/0xbe
      
        -> #2 (&type->i_mutex_dir_key#3){++++++}:
               lock_acquire+0xc9/0x230
               down_read+0x51/0xb0
               lookup_slow+0xde/0x210
               walk_component+0x160/0x250
               link_path_walk+0x1a6/0x610
               path_openat+0xe4/0xd50
               do_filp_open+0x91/0x100
               file_open_name+0xf5/0x130
               filp_open+0x33/0x50
               kernel_read_file_from_path+0x39/0x80
               _request_firmware+0x39f/0x880
               request_firmware_direct+0x37/0x50
               request_microcode_fw+0x64/0xe0
               reload_store+0xf7/0x180
               dev_attr_store+0x18/0x30
               sysfs_kf_write+0x44/0x60
               kernfs_fop_write+0x113/0x1a0
               __vfs_write+0x37/0x170
               vfs_write+0xc7/0x1c0
               SyS_write+0x58/0xc0
               do_syscall_64+0x6c/0x1f0
               return_from_SYSCALL_64+0x0/0x7a
      
        -> #1 (microcode_mutex){+.+.+.}:
               lock_acquire+0xc9/0x230
               __mutex_lock+0x88/0x960
               mutex_lock_nested+0x1b/0x20
               microcode_init+0xbb/0x208
               do_one_initcall+0x51/0x1a9
               kernel_init_freeable+0x208/0x2a7
               kernel_init+0xe/0x104
               ret_from_fork+0x2a/0x40
      
        -> #0 (cpu_hotplug_lock.rw_sem){++++++}:
               __lock_acquire+0x153c/0x1550
               lock_acquire+0xc9/0x230
               cpus_read_lock+0x4b/0x90
               drain_all_stock.part.35+0x18/0x140
               try_charge+0x3ab/0x6e0
               mem_cgroup_try_charge+0x7f/0x2c0
               shmem_getpage_gfp+0x25f/0x1050
               shmem_fault+0x96/0x200
               __do_fault+0x1e/0xa0
               __handle_mm_fault+0x9c3/0xe00
               handle_mm_fault+0x16e/0x380
               __do_page_fault+0x24a/0x530
               do_page_fault+0x30/0x80
               page_fault+0x28/0x30
      
        other info that might help us debug this:
      
        Chain exists of:
          cpu_hotplug_lock.rw_sem --> &type->i_mutex_dir_key#3 --> &mm->mmap_sem
      
         Possible unsafe locking scenario:
      
               CPU0                    CPU1
               ----                    ----
          lock(&mm->mmap_sem);
                                       lock(&type->i_mutex_dir_key#3);
                                       lock(&mm->mmap_sem);
          lock(cpu_hotplug_lock.rw_sem);
      
         *** DEADLOCK ***
      
        2 locks held by a.out/4771:
         #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff8106eb35>] __do_page_fault+0x175/0x530
         #1:  (percpu_charge_mutex){+.+...}, at: [<ffffffff812b4c97>] try_charge+0x397/0x6e0
      
      The problem is very similar to the one fixed by commit a459eeb7
      ("mm, page_alloc: do not depend on cpu hotplug locks inside the
      allocator").  We are taking hotplug locks while we can be sitting on top
      of basically arbitrary locks.  This just calls for problems.
      
      We can get rid of {get,put}_online_cpus, fortunately.  We do not have to
      be worried about races with memory hotplug because drain_local_stock,
      which is called from both the WQ draining and the memory hotplug
      contexts, is always operating on the local cpu stock with IRQs disabled.
      
      The only thing to be careful about is that the target memcg doesn't
      vanish while we are still in drain_all_stock so take a reference on it.
      
      Link: http://lkml.kernel.org/r/20170913090023.28322-1-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NArtem Savkov <asavkov@redhat.com>
      Tested-by: NArtem Savkov <asavkov@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      72f0184c
  17. 09 9月, 2017 4 次提交
    • D
      mem/memcg: cache rightmost node · fa90b2fd
      Davidlohr Bueso 提交于
      Such that we can optimize __mem_cgroup_largest_soft_limit_node().  The
      only overhead is the extra footprint for the cached pointer, but this
      should not be an issue for mem_cgroup_tree_per_node.
      
      [dave@stgolabs.net: brain fart #2]
        Link: http://lkml.kernel.org/r/20170731160114.GE21328@linux-80c1.suse
      Link: http://lkml.kernel.org/r/20170719014603.19029-17-dave@stgolabs.netSigned-off-by: NDavidlohr Bueso <dbueso@suse.de>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fa90b2fd
    • R
      mm: memcontrol: use per-cpu stocks for socket memory uncharging · 475d0487
      Roman Gushchin 提交于
      We've noticed a quite noticeable performance overhead on some hosts with
      significant network traffic when socket memory accounting is enabled.
      
      Perf top shows that socket memory uncharging path is hot:
        2.13%  [kernel]                [k] page_counter_cancel
        1.14%  [kernel]                [k] __sk_mem_reduce_allocated
        1.14%  [kernel]                [k] _raw_spin_lock
        0.87%  [kernel]                [k] _raw_spin_lock_irqsave
        0.84%  [kernel]                [k] tcp_ack
        0.84%  [kernel]                [k] ixgbe_poll
        0.83%  < workload >
        0.82%  [kernel]                [k] enqueue_entity
        0.68%  [kernel]                [k] __fget
        0.68%  [kernel]                [k] tcp_delack_timer_handler
        0.67%  [kernel]                [k] __schedule
        0.60%  < workload >
        0.59%  [kernel]                [k] __inet6_lookup_established
        0.55%  [kernel]                [k] __switch_to
        0.55%  [kernel]                [k] menu_select
        0.54%  libc-2.20.so            [.] __memcpy_avx_unaligned
      
      To address this issue, the existing per-cpu stock infrastructure can be
      used.
      
      refill_stock() can be called from mem_cgroup_uncharge_skmem() to move
      charge to a per-cpu stock instead of calling atomic
      page_counter_uncharge().
      
      To prevent the uncontrolled growth of per-cpu stocks, refill_stock()
      will explicitly drain the cached charge, if the cached value exceeds
      CHARGE_BATCH.
      
      This allows significantly optimize the load:
        1.21%  [kernel]                [k] _raw_spin_lock
        1.01%  [kernel]                [k] ixgbe_poll
        0.92%  [kernel]                [k] _raw_spin_lock_irqsave
        0.90%  [kernel]                [k] enqueue_entity
        0.86%  [kernel]                [k] tcp_ack
        0.85%  < workload >
        0.74%  perf-11120.map          [.] 0x000000000061bf24
        0.73%  [kernel]                [k] __schedule
        0.67%  [kernel]                [k] __fget
        0.63%  [kernel]                [k] __inet6_lookup_established
        0.62%  [kernel]                [k] menu_select
        0.59%  < workload >
        0.59%  [kernel]                [k] __switch_to
        0.57%  libc-2.20.so            [.] __memcpy_avx_unaligned
      
      Link: http://lkml.kernel.org/r/20170829100150.4580-1-guro@fb.comSigned-off-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      475d0487
    • J
      mm/device-public-memory: device memory cache coherent with CPU · df6ad698
      Jérôme Glisse 提交于
      Platform with advance system bus (like CAPI or CCIX) allow device memory
      to be accessible from CPU in a cache coherent fashion.  Add a new type of
      ZONE_DEVICE to represent such memory.  The use case are the same as for
      the un-addressable device memory but without all the corners cases.
      
      Link: http://lkml.kernel.org/r/20170817000548.32038-19-jglisse@redhat.comSigned-off-by: NJérôme Glisse <jglisse@redhat.com>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mark Hairgrove <mhairgrove@nvidia.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Sherry Cheung <SCheung@nvidia.com>
      Cc: Subhash Gutti <sgutti@nvidia.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Bob Liu <liubo95@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      df6ad698
    • J
      mm/memcontrol: support MEMORY_DEVICE_PRIVATE · c733a828
      Jérôme Glisse 提交于
      HMM pages (private or public device pages) are ZONE_DEVICE page and thus
      need special handling when it comes to lru or refcount.  This patch make
      sure that memcontrol properly handle those when it face them.  Those pages
      are use like regular pages in a process address space either as anonymous
      page or as file back page.  So from memcg point of view we want to handle
      them like regular page for now at least.
      
      Link: http://lkml.kernel.org/r/20170817000548.32038-11-jglisse@redhat.comSigned-off-by: NJérôme Glisse <jglisse@redhat.com>
      Acked-by: NBalbir Singh <bsingharora@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mark Hairgrove <mhairgrove@nvidia.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Sherry Cheung <SCheung@nvidia.com>
      Cc: Subhash Gutti <sgutti@nvidia.com>
      Cc: Bob Liu <liubo95@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c733a828