1. 18 8月, 2018 1 次提交
    • S
      fs: fsnotify: account fsnotify metadata to kmemcg · d46eb14b
      Shakeel Butt 提交于
      Patch series "Directed kmem charging", v8.
      
      The Linux kernel's memory cgroup allows limiting the memory usage of the
      jobs running on the system to provide isolation between the jobs.  All
      the kernel memory allocated in the context of the job and marked with
      __GFP_ACCOUNT will also be included in the memory usage and be limited
      by the job's limit.
      
      The kernel memory can only be charged to the memcg of the process in
      whose context kernel memory was allocated.  However there are cases
      where the allocated kernel memory should be charged to the memcg
      different from the current processes's memcg.  This patch series
      contains two such concrete use-cases i.e.  fsnotify and buffer_head.
      
      The fsnotify event objects can consume a lot of system memory for large
      or unlimited queues if there is either no or slow listener.  The events
      are allocated in the context of the event producer.  However they should
      be charged to the event consumer.  Similarly the buffer_head objects can
      be allocated in a memcg different from the memcg of the page for which
      buffer_head objects are being allocated.
      
      To solve this issue, this patch series introduces mechanism to charge
      kernel memory to a given memcg.  In case of fsnotify events, the memcg
      of the consumer can be used for charging and for buffer_head, the memcg
      of the page can be charged.  For directed charging, the caller can use
      the scope API memalloc_[un]use_memcg() to specify the memcg to charge
      for all the __GFP_ACCOUNT allocations within the scope.
      
      This patch (of 2):
      
      A lot of memory can be consumed by the events generated for the huge or
      unlimited queues if there is either no or slow listener.  This can cause
      system level memory pressure or OOMs.  So, it's better to account the
      fsnotify kmem caches to the memcg of the listener.
      
      However the listener can be in a different memcg than the memcg of the
      producer and these allocations happen in the context of the event
      producer.  This patch introduces remote memcg charging API which the
      producer can use to charge the allocations to the memcg of the listener.
      
      There are seven fsnotify kmem caches and among them allocations from
      dnotify_struct_cache, dnotify_mark_cache, fanotify_mark_cache and
      inotify_inode_mark_cachep happens in the context of syscall from the
      listener.  So, SLAB_ACCOUNT is enough for these caches.
      
      The objects from fsnotify_mark_connector_cachep are not accounted as
      they are small compared to the notification mark or events and it is
      unclear whom to account connector to since it is shared by all events
      attached to the inode.
      
      The allocations from the event caches happen in the context of the event
      producer.  For such caches we will need to remote charge the allocations
      to the listener's memcg.  Thus we save the memcg reference in the
      fsnotify_group structure of the listener.
      
      This patch has also moved the members of fsnotify_group to keep the size
      same, at least for 64 bit build, even with additional member by filling
      the holes.
      
      [shakeelb@google.com: use GFP_KERNEL_ACCOUNT rather than open-coding it]
        Link: http://lkml.kernel.org/r/20180702215439.211597-1-shakeelb@google.com
      Link: http://lkml.kernel.org/r/20180627191250.209150-2-shakeelb@google.comSigned-off-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Amir Goldstein <amir73il@gmail.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d46eb14b
  2. 03 8月, 2018 1 次提交
  3. 22 7月, 2018 1 次提交
    • J
      mm: memcg: fix use after free in mem_cgroup_iter() · 9f15bde6
      Jing Xia 提交于
      It was reported that a kernel crash happened in mem_cgroup_iter(), which
      can be triggered if the legacy cgroup-v1 non-hierarchical mode is used.
      
      Unable to handle kernel paging request at virtual address 6b6b6b6b6b6b8f
      ......
      Call trace:
        mem_cgroup_iter+0x2e0/0x6d4
        shrink_zone+0x8c/0x324
        balance_pgdat+0x450/0x640
        kswapd+0x130/0x4b8
        kthread+0xe8/0xfc
        ret_from_fork+0x10/0x20
      
        mem_cgroup_iter():
            ......
            if (css_tryget(css))    <-- crash here
      	    break;
            ......
      
      The crashing reason is that mem_cgroup_iter() uses the memcg object whose
      pointer is stored in iter->position, which has been freed before and
      filled with POISON_FREE(0x6b).
      
      And the root cause of the use-after-free issue is that
      invalidate_reclaim_iterators() fails to reset the value of iter->position
      to NULL when the css of the memcg is released in non- hierarchical mode.
      
      Link: http://lkml.kernel.org/r/1531994807-25639-1-git-send-email-jing.xia@unisoc.com
      Fixes: 6df38689 ("mm: memcontrol: fix possible memcg leak due to interrupted reclaim")
      Signed-off-by: NJing Xia <jing.xia.mail@gmail.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: <chunyan.zhang@unisoc.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9f15bde6
  4. 09 7月, 2018 1 次提交
  5. 15 6月, 2018 2 次提交
  6. 08 6月, 2018 11 次提交
  7. 26 5月, 2018 1 次提交
  8. 21 4月, 2018 1 次提交
    • M
      mm: memcg: add __GFP_NOWARN in __memcg_schedule_kmem_cache_create() · c892fd82
      Minchan Kim 提交于
      If there is heavy memory pressure, page allocation with __GFP_NOWAIT
      fails easily although it's order-0 request.  I got below warning 9 times
      for normal boot.
      
           <snip >: page allocation failure: order:0, mode:0x2200000(GFP_NOWAIT|__GFP_NOTRACK)
           .. snip ..
           Call trace:
             dump_backtrace+0x0/0x4
             dump_stack+0xa4/0xc0
             warn_alloc+0xd4/0x15c
             __alloc_pages_nodemask+0xf88/0x10fc
             alloc_slab_page+0x40/0x18c
             new_slab+0x2b8/0x2e0
             ___slab_alloc+0x25c/0x464
             __kmalloc+0x394/0x498
             memcg_kmem_get_cache+0x114/0x2b8
             kmem_cache_alloc+0x98/0x3e8
             mmap_region+0x3bc/0x8c0
             do_mmap+0x40c/0x43c
             vm_mmap_pgoff+0x15c/0x1e4
             sys_mmap+0xb0/0xc8
             el0_svc_naked+0x24/0x28
           Mem-Info:
           active_anon:17124 inactive_anon:193 isolated_anon:0
            active_file:7898 inactive_file:712955 isolated_file:55
            unevictable:0 dirty:27 writeback:18 unstable:0
            slab_reclaimable:12250 slab_unreclaimable:23334
            mapped:19310 shmem:212 pagetables:816 bounce:0
            free:36561 free_pcp:1205 free_cma:35615
           Node 0 active_anon:68496kB inactive_anon:772kB active_file:31592kB inactive_file:2851820kB unevictable:0kB isolated(anon):0kB isolated(file):220kB mapped:77240kB dirty:108kB writeback:72kB shmem:848kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
           DMA free:142188kB min:3056kB low:3820kB high:4584kB active_anon:10052kB inactive_anon:12kB active_file:312kB inactive_file:1412620kB unevictable:0kB writepending:0kB present:1781412kB managed:1604728kB mlocked:0kB slab_reclaimable:3592kB slab_unreclaimable:876kB kernel_stack:400kB pagetables:52kB bounce:0kB free_pcp:1436kB local_pcp:124kB free_cma:142492kB
           lowmem_reserve[]: 0 1842 1842
           Normal free:4056kB min:4172kB low:5212kB high:6252kB active_anon:58376kB inactive_anon:760kB active_file:31348kB inactive_file:1439040kB unevictable:0kB writepending:180kB present:2000636kB managed:1923688kB mlocked:0kB slab_reclaimable:45408kB slab_unreclaimable:92460kB kernel_stack:9680kB pagetables:3212kB bounce:0kB free_pcp:3392kB local_pcp:688kB free_cma:0kB
           lowmem_reserve[]: 0 0 0
           DMA: 0*4kB 0*8kB 1*16kB (C) 0*32kB 0*64kB 0*128kB 1*256kB (C) 1*512kB (C) 0*1024kB 1*2048kB (C) 34*4096kB (C) = 142096kB
           Normal: 228*4kB (UMEH) 172*8kB (UMH) 23*16kB (UH) 24*32kB (H) 5*64kB (H) 1*128kB (H) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3872kB
           721350 total pagecache pages
           0 pages in swap cache
           Swap cache stats: add 0, delete 0, find 0/0
           Free swap  = 0kB
           Total swap = 0kB
           945512 pages RAM
           0 pages HighMem/MovableOnly
           63408 pages reserved
           51200 pages cma reserved
      
      __memcg_schedule_kmem_cache_create() tries to create a shadow slab cache
      and the worker allocation failure is not really critical because we will
      retry on the next kmem charge.  We might miss some charges but that
      shouldn't be critical.  The excessive allocation failure report is not
      very helpful.
      
      [mhocko@kernel.org: changelog update]
      Link: http://lkml.kernel.org/r/20180418022912.248417-1-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c892fd82
  9. 12 4月, 2018 4 次提交
  10. 29 3月, 2018 1 次提交
  11. 12 2月, 2018 1 次提交
    • L
      vfs: do bulk POLL* -> EPOLL* replacement · a9a08845
      Linus Torvalds 提交于
      This is the mindless scripted replacement of kernel use of POLL*
      variables as described by Al, done by this script:
      
          for V in IN OUT PRI ERR RDNORM RDBAND WRNORM WRBAND HUP RDHUP NVAL MSG; do
              L=`git grep -l -w POLL$V | grep -v '^t' | grep -v /um/ | grep -v '^sa' | grep -v '/poll.h$'|grep -v '^D'`
              for f in $L; do sed -i "-es/^\([^\"]*\)\(\<POLL$V\>\)/\\1E\\2/" $f; done
          done
      
      with de-mangling cleanups yet to come.
      
      NOTE! On almost all architectures, the EPOLL* constants have the same
      values as the POLL* constants do.  But they keyword here is "almost".
      For various bad reasons they aren't the same, and epoll() doesn't
      actually work quite correctly in some cases due to this on Sparc et al.
      
      The next patch from Al will sort out the final differences, and we
      should be all done.
      Scripted-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a9a08845
  12. 07 2月, 2018 2 次提交
  13. 03 2月, 2018 1 次提交
    • R
      Revert "defer call to mem_cgroup_sk_alloc()" · edbe69ef
      Roman Gushchin 提交于
      This patch effectively reverts commit 9f1c2674 ("net: memcontrol:
      defer call to mem_cgroup_sk_alloc()").
      
      Moving mem_cgroup_sk_alloc() to the inet_csk_accept() completely breaks
      memcg socket memory accounting, as packets received before memcg
      pointer initialization are not accounted and are causing refcounting
      underflow on socket release.
      
      Actually the free-after-use problem was fixed by
      commit c0576e39 ("net: call cgroup_sk_alloc() earlier in
      sk_clone_lock()") for the cgroup pointer.
      
      So, let's revert it and call mem_cgroup_sk_alloc() just before
      cgroup_sk_alloc(). This is safe, as we hold a reference to the socket
      we're cloning, and it holds a reference to the memcg.
      
      Also, let's drop BUG_ON(mem_cgroup_is_root()) check from
      mem_cgroup_sk_alloc(). I see no reasons why bumping the root
      memcg counter is a good reason to panic, and there are no realistic
      ways to hit it.
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      edbe69ef
  14. 01 2月, 2018 5 次提交
  15. 30 11月, 2017 1 次提交
  16. 28 11月, 2017 1 次提交
  17. 16 11月, 2017 1 次提交
  18. 10 10月, 2017 1 次提交
  19. 04 10月, 2017 2 次提交
    • J
      mm/memcg: avoid page count check for zone device · 3f2eb028
      Jérôme Glisse 提交于
      Fix for 4.14, zone device page always have an elevated refcount of one
      and thus page count sanity check in uncharge_page() is inappropriate for
      them.
      
      [mhocko@suse.com: nano-optimize VM_BUG_ON in uncharge_page]
      Link: http://lkml.kernel.org/r/20170914190011.5217-1-jglisse@redhat.comSigned-off-by: NJérôme Glisse <jglisse@redhat.com>
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NEvgeny Baskakov <ebaskakov@nvidia.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3f2eb028
    • M
      mm, memcg: remove hotplug locking from try_charge · 72f0184c
      Michal Hocko 提交于
      The following lockdep splat has been noticed during LTP testing
      
        ======================================================
        WARNING: possible circular locking dependency detected
        4.13.0-rc3-next-20170807 #12 Not tainted
        ------------------------------------------------------
        a.out/4771 is trying to acquire lock:
         (cpu_hotplug_lock.rw_sem){++++++}, at: [<ffffffff812b4668>] drain_all_stock.part.35+0x18/0x140
      
        but task is already holding lock:
         (&mm->mmap_sem){++++++}, at: [<ffffffff8106eb35>] __do_page_fault+0x175/0x530
      
        which lock already depends on the new lock.
      
        the existing dependency chain (in reverse order) is:
      
        -> #3 (&mm->mmap_sem){++++++}:
               lock_acquire+0xc9/0x230
               __might_fault+0x70/0xa0
               _copy_to_user+0x23/0x70
               filldir+0xa7/0x110
               xfs_dir2_sf_getdents.isra.10+0x20c/0x2c0 [xfs]
               xfs_readdir+0x1fa/0x2c0 [xfs]
               xfs_file_readdir+0x30/0x40 [xfs]
               iterate_dir+0x17a/0x1a0
               SyS_getdents+0xb0/0x160
               entry_SYSCALL_64_fastpath+0x1f/0xbe
      
        -> #2 (&type->i_mutex_dir_key#3){++++++}:
               lock_acquire+0xc9/0x230
               down_read+0x51/0xb0
               lookup_slow+0xde/0x210
               walk_component+0x160/0x250
               link_path_walk+0x1a6/0x610
               path_openat+0xe4/0xd50
               do_filp_open+0x91/0x100
               file_open_name+0xf5/0x130
               filp_open+0x33/0x50
               kernel_read_file_from_path+0x39/0x80
               _request_firmware+0x39f/0x880
               request_firmware_direct+0x37/0x50
               request_microcode_fw+0x64/0xe0
               reload_store+0xf7/0x180
               dev_attr_store+0x18/0x30
               sysfs_kf_write+0x44/0x60
               kernfs_fop_write+0x113/0x1a0
               __vfs_write+0x37/0x170
               vfs_write+0xc7/0x1c0
               SyS_write+0x58/0xc0
               do_syscall_64+0x6c/0x1f0
               return_from_SYSCALL_64+0x0/0x7a
      
        -> #1 (microcode_mutex){+.+.+.}:
               lock_acquire+0xc9/0x230
               __mutex_lock+0x88/0x960
               mutex_lock_nested+0x1b/0x20
               microcode_init+0xbb/0x208
               do_one_initcall+0x51/0x1a9
               kernel_init_freeable+0x208/0x2a7
               kernel_init+0xe/0x104
               ret_from_fork+0x2a/0x40
      
        -> #0 (cpu_hotplug_lock.rw_sem){++++++}:
               __lock_acquire+0x153c/0x1550
               lock_acquire+0xc9/0x230
               cpus_read_lock+0x4b/0x90
               drain_all_stock.part.35+0x18/0x140
               try_charge+0x3ab/0x6e0
               mem_cgroup_try_charge+0x7f/0x2c0
               shmem_getpage_gfp+0x25f/0x1050
               shmem_fault+0x96/0x200
               __do_fault+0x1e/0xa0
               __handle_mm_fault+0x9c3/0xe00
               handle_mm_fault+0x16e/0x380
               __do_page_fault+0x24a/0x530
               do_page_fault+0x30/0x80
               page_fault+0x28/0x30
      
        other info that might help us debug this:
      
        Chain exists of:
          cpu_hotplug_lock.rw_sem --> &type->i_mutex_dir_key#3 --> &mm->mmap_sem
      
         Possible unsafe locking scenario:
      
               CPU0                    CPU1
               ----                    ----
          lock(&mm->mmap_sem);
                                       lock(&type->i_mutex_dir_key#3);
                                       lock(&mm->mmap_sem);
          lock(cpu_hotplug_lock.rw_sem);
      
         *** DEADLOCK ***
      
        2 locks held by a.out/4771:
         #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff8106eb35>] __do_page_fault+0x175/0x530
         #1:  (percpu_charge_mutex){+.+...}, at: [<ffffffff812b4c97>] try_charge+0x397/0x6e0
      
      The problem is very similar to the one fixed by commit a459eeb7
      ("mm, page_alloc: do not depend on cpu hotplug locks inside the
      allocator").  We are taking hotplug locks while we can be sitting on top
      of basically arbitrary locks.  This just calls for problems.
      
      We can get rid of {get,put}_online_cpus, fortunately.  We do not have to
      be worried about races with memory hotplug because drain_local_stock,
      which is called from both the WQ draining and the memory hotplug
      contexts, is always operating on the local cpu stock with IRQs disabled.
      
      The only thing to be careful about is that the target memcg doesn't
      vanish while we are still in drain_all_stock so take a reference on it.
      
      Link: http://lkml.kernel.org/r/20170913090023.28322-1-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NArtem Savkov <asavkov@redhat.com>
      Tested-by: NArtem Savkov <asavkov@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      72f0184c
  20. 09 9月, 2017 1 次提交