1. 08 6月, 2018 2 次提交
  2. 26 5月, 2018 1 次提交
  3. 21 4月, 2018 1 次提交
    • M
      mm: memcg: add __GFP_NOWARN in __memcg_schedule_kmem_cache_create() · c892fd82
      Minchan Kim 提交于
      If there is heavy memory pressure, page allocation with __GFP_NOWAIT
      fails easily although it's order-0 request.  I got below warning 9 times
      for normal boot.
      
           <snip >: page allocation failure: order:0, mode:0x2200000(GFP_NOWAIT|__GFP_NOTRACK)
           .. snip ..
           Call trace:
             dump_backtrace+0x0/0x4
             dump_stack+0xa4/0xc0
             warn_alloc+0xd4/0x15c
             __alloc_pages_nodemask+0xf88/0x10fc
             alloc_slab_page+0x40/0x18c
             new_slab+0x2b8/0x2e0
             ___slab_alloc+0x25c/0x464
             __kmalloc+0x394/0x498
             memcg_kmem_get_cache+0x114/0x2b8
             kmem_cache_alloc+0x98/0x3e8
             mmap_region+0x3bc/0x8c0
             do_mmap+0x40c/0x43c
             vm_mmap_pgoff+0x15c/0x1e4
             sys_mmap+0xb0/0xc8
             el0_svc_naked+0x24/0x28
           Mem-Info:
           active_anon:17124 inactive_anon:193 isolated_anon:0
            active_file:7898 inactive_file:712955 isolated_file:55
            unevictable:0 dirty:27 writeback:18 unstable:0
            slab_reclaimable:12250 slab_unreclaimable:23334
            mapped:19310 shmem:212 pagetables:816 bounce:0
            free:36561 free_pcp:1205 free_cma:35615
           Node 0 active_anon:68496kB inactive_anon:772kB active_file:31592kB inactive_file:2851820kB unevictable:0kB isolated(anon):0kB isolated(file):220kB mapped:77240kB dirty:108kB writeback:72kB shmem:848kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
           DMA free:142188kB min:3056kB low:3820kB high:4584kB active_anon:10052kB inactive_anon:12kB active_file:312kB inactive_file:1412620kB unevictable:0kB writepending:0kB present:1781412kB managed:1604728kB mlocked:0kB slab_reclaimable:3592kB slab_unreclaimable:876kB kernel_stack:400kB pagetables:52kB bounce:0kB free_pcp:1436kB local_pcp:124kB free_cma:142492kB
           lowmem_reserve[]: 0 1842 1842
           Normal free:4056kB min:4172kB low:5212kB high:6252kB active_anon:58376kB inactive_anon:760kB active_file:31348kB inactive_file:1439040kB unevictable:0kB writepending:180kB present:2000636kB managed:1923688kB mlocked:0kB slab_reclaimable:45408kB slab_unreclaimable:92460kB kernel_stack:9680kB pagetables:3212kB bounce:0kB free_pcp:3392kB local_pcp:688kB free_cma:0kB
           lowmem_reserve[]: 0 0 0
           DMA: 0*4kB 0*8kB 1*16kB (C) 0*32kB 0*64kB 0*128kB 1*256kB (C) 1*512kB (C) 0*1024kB 1*2048kB (C) 34*4096kB (C) = 142096kB
           Normal: 228*4kB (UMEH) 172*8kB (UMH) 23*16kB (UH) 24*32kB (H) 5*64kB (H) 1*128kB (H) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3872kB
           721350 total pagecache pages
           0 pages in swap cache
           Swap cache stats: add 0, delete 0, find 0/0
           Free swap  = 0kB
           Total swap = 0kB
           945512 pages RAM
           0 pages HighMem/MovableOnly
           63408 pages reserved
           51200 pages cma reserved
      
      __memcg_schedule_kmem_cache_create() tries to create a shadow slab cache
      and the worker allocation failure is not really critical because we will
      retry on the next kmem charge.  We might miss some charges but that
      shouldn't be critical.  The excessive allocation failure report is not
      very helpful.
      
      [mhocko@kernel.org: changelog update]
      Link: http://lkml.kernel.org/r/20180418022912.248417-1-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c892fd82
  4. 12 4月, 2018 4 次提交
  5. 29 3月, 2018 1 次提交
  6. 12 2月, 2018 1 次提交
    • L
      vfs: do bulk POLL* -> EPOLL* replacement · a9a08845
      Linus Torvalds 提交于
      This is the mindless scripted replacement of kernel use of POLL*
      variables as described by Al, done by this script:
      
          for V in IN OUT PRI ERR RDNORM RDBAND WRNORM WRBAND HUP RDHUP NVAL MSG; do
              L=`git grep -l -w POLL$V | grep -v '^t' | grep -v /um/ | grep -v '^sa' | grep -v '/poll.h$'|grep -v '^D'`
              for f in $L; do sed -i "-es/^\([^\"]*\)\(\<POLL$V\>\)/\\1E\\2/" $f; done
          done
      
      with de-mangling cleanups yet to come.
      
      NOTE! On almost all architectures, the EPOLL* constants have the same
      values as the POLL* constants do.  But they keyword here is "almost".
      For various bad reasons they aren't the same, and epoll() doesn't
      actually work quite correctly in some cases due to this on Sparc et al.
      
      The next patch from Al will sort out the final differences, and we
      should be all done.
      Scripted-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a9a08845
  7. 07 2月, 2018 2 次提交
  8. 03 2月, 2018 1 次提交
    • R
      Revert "defer call to mem_cgroup_sk_alloc()" · edbe69ef
      Roman Gushchin 提交于
      This patch effectively reverts commit 9f1c2674 ("net: memcontrol:
      defer call to mem_cgroup_sk_alloc()").
      
      Moving mem_cgroup_sk_alloc() to the inet_csk_accept() completely breaks
      memcg socket memory accounting, as packets received before memcg
      pointer initialization are not accounted and are causing refcounting
      underflow on socket release.
      
      Actually the free-after-use problem was fixed by
      commit c0576e39 ("net: call cgroup_sk_alloc() earlier in
      sk_clone_lock()") for the cgroup pointer.
      
      So, let's revert it and call mem_cgroup_sk_alloc() just before
      cgroup_sk_alloc(). This is safe, as we hold a reference to the socket
      we're cloning, and it holds a reference to the memcg.
      
      Also, let's drop BUG_ON(mem_cgroup_is_root()) check from
      mem_cgroup_sk_alloc(). I see no reasons why bumping the root
      memcg counter is a good reason to panic, and there are no realistic
      ways to hit it.
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      edbe69ef
  9. 01 2月, 2018 5 次提交
  10. 30 11月, 2017 1 次提交
  11. 28 11月, 2017 1 次提交
  12. 16 11月, 2017 1 次提交
  13. 10 10月, 2017 1 次提交
  14. 04 10月, 2017 2 次提交
    • J
      mm/memcg: avoid page count check for zone device · 3f2eb028
      Jérôme Glisse 提交于
      Fix for 4.14, zone device page always have an elevated refcount of one
      and thus page count sanity check in uncharge_page() is inappropriate for
      them.
      
      [mhocko@suse.com: nano-optimize VM_BUG_ON in uncharge_page]
      Link: http://lkml.kernel.org/r/20170914190011.5217-1-jglisse@redhat.comSigned-off-by: NJérôme Glisse <jglisse@redhat.com>
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NEvgeny Baskakov <ebaskakov@nvidia.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3f2eb028
    • M
      mm, memcg: remove hotplug locking from try_charge · 72f0184c
      Michal Hocko 提交于
      The following lockdep splat has been noticed during LTP testing
      
        ======================================================
        WARNING: possible circular locking dependency detected
        4.13.0-rc3-next-20170807 #12 Not tainted
        ------------------------------------------------------
        a.out/4771 is trying to acquire lock:
         (cpu_hotplug_lock.rw_sem){++++++}, at: [<ffffffff812b4668>] drain_all_stock.part.35+0x18/0x140
      
        but task is already holding lock:
         (&mm->mmap_sem){++++++}, at: [<ffffffff8106eb35>] __do_page_fault+0x175/0x530
      
        which lock already depends on the new lock.
      
        the existing dependency chain (in reverse order) is:
      
        -> #3 (&mm->mmap_sem){++++++}:
               lock_acquire+0xc9/0x230
               __might_fault+0x70/0xa0
               _copy_to_user+0x23/0x70
               filldir+0xa7/0x110
               xfs_dir2_sf_getdents.isra.10+0x20c/0x2c0 [xfs]
               xfs_readdir+0x1fa/0x2c0 [xfs]
               xfs_file_readdir+0x30/0x40 [xfs]
               iterate_dir+0x17a/0x1a0
               SyS_getdents+0xb0/0x160
               entry_SYSCALL_64_fastpath+0x1f/0xbe
      
        -> #2 (&type->i_mutex_dir_key#3){++++++}:
               lock_acquire+0xc9/0x230
               down_read+0x51/0xb0
               lookup_slow+0xde/0x210
               walk_component+0x160/0x250
               link_path_walk+0x1a6/0x610
               path_openat+0xe4/0xd50
               do_filp_open+0x91/0x100
               file_open_name+0xf5/0x130
               filp_open+0x33/0x50
               kernel_read_file_from_path+0x39/0x80
               _request_firmware+0x39f/0x880
               request_firmware_direct+0x37/0x50
               request_microcode_fw+0x64/0xe0
               reload_store+0xf7/0x180
               dev_attr_store+0x18/0x30
               sysfs_kf_write+0x44/0x60
               kernfs_fop_write+0x113/0x1a0
               __vfs_write+0x37/0x170
               vfs_write+0xc7/0x1c0
               SyS_write+0x58/0xc0
               do_syscall_64+0x6c/0x1f0
               return_from_SYSCALL_64+0x0/0x7a
      
        -> #1 (microcode_mutex){+.+.+.}:
               lock_acquire+0xc9/0x230
               __mutex_lock+0x88/0x960
               mutex_lock_nested+0x1b/0x20
               microcode_init+0xbb/0x208
               do_one_initcall+0x51/0x1a9
               kernel_init_freeable+0x208/0x2a7
               kernel_init+0xe/0x104
               ret_from_fork+0x2a/0x40
      
        -> #0 (cpu_hotplug_lock.rw_sem){++++++}:
               __lock_acquire+0x153c/0x1550
               lock_acquire+0xc9/0x230
               cpus_read_lock+0x4b/0x90
               drain_all_stock.part.35+0x18/0x140
               try_charge+0x3ab/0x6e0
               mem_cgroup_try_charge+0x7f/0x2c0
               shmem_getpage_gfp+0x25f/0x1050
               shmem_fault+0x96/0x200
               __do_fault+0x1e/0xa0
               __handle_mm_fault+0x9c3/0xe00
               handle_mm_fault+0x16e/0x380
               __do_page_fault+0x24a/0x530
               do_page_fault+0x30/0x80
               page_fault+0x28/0x30
      
        other info that might help us debug this:
      
        Chain exists of:
          cpu_hotplug_lock.rw_sem --> &type->i_mutex_dir_key#3 --> &mm->mmap_sem
      
         Possible unsafe locking scenario:
      
               CPU0                    CPU1
               ----                    ----
          lock(&mm->mmap_sem);
                                       lock(&type->i_mutex_dir_key#3);
                                       lock(&mm->mmap_sem);
          lock(cpu_hotplug_lock.rw_sem);
      
         *** DEADLOCK ***
      
        2 locks held by a.out/4771:
         #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff8106eb35>] __do_page_fault+0x175/0x530
         #1:  (percpu_charge_mutex){+.+...}, at: [<ffffffff812b4c97>] try_charge+0x397/0x6e0
      
      The problem is very similar to the one fixed by commit a459eeb7
      ("mm, page_alloc: do not depend on cpu hotplug locks inside the
      allocator").  We are taking hotplug locks while we can be sitting on top
      of basically arbitrary locks.  This just calls for problems.
      
      We can get rid of {get,put}_online_cpus, fortunately.  We do not have to
      be worried about races with memory hotplug because drain_local_stock,
      which is called from both the WQ draining and the memory hotplug
      contexts, is always operating on the local cpu stock with IRQs disabled.
      
      The only thing to be careful about is that the target memcg doesn't
      vanish while we are still in drain_all_stock so take a reference on it.
      
      Link: http://lkml.kernel.org/r/20170913090023.28322-1-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NArtem Savkov <asavkov@redhat.com>
      Tested-by: NArtem Savkov <asavkov@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      72f0184c
  15. 09 9月, 2017 6 次提交
    • D
      mem/memcg: cache rightmost node · fa90b2fd
      Davidlohr Bueso 提交于
      Such that we can optimize __mem_cgroup_largest_soft_limit_node().  The
      only overhead is the extra footprint for the cached pointer, but this
      should not be an issue for mem_cgroup_tree_per_node.
      
      [dave@stgolabs.net: brain fart #2]
        Link: http://lkml.kernel.org/r/20170731160114.GE21328@linux-80c1.suse
      Link: http://lkml.kernel.org/r/20170719014603.19029-17-dave@stgolabs.netSigned-off-by: NDavidlohr Bueso <dbueso@suse.de>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fa90b2fd
    • R
      mm: memcontrol: use per-cpu stocks for socket memory uncharging · 475d0487
      Roman Gushchin 提交于
      We've noticed a quite noticeable performance overhead on some hosts with
      significant network traffic when socket memory accounting is enabled.
      
      Perf top shows that socket memory uncharging path is hot:
        2.13%  [kernel]                [k] page_counter_cancel
        1.14%  [kernel]                [k] __sk_mem_reduce_allocated
        1.14%  [kernel]                [k] _raw_spin_lock
        0.87%  [kernel]                [k] _raw_spin_lock_irqsave
        0.84%  [kernel]                [k] tcp_ack
        0.84%  [kernel]                [k] ixgbe_poll
        0.83%  < workload >
        0.82%  [kernel]                [k] enqueue_entity
        0.68%  [kernel]                [k] __fget
        0.68%  [kernel]                [k] tcp_delack_timer_handler
        0.67%  [kernel]                [k] __schedule
        0.60%  < workload >
        0.59%  [kernel]                [k] __inet6_lookup_established
        0.55%  [kernel]                [k] __switch_to
        0.55%  [kernel]                [k] menu_select
        0.54%  libc-2.20.so            [.] __memcpy_avx_unaligned
      
      To address this issue, the existing per-cpu stock infrastructure can be
      used.
      
      refill_stock() can be called from mem_cgroup_uncharge_skmem() to move
      charge to a per-cpu stock instead of calling atomic
      page_counter_uncharge().
      
      To prevent the uncontrolled growth of per-cpu stocks, refill_stock()
      will explicitly drain the cached charge, if the cached value exceeds
      CHARGE_BATCH.
      
      This allows significantly optimize the load:
        1.21%  [kernel]                [k] _raw_spin_lock
        1.01%  [kernel]                [k] ixgbe_poll
        0.92%  [kernel]                [k] _raw_spin_lock_irqsave
        0.90%  [kernel]                [k] enqueue_entity
        0.86%  [kernel]                [k] tcp_ack
        0.85%  < workload >
        0.74%  perf-11120.map          [.] 0x000000000061bf24
        0.73%  [kernel]                [k] __schedule
        0.67%  [kernel]                [k] __fget
        0.63%  [kernel]                [k] __inet6_lookup_established
        0.62%  [kernel]                [k] menu_select
        0.59%  < workload >
        0.59%  [kernel]                [k] __switch_to
        0.57%  libc-2.20.so            [.] __memcpy_avx_unaligned
      
      Link: http://lkml.kernel.org/r/20170829100150.4580-1-guro@fb.comSigned-off-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      475d0487
    • J
      mm/device-public-memory: device memory cache coherent with CPU · df6ad698
      Jérôme Glisse 提交于
      Platform with advance system bus (like CAPI or CCIX) allow device memory
      to be accessible from CPU in a cache coherent fashion.  Add a new type of
      ZONE_DEVICE to represent such memory.  The use case are the same as for
      the un-addressable device memory but without all the corners cases.
      
      Link: http://lkml.kernel.org/r/20170817000548.32038-19-jglisse@redhat.comSigned-off-by: NJérôme Glisse <jglisse@redhat.com>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mark Hairgrove <mhairgrove@nvidia.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Sherry Cheung <SCheung@nvidia.com>
      Cc: Subhash Gutti <sgutti@nvidia.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Bob Liu <liubo95@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      df6ad698
    • J
      mm/memcontrol: support MEMORY_DEVICE_PRIVATE · c733a828
      Jérôme Glisse 提交于
      HMM pages (private or public device pages) are ZONE_DEVICE page and thus
      need special handling when it comes to lru or refcount.  This patch make
      sure that memcontrol properly handle those when it face them.  Those pages
      are use like regular pages in a process address space either as anonymous
      page or as file back page.  So from memcg point of view we want to handle
      them like regular page for now at least.
      
      Link: http://lkml.kernel.org/r/20170817000548.32038-11-jglisse@redhat.comSigned-off-by: NJérôme Glisse <jglisse@redhat.com>
      Acked-by: NBalbir Singh <bsingharora@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mark Hairgrove <mhairgrove@nvidia.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Sherry Cheung <SCheung@nvidia.com>
      Cc: Subhash Gutti <sgutti@nvidia.com>
      Cc: Bob Liu <liubo95@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c733a828
    • J
      mm/memcontrol: allow to uncharge page without using page->lru field · a9d5adee
      Jérôme Glisse 提交于
      HMM pages (private or public device pages) are ZONE_DEVICE page and
      thus you can not use page->lru fields of those pages. This patch
      re-arrange the uncharge to allow single page to be uncharge without
      modifying the lru field of the struct page.
      
      There is no change to memcontrol logic, it is the same as it was
      before this patch.
      
      Link: http://lkml.kernel.org/r/20170817000548.32038-10-jglisse@redhat.comSigned-off-by: NJérôme Glisse <jglisse@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mark Hairgrove <mhairgrove@nvidia.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Sherry Cheung <SCheung@nvidia.com>
      Cc: Subhash Gutti <sgutti@nvidia.com>
      Cc: Bob Liu <liubo95@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a9d5adee
    • Z
      mm: thp: check pmd migration entry in common path · 84c3fc4e
      Zi Yan 提交于
      When THP migration is being used, memory management code needs to handle
      pmd migration entries properly.  This patch uses !pmd_present() or
      is_swap_pmd() (depending on whether pmd_none() needs separate code or
      not) to check pmd migration entries at the places where a pmd entry is
      present.
      
      Since pmd-related code uses split_huge_page(), split_huge_pmd(),
      pmd_trans_huge(), pmd_trans_unstable(), or
      pmd_none_or_trans_huge_or_clear_bad(), this patch:
      
      1. adds pmd migration entry split code in split_huge_pmd(),
      
      2. takes care of pmd migration entries whenever pmd_trans_huge() is present,
      
      3. makes pmd_none_or_trans_huge_or_clear_bad() pmd migration entry aware.
      
      Since split_huge_page() uses split_huge_pmd() and pmd_trans_unstable()
      is equivalent to pmd_none_or_trans_huge_or_clear_bad(), we do not change
      them.
      
      Until this commit, a pmd entry should be:
      1. pointing to a pte page,
      2. is_swap_pmd(),
      3. pmd_trans_huge(),
      4. pmd_devmap(), or
      5. pmd_none().
      Signed-off-by: NZi Yan <zi.yan@cs.rutgers.edu>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      84c3fc4e
  16. 07 9月, 2017 6 次提交
  17. 19 8月, 2017 1 次提交
    • J
      mm: memcontrol: fix NULL pointer crash in test_clear_page_writeback() · 739f79fc
      Johannes Weiner 提交于
      Jaegeuk and Brad report a NULL pointer crash when writeback ending tries
      to update the memcg stats:
      
          BUG: unable to handle kernel NULL pointer dereference at 00000000000003b0
          IP: test_clear_page_writeback+0x12e/0x2c0
          [...]
          RIP: 0010:test_clear_page_writeback+0x12e/0x2c0
          Call Trace:
           <IRQ>
           end_page_writeback+0x47/0x70
           f2fs_write_end_io+0x76/0x180 [f2fs]
           bio_endio+0x9f/0x120
           blk_update_request+0xa8/0x2f0
           scsi_end_request+0x39/0x1d0
           scsi_io_completion+0x211/0x690
           scsi_finish_command+0xd9/0x120
           scsi_softirq_done+0x127/0x150
           __blk_mq_complete_request_remote+0x13/0x20
           flush_smp_call_function_queue+0x56/0x110
           generic_smp_call_function_single_interrupt+0x13/0x30
           smp_call_function_single_interrupt+0x27/0x40
           call_function_single_interrupt+0x89/0x90
          RIP: 0010:native_safe_halt+0x6/0x10
      
          (gdb) l *(test_clear_page_writeback+0x12e)
          0xffffffff811bae3e is in test_clear_page_writeback (./include/linux/memcontrol.h:619).
          614		mod_node_page_state(page_pgdat(page), idx, val);
          615		if (mem_cgroup_disabled() || !page->mem_cgroup)
          616			return;
          617		mod_memcg_state(page->mem_cgroup, idx, val);
          618		pn = page->mem_cgroup->nodeinfo[page_to_nid(page)];
          619		this_cpu_add(pn->lruvec_stat->count[idx], val);
          620	}
          621
          622	unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
          623							gfp_t gfp_mask,
      
      The issue is that writeback doesn't hold a page reference and the page
      might get freed after PG_writeback is cleared (and the mapping is
      unlocked) in test_clear_page_writeback().  The stat functions looking up
      the page's node or zone are safe, as those attributes are static across
      allocation and free cycles.  But page->mem_cgroup is not, and it will
      get cleared if we race with truncation or migration.
      
      It appears this race window has been around for a while, but less likely
      to trigger when the memcg stats were updated first thing after
      PG_writeback is cleared.  Recent changes reshuffled this code to update
      the global node stats before the memcg ones, though, stretching the race
      window out to an extent where people can reproduce the problem.
      
      Update test_clear_page_writeback() to look up and pin page->mem_cgroup
      before clearing PG_writeback, then not use that pointer afterward.  It
      is a partial revert of 62cccb8c ("mm: simplify lock_page_memcg()")
      but leaves the pageref-holding callsites that aren't affected alone.
      
      Link: http://lkml.kernel.org/r/20170809183825.GA26387@cmpxchg.org
      Fixes: 62cccb8c ("mm: simplify lock_page_memcg()")
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: NJaegeuk Kim <jaegeuk@kernel.org>
      Tested-by: NJaegeuk Kim <jaegeuk@kernel.org>
      Reported-by: NBradley Bolen <bradleybolen@gmail.com>
      Tested-by: NBrad Bolen <bradleybolen@gmail.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: <stable@vger.kernel.org>	[4.6+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      739f79fc
  18. 21 7月, 2017 1 次提交
    • T
      cgroup: add @flags to css_task_iter_start() and implement CSS_TASK_ITER_PROCS · bc2fb7ed
      Tejun Heo 提交于
      css_task_iter currently always walks all tasks.  With the scheduled
      cgroup v2 thread support, the iterator would need to handle multiple
      types of iteration.  As a preparation, add @flags to
      css_task_iter_start() and implement CSS_TASK_ITER_PROCS.  If the flag
      is not specified, it walks all tasks as before.  When asserted, the
      iterator only walks the group leaders.
      
      For now, the only user of the flag is cgroup v2 "cgroup.procs" file
      which no longer needs to skip non-leader tasks in cgroup_procs_next().
      Note that cgroup v1 "cgroup.procs" can't use the group leader walk as
      v1 "cgroup.procs" doesn't mean "list all thread group leaders in the
      cgroup" but "list all thread group id's with any threads in the
      cgroup".
      
      While at it, update cgroup_procs_show() to use task_pid_vnr() instead
      of task_tgid_vnr().  As the iteration guarantees that the function
      only sees group leaders, this doesn't change the output and will allow
      sharing the function for thread iteration.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      bc2fb7ed
  19. 11 7月, 2017 2 次提交
    • M
      mm, memcg: fix potential undefined behavior in mem_cgroup_event_ratelimit() · 6a1a8b80
      Michal Hocko 提交于
      Alice has reported the following UBSAN splat:
      
        UBSAN: Undefined behaviour in mm/memcontrol.c:661:17
        signed integer overflow:
        -2147483644 - 2147483525 cannot be represented in type 'long int'
        CPU: 1 PID: 11758 Comm: mybibtex2filena Tainted: P           O 4.9.25-gentoo #4
        Hardware name: XXXXXX, BIOS YYYYYY
        Call Trace:
          dump_stack+0x59/0x87
          ubsan_epilogue+0xe/0x40
          handle_overflow+0xbb/0xf0
          __ubsan_handle_sub_overflow+0x12/0x20
          memcg_check_events.isra.36+0x223/0x360
          mem_cgroup_commit_charge+0x55/0x140
          wp_page_copy+0x34e/0xb80
          do_wp_page+0x1e6/0x1300
          handle_mm_fault+0x88b/0x1990
          __do_page_fault+0x2de/0x8a0
          do_page_fault+0x1a/0x20
          error_code+0x67/0x6c
      
      The reason is that we subtract two signed types.  Let's fix this by
      truly mimicing time_after and cast the result of the subtraction.
      
      Link: http://lkml.kernel.org/r/20170616150057.GQ30580@dhcp22.suse.czSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NAlice Ferrazzi <alicef@gentoo.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6a1a8b80
    • S
      mm/memcontrol: exclude @root from checks in mem_cgroup_low · 34c81057
      Sean Christopherson 提交于
      Make @root exclusive in mem_cgroup_low; it is never considered low when
      looked at directly and is not checked when traversing the tree.  In
      effect, @root is handled identically to how root_mem_cgroup was
      previously handled by mem_cgroup_low.
      
      If @root is not excluded from the checks, a cgroup underneath @root will
      never be considered low during targeted reclaim of @root, e.g.  due to
      memory.current > memory.high, unless @root is misconfigured to have
      memory.low > memory.high.
      
      Excluding @root enables using memory.low to prioritize memory usage
      between cgroups within a subtree of the hierarchy that is limited by
      memory.high or memory.max, e.g.  when ROOT owns @root's controls but
      delegates the @root directory to a USER so that USER can create and
      administer children of @root.
      
      For example, given cgroup A with children B and C:
      
          A
         / \
        B   C
      
      and
      
        1. A/memory.current > A/memory.high
        2. A/B/memory.current < A/B/memory.low
        3. A/C/memory.current >= A/C/memory.low
      
      As 'A' is high, i.e.  triggers reclaim from 'A', and 'B' is low, we
      should reclaim from 'C' until 'A' is no longer high or until we can no
      longer reclaim from 'C'.  If 'A', i.e.  @root, isn't excluded by
      mem_cgroup_low when reclaming from 'A', then 'B' won't be considered low
      and we will reclaim indiscriminately from both 'B' and 'C'.
      
      Here is the test I used to confirm the bug and the patch.
      
      20:00:55@sjchrist-vm ? ~ $ cat ~/.bin/memcg_low_test
      #!/bin/bash
      
      x62mb=$((62<<20))
      x66mb=$((66<<20))
      x94mb=$((94<<20))
      x98mb=$((98<<20))
      
      setup() {
          set -e
      
          if [[ -n $DEBUG ]]; then
              set -x
          fi
      
          trap teardown EXIT HUP INT TERM
      
          if [[ ! -e /mnt/1gb.swap ]]; then
              sudo fallocate -l 1G /mnt/1gb.swap > /dev/null
              sudo mkswap /mnt/1gb.swap > /dev/null
          fi
          if ! swapon --show=NAME | grep -q "/mnt/1gb.swap"; then
              sudo swapon /mnt/1gb.swap
          fi
      
          if [[ ! -e /cgroup/cgroup.controllers ]]; then
              sudo mount -t cgroup2 none /cgroup
          fi
      
          grep -q memory /cgroup/cgroup.controllers
      
          sudo sh -c "echo '+memory' > /cgroup/cgroup.subtree_control"
      
          sudo mkdir /cgroup/A && sudo chown $USER:$USER /cgroup/A
          sudo sh -c "echo '+memory' > /cgroup/A/cgroup.subtree_control"
          sudo sh -c "echo '96m' > /cgroup/A/memory.high"
      
          mkdir /cgroup/A/0
          mkdir /cgroup/A/1
      
          echo 64m > /cgroup/A/0/memory.low
      }
      
      teardown() {
          set +e
      
          trap - EXIT HUP INT TERM
      
          if [[ -z $1 ]]; then
              printf "\n"
              printf "%0.s*" {1..35}
              printf "\nFAILED!\n\n"
              tail /cgroup/A/**/memory.current
              printf "%0.s*" {1..35}
              printf "\n\n"
          fi
      
          ps | grep stress | tr -s ' ' | cut -f 2 -d ' ' | xargs -I % kill %
      
          sleep 2
      
          if [[ -e /cgroup/A/0 ]]; then
              rmdir /cgroup/A/0
          fi
          if [[ -e /cgroup/A/1 ]]; then
              rmdir /cgroup/A/1
          fi
          if [[ -e /cgroup/A ]]; then
              sudo rmdir /cgroup/A
          fi
      }
      
      stress_test() {
          sudo sh -c "echo $$ > /cgroup/A/$1/cgroup.procs"
          stress --vm 1 --vm-bytes 64M --vm-keep > /dev/null &
      
          sudo sh -c "echo $$ > /cgroup/A/$2/cgroup.procs"
          stress --vm 1 --vm-bytes 64M --vm-keep > /dev/null &
      
          sudo sh -c "echo $$ > /cgroup/cgroup.procs"
      
          sleep 1
      
          # A/0 should be consuming more memory than A/1
          [[ $(cat /cgroup/A/0/memory.current) -ge $(cat /cgroup/A/1/memory.current) ]]
      
          # A/0 should be consuming ~64mb
          [[ $(cat /cgroup/A/0/memory.current) -ge $x62mb ]] && [[ $(cat /cgroup/A/0/memory.current) -le $x66mb ]]
      
          # A should cumulatively be consuming ~96mb
          [[ $(cat /cgroup/A/memory.current) -ge $x94mb ]] && [[ $(cat /cgroup/A/memory.current) -le $x98mb ]]
      
          # Stop the stressors
          ps | grep stress | tr -s ' ' | cut -f 2 -d ' ' | xargs -I % kill %
      }
      
      teardown 1
      setup
      
      for ((i=1;i<=$1;i++)); do
          printf "ITERATION $i of $1 - stress_test 0 1"
          stress_test 0 1
          printf "\x1b[2K\r"
      
          printf "ITERATION $i of $1 - stress_test 1 0"
          stress_test 1 0
          printf "\x1b[2K\r"
      
          printf "ITERATION $i of $1 - PASSED\n"
      done
      
      teardown 1
      
      echo PASSED!
      
      20:11:26@sjchrist-vm ? ~ $ memcg_low_test 10
      
      Link: http://lkml.kernel.org/r/1496434412-21005-1-git-send-email-sean.j.christopherson@intel.comSigned-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Acked-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Acked-by: NBalbir Singh <bsingharora@gmail.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      34c81057