1. 18 8月, 2018 40 次提交
    • M
      mm, oom: remove sleep from under oom_lock · 9bfe5ded
      Michal Hocko 提交于
      Tetsuo has pointed out that since 27ae357f ("mm, oom: fix concurrent
      munlock and oom reaper unmap, v3") we have a strong synchronization
      between the oom_killer and victim's exiting because both have to take
      the oom_lock.  Therefore the original heuristic to sleep for a short
      time in out_of_memory doesn't serve the original purpose.
      
      Moreover Tetsuo has noticed that the short sleep can be more harmful
      than actually useful.  Hammering the system with many processes can lead
      to a starvation when the task holding the oom_lock can block for a long
      time (minutes) and block any further progress because the oom_reaper
      depends on the oom_lock as well.
      
      Drop the short sleep from out_of_memory when we hold the lock.  Keep the
      sleep when the trylock fails to throttle the concurrent OOM paths a bit.
      This should be solved in a more reasonable way (e.g.  sleep proportional
      to the time spent in the active reclaiming etc.) but this is much more
      complex thing to achieve.  This is a quick fixup to remove a stale code.
      
      Link: http://lkml.kernel.org/r/20180709074706.30635-1-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9bfe5ded
    • M
      kernel/dma: remove unsupported gfp_mask parameter from dma_alloc_from_contiguous() · d834c5ab
      Marek Szyprowski 提交于
      The CMA memory allocator doesn't support standard gfp flags for memory
      allocation, so there is no point having it as a parameter for
      dma_alloc_from_contiguous() function.  Replace it by a boolean no_warn
      argument, which covers all the underlaying cma_alloc() function
      supports.
      
      This will help to avoid giving false feeling that this function supports
      standard gfp flags and callers can pass __GFP_ZERO to get zeroed buffer,
      what has already been an issue: see commit dd65a941 ("arm64:
      dma-mapping: clear buffers allocated with FORCE_CONTIGUOUS flag").
      
      Link: http://lkml.kernel.org/r/20180709122020eucas1p21a71b092975cb4a3b9954ffc63f699d1~-sqUFoa-h2939329393eucas1p2Y@eucas1p2.samsung.comSigned-off-by: NMarek Szyprowski <m.szyprowski@samsung.com>
      Acked-by: NMichał Nazarewicz <mina86@mina86.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Cc: Laura Abbott <labbott@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d834c5ab
    • M
      mm/cma: remove unsupported gfp_mask parameter from cma_alloc() · 65182029
      Marek Szyprowski 提交于
      cma_alloc() doesn't really support gfp flags other than __GFP_NOWARN, so
      convert gfp_mask parameter to boolean no_warn parameter.
      
      This will help to avoid giving false feeling that this function supports
      standard gfp flags and callers can pass __GFP_ZERO to get zeroed buffer,
      what has already been an issue: see commit dd65a941 ("arm64:
      dma-mapping: clear buffers allocated with FORCE_CONTIGUOUS flag").
      
      Link: http://lkml.kernel.org/r/20180709122019eucas1p2340da484acfcc932537e6014f4fd2c29~-sqTPJKij2939229392eucas1p2j@eucas1p2.samsung.comSigned-off-by: NMarek Szyprowski <m.szyprowski@samsung.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NMichał Nazarewicz <mina86@mina86.com>
      Acked-by: NLaura Abbott <labbott@redhat.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      65182029
    • R
      Revert "mm: always flush VMA ranges affected by zap_page_range" · 50c150f2
      Rik van Riel 提交于
      There was a bug in Linux that could cause madvise (and mprotect?) system
      calls to return to userspace without the TLB having been flushed for all
      the pages involved.
      
      This could happen when multiple threads of a process made simultaneous
      madvise and/or mprotect calls.
      
      This was noticed in the summer of 2017, at which time two solutions
      were created:
      
        56236a59 ("mm: refactor TLB gathering API")
        99baac21 ("mm: fix MADV_[FREE|DONTNEED] TLB flush miss problem")
      and
        4647706e ("mm: always flush VMA ranges affected by zap_page_range")
      
      We need only one of these solutions, and the former appears to be a
      little more efficient than the latter, so revert that one.
      
      This reverts 4647706e ("mm: always flush VMA ranges affected by
      zap_page_range")
      
      Link: http://lkml.kernel.org/r/20180706131019.51e3a5f0@imladris.surriel.comSigned-off-by: NRik van Riel <riel@surriel.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      50c150f2
    • B
      mm/sparse: optimize memmap allocation during sparse_init() · c98aff64
      Baoquan He 提交于
      In sparse_init(), two temporary pointer arrays, usemap_map and map_map
      are allocated with the size of NR_MEM_SECTIONS.  They are used to store
      each memory section's usemap and mem map if marked as present.  With the
      help of these two arrays, continuous memory chunk is allocated for
      usemap and memmap for memory sections on one node.  This avoids too many
      memory fragmentations.  Like below diagram, '1' indicates the present
      memory section, '0' means absent one.  The number 'n' could be much
      smaller than NR_MEM_SECTIONS on most of systems.
      
        |1|1|1|1|0|0|0|0|1|1|0|0|...|1|0||1|0|...|1||0|1|...|0|
        -------------------------------------------------------
         0 1 2 3         4 5         i   i+1     n-1   n
      
      If we fail to populate the page tables to map one section's memmap, its
      ->section_mem_map will be cleared finally to indicate that it's not
      present.  After use, these two arrays will be released at the end of
      sparse_init().
      
      In 4-level paging mode, each array costs 4M which can be ignorable.
      While in 5-level paging, they costs 256M each, 512M altogether.  Kdump
      kernel Usually only reserves very few memory, e.g 256M.  So, even thouth
      they are temporarily allocated, still not acceptable.
      
      In fact, there's no need to allocate them with the size of
      NR_MEM_SECTIONS.  Since the ->section_mem_map clearing has been deferred
      to the last, the number of present memory sections are kept the same
      during sparse_init() until we finally clear out the memory section's
      ->section_mem_map if its usemap or memmap is not correctly handled.
      Thus in the middle whenever for_each_present_section_nr() loop is taken,
      the i-th present memory section is always the same one.
      
      Here only allocate usemap_map and map_map with the size of
      'nr_present_sections'.  For the i-th present memory section, install its
      usemap and memmap to usemap_map[i] and mam_map[i] during allocation.
      Then in the last for_each_present_section_nr() loop which clears the
      failed memory section's ->section_mem_map, fetch usemap and memmap from
      usemap_map[] and map_map[] array and set them into mem_section[]
      accordingly.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Link: http://lkml.kernel.org/r/20180628062857.29658-5-bhe@redhat.comSigned-off-by: NBaoquan He <bhe@redhat.com>
      Reviewed-by: NPavel Tatashin <pasha.tatashin@oracle.com>
      Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Oscar Salvador <osalvador@techadventures.net>
      Cc: Pankaj Gupta <pagupta@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c98aff64
    • B
      mm/sparse.c: add a new parameter 'data_unit_size' for alloc_usemap_and_memmap · 9258631b
      Baoquan He 提交于
      It's used to pass the size of map data unit into
      alloc_usemap_and_memmap, and is preparation for next patch.
      
      Link: http://lkml.kernel.org/r/20180228032657.32385-4-bhe@redhat.comSigned-off-by: NBaoquan He <bhe@redhat.com>
      Reviewed-by: NPavel Tatashin <pasha.tatashin@oracle.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Pankaj Gupta <pagupta@redhat.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9258631b
    • B
      mm/sparsemem.c: defer the ms->section_mem_map clearing · 07a34a8c
      Baoquan He 提交于
      In sparse_init(), if CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER=y, system
      will allocate one continuous memory chunk for mem maps on one node and
      populate the relevant page tables to map memory section one by one.  If
      fail to populate for a certain mem section, print warning and its
      ->section_mem_map will be cleared to cancel the marking of being
      present.  Like this, the number of mem sections marked as present could
      become less during sparse_init() execution.
      
      Here just defer the ms->section_mem_map clearing if failed to populate
      its page tables until the last for_each_present_section_nr() loop.  This
      is in preparation for later optimizing the mem map allocation.
      
      [akpm@linux-foundation.org: remove now-unused local `ms', per Oscar]
      Link: http://lkml.kernel.org/r/20180228032657.32385-3-bhe@redhat.comSigned-off-by: NBaoquan He <bhe@redhat.com>
      Acked-by: NDave Hansen <dave.hansen@intel.com>
      Reviewed-by: NPavel Tatashin <pasha.tatashin@oracle.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Pankaj Gupta <pagupta@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      07a34a8c
    • B
      mm/sparse.c: add a static variable nr_present_sections · f2fc10e0
      Baoquan He 提交于
      Patch series "mm/sparse: Optimize memmap allocation during
      sparse_init()", v6.
      
      In sparse_init(), two temporary pointer arrays, usemap_map and map_map
      are allocated with the size of NR_MEM_SECTIONS.  They are used to store
      each memory section's usemap and mem map if marked as present.  In
      5-level paging mode, this will cost 512M memory though they will be
      released at the end of sparse_init().  System with few memory, like
      kdump kernel which usually only has about 256M, will fail to boot
      because of allocation failure if CONFIG_X86_5LEVEL=y.
      
      In this patchset, optimize the memmap allocation code to only use
      usemap_map and map_map with the size of nr_present_sections.  This makes
      kdump kernel boot up with normal crashkernel='' setting when
      CONFIG_X86_5LEVEL=y.
      
      This patch (of 5):
      
      nr_present_sections is used to record how many memory sections are
      marked as present during system boot up, and will be used in the later
      patch.
      
      Link: http://lkml.kernel.org/r/20180228032657.32385-2-bhe@redhat.comSigned-off-by: NBaoquan He <bhe@redhat.com>
      Acked-by: NDave Hansen <dave.hansen@intel.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NPavel Tatashin <pasha.tatashin@oracle.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Pankaj Gupta <pagupta@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f2fc10e0
    • K
      mm: use special value SHRINKER_REGISTERING instead of list_empty() check · 7e010df5
      Kirill Tkhai 提交于
      The patch introduces a special value SHRINKER_REGISTERING to use instead
      of list_empty() to differ a registering shrinker from unregistered
      shrinker.  Why we need that at all?
      
      Shrinker registration is split in two parts.  The first one is
      prealloc_shrinker(), which allocates shrinker memory and reserves ID in
      shrinker_idr.  This function can fail.  The second is
      register_shrinker_prepared(), and it finalizes the registration.  This
      function actually makes shrinker available to be used from
      shrink_slab(), and it can't fail.
      
      One shrinker may be based on more then one LRU lists.  So, we never
      clear the bit in memcg shrinker maps, when (one of) corresponding LRU
      list becomes empty, since other LRU lists may be not empty.  See
      superblock shrinker for example: it is based on two LRU lists:
      s_inode_lru and s_dentry_lru.  We do not want to clear shrinker bit,
      when there are no inodes in s_inode_lru, as s_dentry_lru may contain
      dentries.
      
      Instead of that, we use special algorithm to detect shrinkers having no
      elements at all its LRU lists, and this is made in shrink_slab_memcg().
      See the comment in this function for the details.
      
      Also, in shrink_slab_memcg() we clear shrinker bit in the map, when we
      meet unregistered shrinker (bit is set, while there is no a shrinker in
      IDR).  Otherwise, we would have done that at the moment of shrinker
      unregistration for all memcgs (and this looks worse, since iteration
      over all memcg may take much time).  Also this would have imposed
      restrictions on shrinker unregistration order for its users: they would
      have had to guarantee, there are no new elements after
      unregister_shrinker() (otherwise, a new added element would have set a
      bit).
      
      So, if we meet a set bit in map and no shrinker in IDR when we're
      iterating over the map in shrink_slab_memcg(), this means the
      corresponding shrinker is unregistered, and we must clear the bit.
      
      Another case is shrinker registration.  We want two things there:
      
      1) do_shrink_slab() can be called only for completely registered
         shrinkers;
      
      2) shrinker internal lists may be populated in any order with
         register_shrinker_prepared() (let's talk on the example with sb).  Both
         of:
      
        a)list_lru_add(&inode->i_sb->s_inode_lru, &inode->i_lru); [cpu0]
          memcg_set_shrinker_bit();                               [cpu0]
          ...
          register_shrinker_prepared();                           [cpu1]
      
        and
      
        b)register_shrinker_prepared();                           [cpu0]
          ...
          list_lru_add(&inode->i_sb->s_inode_lru, &inode->i_lru); [cpu1]
          memcg_set_shrinker_bit();                               [cpu1]
      
         are legitimate.  We don't want to impose restriction here and to
         force people to use only (b) variant.  We don't want to force people to
         care, there is no elements in LRU lists before the shrinker is
         completely registered.  Internal users of LRU lists and shrinker code
         are two different subsystems, and they have to be closed in themselves
         each other.
      
      In (a) case we have the bit set before shrinker is completely
      registered.  We don't want do_shrink_slab() is called at this moment, so
      we have to detect such the registering shrinkers.
      
      Before this patch list_empty() (shrinker is not linked to the list)
      check was used for that.  So, in (a) there could be a bit set, but we
      don't call do_shrink_slab() unless shrinker is linked to the list.  It's
      just an indicator, I just overloaded linking to the list.
      
      This was not the best solution, since it's better not to touch the
      shrinker memory from shrink_slab_memcg() before it's completely
      registered (this also will be useful in the future to make shrink_slab()
      completely lockless).
      
      So, this patch introduces better way to detect registering shrinker,
      which allows not to dereference shrinker memory.  It's just a ~0UL
      value, which we insert into the IDR during ID allocation.  After
      shrinker is ready to be used, we insert actual shrinker pointer in the
      IDR, and it becomes available to shrink_slab_memcg().
      
      We can't use NULL instead of this new value for this purpose as:
      shrink_slab_memcg() already uses NULL to detect unregistered shrinkers,
      and we don't want the function sees NULL and clears the bit, otherwise
      (a) won't work.
      
      This is the only thing the patch makes: the better way to detect
      registering shrinker.  Nothing else this patch makes.
      
      Also this gives a better assembler, but it's minor side of the patch:
      
      Before:
        callq  <idr_find>
        mov    %rax,%r15
        test   %rax,%rax
        je     <shrink_slab_memcg+0x1d5>
        mov    0x20(%rax),%rax
        lea    0x20(%r15),%rdx
        cmp    %rax,%rdx
        je     <shrink_slab_memcg+0xbd>
        mov    0x8(%rsp),%edx
        mov    %r15,%rsi
        lea    0x10(%rsp),%rdi
        callq  <do_shrink_slab>
      
      After:
        callq  <idr_find>
        mov    %rax,%r15
        lea    -0x1(%rax),%rax
        cmp    $0xfffffffffffffffd,%rax
        ja     <shrink_slab_memcg+0x1cd>
        mov    0x8(%rsp),%edx
        mov    %r15,%rsi
        lea    0x10(%rsp),%rdi
        callq  ffffffff810cefd0 <do_shrink_slab>
      
      [ktkhai@virtuozzo.com: add #ifdef CONFIG_MEMCG_KMEM around idr_replace()]
        Link: http://lkml.kernel.org/r/758b8fec-7573-47eb-b26a-7b2847ae7b8c@virtuozzo.com
      Link: http://lkml.kernel.org/r/153355467546.11522.4518015068123480218.stgit@localhost.localdomainSigned-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Josef Bacik <jbacik@fb.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7e010df5
    • K
      mm/vmscan.c: move check for SHRINKER_NUMA_AWARE to do_shrink_slab() · ac7fb3ad
      Kirill Tkhai 提交于
      In case of shrink_slab_memcg() we do not zero nid, when shrinker is not
      numa-aware.  This is not a real problem, since currently all memcg-aware
      shrinkers are numa-aware too (we have two: super_block shrinker and
      workingset shrinker), but something may change in the future.
      
      Link: http://lkml.kernel.org/r/153320759911.18959.8842396230157677671.stgit@localhost.localdomainSigned-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Josef Bacik <jbacik@fb.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ac7fb3ad
    • K
      mm/vmscan.c: clear shrinker bit if there are no objects related to memcg · f90280d6
      Kirill Tkhai 提交于
      To avoid further unneed calls of do_shrink_slab() for shrinkers, which
      already do not have any charged objects in a memcg, their bits have to
      be cleared.
      
      This patch introduces a lockless mechanism to do that without races
      without parallel list lru add.  After do_shrink_slab() returns
      SHRINK_EMPTY the first time, we clear the bit and call it once again.
      Then we restore the bit, if the new return value is different.
      
      Note, that single smp_mb__after_atomic() in shrink_slab_memcg() covers
      two situations:
      
      1)list_lru_add()     shrink_slab_memcg
          list_add_tail()    for_each_set_bit() <--- read bit
                               do_shrink_slab() <--- missed list update (no barrier)
          <MB>                 <MB>
          set_bit()            do_shrink_slab() <--- seen list update
      
      This situation, when the first do_shrink_slab() sees set bit, but it
      doesn't see list update (i.e., race with the first element queueing), is
      rare.  So we don't add <MB> before the first call of do_shrink_slab()
      instead of this to do not slow down generic case.  Also, it's need the
      second call as seen in below in (2).
      
      2)list_lru_add()      shrink_slab_memcg()
          list_add_tail()     ...
          set_bit()           ...
        ...                   for_each_set_bit()
        do_shrink_slab()        do_shrink_slab()
          clear_bit()           ...
        ...                     ...
        list_lru_add()          ...
          list_add_tail()       clear_bit()
          <MB>                  <MB>
          set_bit()             do_shrink_slab()
      
      The barriers guarantee that the second do_shrink_slab() in the right
      side task sees list update if really cleared the bit.  This case is
      drawn in the code comment.
      
      [Results/performance of the patchset]
      
      After the whole patchset applied the below test shows signify increase
      of performance:
      
        $echo 1 > /sys/fs/cgroup/memory/memory.use_hierarchy
        $mkdir /sys/fs/cgroup/memory/ct
        $echo 4000M > /sys/fs/cgroup/memory/ct/memory.kmem.limit_in_bytes
            $for i in `seq 0 4000`; do mkdir /sys/fs/cgroup/memory/ct/$i;
      			    echo $$ > /sys/fs/cgroup/memory/ct/$i/cgroup.procs;
      			    mkdir -p s/$i; mount -t tmpfs $i s/$i;
      			    touch s/$i/file; done
      
      Then, 5 sequential calls of drop caches:
      
        $time echo 3 > /proc/sys/vm/drop_caches
      
      1)Before:
        0.00user 13.78system 0:13.78elapsed 99%CPU
        0.00user 5.59system 0:05.60elapsed 99%CPU
        0.00user 5.48system 0:05.48elapsed 99%CPU
        0.00user 8.35system 0:08.35elapsed 99%CPU
        0.00user 8.34system 0:08.35elapsed 99%CPU
      
      2)After
        0.00user 1.10system 0:01.10elapsed 99%CPU
        0.00user 0.00system 0:00.01elapsed 64%CPU
        0.00user 0.01system 0:00.01elapsed 82%CPU
        0.00user 0.00system 0:00.01elapsed 64%CPU
        0.00user 0.01system 0:00.01elapsed 82%CPU
      
      The results show the performance increases at least in 548 times.
      
      Shakeel Butt tested this patchset with fork-bomb on his configuration:
      
       > I created 255 memcgs, 255 ext4 mounts and made each memcg create a
       > file containing few KiBs on corresponding mount. Then in a separate
       > memcg of 200 MiB limit ran a fork-bomb.
       >
       > I ran the "perf record -ag -- sleep 60" and below are the results:
       >
       > Without the patch series:
       > Samples: 4M of event 'cycles', Event count (approx.): 3279403076005
       > +  36.40%            fb.sh  [kernel.kallsyms]    [k] shrink_slab
       > +  18.97%            fb.sh  [kernel.kallsyms]    [k] list_lru_count_one
       > +   6.75%            fb.sh  [kernel.kallsyms]    [k] super_cache_count
       > +   0.49%            fb.sh  [kernel.kallsyms]    [k] down_read_trylock
       > +   0.44%            fb.sh  [kernel.kallsyms]    [k] mem_cgroup_iter
       > +   0.27%            fb.sh  [kernel.kallsyms]    [k] up_read
       > +   0.21%            fb.sh  [kernel.kallsyms]    [k] osq_lock
       > +   0.13%            fb.sh  [kernel.kallsyms]    [k] shmem_unused_huge_count
       > +   0.08%            fb.sh  [kernel.kallsyms]    [k] shrink_node_memcg
       > +   0.08%            fb.sh  [kernel.kallsyms]    [k] shrink_node
       >
       > With the patch series:
       > Samples: 4M of event 'cycles', Event count (approx.): 2756866824946
       > +  47.49%            fb.sh  [kernel.kallsyms]    [k] down_read_trylock
       > +  30.72%            fb.sh  [kernel.kallsyms]    [k] up_read
       > +   9.51%            fb.sh  [kernel.kallsyms]    [k] mem_cgroup_iter
       > +   1.69%            fb.sh  [kernel.kallsyms]    [k] shrink_node_memcg
       > +   1.35%            fb.sh  [kernel.kallsyms]    [k] mem_cgroup_protected
       > +   1.05%            fb.sh  [kernel.kallsyms]    [k] queued_spin_lock_slowpath
       > +   0.85%            fb.sh  [kernel.kallsyms]    [k] _raw_spin_lock
       > +   0.78%            fb.sh  [kernel.kallsyms]    [k] lruvec_lru_size
       > +   0.57%            fb.sh  [kernel.kallsyms]    [k] shrink_node
       > +   0.54%            fb.sh  [kernel.kallsyms]    [k] queue_work_on
       > +   0.46%            fb.sh  [kernel.kallsyms]    [k] shrink_slab_memcg
      
      [ktkhai@virtuozzo.com: v9]
        Link: http://lkml.kernel.org/r/153112561772.4097.11011071937553113003.stgit@localhost.localdomain
      Link: http://lkml.kernel.org/r/153063070859.1818.11870882950920963480.stgit@localhost.localdomainSigned-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Tested-by: NShakeel Butt <shakeelb@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Josef Bacik <jbacik@fb.com>
      Cc: Li RongQing <lirongqing@baidu.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Matthias Kaehlcke <mka@chromium.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Philippe Ombredanne <pombredanne@nexb.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Sahitya Tummala <stummala@codeaurora.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Waiman Long <longman@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f90280d6
    • K
      mm: add SHRINK_EMPTY shrinker methods return value · 9b996468
      Kirill Tkhai 提交于
      We need to distinguish the situations when shrinker has very small
      amount of objects (see vfs_pressure_ratio() called from
      super_cache_count()), and when it has no objects at all.  Currently, in
      the both of these cases, shrinker::count_objects() returns 0.
      
      The patch introduces new SHRINK_EMPTY return value, which will be used
      for "no objects at all" case.  It's is a refactoring mostly, as
      SHRINK_EMPTY is replaced by 0 by all callers of do_shrink_slab() in this
      patch, and all the magic will happen in further.
      
      Link: http://lkml.kernel.org/r/153063069574.1818.11037751256699341813.stgit@localhost.localdomainSigned-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Tested-by: NShakeel Butt <shakeelb@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Josef Bacik <jbacik@fb.com>
      Cc: Li RongQing <lirongqing@baidu.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Matthias Kaehlcke <mka@chromium.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Philippe Ombredanne <pombredanne@nexb.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Sahitya Tummala <stummala@codeaurora.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Waiman Long <longman@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9b996468
    • V
      mm/vmscan.c: generalize shrink_slab() calls in shrink_node() · aeed1d32
      Vladimir Davydov 提交于
      The patch makes shrink_slab() be called for root_mem_cgroup in the same
      way as it's called for the rest of cgroups.  This simplifies the logic
      and improves the readability.
      
      [ktkhai@virtuozzo.com: wrote changelog]
      Link: http://lkml.kernel.org/r/153063068338.1818.11496084754797453962.stgit@localhost.localdomainSigned-off-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Tested-by: NShakeel Butt <shakeelb@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Josef Bacik <jbacik@fb.com>
      Cc: Li RongQing <lirongqing@baidu.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Matthias Kaehlcke <mka@chromium.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Philippe Ombredanne <pombredanne@nexb.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Sahitya Tummala <stummala@codeaurora.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Waiman Long <longman@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      aeed1d32
    • K
      mm/vmscan.c: iterate only over charged shrinkers during memcg shrink_slab() · b0dedc49
      Kirill Tkhai 提交于
      Using the preparations made in previous patches, in case of memcg
      shrink, we may avoid shrinkers, which are not set in memcg's shrinkers
      bitmap.  To do that, we separate iterations over memcg-aware and
      !memcg-aware shrinkers, and memcg-aware shrinkers are chosen via
      for_each_set_bit() from the bitmap.  In case of big nodes, having many
      isolated environments, this gives significant performance growth.  See
      next patches for the details.
      
      Note that the patch does not respect to empty memcg shrinkers, since we
      never clear the bitmap bits after we set it once.  Their shrinkers will
      be called again, with no shrinked objects as result.  This functionality
      is provided by next patches.
      
      [ktkhai@virtuozzo.com: v9]
        Link: http://lkml.kernel.org/r/153112558507.4097.12713813335683345488.stgit@localhost.localdomain
      Link: http://lkml.kernel.org/r/153063066653.1818.976035462801487910.stgit@localhost.localdomainSigned-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Tested-by: NShakeel Butt <shakeelb@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Josef Bacik <jbacik@fb.com>
      Cc: Li RongQing <lirongqing@baidu.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Matthias Kaehlcke <mka@chromium.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Philippe Ombredanne <pombredanne@nexb.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Sahitya Tummala <stummala@codeaurora.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Waiman Long <longman@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b0dedc49
    • K
      mm/list_lru.c: set bit in memcg shrinker bitmap on first list_lru item appearance · fae91d6d
      Kirill Tkhai 提交于
      Introduce set_shrinker_bit() function to set shrinker-related bit in
      memcg shrinker bitmap, and set the bit after the first item is added and
      in case of reparenting destroyed memcg's items.
      
      This will allow next patch to make shrinkers be called only, in case of
      they have charged objects at the moment, and to improve shrink_slab()
      performance.
      
      [ktkhai@virtuozzo.com: v9]
        Link: http://lkml.kernel.org/r/153112557572.4097.17315791419810749985.stgit@localhost.localdomain
      Link: http://lkml.kernel.org/r/153063065671.1818.15914674956134687268.stgit@localhost.localdomainSigned-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Tested-by: NShakeel Butt <shakeelb@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Josef Bacik <jbacik@fb.com>
      Cc: Li RongQing <lirongqing@baidu.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Matthias Kaehlcke <mka@chromium.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Philippe Ombredanne <pombredanne@nexb.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Sahitya Tummala <stummala@codeaurora.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Waiman Long <longman@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fae91d6d
    • K
      mm/memcontrol.c: export mem_cgroup_is_root() · dfd2f10c
      Kirill Tkhai 提交于
      This will be used in next patch.
      
      Link: http://lkml.kernel.org/r/153063064347.1818.1987011484100392706.stgit@localhost.localdomainSigned-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Tested-by: NShakeel Butt <shakeelb@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Josef Bacik <jbacik@fb.com>
      Cc: Li RongQing <lirongqing@baidu.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Matthias Kaehlcke <mka@chromium.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Philippe Ombredanne <pombredanne@nexb.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Sahitya Tummala <stummala@codeaurora.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Waiman Long <longman@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dfd2f10c
    • K
      mm/list_lru.c: pass lru argument to memcg_drain_list_lru_node() · 3b82c4dc
      Kirill Tkhai 提交于
      This is just refactoring to allow next patches to have lru pointer in
      memcg_drain_list_lru_node().
      
      Link: http://lkml.kernel.org/r/153063063164.1818.55009531386089350.stgit@localhost.localdomainSigned-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Tested-by: NShakeel Butt <shakeelb@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Josef Bacik <jbacik@fb.com>
      Cc: Li RongQing <lirongqing@baidu.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Matthias Kaehlcke <mka@chromium.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Philippe Ombredanne <pombredanne@nexb.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Sahitya Tummala <stummala@codeaurora.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Waiman Long <longman@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3b82c4dc
    • K
      mm/list_lru: pass dst_memcg argument to memcg_drain_list_lru_node() · 9bec5c35
      Kirill Tkhai 提交于
      This is just refactoring to allow the next patches to have dst_memcg
      pointer in memcg_drain_list_lru_node().
      
      Link: http://lkml.kernel.org/r/153063062118.1818.2761273817739499749.stgit@localhost.localdomainSigned-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Tested-by: NShakeel Butt <shakeelb@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Josef Bacik <jbacik@fb.com>
      Cc: Li RongQing <lirongqing@baidu.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Matthias Kaehlcke <mka@chromium.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Philippe Ombredanne <pombredanne@nexb.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Sahitya Tummala <stummala@codeaurora.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Waiman Long <longman@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9bec5c35
    • K
      mm/list_lru.c: add memcg argument to list_lru_from_kmem() · 44bd4a47
      Kirill Tkhai 提交于
      This is just refactoring to allow the next patches to have memcg pointer
      in list_lru_from_kmem().
      
      Link: http://lkml.kernel.org/r/153063060664.1818.9541345386733498582.stgit@localhost.localdomainSigned-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Tested-by: NShakeel Butt <shakeelb@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Josef Bacik <jbacik@fb.com>
      Cc: Li RongQing <lirongqing@baidu.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Matthias Kaehlcke <mka@chromium.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Philippe Ombredanne <pombredanne@nexb.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Sahitya Tummala <stummala@codeaurora.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Waiman Long <longman@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      44bd4a47
    • K
      fs: propagate shrinker::id to list_lru · c92e8e10
      Kirill Tkhai 提交于
      Add list_lru::shrinker_id field and populate it by registered shrinker
      id.
      
      This will be used to set correct bit in memcg shrinkers map by lru code
      in next patches, after there appeared the first related to memcg element
      in list_lru.
      
      Link: http://lkml.kernel.org/r/153063059758.1818.14866596416857717800.stgit@localhost.localdomainSigned-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Tested-by: NShakeel Butt <shakeelb@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Josef Bacik <jbacik@fb.com>
      Cc: Li RongQing <lirongqing@baidu.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Matthias Kaehlcke <mka@chromium.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Philippe Ombredanne <pombredanne@nexb.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Sahitya Tummala <stummala@codeaurora.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Waiman Long <longman@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c92e8e10
    • K
      fs/super.c: refactor alloc_super() · 2b3648a6
      Kirill Tkhai 提交于
      Do two list_lru_init_memcg() calls after prealloc_super().
      destroy_unused_super() in fail path is OK with this.  Next patch needs
      such the order.
      
      Link: http://lkml.kernel.org/r/153063058712.1818.3382490999719078571.stgit@localhost.localdomainSigned-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Tested-by: NShakeel Butt <shakeelb@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Josef Bacik <jbacik@fb.com>
      Cc: Li RongQing <lirongqing@baidu.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Matthias Kaehlcke <mka@chromium.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Philippe Ombredanne <pombredanne@nexb.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Sahitya Tummala <stummala@codeaurora.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Waiman Long <longman@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2b3648a6
    • K
      mm/workingset.c: refactor workingset_init() · 39887653
      Kirill Tkhai 提交于
      Use prealloc_shrinker()/register_shrinker_prepared() instead of
      register_shrinker().  This will be used in next patch.
      
      [ktkhai@virtuozzo.com: v9]
        Link: http://lkml.kernel.org/r/153112550112.4097.16606173020912323761.stgit@localhost.localdomain
      Link: http://lkml.kernel.org/r/153063057666.1818.17625951186610808734.stgit@localhost.localdomainSigned-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Tested-by: NShakeel Butt <shakeelb@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Josef Bacik <jbacik@fb.com>
      Cc: Li RongQing <lirongqing@baidu.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Matthias Kaehlcke <mka@chromium.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Philippe Ombredanne <pombredanne@nexb.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Sahitya Tummala <stummala@codeaurora.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Waiman Long <longman@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      39887653
    • K
      mm, memcg: assign memcg-aware shrinkers bitmap to memcg · 0a4465d3
      Kirill Tkhai 提交于
      Imagine a big node with many cpus, memory cgroups and containers.  Let
      we have 200 containers, every container has 10 mounts, and 10 cgroups.
      All container tasks don't touch foreign containers mounts.  If there is
      intensive pages write, and global reclaim happens, a writing task has to
      iterate over all memcgs to shrink slab, before it's able to go to
      shrink_page_list().
      
      Iteration over all the memcg slabs is very expensive: the task has to
      visit 200 * 10 = 2000 shrinkers for every memcg, and since there are
      2000 memcgs, the total calls are 2000 * 2000 = 4000000.
      
      So, the shrinker makes 4 million do_shrink_slab() calls just to try to
      isolate SWAP_CLUSTER_MAX pages in one of the actively writing memcg via
      shrink_page_list().  I've observed a node spending almost 100% in
      kernel, making useless iteration over already shrinked slab.
      
      This patch adds bitmap of memcg-aware shrinkers to memcg.  The size of
      the bitmap depends on bitmap_nr_ids, and during memcg life it's
      maintained to be enough to fit bitmap_nr_ids shrinkers.  Every bit in
      the map is related to corresponding shrinker id.
      
      Next patches will maintain set bit only for really charged memcg.  This
      will allow shrink_slab() to increase its performance in significant way.
      See the last patch for the numbers.
      
      [ktkhai@virtuozzo.com: v9]
        Link: http://lkml.kernel.org/r/153112549031.4097.3576147070498769979.stgit@localhost.localdomain
      [ktkhai@virtuozzo.com: add comment to mem_cgroup_css_online()]
        Link: http://lkml.kernel.org/r/521f9e5f-c436-b388-fe83-4dc870bfb489@virtuozzo.com
      Link: http://lkml.kernel.org/r/153063056619.1818.12550500883688681076.stgit@localhost.localdomainSigned-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Tested-by: NShakeel Butt <shakeelb@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Josef Bacik <jbacik@fb.com>
      Cc: Li RongQing <lirongqing@baidu.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Matthias Kaehlcke <mka@chromium.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Philippe Ombredanne <pombredanne@nexb.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Sahitya Tummala <stummala@codeaurora.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Waiman Long <longman@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0a4465d3
    • K
      mm/memcontrol.c: move up for_each_mem_cgroup{, _tree} defines · b05706f1
      Kirill Tkhai 提交于
      Next patch requires these defines are above their current position, so
      here they are moved to declarations.
      
      Link: http://lkml.kernel.org/r/153063055665.1818.5200425793649695598.stgit@localhost.localdomainSigned-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Tested-by: NShakeel Butt <shakeelb@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Josef Bacik <jbacik@fb.com>
      Cc: Li RongQing <lirongqing@baidu.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Matthias Kaehlcke <mka@chromium.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Philippe Ombredanne <pombredanne@nexb.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Sahitya Tummala <stummala@codeaurora.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Waiman Long <longman@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b05706f1
    • K
      mm: assign id to every memcg-aware shrinker · b4c2b231
      Kirill Tkhai 提交于
      Introduce shrinker::id number, which is used to enumerate memcg-aware
      shrinkers.  The number start from 0, and the code tries to maintain it
      as small as possible.
      
      This will be used to represent a memcg-aware shrinkers in memcg
      shrinkers map.
      
      Since all memcg-aware shrinkers are based on list_lru, which is
      per-memcg in case of !CONFIG_MEMCG_KMEM only, the new functionality will
      be under this config option.
      
      [ktkhai@virtuozzo.com: v9]
        Link: http://lkml.kernel.org/r/153112546435.4097.10607140323811756557.stgit@localhost.localdomain
      Link: http://lkml.kernel.org/r/153063054586.1818.6041047871606697364.stgit@localhost.localdomainSigned-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Tested-by: NShakeel Butt <shakeelb@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Josef Bacik <jbacik@fb.com>
      Cc: Li RongQing <lirongqing@baidu.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Matthias Kaehlcke <mka@chromium.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Philippe Ombredanne <pombredanne@nexb.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Sahitya Tummala <stummala@codeaurora.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Waiman Long <longman@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b4c2b231
    • K
      mm: introduce CONFIG_MEMCG_KMEM as combination of CONFIG_MEMCG && !CONFIG_SLOB · 84c07d11
      Kirill Tkhai 提交于
      Introduce new config option, which is used to replace repeating
      CONFIG_MEMCG && !CONFIG_SLOB pattern.  Next patches add a little more
      memcg+kmem related code, so let's keep the defines more clearly.
      
      Link: http://lkml.kernel.org/r/153063053670.1818.15013136946600481138.stgit@localhost.localdomainSigned-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Tested-by: NShakeel Butt <shakeelb@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Josef Bacik <jbacik@fb.com>
      Cc: Li RongQing <lirongqing@baidu.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Matthias Kaehlcke <mka@chromium.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Philippe Ombredanne <pombredanne@nexb.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Sahitya Tummala <stummala@codeaurora.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Waiman Long <longman@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      84c07d11
    • K
      mm/list_lru.c: combine code under the same define · e0295238
      Kirill Tkhai 提交于
      Patch series "Improve shrink_slab() scalability (old complexity was O(n^2), new is O(n))", v8.
      
      This patcheset solves the problem with slow shrink_slab() occuring on
      the machines having many shrinkers and memory cgroups (i.e., with many
      containers).  The problem is complexity of shrink_slab() is O(n^2) and
      it grows too fast with the growth of containers numbers.
      
      Let us have 200 containers, and every container has 10 mounts and 10
      cgroups.  All container tasks are isolated, and they don't touch foreign
      containers mounts.
      
      In case of global reclaim, a task has to iterate all over the memcgs and
      to call all the memcg-aware shrinkers for all of them.  This means, the
      task has to visit 200 * 10 = 2000 shrinkers for every memcg, and since
      there are 2000 memcgs, the total calls of do_shrink_slab() are 2000 *
      2000 = 4000000.
      
      4 million calls are not a number operations, which can takes 1 cpu
      cycle.  E.g., super_cache_count() accesses at least two lists, and makes
      arifmetical calculations.  Even, if there are no charged objects, we do
      these calculations, and replaces cpu caches by read memory.  I observed
      nodes spending almost 100% time in kernel, in case of intensive writing
      and global reclaim.  The writer consumes pages fast, but it's need to
      shrink_slab() before the reclaimer reached shrink pages function (and
      frees SWAP_CLUSTER_MAX pages).  Even if there is no writing, the
      iterations just waste the time, and slows reclaim down.
      
      Let's see the small test below:
      
        $echo 1 > /sys/fs/cgroup/memory/memory.use_hierarchy
        $mkdir /sys/fs/cgroup/memory/ct
        $echo 4000M > /sys/fs/cgroup/memory/ct/memory.kmem.limit_in_bytes
        $for i in `seq 0 4000`;
                do mkdir /sys/fs/cgroup/memory/ct/$i;
                echo $$ > /sys/fs/cgroup/memory/ct/$i/cgroup.procs;
                mkdir -p s/$i; mount -t tmpfs $i s/$i; touch s/$i/file;
        done
      
      Then, let's see drop caches time (5 sequential calls):
      
        $time echo 3 > /proc/sys/vm/drop_caches
      
        0.00user 13.78system 0:13.78elapsed 99%CPU
        0.00user 5.59system 0:05.60elapsed 99%CPU
        0.00user 5.48system 0:05.48elapsed 99%CPU
        0.00user 8.35system 0:08.35elapsed 99%CPU
        0.00user 8.34system 0:08.35elapsed 99%CPU
      
      The last four calls don't actually shrink anything.  So, the iterations
      over slab shrinkers take 5.48 seconds.  Not so good for scalability.
      
      The patchset solves the problem by making shrink_slab() of O(n)
      complexity.  There are following functional actions:
      
      1) Assign id to every registered memcg-aware shrinker.
      
      2) Maintain per-memcgroup bitmap of memcg-aware shrinkers, and set a
         shrinker-related bit after the first element is added to lru list
         (also, when removed child memcg elements are reparanted).
      
      3) Split memcg-aware shrinkers and !memcg-aware shrinkers, and call a
         shrinker if its bit is set in memcg's shrinker bitmap.  (Also, there is
         a functionality to clear the bit, after last element is shrinked).
      
      This gives significant performance increase.  The result after patchset
      is applied:
      
        $time echo 3 > /proc/sys/vm/drop_caches
      
        0.00user 1.10system 0:01.10elapsed 99%CPU
        0.00user 0.00system 0:00.01elapsed 64%CPU
        0.00user 0.01system 0:00.01elapsed 82%CPU
        0.00user 0.00system 0:00.01elapsed 64%CPU
        0.00user 0.01system 0:00.01elapsed 82%CPU
      
      The results show the performance increases at least in 548 times.
      
      So, the patchset makes shrink_slab() of less complexity and improves the
      performance in such types of load I pointed.  This will give a profit in
      case of !global reclaim case, since there also will be less
      do_shrink_slab() calls.
      
      This patch (of 17):
      
      These two pairs of blocks of code are under the same #ifdef #else
      #endif.
      
      Link: http://lkml.kernel.org/r/153063052519.1818.9393587113056959488.stgit@localhost.localdomainSigned-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Tested-by: NShakeel Butt <shakeelb@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Philippe Ombredanne <pombredanne@nexb.com>
      Cc: Sahitya Tummala <stummala@codeaurora.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Matthias Kaehlcke <mka@chromium.org>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Waiman Long <longman@redhat.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Josef Bacik <jbacik@fb.com>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Li RongQing <lirongqing@baidu.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e0295238
    • M
      mm/memblock.c: replace u64 with phys_addr_t where appropriate · a36aab89
      Mike Rapoport 提交于
      Most functions in memblock already use phys_addr_t to represent a
      physical address with __memblock_free_late() being an exception.
      
      This patch replaces u64 with phys_addr_t in __memblock_free_late() and
      switches several format strings from %llx to %pa to avoid casting from
      phys_addr_t to u64.
      
      Link: http://lkml.kernel.org/r/1530637506-1256-1-git-send-email-rppt@linux.vnet.ibm.comSigned-off-by: NMike Rapoport <rppt@linux.vnet.ibm.com>
      Reviewed-by: NPavel Tatashin <pasha.tatashin@oracle.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a36aab89
    • O
      mm/sparse.c: make sparse_init_one_section void and remove check · 4e40987f
      Oscar Salvador 提交于
      sparse_init_one_section() is being called from two sites: sparse_init()
      and sparse_add_one_section().  The former calls it from a
      for_each_present_section_nr() loop, and the latter marks the section as
      present before calling it.  This means that when
      sparse_init_one_section() gets called, we already know that the section
      is present.  So there is no point to double check that in the function.
      
      This removes the check and makes the function void.
      
      [ross.zwisler@linux.intel.com: fix error path in sparse_add_one_section]
        Link: http://lkml.kernel.org/r/20180706190658.6873-1-ross.zwisler@linux.intel.com
      [ross.zwisler@linux.intel.com: simplification suggested by Oscar]
        Link: http://lkml.kernel.org/r/20180706223358.742-1-ross.zwisler@linux.intel.com
      Link: http://lkml.kernel.org/r/20180702154325.12196-1-osalvador@techadventures.netSigned-off-by: NOscar Salvador <osalvador@suse.de>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NPavel Tatashin <pasha.tatashin@oracle.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4e40987f
    • M
      memcg, oom: move out_of_memory back to the charge path · 29ef680a
      Michal Hocko 提交于
      Commit 3812c8c8 ("mm: memcg: do not trap chargers with full
      callstack on OOM") has changed the ENOMEM semantic of memcg charges.
      Rather than invoking the oom killer from the charging context it delays
      the oom killer to the page fault path (pagefault_out_of_memory).  This
      in turn means that many users (e.g.  slab or g-u-p) will get ENOMEM when
      the corresponding memcg hits the hard limit and the memcg is is OOM.
      This is behavior is inconsistent with !memcg case where the oom killer
      is invoked from the allocation context and the allocator keeps retrying
      until it succeeds.
      
      The difference in the behavior is user visible.  mmap(MAP_POPULATE)
      might result in not fully populated ranges while the mmap return code
      doesn't tell that to the userspace.  Random syscalls might fail with
      ENOMEM etc.
      
      The primary motivation of the different memcg oom semantic was the
      deadlock avoidance.  Things have changed since then, though.  We have an
      async oom teardown by the oom reaper now and so we do not have to rely
      on the victim to tear down its memory anymore.  Therefore we can return
      to the original semantic as long as the memcg oom killer is not handed
      over to the users space.
      
      There is still one thing to be careful about here though.  If the oom
      killer is not able to make any forward progress - e.g.  because there is
      no eligible task to kill - then we have to bail out of the charge path
      to prevent from same class of deadlocks.  We have basically two options
      here.  Either we fail the charge with ENOMEM or force the charge and
      allow overcharge.  The first option has been considered more harmful
      than useful because rare inconsistencies in the ENOMEM behavior is hard
      to test for and error prone.  Basically the same reason why the page
      allocator doesn't fail allocations under such conditions.  The later
      might allow runaways but those should be really unlikely unless somebody
      misconfigures the system.  E.g.  allowing to migrate tasks away from the
      memcg to a different unlimited memcg with move_charge_at_immigrate
      disabled.
      
      Link: http://lkml.kernel.org/r/20180628151101.25307-1-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NGreg Thelen <gthelen@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      29ef680a
    • M
      mm: make DEFERRED_STRUCT_PAGE_INIT explicitly depend on SPARSEMEM · d39f8fb4
      Mike Rapoport 提交于
      The deferred memory initialization relies on section definitions, e.g
      PAGES_PER_SECTION, that are only available when CONFIG_SPARSEMEM=y on
      most architectures.
      
      Initially DEFERRED_STRUCT_PAGE_INIT depended on explicit
      ARCH_SUPPORTS_DEFERRED_STRUCT_PAGE_INIT configuration option, but since
      the commit 2e3ca40f ("mm: relax deferred struct page
      requirements") this requirement was relaxed and now it is possible to
      enable DEFERRED_STRUCT_PAGE_INIT on architectures that support
      DISCONTINGMEM and NO_BOOTMEM which causes build failures.
      
      For instance, setting SMP=y and DEFERRED_STRUCT_PAGE_INIT=y on arc
      causes the following build failure:
      
          CC      mm/page_alloc.o
        mm/page_alloc.c: In function 'update_defer_init':
        mm/page_alloc.c:321:14: error: 'PAGES_PER_SECTION'
        undeclared (first use in this function); did you mean 'USEC_PER_SEC'?
              (pfn & (PAGES_PER_SECTION - 1)) == 0) {
                      ^~~~~~~~~~~~~~~~~
                      USEC_PER_SEC
        mm/page_alloc.c:321:14: note: each undeclared identifier is reported only once for each function it appears in
        In file included from include/linux/cache.h:5:0,
                         from include/linux/printk.h:9,
                         from include/linux/kernel.h:14,
                         from include/asm-generic/bug.h:18,
                         from arch/arc/include/asm/bug.h:32,
                         from include/linux/bug.h:5,
                         from include/linux/mmdebug.h:5,
                         from include/linux/mm.h:9,
                         from mm/page_alloc.c:18:
        mm/page_alloc.c: In function 'deferred_grow_zone':
        mm/page_alloc.c:1624:52: error: 'PAGES_PER_SECTION' undeclared (first use in this function); did you mean 'USEC_PER_SEC'?
          unsigned long nr_pages_needed = ALIGN(1 << order, PAGES_PER_SECTION);
                                                            ^
        include/uapi/linux/kernel.h:11:47: note: in definition of macro '__ALIGN_KERNEL_MASK'
         #define __ALIGN_KERNEL_MASK(x, mask) (((x) + (mask)) & ~(mask))
                                                       ^~~~
        include/linux/kernel.h:58:22: note: in expansion of macro '__ALIGN_KERNEL'
         #define ALIGN(x, a)  __ALIGN_KERNEL((x), (a))
                              ^~~~~~~~~~~~~~
        mm/page_alloc.c:1624:34: note: in expansion of macro 'ALIGN'
          unsigned long nr_pages_needed = ALIGN(1 << order, PAGES_PER_SECTION);
                                          ^~~~~
        In file included from include/asm-generic/bug.h:18:0,
                         from arch/arc/include/asm/bug.h:32,
                         from include/linux/bug.h:5,
                         from include/linux/mmdebug.h:5,
                         from include/linux/mm.h:9,
                         from mm/page_alloc.c:18:
        mm/page_alloc.c: In function 'free_area_init_node':
        mm/page_alloc.c:6379:50: error: 'PAGES_PER_SECTION' undeclared (first use in this function); did you mean 'USEC_PER_SEC'?
          pgdat->static_init_pgcnt = min_t(unsigned long, PAGES_PER_SECTION,
                                                          ^
        include/linux/kernel.h:812:22: note: in definition of macro '__typecheck'
           (!!(sizeof((typeof(x) *)1 == (typeof(y) *)1)))
                              ^
        include/linux/kernel.h:836:24: note: in expansion of macro '__safe_cmp'
          __builtin_choose_expr(__safe_cmp(x, y), \
                                ^~~~~~~~~~
        include/linux/kernel.h:904:27: note: in expansion of macro '__careful_cmp'
         #define min_t(type, x, y) __careful_cmp((type)(x), (type)(y), <)
                                   ^~~~~~~~~~~~~
        mm/page_alloc.c:6379:29: note: in expansion of macro 'min_t'
          pgdat->static_init_pgcnt = min_t(unsigned long, PAGES_PER_SECTION,
                                     ^~~~~
        include/linux/kernel.h:836:2: error: first argument to '__builtin_choose_expr' not a constant
          __builtin_choose_expr(__safe_cmp(x, y), \
          ^
        include/linux/kernel.h:904:27: note: in expansion of macro '__careful_cmp'
         #define min_t(type, x, y) __careful_cmp((type)(x), (type)(y), <)
                                   ^~~~~~~~~~~~~
        mm/page_alloc.c:6379:29: note: in expansion of macro 'min_t'
          pgdat->static_init_pgcnt = min_t(unsigned long, PAGES_PER_SECTION,
                                     ^~~~~
        scripts/Makefile.build:317: recipe for target 'mm/page_alloc.o' failed
      
      Let's make the DEFERRED_STRUCT_PAGE_INIT explicitly depend on SPARSEMEM
      as the systems that support DISCONTIGMEM do not seem to have that huge
      amounts of memory that would make DEFERRED_STRUCT_PAGE_INIT relevant.
      
      Link: http://lkml.kernel.org/r/1530279308-24988-1-git-send-email-rppt@linux.vnet.ibm.comSigned-off-by: NMike Rapoport <rppt@linux.vnet.ibm.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NPavel Tatashin <pasha.tatashin@oracle.com>
      Tested-by: NRandy Dunlap <rdunlap@infradead.org>
      Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d39f8fb4
    • A
      kernel/memremap, kasan: make ZONE_DEVICE with work with KASAN · 0207df4f
      Andrey Ryabinin 提交于
      KASAN learns about hotadded memory via the memory hotplug notifier.
      devm_memremap_pages() intentionally skips calling memory hotplug
      notifiers.  So KASAN doesn't know anything about new memory added by
      devm_memremap_pages().  This causes a crash when KASAN tries to access
      non-existent shadow memory:
      
       BUG: unable to handle kernel paging request at ffffed0078000000
       RIP: 0010:check_memory_region+0x82/0x1e0
       Call Trace:
        memcpy+0x1f/0x50
        pmem_do_bvec+0x163/0x720
        pmem_make_request+0x305/0xac0
        generic_make_request+0x54f/0xcf0
        submit_bio+0x9c/0x370
        submit_bh_wbc+0x4c7/0x700
        block_read_full_page+0x5ef/0x870
        do_read_cache_page+0x2b8/0xb30
        read_dev_sector+0xbd/0x3f0
        read_lba.isra.0+0x277/0x670
        efi_partition+0x41a/0x18f0
        check_partition+0x30d/0x5e9
        rescan_partitions+0x18c/0x840
        __blkdev_get+0x859/0x1060
        blkdev_get+0x23f/0x810
        __device_add_disk+0x9c8/0xde0
        pmem_attach_disk+0x9a8/0xf50
        nvdimm_bus_probe+0xf3/0x3c0
        driver_probe_device+0x493/0xbd0
        bus_for_each_drv+0x118/0x1b0
        __device_attach+0x1cd/0x2b0
        bus_probe_device+0x1ac/0x260
        device_add+0x90d/0x1380
        nd_async_device_register+0xe/0x50
        async_run_entry_fn+0xc3/0x5d0
        process_one_work+0xa0a/0x1810
        worker_thread+0x87/0xe80
        kthread+0x2d7/0x390
        ret_from_fork+0x3a/0x50
      
      Add kasan_add_zero_shadow()/kasan_remove_zero_shadow() - post mm_init()
      interface to map/unmap kasan_zero_page at requested virtual addresses.
      And use it to add/remove the shadow memory for hotplugged/unplugged
      device memory.
      
      Link: http://lkml.kernel.org/r/20180629164932.740-1-aryabinin@virtuozzo.com
      Fixes: 41e94a85 ("add devm_memremap_pages")
      Signed-off-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Reported-by: NDave Chinner <david@fromorbit.com>
      Reviewed-by: NDan Williams <dan.j.williams@intel.com>
      Tested-by: NDan Williams <dan.j.williams@intel.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Alexander Potapenko <glider@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0207df4f
    • S
      mm: thp: pass correct vm_flags to hugepage_vma_check() · 50f8b92f
      Song Liu 提交于
      khugepaged_enter_vma_merge() passes a stale vma->vm_flags to
      hugepage_vma_check().  The argument vm_flags contains the latest value.
      Therefore, it is necessary to pass this vm_flags into
      hugepage_vma_check().
      
      With this bug, madvise(MADV_HUGEPAGE) for mmap files in shmem fails to
      put memory in huge pages.  Here is an example of failed madvise():
      
         /* mount /dev/shm with huge=advise:
          *     mount -o remount,huge=advise /dev/shm */
         /* create file /dev/shm/huge */
         #define HUGE_FILE "/dev/shm/huge"
      
         fd = open(HUGE_FILE, O_RDONLY);
         ptr = mmap(NULL, FILE_SIZE, PROT_READ, MAP_PRIVATE, fd, 0);
         ret = madvise(ptr, FILE_SIZE, MADV_HUGEPAGE);
      
      madvise() will return 0, but this memory region is never put in huge
      page (check from /proc/meminfo: ShmemHugePages).
      
      Link: http://lkml.kernel.org/r/20180629181752.792831-1-songliubraving@fb.com
      Fixes: 02b75dc8160d ("mm: thp: register mm for khugepaged when merging vma for shmem")
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Reviewed-by: NRik van Riel <riel@surriel.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      50f8b92f
    • A
      mm/fadvise.c: fix signed overflow UBSAN complaint · a718e28f
      Andrey Ryabinin 提交于
      Signed integer overflow is undefined according to the C standard.  The
      overflow in ksys_fadvise64_64() is deliberate, but since it is signed
      overflow, UBSAN complains:
      
      	UBSAN: Undefined behaviour in mm/fadvise.c:76:10
      	signed integer overflow:
      	4 + 9223372036854775805 cannot be represented in type 'long long int'
      
      Use unsigned types to do math.  Unsigned overflow is defined so UBSAN
      will not complain about it.  This patch doesn't change generated code.
      
      [akpm@linux-foundation.org: add comment explaining the casts]
      Link: http://lkml.kernel.org/r/20180629184453.7614-1-aryabinin@virtuozzo.comSigned-off-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Reported-by: <icytxw@gmail.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a718e28f
    • C
      mm/swap_slots.c: make swap_slots_cache_mutex and swap_slots_cache_enable_mutex static · 31f21da1
      Colin Ian King 提交于
      The mutexes swap_slots_cache_mutex and swap_slots_cache_enable_mutex are
      local to the source and do not need to be in global scope, so make them
      static.
      
      Cleans up sparse warnings:
        symbol 'swap_slots_cache_mutex' was not declared. Should it be static?
        symbol 'swap_slots_cache_enable_mutex' was not declared. Should it be static?
      
      Link: http://lkml.kernel.org/r/20180624182536.4937-1-colin.king@canonical.comSigned-off-by: NColin Ian King <colin.king@canonical.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      31f21da1
    • C
      mm/zsmalloc.c: make several functions and a struct static · 4d0a5402
      Colin Ian King 提交于
      The functions zs_page_isolate, zs_page_migrate, zs_page_putback,
      lock_zspage, trylock_zspage and structure zsmalloc_aops are local to
      source and do not need to be in global scope, so make them static.
      
      Cleans up sparse warnings:
        symbol 'zs_page_isolate' was not declared. Should it be static?
        symbol 'zs_page_migrate' was not declared. Should it be static?
        symbol 'zs_page_putback' was not declared. Should it be static?
        symbol 'zsmalloc_aops' was not declared. Should it be static?
        symbol 'lock_zspage' was not declared. Should it be static?
        symbol 'trylock_zspage' was not declared. Should it be static?
      
      [arnd@arndb.de: hide unused lock_zspage]
        Link: http://lkml.kernel.org/r/20180706130924.3891230-1-arnd@arndb.de
      Link: http://lkml.kernel.org/r/20180624213322.13776-1-colin.king@canonical.comSigned-off-by: NColin Ian King <colin.king@canonical.com>
      Reviewed-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4d0a5402
    • G
      mm/page-writeback.c: update stale account_page_redirty() comment · dcfe4df3
      Greg Thelen 提交于
      Commit 93f78d88 ("writeback: move backing_dev_info->bdi_stat[] into
      bdi_writeback") replaced BDI_DIRTIED with WB_DIRTIED in
      account_page_redirty().  Update comment to track that change.
      
        BDI_DIRTIED => WB_DIRTIED
        BDI_WRITTEN => WB_WRITTEN
      
      Link: http://lkml.kernel.org/r/20180625171526.173483-1-gthelen@google.comSigned-off-by: NGreg Thelen <gthelen@google.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dcfe4df3
    • S
      fs, mm: account buffer_head to kmemcg · f745c6f5
      Shakeel Butt 提交于
      The buffer_head can consume a significant amount of system memory and is
      directly related to the amount of page cache.  In our production
      environment we have observed that a lot of machines are spending a
      significant amount of memory as buffer_head and can not be left as
      system memory overhead.
      
      Charging buffer_head is not as simple as adding __GFP_ACCOUNT to the
      allocation.  The buffer_heads can be allocated in a memcg different from
      the memcg of the page for which buffer_heads are being allocated.  One
      concrete example is memory reclaim.  The reclaim can trigger I/O of
      pages of any memcg on the system.  So, the right way to charge
      buffer_head is to extract the memcg from the page for which buffer_heads
      are being allocated and then use targeted memcg charging API.
      
      [shakeelb@google.com: use __GFP_ACCOUNT for directed memcg charging]
        Link: http://lkml.kernel.org/r/20180702220208.213380-1-shakeelb@google.com
      Link: http://lkml.kernel.org/r/20180627191250.209150-3-shakeelb@google.comSigned-off-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Amir Goldstein <amir73il@gmail.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f745c6f5
    • S
      fs: fsnotify: account fsnotify metadata to kmemcg · d46eb14b
      Shakeel Butt 提交于
      Patch series "Directed kmem charging", v8.
      
      The Linux kernel's memory cgroup allows limiting the memory usage of the
      jobs running on the system to provide isolation between the jobs.  All
      the kernel memory allocated in the context of the job and marked with
      __GFP_ACCOUNT will also be included in the memory usage and be limited
      by the job's limit.
      
      The kernel memory can only be charged to the memcg of the process in
      whose context kernel memory was allocated.  However there are cases
      where the allocated kernel memory should be charged to the memcg
      different from the current processes's memcg.  This patch series
      contains two such concrete use-cases i.e.  fsnotify and buffer_head.
      
      The fsnotify event objects can consume a lot of system memory for large
      or unlimited queues if there is either no or slow listener.  The events
      are allocated in the context of the event producer.  However they should
      be charged to the event consumer.  Similarly the buffer_head objects can
      be allocated in a memcg different from the memcg of the page for which
      buffer_head objects are being allocated.
      
      To solve this issue, this patch series introduces mechanism to charge
      kernel memory to a given memcg.  In case of fsnotify events, the memcg
      of the consumer can be used for charging and for buffer_head, the memcg
      of the page can be charged.  For directed charging, the caller can use
      the scope API memalloc_[un]use_memcg() to specify the memcg to charge
      for all the __GFP_ACCOUNT allocations within the scope.
      
      This patch (of 2):
      
      A lot of memory can be consumed by the events generated for the huge or
      unlimited queues if there is either no or slow listener.  This can cause
      system level memory pressure or OOMs.  So, it's better to account the
      fsnotify kmem caches to the memcg of the listener.
      
      However the listener can be in a different memcg than the memcg of the
      producer and these allocations happen in the context of the event
      producer.  This patch introduces remote memcg charging API which the
      producer can use to charge the allocations to the memcg of the listener.
      
      There are seven fsnotify kmem caches and among them allocations from
      dnotify_struct_cache, dnotify_mark_cache, fanotify_mark_cache and
      inotify_inode_mark_cachep happens in the context of syscall from the
      listener.  So, SLAB_ACCOUNT is enough for these caches.
      
      The objects from fsnotify_mark_connector_cachep are not accounted as
      they are small compared to the notification mark or events and it is
      unclear whom to account connector to since it is shared by all events
      attached to the inode.
      
      The allocations from the event caches happen in the context of the event
      producer.  For such caches we will need to remote charge the allocations
      to the listener's memcg.  Thus we save the memcg reference in the
      fsnotify_group structure of the listener.
      
      This patch has also moved the members of fsnotify_group to keep the size
      same, at least for 64 bit build, even with additional member by filling
      the holes.
      
      [shakeelb@google.com: use GFP_KERNEL_ACCOUNT rather than open-coding it]
        Link: http://lkml.kernel.org/r/20180702215439.211597-1-shakeelb@google.com
      Link: http://lkml.kernel.org/r/20180627191250.209150-2-shakeelb@google.comSigned-off-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Amir Goldstein <amir73il@gmail.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d46eb14b
    • R
      mm: introduce mem_cgroup_put() helper · dc0b5864
      Roman Gushchin 提交于
      Introduce the mem_cgroup_put() helper, which helps to eliminate guarding
      memcg css release with "#ifdef CONFIG_MEMCG" in multiple places.
      
      Link: http://lkml.kernel.org/r/20180623000600.5818-2-guro@fb.comSigned-off-by: NRoman Gushchin <guro@fb.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@kernel.org>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dc0b5864