1. 01 10月, 2013 8 次提交
  2. 25 9月, 2013 9 次提交
  3. 13 9月, 2013 23 次提交
    • C
      mm/Kconfig: add MMU dependency for MIGRATION. · de32a817
      Chen Gang 提交于
      MIGRATION must depend on MMU, or allmodconfig for the nommu sh
      architecture fails to build:
      
          CC      mm/migrate.o
        mm/migrate.c: In function 'remove_migration_pte':
        mm/migrate.c:134:3: error: implicit declaration of function 'pmd_trans_huge' [-Werror=implicit-function-declaration]
           if (pmd_trans_huge(*pmd))
           ^
        mm/migrate.c:149:2: error: implicit declaration of function 'is_swap_pte' [-Werror=implicit-function-declaration]
          if (!is_swap_pte(pte))
          ^
        ...
      
      Also let CMA depend on MMU, or when NOMMU, if we select CMA, it will
      select MIGRATION by force.
      Signed-off-by: NChen Gang <gang.chen@asianux.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      de32a817
    • D
      mm, thp: count thp_fault_fallback anytime thp fault fails · 17766dde
      David Rientjes 提交于
      Currently, thp_fault_fallback in vmstat only gets incremented if a
      hugepage allocation fails.  If current's memcg hits its limit or the page
      fault handler returns an error, it is incorrectly accounted as a
      successful thp_fault_alloc.
      
      Count thp_fault_fallback anytime the page fault handler falls back to
      using regular pages and only count thp_fault_alloc when a hugepage has
      actually been faulted.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      17766dde
    • K
      thp: consolidate code between handle_mm_fault() and do_huge_pmd_anonymous_page() · c0292554
      Kirill A. Shutemov 提交于
      do_huge_pmd_anonymous_page() has copy-pasted piece of handle_mm_fault()
      to handle fallback path.
      
      Let's consolidate code back by introducing VM_FAULT_FALLBACK return
      code.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NHillf Danton <dhillf@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c0292554
    • K
      thp: do_huge_pmd_anonymous_page() cleanup · 128ec037
      Kirill A. Shutemov 提交于
      Minor cleanup: unindent most code of the fucntion by inverting one
      condition.  It's preparation for the next patch.
      
      No functional changes.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NHillf Danton <dhillf@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      128ec037
    • K
      thp: move maybe_pmd_mkwrite() out of mk_huge_pmd() · 3122359a
      Kirill A. Shutemov 提交于
      It's confusing that mk_huge_pmd() has semantics different from mk_pte() or
      mk_pmd().  I spent some time on debugging issue cased by this
      inconsistency.
      
      Let's move maybe_pmd_mkwrite() out of mk_huge_pmd() and adjust prototype
      to match mk_pte().
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3122359a
    • K
      mm: cleanup add_to_page_cache_locked() · 66a0c8ee
      Kirill A. Shutemov 提交于
      Make add_to_page_cache_locked() cleaner:
      
       - unindent most code of the function by inverting one condition;
       - streamline code no-error path;
       - move insert error path outside normal code path;
       - call radix_tree_preload_end() earlier;
      
      No functional changes.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      66a0c8ee
    • K
      thp: account anon transparent huge pages into NR_ANON_PAGES · 3cd14fcd
      Kirill A. Shutemov 提交于
      We use NR_ANON_PAGES as base for reporting AnonPages to user.  There's
      not much sense in not accounting transparent huge pages there, but add
      them on printing to user.
      
      Let's account transparent huge pages in NR_ANON_PAGES in the first place.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Cc: Ning Qu <quning@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3cd14fcd
    • K
      truncate: drop 'oldsize' truncate_pagecache() parameter · 7caef267
      Kirill A. Shutemov 提交于
      truncate_pagecache() doesn't care about old size since commit
      cedabed4 ("vfs: Fix vmtruncate() regression").  Let's drop it.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7caef267
    • C
      mm: make lru_add_drain_all() selective · 5fbc4616
      Chris Metcalf 提交于
      make lru_add_drain_all() only selectively interrupt the cpus that have
      per-cpu free pages that can be drained.
      
      This is important in nohz mode where calling mlockall(), for example,
      otherwise will interrupt every core unnecessarily.
      
      This is important on workloads where nohz cores are handling 10 Gb traffic
      in userspace.  Those CPUs do not enter the kernel and place pages into LRU
      pagevecs and they really, really don't want to be interrupted, or they
      drop packets on the floor.
      Signed-off-by: NChris Metcalf <cmetcalf@tilera.com>
      Reviewed-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5fbc4616
    • S
      memcg: add per cgroup writeback pages accounting · 3ea67d06
      Sha Zhengju 提交于
      Add memcg routines to count writeback pages, later dirty pages will also
      be accounted.
      
      After Kame's commit 89c06bd5 ("memcg: use new logic for page stat
      accounting"), we can use 'struct page' flag to test page state instead
      of per page_cgroup flag.  But memcg has a feature to move a page from a
      cgroup to another one and may have race between "move" and "page stat
      accounting".  So in order to avoid the race we have designed a new lock:
      
               mem_cgroup_begin_update_page_stat()
               modify page information        -->(a)
               mem_cgroup_update_page_stat()  -->(b)
               mem_cgroup_end_update_page_stat()
      
      It requires both (a) and (b)(writeback pages accounting) to be pretected
      in mem_cgroup_{begin/end}_update_page_stat().  It's full no-op for
      !CONFIG_MEMCG, almost no-op if memcg is disabled (but compiled in), rcu
      read lock in the most cases (no task is moving), and spin_lock_irqsave
      on top in the slow path.
      
      There're two writeback interfaces to modify: test_{clear/set}_page_writeback().
      And the lock order is:
      	--> memcg->move_lock
      	  --> mapping->tree_lock
      Signed-off-by: NSha Zhengju <handai.szj@taobao.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Reviewed-by: NGreg Thelen <gthelen@google.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3ea67d06
    • S
      memcg: check for proper lock held in mem_cgroup_update_page_stat · 658b72c5
      Sha Zhengju 提交于
      We should call mem_cgroup_begin_update_page_stat() before
      mem_cgroup_update_page_stat() to get proper locks, however the latter
      doesn't do any checking that we use proper locking, which would be hard.
      Suggested by Michal Hock we could at least test for rcu_read_lock_held()
      because RCU is held if !mem_cgroup_disabled().
      Signed-off-by: NSha Zhengju <handai.szj@taobao.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Reviewed-by: NGreg Thelen <gthelen@google.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      658b72c5
    • S
      memcg: remove MEMCG_NR_FILE_MAPPED · 68b4876d
      Sha Zhengju 提交于
      While accounting memcg page stat, it's not worth to use
      MEMCG_NR_FILE_MAPPED as an extra layer of indirection because of the
      complexity and presumed performance overhead.  We can use
      MEM_CGROUP_STAT_FILE_MAPPED directly.
      Signed-off-by: NSha Zhengju <handai.szj@taobao.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NFengguang Wu <fengguang.wu@intel.com>
      Reviewed-by: NGreg Thelen <gthelen@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      68b4876d
    • S
      memcg: rename RESOURCE_MAX to RES_COUNTER_MAX · 6de5a8bf
      Sha Zhengju 提交于
      RESOURCE_MAX is far too general name, change it to RES_COUNTER_MAX.
      Signed-off-by: NSha Zhengju <handai.szj@taobao.com>
      Signed-off-by: NQiang Huang <h.huangqiang@huawei.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Jeff Liu <jeff.liu@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6de5a8bf
    • J
      mm: memcg: do not trap chargers with full callstack on OOM · 3812c8c8
      Johannes Weiner 提交于
      The memcg OOM handling is incredibly fragile and can deadlock.  When a
      task fails to charge memory, it invokes the OOM killer and loops right
      there in the charge code until it succeeds.  Comparably, any other task
      that enters the charge path at this point will go to a waitqueue right
      then and there and sleep until the OOM situation is resolved.  The problem
      is that these tasks may hold filesystem locks and the mmap_sem; locks that
      the selected OOM victim may need to exit.
      
      For example, in one reported case, the task invoking the OOM killer was
      about to charge a page cache page during a write(), which holds the
      i_mutex.  The OOM killer selected a task that was just entering truncate()
      and trying to acquire the i_mutex:
      
      OOM invoking task:
        mem_cgroup_handle_oom+0x241/0x3b0
        mem_cgroup_cache_charge+0xbe/0xe0
        add_to_page_cache_locked+0x4c/0x140
        add_to_page_cache_lru+0x22/0x50
        grab_cache_page_write_begin+0x8b/0xe0
        ext3_write_begin+0x88/0x270
        generic_file_buffered_write+0x116/0x290
        __generic_file_aio_write+0x27c/0x480
        generic_file_aio_write+0x76/0xf0           # takes ->i_mutex
        do_sync_write+0xea/0x130
        vfs_write+0xf3/0x1f0
        sys_write+0x51/0x90
        system_call_fastpath+0x18/0x1d
      
      OOM kill victim:
        do_truncate+0x58/0xa0              # takes i_mutex
        do_last+0x250/0xa30
        path_openat+0xd7/0x440
        do_filp_open+0x49/0xa0
        do_sys_open+0x106/0x240
        sys_open+0x20/0x30
        system_call_fastpath+0x18/0x1d
      
      The OOM handling task will retry the charge indefinitely while the OOM
      killed task is not releasing any resources.
      
      A similar scenario can happen when the kernel OOM killer for a memcg is
      disabled and a userspace task is in charge of resolving OOM situations.
      In this case, ALL tasks that enter the OOM path will be made to sleep on
      the OOM waitqueue and wait for userspace to free resources or increase
      the group's limit.  But a userspace OOM handler is prone to deadlock
      itself on the locks held by the waiting tasks.  For example one of the
      sleeping tasks may be stuck in a brk() call with the mmap_sem held for
      writing but the userspace handler, in order to pick an optimal victim,
      may need to read files from /proc/<pid>, which tries to acquire the same
      mmap_sem for reading and deadlocks.
      
      This patch changes the way tasks behave after detecting a memcg OOM and
      makes sure nobody loops or sleeps with locks held:
      
      1. When OOMing in a user fault, invoke the OOM killer and restart the
         fault instead of looping on the charge attempt.  This way, the OOM
         victim can not get stuck on locks the looping task may hold.
      
      2. When OOMing in a user fault but somebody else is handling it
         (either the kernel OOM killer or a userspace handler), don't go to
         sleep in the charge context.  Instead, remember the OOMing memcg in
         the task struct and then fully unwind the page fault stack with
         -ENOMEM.  pagefault_out_of_memory() will then call back into the
         memcg code to check if the -ENOMEM came from the memcg, and then
         either put the task to sleep on the memcg's OOM waitqueue or just
         restart the fault.  The OOM victim can no longer get stuck on any
         lock a sleeping task may hold.
      
      Debugged by Michal Hocko.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: NazurIt <azurit@pobox.sk>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3812c8c8
    • J
      mm: memcg: rework and document OOM waiting and wakeup · fb2a6fc5
      Johannes Weiner 提交于
      The memcg OOM handler open-codes a sleeping lock for OOM serialization
      (trylock, wait, repeat) because the required locking is so specific to
      memcg hierarchies.  However, it would be nice if this construct would be
      clearly recognizable and not be as obfuscated as it is right now.  Clean
      up as follows:
      
      1. Remove the return value of mem_cgroup_oom_unlock()
      
      2. Rename mem_cgroup_oom_lock() to mem_cgroup_oom_trylock().
      
      3. Pull the prepare_to_wait() out of the memcg_oom_lock scope.  This
         makes it more obvious that the task has to be on the waitqueue
         before attempting to OOM-trylock the hierarchy, to not miss any
         wakeups before going to sleep.  It just didn't matter until now
         because it was all lumped together into the global memcg_oom_lock
         spinlock section.
      
      4. Pull the mem_cgroup_oom_notify() out of the memcg_oom_lock scope.
         It is proctected by the hierarchical OOM-lock.
      
      5. The memcg_oom_lock spinlock is only required to propagate the OOM
         lock in any given hierarchy atomically.  Restrict its scope to
         mem_cgroup_oom_(trylock|unlock).
      
      6. Do not wake up the waitqueue unconditionally at the end of the
         function.  Only the lockholder has to wake up the next in line
         after releasing the lock.
      
         Note that the lockholder kicks off the OOM-killer, which in turn
         leads to wakeups from the uncharges of the exiting task.  But a
         contender is not guaranteed to see them if it enters the OOM path
         after the OOM kills but before the lockholder releases the lock.
         Thus there has to be an explicit wakeup after releasing the lock.
      
      7. Put the OOM task on the waitqueue before marking the hierarchy as
         under OOM as that is the point where we start to receive wakeups.
         No point in listening before being on the waitqueue.
      
      8. Likewise, unmark the hierarchy before finishing the sleep, for
         symmetry.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: azurIt <azurit@pobox.sk>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fb2a6fc5
    • J
      mm: memcg: enable memcg OOM killer only for user faults · 519e5247
      Johannes Weiner 提交于
      System calls and kernel faults (uaccess, gup) can handle an out of memory
      situation gracefully and just return -ENOMEM.
      
      Enable the memcg OOM killer only for user faults, where it's really the
      only option available.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: azurIt <azurit@pobox.sk>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      519e5247
    • A
      memcg: trivial cleanups · f894ffa8
      Andrew Morton 提交于
      Clean up some mess made by the "Soft limit rework" series, and a few other
      things.
      
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f894ffa8
    • M
      memcg, vmscan: do not fall into reclaim-all pass too quickly · e975de99
      Michal Hocko 提交于
      shrink_zone starts with soft reclaim pass first and then falls back to
      regular reclaim if nothing has been scanned.  This behavior is natural
      but there is a catch.  Memcg iterators, when used with the reclaim
      cookie, are designed to help to prevent from over reclaim by
      interleaving reclaimers (per node-zone-priority) so the tree walk might
      miss many (even all) nodes in the hierarchy e.g.  when there are direct
      reclaimers racing with each other or with kswapd in the global case or
      multiple allocators reaching the limit for the target reclaim case.  To
      make it even more complicated, targeted reclaim doesn't do the whole
      tree walk because it stops reclaiming once it reclaims sufficient pages.
      As a result groups over the limit might be missed, thus nothing is
      scanned, and reclaim would fall back to the reclaim all mode.
      
      This patch checks for the incomplete tree walk in shrink_zone.  If no
      group has been visited and the hierarchy is soft reclaimable then we
      must have missed some groups, in which case the __shrink_zone is called
      again.  This doesn't guarantee there will be some progress of course
      because the current reclaimer might be still racing with others but it
      would at least give a chance to start the walk without a big risk of
      reclaim latencies.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Glauber Costa <glommer@openvz.org>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e975de99
    • M
      memcg: track all children over limit in the root · 1be171d6
      Michal Hocko 提交于
      Children in soft limit excess are currently tracked up the hierarchy in
      memcg->children_in_excess.  Nevertheless there still might exist tons of
      groups that are not in hierarchy relation to the root cgroup (e.g.  all
      first level groups if root_mem_cgroup->use_hierarchy == false).
      
      As the whole tree walk has to be done when the iteration starts at
      root_mem_cgroup the iterator should be able to skip the walk if there is
      no child above the limit without iterating them.  This can be done
      easily if the root tracks all children rather than only hierarchical
      children.  This is done by this patch which updates root_mem_cgroup
      children_in_excess if root_mem_cgroup->use_hierarchy == false so the
      root knows about all children in excess.
      
      Please note that this is not an issue for inner memcgs which have
      use_hierarchy == false because then only the single group is visited so
      no special optimization is necessary.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Glauber Costa <glommer@openvz.org>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1be171d6
    • M
      memcg, vmscan: do not attempt soft limit reclaim if it would not scan anything · e839b6a1
      Michal Hocko 提交于
      mem_cgroup_should_soft_reclaim controls whether soft reclaim pass is
      done and it always says yes currently.  Memcg iterators are clever to
      skip nodes that are not soft reclaimable quite efficiently but
      mem_cgroup_should_soft_reclaim can be more clever and do not start the
      soft reclaim pass at all if it knows that nothing would be scanned
      anyway.
      
      In order to do that, simply reuse mem_cgroup_soft_reclaim_eligible for
      the target group of the reclaim and allow the pass only if the whole
      subtree wouldn't be skipped.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Glauber Costa <glommer@openvz.org>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e839b6a1
    • M
      memcg: track children in soft limit excess to improve soft limit · 7d910c05
      Michal Hocko 提交于
      Soft limit reclaim has to check the whole reclaim hierarchy while doing
      the first pass of the reclaim.  This leads to a higher system time which
      can be visible especially when there are many groups in the hierarchy.
      
      This patch adds a per-memcg counter of children in excess.  It also
      restores MEM_CGROUP_TARGET_SOFTLIMIT into mem_cgroup_event_ratelimit for a
      proper batching.
      
      If a group crosses soft limit for the first time it increases parent's
      children_in_excess up the hierarchy.  The similarly if a group gets below
      the limit it will decrease the counter.  The transition phase is recorded
      in soft_contributed flag.
      
      mem_cgroup_soft_reclaim_eligible then uses this information to better
      decide whether to skip the node or the whole subtree.  The rule is simple.
       Skip the node with a children in excess or skip the whole subtree
      otherwise.
      
      This has been tested by a stream IO (dd if=/dev/zero of=file with
      4*MemTotal size) which is quite sensitive to overhead during reclaim.  The
      load is running in a group with soft limit set to 0 and without any limit.
       Apart from that there was a hierarchy with ~500, 2k and 8k groups (two
      groups on each level) without any pages in them.  base denotes to the
      kernel on which the whole series is based on, rework is the kernel before
      this patch and reworkoptim is with this patch applied:
      
      * Run with soft limit set to 0
      Elapsed
      0-0-limit/base: min: 88.21 max: 94.61 avg: 91.73 std: 2.65 runs: 3
      0-0-limit/rework: min: 76.05 [86.2%] max: 79.08 [83.6%] avg: 77.84 [84.9%] std: 1.30 runs: 3
      0-0-limit/reworkoptim: min: 77.98 [88.4%] max: 80.36 [84.9%] avg: 78.92 [86.0%] std: 1.03 runs: 3
      System
      0.5k-0-limit/base: min: 34.86 max: 36.42 avg: 35.89 std: 0.73 runs: 3
      0.5k-0-limit/rework: min: 43.26 [124.1%] max: 48.95 [134.4%] avg: 46.09 [128.4%] std: 2.32 runs: 3
      0.5k-0-limit/reworkoptim: min: 46.98 [134.8%] max: 50.98 [140.0%] avg: 48.49 [135.1%] std: 1.77 runs: 3
      Elapsed
      0.5k-0-limit/base: min: 88.50 max: 97.52 avg: 93.92 std: 3.90 runs: 3
      0.5k-0-limit/rework: min: 75.92 [85.8%] max: 78.45 [80.4%] avg: 77.34 [82.3%] std: 1.06 runs: 3
      0.5k-0-limit/reworkoptim: min: 75.79 [85.6%] max: 79.37 [81.4%] avg: 77.55 [82.6%] std: 1.46 runs: 3
      System
      2k-0-limit/base: min: 34.57 max: 37.65 avg: 36.34 std: 1.30 runs: 3
      2k-0-limit/rework: min: 64.17 [185.6%] max: 68.20 [181.1%] avg: 66.21 [182.2%] std: 1.65 runs: 3
      2k-0-limit/reworkoptim: min: 49.78 [144.0%] max: 52.99 [140.7%] avg: 51.00 [140.3%] std: 1.42 runs: 3
      Elapsed
      2k-0-limit/base: min: 92.61 max: 97.83 avg: 95.03 std: 2.15 runs: 3
      2k-0-limit/rework: min: 78.33 [84.6%] max: 84.08 [85.9%] avg: 81.09 [85.3%] std: 2.35 runs: 3
      2k-0-limit/reworkoptim: min: 75.72 [81.8%] max: 78.57 [80.3%] avg: 76.73 [80.7%] std: 1.30 runs: 3
      System
      8k-0-limit/base: min: 39.78 max: 42.09 avg: 41.09 std: 0.97 runs: 3
      8k-0-limit/rework: min: 200.86 [504.9%] max: 265.42 [630.6%] avg: 241.80 [588.5%] std: 29.06 runs: 3
      8k-0-limit/reworkoptim: min: 53.70 [135.0%] max: 54.89 [130.4%] avg: 54.43 [132.5%] std: 0.52 runs: 3
      Elapsed
      8k-0-limit/base: min: 95.11 max: 98.61 avg: 96.81 std: 1.43 runs: 3
      8k-0-limit/rework: min: 246.96 [259.7%] max: 331.47 [336.1%] avg: 301.32 [311.2%] std: 38.52 runs: 3
      8k-0-limit/reworkoptim: min: 76.79 [80.7%] max: 81.71 [82.9%] avg: 78.97 [81.6%] std: 2.05 runs: 3
      
      System time is increased by 30-40% but it is reduced a lot comparing to
      kernel without this patch.  The higher time can be explained by the fact
      that the original soft reclaim scanned at priority 0 so it was much more
      effective for this workload (which is basically touch once and writeback).
       The Elapsed time looks better though (~20%).
      
      * Run with no soft limit set
      System
      0-no-limit/base: min: 42.18 max: 50.38 avg: 46.44 std: 3.36 runs: 3
      0-no-limit/rework: min: 40.57 [96.2%] max: 47.04 [93.4%] avg: 43.82 [94.4%] std: 2.64 runs: 3
      0-no-limit/reworkoptim: min: 40.45 [95.9%] max: 45.28 [89.9%] avg: 42.10 [90.7%] std: 2.25 runs: 3
      Elapsed
      0-no-limit/base: min: 75.97 max: 78.21 avg: 76.87 std: 0.96 runs: 3
      0-no-limit/rework: min: 75.59 [99.5%] max: 80.73 [103.2%] avg: 77.64 [101.0%] std: 2.23 runs: 3
      0-no-limit/reworkoptim: min: 77.85 [102.5%] max: 82.42 [105.4%] avg: 79.64 [103.6%] std: 1.99 runs: 3
      System
      0.5k-no-limit/base: min: 44.54 max: 46.93 avg: 46.12 std: 1.12 runs: 3
      0.5k-no-limit/rework: min: 42.09 [94.5%] max: 46.16 [98.4%] avg: 43.92 [95.2%] std: 1.69 runs: 3
      0.5k-no-limit/reworkoptim: min: 42.47 [95.4%] max: 45.67 [97.3%] avg: 44.06 [95.5%] std: 1.31 runs: 3
      Elapsed
      0.5k-no-limit/base: min: 78.26 max: 81.49 avg: 79.65 std: 1.36 runs: 3
      0.5k-no-limit/rework: min: 77.01 [98.4%] max: 80.43 [98.7%] avg: 78.30 [98.3%] std: 1.52 runs: 3
      0.5k-no-limit/reworkoptim: min: 76.13 [97.3%] max: 77.87 [95.6%] avg: 77.18 [96.9%] std: 0.75 runs: 3
      System
      2k-no-limit/base: min: 62.96 max: 69.14 avg: 66.14 std: 2.53 runs: 3
      2k-no-limit/rework: min: 76.01 [120.7%] max: 81.06 [117.2%] avg: 78.17 [118.2%] std: 2.12 runs: 3
      2k-no-limit/reworkoptim: min: 62.57 [99.4%] max: 66.10 [95.6%] avg: 64.53 [97.6%] std: 1.47 runs: 3
      Elapsed
      2k-no-limit/base: min: 76.47 max: 84.22 avg: 79.12 std: 3.60 runs: 3
      2k-no-limit/rework: min: 89.67 [117.3%] max: 93.26 [110.7%] avg: 91.10 [115.1%] std: 1.55 runs: 3
      2k-no-limit/reworkoptim: min: 76.94 [100.6%] max: 79.21 [94.1%] avg: 78.45 [99.2%] std: 1.07 runs: 3
      System
      8k-no-limit/base: min: 104.74 max: 151.34 avg: 129.21 std: 19.10 runs: 3
      8k-no-limit/rework: min: 205.23 [195.9%] max: 285.94 [188.9%] avg: 258.98 [200.4%] std: 38.01 runs: 3
      8k-no-limit/reworkoptim: min: 161.16 [153.9%] max: 184.54 [121.9%] avg: 174.52 [135.1%] std: 9.83 runs: 3
      Elapsed
      8k-no-limit/base: min: 125.43 max: 181.00 avg: 154.81 std: 22.80 runs: 3
      8k-no-limit/rework: min: 254.05 [202.5%] max: 355.67 [196.5%] avg: 321.46 [207.6%] std: 47.67 runs: 3
      8k-no-limit/reworkoptim: min: 193.77 [154.5%] max: 222.72 [123.0%] avg: 210.18 [135.8%] std: 12.13 runs: 3
      
      Both System and Elapsed are in stdev with the base kernel for all
      configurations except for 8k where both System and Elapsed are up by 35%.
      I do not have a good explanation for this because there is no soft reclaim
      pass going on as no group is above the limit which is checked in
      mem_cgroup_should_soft_reclaim.
      
      Then I have tested kernel build with the same configuration to see the
      behavior with a more general behavior.
      
      * Soft limit set to 0 for the build
      System
      0-0-limit/base: min: 242.70 max: 245.17 avg: 243.85 std: 1.02 runs: 3
      0-0-limit/rework min: 237.86 [98.0%] max: 240.22 [98.0%] avg: 239.00 [98.0%] std: 0.97 runs: 3
      0-0-limit/reworkoptim: min: 241.11 [99.3%] max: 243.53 [99.3%] avg: 242.01 [99.2%] std: 1.08 runs: 3
      Elapsed
      0-0-limit/base: min: 348.48 max: 360.86 avg: 356.04 std: 5.41 runs: 3
      0-0-limit/rework min: 286.95 [82.3%] max: 290.26 [80.4%] avg: 288.27 [81.0%] std: 1.43 runs: 3
      0-0-limit/reworkoptim: min: 286.55 [82.2%] max: 289.00 [80.1%] avg: 287.69 [80.8%] std: 1.01 runs: 3
      System
      0.5k-0-limit/base: min: 251.77 max: 254.41 avg: 252.70 std: 1.21 runs: 3
      0.5k-0-limit/rework min: 286.44 [113.8%] max: 289.30 [113.7%] avg: 287.60 [113.8%] std: 1.23 runs: 3
      0.5k-0-limit/reworkoptim: min: 252.18 [100.2%] max: 253.16 [99.5%] avg: 252.62 [100.0%] std: 0.41 runs: 3
      Elapsed
      0.5k-0-limit/base: min: 347.83 max: 353.06 avg: 350.04 std: 2.21 runs: 3
      0.5k-0-limit/rework min: 290.19 [83.4%] max: 295.62 [83.7%] avg: 293.12 [83.7%] std: 2.24 runs: 3
      0.5k-0-limit/reworkoptim: min: 293.91 [84.5%] max: 294.87 [83.5%] avg: 294.29 [84.1%] std: 0.42 runs: 3
      System
      2k-0-limit/base: min: 263.05 max: 271.52 avg: 267.94 std: 3.58 runs: 3
      2k-0-limit/rework min: 458.99 [174.5%] max: 468.31 [172.5%] avg: 464.45 [173.3%] std: 3.97 runs: 3
      2k-0-limit/reworkoptim: min: 267.10 [101.5%] max: 279.38 [102.9%] avg: 272.78 [101.8%] std: 5.05 runs: 3
      Elapsed
      2k-0-limit/base: min: 372.33 max: 379.32 avg: 375.47 std: 2.90 runs: 3
      2k-0-limit/rework min: 334.40 [89.8%] max: 339.52 [89.5%] avg: 337.44 [89.9%] std: 2.20 runs: 3
      2k-0-limit/reworkoptim: min: 301.47 [81.0%] max: 319.19 [84.1%] avg: 307.90 [82.0%] std: 8.01 runs: 3
      System
      8k-0-limit/base: min: 320.50 max: 332.10 avg: 325.46 std: 4.88 runs: 3
      8k-0-limit/rework min: 1115.76 [348.1%] max: 1165.66 [351.0%] avg: 1132.65 [348.0%] std: 23.34 runs: 3
      8k-0-limit/reworkoptim: min: 403.75 [126.0%] max: 409.22 [123.2%] avg: 406.16 [124.8%] std: 2.28 runs: 3
      Elapsed
      8k-0-limit/base: min: 475.48 max: 585.19 avg: 525.54 std: 45.30 runs: 3
      8k-0-limit/rework min: 616.25 [129.6%] max: 625.90 [107.0%] avg: 620.68 [118.1%] std: 3.98 runs: 3
      8k-0-limit/reworkoptim: min: 420.18 [88.4%] max: 428.28 [73.2%] avg: 423.05 [80.5%] std: 3.71 runs: 3
      
      Apart from 8k the system time is comparable with the base kernel while
      Elapsed is up to 20% better with all configurations.
      
      * No soft limit set
      System
      0-no-limit/base: min: 234.76 max: 237.42 avg: 236.25 std: 1.11 runs: 3
      0-no-limit/rework min: 233.09 [99.3%] max: 238.65 [100.5%] avg: 236.09 [99.9%] std: 2.29 runs: 3
      0-no-limit/reworkoptim: min: 236.12 [100.6%] max: 240.53 [101.3%] avg: 237.94 [100.7%] std: 1.88 runs: 3
      Elapsed
      0-no-limit/base: min: 288.52 max: 295.42 avg: 291.29 std: 2.98 runs: 3
      0-no-limit/rework min: 283.17 [98.1%] max: 284.33 [96.2%] avg: 283.78 [97.4%] std: 0.48 runs: 3
      0-no-limit/reworkoptim: min: 288.50 [100.0%] max: 290.79 [98.4%] avg: 289.78 [99.5%] std: 0.95 runs: 3
      System
      0.5k-no-limit/base: min: 286.51 max: 293.23 avg: 290.21 std: 2.78 runs: 3
      0.5k-no-limit/rework min: 291.69 [101.8%] max: 294.38 [100.4%] avg: 292.97 [101.0%] std: 1.10 runs: 3
      0.5k-no-limit/reworkoptim: min: 277.05 [96.7%] max: 288.76 [98.5%] avg: 284.17 [97.9%] std: 5.11 runs: 3
      Elapsed
      0.5k-no-limit/base: min: 294.94 max: 298.92 avg: 296.47 std: 1.75 runs: 3
      0.5k-no-limit/rework min: 292.55 [99.2%] max: 294.21 [98.4%] avg: 293.55 [99.0%] std: 0.72 runs: 3
      0.5k-no-limit/reworkoptim: min: 294.41 [99.8%] max: 301.67 [100.9%] avg: 297.78 [100.4%] std: 2.99 runs: 3
      System
      2k-no-limit/base: min: 443.41 max: 466.66 avg: 457.66 std: 10.19 runs: 3
      2k-no-limit/rework min: 490.11 [110.5%] max: 516.02 [110.6%] avg: 501.42 [109.6%] std: 10.83 runs: 3
      2k-no-limit/reworkoptim: min: 435.25 [98.2%] max: 458.11 [98.2%] avg: 446.73 [97.6%] std: 9.33 runs: 3
      Elapsed
      2k-no-limit/base: min: 330.85 max: 333.75 avg: 332.52 std: 1.23 runs: 3
      2k-no-limit/rework min: 343.06 [103.7%] max: 349.59 [104.7%] avg: 345.95 [104.0%] std: 2.72 runs: 3
      2k-no-limit/reworkoptim: min: 330.01 [99.7%] max: 333.92 [100.1%] avg: 332.22 [99.9%] std: 1.64 runs: 3
      System
      8k-no-limit/base: min: 1175.64 max: 1259.38 avg: 1222.39 std: 34.88 runs: 3
      8k-no-limit/rework min: 1226.31 [104.3%] max: 1241.60 [98.6%] avg: 1233.74 [100.9%] std: 6.25 runs: 3
      8k-no-limit/reworkoptim: min: 1023.45 [87.1%] max: 1056.74 [83.9%] avg: 1038.92 [85.0%] std: 13.69 runs: 3
      Elapsed
      8k-no-limit/base: min: 613.36 max: 619.60 avg: 616.47 std: 2.55 runs: 3
      8k-no-limit/rework min: 627.56 [102.3%] max: 642.33 [103.7%] avg: 633.44 [102.8%] std: 6.39 runs: 3
      8k-no-limit/reworkoptim: min: 545.89 [89.0%] max: 555.36 [89.6%] avg: 552.06 [89.6%] std: 4.37 runs: 3
      
      and these numbers look good as well.  System time is around 100%
      (suprisingly better for the 8k case) and Elapsed is copies that trend.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Glauber Costa <glommer@openvz.org>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7d910c05
    • M
      memcg: enhance memcg iterator to support predicates · de57780d
      Michal Hocko 提交于
      The caller of the iterator might know that some nodes or even subtrees
      should be skipped but there is no way to tell iterators about that so the
      only choice left is to let iterators to visit each node and do the
      selection outside of the iterating code.  This, however, doesn't scale
      well with hierarchies with many groups where only few groups are
      interesting.
      
      This patch adds mem_cgroup_iter_cond variant of the iterator with a
      callback which gets called for every visited node.  There are three
      possible ways how the callback can influence the walk.  Either the node is
      visited, it is skipped but the tree walk continues down the tree or the
      whole subtree of the current group is skipped.
      
      [hughd@google.com: fix memcg-less page reclaim]
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Glauber Costa <glommer@openvz.org>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      de57780d
    • M
      vmscan, memcg: do softlimit reclaim also for targeted reclaim · a5b7c87f
      Michal Hocko 提交于
      Soft reclaim has been done only for the global reclaim (both background
      and direct).  Since "memcg: integrate soft reclaim tighter with zone
      shrinking code" there is no reason for this limitation anymore as the soft
      limit reclaim doesn't use any special code paths and it is a part of the
      zone shrinking code which is used by both global and targeted reclaims.
      
      From the semantic point of view it is natural to consider soft limit
      before touching all groups in the hierarchy tree which is touching the
      hard limit because soft limit tells us where to push back when there is a
      memory pressure.  It is not important whether the pressure comes from the
      limit or imbalanced zones.
      
      This patch simply enables soft reclaim unconditionally in
      mem_cgroup_should_soft_reclaim so it is enabled for both global and
      targeted reclaim paths.  mem_cgroup_soft_reclaim_eligible needs to learn
      about the root of the reclaim to know where to stop checking soft limit
      state of parents up the hierarchy.  Say we have
      
      A (over soft limit)
       \
        B (below s.l., hit the hard limit)
       / \
      C   D (below s.l.)
      
      B is the source of the outside memory pressure now for D but we shouldn't
      soft reclaim it because it is behaving well under B subtree and we can
      still reclaim from C (pressumably it is over the limit).
      mem_cgroup_soft_reclaim_eligible should therefore stop climbing up the
      hierarchy at B (root of the memory pressure).
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Reviewed-by: NGlauber Costa <glommer@openvz.org>
      Reviewed-by: NTejun Heo <tj@kernel.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a5b7c87f