1. 03 1月, 2014 1 次提交
  2. 13 12月, 2013 3 次提交
    • J
      mm: memcg: do not allow task about to OOM kill to bypass the limit · 1f14c1ac
      Johannes Weiner 提交于
      Commit 49426420 ("mm: memcg: handle non-error OOM situations more
      gracefully") allowed tasks that already entered a memcg OOM condition to
      bypass the memcg limit on subsequent allocation attempts hoping this
      would expedite finishing the page fault and executing the kill.
      
      David Rientjes is worried that this breaks memcg isolation guarantees
      and since there is no evidence that the bypass actually speeds up fault
      processing just change it so that these subsequent charge attempts fail
      outright.  The notable exception being __GFP_NOFAIL charges which are
      required to bypass the limit regardless.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-bt: David Rientjes <rientjes@google.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1f14c1ac
    • J
      mm: memcg: fix race condition between memcg teardown and swapin · 96f1c58d
      Johannes Weiner 提交于
      There is a race condition between a memcg being torn down and a swapin
      triggered from a different memcg of a page that was recorded to belong
      to the exiting memcg on swapout (with CONFIG_MEMCG_SWAP extension).  The
      result is unreclaimable pages pointing to dead memcgs, which can lead to
      anything from endless loops in later memcg teardown (the page is charged
      to all hierarchical parents but is not on any LRU list) or crashes from
      following the dangling memcg pointer.
      
      Memcgs with tasks in them can not be torn down and usually charges don't
      show up in memcgs without tasks.  Swapin with the CONFIG_MEMCG_SWAP
      extension is the notable exception because it charges the cgroup that
      was recorded as owner during swapout, which may be empty and in the
      process of being torn down when a task in another memcg triggers the
      swapin:
      
        teardown:                 swapin:
      
                                  lookup_swap_cgroup_id()
                                  rcu_read_lock()
                                  mem_cgroup_lookup()
                                  css_tryget()
                                  rcu_read_unlock()
        disable css_tryget()
        call_rcu()
          offline_css()
            reparent_charges()
                                  res_counter_charge() (hierarchical!)
                                  css_put()
                                    css_free()
                                  pc->mem_cgroup = dead memcg
                                  add page to dead lru
      
      Add a final reparenting step into css_free() to make sure any such raced
      charges are moved out of the memcg before it's finally freed.
      
      In the longer term it would be cleaner to have the css_tryget() and the
      res_counter charge under the same RCU lock section so that the charge
      reparenting is deferred until the last charge whose tryget succeeded is
      visible.  But this will require more invasive changes that will be
      harder to evaluate and backport into stable, so better defer them to a
      separate change set.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      96f1c58d
    • J
      mm: memcg: do not declare OOM from __GFP_NOFAIL allocations · a0d8b00a
      Johannes Weiner 提交于
      Commit 84235de3 ("fs: buffer: move allocation failure loop into the
      allocator") started recognizing __GFP_NOFAIL in memory cgroups but
      forgot to disable the OOM killer.
      
      Any task that does not fail allocation will also not enter the OOM
      completion path.  So don't declare an OOM state in this case or it'll be
      leaked and the task be able to bypass the limit until the next
      userspace-triggered page fault cleans up the OOM state.
      Reported-by: NWilliam Dauchy <wdauchy@gmail.com>
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: <stable@vger.kernel.org>	[3.12.x]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a0d8b00a
  3. 15 11月, 2013 1 次提交
    • K
      mm, thp: change pmd_trans_huge_lock() to return taken lock · bf929152
      Kirill A. Shutemov 提交于
      With split ptlock it's important to know which lock
      pmd_trans_huge_lock() took.  This patch adds one more parameter to the
      function to return the lock.
      
      In most places migration to new api is trivial.  Exception is
      move_huge_pmd(): we need to take two locks if pmd tables are different.
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Tested-by: NAlex Thorlton <athorlton@sgi.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "Eric W . Biederman" <ebiederm@xmission.com>
      Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Robin Holt <robinmholt@gmail.com>
      Cc: Sedat Dilek <sedat.dilek@gmail.com>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bf929152
  4. 13 11月, 2013 5 次提交
  5. 02 11月, 2013 1 次提交
    • G
      memcg: remove incorrect underflow check · 6920a1bd
      Greg Thelen 提交于
      When a memcg is deleted mem_cgroup_reparent_charges() moves charged
      memory to the parent memcg.  As of v3.11-9444-g3ea67d06 "memcg: add per
      cgroup writeback pages accounting" there's bad pointer read.  The goal
      was to check for counter underflow.  The counter is a per cpu counter
      and there are two problems with the code:
      
       (1) per cpu access function isn't used, instead a naked pointer is used
           which easily causes oops.
       (2) the check doesn't sum all cpus
      
      Test:
        $ cd /sys/fs/cgroup/memory
        $ mkdir x
        $ echo 3 > /proc/sys/vm/drop_caches
        $ (echo $BASHPID >> x/tasks && exec cat) &
        [1] 7154
        $ grep ^mapped x/memory.stat
        mapped_file 53248
        $ echo 7154 > tasks
        $ rmdir x
        <OOPS>
      
      The fix is to remove the check.  It's currently dangerous and isn't
      worth fixing it to use something expensive, such as
      percpu_counter_sum(), for each reparented page.  __this_cpu_read() isn't
      enough to fix this because there's no guarantees of the current cpus
      count.  The only guarantees is that the sum of all per-cpu counter is >=
      nr_pages.
      
      Fixes: 3ea67d06 ("memcg: add per cgroup writeback pages accounting")
      Reported-and-tested-by: NFlavio Leitner <fbl@redhat.com>
      Signed-off-by: NGreg Thelen <gthelen@google.com>
      Reviewed-by: NSha Zhengju <handai.szj@taobao.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6920a1bd
  6. 01 11月, 2013 3 次提交
  7. 31 10月, 2013 1 次提交
    • G
      memcg: use __this_cpu_sub() to dec stats to avoid incorrect subtrahend casting · 5e8cfc3c
      Greg Thelen 提交于
      As of commit 3ea67d06 ("memcg: add per cgroup writeback pages
      accounting") memcg counter errors are possible when moving charged
      memory to a different memcg.  Charge movement occurs when processing
      writes to memory.force_empty, moving tasks to a memcg with
      memcg.move_charge_at_immigrate=1, or memcg deletion.
      
      An example showing error after memory.force_empty:
      
        $ cd /sys/fs/cgroup/memory
        $ mkdir x
        $ rm /data/tmp/file
        $ (echo $BASHPID >> x/tasks && exec mmap_writer /data/tmp/file 1M) &
        [1] 13600
        $ grep ^mapped x/memory.stat
        mapped_file 1048576
        $ echo 13600 > tasks
        $ echo 1 > x/memory.force_empty
        $ grep ^mapped x/memory.stat
        mapped_file 4503599627370496
      
      mapped_file should end with 0.
        4503599627370496 == 0x10,0000,0000,0000 == 0x100,0000,0000 pages
        1048576          == 0x10,0000           == 0x100 pages
      
      This issue only affects the source memcg on 64 bit machines; the
      destination memcg counters are correct.  So the rmdir case is not too
      important because such counters are soon disappearing with the entire
      memcg.  But the memcg.force_empty and memory.move_charge_at_immigrate=1
      cases are larger problems as the bogus counters are visible for the
      (possibly long) remaining life of the source memcg.
      
      The problem is due to memcg use of __this_cpu_from(.., -nr_pages), which
      is subtly wrong because it subtracts the unsigned int nr_pages (either
      -1 or -512 for THP) from a signed long percpu counter.  When
      nr_pages=-1, -nr_pages=0xffffffff.  On 64 bit machines stat->count[idx]
      is signed 64 bit.  So memcg's attempt to simply decrement a count (e.g.
      from 1 to 0) boils down to:
      
        long count = 1
        unsigned int nr_pages = 1
        count += -nr_pages  /* -nr_pages == 0xffff,ffff */
        count is now 0x1,0000,0000 instead of 0
      
      The fix is to subtract the unsigned page count rather than adding its
      negation.  This only works once "percpu: fix this_cpu_sub() subtrahend
      casting for unsigneds" is applied to fix this_cpu_sub().
      Signed-off-by: NGreg Thelen <gthelen@google.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5e8cfc3c
  8. 22 10月, 2013 1 次提交
  9. 17 10月, 2013 3 次提交
    • J
      fs: buffer: move allocation failure loop into the allocator · 84235de3
      Johannes Weiner 提交于
      Buffer allocation has a very crude indefinite loop around waking the
      flusher threads and performing global NOFS direct reclaim because it can
      not handle allocation failures.
      
      The most immediate problem with this is that the allocation may fail due
      to a memory cgroup limit, where flushers + direct reclaim might not make
      any progress towards resolving the situation at all.  Because unlike the
      global case, a memory cgroup may not have any cache at all, only
      anonymous pages but no swap.  This situation will lead to a reclaim
      livelock with insane IO from waking the flushers and thrashing unrelated
      filesystem cache in a tight loop.
      
      Use __GFP_NOFAIL allocations for buffers for now.  This makes sure that
      any looping happens in the page allocator, which knows how to
      orchestrate kswapd, direct reclaim, and the flushers sensibly.  It also
      allows memory cgroups to detect allocations that can't handle failure
      and will allow them to ultimately bypass the limit if reclaim can not
      make progress.
      Reported-by: NazurIt <azurit@pobox.sk>
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      84235de3
    • J
      mm: memcg: handle non-error OOM situations more gracefully · 49426420
      Johannes Weiner 提交于
      Commit 3812c8c8 ("mm: memcg: do not trap chargers with full
      callstack on OOM") assumed that only a few places that can trigger a
      memcg OOM situation do not return VM_FAULT_OOM, like optional page cache
      readahead.  But there are many more and it's impractical to annotate
      them all.
      
      First of all, we don't want to invoke the OOM killer when the failed
      allocation is gracefully handled, so defer the actual kill to the end of
      the fault handling as well.  This simplifies the code quite a bit for
      added bonus.
      
      Second, since a failed allocation might not be the abrupt end of the
      fault, the memcg OOM handler needs to be re-entrant until the fault
      finishes for subsequent allocation attempts.  If an allocation is
      attempted after the task already OOMed, allow it to bypass the limit so
      that it can quickly finish the fault and invoke the OOM killer.
      Reported-by: NazurIt <azurit@pobox.sk>
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      49426420
    • D
      mm, memcg: protect mem_cgroup_read_events for cpu hotplug · 9c567512
      David Rientjes 提交于
      for_each_online_cpu() needs the protection of {get,put}_online_cpus() so
      cpu_online_mask doesn't change during the iteration.
      
      cpu_hotplug.lock is held while a cpu is going down, it's a coarse lock
      that is used kernel-wide to synchronize cpu hotplug activity.  Memcg has
      a cpu hotplug notifier, called while there may not be any cpu hotplug
      refcounts, which drains per-cpu event counts to memcg->nocpu_base.events
      to maintain a cumulative event count as cpus disappear.  Without
      get_online_cpus() in mem_cgroup_read_events(), it's possible to account
      for the event count on a dying cpu twice, and this value may be
      significantly large.
      
      In fact, all memcg->pcp_counter_lock use should be nested by
      {get,put}_online_cpus().
      
      This fixes that issue and ensures the reported statistics are not vastly
      over-reported during cpu hotplug.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9c567512
  10. 25 9月, 2013 7 次提交
  11. 24 9月, 2013 4 次提交
  12. 13 9月, 2013 10 次提交
    • S
      memcg: add per cgroup writeback pages accounting · 3ea67d06
      Sha Zhengju 提交于
      Add memcg routines to count writeback pages, later dirty pages will also
      be accounted.
      
      After Kame's commit 89c06bd5 ("memcg: use new logic for page stat
      accounting"), we can use 'struct page' flag to test page state instead
      of per page_cgroup flag.  But memcg has a feature to move a page from a
      cgroup to another one and may have race between "move" and "page stat
      accounting".  So in order to avoid the race we have designed a new lock:
      
               mem_cgroup_begin_update_page_stat()
               modify page information        -->(a)
               mem_cgroup_update_page_stat()  -->(b)
               mem_cgroup_end_update_page_stat()
      
      It requires both (a) and (b)(writeback pages accounting) to be pretected
      in mem_cgroup_{begin/end}_update_page_stat().  It's full no-op for
      !CONFIG_MEMCG, almost no-op if memcg is disabled (but compiled in), rcu
      read lock in the most cases (no task is moving), and spin_lock_irqsave
      on top in the slow path.
      
      There're two writeback interfaces to modify: test_{clear/set}_page_writeback().
      And the lock order is:
      	--> memcg->move_lock
      	  --> mapping->tree_lock
      Signed-off-by: NSha Zhengju <handai.szj@taobao.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Reviewed-by: NGreg Thelen <gthelen@google.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3ea67d06
    • S
      memcg: check for proper lock held in mem_cgroup_update_page_stat · 658b72c5
      Sha Zhengju 提交于
      We should call mem_cgroup_begin_update_page_stat() before
      mem_cgroup_update_page_stat() to get proper locks, however the latter
      doesn't do any checking that we use proper locking, which would be hard.
      Suggested by Michal Hock we could at least test for rcu_read_lock_held()
      because RCU is held if !mem_cgroup_disabled().
      Signed-off-by: NSha Zhengju <handai.szj@taobao.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Reviewed-by: NGreg Thelen <gthelen@google.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      658b72c5
    • S
      memcg: remove MEMCG_NR_FILE_MAPPED · 68b4876d
      Sha Zhengju 提交于
      While accounting memcg page stat, it's not worth to use
      MEMCG_NR_FILE_MAPPED as an extra layer of indirection because of the
      complexity and presumed performance overhead.  We can use
      MEM_CGROUP_STAT_FILE_MAPPED directly.
      Signed-off-by: NSha Zhengju <handai.szj@taobao.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NFengguang Wu <fengguang.wu@intel.com>
      Reviewed-by: NGreg Thelen <gthelen@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      68b4876d
    • S
      memcg: rename RESOURCE_MAX to RES_COUNTER_MAX · 6de5a8bf
      Sha Zhengju 提交于
      RESOURCE_MAX is far too general name, change it to RES_COUNTER_MAX.
      Signed-off-by: NSha Zhengju <handai.szj@taobao.com>
      Signed-off-by: NQiang Huang <h.huangqiang@huawei.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Jeff Liu <jeff.liu@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6de5a8bf
    • J
      mm: memcg: do not trap chargers with full callstack on OOM · 3812c8c8
      Johannes Weiner 提交于
      The memcg OOM handling is incredibly fragile and can deadlock.  When a
      task fails to charge memory, it invokes the OOM killer and loops right
      there in the charge code until it succeeds.  Comparably, any other task
      that enters the charge path at this point will go to a waitqueue right
      then and there and sleep until the OOM situation is resolved.  The problem
      is that these tasks may hold filesystem locks and the mmap_sem; locks that
      the selected OOM victim may need to exit.
      
      For example, in one reported case, the task invoking the OOM killer was
      about to charge a page cache page during a write(), which holds the
      i_mutex.  The OOM killer selected a task that was just entering truncate()
      and trying to acquire the i_mutex:
      
      OOM invoking task:
        mem_cgroup_handle_oom+0x241/0x3b0
        mem_cgroup_cache_charge+0xbe/0xe0
        add_to_page_cache_locked+0x4c/0x140
        add_to_page_cache_lru+0x22/0x50
        grab_cache_page_write_begin+0x8b/0xe0
        ext3_write_begin+0x88/0x270
        generic_file_buffered_write+0x116/0x290
        __generic_file_aio_write+0x27c/0x480
        generic_file_aio_write+0x76/0xf0           # takes ->i_mutex
        do_sync_write+0xea/0x130
        vfs_write+0xf3/0x1f0
        sys_write+0x51/0x90
        system_call_fastpath+0x18/0x1d
      
      OOM kill victim:
        do_truncate+0x58/0xa0              # takes i_mutex
        do_last+0x250/0xa30
        path_openat+0xd7/0x440
        do_filp_open+0x49/0xa0
        do_sys_open+0x106/0x240
        sys_open+0x20/0x30
        system_call_fastpath+0x18/0x1d
      
      The OOM handling task will retry the charge indefinitely while the OOM
      killed task is not releasing any resources.
      
      A similar scenario can happen when the kernel OOM killer for a memcg is
      disabled and a userspace task is in charge of resolving OOM situations.
      In this case, ALL tasks that enter the OOM path will be made to sleep on
      the OOM waitqueue and wait for userspace to free resources or increase
      the group's limit.  But a userspace OOM handler is prone to deadlock
      itself on the locks held by the waiting tasks.  For example one of the
      sleeping tasks may be stuck in a brk() call with the mmap_sem held for
      writing but the userspace handler, in order to pick an optimal victim,
      may need to read files from /proc/<pid>, which tries to acquire the same
      mmap_sem for reading and deadlocks.
      
      This patch changes the way tasks behave after detecting a memcg OOM and
      makes sure nobody loops or sleeps with locks held:
      
      1. When OOMing in a user fault, invoke the OOM killer and restart the
         fault instead of looping on the charge attempt.  This way, the OOM
         victim can not get stuck on locks the looping task may hold.
      
      2. When OOMing in a user fault but somebody else is handling it
         (either the kernel OOM killer or a userspace handler), don't go to
         sleep in the charge context.  Instead, remember the OOMing memcg in
         the task struct and then fully unwind the page fault stack with
         -ENOMEM.  pagefault_out_of_memory() will then call back into the
         memcg code to check if the -ENOMEM came from the memcg, and then
         either put the task to sleep on the memcg's OOM waitqueue or just
         restart the fault.  The OOM victim can no longer get stuck on any
         lock a sleeping task may hold.
      
      Debugged by Michal Hocko.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: NazurIt <azurit@pobox.sk>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3812c8c8
    • J
      mm: memcg: rework and document OOM waiting and wakeup · fb2a6fc5
      Johannes Weiner 提交于
      The memcg OOM handler open-codes a sleeping lock for OOM serialization
      (trylock, wait, repeat) because the required locking is so specific to
      memcg hierarchies.  However, it would be nice if this construct would be
      clearly recognizable and not be as obfuscated as it is right now.  Clean
      up as follows:
      
      1. Remove the return value of mem_cgroup_oom_unlock()
      
      2. Rename mem_cgroup_oom_lock() to mem_cgroup_oom_trylock().
      
      3. Pull the prepare_to_wait() out of the memcg_oom_lock scope.  This
         makes it more obvious that the task has to be on the waitqueue
         before attempting to OOM-trylock the hierarchy, to not miss any
         wakeups before going to sleep.  It just didn't matter until now
         because it was all lumped together into the global memcg_oom_lock
         spinlock section.
      
      4. Pull the mem_cgroup_oom_notify() out of the memcg_oom_lock scope.
         It is proctected by the hierarchical OOM-lock.
      
      5. The memcg_oom_lock spinlock is only required to propagate the OOM
         lock in any given hierarchy atomically.  Restrict its scope to
         mem_cgroup_oom_(trylock|unlock).
      
      6. Do not wake up the waitqueue unconditionally at the end of the
         function.  Only the lockholder has to wake up the next in line
         after releasing the lock.
      
         Note that the lockholder kicks off the OOM-killer, which in turn
         leads to wakeups from the uncharges of the exiting task.  But a
         contender is not guaranteed to see them if it enters the OOM path
         after the OOM kills but before the lockholder releases the lock.
         Thus there has to be an explicit wakeup after releasing the lock.
      
      7. Put the OOM task on the waitqueue before marking the hierarchy as
         under OOM as that is the point where we start to receive wakeups.
         No point in listening before being on the waitqueue.
      
      8. Likewise, unmark the hierarchy before finishing the sleep, for
         symmetry.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: azurIt <azurit@pobox.sk>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fb2a6fc5
    • J
      mm: memcg: enable memcg OOM killer only for user faults · 519e5247
      Johannes Weiner 提交于
      System calls and kernel faults (uaccess, gup) can handle an out of memory
      situation gracefully and just return -ENOMEM.
      
      Enable the memcg OOM killer only for user faults, where it's really the
      only option available.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: azurIt <azurit@pobox.sk>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      519e5247
    • A
      memcg: trivial cleanups · f894ffa8
      Andrew Morton 提交于
      Clean up some mess made by the "Soft limit rework" series, and a few other
      things.
      
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f894ffa8
    • M
      memcg: track all children over limit in the root · 1be171d6
      Michal Hocko 提交于
      Children in soft limit excess are currently tracked up the hierarchy in
      memcg->children_in_excess.  Nevertheless there still might exist tons of
      groups that are not in hierarchy relation to the root cgroup (e.g.  all
      first level groups if root_mem_cgroup->use_hierarchy == false).
      
      As the whole tree walk has to be done when the iteration starts at
      root_mem_cgroup the iterator should be able to skip the walk if there is
      no child above the limit without iterating them.  This can be done
      easily if the root tracks all children rather than only hierarchical
      children.  This is done by this patch which updates root_mem_cgroup
      children_in_excess if root_mem_cgroup->use_hierarchy == false so the
      root knows about all children in excess.
      
      Please note that this is not an issue for inner memcgs which have
      use_hierarchy == false because then only the single group is visited so
      no special optimization is necessary.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Glauber Costa <glommer@openvz.org>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1be171d6
    • M
      memcg, vmscan: do not attempt soft limit reclaim if it would not scan anything · e839b6a1
      Michal Hocko 提交于
      mem_cgroup_should_soft_reclaim controls whether soft reclaim pass is
      done and it always says yes currently.  Memcg iterators are clever to
      skip nodes that are not soft reclaimable quite efficiently but
      mem_cgroup_should_soft_reclaim can be more clever and do not start the
      soft reclaim pass at all if it knows that nothing would be scanned
      anyway.
      
      In order to do that, simply reuse mem_cgroup_soft_reclaim_eligible for
      the target group of the reclaim and allow the pass only if the whole
      subtree wouldn't be skipped.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Glauber Costa <glommer@openvz.org>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e839b6a1