1. 07 8月, 2014 1 次提交
  2. 31 1月, 2014 1 次提交
    • D
      mm, oom: base root bonus on current usage · 778c14af
      David Rientjes 提交于
      A 3% of system memory bonus is sometimes too excessive in comparison to
      other processes.
      
      With commit a63d83f4 ("oom: badness heuristic rewrite"), the OOM
      killer tries to avoid killing privileged tasks by subtracting 3% of
      overall memory (system or cgroup) from their per-task consumption.  But
      as a result, all root tasks that consume less than 3% of overall memory
      are considered equal, and so it only takes 33+ privileged tasks pushing
      the system out of memory for the OOM killer to do something stupid and
      kill dhclient or other root-owned processes.  For example, on a 32G
      machine it can't tell the difference between the 1M agetty and the 10G
      fork bomb member.
      
      The changelog describes this 3% boost as the equivalent to the global
      overcommit limit being 3% higher for privileged tasks, but this is not
      the same as discounting 3% of overall memory from _every privileged task
      individually_ during OOM selection.
      
      Replace the 3% of system memory bonus with a 3% of current memory usage
      bonus.
      
      By giving root tasks a bonus that is proportional to their actual size,
      they remain comparable even when relatively small.  In the example
      above, the OOM killer will discount the 1M agetty's 256 badness points
      down to 179, and the 10G fork bomb's 262144 points down to 183500 points
      and make the right choice, instead of discounting both to 0 and killing
      agetty because it's first in the task list.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Reported-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      778c14af
  3. 24 1月, 2014 1 次提交
  4. 22 1月, 2014 3 次提交
  5. 15 11月, 2013 1 次提交
    • K
      mm: convert mm->nr_ptes to atomic_long_t · e1f56c89
      Kirill A. Shutemov 提交于
      With split page table lock for PMD level we can't hold mm->page_table_lock
      while updating nr_ptes.
      
      Let's convert it to atomic_long_t to avoid races.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Tested-by: NAlex Thorlton <athorlton@sgi.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: "Eric W . Biederman" <ebiederm@xmission.com>
      Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Robin Holt <robinmholt@gmail.com>
      Cc: Sedat Dilek <sedat.dilek@gmail.com>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e1f56c89
  6. 17 10月, 2013 1 次提交
    • J
      mm: memcg: handle non-error OOM situations more gracefully · 49426420
      Johannes Weiner 提交于
      Commit 3812c8c8 ("mm: memcg: do not trap chargers with full
      callstack on OOM") assumed that only a few places that can trigger a
      memcg OOM situation do not return VM_FAULT_OOM, like optional page cache
      readahead.  But there are many more and it's impractical to annotate
      them all.
      
      First of all, we don't want to invoke the OOM killer when the failed
      allocation is gracefully handled, so defer the actual kill to the end of
      the fault handling as well.  This simplifies the code quite a bit for
      added bonus.
      
      Second, since a failed allocation might not be the abrupt end of the
      fault, the memcg OOM handler needs to be re-entrant until the fault
      finishes for subsequent allocation attempts.  If an allocation is
      attempted after the task already OOMed, allow it to bypass the limit so
      that it can quickly finish the fault and invoke the OOM killer.
      Reported-by: NazurIt <azurit@pobox.sk>
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      49426420
  7. 13 9月, 2013 1 次提交
    • J
      mm: memcg: do not trap chargers with full callstack on OOM · 3812c8c8
      Johannes Weiner 提交于
      The memcg OOM handling is incredibly fragile and can deadlock.  When a
      task fails to charge memory, it invokes the OOM killer and loops right
      there in the charge code until it succeeds.  Comparably, any other task
      that enters the charge path at this point will go to a waitqueue right
      then and there and sleep until the OOM situation is resolved.  The problem
      is that these tasks may hold filesystem locks and the mmap_sem; locks that
      the selected OOM victim may need to exit.
      
      For example, in one reported case, the task invoking the OOM killer was
      about to charge a page cache page during a write(), which holds the
      i_mutex.  The OOM killer selected a task that was just entering truncate()
      and trying to acquire the i_mutex:
      
      OOM invoking task:
        mem_cgroup_handle_oom+0x241/0x3b0
        mem_cgroup_cache_charge+0xbe/0xe0
        add_to_page_cache_locked+0x4c/0x140
        add_to_page_cache_lru+0x22/0x50
        grab_cache_page_write_begin+0x8b/0xe0
        ext3_write_begin+0x88/0x270
        generic_file_buffered_write+0x116/0x290
        __generic_file_aio_write+0x27c/0x480
        generic_file_aio_write+0x76/0xf0           # takes ->i_mutex
        do_sync_write+0xea/0x130
        vfs_write+0xf3/0x1f0
        sys_write+0x51/0x90
        system_call_fastpath+0x18/0x1d
      
      OOM kill victim:
        do_truncate+0x58/0xa0              # takes i_mutex
        do_last+0x250/0xa30
        path_openat+0xd7/0x440
        do_filp_open+0x49/0xa0
        do_sys_open+0x106/0x240
        sys_open+0x20/0x30
        system_call_fastpath+0x18/0x1d
      
      The OOM handling task will retry the charge indefinitely while the OOM
      killed task is not releasing any resources.
      
      A similar scenario can happen when the kernel OOM killer for a memcg is
      disabled and a userspace task is in charge of resolving OOM situations.
      In this case, ALL tasks that enter the OOM path will be made to sleep on
      the OOM waitqueue and wait for userspace to free resources or increase
      the group's limit.  But a userspace OOM handler is prone to deadlock
      itself on the locks held by the waiting tasks.  For example one of the
      sleeping tasks may be stuck in a brk() call with the mmap_sem held for
      writing but the userspace handler, in order to pick an optimal victim,
      may need to read files from /proc/<pid>, which tries to acquire the same
      mmap_sem for reading and deadlocks.
      
      This patch changes the way tasks behave after detecting a memcg OOM and
      makes sure nobody loops or sleeps with locks held:
      
      1. When OOMing in a user fault, invoke the OOM killer and restart the
         fault instead of looping on the charge attempt.  This way, the OOM
         victim can not get stuck on locks the looping task may hold.
      
      2. When OOMing in a user fault but somebody else is handling it
         (either the kernel OOM killer or a userspace handler), don't go to
         sleep in the charge context.  Instead, remember the OOMing memcg in
         the task struct and then fully unwind the page fault stack with
         -ENOMEM.  pagefault_out_of_memory() will then call back into the
         memcg code to check if the -ENOMEM came from the memcg, and then
         either put the task to sleep on the memcg's OOM waitqueue or just
         restart the fault.  The OOM victim can no longer get stuck on any
         lock a sleeping task may hold.
      
      Debugged by Michal Hocko.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: NazurIt <azurit@pobox.sk>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3812c8c8
  8. 15 7月, 2013 1 次提交
  9. 24 2月, 2013 1 次提交
    • S
      memcg, oom: provide more precise dump info while memcg oom happening · 58cf188e
      Sha Zhengju 提交于
      Currently when a memcg oom is happening the oom dump messages is still
      global state and provides few useful info for users.  This patch prints
      more pointed memcg page statistics for memcg-oom and take hierarchy into
      consideration:
      
      Based on Michal's advice, we take hierarchy into consideration: supppose
      we trigger an OOM on A's limit
      
              root_memcg
                  |
                  A (use_hierachy=1)
                 / \
                B   C
                |
                D
      then the printed info will be:
      
        Memory cgroup stats for /A:...
        Memory cgroup stats for /A/B:...
        Memory cgroup stats for /A/C:...
        Memory cgroup stats for /A/B/D:...
      
      Following are samples of oom output:
      
      (1) Before change:
      
          mal-80 invoked oom-killer:gfp_mask=0xd0, order=0, oom_score_adj=0
          mal-80 cpuset=/ mems_allowed=0
          Pid: 2976, comm: mal-80 Not tainted 3.7.0+ #10
          Call Trace:
           [<ffffffff8167fbfb>] dump_header+0x83/0x1ca
           ..... (call trace)
           [<ffffffff8168a818>] page_fault+0x28/0x30
                                   <<<<<<<<<<<<<<<<<<<<< memcg specific information
          Task in /A/B/D killed as a result of limit of /A
          memory: usage 101376kB, limit 101376kB, failcnt 57
          memory+swap: usage 101376kB, limit 101376kB, failcnt 0
          kmem: usage 0kB, limit 9007199254740991kB, failcnt 0
                                   <<<<<<<<<<<<<<<<<<<<< print per cpu pageset stat
          Mem-Info:
          Node 0 DMA per-cpu:
          CPU    0: hi:    0, btch:   1 usd:   0
          ......
          CPU    3: hi:    0, btch:   1 usd:   0
          Node 0 DMA32 per-cpu:
          CPU    0: hi:  186, btch:  31 usd: 173
          ......
          CPU    3: hi:  186, btch:  31 usd: 130
                                   <<<<<<<<<<<<<<<<<<<<< print global page state
          active_anon:92963 inactive_anon:40777 isolated_anon:0
           active_file:33027 inactive_file:51718 isolated_file:0
           unevictable:0 dirty:3 writeback:0 unstable:0
           free:729995 slab_reclaimable:6897 slab_unreclaimable:6263
           mapped:20278 shmem:35971 pagetables:5885 bounce:0
           free_cma:0
                                   <<<<<<<<<<<<<<<<<<<<< print per zone page state
          Node 0 DMA free:15836kB ... all_unreclaimable? no
          lowmem_reserve[]: 0 3175 3899 3899
          Node 0 DMA32 free:2888564kB ... all_unrelaimable? no
          lowmem_reserve[]: 0 0 724 724
          lowmem_reserve[]: 0 0 0 0
          Node 0 DMA: 1*4kB (U) ... 3*4096kB (M) = 15836kB
          Node 0 DMA32: 41*4kB (UM) ... 702*4096kB (MR) = 2888316kB
          120710 total pagecache pages
          0 pages in swap cache
                                   <<<<<<<<<<<<<<<<<<<<< print global swap cache stat
          Swap cache stats: add 0, delete 0, find 0/0
          Free swap  = 499708kB
          Total swap = 499708kB
          1040368 pages RAM
          58678 pages reserved
          169065 pages shared
          173632 pages non-shared
          [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
          [ 2693]     0  2693     6005     1324      17        0             0 god
          [ 2754]     0  2754     6003     1320      16        0             0 god
          [ 2811]     0  2811     5992     1304      18        0             0 god
          [ 2874]     0  2874     6005     1323      18        0             0 god
          [ 2935]     0  2935     8720     7742      21        0             0 mal-30
          [ 2976]     0  2976    21520    17577      42        0             0 mal-80
          Memory cgroup out of memory: Kill process 2976 (mal-80) score 665 or sacrifice child
          Killed process 2976 (mal-80) total-vm:86080kB, anon-rss:69964kB, file-rss:344kB
      
      We can see that messages dumped by show_free_areas() are longsome and can
      provide so limited info for memcg that just happen oom.
      
      (2) After change
          mal-80 invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0
          mal-80 cpuset=/ mems_allowed=0
          Pid: 2704, comm: mal-80 Not tainted 3.7.0+ #10
          Call Trace:
           [<ffffffff8167fd0b>] dump_header+0x83/0x1d1
           .......(call trace)
           [<ffffffff8168a918>] page_fault+0x28/0x30
          Task in /A/B/D killed as a result of limit of /A
                                   <<<<<<<<<<<<<<<<<<<<< memcg specific information
          memory: usage 102400kB, limit 102400kB, failcnt 140
          memory+swap: usage 102400kB, limit 102400kB, failcnt 0
          kmem: usage 0kB, limit 9007199254740991kB, failcnt 0
          Memory cgroup stats for /A: cache:32KB rss:30984KB mapped_file:0KB swap:0KB inactive_anon:6912KB active_anon:24072KB inactive_file:32KB active_file:0KB unevictable:0KB
          Memory cgroup stats for /A/B: cache:0KB rss:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB
          Memory cgroup stats for /A/C: cache:0KB rss:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB
          Memory cgroup stats for /A/B/D: cache:32KB rss:71352KB mapped_file:0KB swap:0KB inactive_anon:6656KB active_anon:64696KB inactive_file:16KB active_file:16KB unevictable:0KB
          [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
          [ 2260]     0  2260     6006     1325      18        0             0 god
          [ 2383]     0  2383     6003     1319      17        0             0 god
          [ 2503]     0  2503     6004     1321      18        0             0 god
          [ 2622]     0  2622     6004     1321      16        0             0 god
          [ 2695]     0  2695     8720     7741      22        0             0 mal-30
          [ 2704]     0  2704    21520    17839      43        0             0 mal-80
          Memory cgroup out of memory: Kill process 2704 (mal-80) score 669 or sacrifice child
          Killed process 2704 (mal-80) total-vm:86080kB, anon-rss:71016kB, file-rss:340kB
      
      This version provides more pointed info for memcg in "Memory cgroup stats
      for XXX" section.
      Signed-off-by: NSha Zhengju <handai.szj@taobao.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      58cf188e
  10. 13 12月, 2012 3 次提交
  11. 12 12月, 2012 3 次提交
    • D
      mm, oom: fix race when specifying a thread as the oom origin · e1e12d2f
      David Rientjes 提交于
      test_set_oom_score_adj() and compare_swap_oom_score_adj() are used to
      specify that current should be killed first if an oom condition occurs in
      between the two calls.
      
      The usage is
      
      	short oom_score_adj = test_set_oom_score_adj(OOM_SCORE_ADJ_MAX);
      	...
      	compare_swap_oom_score_adj(OOM_SCORE_ADJ_MAX, oom_score_adj);
      
      to store the thread's oom_score_adj, temporarily change it to the maximum
      score possible, and then restore the old value if it is still the same.
      
      This happens to still be racy, however, if the user writes
      OOM_SCORE_ADJ_MAX to /proc/pid/oom_score_adj in between the two calls.
      The compare_swap_oom_score_adj() will then incorrectly reset the old value
      prior to the write of OOM_SCORE_ADJ_MAX.
      
      To fix this, introduce a new oom_flags_t member in struct signal_struct
      that will be used for per-thread oom killer flags.  KSM and swapoff can
      now use a bit in this member to specify that threads should be killed
      first in oom conditions without playing around with oom_score_adj.
      
      This also allows the correct oom_score_adj to always be shown when reading
      /proc/pid/oom_score.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Anton Vorontsov <anton.vorontsov@linaro.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e1e12d2f
    • D
      mm, oom: change type of oom_score_adj to short · a9c58b90
      David Rientjes 提交于
      The maximum oom_score_adj is 1000 and the minimum oom_score_adj is -1000,
      so this range can be represented by the signed short type with no
      functional change.  The extra space this frees up in struct signal_struct
      will be used for per-thread oom kill flags in the next patch.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Anton Vorontsov <anton.vorontsov@linaro.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a9c58b90
    • D
      mm, oom: allow exiting threads to have access to memory reserves · 9ff4868e
      David Rientjes 提交于
      Exiting threads, those with PF_EXITING set, can pagefault and require
      memory before they can make forward progress.  This happens, for instance,
      when a process must fault task->robust_list, a userspace structure, before
      detaching its memory.
      
      These threads also aren't guaranteed to get access to memory reserves
      unless oom killed or killed from userspace.  The oom killer won't grant
      memory reserves if other threads are also exiting other than current and
      stalling at the same point.  This prevents needlessly killing processes
      when others are already exiting.
      
      Instead of special casing all the possible situations between PF_EXITING
      getting set and a thread detaching its mm where it may allocate memory,
      which probably wouldn't get updated when a change is made to the exit
      path, the solution is to give all exiting threads access to memory
      reserves if they call the oom killer.  This allows them to quickly
      allocate, detach its mm, and free the memory it represents.
      
      Summary of Luigi's bug report:
      
      : He had an oom condition where threads were faulting on task->robust_list
      : and repeatedly called the oom killer but it would defer killing a thread
      : because it saw other PF_EXITING threads.  This can happen anytime we need
      : to allocate memory after setting PF_EXITING and before detaching our mm;
      : if there are other threads in the same state then the oom killer won't do
      : anything unless one of them happens to be killed from userspace.
      :
      : So instead of only deferring for PF_EXITING and !task->robust_list, it's
      : better to just give them access to memory reserves to prevent a potential
      : livelock so that any other faults that may be introduced in the future in
      : the exit path don't cause the same problem (and hopefully we don't allow
      : too many of those!).
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Tested-by: NLuigi Semenzato <semenzato@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9ff4868e
  12. 09 10月, 2012 1 次提交
  13. 01 8月, 2012 8 次提交
  14. 21 6月, 2012 2 次提交
  15. 09 6月, 2012 1 次提交
  16. 30 5月, 2012 1 次提交
  17. 03 5月, 2012 1 次提交
  18. 24 3月, 2012 1 次提交
  19. 22 3月, 2012 6 次提交
  20. 13 1月, 2012 2 次提交