1. 15 11月, 2013 1 次提交
    • K
      mm: convert mm->nr_ptes to atomic_long_t · e1f56c89
      Kirill A. Shutemov 提交于
      With split page table lock for PMD level we can't hold mm->page_table_lock
      while updating nr_ptes.
      
      Let's convert it to atomic_long_t to avoid races.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Tested-by: NAlex Thorlton <athorlton@sgi.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: "Eric W . Biederman" <ebiederm@xmission.com>
      Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Robin Holt <robinmholt@gmail.com>
      Cc: Sedat Dilek <sedat.dilek@gmail.com>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e1f56c89
  2. 17 10月, 2013 1 次提交
    • J
      mm: memcg: handle non-error OOM situations more gracefully · 49426420
      Johannes Weiner 提交于
      Commit 3812c8c8 ("mm: memcg: do not trap chargers with full
      callstack on OOM") assumed that only a few places that can trigger a
      memcg OOM situation do not return VM_FAULT_OOM, like optional page cache
      readahead.  But there are many more and it's impractical to annotate
      them all.
      
      First of all, we don't want to invoke the OOM killer when the failed
      allocation is gracefully handled, so defer the actual kill to the end of
      the fault handling as well.  This simplifies the code quite a bit for
      added bonus.
      
      Second, since a failed allocation might not be the abrupt end of the
      fault, the memcg OOM handler needs to be re-entrant until the fault
      finishes for subsequent allocation attempts.  If an allocation is
      attempted after the task already OOMed, allow it to bypass the limit so
      that it can quickly finish the fault and invoke the OOM killer.
      Reported-by: NazurIt <azurit@pobox.sk>
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      49426420
  3. 13 9月, 2013 1 次提交
    • J
      mm: memcg: do not trap chargers with full callstack on OOM · 3812c8c8
      Johannes Weiner 提交于
      The memcg OOM handling is incredibly fragile and can deadlock.  When a
      task fails to charge memory, it invokes the OOM killer and loops right
      there in the charge code until it succeeds.  Comparably, any other task
      that enters the charge path at this point will go to a waitqueue right
      then and there and sleep until the OOM situation is resolved.  The problem
      is that these tasks may hold filesystem locks and the mmap_sem; locks that
      the selected OOM victim may need to exit.
      
      For example, in one reported case, the task invoking the OOM killer was
      about to charge a page cache page during a write(), which holds the
      i_mutex.  The OOM killer selected a task that was just entering truncate()
      and trying to acquire the i_mutex:
      
      OOM invoking task:
        mem_cgroup_handle_oom+0x241/0x3b0
        mem_cgroup_cache_charge+0xbe/0xe0
        add_to_page_cache_locked+0x4c/0x140
        add_to_page_cache_lru+0x22/0x50
        grab_cache_page_write_begin+0x8b/0xe0
        ext3_write_begin+0x88/0x270
        generic_file_buffered_write+0x116/0x290
        __generic_file_aio_write+0x27c/0x480
        generic_file_aio_write+0x76/0xf0           # takes ->i_mutex
        do_sync_write+0xea/0x130
        vfs_write+0xf3/0x1f0
        sys_write+0x51/0x90
        system_call_fastpath+0x18/0x1d
      
      OOM kill victim:
        do_truncate+0x58/0xa0              # takes i_mutex
        do_last+0x250/0xa30
        path_openat+0xd7/0x440
        do_filp_open+0x49/0xa0
        do_sys_open+0x106/0x240
        sys_open+0x20/0x30
        system_call_fastpath+0x18/0x1d
      
      The OOM handling task will retry the charge indefinitely while the OOM
      killed task is not releasing any resources.
      
      A similar scenario can happen when the kernel OOM killer for a memcg is
      disabled and a userspace task is in charge of resolving OOM situations.
      In this case, ALL tasks that enter the OOM path will be made to sleep on
      the OOM waitqueue and wait for userspace to free resources or increase
      the group's limit.  But a userspace OOM handler is prone to deadlock
      itself on the locks held by the waiting tasks.  For example one of the
      sleeping tasks may be stuck in a brk() call with the mmap_sem held for
      writing but the userspace handler, in order to pick an optimal victim,
      may need to read files from /proc/<pid>, which tries to acquire the same
      mmap_sem for reading and deadlocks.
      
      This patch changes the way tasks behave after detecting a memcg OOM and
      makes sure nobody loops or sleeps with locks held:
      
      1. When OOMing in a user fault, invoke the OOM killer and restart the
         fault instead of looping on the charge attempt.  This way, the OOM
         victim can not get stuck on locks the looping task may hold.
      
      2. When OOMing in a user fault but somebody else is handling it
         (either the kernel OOM killer or a userspace handler), don't go to
         sleep in the charge context.  Instead, remember the OOMing memcg in
         the task struct and then fully unwind the page fault stack with
         -ENOMEM.  pagefault_out_of_memory() will then call back into the
         memcg code to check if the -ENOMEM came from the memcg, and then
         either put the task to sleep on the memcg's OOM waitqueue or just
         restart the fault.  The OOM victim can no longer get stuck on any
         lock a sleeping task may hold.
      
      Debugged by Michal Hocko.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: NazurIt <azurit@pobox.sk>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3812c8c8
  4. 15 7月, 2013 1 次提交
  5. 24 2月, 2013 1 次提交
    • S
      memcg, oom: provide more precise dump info while memcg oom happening · 58cf188e
      Sha Zhengju 提交于
      Currently when a memcg oom is happening the oom dump messages is still
      global state and provides few useful info for users.  This patch prints
      more pointed memcg page statistics for memcg-oom and take hierarchy into
      consideration:
      
      Based on Michal's advice, we take hierarchy into consideration: supppose
      we trigger an OOM on A's limit
      
              root_memcg
                  |
                  A (use_hierachy=1)
                 / \
                B   C
                |
                D
      then the printed info will be:
      
        Memory cgroup stats for /A:...
        Memory cgroup stats for /A/B:...
        Memory cgroup stats for /A/C:...
        Memory cgroup stats for /A/B/D:...
      
      Following are samples of oom output:
      
      (1) Before change:
      
          mal-80 invoked oom-killer:gfp_mask=0xd0, order=0, oom_score_adj=0
          mal-80 cpuset=/ mems_allowed=0
          Pid: 2976, comm: mal-80 Not tainted 3.7.0+ #10
          Call Trace:
           [<ffffffff8167fbfb>] dump_header+0x83/0x1ca
           ..... (call trace)
           [<ffffffff8168a818>] page_fault+0x28/0x30
                                   <<<<<<<<<<<<<<<<<<<<< memcg specific information
          Task in /A/B/D killed as a result of limit of /A
          memory: usage 101376kB, limit 101376kB, failcnt 57
          memory+swap: usage 101376kB, limit 101376kB, failcnt 0
          kmem: usage 0kB, limit 9007199254740991kB, failcnt 0
                                   <<<<<<<<<<<<<<<<<<<<< print per cpu pageset stat
          Mem-Info:
          Node 0 DMA per-cpu:
          CPU    0: hi:    0, btch:   1 usd:   0
          ......
          CPU    3: hi:    0, btch:   1 usd:   0
          Node 0 DMA32 per-cpu:
          CPU    0: hi:  186, btch:  31 usd: 173
          ......
          CPU    3: hi:  186, btch:  31 usd: 130
                                   <<<<<<<<<<<<<<<<<<<<< print global page state
          active_anon:92963 inactive_anon:40777 isolated_anon:0
           active_file:33027 inactive_file:51718 isolated_file:0
           unevictable:0 dirty:3 writeback:0 unstable:0
           free:729995 slab_reclaimable:6897 slab_unreclaimable:6263
           mapped:20278 shmem:35971 pagetables:5885 bounce:0
           free_cma:0
                                   <<<<<<<<<<<<<<<<<<<<< print per zone page state
          Node 0 DMA free:15836kB ... all_unreclaimable? no
          lowmem_reserve[]: 0 3175 3899 3899
          Node 0 DMA32 free:2888564kB ... all_unrelaimable? no
          lowmem_reserve[]: 0 0 724 724
          lowmem_reserve[]: 0 0 0 0
          Node 0 DMA: 1*4kB (U) ... 3*4096kB (M) = 15836kB
          Node 0 DMA32: 41*4kB (UM) ... 702*4096kB (MR) = 2888316kB
          120710 total pagecache pages
          0 pages in swap cache
                                   <<<<<<<<<<<<<<<<<<<<< print global swap cache stat
          Swap cache stats: add 0, delete 0, find 0/0
          Free swap  = 499708kB
          Total swap = 499708kB
          1040368 pages RAM
          58678 pages reserved
          169065 pages shared
          173632 pages non-shared
          [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
          [ 2693]     0  2693     6005     1324      17        0             0 god
          [ 2754]     0  2754     6003     1320      16        0             0 god
          [ 2811]     0  2811     5992     1304      18        0             0 god
          [ 2874]     0  2874     6005     1323      18        0             0 god
          [ 2935]     0  2935     8720     7742      21        0             0 mal-30
          [ 2976]     0  2976    21520    17577      42        0             0 mal-80
          Memory cgroup out of memory: Kill process 2976 (mal-80) score 665 or sacrifice child
          Killed process 2976 (mal-80) total-vm:86080kB, anon-rss:69964kB, file-rss:344kB
      
      We can see that messages dumped by show_free_areas() are longsome and can
      provide so limited info for memcg that just happen oom.
      
      (2) After change
          mal-80 invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0
          mal-80 cpuset=/ mems_allowed=0
          Pid: 2704, comm: mal-80 Not tainted 3.7.0+ #10
          Call Trace:
           [<ffffffff8167fd0b>] dump_header+0x83/0x1d1
           .......(call trace)
           [<ffffffff8168a918>] page_fault+0x28/0x30
          Task in /A/B/D killed as a result of limit of /A
                                   <<<<<<<<<<<<<<<<<<<<< memcg specific information
          memory: usage 102400kB, limit 102400kB, failcnt 140
          memory+swap: usage 102400kB, limit 102400kB, failcnt 0
          kmem: usage 0kB, limit 9007199254740991kB, failcnt 0
          Memory cgroup stats for /A: cache:32KB rss:30984KB mapped_file:0KB swap:0KB inactive_anon:6912KB active_anon:24072KB inactive_file:32KB active_file:0KB unevictable:0KB
          Memory cgroup stats for /A/B: cache:0KB rss:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB
          Memory cgroup stats for /A/C: cache:0KB rss:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB
          Memory cgroup stats for /A/B/D: cache:32KB rss:71352KB mapped_file:0KB swap:0KB inactive_anon:6656KB active_anon:64696KB inactive_file:16KB active_file:16KB unevictable:0KB
          [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
          [ 2260]     0  2260     6006     1325      18        0             0 god
          [ 2383]     0  2383     6003     1319      17        0             0 god
          [ 2503]     0  2503     6004     1321      18        0             0 god
          [ 2622]     0  2622     6004     1321      16        0             0 god
          [ 2695]     0  2695     8720     7741      22        0             0 mal-30
          [ 2704]     0  2704    21520    17839      43        0             0 mal-80
          Memory cgroup out of memory: Kill process 2704 (mal-80) score 669 or sacrifice child
          Killed process 2704 (mal-80) total-vm:86080kB, anon-rss:71016kB, file-rss:340kB
      
      This version provides more pointed info for memcg in "Memory cgroup stats
      for XXX" section.
      Signed-off-by: NSha Zhengju <handai.szj@taobao.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      58cf188e
  6. 13 12月, 2012 3 次提交
  7. 12 12月, 2012 3 次提交
    • D
      mm, oom: fix race when specifying a thread as the oom origin · e1e12d2f
      David Rientjes 提交于
      test_set_oom_score_adj() and compare_swap_oom_score_adj() are used to
      specify that current should be killed first if an oom condition occurs in
      between the two calls.
      
      The usage is
      
      	short oom_score_adj = test_set_oom_score_adj(OOM_SCORE_ADJ_MAX);
      	...
      	compare_swap_oom_score_adj(OOM_SCORE_ADJ_MAX, oom_score_adj);
      
      to store the thread's oom_score_adj, temporarily change it to the maximum
      score possible, and then restore the old value if it is still the same.
      
      This happens to still be racy, however, if the user writes
      OOM_SCORE_ADJ_MAX to /proc/pid/oom_score_adj in between the two calls.
      The compare_swap_oom_score_adj() will then incorrectly reset the old value
      prior to the write of OOM_SCORE_ADJ_MAX.
      
      To fix this, introduce a new oom_flags_t member in struct signal_struct
      that will be used for per-thread oom killer flags.  KSM and swapoff can
      now use a bit in this member to specify that threads should be killed
      first in oom conditions without playing around with oom_score_adj.
      
      This also allows the correct oom_score_adj to always be shown when reading
      /proc/pid/oom_score.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Anton Vorontsov <anton.vorontsov@linaro.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e1e12d2f
    • D
      mm, oom: change type of oom_score_adj to short · a9c58b90
      David Rientjes 提交于
      The maximum oom_score_adj is 1000 and the minimum oom_score_adj is -1000,
      so this range can be represented by the signed short type with no
      functional change.  The extra space this frees up in struct signal_struct
      will be used for per-thread oom kill flags in the next patch.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Anton Vorontsov <anton.vorontsov@linaro.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a9c58b90
    • D
      mm, oom: allow exiting threads to have access to memory reserves · 9ff4868e
      David Rientjes 提交于
      Exiting threads, those with PF_EXITING set, can pagefault and require
      memory before they can make forward progress.  This happens, for instance,
      when a process must fault task->robust_list, a userspace structure, before
      detaching its memory.
      
      These threads also aren't guaranteed to get access to memory reserves
      unless oom killed or killed from userspace.  The oom killer won't grant
      memory reserves if other threads are also exiting other than current and
      stalling at the same point.  This prevents needlessly killing processes
      when others are already exiting.
      
      Instead of special casing all the possible situations between PF_EXITING
      getting set and a thread detaching its mm where it may allocate memory,
      which probably wouldn't get updated when a change is made to the exit
      path, the solution is to give all exiting threads access to memory
      reserves if they call the oom killer.  This allows them to quickly
      allocate, detach its mm, and free the memory it represents.
      
      Summary of Luigi's bug report:
      
      : He had an oom condition where threads were faulting on task->robust_list
      : and repeatedly called the oom killer but it would defer killing a thread
      : because it saw other PF_EXITING threads.  This can happen anytime we need
      : to allocate memory after setting PF_EXITING and before detaching our mm;
      : if there are other threads in the same state then the oom killer won't do
      : anything unless one of them happens to be killed from userspace.
      :
      : So instead of only deferring for PF_EXITING and !task->robust_list, it's
      : better to just give them access to memory reserves to prevent a potential
      : livelock so that any other faults that may be introduced in the future in
      : the exit path don't cause the same problem (and hopefully we don't allow
      : too many of those!).
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Tested-by: NLuigi Semenzato <semenzato@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9ff4868e
  8. 09 10月, 2012 1 次提交
  9. 01 8月, 2012 8 次提交
  10. 21 6月, 2012 2 次提交
  11. 09 6月, 2012 1 次提交
  12. 30 5月, 2012 1 次提交
  13. 03 5月, 2012 1 次提交
  14. 24 3月, 2012 1 次提交
  15. 22 3月, 2012 6 次提交
  16. 13 1月, 2012 2 次提交
  17. 11 1月, 2012 1 次提交
    • K
      tracepoint: add tracepoints for debugging oom_score_adj · 43d2b113
      KAMEZAWA Hiroyuki 提交于
      oom_score_adj is used for guarding processes from OOM-Killer.  One of
      problem is that it's inherited at fork().  When a daemon set oom_score_adj
      and make children, it's hard to know where the value is set.
      
      This patch adds some tracepoints useful for debugging. This patch adds
      3 trace points.
        - creating new task
        - renaming a task (exec)
        - set oom_score_adj
      
      To debug, users need to enable some trace pointer. Maybe filtering is useful as
      
      # EVENT=/sys/kernel/debug/tracing/events/task/
      # echo "oom_score_adj != 0" > $EVENT/task_newtask/filter
      # echo "oom_score_adj != 0" > $EVENT/task_rename/filter
      # echo 1 > $EVENT/enable
      # EVENT=/sys/kernel/debug/tracing/events/oom/
      # echo 1 > $EVENT/enable
      
      output will be like this.
      # grep oom /sys/kernel/debug/tracing/trace
      bash-7699  [007] d..3  5140.744510: oom_score_adj_update: pid=7699 comm=bash oom_score_adj=-1000
      bash-7699  [007] ...1  5151.818022: task_newtask: pid=7729 comm=bash clone_flags=1200011 oom_score_adj=-1000
      ls-7729  [003] ...2  5151.818504: task_rename: pid=7729 oldcomm=bash newcomm=ls oom_score_adj=-1000
      bash-7699  [002] ...1  5175.701468: task_newtask: pid=7730 comm=bash clone_flags=1200011 oom_score_adj=-1000
      grep-7730  [007] ...2  5175.701993: task_rename: pid=7730 oldcomm=bash newcomm=grep oom_score_adj=-1000
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      43d2b113
  18. 21 12月, 2011 1 次提交
    • F
      oom: fix integer overflow of points in oom_badness · ff05b6f7
      Frantisek Hrbata 提交于
      An integer overflow will happen on 64bit archs if task's sum of rss,
      swapents and nr_ptes exceeds (2^31)/1000 value.  This was introduced by
      commit
      
      f755a042 oom: use pte pages in OOM score
      
      where the oom score computation was divided into several steps and it's no
      longer computed as one expression in unsigned long(rss, swapents, nr_pte
      are unsigned long), where the result value assigned to points(int) is in
      range(1..1000).  So there could be an int overflow while computing
      
      176          points *= 1000;
      
      and points may have negative value. Meaning the oom score for a mem hog task
      will be one.
      
      196          if (points <= 0)
      197                  return 1;
      
      For example:
      [ 3366]     0  3366 35390480 24303939   5       0             0 oom01
      Out of memory: Kill process 3366 (oom01) score 1 or sacrifice child
      
      Here the oom1 process consumes more than 24303939(rss)*4096~=92GB physical
      memory, but it's oom score is one.
      
      In this situation the mem hog task is skipped and oom killer kills another and
      most probably innocent task with oom score greater than one.
      
      The points variable should be of type long instead of int to prevent the
      int overflow.
      Signed-off-by: NFrantisek Hrbata <fhrbata@redhat.com>
      Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: <stable@vger.kernel.org>		[2.6.36+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ff05b6f7
  19. 22 11月, 2011 1 次提交
    • T
      freezer: rename thaw_process() to __thaw_task() and simplify the implementation · a5be2d0d
      Tejun Heo 提交于
      thaw_process() now has only internal users - system and cgroup
      freezers.  Remove the unnecessary return value, rename, unexport and
      collapse __thaw_process() into it.  This will help further updates to
      the freezer code.
      
      -v3: oom_kill grew a use of thaw_process() while this patch was
           pending.  Convert it to use __thaw_task() for now.  In the longer
           term, this should be handled by allowing tasks to die if killed
           even if it's frozen.
      
      -v2: minor style update as suggested by Matt.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Paul Menage <menage@google.com>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      a5be2d0d
  20. 16 11月, 2011 1 次提交
  21. 01 11月, 2011 2 次提交
    • D
      oom: fix race while temporarily setting current's oom_score_adj · 43362a49
      David Rientjes 提交于
      test_set_oom_score_adj() was introduced in 72788c38 ("oom: replace
      PF_OOM_ORIGIN with toggling oom_score_adj") to temporarily elevate
      current's oom_score_adj for ksm and swapoff without requiring an
      additional per-process flag.
      
      Using that function to both set oom_score_adj to OOM_SCORE_ADJ_MAX and
      then reinstate the previous value is racy since it's possible that
      userspace can set the value to something else itself before the old value
      is reinstated.  That results in userspace setting current's oom_score_adj
      to a different value and then the kernel immediately setting it back to
      its previous value without notification.
      
      To fix this, a new compare_swap_oom_score_adj() function is introduced
      with the same semantics as the compare and swap CAS instruction, or
      CMPXCHG on x86.  It is used to reinstate the previous value of
      oom_score_adj if and only if the present value is the same as the old
      value.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Ying Han <yinghan@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      43362a49
    • D
      oom: remove oom_disable_count · c9f01245
      David Rientjes 提交于
      This removes mm->oom_disable_count entirely since it's unnecessary and
      currently buggy.  The counter was intended to be per-process but it's
      currently decremented in the exit path for each thread that exits, causing
      it to underflow.
      
      The count was originally intended to prevent oom killing threads that
      share memory with threads that cannot be killed since it doesn't lead to
      future memory freeing.  The counter could be fixed to represent all
      threads sharing the same mm, but it's better to remove the count since:
      
       - it is possible that the OOM_DISABLE thread sharing memory with the
         victim is waiting on that thread to exit and will actually cause
         future memory freeing, and
      
       - there is no guarantee that a thread is disabled from oom killing just
         because another thread sharing its mm is oom disabled.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Reported-by: NOleg Nesterov <oleg@redhat.com>
      Reviewed-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Ying Han <yinghan@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c9f01245