1. 03 5月, 2012 1 次提交
  2. 24 3月, 2012 1 次提交
  3. 22 3月, 2012 6 次提交
  4. 13 1月, 2012 2 次提交
  5. 11 1月, 2012 1 次提交
    • K
      tracepoint: add tracepoints for debugging oom_score_adj · 43d2b113
      KAMEZAWA Hiroyuki 提交于
      oom_score_adj is used for guarding processes from OOM-Killer.  One of
      problem is that it's inherited at fork().  When a daemon set oom_score_adj
      and make children, it's hard to know where the value is set.
      
      This patch adds some tracepoints useful for debugging. This patch adds
      3 trace points.
        - creating new task
        - renaming a task (exec)
        - set oom_score_adj
      
      To debug, users need to enable some trace pointer. Maybe filtering is useful as
      
      # EVENT=/sys/kernel/debug/tracing/events/task/
      # echo "oom_score_adj != 0" > $EVENT/task_newtask/filter
      # echo "oom_score_adj != 0" > $EVENT/task_rename/filter
      # echo 1 > $EVENT/enable
      # EVENT=/sys/kernel/debug/tracing/events/oom/
      # echo 1 > $EVENT/enable
      
      output will be like this.
      # grep oom /sys/kernel/debug/tracing/trace
      bash-7699  [007] d..3  5140.744510: oom_score_adj_update: pid=7699 comm=bash oom_score_adj=-1000
      bash-7699  [007] ...1  5151.818022: task_newtask: pid=7729 comm=bash clone_flags=1200011 oom_score_adj=-1000
      ls-7729  [003] ...2  5151.818504: task_rename: pid=7729 oldcomm=bash newcomm=ls oom_score_adj=-1000
      bash-7699  [002] ...1  5175.701468: task_newtask: pid=7730 comm=bash clone_flags=1200011 oom_score_adj=-1000
      grep-7730  [007] ...2  5175.701993: task_rename: pid=7730 oldcomm=bash newcomm=grep oom_score_adj=-1000
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      43d2b113
  6. 21 12月, 2011 1 次提交
    • F
      oom: fix integer overflow of points in oom_badness · ff05b6f7
      Frantisek Hrbata 提交于
      An integer overflow will happen on 64bit archs if task's sum of rss,
      swapents and nr_ptes exceeds (2^31)/1000 value.  This was introduced by
      commit
      
      f755a042 oom: use pte pages in OOM score
      
      where the oom score computation was divided into several steps and it's no
      longer computed as one expression in unsigned long(rss, swapents, nr_pte
      are unsigned long), where the result value assigned to points(int) is in
      range(1..1000).  So there could be an int overflow while computing
      
      176          points *= 1000;
      
      and points may have negative value. Meaning the oom score for a mem hog task
      will be one.
      
      196          if (points <= 0)
      197                  return 1;
      
      For example:
      [ 3366]     0  3366 35390480 24303939   5       0             0 oom01
      Out of memory: Kill process 3366 (oom01) score 1 or sacrifice child
      
      Here the oom1 process consumes more than 24303939(rss)*4096~=92GB physical
      memory, but it's oom score is one.
      
      In this situation the mem hog task is skipped and oom killer kills another and
      most probably innocent task with oom score greater than one.
      
      The points variable should be of type long instead of int to prevent the
      int overflow.
      Signed-off-by: NFrantisek Hrbata <fhrbata@redhat.com>
      Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: <stable@vger.kernel.org>		[2.6.36+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ff05b6f7
  7. 22 11月, 2011 1 次提交
    • T
      freezer: rename thaw_process() to __thaw_task() and simplify the implementation · a5be2d0d
      Tejun Heo 提交于
      thaw_process() now has only internal users - system and cgroup
      freezers.  Remove the unnecessary return value, rename, unexport and
      collapse __thaw_process() into it.  This will help further updates to
      the freezer code.
      
      -v3: oom_kill grew a use of thaw_process() while this patch was
           pending.  Convert it to use __thaw_task() for now.  In the longer
           term, this should be handled by allowing tasks to die if killed
           even if it's frozen.
      
      -v2: minor style update as suggested by Matt.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Paul Menage <menage@google.com>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      a5be2d0d
  8. 16 11月, 2011 1 次提交
  9. 01 11月, 2011 4 次提交
  10. 31 10月, 2011 1 次提交
  11. 02 8月, 2011 1 次提交
  12. 26 7月, 2011 1 次提交
  13. 23 6月, 2011 1 次提交
  14. 25 5月, 2011 1 次提交
    • D
      oom: replace PF_OOM_ORIGIN with toggling oom_score_adj · 72788c38
      David Rientjes 提交于
      There's a kernel-wide shortage of per-process flags, so it's always
      helpful to trim one when possible without incurring a significant penalty.
       It's even more important when you're planning on adding a per- process
      flag yourself, which I plan to do shortly for transparent hugepages.
      
      PF_OOM_ORIGIN is used by ksm and swapoff to prefer current since it has a
      tendency to allocate large amounts of memory and should be preferred for
      killing over other tasks.  We'd rather immediately kill the task making
      the errant syscall rather than penalizing an innocent task.
      
      This patch removes PF_OOM_ORIGIN since its behavior is equivalent to
      setting the process's oom_score_adj to OOM_SCORE_ADJ_MAX.
      
      The process's old oom_score_adj is stored and then set to
      OOM_SCORE_ADJ_MAX during the time it used to have PF_OOM_ORIGIN.  The old
      value is then reinstated when the process should no longer be considered a
      high priority for oom killing.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Izik Eidus <ieidus@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      72788c38
  15. 29 4月, 2011 1 次提交
  16. 15 4月, 2011 1 次提交
  17. 25 3月, 2011 1 次提交
  18. 24 3月, 2011 1 次提交
  19. 23 3月, 2011 4 次提交
    • D
      oom: suppress nodes that are not allowed from meminfo on oom kill · ddd588b5
      David Rientjes 提交于
      The oom killer is extremely verbose for machines with a large number of
      cpus and/or nodes.  This verbosity can often be harmful if it causes other
      important messages to be scrolled from the kernel log and incurs a
      signicant time delay, specifically for kernels with CONFIG_NODES_SHIFT >
      8.
      
      This patch causes only memory information to be displayed for nodes that
      are allowed by current's cpuset when dumping the VM state.  Information
      for all other nodes is irrelevant to the oom condition; we don't care if
      there's an abundance of memory elsewhere if we can't access it.
      
      This only affects the behavior of dumping memory information when an oom
      is triggered.  Other dumps, such as for sysrq+m, still display the
      unfiltered form when using the existing show_mem() interface.
      
      Additionally, the per-cpu pageset statistics are extremely verbose in oom
      killer output, so it is now suppressed.  This removes
      
      	nodes_weight(current->mems_allowed) * (1 + nr_cpus)
      
      lines from the oom killer output.
      
      Callers may use __show_mem(SHOW_MEM_FILTER_NODES) to filter disallowed
      nodes.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ddd588b5
    • D
      oom: avoid deferring oom killer if exiting task is being traced · edd45544
      David Rientjes 提交于
      The oom killer naturally defers killing anything if it finds an eligible
      task that is already exiting and has yet to detach its ->mm.  This avoids
      unnecessarily killing tasks when one is already in the exit path and may
      free enough memory that the oom killer is no longer needed.  This is
      detected by PF_EXITING since threads that have already detached its ->mm
      are no longer considered at all.
      
      The problem with always deferring when a thread is PF_EXITING, however, is
      that it may never actually exit when being traced, specifically if another
      task is tracing it with PTRACE_O_TRACEEXIT.  The oom killer does not want
      to defer in this case since there is no guarantee that thread will ever
      exit without intervention.
      
      This patch will now only defer the oom killer when a thread is PF_EXITING
      and no ptracer has stopped its progress in the exit path.  It also ensures
      that a child is sacrificed for the chosen parent only if it has a
      different ->mm as the comment implies: this ensures that the thread group
      leader is always targeted appropriately.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Reported-by: NOleg Nesterov <oleg@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andrey Vagin <avagin@openvz.org>
      Cc: <stable@kernel.org>		[2.6.38.x]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      edd45544
    • A
      oom: skip zombies when iterating tasklist · 30e2b41f
      Andrey Vagin 提交于
      We shouldn't defer oom killing if a thread has already detached its ->mm
      and still has TIF_MEMDIE set.  Memory needs to be freed, so find kill
      other threads that pin the same ->mm or find another task to kill.
      Signed-off-by: NAndrey Vagin <avagin@openvz.org>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: <stable@kernel.org>		[2.6.38.x]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      30e2b41f
    • D
      oom: prevent unnecessary oom kills or kernel panics · 3a5dda7a
      David Rientjes 提交于
      This patch prevents unnecessary oom kills or kernel panics by reverting
      two commits:
      
      	495789a5 (oom: make oom_score to per-process value)
      	cef1d352 (oom: multi threaded process coredump don't make deadlock)
      
      First, 495789a5 (oom: make oom_score to per-process value) ignores the
      fact that all threads in a thread group do not necessarily exit at the
      same time.
      
      It is imperative that select_bad_process() detect threads that are in the
      exit path, specifically those with PF_EXITING set, to prevent needlessly
      killing additional tasks.  If a process is oom killed and the thread group
      leader exits, select_bad_process() cannot detect the other threads that
      are PF_EXITING by iterating over only processes.  Thus, it currently
      chooses another task unnecessarily for oom kill or panics the machine when
      nothing else is eligible.
      
      By iterating over threads instead, it is possible to detect threads that
      are exiting and nominate them for oom kill so they get access to memory
      reserves.
      
      Second, cef1d352 (oom: multi threaded process coredump don't make
      deadlock) erroneously avoids making the oom killer a no-op when an
      eligible thread other than current isfound to be exiting.  We want to
      detect this situation so that we may allow that exiting thread time to
      exit and free its memory; if it is able to exit on its own, that should
      free memory so current is no loner oom.  If it is not able to exit on its
      own, the oom killer will nominate it for oom kill which, in this case,
      only means it will get access to memory reserves.
      
      Without this change, it is easy for the oom killer to unnecessarily target
      tasks when all threads of a victim don't exit before the thread group
      leader or, in the worst case, panic the machine.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andrey Vagin <avagin@openvz.org>
      Cc: <stable@kernel.org>		[2.6.38.x]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3a5dda7a
  20. 15 3月, 2011 2 次提交
  21. 27 10月, 2010 2 次提交
    • D
      oom: kill all threads sharing oom killed task's mm · 1e99bad0
      David Rientjes 提交于
      It's necessary to kill all threads that share an oom killed task's mm if
      the goal is to lead to future memory freeing.
      
      This patch reintroduces the code removed in 8c5cd6f3 (oom: oom_kill
      doesn't kill vfork parent (or child)) since it is obsoleted.
      
      It's now guaranteed that any task passed to oom_kill_task() does not share
      an mm with any thread that is unkillable.  Thus, we're safe to issue a
      SIGKILL to any thread sharing the same mm.
      
      This is especially necessary to solve an mm->mmap_sem livelock issue
      whereas an oom killed thread must acquire the lock in the exit path while
      another thread is holding it in the page allocator while trying to
      allocate memory itself (and will preempt the oom killer since a task was
      already killed).  Since tasks with pending fatal signals are now granted
      access to memory reserves, the thread holding the lock may quickly
      allocate and release the lock so that the oom killed task may exit.
      
      This mainly is for threads that are cloned with CLONE_VM but not
      CLONE_THREAD, so they are in a different thread group.  Non-NPTL threads
      exist in the wild and this change is necessary to prevent the livelock in
      such cases.  We care more about preventing the livelock than incurring the
      additional tasklist in the oom killer when a task has been killed.
      Systems that are sufficiently large to not want the tasklist scan in the
      oom killer in the first place already have the option of enabling
      /proc/sys/vm/oom_kill_allocating_task, which was designed specifically for
      that purpose.
      
      This code had existed in the oom killer for over eight years dating back
      to the 2.4 kernel.
      
      [akpm@linux-foundation.org: add nice comment]
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1e99bad0
    • D
      oom: avoid killing a task if a thread sharing its mm cannot be killed · e18641e1
      David Rientjes 提交于
      The oom killer's goal is to kill a memory-hogging task so that it may
      exit, free its memory, and allow the current context to allocate the
      memory that triggered it in the first place.  Thus, killing a task is
      pointless if other threads sharing its mm cannot be killed because of its
      /proc/pid/oom_adj or /proc/pid/oom_score_adj value.
      
      This patch checks whether any other thread sharing p->mm has an
      oom_score_adj of OOM_SCORE_ADJ_MIN.  If so, the thread cannot be killed
      and oom_badness(p) returns 0, meaning it's unkillable.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e18641e1
  22. 23 9月, 2010 2 次提交
    • D
      oom: filter unkillable tasks from tasklist dump · e85bfd3a
      David Rientjes 提交于
      /proc/sys/vm/oom_dump_tasks is enabled by default, so it's necessary to
      limit as much information as possible that it should emit.
      
      The tasklist dump should be filtered to only those tasks that are eligible
      for oom kill.  This is already done for memcg ooms, but this patch extends
      it to both cpuset and mempolicy ooms as well as init.
      
      In addition to suppressing irrelevant information, this also reduces
      confusion since users currently don't know which tasks in the tasklist
      aren't eligible for kill (such as those attached to cpusets or bound to
      mempolicies with a disjoint set of mems or nodes, respectively) since that
      information is not shown.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e85bfd3a
    • D
      oom: always return a badness score of non-zero for eligible tasks · f19e8aa1
      David Rientjes 提交于
      A task's badness score is roughly a proportion of its rss and swap
      compared to the system's capacity.  The scale ranges from 0 to 1000 with
      the highest score chosen for kill.  Thus, this scale operates on a
      resolution of 0.1% of RAM + swap.  Admin tasks are also given a 3% bonus,
      so the badness score of an admin task using 3% of memory, for example,
      would still be 0.
      
      It's possible that an exceptionally large number of tasks will combine to
      exhaust all resources but never have a single task that uses more than
      0.1% of RAM and swap (or 3.0% for admin tasks).
      
      This patch ensures that the badness score of any eligible task is never 0
      so the machine doesn't unnecessarily panic because it cannot find a task
      to kill.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f19e8aa1
  23. 21 8月, 2010 3 次提交