1. 10 8月, 2010 3 次提交
    • D
      oom: badness heuristic rewrite · a63d83f4
      David Rientjes 提交于
      This a complete rewrite of the oom killer's badness() heuristic which is
      used to determine which task to kill in oom conditions.  The goal is to
      make it as simple and predictable as possible so the results are better
      understood and we end up killing the task which will lead to the most
      memory freeing while still respecting the fine-tuning from userspace.
      
      Instead of basing the heuristic on mm->total_vm for each task, the task's
      rss and swap space is used instead.  This is a better indication of the
      amount of memory that will be freeable if the oom killed task is chosen
      and subsequently exits.  This helps specifically in cases where KDE or
      GNOME is chosen for oom kill on desktop systems instead of a memory
      hogging task.
      
      The baseline for the heuristic is a proportion of memory that each task is
      currently using in memory plus swap compared to the amount of "allowable"
      memory.  "Allowable," in this sense, means the system-wide resources for
      unconstrained oom conditions, the set of mempolicy nodes, the mems
      attached to current's cpuset, or a memory controller's limit.  The
      proportion is given on a scale of 0 (never kill) to 1000 (always kill),
      roughly meaning that if a task has a badness() score of 500 that the task
      consumes approximately 50% of allowable memory resident in RAM or in swap
      space.
      
      The proportion is always relative to the amount of "allowable" memory and
      not the total amount of RAM systemwide so that mempolicies and cpusets may
      operate in isolation; they shall not need to know the true size of the
      machine on which they are running if they are bound to a specific set of
      nodes or mems, respectively.
      
      Root tasks are given 3% extra memory just like __vm_enough_memory()
      provides in LSMs.  In the event of two tasks consuming similar amounts of
      memory, it is generally better to save root's task.
      
      Because of the change in the badness() heuristic's baseline, it is also
      necessary to introduce a new user interface to tune it.  It's not possible
      to redefine the meaning of /proc/pid/oom_adj with a new scale since the
      ABI cannot be changed for backward compatability.  Instead, a new tunable,
      /proc/pid/oom_score_adj, is added that ranges from -1000 to +1000.  It may
      be used to polarize the heuristic such that certain tasks are never
      considered for oom kill while others may always be considered.  The value
      is added directly into the badness() score so a value of -500, for
      example, means to discount 50% of its memory consumption in comparison to
      other tasks either on the system, bound to the mempolicy, in the cpuset,
      or sharing the same memory controller.
      
      /proc/pid/oom_adj is changed so that its meaning is rescaled into the
      units used by /proc/pid/oom_score_adj, and vice versa.  Changing one of
      these per-task tunables will rescale the value of the other to an
      equivalent meaning.  Although /proc/pid/oom_adj was originally defined as
      a bitshift on the badness score, it now shares the same linear growth as
      /proc/pid/oom_score_adj but with different granularity.  This is required
      so the ABI is not broken with userspace applications and allows oom_adj to
      be deprecated for future removal.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a63d83f4
    • A
      oom: move badness() declaration into oom.h · 74bcbf40
      Andrew Morton 提交于
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      74bcbf40
    • K
      oom: /proc/<pid>/oom_score treat kernel thread honestly · 26ebc984
      KOSAKI Motohiro 提交于
      If a kernel thread is using use_mm(), badness() returns a positive value.
      This is not a big issue because caller take care of it correctly.  But
      there is one exception, /proc/<pid>/oom_score calls badness() directly and
      doesn't care that the task is a regular process.
      
      Another example, /proc/1/oom_score return !0 value.  But it's unkillable.
      This incorrectness makes administration a little confusing.
      
      This patch fixes it.
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      26ebc984
  2. 28 5月, 2010 2 次提交
  3. 28 4月, 2010 1 次提交
  4. 09 4月, 2010 1 次提交
    • A
      procfs: Kill BKL in llseek on proc base · 87df8424
      Arnd Bergmann 提交于
      We don't use the BKL elsewhere, so use generic_file_llseek
      so we can avoid default_llseek taking the BKL.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      [restore proc_fdinfo_file_operations as non-seekable]
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: John Kacur <jkacur@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      87df8424
  5. 01 4月, 2010 1 次提交
    • O
      oom: fix the unsafe usage of badness() in proc_oom_score() · b95c35e7
      Oleg Nesterov 提交于
      proc_oom_score(task) has a reference to task_struct, but that is all.
      If this task was already released before we take tasklist_lock
      
      	- we can't use task->group_leader, it points to nowhere
      
      	- it is not safe to call badness() even if this task is
      	  ->group_leader, has_intersects_mems_allowed() assumes
      	  it is safe to iterate over ->thread_group list.
      
      	- even worse, badness() can hit ->signal == NULL
      
      Add the pid_alive() check to ensure __unhash_process() was not called.
      
      Also, use "task" instead of task->group_leader. badness() should return
      the same result for any sub-thread. Currently this is not true, but
      this should be changed anyway.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: stable@kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b95c35e7
  6. 30 3月, 2010 1 次提交
    • T
      include cleanup: Update gfp.h and slab.h includes to prepare for breaking... · 5a0e3ad6
      Tejun Heo 提交于
      include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
      
      percpu.h is included by sched.h and module.h and thus ends up being
      included when building most .c files.  percpu.h includes slab.h which
      in turn includes gfp.h making everything defined by the two files
      universally available and complicating inclusion dependencies.
      
      percpu.h -> slab.h dependency is about to be removed.  Prepare for
      this change by updating users of gfp and slab facilities include those
      headers directly instead of assuming availability.  As this conversion
      needs to touch large number of source files, the following script is
      used as the basis of conversion.
      
        http://userweb.kernel.org/~tj/misc/slabh-sweep.py
      
      The script does the followings.
      
      * Scan files for gfp and slab usages and update includes such that
        only the necessary includes are there.  ie. if only gfp is used,
        gfp.h, if slab is used, slab.h.
      
      * When the script inserts a new include, it looks at the include
        blocks and try to put the new include such that its order conforms
        to its surrounding.  It's put in the include block which contains
        core kernel includes, in the same order that the rest are ordered -
        alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
        doesn't seem to be any matching order.
      
      * If the script can't find a place to put a new include (mostly
        because the file doesn't have fitting include block), it prints out
        an error message indicating which .h file needs to be added to the
        file.
      
      The conversion was done in the following steps.
      
      1. The initial automatic conversion of all .c files updated slightly
         over 4000 files, deleting around 700 includes and adding ~480 gfp.h
         and ~3000 slab.h inclusions.  The script emitted errors for ~400
         files.
      
      2. Each error was manually checked.  Some didn't need the inclusion,
         some needed manual addition while adding it to implementation .h or
         embedding .c file was more appropriate for others.  This step added
         inclusions to around 150 files.
      
      3. The script was run again and the output was compared to the edits
         from #2 to make sure no file was left behind.
      
      4. Several build tests were done and a couple of problems were fixed.
         e.g. lib/decompress_*.c used malloc/free() wrappers around slab
         APIs requiring slab.h to be added manually.
      
      5. The script was run on all .h files but without automatically
         editing them as sprinkling gfp.h and slab.h inclusions around .h
         files could easily lead to inclusion dependency hell.  Most gfp.h
         inclusion directives were ignored as stuff from gfp.h was usually
         wildly available and often used in preprocessor macros.  Each
         slab.h inclusion directive was examined and added manually as
         necessary.
      
      6. percpu.h was updated not to include slab.h.
      
      7. Build test were done on the following configurations and failures
         were fixed.  CONFIG_GCOV_KERNEL was turned off for all tests (as my
         distributed build env didn't work with gcov compiles) and a few
         more options had to be turned off depending on archs to make things
         build (like ipr on powerpc/64 which failed due to missing writeq).
      
         * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
         * powerpc and powerpc64 SMP allmodconfig
         * sparc and sparc64 SMP allmodconfig
         * ia64 SMP allmodconfig
         * s390 SMP allmodconfig
         * alpha SMP allmodconfig
         * um on x86_64 SMP allmodconfig
      
      8. percpu.h modifications were reverted so that it could be applied as
         a separate patch and serve as bisection point.
      
      Given the fact that I had only a couple of failures from tests on step
      6, I'm fairly confident about the coverage of this conversion patch.
      If there is a breakage, it's likely to be something in one of the arch
      headers which should be easily discoverable easily on most builds of
      the specific arch.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Guess-its-ok-by: NChristoph Lameter <cl@linux-foundation.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      5a0e3ad6
  7. 04 3月, 2010 1 次提交
  8. 25 2月, 2010 1 次提交
    • P
      vfs: Apply lockdep-based checking to rcu_dereference() uses · 7dc52157
      Paul E. McKenney 提交于
      Add lockdep-ified RCU primitives to alloc_fd(), files_fdtable()
      and fcheck_files().
      
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: laijs@cn.fujitsu.com
      Cc: dipankar@in.ibm.com
      Cc: mathieu.desnoyers@polymtl.ca
      Cc: josh@joshtriplett.org
      Cc: dvhltc@us.ibm.com
      Cc: niv@us.ibm.com
      Cc: peterz@infradead.org
      Cc: rostedt@goodmis.org
      Cc: Valdis.Kletnieks@vt.edu
      Cc: dhowells@redhat.com
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      LKML-Reference: <1266887105-1528-8-git-send-email-paulmck@linux.vnet.ibm.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      7dc52157
  9. 19 2月, 2010 1 次提交
  10. 14 1月, 2010 1 次提交
    • A
      fix autofs/afs/etc. magic mountpoint breakage · 86acdca1
      Al Viro 提交于
      We end up trying to kfree() nd.last.name on open("/mnt/tmp", O_CREAT)
      if /mnt/tmp is an autofs direct mount.  The reason is that nd.last_type
      is bogus here; we want LAST_BIND for everything of that kind and we
      get LAST_NORM left over from finding parent directory.
      
      So make sure that it *is* set properly; set to LAST_BIND before
      doing ->follow_link() - for normal symlinks it will be changed
      by __vfs_follow_link() and everything else needs it set that way.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      86acdca1
  11. 16 12月, 2009 2 次提交
  12. 12 11月, 2009 1 次提交
    • S
      pidns: fix a leak in /proc dentries and inodes with pid namespaces. · 29f12ca3
      Sukadev Bhattiprolu 提交于
      Daniel Lezcano reported a leak in 'struct pid' and 'struct pid_namespace'
      that is discussed in:
      
      	http://lkml.org/lkml/2009/10/2/159.
      
      To summarize the thread, when container-init is terminated, it sets the
      PF_EXITING flag, zaps other processes in the container and waits to reap
      them.  As a part of reaping, the container-init should flush any /proc
      dentries associated with the processes.  But because the container-init is
      itself exiting and the following PF_EXITING check, the dentries are not
      flushed, resulting in leak in /proc inodes and dentries.
      
      This fix reverts the commit 7766755a ("Fix /proc dcache deadlock
      in do_exit") which introduced the check for PF_EXITING.  At the time of
      the commit, shrink_dcache_parent() flushed dentries from other filesystems
      also and could have caused a deadlock which the commit fixed.  But as
      pointed out by Eric Biederman, after commit 0feae5c4,
      shrink_dcache_parent() no longer affects other filesystems.  So reverting
      the commit is now safe.
      
      As pointed out by Jan Kara, the leak is not as critical since the
      unclaimed space will be reclaimed under memory pressure or by:
      
      	echo 3 > /proc/sys/vm/drop_caches
      
      But since this check is no longer required, its best to remove it.
      Signed-off-by: NSukadev Bhattiprolu <sukadev@us.ibm.com>
      Reported-by: NDaniel Lezcano <dlezcano@fr.ibm.com>
      Acked-by: NEric W. Biederman <ebiederm@xmission.com>
      Acked-by: NJan Kara <jack@ucw.cz>
      Cc: Andrea Arcangeli <andrea@cpushare.com>
      Cc: Serge Hallyn <serue@us.ibm.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      29f12ca3
  13. 23 9月, 2009 3 次提交
  14. 22 9月, 2009 3 次提交
  15. 19 8月, 2009 1 次提交
    • K
      mm: revert "oom: move oom_adj value" · 0753ba01
      KOSAKI Motohiro 提交于
      The commit 2ff05b2b (oom: move oom_adj value) moveed the oom_adj value to
      the mm_struct.  It was a very good first step for sanitize OOM.
      
      However Paul Menage reported the commit makes regression to his job
      scheduler.  Current OOM logic can kill OOM_DISABLED process.
      
      Why? His program has the code of similar to the following.
      
      	...
      	set_oom_adj(OOM_DISABLE); /* The job scheduler never killed by oom */
      	...
      	if (vfork() == 0) {
      		set_oom_adj(0); /* Invoked child can be killed */
      		execve("foo-bar-cmd");
      	}
      	....
      
      vfork() parent and child are shared the same mm_struct.  then above
      set_oom_adj(0) doesn't only change oom_adj for vfork() child, it's also
      change oom_adj for vfork() parent.  Then, vfork() parent (job scheduler)
      lost OOM immune and it was killed.
      
      Actually, fork-setting-exec idiom is very frequently used in userland program.
      We must not break this assumption.
      
      Then, this patch revert commit 2ff05b2b and related commit.
      
      Reverted commit list
      ---------------------
      - commit 2ff05b2b (oom: move oom_adj value from task_struct to mm_struct)
      - commit 4d8b9135 (oom: avoid unnecessary mm locking and scanning for OOM_DISABLE)
      - commit 81236810 (oom: only oom kill exiting tasks with attached memory)
      - commit 933b787b (mm: copy over oom_adj value at fork time)
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0753ba01
  16. 10 8月, 2009 5 次提交
  17. 24 6月, 2009 1 次提交
    • O
      mm_for_maps: simplify, use ptrace_may_access() · a893a84e
      Oleg Nesterov 提交于
      It would be nice to kill __ptrace_may_access(). It requires task_lock(),
      but this lock is only needed to read mm->flags in the middle.
      
      Convert mm_for_maps() to use ptrace_may_access(), this also simplifies
      the code a little bit.
      
      Also, we do not need to take ->mmap_sem in advance. In fact I think
      mm_for_maps() should not play with ->mmap_sem at all, the caller should
      take this lock.
      
      With or without this patch, without ->cred_guard_mutex held we can race
      with exec() and get the new ->mm but check old creds.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Reviewed-by: NSerge Hallyn <serue@us.ibm.com>
      Signed-off-by: NJames Morris <jmorris@namei.org>
      a893a84e
  18. 17 6月, 2009 1 次提交
    • D
      oom: move oom_adj value from task_struct to mm_struct · 2ff05b2b
      David Rientjes 提交于
      The per-task oom_adj value is a characteristic of its mm more than the
      task itself since it's not possible to oom kill any thread that shares the
      mm.  If a task were to be killed while attached to an mm that could not be
      freed because another thread were set to OOM_DISABLE, it would have
      needlessly been terminated since there is no potential for future memory
      freeing.
      
      This patch moves oomkilladj (now more appropriately named oom_adj) from
      struct task_struct to struct mm_struct.  This requires task_lock() on a
      task to check its oom_adj value to protect against exec, but it's already
      necessary to take the lock when dereferencing the mm to find the total VM
      size for the badness heuristic.
      
      This fixes a livelock if the oom killer chooses a task and another thread
      sharing the same memory has an oom_adj value of OOM_DISABLE.  This occurs
      because oom_kill_task() repeatedly returns 1 and refuses to kill the
      chosen task while select_bad_process() will repeatedly choose the same
      task during the next retry.
      
      Taking task_lock() in select_bad_process() to check for OOM_DISABLE and in
      oom_kill_task() to check for threads sharing the same memory will be
      removed in the next patch in this series where it will no longer be
      necessary.
      
      Writing to /proc/pid/oom_adj for a kthread will now return -EINVAL since
      these threads are immune from oom killing already.  They simply report an
      oom_adj value of OOM_DISABLE.
      
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2ff05b2b
  19. 29 5月, 2009 1 次提交
  20. 11 5月, 2009 1 次提交
  21. 05 5月, 2009 1 次提交
  22. 17 4月, 2009 1 次提交
  23. 01 4月, 2009 1 次提交
  24. 29 3月, 2009 1 次提交
    • H
      fix setuid sometimes wouldn't · 7c2c7d99
      Hugh Dickins 提交于
      check_unsafe_exec() also notes whether the fs_struct is being
      shared by more threads than will get killed by the exec, and if so
      sets LSM_UNSAFE_SHARE to make bprm_set_creds() careful about euid.
      But /proc/<pid>/cwd and /proc/<pid>/root lookups make transient
      use of get_fs_struct(), which also raises that sharing count.
      
      This might occasionally cause a setuid program not to change euid,
      in the same way as happened with files->count (check_unsafe_exec
      also looks at sighand->count, but /proc doesn't raise that one).
      
      We'd prefer exec not to unshare fs_struct: so fix this in procfs,
      replacing get_fs_struct() by get_fs_path(), which does path_get
      while still holding task_lock, instead of raising fs->count.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: stable@kernel.org
      ___
      
       fs/proc/base.c |   50 +++++++++++++++--------------------------------
       1 file changed, 16 insertions(+), 34 deletions(-)
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7c2c7d99
  25. 28 3月, 2009 1 次提交
  26. 18 3月, 2009 1 次提交
    • L
      Avoid 64-bit "switch()" statements on 32-bit architectures · ee568b25
      Linus Torvalds 提交于
      Commit ee6f779b ("filp->f_pos not
      correctly updated in proc_task_readdir") changed the proc code to use
      filp->f_pos directly, rather than through a temporary variable.  In the
      process, that caused the operations to be done on the full 64 bits, even
      though the offset is never that big.
      
      That's all fine and dandy per se, but for some unfathomable reason gcc
      generates absolutely horrid code when using 64-bit values in switch()
      statements.  To the point of actually calling out to gcc helper
      functions like __cmpdi2 rather than just doing the trivial comparisons
      directly the way gcc does for normal compares.  At which point we get
      link failures, because we really don't want to support that kind of
      crazy code.
      
      Fix this by just casting the f_pos value to "unsigned long", which
      is plenty big enough for /proc, and avoids the gcc code generation issue.
      Reported-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Cc: Zhang Le <r0bertz@gentoo.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ee568b25
  27. 16 3月, 2009 1 次提交
    • Z
      filp->f_pos not correctly updated in proc_task_readdir · ee6f779b
      Zhang Le 提交于
      filp->f_pos only get updated at the end of the function. Thus d_off of those
      dirents who are in the middle will be 0, and this will cause a problem in
      glibc's readdir implementation, specifically endless loop. Because when overflow
      occurs, f_pos will be set to next dirent to read, however it will be 0, unless
      the next one is the last one. So it will start over again and again.
      
      There is a sample program in man 2 gendents. This is the output of the program
      running on a multithread program's task dir before this patch is applied:
      
        $ ./a.out /proc/3807/task
        --------------- nread=128 ---------------
        i-node#  file type  d_reclen  d_off   d_name
          506442  directory    16          1  .
          506441  directory    16          0  ..
          506443  directory    16          0  3807
          506444  directory    16          0  3809
          506445  directory    16          0  3812
          506446  directory    16          0  3861
          506447  directory    16          0  3862
          506448  directory    16          8  3863
      
      This is the output after this patch is applied
      
        $ ./a.out /proc/3807/task
        --------------- nread=128 ---------------
        i-node#  file type  d_reclen  d_off   d_name
          506442  directory    16          1  .
          506441  directory    16          2  ..
          506443  directory    16          3  3807
          506444  directory    16          4  3809
          506445  directory    16          5  3812
          506446  directory    16          6  3861
          506447  directory    16          7  3862
          506448  directory    16          8  3863
      Signed-off-by: NZhang Le <r0bertz@gentoo.org>
      Acked-by: NAl Viro <viro@ZenIV.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ee6f779b
  28. 06 1月, 2009 1 次提交