- 11 8月, 2010 2 次提交
-
-
由 Miklos Szeredi 提交于
Prepend "(unreachable)" to path strings if the path is not reachable from the current root. Two places updated are - the return string from getcwd() - and symlinks under /proc/$PID. Other uses of d_path() are left unchanged (we know that some old software crashes if /proc/mounts is changed). Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Miklos Szeredi 提交于
Add three helpers that retrieve a refcounted copy of the root and cwd from the supplied fs_struct. get_fs_root() get_fs_pwd() get_fs_root_and_pwd() Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 10 8月, 2010 5 次提交
-
-
由 David Rientjes 提交于
/proc/pid/oom_adj is now deprecated so that that it may eventually be removed. The target date for removal is August 2012. A warning will be printed to the kernel log if a task attempts to use this interface. Future warning will be suppressed until the kernel is rebooted to prevent spamming the kernel log. Signed-off-by: NDavid Rientjes <rientjes@google.com> Cc: Nick Piggin <npiggin@suse.de> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 David Rientjes 提交于
This a complete rewrite of the oom killer's badness() heuristic which is used to determine which task to kill in oom conditions. The goal is to make it as simple and predictable as possible so the results are better understood and we end up killing the task which will lead to the most memory freeing while still respecting the fine-tuning from userspace. Instead of basing the heuristic on mm->total_vm for each task, the task's rss and swap space is used instead. This is a better indication of the amount of memory that will be freeable if the oom killed task is chosen and subsequently exits. This helps specifically in cases where KDE or GNOME is chosen for oom kill on desktop systems instead of a memory hogging task. The baseline for the heuristic is a proportion of memory that each task is currently using in memory plus swap compared to the amount of "allowable" memory. "Allowable," in this sense, means the system-wide resources for unconstrained oom conditions, the set of mempolicy nodes, the mems attached to current's cpuset, or a memory controller's limit. The proportion is given on a scale of 0 (never kill) to 1000 (always kill), roughly meaning that if a task has a badness() score of 500 that the task consumes approximately 50% of allowable memory resident in RAM or in swap space. The proportion is always relative to the amount of "allowable" memory and not the total amount of RAM systemwide so that mempolicies and cpusets may operate in isolation; they shall not need to know the true size of the machine on which they are running if they are bound to a specific set of nodes or mems, respectively. Root tasks are given 3% extra memory just like __vm_enough_memory() provides in LSMs. In the event of two tasks consuming similar amounts of memory, it is generally better to save root's task. Because of the change in the badness() heuristic's baseline, it is also necessary to introduce a new user interface to tune it. It's not possible to redefine the meaning of /proc/pid/oom_adj with a new scale since the ABI cannot be changed for backward compatability. Instead, a new tunable, /proc/pid/oom_score_adj, is added that ranges from -1000 to +1000. It may be used to polarize the heuristic such that certain tasks are never considered for oom kill while others may always be considered. The value is added directly into the badness() score so a value of -500, for example, means to discount 50% of its memory consumption in comparison to other tasks either on the system, bound to the mempolicy, in the cpuset, or sharing the same memory controller. /proc/pid/oom_adj is changed so that its meaning is rescaled into the units used by /proc/pid/oom_score_adj, and vice versa. Changing one of these per-task tunables will rescale the value of the other to an equivalent meaning. Although /proc/pid/oom_adj was originally defined as a bitshift on the badness score, it now shares the same linear growth as /proc/pid/oom_score_adj but with different granularity. This is required so the ABI is not broken with userspace applications and allows oom_adj to be deprecated for future removal. Signed-off-by: NDavid Rientjes <rientjes@google.com> Cc: Nick Piggin <npiggin@suse.de> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Andrew Morton 提交于
Cc: Minchan Kim <minchan.kim@gmail.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 KOSAKI Motohiro 提交于
If a kernel thread is using use_mm(), badness() returns a positive value. This is not a big issue because caller take care of it correctly. But there is one exception, /proc/<pid>/oom_score calls badness() directly and doesn't care that the task is a regular process. Another example, /proc/1/oom_score return !0 value. But it's unkillable. This incorrectness makes administration a little confusing. This patch fixes it. Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Christoph Hellwig 提交于
Replace inode_setattr with opencoded variants of it in all callers. This moves the remaining call to vmtruncate into the filesystem methods where it can be replaced with the proper truncate sequence. In a few cases it was obvious that we would never end up calling vmtruncate so it was left out in the opencoded variant: spufs: explicitly checks for ATTR_SIZE earlier btrfs,hugetlbfs,logfs,dlmfs: explicitly clears ATTR_SIZE earlier ufs: contains an opencoded simple_seattr + truncate that sets the filesize just above In addition to that ncpfs called inode_setattr with handcrafted iattrs, which allowed to trim down the opencoded variant. Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 28 5月, 2010 2 次提交
-
-
由 Dan Carpenter 提交于
I removed 3 unused assignments. The first two get reset on the first statement of their functions. For "err" in root.c we don't return an error and we don't use the variable again. Signed-off-by: NDan Carpenter <error27@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Acked-by: NSerge Hallyn <serue@us.ibm.com> Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Oleg Nesterov 提交于
Now that task->signal can't go away get_nr_threads() doesn't need ->siglock to read signal->count. Also, make it inline, move into sched.h, and convert 2 other proc users of signal->count to use this (now trivial) helper. Henceforth get_nr_threads() is the only valid user of signal->count, we are ready to turn it into "int nr_threads" or, perhaps, kill it. Signed-off-by: NOleg Nesterov <oleg@redhat.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: David Howells <dhowells@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: NRoland McGrath <roland@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 28 4月, 2010 1 次提交
-
-
由 Jerome Marchand 提交于
Correct the file_operations struct in fdinfo entry of tid_base_stuff[]. Presently /proc/*/task/*/fdinfo contains symlinks to opened files like /proc/*/fd/. Signed-off-by: NJerome Marchand <jmarchan@redhat.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Miklos Szeredi <mszeredi@suse.cz> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: <stable@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 09 4月, 2010 1 次提交
-
-
由 Arnd Bergmann 提交于
We don't use the BKL elsewhere, so use generic_file_llseek so we can avoid default_llseek taking the BKL. Signed-off-by: NArnd Bergmann <arnd@arndb.de> [restore proc_fdinfo_file_operations as non-seekable] Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: John Kacur <jkacur@redhat.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Al Viro <viro@ZenIV.linux.org.uk>
-
- 01 4月, 2010 1 次提交
-
-
由 Oleg Nesterov 提交于
proc_oom_score(task) has a reference to task_struct, but that is all. If this task was already released before we take tasklist_lock - we can't use task->group_leader, it points to nowhere - it is not safe to call badness() even if this task is ->group_leader, has_intersects_mems_allowed() assumes it is safe to iterate over ->thread_group list. - even worse, badness() can hit ->signal == NULL Add the pid_alive() check to ensure __unhash_process() was not called. Also, use "task" instead of task->group_leader. badness() should return the same result for any sub-thread. Currently this is not true, but this should be changed anyway. Signed-off-by: NOleg Nesterov <oleg@redhat.com> Cc: stable@kernel.org Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 30 3月, 2010 1 次提交
-
-
由 Tejun Heo 提交于
include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_*.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). * x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: NTejun Heo <tj@kernel.org> Guess-its-ok-by: NChristoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
-
- 04 3月, 2010 1 次提交
-
-
由 Al Viro 提交于
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 25 2月, 2010 1 次提交
-
-
由 Paul E. McKenney 提交于
Add lockdep-ified RCU primitives to alloc_fd(), files_fdtable() and fcheck_files(). Cc: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com Cc: Alexander Viro <viro@zeniv.linux.org.uk> LKML-Reference: <1266887105-1528-8-git-send-email-paulmck@linux.vnet.ibm.com> Signed-off-by: NIngo Molnar <mingo@elte.hu>
-
- 19 2月, 2010 1 次提交
-
-
由 Al Viro 提交于
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 14 1月, 2010 1 次提交
-
-
由 Al Viro 提交于
We end up trying to kfree() nd.last.name on open("/mnt/tmp", O_CREAT) if /mnt/tmp is an autofs direct mount. The reason is that nd.last_type is bogus here; we want LAST_BIND for everything of that kind and we get LAST_NORM left over from finding parent directory. So make sure that it *is* set properly; set to LAST_BIND before doing ->follow_link() - for normal symlinks it will be changed by __vfs_follow_link() and everything else needs it set that way. Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 16 12月, 2009 2 次提交
-
-
由 Christoph Hellwig 提交于
Currently all architectures but microblaze unconditionally define USE_ELF_CORE_DUMP. The microblaze omission seems like an error to me, so let's kill this ifdef and make sure we are the same everywhere. Signed-off-by: NChristoph Hellwig <hch@lst.de> Acked-by: NHugh Dickins <hugh.dickins@tiscali.co.uk> Cc: <linux-arch@vger.kernel.org> Cc: Michal Simek <michal.simek@petalogix.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 john stultz 提交于
Setting a thread's comm to be something unique is a very useful ability and is helpful for debugging complicated threaded applications. However currently the only way to set a thread name is for the thread to name itself via the PR_SET_NAME prctl. However, there may be situations where it would be advantageous for a thread dispatcher to be naming the threads its managing, rather then having the threads self-describe themselves. This sort of behavior is available on other systems via the pthread_setname_np() interface. This patch exports a task's comm via proc/pid/comm and proc/pid/task/tid/comm interfaces, and allows thread siblings to write to these values. [akpm@linux-foundation.org: cleanups] Signed-off-by: NJohn Stultz <johnstul@us.ibm.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Arjan van de Ven <arjan@infradead.org> Cc: Mike Fulton <fultonm@ca.ibm.com> Cc: Sean Foley <Sean_Foley@ca.ibm.com> Cc: Darren Hart <dvhltc@us.ibm.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 12 11月, 2009 1 次提交
-
-
由 Sukadev Bhattiprolu 提交于
Daniel Lezcano reported a leak in 'struct pid' and 'struct pid_namespace' that is discussed in: http://lkml.org/lkml/2009/10/2/159. To summarize the thread, when container-init is terminated, it sets the PF_EXITING flag, zaps other processes in the container and waits to reap them. As a part of reaping, the container-init should flush any /proc dentries associated with the processes. But because the container-init is itself exiting and the following PF_EXITING check, the dentries are not flushed, resulting in leak in /proc inodes and dentries. This fix reverts the commit 7766755a ("Fix /proc dcache deadlock in do_exit") which introduced the check for PF_EXITING. At the time of the commit, shrink_dcache_parent() flushed dentries from other filesystems also and could have caused a deadlock which the commit fixed. But as pointed out by Eric Biederman, after commit 0feae5c4, shrink_dcache_parent() no longer affects other filesystems. So reverting the commit is now safe. As pointed out by Jan Kara, the leak is not as critical since the unclaimed space will be reclaimed under memory pressure or by: echo 3 > /proc/sys/vm/drop_caches But since this check is no longer required, its best to remove it. Signed-off-by: NSukadev Bhattiprolu <sukadev@us.ibm.com> Reported-by: NDaniel Lezcano <dlezcano@fr.ibm.com> Acked-by: NEric W. Biederman <ebiederm@xmission.com> Acked-by: NJan Kara <jack@ucw.cz> Cc: Andrea Arcangeli <andrea@cpushare.com> Cc: Serge Hallyn <serue@us.ibm.com> Cc: <stable@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 23 9月, 2009 3 次提交
-
-
由 Vincent Li 提交于
Remove obfuscated zero-length input check and return -EINVAL instead of -EIO error to make the error message clear to user. Add whitespace stripping. No functionality changes. The old code: echo 1 > /proc/pid/make-it-fail (ok) echo 1foo > /proc/pid/make-it-fail (-bash: echo: write error: Input/output error) The new code: echo 1 > /proc/pid/make-it-fail (ok) echo 1foo > /proc/pid/make-it-fail (-bash: echo: write error: Invalid argument) This patch is conservative in changes to not breaking existing scripts/applications. Signed-off-by: NVincent Li <macli@brc.ubc.ca> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Oleg Nesterov 提交于
The exiting sub-thread flushes /proc/pid only, but this doesn't buy too much: ps and friends mostly use /proc/tid/task/pid. Remove "if (thread_group_leader())" checks from proc_flush_task() path, this means we always remove /proc/tid/task/pid dentry on exit, and this actually matches the comment above proc_flush_task(). The test-case: static void* tfunc(void *arg) { char name[256]; sprintf(name, "/proc/%d/task/%ld/status", getpid(), gettid()); close(open(name, O_RDONLY)); return NULL; } int main(void) { pthread_t t; for (;;) { if (!pthread_create(&t, NULL, &tfunc, NULL)) pthread_join(t, NULL); } } slabtop shows that pid/proc_inode_cache/etc grow quickly and "indefinitely" until the task is killed or shrink_slab() is called, not good. And the main thread needs a lot of time to exit. The same can happen if something like "ps -efL" runs continuously, while some application spawns short-living threads. Reported-by: N"James M. Leddy" <jleddy@redhat.com> Signed-off-by: NOleg Nesterov <oleg@redhat.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Dominic Duval <dduval@redhat.com> Cc: Frank Hirtz <fhirtz@redhat.com> Cc: "Fuller, Johnray" <Johnray.Fuller@gs.com> Cc: Larry Woodman <lwoodman@redhat.com> Cc: Paul Batkowski <pbatkowski@redhat.com> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Kees Cook 提交于
/proc/$pid/limits should show RLIMIT_CPU as seconds, which is the unit used in kernel/posix-cpu-timers.c: unsigned long psecs = cputime_to_secs(ptime); ... if (psecs >= sig->rlim[RLIMIT_CPU].rlim_max) { ... __group_send_sig_info(SIGKILL, SEND_SIG_PRIV, tsk); Signed-off-by: NKees Cook <kees.cook@canonical.com> Acked-by: NWANG Cong <xiyou.wangcong@gmail.com> Acked-by: NNeil Horman <nhorman@tuxdriver.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 22 9月, 2009 3 次提交
-
-
由 KOSAKI Motohiro 提交于
Andrew Morton pointed out oom_adjust_write() has very strange EIO and new line handling. this patch fixes it. Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 KOSAKI Motohiro 提交于
oom-killer kills a process, not task. Then oom_score should be calculated as per-process too. it makes consistency more and makes speed up select_bad_process(). Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 KOSAKI Motohiro 提交于
Currently, OOM logic callflow is here. __out_of_memory() select_bad_process() for each task badness() calculate badness of one task oom_kill_process() search child oom_kill_task() kill target task and mm shared tasks with it example, process-A have two thread, thread-A and thread-B and it have very fat memory and each thread have following oom_adj and oom_score. thread-A: oom_adj = OOM_DISABLE, oom_score = 0 thread-B: oom_adj = 0, oom_score = very-high Then, select_bad_process() select thread-B, but oom_kill_task() refuse kill the task because thread-A have OOM_DISABLE. Thus __out_of_memory() call select_bad_process() again. but select_bad_process() select the same task. It mean kernel fall in livelock. The fact is, select_bad_process() must select killable task. otherwise OOM logic go into livelock. And root cause is, oom_adj shouldn't be per-thread value. it should be per-process value because OOM-killer kill a process, not thread. Thus This patch moves oomkilladj (now more appropriately named oom_adj) from struct task_struct to struct signal_struct. it naturally prevent select_bad_process() choose wrong task. Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 19 8月, 2009 1 次提交
-
-
由 KOSAKI Motohiro 提交于
The commit 2ff05b2b (oom: move oom_adj value) moveed the oom_adj value to the mm_struct. It was a very good first step for sanitize OOM. However Paul Menage reported the commit makes regression to his job scheduler. Current OOM logic can kill OOM_DISABLED process. Why? His program has the code of similar to the following. ... set_oom_adj(OOM_DISABLE); /* The job scheduler never killed by oom */ ... if (vfork() == 0) { set_oom_adj(0); /* Invoked child can be killed */ execve("foo-bar-cmd"); } .... vfork() parent and child are shared the same mm_struct. then above set_oom_adj(0) doesn't only change oom_adj for vfork() child, it's also change oom_adj for vfork() parent. Then, vfork() parent (job scheduler) lost OOM immune and it was killed. Actually, fork-setting-exec idiom is very frequently used in userland program. We must not break this assumption. Then, this patch revert commit 2ff05b2b and related commit. Reverted commit list --------------------- - commit 2ff05b2b (oom: move oom_adj value from task_struct to mm_struct) - commit 4d8b9135 (oom: avoid unnecessary mm locking and scanning for OOM_DISABLE) - commit 81236810 (oom: only oom kill exiting tasks with attached memory) - commit 933b787b (mm: copy over oom_adj value at fork time) Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 10 8月, 2009 5 次提交
-
-
由 Oleg Nesterov 提交于
The problem is minor, but without ->cred_guard_mutex held we can race with exec() and get the new ->mm but check old creds. Now we do not need to re-check task->mm after ptrace_may_access(), it can't be changed to the new mm under us. Strictly speaking, this also fixes another very minor problem. Unless security check fails or the task exits mm_for_maps() should never return NULL, the caller should get either old or new ->mm. Signed-off-by: NOleg Nesterov <oleg@redhat.com> Acked-by: NSerge Hallyn <serue@us.ibm.com> Signed-off-by: NJames Morris <jmorris@namei.org>
-
由 Oleg Nesterov 提交于
mm_for_maps() takes ->mmap_sem after security checks, this looks strange and obfuscates the locking rules. Move this lock to its single caller, m_start(). Signed-off-by: NOleg Nesterov <oleg@redhat.com> Acked-by: NSerge Hallyn <serue@us.ibm.com> Signed-off-by: NJames Morris <jmorris@namei.org>
-
由 Oleg Nesterov 提交于
It would be nice to kill __ptrace_may_access(). It requires task_lock(), but this lock is only needed to read mm->flags in the middle. Convert mm_for_maps() to use ptrace_may_access(), this also simplifies the code a little bit. Also, we do not need to take ->mmap_sem in advance. In fact I think mm_for_maps() should not play with ->mmap_sem at all, the caller should take this lock. With or without this patch, without ->cred_guard_mutex held we can race with exec() and get the new ->mm but check old creds. Signed-off-by: NOleg Nesterov <oleg@redhat.com> Reviewed-by: NSerge Hallyn <serue@us.ibm.com> Signed-off-by: NJames Morris <jmorris@namei.org>
-
由 Oleg Nesterov 提交于
The problem is minor, but without ->cred_guard_mutex held we can race with exec() and get the new ->mm but check old creds. Now we do not need to re-check task->mm after ptrace_may_access(), it can't be changed to the new mm under us. Strictly speaking, this also fixes another very minor problem. Unless security check fails or the task exits mm_for_maps() should never return NULL, the caller should get either old or new ->mm. Signed-off-by: NOleg Nesterov <oleg@redhat.com> Acked-by: NSerge Hallyn <serue@us.ibm.com> Signed-off-by: NJames Morris <jmorris@namei.org>
-
由 Oleg Nesterov 提交于
mm_for_maps() takes ->mmap_sem after security checks, this looks strange and obfuscates the locking rules. Move this lock to its single caller, m_start(). Signed-off-by: NOleg Nesterov <oleg@redhat.com> Acked-by: NSerge Hallyn <serue@us.ibm.com> Signed-off-by: NJames Morris <jmorris@namei.org>
-
- 24 6月, 2009 1 次提交
-
-
由 Oleg Nesterov 提交于
It would be nice to kill __ptrace_may_access(). It requires task_lock(), but this lock is only needed to read mm->flags in the middle. Convert mm_for_maps() to use ptrace_may_access(), this also simplifies the code a little bit. Also, we do not need to take ->mmap_sem in advance. In fact I think mm_for_maps() should not play with ->mmap_sem at all, the caller should take this lock. With or without this patch, without ->cred_guard_mutex held we can race with exec() and get the new ->mm but check old creds. Signed-off-by: NOleg Nesterov <oleg@redhat.com> Reviewed-by: NSerge Hallyn <serue@us.ibm.com> Signed-off-by: NJames Morris <jmorris@namei.org>
-
- 17 6月, 2009 1 次提交
-
-
由 David Rientjes 提交于
The per-task oom_adj value is a characteristic of its mm more than the task itself since it's not possible to oom kill any thread that shares the mm. If a task were to be killed while attached to an mm that could not be freed because another thread were set to OOM_DISABLE, it would have needlessly been terminated since there is no potential for future memory freeing. This patch moves oomkilladj (now more appropriately named oom_adj) from struct task_struct to struct mm_struct. This requires task_lock() on a task to check its oom_adj value to protect against exec, but it's already necessary to take the lock when dereferencing the mm to find the total VM size for the badness heuristic. This fixes a livelock if the oom killer chooses a task and another thread sharing the same memory has an oom_adj value of OOM_DISABLE. This occurs because oom_kill_task() repeatedly returns 1 and refuses to kill the chosen task while select_bad_process() will repeatedly choose the same task during the next retry. Taking task_lock() in select_bad_process() to check for OOM_DISABLE and in oom_kill_task() to check for threads sharing the same memory will be removed in the next patch in this series where it will no longer be necessary. Writing to /proc/pid/oom_adj for a kthread will now return -EINVAL since these threads are immune from oom killing already. They simply report an oom_adj value of OOM_DISABLE. Cc: Nick Piggin <npiggin@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: NDavid Rientjes <rientjes@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 29 5月, 2009 1 次提交
-
-
由 KOSAKI Motohiro 提交于
proc_pident_instantiate() has following call flow. proc_pident_lookup() proc_pident_instantiate() proc_pid_make_inode() And, proc_pident_lookup() has following error handling. const struct pid_entry *p, *last; error = ERR_PTR(-ENOENT); if (!task) goto out_no_task; Then, proc_pident_instantiate should return ENOENT too when racing against exit(2) occur. EINAL has two bad reason. - it implies caller is wrong. bad the race isn't caller's mistake. - man 2 open don't explain EINVAL. user often don't handle it. Note: Other proc_pid_make_inode() caller already use ENOENT properly. Acked-by: NEric W. Biederman <ebiederm@xmission.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 11 5月, 2009 1 次提交
-
-
由 David Howells 提交于
Guard the setprocattr security hook against ptrace by taking the target task's cred_guard_mutex around it. The problem is that setprocattr() may otherwise note the lack of a debugger, and then perform an action on that basis whilst letting a debugger attach between the two points. Holding cred_guard_mutex across the test and the action prevents ptrace_attach() from doing that. Signed-off-by: NDavid Howells <dhowells@redhat.com> Signed-off-by: NJames Morris <jmorris@namei.org>
-
- 05 5月, 2009 1 次提交
-
-
由 Jake Edge 提交于
By using the same test as is used for /proc/pid/maps and /proc/pid/smaps, only allow processes that can ptrace() a given process to see information that might be used to bypass address space layout randomization (ASLR). These include eip, esp, wchan, and start_stack in /proc/pid/stat as well as the non-symbolic output from /proc/pid/wchan. ASLR can be bypassed by sampling eip as shown by the proof-of-concept code at http://code.google.com/p/fuzzyaslr/ As part of a presentation (http://www.cr0.org/paper/to-jt-linux-alsr-leak.pdf) esp and wchan were also noted as possibly usable information leaks as well. The start_stack address also leaks potentially useful information. Cc: Stable Team <stable@kernel.org> Signed-off-by: NJake Edge <jake@lwn.net> Acked-by: NArjan van de Ven <arjan@linux.intel.com> Acked-by: N"Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 17 4月, 2009 1 次提交
-
-
由 KOSAKI Motohiro 提交于
In recently sysfs_poll discussion, Neil Brown pointed out /proc/mounts also should be fixed. SUSv3 says "Regular files shall always poll TRUE for reading and writing". see http://www.opengroup.org/onlinepubs/009695399/functions/poll.html Then, mounts_poll()'s default should be "POLLIN | POLLRDNORM". it mean always readable. In addition, event trigger should use "POLLERR | POLLPRI" instead POLLERR. it makes consistent to mdstat_poll() and sysfs_poll(). and, select(2) can handle POLLPRI easily. Reported-by: NNeil Brown <neilb@suse.de> Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Ram Pai <linuxram@us.ibm.com> Cc: Miklos Szeredi <mszeredi@suse.cz> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: NGreg Kroah-Hartman <gregkh@suse.de>
-
- 01 4月, 2009 1 次提交
-
-
由 Al Viro 提交于
Don't pull it in sched.h; very few files actually need it and those can include directly. sched.h itself only needs forward declaration of struct fs_struct; Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 29 3月, 2009 1 次提交
-
-
由 Hugh Dickins 提交于
check_unsafe_exec() also notes whether the fs_struct is being shared by more threads than will get killed by the exec, and if so sets LSM_UNSAFE_SHARE to make bprm_set_creds() careful about euid. But /proc/<pid>/cwd and /proc/<pid>/root lookups make transient use of get_fs_struct(), which also raises that sharing count. This might occasionally cause a setuid program not to change euid, in the same way as happened with files->count (check_unsafe_exec also looks at sighand->count, but /proc doesn't raise that one). We'd prefer exec not to unshare fs_struct: so fix this in procfs, replacing get_fs_struct() by get_fs_path(), which does path_get while still holding task_lock, instead of raising fs->count. Signed-off-by: NHugh Dickins <hugh@veritas.com> Cc: stable@kernel.org ___ fs/proc/base.c | 50 +++++++++++++++-------------------------------- 1 file changed, 16 insertions(+), 34 deletions(-) Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-