- 30 10月, 2013 1 次提交
-
-
由 Oleg Nesterov 提交于
Preparation for the next patches. Move the callsite of uprobe_copy_process() in copy_process() down to the succesfull return. We do not care if copy_process() fails, uprobe_free_utask() won't be called in this case so the wrong ->utask != NULL doesn't matter. OTOH, with this change we know that copy_process() can't fail when uprobe_copy_process() is called, the new task should either return to user-mode or call do_exit(). This way uprobe_copy_process() can: 1. setup p->utask != NULL if necessary 2. setup uprobes_state.xol_area 3. use task_work_add(p) Also, move the definition of uprobe_copy_process() down so that it can see get_utask(). Signed-off-by: NOleg Nesterov <oleg@redhat.com> Acked-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
-
- 12 9月, 2013 4 次提交
-
-
由 Oleg Nesterov 提交于
Simple cleanup. Every user of vma_set_policy() does the same work, this looks a bit annoying imho. And the new trivial helper which does mpol_dup() + vma_set_policy() to simplify the callers. Signed-off-by: NOleg Nesterov <oleg@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Oleg Nesterov 提交于
do_fork() denies CLONE_THREAD | CLONE_PARENT if NEWUSER | NEWPID. Then later copy_process() denies CLONE_SIGHAND if the new process will be in a different pid namespace (task_active_pid_ns() doesn't match current->nsproxy->pid_ns). This looks confusing and inconsistent. CLONE_NEWPID is very similar to the case when ->pid_ns was already unshared, we want the same restrictions so copy_process() should also nack CLONE_PARENT. And it would be better to deny CLONE_NEWUSER && CLONE_SIGHAND as well just for consistency. Kill the "CLONE_NEWUSER | CLONE_NEWPID" check in do_fork() and change copy_process() to do the same check along with ->pid_ns check we already have. Signed-off-by: NOleg Nesterov <oleg@redhat.com> Acked-by: NAndy Lutomirski <luto@amacapital.net> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Colin Walters <walters@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Oleg Nesterov 提交于
Commit 8382fcac ("pidns: Outlaw thread creation after unshare(CLONE_NEWPID)") nacks CLONE_NEWPID if the forking process unshared pid_ns. This is correct but unnecessary, copy_pid_ns() does the same check. Remove the CLONE_NEWPID check to cleanup the code and prepare for the next change. Test-case: static int child(void *arg) { return 0; } static char stack[16 * 1024]; int main(void) { pid_t pid; assert(unshare(CLONE_NEWUSER | CLONE_NEWPID) == 0); pid = clone(child, stack + sizeof(stack) / 2, CLONE_NEWPID | SIGCHLD, NULL); assert(pid < 0 && errno == EINVAL); return 0; } clone(CLONE_NEWPID) correctly fails with or without this change. Signed-off-by: NOleg Nesterov <oleg@redhat.com> Acked-by: NAndy Lutomirski <luto@amacapital.net> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Colin Walters <walters@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Oleg Nesterov 提交于
Commit 8382fcac ("pidns: Outlaw thread creation after unshare(CLONE_NEWPID)") nacks CLONE_VM if the forking process unshared pid_ns, this obviously breaks vfork: int main(void) { assert(unshare(CLONE_NEWUSER | CLONE_NEWPID) == 0); assert(vfork() >= 0); _exit(0); return 0; } fails without this patch. Change this check to use CLONE_SIGHAND instead. This also forbids CLONE_THREAD automatically, and this is what the comment implies. We could probably even drop CLONE_SIGHAND and use CLONE_THREAD, but it would be safer to not do this. The current check denies CLONE_SIGHAND implicitely and there is no reason to change this. Eric said "CLONE_SIGHAND is fine. CLONE_THREAD would be even better. Having shared signal handling between two different pid namespaces is the case that we are fundamentally guarding against." Signed-off-by: NOleg Nesterov <oleg@redhat.com> Reported-by: NColin Walters <walters@redhat.com> Acked-by: NAndy Lutomirski <luto@amacapital.net> Reviewed-by: N"Eric W. Biederman" <ebiederm@xmission.com> Cc: <stable@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 31 8月, 2013 1 次提交
-
-
由 Eric W. Biederman 提交于
I goofed when I made unshare(CLONE_NEWPID) only work in a single-threaded process. There is no need for that requirement and in fact I analyzied things right for setns. The hard requirement is for tasks that share a VM to all be in the pid namespace and we properly prevent that in do_fork. Just to be certain I took a look through do_wait and forget_original_parent and there are no cases that make it any harder for children to be in the multiple pid namespaces than it is for children to be in the same pid namespace. I also performed a check to see if there were in uses of task->nsproxy_pid_ns I was not familiar with, but it is only used when allocating a new pid for a new task, and in checks to prevent craziness from happening. Acked-by: NSerge Hallyn <serge.hallyn@canonical.com> Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
-
- 28 8月, 2013 1 次提交
-
-
由 Andy Lutomirski 提交于
nsproxy.pid_ns is *not* the task's pid namespace. The name should clarify that. This makes it more obvious that setns on a pid namespace is weird -- it won't change the pid namespace shown in procfs. Signed-off-by: NAndy Lutomirski <luto@amacapital.net> Reviewed-by: N"Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 14 8月, 2013 1 次提交
-
-
由 Michal Simek 提交于
Fix inadvertent breakage in the clone syscall ABI for Microblaze that was introduced in commit f3268edb ("microblaze: switch to generic fork/vfork/clone"). The Microblaze syscall ABI for clone takes the parent tid address in the 4th argument; the third argument slot is used for the stack size. The incorrectly-used CLONE_BACKWARDS type assigned parent tid to the 3rd slot. This commit restores the original ABI so that existing userspace libc code will work correctly. All kernel versions from v3.8-rc1 were affected. Signed-off-by: NMichal Simek <michal.simek@xilinx.com> Cc: <stable@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 31 7月, 2013 1 次提交
-
-
由 Benjamin LaHaise 提交于
On Wed, Jun 12, 2013 at 11:14:40AM -0700, Kent Overstreet wrote: > On Mon, Apr 15, 2013 at 02:40:55PM +0300, Octavian Purdila wrote: > > When using a large number of threads performing AIO operations the > > IOCTX list may get a significant number of entries which will cause > > significant overhead. For example, when running this fio script: > > > > rw=randrw; size=256k ;directory=/mnt/fio; ioengine=libaio; iodepth=1 > > blocksize=1024; numjobs=512; thread; loops=100 > > > > on an EXT2 filesystem mounted on top of a ramdisk we can observe up to > > 30% CPU time spent by lookup_ioctx: > > > > 32.51% [guest.kernel] [g] lookup_ioctx > > 9.19% [guest.kernel] [g] __lock_acquire.isra.28 > > 4.40% [guest.kernel] [g] lock_release > > 4.19% [guest.kernel] [g] sched_clock_local > > 3.86% [guest.kernel] [g] local_clock > > 3.68% [guest.kernel] [g] native_sched_clock > > 3.08% [guest.kernel] [g] sched_clock_cpu > > 2.64% [guest.kernel] [g] lock_release_holdtime.part.11 > > 2.60% [guest.kernel] [g] memcpy > > 2.33% [guest.kernel] [g] lock_acquired > > 2.25% [guest.kernel] [g] lock_acquire > > 1.84% [guest.kernel] [g] do_io_submit > > > > This patchs converts the ioctx list to a radix tree. For a performance > > comparison the above FIO script was run on a 2 sockets 8 core > > machine. This are the results (average and %rsd of 10 runs) for the > > original list based implementation and for the radix tree based > > implementation: > > > > cores 1 2 4 8 16 32 > > list 109376 ms 69119 ms 35682 ms 22671 ms 19724 ms 16408 ms > > %rsd 0.69% 1.15% 1.17% 1.21% 1.71% 1.43% > > radix 73651 ms 41748 ms 23028 ms 16766 ms 15232 ms 13787 ms > > %rsd 1.19% 0.98% 0.69% 1.13% 0.72% 0.75% > > % of radix > > relative 66.12% 65.59% 66.63% 72.31% 77.26% 83.66% > > to list > > > > To consider the impact of the patch on the typical case of having > > only one ctx per process the following FIO script was run: > > > > rw=randrw; size=100m ;directory=/mnt/fio; ioengine=libaio; iodepth=1 > > blocksize=1024; numjobs=1; thread; loops=100 > > > > on the same system and the results are the following: > > > > list 58892 ms > > %rsd 0.91% > > radix 59404 ms > > %rsd 0.81% > > % of radix > > relative 100.87% > > to list > > So, I was just doing some benchmarking/profiling to get ready to send > out the aio patches I've got for 3.11 - and it looks like your patch is > causing a ~1.5% throughput regression in my testing :/ ... <snip> I've got an alternate approach for fixing this wart in lookup_ioctx()... Instead of using an rbtree, just use the reserved id in the ring buffer header to index an array pointing the ioctx. It's not finished yet, and it needs to be tidied up, but is most of the way there. -ben -- "Thought is the essence of where you are now." -- kmo> And, a rework of Ben's code, but this was entirely his idea kmo> -Kent bcrl> And fix the code to use the right mm_struct in kill_ioctx(), actually free memory. Signed-off-by: NBenjamin LaHaise <bcrl@kvack.org>
-
- 15 7月, 2013 1 次提交
-
-
由 Paul Gortmaker 提交于
The __cpuinit type of throwaway sections might have made sense some time ago when RAM was more constrained, but now the savings do not offset the cost and complications. For example, the fix in commit 5e427ec2 ("x86: Fix bit corruption at CPU resume time") is a good example of the nasty type of bugs that can be created with improper use of the various __init prefixes. After a discussion on LKML[1] it was decided that cpuinit should go the way of devinit and be phased out. Once all the users are gone, we can then finally remove the macros themselves from linux/init.h. This removes all the uses of the __cpuinit macros from C files in the core kernel directories (kernel, init, lib, mm, and include) that don't really have a specific maintainer. [1] https://lkml.org/lkml/2013/5/20/589Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
-
- 11 7月, 2013 1 次提交
-
-
由 Michel Lespinasse 提交于
Since all architectures have been converted to use vm_unmapped_area(), there is no remaining use for the free_area_cache. Signed-off-by: NMichel Lespinasse <walken@google.com> Acked-by: NRik van Riel <riel@redhat.com> Cc: "James E.J. Bottomley" <jejb@parisc-linux.org> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: David Howells <dhowells@redhat.com> Cc: Helge Deller <deller@gmx.de> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Matt Turner <mattst88@gmail.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Richard Henderson <rth@twiddle.net> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 04 7月, 2013 4 次提交
-
-
由 Oleg Nesterov 提交于
copy_process() does a lot of "chaotic" initializations and checks CLONE_THREAD twice before it takes tasklist. In particular it sets "p->group_leader = p" and then changes it again under tasklist if !thread_group_leader(p). This looks a bit confusing, lets create a single "if (CLONE_THREAD)" block which initializes ->exit_signal, ->group_leader, and ->tgid. Signed-off-by: NOleg Nesterov <oleg@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Sergey Dyasly <dserrg@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Oleg Nesterov 提交于
copy_process() adds the new child to thread_group/init_task.tasks list and then does attach_pid(child, PIDTYPE_PID). This means that the lockless next_thread() or next_task() can see this thread with the wrong pid. Say, "ls /proc/pid/task" can list the same inode twice. We could move attach_pid(child, PIDTYPE_PID) up, but in this case find_task_by_vpid() can find the new thread before it was fully initialized. And this is already true for PIDTYPE_PGID/PIDTYPE_SID, With this patch copy_process() initializes child->pids[*].pid first, then calls attach_pid() to insert the task into the pid->tasks list. attach_pid() no longer need the "struct pid*" argument, it is always called after pid_link->pid was already set. Signed-off-by: NOleg Nesterov <oleg@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Sergey Dyasly <dserrg@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Oleg Nesterov 提交于
Cleanup and preparation for the next changes. Move the "if (clone_flags & CLONE_THREAD)" code down under "if (likely(p->pid))" and turn it into into the "else" branch. This makes the process/thread initialization more symmetrical and removes one check. Signed-off-by: NOleg Nesterov <oleg@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Sergey Dyasly <dserrg@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Eric Paris 提交于
When a task is attempting to violate the RLIMIT_NPROC limit we have a check to see if the task is sufficiently priviledged. The check first looks at CAP_SYS_ADMIN, then CAP_SYS_RESOURCE, then if the task is uid=0. A result is that tasks which are allowed by the uid=0 check are first checked against the security subsystem. This results in the security subsystem auditting a denial for sys_admin and sys_resource and then the task passing the uid=0 check. This patch rearranges the code to first check uid=0, since if we pass that we shouldn't hit the security system at all. We then check sys_resource, since it is the smallest capability which will solve the problem. Lastly we check the fallback everything cap_sysadmin. We don't want to give this capability many places since it is so powerful. This will eliminate many of the false positive/needless denial messages we get when a root task tries to violate the nproc limit. (note that kthreads count against root, so on a sufficiently large machine we can actually get past the default limits before any userspace tasks are launched.) Signed-off-by: NEric Paris <eparis@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 08 5月, 2013 1 次提交
-
-
由 Kent Overstreet 提交于
Faster kernel compiles by way of fewer unnecessary includes. [akpm@linux-foundation.org: fix fallout] [akpm@linux-foundation.org: fix build] Signed-off-by: NKent Overstreet <koverstreet@google.com> Cc: Zach Brown <zab@redhat.com> Cc: Felipe Balbi <balbi@ti.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Jens Axboe <axboe@kernel.dk> Cc: Asai Thambi S P <asamymuthupa@micron.com> Cc: Selvan Mani <smani@micron.com> Cc: Sam Bradshaw <sbradshaw@micron.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Benjamin LaHaise <bcrl@kvack.org> Reviewed-by: N"Theodore Ts'o" <tytso@mit.edu> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 24 3月, 2013 1 次提交
-
-
由 Kent Overstreet 提交于
Does writethrough and writeback caching, handles unclean shutdown, and has a bunch of other nifty features motivated by real world usage. See the wiki at http://bcache.evilpiepirate.org for more. Signed-off-by: NKent Overstreet <koverstreet@google.com>
-
- 14 3月, 2013 1 次提交
-
-
由 Eric W. Biederman 提交于
Don't allowing sharing the root directory with processes in a different user namespace. There doesn't seem to be any point, and to allow it would require the overhead of putting a user namespace reference in fs_struct (for permission checks) and incrementing that reference count on practically every call to fork. So just perform the inexpensive test of forbidding sharing fs_struct acrosss processes in different user namespaces. We already disallow other forms of threading when unsharing a user namespace so this should be no real burden in practice. This updates setns, clone, and unshare to disallow multiple user namespaces sharing an fs_struct. Cc: stable@vger.kernel.org Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 08 3月, 2013 1 次提交
-
-
由 Frederic Weisbecker 提交于
The full dynticks cputime accounting is able to account either using the tick or the context tracking subsystem. This way the housekeeping CPU can keep the low overhead tick based solution. This latter mode has a low jiffies resolution granularity and need to be scaled against CFS precise runtime accounting to improve its result. We are doing this for CONFIG_TICK_CPU_ACCOUNTING, now we also need to expand it to full dynticks accounting dynamic off-case as well. Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com> Cc: Li Zhong <zhong@linux.vnet.ibm.com> Cc: Kevin Hilman <khilman@linaro.org> Cc: Mats Liljegren <mats.liljegren@enea.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Namhyung Kim <namhyung.kim@lge.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
-
- 04 3月, 2013 1 次提交
-
-
由 Al Viro 提交于
... and switch i386 to HAVE_SYSCALL_WRAPPERS, killing open-coded uses of asmlinkage_protect() in a bunch of syscalls. Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 28 2月, 2013 1 次提交
-
-
由 Alan Cox 提交于
If new_nsproxy is set we will always call switch_task_namespaces and then set new_nsproxy back to NULL so the reassignment and fall through check are redundant Signed-off-by: NAlan Cox <alan@linux.intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 23 2月, 2013 1 次提交
-
-
由 Al Viro 提交于
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 28 1月, 2013 1 次提交
-
-
由 Frederic Weisbecker 提交于
While remotely reading the cputime of a task running in a full dynticks CPU, the values stored in utime/stime fields of struct task_struct may be stale. Its values may be those of the last kernel <-> user transition time snapshot and we need to add the tickless time spent since this snapshot. To fix this, flush the cputime of the dynticks CPUs on kernel <-> user transition and record the time / context where we did this. Then on top of this snapshot and the current time, perform the fixup on the reader side from task_times() accessors. Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Li Zhong <zhong@linux.vnet.ibm.com> Cc: Namhyung Kim <namhyung.kim@lge.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> [fixed kvm module related build errors] Signed-off-by: NSedat Dilek <sedat.dilek@gmail.com>
-
- 20 1月, 2013 1 次提交
-
-
由 Al Viro 提交于
Cc: stable@vger.kernel.org Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 25 12月, 2012 1 次提交
-
-
由 Eric W. Biederman 提交于
The sequence: unshare(CLONE_NEWPID) clone(CLONE_THREAD|CLONE_SIGHAND|CLONE_VM) Creates a new process in the new pid namespace without setting pid_ns->child_reaper. After forking this results in a NULL pointer dereference. Avoid this and other nonsense scenarios that can show up after creating a new pid namespace with unshare by adding a new check in copy_prodcess. Pointed-out-by: NOleg Nesterov <oleg@redhat.com> Acked-by: NOleg Nesterov <oleg@redhat.com> Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
-
- 20 12月, 2012 1 次提交
-
-
由 Al Viro 提交于
All architectures have CONFIG_GENERIC_KERNEL_THREAD CONFIG_GENERIC_KERNEL_EXECVE __ARCH_WANT_SYS_EXECVE None of them have __ARCH_WANT_KERNEL_EXECVE and there are only two callers of kernel_execve() (which is a trivial wrapper for do_execve() now) left. Kill the conditionals and make both callers use do_execve(). Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 19 12月, 2012 1 次提交
-
-
由 Glauber Costa 提交于
Because those architectures will draw their stacks directly from the page allocator, rather than the slab cache, we can directly pass __GFP_KMEMCG flag, and issue the corresponding free_pages. This code path is taken when the architecture doesn't define CONFIG_ARCH_THREAD_INFO_ALLOCATOR (only ia64 seems to), and has THREAD_SIZE >= PAGE_SIZE. Luckily, most - if not all - of the remaining architectures fall in this category. This will guarantee that every stack page is accounted to the memcg the process currently lives on, and will have the allocations to fail if they go over limit. For the time being, I am defining a new variant of THREADINFO_GFP, not to mess with the other path. Once the slab is also tracked by memcg, we can get rid of that flag. Tested to successfully protect against :(){ :|:& };: Signed-off-by: NGlauber Costa <glommer@parallels.com> Acked-by: NFrederic Weisbecker <fweisbec@redhat.com> Acked-by: NKamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: NMichal Hocko <mhocko@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: JoonSoo Kim <js1304@gmail.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Rik van Riel <riel@redhat.com> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 11 12月, 2012 1 次提交
-
-
由 Mel Gorman 提交于
Due to the fact that migrations are driven by the CPU a task is running on there is no point tracking NUMA faults until one task runs on a new node. This patch tracks the first node used by an address space. Until it changes, PTE scanning is disabled and no NUMA hinting faults are trapped. This should help workloads that are short-lived, do not care about NUMA placement or have bound themselves to a single node. This takes advantage of the logic in "mm: sched: numa: Implement slow start for working set sampling" to delay when the checks are made. This will take advantage of processes that set their CPU and node bindings early in their lifetime. It will also potentially allow any initial load balancing to take place. Signed-off-by: NMel Gorman <mgorman@suse.de>
-
- 29 11月, 2012 6 次提交
-
-
由 Al Viro 提交于
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Al Viro 提交于
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Al Viro 提交于
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Al Viro 提交于
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Al Viro 提交于
... and get rid of idiotic struct pt_regs * in asm-generic/syscalls.h prototypes of the same, while we are at it. Eventually we want those in linux/syscalls.h, of course, but that'll have to wait a bit. Note that there are *three* variants of sys_clone() order of arguments. Braindamage galore... Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Frederic Weisbecker 提交于
task_cputime_adjusted() and thread_group_cputime_adjusted() essentially share the same code. They just don't use the same source: * The first function uses the cputime in the task struct and the previous adjusted snapshot that ensures monotonicity. * The second adds the cputime of all tasks in the group and the previous adjusted snapshot of the whole group from the signal structure. Just consolidate the common code that does the adjustment. These functions just need to fetch the values from the appropriate source. Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
-
- 20 11月, 2012 1 次提交
-
-
由 Eric W. Biederman 提交于
- Add CLONE_THREAD to the unshare flags if CLONE_NEWUSER is selected As changing user namespaces is only valid if all there is only a single thread. - Restore the code to add CLONE_VM if CLONE_THREAD is selected and the code to addCLONE_SIGHAND if CLONE_VM is selected. Making the constraints in the code clear. Acked-by: NSerge Hallyn <serge.hallyn@canonical.com> Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
-
- 19 11月, 2012 5 次提交
-
-
由 Eric W. Biederman 提交于
Now that we have been through every permission check in the kernel having uid == 0 and gid == 0 in your local user namespace no longer adds any special privileges. Even having a full set of caps in your local user namespace is safe because capabilies are relative to your local user namespace, and do not confer unexpected privileges. Over the long term this should allow much more of the kernels functionality to be safely used by non-root users. Functionality like unsharing the mount namespace that is only unsafe because it can fool applications whose privileges are raised when they are executed. Since those applications have no privileges in a user namespaces it becomes safe to spoof and confuse those applications all you want. Those capabilities will still need to be enabled carefully because we may still need things like rlimits on the number of unprivileged mounts but that is to avoid DOS attacks not to avoid fooling root owned processes. Acked-by: NSerge Hallyn <serge.hallyn@canonical.com> Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
-
由 Eric W. Biederman 提交于
Unsharing of the pid namespace unlike unsharing of other namespaces does not take affect immediately. Instead it affects the children created with fork and clone. The first of these children becomes the init process of the new pid namespace, the rest become oddball children of pid 0. From the point of view of the new pid namespace the process that created it is pid 0, as it's pid does not map. A couple of different semantics were considered but this one was settled on because it is easy to implement and it is usable from pam modules. The core reasons for the existence of unshare. I took a survey of the callers of pam modules and the following appears to be a representative sample of their logic. { setup stuff include pam child = fork(); if (!child) { setuid() exec /bin/bash } waitpid(child); pam and other cleanup } As you can see there is a fork to create the unprivileged user space process. Which means that the unprivileged user space process will appear as pid 1 in the new pid namespace. Further most login processes do not cope with extraneous children which means shifting the duty of reaping extraneous child process to the creator of those extraneous children makes the system more comprehensible. The practical reason for this set of pid namespace semantics is that it is simple to implement and verify they work correctly. Whereas an implementation that requres changing the struct pid on a process comes with a lot more races and pain. Not the least of which is that glibc caches getpid(). These semantics are implemented by having two notions of the pid namespace of a proces. There is task_active_pid_ns which is the pid namspace the process was created with and the pid namespace that all pids are presented to that process in. The task_active_pid_ns is stored in the struct pid of the task. Then there is the pid namespace that will be used for children that pid namespace is stored in task->nsproxy->pid_ns. Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
-
由 Eric W. Biederman 提交于
Instead of setting child_reaper and SIGNAL_UNKILLABLE one way for the system init process, and another way for pid namespace init processes test pid->nr == 1 and use the same code for both. For the global init this results in SIGNAL_UNKILLABLE being set much earlier in the initialization process. This is a small cleanup and it paves the way for allowing unshare and enter of the pid namespace as that path like our global init also will not set CLONE_NEWPID. Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
-
由 Eric W. Biederman 提交于
Track the number of pids in the proc hash table. When the number of pids goes to 0 schedule work to unmount the kernel mount of proc. Move the mount of proc into alloc_pid when we allocate the pid for init. Remove the surprising calls of pid_ns_release proc in fork and proc_flush_task. Those code paths really shouldn't know about proc namespace implementation details and people have demonstrated several times that finding and understanding those code paths is difficult and non-obvious. Because of the call path detach pid is alwasy called with the rtnl_lock held free_pid is not allowed to sleep, so the work to unmounting proc is moved to a work queue. This has the side benefit of not blocking the entire world waiting for the unnecessary rcu_barrier in deactivate_locked_super. In the process of making the code clear and obvious this fixes a bug reported by Gao feng <gaofeng@cn.fujitsu.com> where we would leak a mount of proc during clone(CLONE_NEWPID|CLONE_NEWNET) if copy_pid_ns succeeded and copy_net_ns failed. Acked-by: N"Serge E. Hallyn" <serge@hallyn.com> Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
-
由 Eric W. Biederman 提交于
The expressions tsk->nsproxy->pid_ns and task_active_pid_ns aka ns_of_pid(task_pid(tsk)) should have the same number of cache line misses with the practical difference that ns_of_pid(task_pid(tsk)) is released later in a processes life. Furthermore by using task_active_pid_ns it becomes trivial to write an unshare implementation for the the pid namespace. So I have used task_active_pid_ns everywhere I can. In fork since the pid has not yet been attached to the process I use ns_of_pid, to achieve the same effect. Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
-