1. 30 10月, 2013 1 次提交
    • O
      uprobes: Change the callsite of uprobe_copy_process() · b68e0749
      Oleg Nesterov 提交于
      Preparation for the next patches.
      
      Move the callsite of uprobe_copy_process() in copy_process() down
      to the succesfull return. We do not care if copy_process() fails,
      uprobe_free_utask() won't be called in this case so the wrong
      ->utask != NULL doesn't matter.
      
      OTOH, with this change we know that copy_process() can't fail when
      uprobe_copy_process() is called, the new task should either return
      to user-mode or call do_exit(). This way uprobe_copy_process() can:
      
      	1. setup p->utask != NULL if necessary
      
      	2. setup uprobes_state.xol_area
      
      	3. use task_work_add(p)
      
      Also, move the definition of uprobe_copy_process() down so that it
      can see get_utask().
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      b68e0749
  2. 12 9月, 2013 4 次提交
  3. 31 8月, 2013 1 次提交
    • E
      pidns: Don't have unshare(CLONE_NEWPID) imply CLONE_THREAD · 6e556ce2
      Eric W. Biederman 提交于
      I goofed when I made unshare(CLONE_NEWPID) only work in a
      single-threaded process.  There is no need for that requirement and in
      fact I analyzied things right for setns.  The hard requirement
      is for tasks that share a VM to all be in the pid namespace and
      we properly prevent that in do_fork.
      
      Just to be certain I took a look through do_wait and
      forget_original_parent and there are no cases that make it any harder
      for children to be in the multiple pid namespaces than it is for
      children to be in the same pid namespace.  I also performed a check to
      see if there were in uses of task->nsproxy_pid_ns I was not familiar
      with, but it is only used when allocating a new pid for a new task,
      and in checks to prevent craziness from happening.
      Acked-by: NSerge Hallyn <serge.hallyn@canonical.com>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      6e556ce2
  4. 28 8月, 2013 1 次提交
  5. 14 8月, 2013 1 次提交
  6. 31 7月, 2013 1 次提交
    • B
      aio: convert the ioctx list to table lookup v3 · db446a08
      Benjamin LaHaise 提交于
      On Wed, Jun 12, 2013 at 11:14:40AM -0700, Kent Overstreet wrote:
      > On Mon, Apr 15, 2013 at 02:40:55PM +0300, Octavian Purdila wrote:
      > > When using a large number of threads performing AIO operations the
      > > IOCTX list may get a significant number of entries which will cause
      > > significant overhead. For example, when running this fio script:
      > >
      > > rw=randrw; size=256k ;directory=/mnt/fio; ioengine=libaio; iodepth=1
      > > blocksize=1024; numjobs=512; thread; loops=100
      > >
      > > on an EXT2 filesystem mounted on top of a ramdisk we can observe up to
      > > 30% CPU time spent by lookup_ioctx:
      > >
      > >  32.51%  [guest.kernel]  [g] lookup_ioctx
      > >   9.19%  [guest.kernel]  [g] __lock_acquire.isra.28
      > >   4.40%  [guest.kernel]  [g] lock_release
      > >   4.19%  [guest.kernel]  [g] sched_clock_local
      > >   3.86%  [guest.kernel]  [g] local_clock
      > >   3.68%  [guest.kernel]  [g] native_sched_clock
      > >   3.08%  [guest.kernel]  [g] sched_clock_cpu
      > >   2.64%  [guest.kernel]  [g] lock_release_holdtime.part.11
      > >   2.60%  [guest.kernel]  [g] memcpy
      > >   2.33%  [guest.kernel]  [g] lock_acquired
      > >   2.25%  [guest.kernel]  [g] lock_acquire
      > >   1.84%  [guest.kernel]  [g] do_io_submit
      > >
      > > This patchs converts the ioctx list to a radix tree. For a performance
      > > comparison the above FIO script was run on a 2 sockets 8 core
      > > machine. This are the results (average and %rsd of 10 runs) for the
      > > original list based implementation and for the radix tree based
      > > implementation:
      > >
      > > cores         1         2         4         8         16        32
      > > list       109376 ms  69119 ms  35682 ms  22671 ms  19724 ms  16408 ms
      > > %rsd         0.69%      1.15%     1.17%     1.21%     1.71%     1.43%
      > > radix       73651 ms  41748 ms  23028 ms  16766 ms  15232 ms   13787 ms
      > > %rsd         1.19%      0.98%     0.69%     1.13%    0.72%      0.75%
      > > % of radix
      > > relative    66.12%     65.59%    66.63%    72.31%   77.26%     83.66%
      > > to list
      > >
      > > To consider the impact of the patch on the typical case of having
      > > only one ctx per process the following FIO script was run:
      > >
      > > rw=randrw; size=100m ;directory=/mnt/fio; ioengine=libaio; iodepth=1
      > > blocksize=1024; numjobs=1; thread; loops=100
      > >
      > > on the same system and the results are the following:
      > >
      > > list        58892 ms
      > > %rsd         0.91%
      > > radix       59404 ms
      > > %rsd         0.81%
      > > % of radix
      > > relative    100.87%
      > > to list
      >
      > So, I was just doing some benchmarking/profiling to get ready to send
      > out the aio patches I've got for 3.11 - and it looks like your patch is
      > causing a ~1.5% throughput regression in my testing :/
      ... <snip>
      
      I've got an alternate approach for fixing this wart in lookup_ioctx()...
      Instead of using an rbtree, just use the reserved id in the ring buffer
      header to index an array pointing the ioctx.  It's not finished yet, and
      it needs to be tidied up, but is most of the way there.
      
      		-ben
      --
      "Thought is the essence of where you are now."
      --
      kmo> And, a rework of Ben's code, but this was entirely his idea
      kmo>		-Kent
      
      bcrl> And fix the code to use the right mm_struct in kill_ioctx(), actually
      free memory.
      Signed-off-by: NBenjamin LaHaise <bcrl@kvack.org>
      db446a08
  7. 15 7月, 2013 1 次提交
    • P
      kernel: delete __cpuinit usage from all core kernel files · 0db0628d
      Paul Gortmaker 提交于
      The __cpuinit type of throwaway sections might have made sense
      some time ago when RAM was more constrained, but now the savings
      do not offset the cost and complications.  For example, the fix in
      commit 5e427ec2 ("x86: Fix bit corruption at CPU resume time")
      is a good example of the nasty type of bugs that can be created
      with improper use of the various __init prefixes.
      
      After a discussion on LKML[1] it was decided that cpuinit should go
      the way of devinit and be phased out.  Once all the users are gone,
      we can then finally remove the macros themselves from linux/init.h.
      
      This removes all the uses of the __cpuinit macros from C files in
      the core kernel directories (kernel, init, lib, mm, and include)
      that don't really have a specific maintainer.
      
      [1] https://lkml.org/lkml/2013/5/20/589Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      0db0628d
  8. 11 7月, 2013 1 次提交
  9. 04 7月, 2013 4 次提交
    • O
      kernel/fork.c:copy_process(): consolidate the lockless CLONE_THREAD checks · 18c830df
      Oleg Nesterov 提交于
      copy_process() does a lot of "chaotic" initializations and checks
      CLONE_THREAD twice before it takes tasklist.  In particular it sets
      "p->group_leader = p" and then changes it again under tasklist if
      !thread_group_leader(p).
      
      This looks a bit confusing, lets create a single "if (CLONE_THREAD)" block
      which initializes ->exit_signal, ->group_leader, and ->tgid.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Sergey Dyasly <dserrg@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      18c830df
    • O
      kernel/fork.c:copy_process(): don't add the uninitialized child to thread/task/pid lists · 81907739
      Oleg Nesterov 提交于
      copy_process() adds the new child to thread_group/init_task.tasks list and
      then does attach_pid(child, PIDTYPE_PID).  This means that the lockless
      next_thread() or next_task() can see this thread with the wrong pid.  Say,
      "ls /proc/pid/task" can list the same inode twice.
      
      We could move attach_pid(child, PIDTYPE_PID) up, but in this case
      find_task_by_vpid() can find the new thread before it was fully
      initialized.
      
      And this is already true for PIDTYPE_PGID/PIDTYPE_SID, With this patch
      copy_process() initializes child->pids[*].pid first, then calls
      attach_pid() to insert the task into the pid->tasks list.
      
      attach_pid() no longer need the "struct pid*" argument, it is always
      called after pid_link->pid was already set.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Sergey Dyasly <dserrg@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      81907739
    • O
      kernel/fork.c:copy_process(): unify CLONE_THREAD-or-thread_group_leader code · 80628ca0
      Oleg Nesterov 提交于
      Cleanup and preparation for the next changes.
      
      Move the "if (clone_flags & CLONE_THREAD)" code down under "if
      (likely(p->pid))" and turn it into into the "else" branch.  This makes the
      process/thread initialization more symmetrical and removes one check.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Sergey Dyasly <dserrg@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      80628ca0
    • E
      fork: reorder permissions when violating number of processes limits · b57922b6
      Eric Paris 提交于
      When a task is attempting to violate the RLIMIT_NPROC limit we have a
      check to see if the task is sufficiently priviledged.  The check first
      looks at CAP_SYS_ADMIN, then CAP_SYS_RESOURCE, then if the task is uid=0.
      
      A result is that tasks which are allowed by the uid=0 check are first
      checked against the security subsystem.  This results in the security
      subsystem auditting a denial for sys_admin and sys_resource and then the
      task passing the uid=0 check.
      
      This patch rearranges the code to first check uid=0, since if we pass that
      we shouldn't hit the security system at all.  We then check sys_resource,
      since it is the smallest capability which will solve the problem.  Lastly
      we check the fallback everything cap_sysadmin.  We don't want to give this
      capability many places since it is so powerful.
      
      This will eliminate many of the false positive/needless denial messages we
      get when a root task tries to violate the nproc limit.  (note that
      kthreads count against root, so on a sufficiently large machine we can
      actually get past the default limits before any userspace tasks are
      launched.)
      Signed-off-by: NEric Paris <eparis@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b57922b6
  10. 08 5月, 2013 1 次提交
  11. 24 3月, 2013 1 次提交
  12. 14 3月, 2013 1 次提交
    • E
      userns: Don't allow CLONE_NEWUSER | CLONE_FS · e66eded8
      Eric W. Biederman 提交于
      Don't allowing sharing the root directory with processes in a
      different user namespace.  There doesn't seem to be any point, and to
      allow it would require the overhead of putting a user namespace
      reference in fs_struct (for permission checks) and incrementing that
      reference count on practically every call to fork.
      
      So just perform the inexpensive test of forbidding sharing fs_struct
      acrosss processes in different user namespaces.  We already disallow
      other forms of threading when unsharing a user namespace so this
      should be no real burden in practice.
      
      This updates setns, clone, and unshare to disallow multiple user
      namespaces sharing an fs_struct.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e66eded8
  13. 08 3月, 2013 1 次提交
    • F
      cputime: Dynamically scale cputime for full dynticks accounting · 9fbc42ea
      Frederic Weisbecker 提交于
      The full dynticks cputime accounting is able to account either
      using the tick or the context tracking subsystem. This way
      the housekeeping CPU can keep the low overhead tick based
      solution.
      
      This latter mode has a low jiffies resolution granularity and
      need to be scaled against CFS precise runtime accounting to
      improve its result. We are doing this for CONFIG_TICK_CPU_ACCOUNTING,
      now we also need to expand it to full dynticks accounting dynamic
      off-case as well.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Li Zhong <zhong@linux.vnet.ibm.com>
      Cc: Kevin Hilman <khilman@linaro.org>
      Cc: Mats Liljegren <mats.liljegren@enea.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Namhyung Kim <namhyung.kim@lge.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      9fbc42ea
  14. 04 3月, 2013 1 次提交
  15. 28 2月, 2013 1 次提交
  16. 23 2月, 2013 1 次提交
  17. 28 1月, 2013 1 次提交
    • F
      cputime: Safely read cputime of full dynticks CPUs · 6a61671b
      Frederic Weisbecker 提交于
      While remotely reading the cputime of a task running in a
      full dynticks CPU, the values stored in utime/stime fields
      of struct task_struct may be stale. Its values may be those
      of the last kernel <-> user transition time snapshot and
      we need to add the tickless time spent since this snapshot.
      
      To fix this, flush the cputime of the dynticks CPUs on
      kernel <-> user transition and record the time / context
      where we did this. Then on top of this snapshot and the current
      time, perform the fixup on the reader side from task_times()
      accessors.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Li Zhong <zhong@linux.vnet.ibm.com>
      Cc: Namhyung Kim <namhyung.kim@lge.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      [fixed kvm module related build errors]
      Signed-off-by: NSedat Dilek <sedat.dilek@gmail.com>
      6a61671b
  18. 20 1月, 2013 1 次提交
  19. 25 12月, 2012 1 次提交
  20. 20 12月, 2012 1 次提交
  21. 19 12月, 2012 1 次提交
    • G
      fork: protect architectures where THREAD_SIZE >= PAGE_SIZE against fork bombs · 2ad306b1
      Glauber Costa 提交于
      Because those architectures will draw their stacks directly from the page
      allocator, rather than the slab cache, we can directly pass __GFP_KMEMCG
      flag, and issue the corresponding free_pages.
      
      This code path is taken when the architecture doesn't define
      CONFIG_ARCH_THREAD_INFO_ALLOCATOR (only ia64 seems to), and has
      THREAD_SIZE >= PAGE_SIZE.  Luckily, most - if not all - of the remaining
      architectures fall in this category.
      
      This will guarantee that every stack page is accounted to the memcg the
      process currently lives on, and will have the allocations to fail if they
      go over limit.
      
      For the time being, I am defining a new variant of THREADINFO_GFP, not to
      mess with the other path.  Once the slab is also tracked by memcg, we can
      get rid of that flag.
      
      Tested to successfully protect against :(){ :|:& };:
      Signed-off-by: NGlauber Costa <glommer@parallels.com>
      Acked-by: NFrederic Weisbecker <fweisbec@redhat.com>
      Acked-by: NKamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: JoonSoo Kim <js1304@gmail.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Suleiman Souhlal <suleiman@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2ad306b1
  22. 11 12月, 2012 1 次提交
    • M
      mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node · 5bca2303
      Mel Gorman 提交于
      Due to the fact that migrations are driven by the CPU a task is running
      on there is no point tracking NUMA faults until one task runs on a new
      node. This patch tracks the first node used by an address space. Until
      it changes, PTE scanning is disabled and no NUMA hinting faults are
      trapped. This should help workloads that are short-lived, do not care
      about NUMA placement or have bound themselves to a single node.
      
      This takes advantage of the logic in "mm: sched: numa: Implement slow
      start for working set sampling" to delay when the checks are made. This
      will take advantage of processes that set their CPU and node bindings
      early in their lifetime. It will also potentially allow any initial load
      balancing to take place.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      5bca2303
  23. 29 11月, 2012 6 次提交
  24. 20 11月, 2012 1 次提交
  25. 19 11月, 2012 5 次提交
    • E
      userns: Allow unprivileged users to create user namespaces. · 5eaf563e
      Eric W. Biederman 提交于
      Now that we have been through every permission check in the kernel
      having uid == 0 and gid == 0 in your local user namespace no
      longer adds any special privileges.  Even having a full set
      of caps in your local user namespace is safe because capabilies
      are relative to your local user namespace, and do not confer
      unexpected privileges.
      
      Over the long term this should allow much more of the kernels
      functionality to be safely used by non-root users.  Functionality
      like unsharing the mount namespace that is only unsafe because
      it can fool applications whose privileges are raised when they
      are executed.  Since those applications have no privileges in
      a user namespaces it becomes safe to spoof and confuse those
      applications all you want.
      
      Those capabilities will still need to be enabled carefully because
      we may still need things like rlimits on the number of unprivileged
      mounts but that is to avoid DOS attacks not to avoid fooling root
      owned processes.
      Acked-by: NSerge Hallyn <serge.hallyn@canonical.com>
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      5eaf563e
    • E
      pidns: Support unsharing the pid namespace. · 50804fe3
      Eric W. Biederman 提交于
      Unsharing of the pid namespace unlike unsharing of other namespaces
      does not take affect immediately.  Instead it affects the children
      created with fork and clone.  The first of these children becomes the init
      process of the new pid namespace, the rest become oddball children
      of pid 0.  From the point of view of the new pid namespace the process
      that created it is pid 0, as it's pid does not map.
      
      A couple of different semantics were considered but this one was
      settled on because it is easy to implement and it is usable from
      pam modules.  The core reasons for the existence of unshare.
      
      I took a survey of the callers of pam modules and the following
      appears to be a representative sample of their logic.
      {
      	setup stuff include pam
      	child = fork();
      	if (!child) {
      		setuid()
                      exec /bin/bash
              }
              waitpid(child);
      
              pam and other cleanup
      }
      
      As you can see there is a fork to create the unprivileged user
      space process.  Which means that the unprivileged user space
      process will appear as pid 1 in the new pid namespace.  Further
      most login processes do not cope with extraneous children which
      means shifting the duty of reaping extraneous child process to
      the creator of those extraneous children makes the system more
      comprehensible.
      
      The practical reason for this set of pid namespace semantics is
      that it is simple to implement and verify they work correctly.
      Whereas an implementation that requres changing the struct
      pid on a process comes with a lot more races and pain.  Not
      the least of which is that glibc caches getpid().
      
      These semantics are implemented by having two notions
      of the pid namespace of a proces.  There is task_active_pid_ns
      which is the pid namspace the process was created with
      and the pid namespace that all pids are presented to
      that process in.  The task_active_pid_ns is stored
      in the struct pid of the task.
      
      Then there is the pid namespace that will be used for children
      that pid namespace is stored in task->nsproxy->pid_ns.
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      50804fe3
    • E
      pidns: Consolidate initialzation of special init task state · 1c4042c2
      Eric W. Biederman 提交于
      Instead of setting child_reaper and SIGNAL_UNKILLABLE one way
      for the system init process, and another way for pid namespace
      init processes test pid->nr == 1 and use the same code for both.
      
      For the global init this results in SIGNAL_UNKILLABLE being set
      much earlier in the initialization process.
      
      This is a small cleanup and it paves the way for allowing unshare and
      enter of the pid namespace as that path like our global init also will
      not set CLONE_NEWPID.
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      1c4042c2
    • E
      pidns: Make the pidns proc mount/umount logic obvious. · 0a01f2cc
      Eric W. Biederman 提交于
      Track the number of pids in the proc hash table.  When the number of
      pids goes to 0 schedule work to unmount the kernel mount of proc.
      
      Move the mount of proc into alloc_pid when we allocate the pid for
      init.
      
      Remove the surprising calls of pid_ns_release proc in fork and
      proc_flush_task.  Those code paths really shouldn't know about proc
      namespace implementation details and people have demonstrated several
      times that finding and understanding those code paths is difficult and
      non-obvious.
      
      Because of the call path detach pid is alwasy called with the
      rtnl_lock held free_pid is not allowed to sleep, so the work to
      unmounting proc is moved to a work queue.  This has the side benefit
      of not blocking the entire world waiting for the unnecessary
      rcu_barrier in deactivate_locked_super.
      
      In the process of making the code clear and obvious this fixes a bug
      reported by Gao feng <gaofeng@cn.fujitsu.com> where we would leak a
      mount of proc during clone(CLONE_NEWPID|CLONE_NEWNET) if copy_pid_ns
      succeeded and copy_net_ns failed.
      Acked-by: N"Serge E. Hallyn" <serge@hallyn.com>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      0a01f2cc
    • E
      pidns: Use task_active_pid_ns where appropriate · 17cf22c3
      Eric W. Biederman 提交于
      The expressions tsk->nsproxy->pid_ns and task_active_pid_ns
      aka ns_of_pid(task_pid(tsk)) should have the same number of
      cache line misses with the practical difference that
      ns_of_pid(task_pid(tsk)) is released later in a processes life.
      
      Furthermore by using task_active_pid_ns it becomes trivial
      to write an unshare implementation for the the pid namespace.
      
      So I have used task_active_pid_ns everywhere I can.
      
      In fork since the pid has not yet been attached to the
      process I use ns_of_pid, to achieve the same effect.
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      17cf22c3