1. 12 9月, 2013 3 次提交
  2. 31 8月, 2013 1 次提交
    • E
      pidns: Don't have unshare(CLONE_NEWPID) imply CLONE_THREAD · 6e556ce2
      Eric W. Biederman 提交于
      I goofed when I made unshare(CLONE_NEWPID) only work in a
      single-threaded process.  There is no need for that requirement and in
      fact I analyzied things right for setns.  The hard requirement
      is for tasks that share a VM to all be in the pid namespace and
      we properly prevent that in do_fork.
      
      Just to be certain I took a look through do_wait and
      forget_original_parent and there are no cases that make it any harder
      for children to be in the multiple pid namespaces than it is for
      children to be in the same pid namespace.  I also performed a check to
      see if there were in uses of task->nsproxy_pid_ns I was not familiar
      with, but it is only used when allocating a new pid for a new task,
      and in checks to prevent craziness from happening.
      Acked-by: NSerge Hallyn <serge.hallyn@canonical.com>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      6e556ce2
  3. 28 8月, 2013 1 次提交
  4. 14 8月, 2013 1 次提交
  5. 15 7月, 2013 1 次提交
    • P
      kernel: delete __cpuinit usage from all core kernel files · 0db0628d
      Paul Gortmaker 提交于
      The __cpuinit type of throwaway sections might have made sense
      some time ago when RAM was more constrained, but now the savings
      do not offset the cost and complications.  For example, the fix in
      commit 5e427ec2 ("x86: Fix bit corruption at CPU resume time")
      is a good example of the nasty type of bugs that can be created
      with improper use of the various __init prefixes.
      
      After a discussion on LKML[1] it was decided that cpuinit should go
      the way of devinit and be phased out.  Once all the users are gone,
      we can then finally remove the macros themselves from linux/init.h.
      
      This removes all the uses of the __cpuinit macros from C files in
      the core kernel directories (kernel, init, lib, mm, and include)
      that don't really have a specific maintainer.
      
      [1] https://lkml.org/lkml/2013/5/20/589Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      0db0628d
  6. 11 7月, 2013 1 次提交
  7. 04 7月, 2013 4 次提交
    • O
      kernel/fork.c:copy_process(): consolidate the lockless CLONE_THREAD checks · 18c830df
      Oleg Nesterov 提交于
      copy_process() does a lot of "chaotic" initializations and checks
      CLONE_THREAD twice before it takes tasklist.  In particular it sets
      "p->group_leader = p" and then changes it again under tasklist if
      !thread_group_leader(p).
      
      This looks a bit confusing, lets create a single "if (CLONE_THREAD)" block
      which initializes ->exit_signal, ->group_leader, and ->tgid.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Sergey Dyasly <dserrg@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      18c830df
    • O
      kernel/fork.c:copy_process(): don't add the uninitialized child to thread/task/pid lists · 81907739
      Oleg Nesterov 提交于
      copy_process() adds the new child to thread_group/init_task.tasks list and
      then does attach_pid(child, PIDTYPE_PID).  This means that the lockless
      next_thread() or next_task() can see this thread with the wrong pid.  Say,
      "ls /proc/pid/task" can list the same inode twice.
      
      We could move attach_pid(child, PIDTYPE_PID) up, but in this case
      find_task_by_vpid() can find the new thread before it was fully
      initialized.
      
      And this is already true for PIDTYPE_PGID/PIDTYPE_SID, With this patch
      copy_process() initializes child->pids[*].pid first, then calls
      attach_pid() to insert the task into the pid->tasks list.
      
      attach_pid() no longer need the "struct pid*" argument, it is always
      called after pid_link->pid was already set.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Sergey Dyasly <dserrg@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      81907739
    • O
      kernel/fork.c:copy_process(): unify CLONE_THREAD-or-thread_group_leader code · 80628ca0
      Oleg Nesterov 提交于
      Cleanup and preparation for the next changes.
      
      Move the "if (clone_flags & CLONE_THREAD)" code down under "if
      (likely(p->pid))" and turn it into into the "else" branch.  This makes the
      process/thread initialization more symmetrical and removes one check.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Sergey Dyasly <dserrg@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      80628ca0
    • E
      fork: reorder permissions when violating number of processes limits · b57922b6
      Eric Paris 提交于
      When a task is attempting to violate the RLIMIT_NPROC limit we have a
      check to see if the task is sufficiently priviledged.  The check first
      looks at CAP_SYS_ADMIN, then CAP_SYS_RESOURCE, then if the task is uid=0.
      
      A result is that tasks which are allowed by the uid=0 check are first
      checked against the security subsystem.  This results in the security
      subsystem auditting a denial for sys_admin and sys_resource and then the
      task passing the uid=0 check.
      
      This patch rearranges the code to first check uid=0, since if we pass that
      we shouldn't hit the security system at all.  We then check sys_resource,
      since it is the smallest capability which will solve the problem.  Lastly
      we check the fallback everything cap_sysadmin.  We don't want to give this
      capability many places since it is so powerful.
      
      This will eliminate many of the false positive/needless denial messages we
      get when a root task tries to violate the nproc limit.  (note that
      kthreads count against root, so on a sufficiently large machine we can
      actually get past the default limits before any userspace tasks are
      launched.)
      Signed-off-by: NEric Paris <eparis@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b57922b6
  8. 08 5月, 2013 1 次提交
  9. 24 3月, 2013 1 次提交
  10. 14 3月, 2013 1 次提交
    • E
      userns: Don't allow CLONE_NEWUSER | CLONE_FS · e66eded8
      Eric W. Biederman 提交于
      Don't allowing sharing the root directory with processes in a
      different user namespace.  There doesn't seem to be any point, and to
      allow it would require the overhead of putting a user namespace
      reference in fs_struct (for permission checks) and incrementing that
      reference count on practically every call to fork.
      
      So just perform the inexpensive test of forbidding sharing fs_struct
      acrosss processes in different user namespaces.  We already disallow
      other forms of threading when unsharing a user namespace so this
      should be no real burden in practice.
      
      This updates setns, clone, and unshare to disallow multiple user
      namespaces sharing an fs_struct.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e66eded8
  11. 08 3月, 2013 1 次提交
    • F
      cputime: Dynamically scale cputime for full dynticks accounting · 9fbc42ea
      Frederic Weisbecker 提交于
      The full dynticks cputime accounting is able to account either
      using the tick or the context tracking subsystem. This way
      the housekeeping CPU can keep the low overhead tick based
      solution.
      
      This latter mode has a low jiffies resolution granularity and
      need to be scaled against CFS precise runtime accounting to
      improve its result. We are doing this for CONFIG_TICK_CPU_ACCOUNTING,
      now we also need to expand it to full dynticks accounting dynamic
      off-case as well.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Li Zhong <zhong@linux.vnet.ibm.com>
      Cc: Kevin Hilman <khilman@linaro.org>
      Cc: Mats Liljegren <mats.liljegren@enea.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Namhyung Kim <namhyung.kim@lge.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      9fbc42ea
  12. 04 3月, 2013 1 次提交
  13. 28 2月, 2013 1 次提交
  14. 23 2月, 2013 1 次提交
  15. 28 1月, 2013 1 次提交
    • F
      cputime: Safely read cputime of full dynticks CPUs · 6a61671b
      Frederic Weisbecker 提交于
      While remotely reading the cputime of a task running in a
      full dynticks CPU, the values stored in utime/stime fields
      of struct task_struct may be stale. Its values may be those
      of the last kernel <-> user transition time snapshot and
      we need to add the tickless time spent since this snapshot.
      
      To fix this, flush the cputime of the dynticks CPUs on
      kernel <-> user transition and record the time / context
      where we did this. Then on top of this snapshot and the current
      time, perform the fixup on the reader side from task_times()
      accessors.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Li Zhong <zhong@linux.vnet.ibm.com>
      Cc: Namhyung Kim <namhyung.kim@lge.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      [fixed kvm module related build errors]
      Signed-off-by: NSedat Dilek <sedat.dilek@gmail.com>
      6a61671b
  16. 20 1月, 2013 1 次提交
  17. 25 12月, 2012 1 次提交
  18. 20 12月, 2012 1 次提交
  19. 19 12月, 2012 1 次提交
    • G
      fork: protect architectures where THREAD_SIZE >= PAGE_SIZE against fork bombs · 2ad306b1
      Glauber Costa 提交于
      Because those architectures will draw their stacks directly from the page
      allocator, rather than the slab cache, we can directly pass __GFP_KMEMCG
      flag, and issue the corresponding free_pages.
      
      This code path is taken when the architecture doesn't define
      CONFIG_ARCH_THREAD_INFO_ALLOCATOR (only ia64 seems to), and has
      THREAD_SIZE >= PAGE_SIZE.  Luckily, most - if not all - of the remaining
      architectures fall in this category.
      
      This will guarantee that every stack page is accounted to the memcg the
      process currently lives on, and will have the allocations to fail if they
      go over limit.
      
      For the time being, I am defining a new variant of THREADINFO_GFP, not to
      mess with the other path.  Once the slab is also tracked by memcg, we can
      get rid of that flag.
      
      Tested to successfully protect against :(){ :|:& };:
      Signed-off-by: NGlauber Costa <glommer@parallels.com>
      Acked-by: NFrederic Weisbecker <fweisbec@redhat.com>
      Acked-by: NKamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: JoonSoo Kim <js1304@gmail.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Suleiman Souhlal <suleiman@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2ad306b1
  20. 11 12月, 2012 1 次提交
    • M
      mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node · 5bca2303
      Mel Gorman 提交于
      Due to the fact that migrations are driven by the CPU a task is running
      on there is no point tracking NUMA faults until one task runs on a new
      node. This patch tracks the first node used by an address space. Until
      it changes, PTE scanning is disabled and no NUMA hinting faults are
      trapped. This should help workloads that are short-lived, do not care
      about NUMA placement or have bound themselves to a single node.
      
      This takes advantage of the logic in "mm: sched: numa: Implement slow
      start for working set sampling" to delay when the checks are made. This
      will take advantage of processes that set their CPU and node bindings
      early in their lifetime. It will also potentially allow any initial load
      balancing to take place.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      5bca2303
  21. 29 11月, 2012 6 次提交
  22. 20 11月, 2012 1 次提交
  23. 19 11月, 2012 5 次提交
    • E
      userns: Allow unprivileged users to create user namespaces. · 5eaf563e
      Eric W. Biederman 提交于
      Now that we have been through every permission check in the kernel
      having uid == 0 and gid == 0 in your local user namespace no
      longer adds any special privileges.  Even having a full set
      of caps in your local user namespace is safe because capabilies
      are relative to your local user namespace, and do not confer
      unexpected privileges.
      
      Over the long term this should allow much more of the kernels
      functionality to be safely used by non-root users.  Functionality
      like unsharing the mount namespace that is only unsafe because
      it can fool applications whose privileges are raised when they
      are executed.  Since those applications have no privileges in
      a user namespaces it becomes safe to spoof and confuse those
      applications all you want.
      
      Those capabilities will still need to be enabled carefully because
      we may still need things like rlimits on the number of unprivileged
      mounts but that is to avoid DOS attacks not to avoid fooling root
      owned processes.
      Acked-by: NSerge Hallyn <serge.hallyn@canonical.com>
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      5eaf563e
    • E
      pidns: Support unsharing the pid namespace. · 50804fe3
      Eric W. Biederman 提交于
      Unsharing of the pid namespace unlike unsharing of other namespaces
      does not take affect immediately.  Instead it affects the children
      created with fork and clone.  The first of these children becomes the init
      process of the new pid namespace, the rest become oddball children
      of pid 0.  From the point of view of the new pid namespace the process
      that created it is pid 0, as it's pid does not map.
      
      A couple of different semantics were considered but this one was
      settled on because it is easy to implement and it is usable from
      pam modules.  The core reasons for the existence of unshare.
      
      I took a survey of the callers of pam modules and the following
      appears to be a representative sample of their logic.
      {
      	setup stuff include pam
      	child = fork();
      	if (!child) {
      		setuid()
                      exec /bin/bash
              }
              waitpid(child);
      
              pam and other cleanup
      }
      
      As you can see there is a fork to create the unprivileged user
      space process.  Which means that the unprivileged user space
      process will appear as pid 1 in the new pid namespace.  Further
      most login processes do not cope with extraneous children which
      means shifting the duty of reaping extraneous child process to
      the creator of those extraneous children makes the system more
      comprehensible.
      
      The practical reason for this set of pid namespace semantics is
      that it is simple to implement and verify they work correctly.
      Whereas an implementation that requres changing the struct
      pid on a process comes with a lot more races and pain.  Not
      the least of which is that glibc caches getpid().
      
      These semantics are implemented by having two notions
      of the pid namespace of a proces.  There is task_active_pid_ns
      which is the pid namspace the process was created with
      and the pid namespace that all pids are presented to
      that process in.  The task_active_pid_ns is stored
      in the struct pid of the task.
      
      Then there is the pid namespace that will be used for children
      that pid namespace is stored in task->nsproxy->pid_ns.
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      50804fe3
    • E
      pidns: Consolidate initialzation of special init task state · 1c4042c2
      Eric W. Biederman 提交于
      Instead of setting child_reaper and SIGNAL_UNKILLABLE one way
      for the system init process, and another way for pid namespace
      init processes test pid->nr == 1 and use the same code for both.
      
      For the global init this results in SIGNAL_UNKILLABLE being set
      much earlier in the initialization process.
      
      This is a small cleanup and it paves the way for allowing unshare and
      enter of the pid namespace as that path like our global init also will
      not set CLONE_NEWPID.
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      1c4042c2
    • E
      pidns: Make the pidns proc mount/umount logic obvious. · 0a01f2cc
      Eric W. Biederman 提交于
      Track the number of pids in the proc hash table.  When the number of
      pids goes to 0 schedule work to unmount the kernel mount of proc.
      
      Move the mount of proc into alloc_pid when we allocate the pid for
      init.
      
      Remove the surprising calls of pid_ns_release proc in fork and
      proc_flush_task.  Those code paths really shouldn't know about proc
      namespace implementation details and people have demonstrated several
      times that finding and understanding those code paths is difficult and
      non-obvious.
      
      Because of the call path detach pid is alwasy called with the
      rtnl_lock held free_pid is not allowed to sleep, so the work to
      unmounting proc is moved to a work queue.  This has the side benefit
      of not blocking the entire world waiting for the unnecessary
      rcu_barrier in deactivate_locked_super.
      
      In the process of making the code clear and obvious this fixes a bug
      reported by Gao feng <gaofeng@cn.fujitsu.com> where we would leak a
      mount of proc during clone(CLONE_NEWPID|CLONE_NEWNET) if copy_pid_ns
      succeeded and copy_net_ns failed.
      Acked-by: N"Serge E. Hallyn" <serge@hallyn.com>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      0a01f2cc
    • E
      pidns: Use task_active_pid_ns where appropriate · 17cf22c3
      Eric W. Biederman 提交于
      The expressions tsk->nsproxy->pid_ns and task_active_pid_ns
      aka ns_of_pid(task_pid(tsk)) should have the same number of
      cache line misses with the practical difference that
      ns_of_pid(task_pid(tsk)) is released later in a processes life.
      
      Furthermore by using task_active_pid_ns it becomes trivial
      to write an unshare implementation for the the pid namespace.
      
      So I have used task_active_pid_ns everywhere I can.
      
      In fork since the pid has not yet been attached to the
      process I use ns_of_pid, to achieve the same effect.
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      17cf22c3
  24. 16 11月, 2012 1 次提交
    • O
      uprobes: Use percpu_rw_semaphore to fix register/unregister vs dup_mmap() race · 32cdba1e
      Oleg Nesterov 提交于
      This was always racy, but 26872090
      "uprobes: Rework register_for_each_vma() to make it O(n)" should be
      blamed anyway, it made everything worse and I didn't notice.
      
      register/unregister call build_map_info() and then do install/remove
      breakpoint for every mm which mmaps inode/offset. This can obviously
      race with fork()->dup_mmap() in between and we can miss the child.
      
      uprobe_register() could be easily fixed but unregister is much worse,
      the new mm inherits "int3" from parent and there is no way to detect
      this if uprobe goes away.
      
      So this patch simply adds percpu_down_read/up_read around dup_mmap(),
      and percpu_down_write/up_write into register_for_each_vma().
      
      This adds 2 new hooks into dup_mmap() but we can kill uprobe_dup_mmap()
      and fold it into uprobe_end_dup_mmap().
      Reported-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Acked-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      32cdba1e
  25. 17 10月, 2012 1 次提交
    • T
      cgroup: cgroup_subsys->fork() should be called after the task is added to css_set · 5edee61e
      Tejun Heo 提交于
      cgroup core has a bug which violates a basic rule about event
      notifications - when a new entity needs to be added, you add that to
      the notification list first and then make the new entity conform to
      the current state.  If done in the reverse order, an event happening
      inbetween will be lost.
      
      cgroup_subsys->fork() is invoked way before the new task is added to
      the css_set.  Currently, cgroup_freezer is the only user of ->fork()
      and uses it to make new tasks conform to the current state of the
      freezer.  If FROZEN state is requested while fork is in progress
      between cgroup_fork_callbacks() and cgroup_post_fork(), the child
      could escape freezing - the cgroup isn't frozen when ->fork() is
      called and the freezer couldn't see the new task on the css_set.
      
      This patch moves cgroup_subsys->fork() invocation to
      cgroup_post_fork() after the new task is added to the css_set.
      cgroup_fork_callbacks() is removed.
      
      Because now a task may be migrated during cgroup_subsys->fork(),
      freezer_fork() is updated so that it adheres to the usual RCU locking
      and the rather pointless comment on why locking can be different there
      is removed (if it doesn't make anything simpler, why even bother?).
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Rafael J. Wysocki <rjw@sisk.pl>
      Cc: stable@vger.kernel.org
      5edee61e
  26. 09 10月, 2012 1 次提交
    • M
      mm: interval tree updates · 9826a516
      Michel Lespinasse 提交于
      Update the generic interval tree code that was introduced in "mm: replace
      vma prio_tree with an interval tree".
      
      Changes:
      
      - fixed 'endpoing' typo noticed by Andrew Morton
      
      - replaced include/linux/interval_tree_tmpl.h, which was used as a
        template (including it automatically defined the interval tree
        functions) with include/linux/interval_tree_generic.h, which only
        defines a preprocessor macro INTERVAL_TREE_DEFINE(), which itself
        defines the interval tree functions when invoked. Now that is a very
        long macro which is unfortunate, but it does make the usage sites
        (lib/interval_tree.c and mm/interval_tree.c) a bit nicer than previously.
      
      - make use of RB_DECLARE_CALLBACKS() in the INTERVAL_TREE_DEFINE() macro,
        instead of duplicating that code in the interval tree template.
      
      - replaced vma_interval_tree_add(), which was actually handling the
        nonlinear and interval tree cases, with vma_interval_tree_insert_after()
        which handles only the interval tree case and has an API that is more
        consistent with the other interval tree handling functions.
        The nonlinear case is now handled explicitly in kernel/fork.c dup_mmap().
      Signed-off-by: NMichel Lespinasse <walken@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Daniel Santos <daniel.santos@pobox.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9826a516