1. 20 10月, 2007 40 次提交
    • E
      Fix tsk->exit_state usage · 270f722d
      Eugene Teo 提交于
      tsk->exit_state can only be 0, EXIT_ZOMBIE, or EXIT_DEAD.  A non-zero test
      is the same as tsk->exit_state & (EXIT_ZOMBIE | EXIT_DEAD), so just testing
      tsk->exit_state is sufficient.
      Signed-off-by: NEugene Teo <eugeneteo@kernel.sg>
      Cc: Roland McGrath <roland@redhat.com>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      270f722d
    • P
      Fix cpusets update_cpumask · 8707d8b8
      Paul Menage 提交于
      Cause writes to cpuset "cpus" file to update cpus_allowed for member tasks:
      
      - collect batches of tasks under tasklist_lock and then call
        set_cpus_allowed() on them outside the lock (since this can sleep).
      
      - add a simple generic priority heap type to allow efficient collection
        of batches of tasks to be processed without duplicating or missing any
        tasks in subsequent batches.
      
      - make "cpus" file update a no-op if the mask hasn't changed
      
      - fix race between update_cpumask() and sched_setaffinity() by making
        sched_setaffinity() post-check that it's not running on any cpus outside
        cpuset_cpus_allowed().
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NPaul Menage <menage@google.com>
      Cc: Paul Jackson <pj@sgi.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Cedric Le Goater <clg@fr.ibm.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Serge Hallyn <serue@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8707d8b8
    • P
      cpusets: decrustify cpuset mask update code · 020958b6
      Paul Jackson 提交于
      Decrustify the kernel/cpuset.c 'cpus' and 'mems' updating code.
      
      Other than subtle improvements in the consistency of identifying
      white space at the beginning and end of passed in masks, this
      doesn't make any visible difference in behaviour.  But it's
      one or two hundred kernel text bytes smaller, and easier to
      understand.
      
      [akpm@linux-foundation.org: coding-style fix]
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Reviewed-by: NPaul Menage <menage@google.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      020958b6
    • P
      cpuset sched_load_balance flag · 029190c5
      Paul Jackson 提交于
      Add a new per-cpuset flag called 'sched_load_balance'.
      
      When enabled in a cpuset (the default value) it tells the kernel scheduler
      that the scheduler should provide the normal load balancing on the CPUs in
      that cpuset, sometimes moving tasks from one CPU to a second CPU if the
      second CPU is less loaded and if that task is allowed to run there.
      
      When disabled (write "0" to the file) then it tells the kernel scheduler
      that load balancing is not required for the CPUs in that cpuset.
      
      Now even if this flag is disabled for some cpuset, the kernel may still
      have to load balance some or all the CPUs in that cpuset, if some
      overlapping cpuset has its sched_load_balance flag enabled.
      
      If there are some CPUs that are not in any cpuset whose sched_load_balance
      flag is enabled, the kernel scheduler will not load balance tasks to those
      CPUs.
      
      Moreover the kernel will partition the 'sched domains' (non-overlapping
      sets of CPUs over which load balancing is attempted) into the finest
      granularity partition that it can find, while still keeping any two CPUs
      that are in the same shed_load_balance enabled cpuset in the same element
      of the partition.
      
      This serves two purposes:
       1) It provides a mechanism for real time isolation of some CPUs, and
       2) it can be used to improve performance on systems with many CPUs
          by supporting configurations in which load balancing is not done
          across all CPUs at once, but rather only done in several smaller
          disjoint sets of CPUs.
      
      This mechanism replaces the earlier overloading of the per-cpuset
      flag 'cpu_exclusive', which overloading was removed in an earlier
      patch: cpuset-remove-sched-domain-hooks-from-cpusets
      
      See further the Documentation and comments in the code itself.
      
      [akpm@linux-foundation.org: don't be weird]
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      029190c5
    • P
      Uninline the task_xid_nr_ns() calls · 2f2a3a46
      Pavel Emelyanov 提交于
      Since these are expanded into call to pid_nr_ns() anyway, it's OK to move
      the whole routine out-of-line.  This is a cheap way to save ~100 bytes from
      vmlinux.  Together with the previous two patches, it saves half-a-kilo from
      the vmlinux.
      
      Un-inline other (currently inlined) functions must be done with additional
      performance testing.
      Signed-off-by: NPavel Emelyanov <xemul@openvz.org>
      Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Cc: Paul Menage <menage@google.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2f2a3a46
    • P
      Uninline find_pid etc set of functions · 8990571e
      Pavel Emelyanov 提交于
      The find_pid/_vpid/_pid_ns functions are used to find the struct pid by its
      id, depending on whic id - global or virtual - is used.
      
      The find_vpid() is a macro that pushes the current->nsproxy->pid_ns on the
      stack to call another function - find_pid_ns().  It turned out, that this
      dereference together with the push itself cause the kernel text size to
      grow too much.
      
      Move all these out-of-line.  Together with the previous patch this saves a
      bit less that 400 bytes from .text section.
      Signed-off-by: NPavel Emelyanov <xemul@openvz.org>
      Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Cc: Paul Menage <menage@google.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8990571e
    • P
      Isolate some explicit usage of task->tgid · bac0abd6
      Pavel Emelyanov 提交于
      With pid namespaces this field is now dangerous to use explicitly, so hide
      it behind the helpers.
      
      Also the pid and pgrp fields o task_struct and signal_struct are to be
      deprecated.  Unfortunately this patch cannot be sent right now as this
      leads to tons of warnings, so start isolating them, and deprecate later.
      
      Actually the p->tgid == pid has to be changed to has_group_leader_pid(),
      but Oleg pointed out that in case of posix cpu timers this is the same, and
      thread_group_leader() is more preferable.
      Signed-off-by: NPavel Emelyanov <xemul@openvz.org>
      Acked-by: NOleg Nesterov <oleg@tv-sign.ru>
      Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bac0abd6
    • P
      pid namespaces: remove the struct pid unneeded fields · 19b9b9b5
      Pavel Emelyanov 提交于
      Since we've switched from using pid->nr to pid->upids->nr some
      fields on struct pid are no longer needed
      Signed-off-by: NPavel Emelyanov <xemul@openvz.org>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
      Cc: Paul Menage <menage@google.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      19b9b9b5
    • P
      Uninline find_task_by_xxx set of functions · 228ebcbe
      Pavel Emelyanov 提交于
      The find_task_by_something is a set of macros are used to find task by pid
      depending on what kind of pid is proposed - global or virtual one.  All of
      them are wrappers above the most generic one - find_task_by_pid_type_ns() -
      and just substitute some args for it.
      
      It turned out, that dereferencing the current->nsproxy->pid_ns construction
      and pushing one more argument on the stack inline cause kernel text size to
      grow.
      
      This patch moves all this stuff out-of-line into kernel/pid.c.  Together
      with the next patch it saves a bit less than 400 bytes from the .text
      section.
      Signed-off-by: NPavel Emelyanov <xemul@openvz.org>
      Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Cc: Paul Menage <menage@google.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      228ebcbe
    • P
      pid namespaces: changes to show virtual ids to user · b488893a
      Pavel Emelyanov 提交于
      This is the largest patch in the set. Make all (I hope) the places where
      the pid is shown to or get from user operate on the virtual pids.
      
      The idea is:
       - all in-kernel data structures must store either struct pid itself
         or the pid's global nr, obtained with pid_nr() call;
       - when seeking the task from kernel code with the stored id one
         should use find_task_by_pid() call that works with global pids;
       - when showing pid's numerical value to the user the virtual one
         should be used, but however when one shows task's pid outside this
         task's namespace the global one is to be used;
       - when getting the pid from userspace one need to consider this as
         the virtual one and use appropriate task/pid-searching functions.
      
      [akpm@linux-foundation.org: build fix]
      [akpm@linux-foundation.org: nuther build fix]
      [akpm@linux-foundation.org: yet nuther build fix]
      [akpm@linux-foundation.org: remove unneeded casts]
      Signed-off-by: NPavel Emelyanov <xemul@openvz.org>
      Signed-off-by: NAlexey Dobriyan <adobriyan@openvz.org>
      Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Cc: Paul Menage <menage@google.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b488893a
    • S
      pid namespaces: destroy pid namespace on init's death · 3eb07c8c
      Sukadev Bhattiprolu 提交于
      Terminate all processes in a namespace when the reaper of the namespace is
      exiting.  We do this by walking the pidmap of the namespace and sending
      SIGKILL to all processes.
      Signed-off-by: NSukadev Bhattiprolu <sukadev@us.ibm.com>
      Acked-by: NPavel Emelyanov <xemul@openvz.org>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
      Cc: Paul Menage <menage@google.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3eb07c8c
    • S
      pid namespaces: allow signalling cgroup-init · 0fbc26a6
      Sukadev Bhattiprolu 提交于
      Only the global-init process must be special - any other cgroup-init
      process must be killable to prevent run-away processes in the system.
      
      TODO: 	Ideally we should allow killing the cgroup-init only from parent
      	cgroup and prevent it being killed from within the cgroup.
      	But that is a more complex change and will be addressed by a follow-on
      	patch. For now allow the cgroup-init to be terminated by any process
      	with sufficient privileges.
      Signed-off-by: NSukadev Bhattiprolu <sukadev@us.ibm.com>
      Acked-by: NPavel Emelyanov <xemul@openvz.org>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
      Cc: Paul Menage <menage@google.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0fbc26a6
    • S
      pid namespaces: create a slab-cache for 'struct pid_namespace' · c9c5d922
      Sukadev Bhattiprolu 提交于
      This will help fixing memory leaks due to bad reference counting.
      Signed-off-by: NSukadev Bhattiprolu <sukadev@us.ibm.com>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
      Cc: Paul Menage <menage@google.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c9c5d922
    • P
      pid namespaces: initialize the namespace's proc_mnt · 6f4e6433
      Pavel Emelyanov 提交于
      The namespace's proc_mnt must be kern_mount-ed to make this pointer always
      valid, independently of whether the user space mounted the proc or not.  This
      solves raced in proc_flush_task, etc.  with the proc_mnt switching from NULL
      to not-NULL.
      
      The initialization is done after the init's pid is created and hashed to make
      proc_get_sb() finr it and get for root inode.
      
      Sice the namespace holds the vfsmnt, vfsmnt holds the superblock and the
      superblock holds the namespace we must explicitly break this circle to destroy
      all the stuff.  This is done after the init of the namespace dies.  Running a
      few steps forward - when init exits it will kill all its children, so no
      proc_mnt will be needed after its death.
      Signed-off-by: NPavel Emelyanov <xemul@openvz.org>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
      Cc: Paul Menage <menage@google.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6f4e6433
    • P
      pid namespaces: allow cloning of new namespace · 30e49c26
      Pavel Emelyanov 提交于
      When clone() is invoked with CLONE_NEWPID, create a new pid namespace and then
      create a new struct pid for the new process.  Allocate pid_t's for the new
      process in the new pid namespace and all ancestor pid namespaces.  Make the
      newly cloned process the session and process group leader.
      
      Since the active pid namespace is special and expected to be the first entry
      in pid->upid_list, preserve the order of pid namespaces.
      
      The size of 'struct pid' is dependent on the the number of pid namespaces the
      process exists in, so we use multiple pid-caches'.  Only one pid cache is
      created during system startup and this used by processes that exist only in
      init_pid_ns.
      
      When a process clones its pid namespace, we create additional pid caches as
      necessary and use the pid cache to allocate 'struct pids' for that depth.
      
      Note, that with this patch the newly created namespace won't work, since the
      rest of the kernel still uses global pids, but this is to be fixed soon.  Init
      pid namespace still works.
      
      [oleg@tv-sign.ru: merge fix]
      Signed-off-by: NPavel Emelyanov <xemul@openvz.org>
      Signed-off-by: NSukadev Bhattiprolu <sukadev@us.ibm.com>
      Cc: Paul Menage <menage@google.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      30e49c26
    • P
      pid namespaces: miscellaneous preparations for pid namespaces · b461cc03
      Pavel Emelyanov 提交于
      * remove pid.h from pid_namespaces.h;
      * rework is_(cgroup|global)_init;
      * optimize (get|put)_pid_ns for init_pid_ns;
      * declare task_child_reaper to return actual reaper.
      Signed-off-by: NPavel Emelyanov <xemul@openvz.org>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
      Cc: Paul Menage <menage@google.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b461cc03
    • P
      pid namespaces: move alloc_pid() lower in copy_process() · 425fb2b4
      Pavel Emelyanov 提交于
      When we create new namespace we will need to allocate the struct pid, that
      will have one extra struct upid in array, comparing to the parent.
      
      Thus we need to know the new namespace (if any) in alloc_pid() to init this
      struct upid properly, so move the alloc_pid() call lower in copy_process().
      Signed-off-by: NPavel Emelyanov <xemul@openvz.org>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
      Cc: Paul Menage <menage@google.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      425fb2b4
    • P
      pid namespaces: helpers to find the task by its numerical ids · 198fe21b
      Pavel Emelyanov 提交于
      When searching the task by numerical id on may need to find it using global
      pid (as it is done now in kernel) or by its virtual id, e.g.  when sending a
      signal to a task from one namespace the sender will specify the task's virtual
      id and we should find the task by this value.
      
      [akpm@linux-foundation.org: fix gfs2 linkage]
      Signed-off-by: NPavel Emelyanov <xemul@openvz.org>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
      Cc: Paul Menage <menage@google.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      198fe21b
    • P
      pid namespaces: helpers to obtain pid numbers · 7af57294
      Pavel Emelyanov 提交于
      When showing pid to user or getting the pid numerical id for in-kernel use the
      value of this id may differ depending on the namespace.
      
      This set of helpers is used to get the global pid nr, the virtual (i.e.  seen
      by task in its namespace) nr and the nr as it is seen from the specified
      namespace.
      Signed-off-by: NPavel Emelyanov <xemul@openvz.org>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
      Cc: Paul Menage <menage@google.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7af57294
    • P
      pid namespaces: make alloc_pid(), free_pid() and put_pid() work with struct upid · 8ef047aa
      Pavel Emelyanov 提交于
      Each struct upid element of struct pid has to be initialized properly, i.e.
      its nr mst be allocated from appropriate pidmap and ns set to appropriate
      namespace.
      
      When allocating a new pid, we need to know the namespace this pid will live
      in, so the additional argument is added to alloc_pid().
      
      On the other hand, the rest of the kernel still uses the pid->nr and
      pid->pid_chain fields, so these ones are still initialized, but this will be
      removed soon.
      Signed-off-by: NPavel Emelyanov <xemul@openvz.org>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
      Cc: Paul Menage <menage@google.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8ef047aa
    • P
      pid namespaces: add support for pid namespaces hierarchy · faacbfd3
      Pavel Emelyanov 提交于
      Each namespace has a parent and is characterized by its "level".  Level is the
      number of the namespace generation.  E.g.  init namespace has level 0, after
      cloning new one it will have level 1, the next one - 2 and so on and so forth.
       This level is not explicitly limited.
      
      True hierarchy must have some way to find each namespace's children, but it is
      not used in the patches, so this ability is not added (yet).
      Signed-off-by: NPavel Emelyanov <xemul@openvz.org>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
      Cc: Paul Menage <menage@google.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      faacbfd3
    • P
      pid namespaces: prepare proc_flust_task() to flush entries from multiple proc trees · 60347f67
      Pavel Emelyanov 提交于
      The first part is trivial - we just make the proc_flush_task() to operate on
      arbitrary vfsmount with arbitrary ids and pass the pid and global proc_mnt to
      it.
      
      The other change is more tricky: I moved the proc_flush_task() call in
      release_task() higher to address the following problem.
      
      When flushing task from many proc trees we need to know the set of ids (not
      just one pid) to find the dentries' names to flush.  Thus we need to pass the
      task's pid to proc_flush_task() as struct pid is the only object that can
      provide all the pid numbers.  But after __exit_signal() task has detached all
      his pids and this information is lost.
      
      This creates a tiny gap for proc_pid_lookup() to bring some dentries back to
      tree and keep them in hash (since pids are still alive before __exit_signal())
      till the next shrink, but since proc_flush_task() does not provide a 100%
      guarantee that the dentries will be flushed, this is OK to do so.
      Signed-off-by: NPavel Emelyanov <xemul@openvz.org>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
      Cc: Paul Menage <menage@google.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      60347f67
    • P
      pid namespaces: move exit_task_namespaces() · 2e4a7072
      Pavel Emelyanov 提交于
      Make task release its namespaces after it has reparented all his children to
      child_reaper, but before it notifies its parent about its death.
      
      The reason to release namespaces after reparenting is that when task exits it
      may send a signal to its parent (SIGCHLD), but if the parent has already
      exited its namespaces there will be no way to decide what pid to dever to him
      - parent can be from different namespace.
      
      The reason to release namespace before notifying the parent it that when task
      sends a SIGCHLD to parent it can call wait() on this taks and release it.  But
      releasing the mnt namespace implies dropping of all the mounts in the mnt
      namespace and NFS expects the task to have valid sighand pointer.
      
      Thanks to Oleg for pointing out some races that can apear and helping with
      patches and fixes.
      Signed-off-by: NPavel Emelyanov <xemul@openvz.org>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
      Cc: Paul Menage <menage@google.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2e4a7072
    • O
      pid namespaces: rework forget_original_parent() · 762a24be
      Oleg Nesterov 提交于
      A pid namespace is a "view" of a particular set of tasks on the system.  They
      work in a similar way to filesystem namespaces.  A file (or a process) can be
      accessed in multiple namespaces, but it may have a different name in each.  In
      a filesystem, this name might be /etc/passwd in one namespace, but
      /chroot/etc/passwd in another.
      
      For processes, a process may have pid 1234 in one namespace, but be pid 1 in
      another.  This allows new pid namespaces to have basically arbitrary pids, and
      not have to worry about what pids exist in other namespaces.  This is
      essential for checkpoint/restart where a restarted process's pid might collide
      with an existing process on the system's pid.
      
      In this particular implementation, pid namespaces have a parent-child
      relationship, just like processes.  A process in a pid namespace may see all
      of the processes in the same namespace, as well as all of the processes in all
      of the namespaces which are children of its namespace.  Processes may not,
      however, see others which are in their parent's namespace, but not in their
      own.  The same goes for sibling namespaces.
      
      The know issue to be solved in the nearest future is signal handling in the
      namespace boundary.  That is, currently the namespace's init is treated like
      an ordinary task that can be killed from within an namespace.  Ideally, the
      signal handling by the namespace's init should have two sides: when signaling
      the init from its namespace, the init should look like a real init task, i.e.
      receive only those signals, that is explicitly wants to; when signaling the
      init from one of the parent namespaces, init should look like an ordinary
      task, i.e.  receive any signal, only taking the general permissions into
      account.
      
      The pid namespace was developed by Pavel Emlyanov and Sukadev Bhattiprolu and
      we eventually came to almost the same implementation, which differed in some
      details.  This set is based on Pavel's patches, but it includes comments and
      patches that from Sukadev.
      
      Many thanks to Oleg, who reviewed the patches, pointed out many BUGs and made
      valuable advises on how to make this set cleaner.
      
      This patch:
      
      We have to call exit_task_namespaces() only after the exiting task has
      reparented all his children and is sure that no other threads will reparent
      theirs for it.  Why this is needed is explained in appropriate patch.  This
      one only reworks the forget_original_parent() so that after calling this a
      task cannot be/become parent of any other task.
      
      We check PF_EXITING instead of ->exit_state while choosing the new parent.
      Note that tasklits_lock acts as a barrier, everyone who takes tasklist after
      us (when forget_original_parent() drops it) must see PF_EXITING.
      
      The other changes are just cleanups.  They just move some code from
      exit_notify to forget_original_parent().  It is a bit silly to declare
      ptrace_dead in exit_notify(), take tasklist, pass ptrace_dead to
      forget_original_parent(), unlock-lock-unlock tasklist, and then use
      ptrace_dead.
      Signed-off-by: NOleg Nesterov <oleg@tv-sign.ru>
      Signed-off-by: NPavel Emelyanov <xemul@openvz.org>
      Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
      Cc: Paul Menage <menage@google.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      762a24be
    • D
      d4c5e41f
    • M
      kernel/time/clocksource.c: Use list_for_each_entry instead of list_for_each · 2e197586
      Matthias Kaehlcke 提交于
      kernel/time/clocksource.c: Convert list_for_each to
      list_for_each_entry in clocksource_resume(),
      sysfs_override_clocksource() and show_available_clocksources()
      Signed-off-by: NMatthias Kaehlcke <matthias.kaehlcke@gmail.com>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: john stultz <johnstul@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2e197586
    • M
      kernel/exit.c: Use list_for_each_entry(_safe) instead of list_for_each(_safe) · 03ff1797
      Matthias Kaehlcke 提交于
      kernel/exit.c: Convert list_for_each(_safe) to
      list_for_each_entry(_safe) in forget_original_parent(), exit_notify()
      and do_wait()
      Signed-off-by: NMatthias Kaehlcke <matthias.kaehlcke@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      03ff1797
    • J
      workqueue: debug flushing deadlocks with lockdep · 4e6045f1
      Johannes Berg 提交于
      In the following scenario:
      
      code path 1:
        my_function() -> lock(L1); ...; flush_workqueue(); ...
      
      code path 2:
        run_workqueue() -> my_work() -> ...; lock(L1); ...
      
      you can get a deadlock when my_work() is queued or running
      but my_function() has acquired L1 already.
      
      This patch adds a pseudo-lock to each workqueue to make lockdep
      warn about this scenario.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NJohannes Berg <johannes@sipsolutions.net>
      Acked-by: NOleg Nesterov <oleg@tv-sign.ru>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4e6045f1
    • P
      Make access to task's nsproxy lighter · cf7b708c
      Pavel Emelyanov 提交于
      When someone wants to deal with some other taks's namespaces it has to lock
      the task and then to get the desired namespace if the one exists.  This is
      slow on read-only paths and may be impossible in some cases.
      
      E.g.  Oleg recently noticed a race between unshare() and the (sent for
      review in cgroups) pid namespaces - when the task notifies the parent it
      has to know the parent's namespace, but taking the task_lock() is
      impossible there - the code is under write locked tasklist lock.
      
      On the other hand switching the namespace on task (daemonize) and releasing
      the namespace (after the last task exit) is rather rare operation and we
      can sacrifice its speed to solve the issues above.
      
      The access to other task namespaces is proposed to be performed
      like this:
      
           rcu_read_lock();
           nsproxy = task_nsproxy(tsk);
           if (nsproxy != NULL) {
                   / *
                     * work with the namespaces here
                     * e.g. get the reference on one of them
                     * /
           } / *
               * NULL task_nsproxy() means that this task is
               * almost dead (zombie)
               * /
           rcu_read_unlock();
      
      This patch has passed the review by Eric and Oleg :) and,
      of course, tested.
      
      [clg@fr.ibm.com: fix unshare()]
      [ebiederm@xmission.com: Update get_net_ns_by_pid]
      Signed-off-by: NPavel Emelyanov <xemul@openvz.org>
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Serge Hallyn <serue@us.ibm.com>
      Signed-off-by: NCedric Le Goater <clg@fr.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cf7b708c
    • S
      pid namespaces: move alloc_pid() to copy_process() · a6f5e063
      Sukadev Bhattiprolu 提交于
      Move alloc_pid() into copy_process().  This will keep all pid and pid
      namespace code together and simplify error handling when we support multiple
      pid namespaces.
      Signed-off-by: NSukadev Bhattiprolu <sukadev@us.ibm.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Cedric Le Goater <clg@fr.ibm.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: Serge Hallyn <serue@us.ibm.com>
      Cc: Herbert Poetzel <herbert@13thfloor.at>
      Cc: Kirill Korotaev <dev@sw.ru>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a6f5e063
    • S
      pid namespaces: define is_global_init() and is_container_init() · b460cbc5
      Serge E. Hallyn 提交于
      is_init() is an ambiguous name for the pid==1 check.  Split it into
      is_global_init() and is_container_init().
      
      A cgroup init has it's tsk->pid == 1.
      
      A global init also has it's tsk->pid == 1 and it's active pid namespace
      is the init_pid_ns.  But rather than check the active pid namespace,
      compare the task structure with 'init_pid_ns.child_reaper', which is
      initialized during boot to the /sbin/init process and never changes.
      
      Changelog:
      
      	2.6.22-rc4-mm2-pidns1:
      	- Use 'init_pid_ns.child_reaper' to determine if a given task is the
      	  global init (/sbin/init) process. This would improve performance
      	  and remove dependence on the task_pid().
      
      	2.6.21-mm2-pidns2:
      
      	- [Sukadev Bhattiprolu] Changed is_container_init() calls in {powerpc,
      	  ppc,avr32}/traps.c for the _exception() call to is_global_init().
      	  This way, we kill only the cgroup if the cgroup's init has a
      	  bug rather than force a kernel panic.
      
      [akpm@linux-foundation.org: fix comment]
      [sukadev@us.ibm.com: Use is_global_init() in arch/m32r/mm/fault.c]
      [bunk@stusta.de: kernel/pid.c: remove unused exports]
      [sukadev@us.ibm.com: Fix capability.c to work with threaded init]
      Signed-off-by: NSerge E. Hallyn <serue@us.ibm.com>
      Signed-off-by: NSukadev Bhattiprolu <sukadev@us.ibm.com>
      Acked-by: NPavel Emelianov <xemul@openvz.org>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Cedric Le Goater <clg@fr.ibm.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: Herbert Poetzel <herbert@13thfloor.at>
      Cc: Kirill Korotaev <dev@sw.ru>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b460cbc5
    • S
      pid namespaces: rename child_reaper() function · 88f21d81
      Sukadev Bhattiprolu 提交于
      Rename the child_reaper() function to task_child_reaper() to be similar to
      other task_* functions and to distinguish the function from 'struct
      pid_namspace.child_reaper'.
      Signed-off-by: NSukadev Bhattiprolu <sukadev@us.ibm.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Cedric Le Goater <clg@fr.ibm.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: Serge Hallyn <serue@us.ibm.com>
      Cc: Herbert Poetzel <herbert@13thfloor.at>
      Cc: Kirill Korotaev <dev@sw.ru>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      88f21d81
    • S
      pid namespaces: define and use task_active_pid_ns() wrapper · 2894d650
      Sukadev Bhattiprolu 提交于
      With multiple pid namespaces, a process is known by some pid_t in every
      ancestor pid namespace.  Every time the process forks, the child process also
      gets a pid_t in every ancestor pid namespace.
      
      While a process is visible in >=1 pid namespaces, it can see pid_t's in only
      one pid namespace.  We call this pid namespace it's "active pid namespace",
      and it is always the youngest pid namespace in which the process is known.
      
      This patch defines and uses a wrapper to find the active pid namespace of a
      process.  The implementation of the wrapper will be changed in when support
      for multiple pid namespaces are added.
      
      Changelog:
      	2.6.22-rc4-mm2-pidns1:
      	- [Pavel Emelianov, Alexey Dobriyan] Back out the change to use
      	  task_active_pid_ns() in child_reaper() since task->nsproxy
      	  can be NULL during task exit (so child_reaper() continues to
      	  use init_pid_ns).
      
      	  to implement child_reaper() since init_pid_ns.child_reaper to
      	  implement child_reaper() since tsk->nsproxy can be NULL during exit.
      
      	2.6.21-rc6-mm1:
      	- Rename task_pid_ns() to task_active_pid_ns() to reflect that a
      	  process can have multiple pid namespaces.
      Signed-off-by: NSukadev Bhattiprolu <sukadev@us.ibm.com>
      Acked-by: NPavel Emelianov <xemul@openvz.org>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Cedric Le Goater <clg@fr.ibm.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: Serge Hallyn <serue@us.ibm.com>
      Cc: Herbert Poetzel <herbert@13thfloor.at>
      Cc: Kirill Korotaev <dev@sw.ru>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2894d650
    • P
      pid namespaces: dynamic kmem cache allocator for pid namespaces · baf8f0f8
      Pavel Emelianov 提交于
      Add kmem_cache to pid_namespace to allocate pids from.
      
      Since both implementations expand the struct pid to carry more numerical
      values each namespace should have separate cache to store pids of different
      sizes.
      
      Each kmem cache is name "pid_<NR>", where <NR> is the number of numerical ids
      on the pid.  Different namespaces with same level of nesting will have same
      caches.
      
      This patch has two FIXMEs that are to be fixed after we reach the consensus
      about the struct pid itself.
      
      The first one is that the namespace to free the pid from in free_pid() must be
      taken from pid.  Now the init_pid_ns is used.
      
      The second FIXME is about the cache allocation.  When we do know how long the
      object will be then we'll have to calculate this size in create_pid_cachep.
      Right now the sizeof(struct pid) value is used.
      
      [akpm@linux-foundation.org: coding-style repair]
      Signed-off-by: NPavel Emelianov <xemul@openvz.org>
      Acked-by: NCedric Le Goater <clg@fr.ibm.com>
      Acked-by: NSukadev Bhattiprolu <sukadev@us.ibm.com>
      Cc: Kirill Korotaev <dev@openvz.org>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      baf8f0f8
    • P
      pid namespaces: round up the API · a47afb0f
      Pavel Emelianov 提交于
      The set of functions process_session, task_session, process_group and
      task_pgrp is confusing, as the names can be mixed with each other when looking
      at the code for a long time.
      
      The proposals are to
      * equip the functions that return the integer with _nr suffix to
        represent that fact,
      * and to make all functions work with task (not process) by making
        the common prefix of the same name.
      
      For monotony the routines signal_session() and set_signal_session() are
      replaced with task_session_nr() and set_task_session(), especially since they
      are only used with the explicit task->signal dereference.
      Signed-off-by: NPavel Emelianov <xemul@openvz.org>
      Acked-by: NSerge E. Hallyn <serue@us.ibm.com>
      Cc: Kirill Korotaev <dev@openvz.org>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Cedric Le Goater <clg@fr.ibm.com>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a47afb0f
    • S
      cgroups: implement namespace tracking subsystem · 858d72ea
      Serge E. Hallyn 提交于
      When a task enters a new namespace via a clone() or unshare(), a new cgroup
      is created and the task moves into it.
      
      This version names cgroups which are automatically created using
      cgroup_clone() as "node_<pid>" where pid is the pid of the unsharing or
      cloned process.  (Thanks Pavel for the idea) This is safe because if the
      process unshares again, it will create
      
      	/cgroups/(...)/node_<pid>/node_<pid>
      
      The only possibilities (AFAICT) for a -EEXIST on unshare are
      
      	1. pid wraparound
      	2. a process fails an unshare, then tries again.
      
      Case 1 is unlikely enough that I ignore it (at least for now).  In case 2, the
      node_<pid> will be empty and can be rmdir'ed to make the subsequent unshare()
      succeed.
      
      Changelog:
      	Name cloned cgroups as "node_<pid>".
      
      [clg@fr.ibm.com: fix order of cgroup subsystems in init/Kconfig]
      Signed-off-by: NSerge E. Hallyn <serue@us.ibm.com>
      Cc: Paul Menage <menage@google.com>
      Signed-off-by: NCedric Le Goater <clg@fr.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      858d72ea
    • B
      Add cgroupstats · 846c7bb0
      Balbir Singh 提交于
      This patch is inspired by the discussion at
      http://lkml.org/lkml/2007/4/11/187 and implements per cgroup statistics
      as suggested by Andrew Morton in http://lkml.org/lkml/2007/4/11/263.  The
      patch is on top of 2.6.21-mm1 with Paul's cgroups v9 patches (forward
      ported)
      
      This patch implements per cgroup statistics infrastructure and re-uses
      code from the taskstats interface.  A new set of cgroup operations are
      registered with commands and attributes.  It should be very easy to
      *extend* per cgroup statistics, by adding members to the cgroupstats
      structure.
      
      The current model for cgroupstats is a pull, a push model (to post
      statistics on interesting events), should be very easy to add.  Currently
      user space requests for statistics by passing the cgroup file
      descriptor.  Statistics about the state of all the tasks in the cgroup
      is returned to user space.
      
      TODO's/NOTE:
      
      This patch provides an infrastructure for implementing cgroup statistics.
      Based on the needs of each controller, we can incrementally add more statistics,
      event based support for notification of statistics, accumulation of taskstats
      into cgroup statistics in the future.
      
      Sample output
      
      # ./cgroupstats -C /cgroup/a
      sleeping 2, blocked 0, running 1, stopped 0, uninterruptible 0
      
      # ./cgroupstats -C /cgroup/
      sleeping 154, blocked 0, running 0, stopped 0, uninterruptible 0
      
      If the approach looks good, I'll enhance and post the user space utility for
      the same
      
      Feedback, comments, test results are always welcome!
      
      [akpm@linux-foundation.org: build fix]
      Signed-off-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Paul Menage <menage@google.com>
      Cc: Jay Lan <jlan@engr.sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      846c7bb0
    • P
      Task Control Groups: simple task cgroup debug info subsystem · 006cb992
      Paul Menage 提交于
      This example subsystem exports debugging information as an aid to diagnosing
      refcount leaks, etc, in the cgroup framework.
      Signed-off-by: NPaul Menage <menage@google.com>
      Cc: Serge E. Hallyn <serue@us.ibm.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Kirill Korotaev <dev@openvz.org>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
      Cc: Cedric Le Goater <clg@fr.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      006cb992
    • P
      Task Control Groups: example CPU accounting subsystem · 62d0df64
      Paul Menage 提交于
      This example demonstrates how to use the generic cgroup subsystem for a
      simple resource tracker that counts, for the processes in a cgroup, the
      total CPU time used and the %CPU used in the last complete 10 second interval.
      
      Portions contributed by Balbir Singh <balbir@in.ibm.com>
      Signed-off-by: NPaul Menage <menage@google.com>
      Cc: Serge E. Hallyn <serue@us.ibm.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Kirill Korotaev <dev@openvz.org>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
      Cc: Cedric Le Goater <clg@fr.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      62d0df64
    • P
      Task Control Groups: make cpusets a client of cgroups · 8793d854
      Paul Menage 提交于
      Remove the filesystem support logic from the cpusets system and makes cpusets
      a cgroup subsystem
      
      The "cpuset" filesystem becomes a dummy filesystem; attempts to mount it get
      passed through to the cgroup filesystem with the appropriate options to
      emulate the old cpuset filesystem behaviour.
      Signed-off-by: NPaul Menage <menage@google.com>
      Cc: Serge E. Hallyn <serue@us.ibm.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Kirill Korotaev <dev@openvz.org>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
      Cc: Cedric Le Goater <clg@fr.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8793d854