1. 20 10月, 2007 40 次提交
    • P
      pid namespaces: prepare proc_flust_task() to flush entries from multiple proc trees · 60347f67
      Pavel Emelyanov 提交于
      The first part is trivial - we just make the proc_flush_task() to operate on
      arbitrary vfsmount with arbitrary ids and pass the pid and global proc_mnt to
      it.
      
      The other change is more tricky: I moved the proc_flush_task() call in
      release_task() higher to address the following problem.
      
      When flushing task from many proc trees we need to know the set of ids (not
      just one pid) to find the dentries' names to flush.  Thus we need to pass the
      task's pid to proc_flush_task() as struct pid is the only object that can
      provide all the pid numbers.  But after __exit_signal() task has detached all
      his pids and this information is lost.
      
      This creates a tiny gap for proc_pid_lookup() to bring some dentries back to
      tree and keep them in hash (since pids are still alive before __exit_signal())
      till the next shrink, but since proc_flush_task() does not provide a 100%
      guarantee that the dentries will be flushed, this is OK to do so.
      Signed-off-by: NPavel Emelyanov <xemul@openvz.org>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
      Cc: Paul Menage <menage@google.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      60347f67
    • P
      pid namespaces: introduce MS_KERNMOUNT flag · 8bf9725c
      Pavel Emelyanov 提交于
      This flag tells the .get_sb callback that this is a kern_mount() call so that
      it can trust *data pointer to be valid in-kernel one.  If this flag is passed
      from the user process, it is cleared since the *data pointer is not a valid
      kernel object.
      
      Running a few steps forward - this will be needed for proc to create the
      superblock and store a valid pid namespace on it during the namespace
      creation.  The reason, why the namespace cannot live without proc mount is
      described in the appropriate patch.
      Signed-off-by: NPavel Emelyanov <xemul@openvz.org>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
      Cc: Paul Menage <menage@google.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8bf9725c
    • P
      pid namespaces: move exit_task_namespaces() · 2e4a7072
      Pavel Emelyanov 提交于
      Make task release its namespaces after it has reparented all his children to
      child_reaper, but before it notifies its parent about its death.
      
      The reason to release namespaces after reparenting is that when task exits it
      may send a signal to its parent (SIGCHLD), but if the parent has already
      exited its namespaces there will be no way to decide what pid to dever to him
      - parent can be from different namespace.
      
      The reason to release namespace before notifying the parent it that when task
      sends a SIGCHLD to parent it can call wait() on this taks and release it.  But
      releasing the mnt namespace implies dropping of all the mounts in the mnt
      namespace and NFS expects the task to have valid sighand pointer.
      
      Thanks to Oleg for pointing out some races that can apear and helping with
      patches and fixes.
      Signed-off-by: NPavel Emelyanov <xemul@openvz.org>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
      Cc: Paul Menage <menage@google.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2e4a7072
    • O
      pid namespaces: rework forget_original_parent() · 762a24be
      Oleg Nesterov 提交于
      A pid namespace is a "view" of a particular set of tasks on the system.  They
      work in a similar way to filesystem namespaces.  A file (or a process) can be
      accessed in multiple namespaces, but it may have a different name in each.  In
      a filesystem, this name might be /etc/passwd in one namespace, but
      /chroot/etc/passwd in another.
      
      For processes, a process may have pid 1234 in one namespace, but be pid 1 in
      another.  This allows new pid namespaces to have basically arbitrary pids, and
      not have to worry about what pids exist in other namespaces.  This is
      essential for checkpoint/restart where a restarted process's pid might collide
      with an existing process on the system's pid.
      
      In this particular implementation, pid namespaces have a parent-child
      relationship, just like processes.  A process in a pid namespace may see all
      of the processes in the same namespace, as well as all of the processes in all
      of the namespaces which are children of its namespace.  Processes may not,
      however, see others which are in their parent's namespace, but not in their
      own.  The same goes for sibling namespaces.
      
      The know issue to be solved in the nearest future is signal handling in the
      namespace boundary.  That is, currently the namespace's init is treated like
      an ordinary task that can be killed from within an namespace.  Ideally, the
      signal handling by the namespace's init should have two sides: when signaling
      the init from its namespace, the init should look like a real init task, i.e.
      receive only those signals, that is explicitly wants to; when signaling the
      init from one of the parent namespaces, init should look like an ordinary
      task, i.e.  receive any signal, only taking the general permissions into
      account.
      
      The pid namespace was developed by Pavel Emlyanov and Sukadev Bhattiprolu and
      we eventually came to almost the same implementation, which differed in some
      details.  This set is based on Pavel's patches, but it includes comments and
      patches that from Sukadev.
      
      Many thanks to Oleg, who reviewed the patches, pointed out many BUGs and made
      valuable advises on how to make this set cleaner.
      
      This patch:
      
      We have to call exit_task_namespaces() only after the exiting task has
      reparented all his children and is sure that no other threads will reparent
      theirs for it.  Why this is needed is explained in appropriate patch.  This
      one only reworks the forget_original_parent() so that after calling this a
      task cannot be/become parent of any other task.
      
      We check PF_EXITING instead of ->exit_state while choosing the new parent.
      Note that tasklits_lock acts as a barrier, everyone who takes tasklist after
      us (when forget_original_parent() drops it) must see PF_EXITING.
      
      The other changes are just cleanups.  They just move some code from
      exit_notify to forget_original_parent().  It is a bit silly to declare
      ptrace_dead in exit_notify(), take tasklist, pass ptrace_dead to
      forget_original_parent(), unlock-lock-unlock tasklist, and then use
      ptrace_dead.
      Signed-off-by: NOleg Nesterov <oleg@tv-sign.ru>
      Signed-off-by: NPavel Emelyanov <xemul@openvz.org>
      Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
      Cc: Paul Menage <menage@google.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      762a24be
    • D
      d4c5e41f
    • M
      mm/oom_kill.c: Use list_for_each_entry instead of list_for_each · 7b1915a9
      Matthias Kaehlcke 提交于
      mm/oom_kill.c: Convert list_for_each to list_for_each_entry in
      oom_kill_process()
      Signed-off-by: NMatthias Kaehlcke <matthias.kaehlcke@gmail.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7b1915a9
    • M
      kernel/time/clocksource.c: Use list_for_each_entry instead of list_for_each · 2e197586
      Matthias Kaehlcke 提交于
      kernel/time/clocksource.c: Convert list_for_each to
      list_for_each_entry in clocksource_resume(),
      sysfs_override_clocksource() and show_available_clocksources()
      Signed-off-by: NMatthias Kaehlcke <matthias.kaehlcke@gmail.com>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: john stultz <johnstul@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2e197586
    • M
      kernel/exit.c: Use list_for_each_entry(_safe) instead of list_for_each(_safe) · 03ff1797
      Matthias Kaehlcke 提交于
      kernel/exit.c: Convert list_for_each(_safe) to
      list_for_each_entry(_safe) in forget_original_parent(), exit_notify()
      and do_wait()
      Signed-off-by: NMatthias Kaehlcke <matthias.kaehlcke@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      03ff1797
    • M
      fs/super.c: use list_for_each_entry() instead of list_for_each() · d4730127
      Matthias Kaehlcke 提交于
      fs/super.c: use list_for_each_entry() instead of list_for_each() in
      sget()
      
      [akpm@linux-foundation.org: clean up some crap while we're there]
      Signed-off-by: NMatthias Kaehlcke <matthias.kaehlcke@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d4730127
    • M
      fs/eventpoll.c: use list_for_each_entry() instead of list_for_each() · b70c3940
      Matthias Kaehlcke 提交于
      fs/eventpoll.c: use list_for_each_entry() instead of list_for_each()
      in ep_poll_safewake()
      Signed-off-by: NMatthias Kaehlcke <matthias.kaehlcke@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b70c3940
    • M
      fs/file_table.c: use list_for_each_entry() instead of list_for_each() · cfdaf9e5
      Matthias Kaehlcke 提交于
      fs/file_table.c: use list_for_each_entry() instead of list_for_each()
      in fs_may_remount_ro()
      Signed-off-by: NMatthias Kaehlcke <matthias.kaehlcke@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cfdaf9e5
    • J
      workqueue: debug flushing deadlocks with lockdep · 4e6045f1
      Johannes Berg 提交于
      In the following scenario:
      
      code path 1:
        my_function() -> lock(L1); ...; flush_workqueue(); ...
      
      code path 2:
        run_workqueue() -> my_work() -> ...; lock(L1); ...
      
      you can get a deadlock when my_work() is queued or running
      but my_function() has acquired L1 already.
      
      This patch adds a pseudo-lock to each workqueue to make lockdep
      warn about this scenario.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NJohannes Berg <johannes@sipsolutions.net>
      Acked-by: NOleg Nesterov <oleg@tv-sign.ru>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4e6045f1
    • P
      Make access to task's nsproxy lighter · cf7b708c
      Pavel Emelyanov 提交于
      When someone wants to deal with some other taks's namespaces it has to lock
      the task and then to get the desired namespace if the one exists.  This is
      slow on read-only paths and may be impossible in some cases.
      
      E.g.  Oleg recently noticed a race between unshare() and the (sent for
      review in cgroups) pid namespaces - when the task notifies the parent it
      has to know the parent's namespace, but taking the task_lock() is
      impossible there - the code is under write locked tasklist lock.
      
      On the other hand switching the namespace on task (daemonize) and releasing
      the namespace (after the last task exit) is rather rare operation and we
      can sacrifice its speed to solve the issues above.
      
      The access to other task namespaces is proposed to be performed
      like this:
      
           rcu_read_lock();
           nsproxy = task_nsproxy(tsk);
           if (nsproxy != NULL) {
                   / *
                     * work with the namespaces here
                     * e.g. get the reference on one of them
                     * /
           } / *
               * NULL task_nsproxy() means that this task is
               * almost dead (zombie)
               * /
           rcu_read_unlock();
      
      This patch has passed the review by Eric and Oleg :) and,
      of course, tested.
      
      [clg@fr.ibm.com: fix unshare()]
      [ebiederm@xmission.com: Update get_net_ns_by_pid]
      Signed-off-by: NPavel Emelyanov <xemul@openvz.org>
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Serge Hallyn <serue@us.ibm.com>
      Signed-off-by: NCedric Le Goater <clg@fr.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cf7b708c
    • S
      pid namespaces: move alloc_pid() to copy_process() · a6f5e063
      Sukadev Bhattiprolu 提交于
      Move alloc_pid() into copy_process().  This will keep all pid and pid
      namespace code together and simplify error handling when we support multiple
      pid namespaces.
      Signed-off-by: NSukadev Bhattiprolu <sukadev@us.ibm.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Cedric Le Goater <clg@fr.ibm.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: Serge Hallyn <serue@us.ibm.com>
      Cc: Herbert Poetzel <herbert@13thfloor.at>
      Cc: Kirill Korotaev <dev@sw.ru>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a6f5e063
    • S
      pid namespaces: define is_global_init() and is_container_init() · b460cbc5
      Serge E. Hallyn 提交于
      is_init() is an ambiguous name for the pid==1 check.  Split it into
      is_global_init() and is_container_init().
      
      A cgroup init has it's tsk->pid == 1.
      
      A global init also has it's tsk->pid == 1 and it's active pid namespace
      is the init_pid_ns.  But rather than check the active pid namespace,
      compare the task structure with 'init_pid_ns.child_reaper', which is
      initialized during boot to the /sbin/init process and never changes.
      
      Changelog:
      
      	2.6.22-rc4-mm2-pidns1:
      	- Use 'init_pid_ns.child_reaper' to determine if a given task is the
      	  global init (/sbin/init) process. This would improve performance
      	  and remove dependence on the task_pid().
      
      	2.6.21-mm2-pidns2:
      
      	- [Sukadev Bhattiprolu] Changed is_container_init() calls in {powerpc,
      	  ppc,avr32}/traps.c for the _exception() call to is_global_init().
      	  This way, we kill only the cgroup if the cgroup's init has a
      	  bug rather than force a kernel panic.
      
      [akpm@linux-foundation.org: fix comment]
      [sukadev@us.ibm.com: Use is_global_init() in arch/m32r/mm/fault.c]
      [bunk@stusta.de: kernel/pid.c: remove unused exports]
      [sukadev@us.ibm.com: Fix capability.c to work with threaded init]
      Signed-off-by: NSerge E. Hallyn <serue@us.ibm.com>
      Signed-off-by: NSukadev Bhattiprolu <sukadev@us.ibm.com>
      Acked-by: NPavel Emelianov <xemul@openvz.org>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Cedric Le Goater <clg@fr.ibm.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: Herbert Poetzel <herbert@13thfloor.at>
      Cc: Kirill Korotaev <dev@sw.ru>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b460cbc5
    • S
      pid namespaces: use task_pid() to find leader's pid · 3743ca05
      Sukadev Bhattiprolu 提交于
      Use task_pid() to get leader's 'struct pid' and avoid the find_pid().
      Signed-off-by: NSukadev Bhattiprolu <sukadev@us.ibm.com>
      Acked-by: NPavel Emelianov <xemul@openvz.org>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Cedric Le Goater <clg@fr.ibm.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: Serge Hallyn <serue@us.ibm.com>
      Cc: Herbert Poetzel <herbert@13thfloor.at>
      Cc: Kirill Korotaev <dev@sw.ru>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3743ca05
    • S
      pid namespaces: rename child_reaper() function · 88f21d81
      Sukadev Bhattiprolu 提交于
      Rename the child_reaper() function to task_child_reaper() to be similar to
      other task_* functions and to distinguish the function from 'struct
      pid_namspace.child_reaper'.
      Signed-off-by: NSukadev Bhattiprolu <sukadev@us.ibm.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Cedric Le Goater <clg@fr.ibm.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: Serge Hallyn <serue@us.ibm.com>
      Cc: Herbert Poetzel <herbert@13thfloor.at>
      Cc: Kirill Korotaev <dev@sw.ru>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      88f21d81
    • S
      pid namespaces: define and use task_active_pid_ns() wrapper · 2894d650
      Sukadev Bhattiprolu 提交于
      With multiple pid namespaces, a process is known by some pid_t in every
      ancestor pid namespace.  Every time the process forks, the child process also
      gets a pid_t in every ancestor pid namespace.
      
      While a process is visible in >=1 pid namespaces, it can see pid_t's in only
      one pid namespace.  We call this pid namespace it's "active pid namespace",
      and it is always the youngest pid namespace in which the process is known.
      
      This patch defines and uses a wrapper to find the active pid namespace of a
      process.  The implementation of the wrapper will be changed in when support
      for multiple pid namespaces are added.
      
      Changelog:
      	2.6.22-rc4-mm2-pidns1:
      	- [Pavel Emelianov, Alexey Dobriyan] Back out the change to use
      	  task_active_pid_ns() in child_reaper() since task->nsproxy
      	  can be NULL during task exit (so child_reaper() continues to
      	  use init_pid_ns).
      
      	  to implement child_reaper() since init_pid_ns.child_reaper to
      	  implement child_reaper() since tsk->nsproxy can be NULL during exit.
      
      	2.6.21-rc6-mm1:
      	- Rename task_pid_ns() to task_active_pid_ns() to reflect that a
      	  process can have multiple pid namespaces.
      Signed-off-by: NSukadev Bhattiprolu <sukadev@us.ibm.com>
      Acked-by: NPavel Emelianov <xemul@openvz.org>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Cedric Le Goater <clg@fr.ibm.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: Serge Hallyn <serue@us.ibm.com>
      Cc: Herbert Poetzel <herbert@13thfloor.at>
      Cc: Kirill Korotaev <dev@sw.ru>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2894d650
    • P
      pid namespaces: dynamic kmem cache allocator for pid namespaces · baf8f0f8
      Pavel Emelianov 提交于
      Add kmem_cache to pid_namespace to allocate pids from.
      
      Since both implementations expand the struct pid to carry more numerical
      values each namespace should have separate cache to store pids of different
      sizes.
      
      Each kmem cache is name "pid_<NR>", where <NR> is the number of numerical ids
      on the pid.  Different namespaces with same level of nesting will have same
      caches.
      
      This patch has two FIXMEs that are to be fixed after we reach the consensus
      about the struct pid itself.
      
      The first one is that the namespace to free the pid from in free_pid() must be
      taken from pid.  Now the init_pid_ns is used.
      
      The second FIXME is about the cache allocation.  When we do know how long the
      object will be then we'll have to calculate this size in create_pid_cachep.
      Right now the sizeof(struct pid) value is used.
      
      [akpm@linux-foundation.org: coding-style repair]
      Signed-off-by: NPavel Emelianov <xemul@openvz.org>
      Acked-by: NCedric Le Goater <clg@fr.ibm.com>
      Acked-by: NSukadev Bhattiprolu <sukadev@us.ibm.com>
      Cc: Kirill Korotaev <dev@openvz.org>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      baf8f0f8
    • P
      pid namespaces: make get_pid_ns() return the namespace itself · a05f7b15
      Pavel Emelianov 提交于
      Make get_pid_ns() return the namespace itself to look like the other getters
      and make the code using it look nicer.
      Signed-off-by: NPavel Emelianov <xemul@openvz.org>
      Acked-by: NCedric Le Goater <clg@fr.ibm.com>
      Cc: Kirill Korotaev <dev@openvz.org>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a05f7b15
    • P
      pid namespaces: round up the API · a47afb0f
      Pavel Emelianov 提交于
      The set of functions process_session, task_session, process_group and
      task_pgrp is confusing, as the names can be mixed with each other when looking
      at the code for a long time.
      
      The proposals are to
      * equip the functions that return the integer with _nr suffix to
        represent that fact,
      * and to make all functions work with task (not process) by making
        the common prefix of the same name.
      
      For monotony the routines signal_session() and set_signal_session() are
      replaced with task_session_nr() and set_task_session(), especially since they
      are only used with the explicit task->signal dereference.
      Signed-off-by: NPavel Emelianov <xemul@openvz.org>
      Acked-by: NSerge E. Hallyn <serue@us.ibm.com>
      Cc: Kirill Korotaev <dev@openvz.org>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Cedric Le Goater <clg@fr.ibm.com>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a47afb0f
    • S
      cgroups: implement namespace tracking subsystem · 858d72ea
      Serge E. Hallyn 提交于
      When a task enters a new namespace via a clone() or unshare(), a new cgroup
      is created and the task moves into it.
      
      This version names cgroups which are automatically created using
      cgroup_clone() as "node_<pid>" where pid is the pid of the unsharing or
      cloned process.  (Thanks Pavel for the idea) This is safe because if the
      process unshares again, it will create
      
      	/cgroups/(...)/node_<pid>/node_<pid>
      
      The only possibilities (AFAICT) for a -EEXIST on unshare are
      
      	1. pid wraparound
      	2. a process fails an unshare, then tries again.
      
      Case 1 is unlikely enough that I ignore it (at least for now).  In case 2, the
      node_<pid> will be empty and can be rmdir'ed to make the subsequent unshare()
      succeed.
      
      Changelog:
      	Name cloned cgroups as "node_<pid>".
      
      [clg@fr.ibm.com: fix order of cgroup subsystems in init/Kconfig]
      Signed-off-by: NSerge E. Hallyn <serue@us.ibm.com>
      Cc: Paul Menage <menage@google.com>
      Signed-off-by: NCedric Le Goater <clg@fr.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      858d72ea
    • B
      Add cgroupstats · 846c7bb0
      Balbir Singh 提交于
      This patch is inspired by the discussion at
      http://lkml.org/lkml/2007/4/11/187 and implements per cgroup statistics
      as suggested by Andrew Morton in http://lkml.org/lkml/2007/4/11/263.  The
      patch is on top of 2.6.21-mm1 with Paul's cgroups v9 patches (forward
      ported)
      
      This patch implements per cgroup statistics infrastructure and re-uses
      code from the taskstats interface.  A new set of cgroup operations are
      registered with commands and attributes.  It should be very easy to
      *extend* per cgroup statistics, by adding members to the cgroupstats
      structure.
      
      The current model for cgroupstats is a pull, a push model (to post
      statistics on interesting events), should be very easy to add.  Currently
      user space requests for statistics by passing the cgroup file
      descriptor.  Statistics about the state of all the tasks in the cgroup
      is returned to user space.
      
      TODO's/NOTE:
      
      This patch provides an infrastructure for implementing cgroup statistics.
      Based on the needs of each controller, we can incrementally add more statistics,
      event based support for notification of statistics, accumulation of taskstats
      into cgroup statistics in the future.
      
      Sample output
      
      # ./cgroupstats -C /cgroup/a
      sleeping 2, blocked 0, running 1, stopped 0, uninterruptible 0
      
      # ./cgroupstats -C /cgroup/
      sleeping 154, blocked 0, running 0, stopped 0, uninterruptible 0
      
      If the approach looks good, I'll enhance and post the user space utility for
      the same
      
      Feedback, comments, test results are always welcome!
      
      [akpm@linux-foundation.org: build fix]
      Signed-off-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Paul Menage <menage@google.com>
      Cc: Jay Lan <jlan@engr.sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      846c7bb0
    • P
      task cgroups: enable cgroups by default in some configs · c2e2c7fa
      Paul Jackson 提交于
      In pre-cgroup cpusets, a few config files enabled cpusets by default.
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Cc: Paul Menage <menage@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c2e2c7fa
    • P
      Task Control Groups: simple task cgroup debug info subsystem · 006cb992
      Paul Menage 提交于
      This example subsystem exports debugging information as an aid to diagnosing
      refcount leaks, etc, in the cgroup framework.
      Signed-off-by: NPaul Menage <menage@google.com>
      Cc: Serge E. Hallyn <serue@us.ibm.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Kirill Korotaev <dev@openvz.org>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
      Cc: Cedric Le Goater <clg@fr.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      006cb992
    • P
      Task Control Groups: example CPU accounting subsystem · 62d0df64
      Paul Menage 提交于
      This example demonstrates how to use the generic cgroup subsystem for a
      simple resource tracker that counts, for the processes in a cgroup, the
      total CPU time used and the %CPU used in the last complete 10 second interval.
      
      Portions contributed by Balbir Singh <balbir@in.ibm.com>
      Signed-off-by: NPaul Menage <menage@google.com>
      Cc: Serge E. Hallyn <serue@us.ibm.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Kirill Korotaev <dev@openvz.org>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
      Cc: Cedric Le Goater <clg@fr.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      62d0df64
    • P
      Task Control Groups: make cpusets a client of cgroups · 8793d854
      Paul Menage 提交于
      Remove the filesystem support logic from the cpusets system and makes cpusets
      a cgroup subsystem
      
      The "cpuset" filesystem becomes a dummy filesystem; attempts to mount it get
      passed through to the cgroup filesystem with the appropriate options to
      emulate the old cpuset filesystem behaviour.
      Signed-off-by: NPaul Menage <menage@google.com>
      Cc: Serge E. Hallyn <serue@us.ibm.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Kirill Korotaev <dev@openvz.org>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
      Cc: Cedric Le Goater <clg@fr.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8793d854
    • P
      Task Control Groups: automatic userspace notification of idle cgroups · 81a6a5cd
      Paul Menage 提交于
      Add the following files to the cgroup filesystem:
      
      notify_on_release - configures/reports whether the cgroup subsystem should
      attempt to run a release script when this cgroup becomes unused
      
      release_agent - configures/reports the release agent to be used for this
      hierarchy (top level in each hierarchy only)
      
      releasable - reports whether this cgroup would have been auto-released if
      notify_on_release was true and a release agent was configured (mainly useful
      for debugging)
      
      To avoid locking issues, invoking the userspace release agent is done via a
      workqueue task; cgroups that need to have their release agents invoked by
      the workqueue task are linked on to a list.
      
      [pj@sgi.com: Need to include kmod.h]
      Signed-off-by: NPaul Menage <menage@google.com>
      Cc: Serge E. Hallyn <serue@us.ibm.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Kirill Korotaev <dev@openvz.org>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
      Cc: Cedric Le Goater <clg@fr.ibm.com>
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      81a6a5cd
    • P
      Task Control Groups: shared cgroup subsystem group arrays · 817929ec
      Paul Menage 提交于
      Replace the struct css_set embedded in task_struct with a pointer; all tasks
      that have the same set of memberships across all hierarchies will share a
      css_set object, and will be linked via their css_sets field to the "tasks"
      list_head in the css_set.
      
      Assuming that many tasks share the same cgroup assignments, this reduces
      overall space usage and keeps the size of the task_struct down (three pointers
      added to task_struct compared to a non-cgroups kernel, no matter how many
      subsystems are registered).
      
      [akpm@linux-foundation.org: fix a printk]
      [akpm@linux-foundation.org: build fix]
      Signed-off-by: NPaul Menage <menage@google.com>
      Cc: Serge E. Hallyn <serue@us.ibm.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Kirill Korotaev <dev@openvz.org>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
      Cc: Cedric Le Goater <clg@fr.ibm.com>
      Cc: Serge E. Hallyn <serue@us.ibm.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Kirill Korotaev <dev@openvz.org>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
      Cc: Cedric Le Goater <clg@fr.ibm.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      817929ec
    • P
      Task Control Groups: add procfs interface · a424316c
      Paul Menage 提交于
      Add:
      
      /proc/cgroups - general system info
      
      /proc/*/cgroup - per-task cgroup membership info
      
      [a.p.zijlstra@chello.nl: cgroups: bdi init hooks]
      Signed-off-by: NPaul Menage <menage@google.com>
      Cc: Serge E. Hallyn <serue@us.ibm.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Kirill Korotaev <dev@openvz.org>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
      Cc: Cedric Le Goater <clg@fr.ibm.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a424316c
    • P
      Task Control Groups: add cgroup_clone() interface · 697f4161
      Paul Menage 提交于
      Add support for cgroup_clone(), a way to create new cgroups intended to
      be used for systems such as namespace unsharing.  A new subsystem callback,
      post_clone(), is added to allow subsystems to automatically configure cloned
      cgroups.
      Signed-off-by: NPaul Menage <menage@google.com>
      Cc: Serge E. Hallyn <serue@us.ibm.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Kirill Korotaev <dev@openvz.org>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
      Cc: Cedric Le Goater <clg@fr.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      697f4161
    • P
      Task Control Groups: add fork()/exit() hooks · b4f48b63
      Paul Menage 提交于
      This adds the necessary hooks to the fork() and exit() paths to ensure
      that new children inherit their parent's cgroup assignments, and that
      exiting processes release reference counts on their cgroups.
      Signed-off-by: NPaul Menage <menage@google.com>
      Cc: Serge E. Hallyn <serue@us.ibm.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Kirill Korotaev <dev@openvz.org>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
      Cc: Cedric Le Goater <clg@fr.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b4f48b63
    • P
      Add cgroup write_uint() helper method · 355e0c48
      Paul Menage 提交于
      Add write_uint() helper method for cgroup subsystems
      
      This helper is analagous to the read_uint() helper method for
      reporting u64 values to userspace. It's designed to reduce the amount
      of boilerplate requierd for creating new cgroup subsystems.
      Signed-off-by: NPaul Menage <menage@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      355e0c48
    • P
      Task Control Groups: add tasks file interface · bbcb81d0
      Paul Menage 提交于
      Add the per-directory "tasks" file for cgroupfs mounts; this allows the
      user to determine which tasks are members of a cgroup by reading a
      cgroup's "tasks", and to move a task into a cgroup by writing its pid to
      its "tasks".
      Signed-off-by: NPaul Menage <menage@google.com>
      Cc: Serge E. Hallyn <serue@us.ibm.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Kirill Korotaev <dev@openvz.org>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
      Cc: Cedric Le Goater <clg@fr.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bbcb81d0
    • P
      Task Control Groups: basic task cgroup framework · ddbcc7e8
      Paul Menage 提交于
      Generic Process Control Groups
      --------------------------
      
      There have recently been various proposals floating around for
      resource management/accounting and other task grouping subsystems in
      the kernel, including ResGroups, User BeanCounters, NSProxy
      cgroups, and others.  These all need the basic abstraction of being
      able to group together multiple processes in an aggregate, in order to
      track/limit the resources permitted to those processes, or control
      other behaviour of the processes, and all implement this grouping in
      different ways.
      
      This patchset provides a framework for tracking and grouping processes
      into arbitrary "cgroups" and assigning arbitrary state to those
      groupings, in order to control the behaviour of the cgroup as an
      aggregate.
      
      The intention is that the various resource management and
      virtualization/cgroup efforts can also become task cgroup
      clients, with the result that:
      
      - the userspace APIs are (somewhat) normalised
      
      - it's easier to test e.g. the ResGroups CPU controller in
       conjunction with the BeanCounters memory controller, or use either of
      them as the resource-control portion of a virtual server system.
      
      - the additional kernel footprint of any of the competing resource
       management systems is substantially reduced, since it doesn't need
       to provide process grouping/containment, hence improving their
       chances of getting into the kernel
      
      This patch:
      
      Add the main task cgroups framework - the cgroup filesystem, and the
      basic structures for tracking membership and associating subsystem state
      objects to tasks.
      Signed-off-by: NPaul Menage <menage@google.com>
      Cc: Serge E. Hallyn <serue@us.ibm.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Kirill Korotaev <dev@openvz.org>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
      Cc: Cedric Le Goater <clg@fr.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ddbcc7e8
    • P
      cpuset: zero malloc - revert the old cpuset fix · 55a230aa
      Paul Jackson 提交于
      The cpuset code to present a list of tasks using a cpuset to user space could
      write to an array that it had kmalloc'd, after a kmalloc request of zero size.
      
      The problem was that the code didn't check for writes past the allocated end
      of the array until -after- the first write.
      
      This is a race condition that is likely rare -- it would only show up if a
      cpuset went from being empty to having a task in it, during the brief time
      between the allocation and the first write.
      
      Prior to roughly 2.6.22 kernels, this was also a benign problem, because a
      zero kmalloc returned a few usable bytes anyway, and no harm was done with the
      bogus write.
      
      With the 2.6.22 kernel changes to make issue a warning if code tries to write
      to the location returned from a zero size allocation, this problem is no
      longer benign.  This cpuset code would occassionally trigger that warning.
      
      The fix is trivial -- check before storing into the array, not after, whether
      the array is big enough to hold the store.
      
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: "Serge E. Hallyn" <serue@us.ibm.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: Kirill Korotaev <dev@openvz.org>
      Cc: Paul Menage <menage@google.com>
      Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      55a230aa
    • R
      kernel-api docbook: fix content problems · 8f731f7d
      Randy Dunlap 提交于
      Fix kernel-api docbook contents problems.
      
      docproc: linux-2.6.23-git13/include/asm-x86/unaligned_32.h: No such file or directory
      Warning(linux-2.6.23-git13//include/linux/list.h:482): bad line: 			of list entry
      Warning(linux-2.6.23-git13//mm/filemap.c:864): No description found for parameter 'ra'
      Warning(linux-2.6.23-git13//block/ll_rw_blk.c:3760): No description found for parameter 'req'
      Warning(linux-2.6.23-git13//include/linux/input.h:1077): No description found for parameter 'private'
      Warning(linux-2.6.23-git13//include/linux/input.h:1077): No description found for parameter 'cdev'
      Signed-off-by: NRandy Dunlap <randy.dunlap@oracle.com>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Cc: WU Fengguang <wfg@mail.ustc.edu.cn>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8f731f7d
    • J
      reiserfs: ignore on disk s_bmap_nr value · cb680c1b
      Jeff Mahoney 提交于
      Implement support for file systems larger than 8 TiB.
      
      The reiserfs superblock contains a 16 bit value for counting the number of
      bitmap blocks.  The rest of the disk format supports file systems up to 2^32
      blocks, but the bitmap block limitation artificially limits this to 8 TiB with
      a 4KiB block size.
      
      Rather than trust the superblock's 16-bit bitmap block count, we calculate it
      dynamically based on the number of blocks in the file system.  When an
      incorrect value is observed in the superblock, it is zeroed out, ensuring that
      older kernels will not be able to mount the file system.
      
      Userspace support has already been implemented and shipped in reiserfsprogs
      3.6.20.
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cb680c1b
    • J
      reiserfs: remove first_zero_hint · 4d20851d
      Jeff Mahoney 提交于
      The first_zero_hint metadata caching was never actually used, and it's of
      dubious optimization quality.  This patch removes it.
      
      It doesn't actually shrink the size of the reiserfs_bitmap_info struct, since
      that doesn't work with block sizes larger than 8K.  There was a big fixme in
      there, and with all the work lately in allowing block size > page size, I
      might as well kill the fixme as well.
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4d20851d
    • J
      reiserfs: fix usage of signed ints for block numbers · 3ee16670
      Jeff Mahoney 提交于
      Do a quick signedness check for block numbers.  There are a number of places
      where signed integers are used for block numbers, which limits the usable file
      system size to 8 TiB.  The disk format, excepting a problem which will be
      fixed in the following patch, supports file systems up to 16 TiB in size.
      This patch cleans up those sites so that we can enable the full usable size.
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3ee16670