1. 26 1月, 2008 5 次提交
    • A
      sched: latencytop support · 9745512c
      Arjan van de Ven 提交于
      LatencyTOP kernel infrastructure; it measures latencies in the
      scheduler and tracks it system wide and per process.
      Signed-off-by: NArjan van de Ven <arjan@linux.intel.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      9745512c
    • P
      sched: rt group scheduling · 6f505b16
      Peter Zijlstra 提交于
      Extend group scheduling to also cover the realtime classes. It uses the time
      limiting introduced by the previous patch to allow multiple realtime groups.
      
      The hard time limit is required to keep behaviour deterministic.
      
      The algorithms used make the realtime scheduler O(tg), linear scaling wrt the
      number of task groups. This is the worst case behaviour I can't seem to get out
      of, the avg. case of the algorithms can be improved, I focused on correctness
      and worst case.
      
      [ akpm@linux-foundation.org: move side-effects out of BUG_ON(). ]
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      6f505b16
    • P
      Preempt-RCU: implementation · e260be67
      Paul E. McKenney 提交于
      This patch implements a new version of RCU which allows its read-side
      critical sections to be preempted. It uses a set of counter pairs
      to keep track of the read-side critical sections and flips them
      when all tasks exit read-side critical section. The details
      of this implementation can be found in this paper -
      
      	http://www.rdrop.com/users/paulmck/RCU/OLSrtRCU.2006.08.11a.pdf
      
      and the article-
      
      	http://lwn.net/Articles/253651/
      
      This patch was developed as a part of the -rt kernel development and
      meant to provide better latencies when read-side critical sections of
      RCU don't disable preemption.  As a consequence of keeping track of RCU
      readers, the readers have a slight overhead (optimizations in the paper).
      This implementation co-exists with the "classic" RCU implementations
      and can be switched to at compiler.
      
      Also includes RCU tracing summarized in debugfs.
      
      [ akpm@linux-foundation.org: build fixes on non-preempt architectures ]
      Signed-off-by: NGautham R Shenoy <ego@in.ibm.com>
      Signed-off-by: NDipankar Sarma <dipankar@in.ibm.com>
      Signed-off-by: NPaul E. McKenney <paulmck@us.ibm.com>
      Reviewed-by: NSteven Rostedt <srostedt@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      e260be67
    • G
      sched: add RT-balance cpu-weight · 73fe6aae
      Gregory Haskins 提交于
      Some RT tasks (particularly kthreads) are bound to one specific CPU.
      It is fairly common for two or more bound tasks to get queued up at the
      same time.  Consider, for instance, softirq_timer and softirq_sched.  A
      timer goes off in an ISR which schedules softirq_thread to run at RT50.
      Then the timer handler determines that it's time to smp-rebalance the
      system so it schedules softirq_sched to run.  So we are in a situation
      where we have two RT50 tasks queued, and the system will go into
      rt-overload condition to request other CPUs for help.
      
      This causes two problems in the current code:
      
      1) If a high-priority bound task and a low-priority unbounded task queue
         up behind the running task, we will fail to ever relocate the unbounded
         task because we terminate the search on the first unmovable task.
      
      2) We spend precious futile cycles in the fast-path trying to pull
         overloaded tasks over.  It is therefore optimial to strive to avoid the
         overhead all together if we can cheaply detect the condition before
         overload even occurs.
      
      This patch tries to achieve this optimization by utilizing the hamming
      weight of the task->cpus_allowed mask.  A weight of 1 indicates that
      the task cannot be migrated.  We will then utilize this information to
      skip non-migratable tasks and to eliminate uncessary rebalance attempts.
      
      We introduce a per-rq variable to count the number of migratable tasks
      that are currently running.  We only go into overload if we have more
      than one rt task, AND at least one of them is migratable.
      
      In addition, we introduce a per-task variable to cache the cpus_allowed
      weight, since the hamming calculation is probably relatively expensive.
      We only update the cached value when the mask is updated which should be
      relatively infrequent, especially compared to scheduling frequency
      in the fast path.
      Signed-off-by: NGregory Haskins <ghaskins@novell.com>
      Signed-off-by: NSteven Rostedt <srostedt@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      73fe6aae
    • I
      softlockup: automatically detect hung TASK_UNINTERRUPTIBLE tasks · 82a1fcb9
      Ingo Molnar 提交于
      this patch extends the soft-lockup detector to automatically
      detect hung TASK_UNINTERRUPTIBLE tasks. Such hung tasks are
      printed the following way:
      
       ------------------>
       INFO: task prctl:3042 blocked for more than 120 seconds.
       "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message
       prctl         D fd5e3793     0  3042   2997
              f6050f38 00000046 00000001 fd5e3793 00000009 c06d8264 c06dae80 00000286
              f6050f40 f6050f00 f7d34d90 f7d34fc8 c1e1be80 00000001 f6050000 00000000
              f7e92d00 00000286 f6050f18 c0489d1a f6050f40 00006605 00000000 c0133a5b
       Call Trace:
        [<c04883a5>] schedule_timeout+0x6d/0x8b
        [<c04883d8>] schedule_timeout_uninterruptible+0x15/0x17
        [<c0133a76>] msleep+0x10/0x16
        [<c0138974>] sys_prctl+0x30/0x1e2
        [<c0104c52>] sysenter_past_esp+0x5f/0xa5
        =======================
       2 locks held by prctl/3042:
       #0:  (&sb->s_type->i_mutex_key#5){--..}, at: [<c0197d11>] do_fsync+0x38/0x7a
       #1:  (jbd_handle){--..}, at: [<c01ca3d2>] journal_start+0xc7/0xe9
       <------------------
      
      the current default timeout is 120 seconds. Such messages are printed
      up to 10 times per bootup. If the system has crashed already then the
      messages are not printed.
      
      if lockdep is enabled then all held locks are printed as well.
      
      this feature is a natural extension to the softlockup-detector (kernel
      locked up without scheduling) and to the NMI watchdog (kernel locked up
      with IRQs disabled).
      
      [ Gautham R Shenoy <ego@in.ibm.com>: CPU hotplug fixes. ]
      [ Andrew Morton <akpm@linux-foundation.org>: build warning fix. ]
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NArjan van de Ven <arjan@linux.intel.com>
      82a1fcb9
  2. 06 12月, 2007 1 次提交
    • E
      fix clone(CLONE_NEWPID) · 5cd17569
      Eric W. Biederman 提交于
      Currently we are complicating the code in copy_process, the clone ABI, and
      if we fix the bugs sys_setsid itself, with an unnecessary open coded
      version of sys_setsid.
      
      So just simplify everything and don't special case the session and pgrp of
      the initial process in a pid namespace.
      
      Having this special case actually presents to user space the classic linux
      startup conditions with session == pgrp == 0 for /sbin/init.
      
      We already handle sending signals to processes in a child pid namespace.
      
      We need to handle sending signals to processes in a parent pid namespace
      for cases like SIGCHILD and SIGIO.
      
      This makes nothing extra visible inside a pid namespace.  So this extra
      special case appears to have no redeeming merits.
      
      Further removing this special case increases the flexibility of how we can
      use pid namespaces, by not requiring the initial process in a pid namespace
      to be a daemon.
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5cd17569
  3. 10 11月, 2007 1 次提交
    • S
      sched: fix copy_namespace() <-> sched_fork() dependency in do_fork · 3c90e6e9
      Srivatsa Vaddagiri 提交于
      Sukadev Bhattiprolu reported a kernel crash with control groups.
      There are couple of problems discovered by Suka's test:
      
      - The test requires the cgroup filesystem to be mounted with
        atleast the cpu and ns options (i.e both namespace and cpu 
        controllers are active in the same hierarchy). 
      
      	# mkdir /dev/cpuctl
      	# mount -t cgroup -ocpu,ns none cpuctl
      	(or simply)
      	# mount -t cgroup none cpuctl -> Will activate all controllers
      					 in same hierarchy.
      
      - The test invokes clone() with CLONE_NEWNS set. This causes a a new child
        to be created, also a new group (do_fork->copy_namespaces->ns_cgroup_clone->
        cgroup_clone) and the child is attached to the new group (cgroup_clone->
        attach_task->sched_move_task). At this point in time, the child's scheduler 
        related fields are uninitialized (including its on_rq field, which it has
        inherited from parent). As a result sched_move_task thinks its on
        runqueue, when it isn't.
      
        As a solution to this problem, I moved sched_fork() call, which
        initializes scheduler related fields on a new task, before
        copy_namespaces(). I am not sure though whether moving up will
        cause other side-effects. Do you see any issue?
      
      - The second problem exposed by this test is that task_new_fair()
        assumes that parent and child will be part of the same group (which 
        needn't be as this test shows). As a result, cfs_rq->curr can be NULL
        for the child.
      
        The solution is to test for curr pointer being NULL in
        task_new_fair().
      
      With the patch below, I could run ns_exec() fine w/o a crash.
      Reported-by: NSukadev Bhattiprolu <sukadev@us.ibm.com>
      Signed-off-by: NSrivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      3c90e6e9
  4. 30 10月, 2007 2 次提交
  5. 20 10月, 2007 15 次提交
    • A
      Uninline fork.c/exit.c · a39bc516
      Alexey Dobriyan 提交于
      Save ~650 bytes here.
      
      add/remove: 4/0 grow/shrink: 0/7 up/down: 430/-1088 (-658)
      function                                     old     new   delta
      __copy_fs_struct                               -     202    +202
      __put_fs_struct                                -     112    +112
      __exit_fs                                      -      58     +58
      __exit_files                                   -      58     +58
      exit_files                                    58       2     -56
      put_fs_struct                                112       5    -107
      exit_fs                                      161       2    -159
      sys_unshare                                  774     590    -184
      copy_process                                4031    3840    -191
      do_exit                                     1791    1597    -194
      copy_fs_struct                               202       5    -197
      
      No difference in lmbench lat_proc tests on 2-way Opteron 246.
      Smaaaal degradation on UP P4 (within errors).
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NAlexey Dobriyan <adobriyan@sw.ru>
      Cc: Arjan van de Ven <arjan@infradead.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a39bc516
    • M
      kernel/fork.c: remove unneeded variable initialization in copy_process() · a24efe62
      Mariusz Kozlowski 提交于
      This initialization of is not needed so just remove it.
      Signed-off-by: NMariusz Kozlowski <m.kozlowski@tuxland.pl>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a24efe62
    • P
      Isolate the explicit usage of signal->pgrp · 9a2e7057
      Pavel Emelyanov 提交于
      The pgrp field is not used widely around the kernel so it is now marked as
      deprecated with appropriate comment.
      
      The initialization of INIT_SIGNALS is trimmed because
      a) they are set to 0 automatically;
      b) gcc cannot properly initialize two anonymous (the second one
         is the one with the session) unions. In this particular case
         to make it compile we'd have to add some field initialized
         right before the .pgrp.
      
      This is the same patch as the 1ec320af one
      (from Cedric), but for the pgrp field.
      
      Some progress report:
      
      We have to deprecate the pid, tgid, session and pgrp fields on struct
      task_struct and struct signal_struct.  The session and pgrp are already
      deprecated.  The tgid value is close to being such - the worst known usage
      in in fs/locks.c and audit code.  The pid field deprecation is mainly
      blocked by numerous printk-s around the kernel that print the tsk->pid to
      log.
      Signed-off-by: NPavel Emelyanov <xemul@openvz.org>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
      Cc: Cedric Le Goater <clg@fr.ibm.com>
      Cc: Serge Hallyn <serue@us.ibm.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9a2e7057
    • E
      Fix tsk->exit_state usage · 270f722d
      Eugene Teo 提交于
      tsk->exit_state can only be 0, EXIT_ZOMBIE, or EXIT_DEAD.  A non-zero test
      is the same as tsk->exit_state & (EXIT_ZOMBIE | EXIT_DEAD), so just testing
      tsk->exit_state is sufficient.
      Signed-off-by: NEugene Teo <eugeneteo@kernel.sg>
      Cc: Roland McGrath <roland@redhat.com>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      270f722d
    • P
      pid namespaces: changes to show virtual ids to user · b488893a
      Pavel Emelyanov 提交于
      This is the largest patch in the set. Make all (I hope) the places where
      the pid is shown to or get from user operate on the virtual pids.
      
      The idea is:
       - all in-kernel data structures must store either struct pid itself
         or the pid's global nr, obtained with pid_nr() call;
       - when seeking the task from kernel code with the stored id one
         should use find_task_by_pid() call that works with global pids;
       - when showing pid's numerical value to the user the virtual one
         should be used, but however when one shows task's pid outside this
         task's namespace the global one is to be used;
       - when getting the pid from userspace one need to consider this as
         the virtual one and use appropriate task/pid-searching functions.
      
      [akpm@linux-foundation.org: build fix]
      [akpm@linux-foundation.org: nuther build fix]
      [akpm@linux-foundation.org: yet nuther build fix]
      [akpm@linux-foundation.org: remove unneeded casts]
      Signed-off-by: NPavel Emelyanov <xemul@openvz.org>
      Signed-off-by: NAlexey Dobriyan <adobriyan@openvz.org>
      Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Cc: Paul Menage <menage@google.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b488893a
    • P
      pid namespaces: initialize the namespace's proc_mnt · 6f4e6433
      Pavel Emelyanov 提交于
      The namespace's proc_mnt must be kern_mount-ed to make this pointer always
      valid, independently of whether the user space mounted the proc or not.  This
      solves raced in proc_flush_task, etc.  with the proc_mnt switching from NULL
      to not-NULL.
      
      The initialization is done after the init's pid is created and hashed to make
      proc_get_sb() finr it and get for root inode.
      
      Sice the namespace holds the vfsmnt, vfsmnt holds the superblock and the
      superblock holds the namespace we must explicitly break this circle to destroy
      all the stuff.  This is done after the init of the namespace dies.  Running a
      few steps forward - when init exits it will kill all its children, so no
      proc_mnt will be needed after its death.
      Signed-off-by: NPavel Emelyanov <xemul@openvz.org>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
      Cc: Paul Menage <menage@google.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6f4e6433
    • P
      pid namespaces: allow cloning of new namespace · 30e49c26
      Pavel Emelyanov 提交于
      When clone() is invoked with CLONE_NEWPID, create a new pid namespace and then
      create a new struct pid for the new process.  Allocate pid_t's for the new
      process in the new pid namespace and all ancestor pid namespaces.  Make the
      newly cloned process the session and process group leader.
      
      Since the active pid namespace is special and expected to be the first entry
      in pid->upid_list, preserve the order of pid namespaces.
      
      The size of 'struct pid' is dependent on the the number of pid namespaces the
      process exists in, so we use multiple pid-caches'.  Only one pid cache is
      created during system startup and this used by processes that exist only in
      init_pid_ns.
      
      When a process clones its pid namespace, we create additional pid caches as
      necessary and use the pid cache to allocate 'struct pids' for that depth.
      
      Note, that with this patch the newly created namespace won't work, since the
      rest of the kernel still uses global pids, but this is to be fixed soon.  Init
      pid namespace still works.
      
      [oleg@tv-sign.ru: merge fix]
      Signed-off-by: NPavel Emelyanov <xemul@openvz.org>
      Signed-off-by: NSukadev Bhattiprolu <sukadev@us.ibm.com>
      Cc: Paul Menage <menage@google.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      30e49c26
    • P
      pid namespaces: move alloc_pid() lower in copy_process() · 425fb2b4
      Pavel Emelyanov 提交于
      When we create new namespace we will need to allocate the struct pid, that
      will have one extra struct upid in array, comparing to the parent.
      
      Thus we need to know the new namespace (if any) in alloc_pid() to init this
      struct upid properly, so move the alloc_pid() call lower in copy_process().
      Signed-off-by: NPavel Emelyanov <xemul@openvz.org>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
      Cc: Paul Menage <menage@google.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      425fb2b4
    • P
      pid namespaces: make alloc_pid(), free_pid() and put_pid() work with struct upid · 8ef047aa
      Pavel Emelyanov 提交于
      Each struct upid element of struct pid has to be initialized properly, i.e.
      its nr mst be allocated from appropriate pidmap and ns set to appropriate
      namespace.
      
      When allocating a new pid, we need to know the namespace this pid will live
      in, so the additional argument is added to alloc_pid().
      
      On the other hand, the rest of the kernel still uses the pid->nr and
      pid->pid_chain fields, so these ones are still initialized, but this will be
      removed soon.
      Signed-off-by: NPavel Emelyanov <xemul@openvz.org>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
      Cc: Paul Menage <menage@google.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8ef047aa
    • P
      Make access to task's nsproxy lighter · cf7b708c
      Pavel Emelyanov 提交于
      When someone wants to deal with some other taks's namespaces it has to lock
      the task and then to get the desired namespace if the one exists.  This is
      slow on read-only paths and may be impossible in some cases.
      
      E.g.  Oleg recently noticed a race between unshare() and the (sent for
      review in cgroups) pid namespaces - when the task notifies the parent it
      has to know the parent's namespace, but taking the task_lock() is
      impossible there - the code is under write locked tasklist lock.
      
      On the other hand switching the namespace on task (daemonize) and releasing
      the namespace (after the last task exit) is rather rare operation and we
      can sacrifice its speed to solve the issues above.
      
      The access to other task namespaces is proposed to be performed
      like this:
      
           rcu_read_lock();
           nsproxy = task_nsproxy(tsk);
           if (nsproxy != NULL) {
                   / *
                     * work with the namespaces here
                     * e.g. get the reference on one of them
                     * /
           } / *
               * NULL task_nsproxy() means that this task is
               * almost dead (zombie)
               * /
           rcu_read_unlock();
      
      This patch has passed the review by Eric and Oleg :) and,
      of course, tested.
      
      [clg@fr.ibm.com: fix unshare()]
      [ebiederm@xmission.com: Update get_net_ns_by_pid]
      Signed-off-by: NPavel Emelyanov <xemul@openvz.org>
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Serge Hallyn <serue@us.ibm.com>
      Signed-off-by: NCedric Le Goater <clg@fr.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cf7b708c
    • S
      pid namespaces: move alloc_pid() to copy_process() · a6f5e063
      Sukadev Bhattiprolu 提交于
      Move alloc_pid() into copy_process().  This will keep all pid and pid
      namespace code together and simplify error handling when we support multiple
      pid namespaces.
      Signed-off-by: NSukadev Bhattiprolu <sukadev@us.ibm.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Cedric Le Goater <clg@fr.ibm.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: Serge Hallyn <serue@us.ibm.com>
      Cc: Herbert Poetzel <herbert@13thfloor.at>
      Cc: Kirill Korotaev <dev@sw.ru>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a6f5e063
    • P
      pid namespaces: round up the API · a47afb0f
      Pavel Emelianov 提交于
      The set of functions process_session, task_session, process_group and
      task_pgrp is confusing, as the names can be mixed with each other when looking
      at the code for a long time.
      
      The proposals are to
      * equip the functions that return the integer with _nr suffix to
        represent that fact,
      * and to make all functions work with task (not process) by making
        the common prefix of the same name.
      
      For monotony the routines signal_session() and set_signal_session() are
      replaced with task_session_nr() and set_task_session(), especially since they
      are only used with the explicit task->signal dereference.
      Signed-off-by: NPavel Emelianov <xemul@openvz.org>
      Acked-by: NSerge E. Hallyn <serue@us.ibm.com>
      Cc: Kirill Korotaev <dev@openvz.org>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Cedric Le Goater <clg@fr.ibm.com>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a47afb0f
    • P
      Task Control Groups: make cpusets a client of cgroups · 8793d854
      Paul Menage 提交于
      Remove the filesystem support logic from the cpusets system and makes cpusets
      a cgroup subsystem
      
      The "cpuset" filesystem becomes a dummy filesystem; attempts to mount it get
      passed through to the cgroup filesystem with the appropriate options to
      emulate the old cpuset filesystem behaviour.
      Signed-off-by: NPaul Menage <menage@google.com>
      Cc: Serge E. Hallyn <serue@us.ibm.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Kirill Korotaev <dev@openvz.org>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
      Cc: Cedric Le Goater <clg@fr.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8793d854
    • P
      Task Control Groups: shared cgroup subsystem group arrays · 817929ec
      Paul Menage 提交于
      Replace the struct css_set embedded in task_struct with a pointer; all tasks
      that have the same set of memberships across all hierarchies will share a
      css_set object, and will be linked via their css_sets field to the "tasks"
      list_head in the css_set.
      
      Assuming that many tasks share the same cgroup assignments, this reduces
      overall space usage and keeps the size of the task_struct down (three pointers
      added to task_struct compared to a non-cgroups kernel, no matter how many
      subsystems are registered).
      
      [akpm@linux-foundation.org: fix a printk]
      [akpm@linux-foundation.org: build fix]
      Signed-off-by: NPaul Menage <menage@google.com>
      Cc: Serge E. Hallyn <serue@us.ibm.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Kirill Korotaev <dev@openvz.org>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
      Cc: Cedric Le Goater <clg@fr.ibm.com>
      Cc: Serge E. Hallyn <serue@us.ibm.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Kirill Korotaev <dev@openvz.org>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
      Cc: Cedric Le Goater <clg@fr.ibm.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      817929ec
    • P
      Task Control Groups: add fork()/exit() hooks · b4f48b63
      Paul Menage 提交于
      This adds the necessary hooks to the fork() and exit() paths to ensure
      that new children inherit their parent's cgroup assignments, and that
      exiting processes release reference counts on their cgroups.
      Signed-off-by: NPaul Menage <menage@google.com>
      Cc: Serge E. Hallyn <serue@us.ibm.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Kirill Korotaev <dev@openvz.org>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
      Cc: Cedric Le Goater <clg@fr.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b4f48b63
  6. 19 10月, 2007 4 次提交
  7. 17 10月, 2007 4 次提交
  8. 15 10月, 2007 1 次提交
  9. 11 10月, 2007 1 次提交
    • E
      [NET]: Add network namespace clone & unshare support. · 9dd776b6
      Eric W. Biederman 提交于
      This patch allows you to create a new network namespace
      using sys_clone, or sys_unshare.
      
      As the network namespace is still experimental and under development
      clone and unshare support is only made available when CONFIG_NET_NS is
      selected at compile time.
      
      As this patch introduces network namespace support into code paths
      that exist when the CONFIG_NET is not selected there are a few
      additions made to net_namespace.h to allow a few more functions
      to be used when the networking stack is not compiled in.
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9dd776b6
  10. 21 9月, 2007 1 次提交
    • D
      signalfd simplification · b8fceee1
      Davide Libenzi 提交于
      This simplifies signalfd code, by avoiding it to remain attached to the
      sighand during its lifetime.
      
      In this way, the signalfd remain attached to the sighand only during
      poll(2) (and select and epoll) and read(2).  This also allows to remove
      all the custom "tsk == current" checks in kernel/signal.c, since
      dequeue_signal() will only be called by "current".
      
      I think this is also what Ben was suggesting time ago.
      
      The external effect of this, is that a thread can extract only its own
      private signals and the group ones.  I think this is an acceptable
      behaviour, in that those are the signals the thread would be able to
      fetch w/out signalfd.
      Signed-off-by: NDavide Libenzi <davidel@xmailserver.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b8fceee1
  11. 20 7月, 2007 4 次提交
  12. 18 7月, 2007 1 次提交
    • R
      Freezer: make kernel threads nonfreezable by default · 83144186
      Rafael J. Wysocki 提交于
      Currently, the freezer treats all tasks as freezable, except for the kernel
      threads that explicitly set the PF_NOFREEZE flag for themselves.  This
      approach is problematic, since it requires every kernel thread to either
      set PF_NOFREEZE explicitly, or call try_to_freeze(), even if it doesn't
      care for the freezing of tasks at all.
      
      It seems better to only require the kernel threads that want to or need to
      be frozen to use some freezer-related code and to remove any
      freezer-related code from the other (nonfreezable) kernel threads, which is
      done in this patch.
      
      The patch causes all kernel threads to be nonfreezable by default (ie.  to
      have PF_NOFREEZE set by default) and introduces the set_freezable()
      function that should be called by the freezable kernel threads in order to
      unset PF_NOFREEZE.  It also makes all of the currently freezable kernel
      threads call set_freezable(), so it shouldn't cause any (intentional)
      change of behaviour to appear.  Additionally, it updates documentation to
      describe the freezing of tasks more accurately.
      
      [akpm@linux-foundation.org: build fixes]
      Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
      Acked-by: NNigel Cunningham <nigel@nigel.suspend2.net>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Cc: Gautham R Shenoy <ego@in.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      83144186