1. 03 12月, 2007 1 次提交
    • S
      sched: cpu accounting controller (V2) · d842de87
      Srivatsa Vaddagiri 提交于
      Commit cfb52856 removed a useful feature for
      us, which provided a cpu accounting resource controller.  This feature would be
      useful if someone wants to group tasks only for accounting purpose and doesnt
      really want to exercise any control over their cpu consumption.
      
      The patch below reintroduces the feature. It is based on Paul Menage's
      original patch (Commit 62d0df64), with
      these differences:
      
              - Removed load average information. I felt it needs more thought (esp
      	  to deal with SMP and virtualized platforms) and can be added for
      	  2.6.25 after more discussions.
              - Convert group cpu usage to be nanosecond accurate (as rest of the cfs
      	  stats are) and invoke cpuacct_charge() from the respective scheduler
      	  classes
      	- Make accounting scalable on SMP systems by splitting the usage
      	  counter to be per-cpu
      	- Move the code from kernel/cpu_acct.c to kernel/sched.c (since the
      	  code is not big enough to warrant a new file and also this rightly
      	  needs to live inside the scheduler. Also things like accessing
      	  rq->lock while reading cpu usage becomes easier if the code lived in
      	  kernel/sched.c)
      
      The patch also modifies the cpu controller not to provide the same accounting
      information.
      Tested-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      
       Tested the patches on top of 2.6.24-rc3. The patches work fine. Ran
       some simple tests like cpuspin (spin on the cpu), ran several tasks in
       the same group and timed them. Compared their time stamps with
       cpuacct.usage.
      Signed-off-by: NSrivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
      Signed-off-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      d842de87
  2. 23 11月, 2007 1 次提交
  3. 15 11月, 2007 2 次提交
    • E
      pidns: Place under CONFIG_EXPERIMENTAL · 57d5f66b
      Eric W. Biederman 提交于
      This is my trivial patch to swat innumerable little bugs with a single
      blow.
      
      After some intensive review (my apologies for not having gotten to this
      sooner) what we have looks like a good base to build on with the current
      pid namespace code but it is not complete, and it is still much to simple
      to find issues where the kernel does the wrong thing outside of the initial
      pid namespace.
      
      Until the dust settles and we are certain we have the ABI and the
      implementation is as correct as humanly possible let's keep process ID
      namespaces behind CONFIG_EXPERIMENTAL.
      
      Allowing us the option of fixing any ABI or other bugs we find as long as
      they are minor.
      
      Allowing users of the kernel to avoid those bugs simply by ensuring their
      kernel does not have support for multiple pid namespaces.
      
      [akpm@linux-foundation.org: coding-style cleanups]
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      Cc: Cedric Le Goater <clg@fr.ibm.com>
      Cc: Adrian Bunk <bunk@kernel.org>
      Cc: Jeremy Fitzhardinge <jeremy@goop.org>
      Cc: Kir Kolyshkin <kir@swsoft.com>
      Cc: Kirill Korotaev <dev@sw.ru>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      57d5f66b
    • A
      revert "Task Control Groups: example CPU accounting subsystem" · cfb52856
      Andrew Morton 提交于
      Revert 62d0df64.
      
      This was originally intended as a simple initial example of how to create a
      control groups subsystem; it wasn't intended for mainline, but I didn't make
      this clear enough to Andrew.
      
      The CFS cgroup subsystem now has better functionality for the per-cgroup usage
      accounting (based directly on CFS stats) than the "usage" status file in this
      patch, and the "load" status file is rather simplistic - although having a
      per-cgroup load average report would be a useful feature, I don't believe this
      patch actually provides it.  If it gets into the final 2.6.24 we'd probably
      have to support this interface for ever.
      
      Cc: Paul Menage <menage@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cfb52856
  4. 25 10月, 2007 1 次提交
  5. 21 10月, 2007 1 次提交
    • A
      [PATCH] audit: watching subtrees · 74c3cbe3
      Al Viro 提交于
      New kind of audit rule predicates: "object is visible in given subtree".
      The part that can be sanely implemented, that is.  Limitations:
      	* if you have hardlink from outside of tree, you'd better watch
      it too (or just watch the object itself, obviously)
      	* if you mount something under a watched tree, tell audit
      that new chunk should be added to watched subtrees
      	* if you umount something in a watched tree and it's still mounted
      elsewhere, you will get matches on events happening there.  New command
      tells audit to recalculate the trees, trimming such sources of false
      positives.
      
      Note that it's _not_ about path - if something mounted in several places
      (multiple mount, bindings, different namespaces, etc.), the match does
      _not_ depend on which one we are using for access.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      74c3cbe3
  6. 20 10月, 2007 6 次提交
    • S
      Hook up group scheduler with control groups · 68318b8e
      Srivatsa Vaddagiri 提交于
      Enable "cgroup" (formerly containers) based fair group scheduling.  This
      will let administrator create arbitrary groups of tasks (using "cgroup"
      pseudo filesystem) and control their cpu bandwidth usage.
      
      [akpm@linux-foundation.org: fix cpp condition]
      Signed-off-by: NSrivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
      Signed-off-by: NDhaval Giani <dhaval@linux.vnet.ibm.com>
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Paul Menage <menage@google.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      68318b8e
    • S
      cgroups: implement namespace tracking subsystem · 858d72ea
      Serge E. Hallyn 提交于
      When a task enters a new namespace via a clone() or unshare(), a new cgroup
      is created and the task moves into it.
      
      This version names cgroups which are automatically created using
      cgroup_clone() as "node_<pid>" where pid is the pid of the unsharing or
      cloned process.  (Thanks Pavel for the idea) This is safe because if the
      process unshares again, it will create
      
      	/cgroups/(...)/node_<pid>/node_<pid>
      
      The only possibilities (AFAICT) for a -EEXIST on unshare are
      
      	1. pid wraparound
      	2. a process fails an unshare, then tries again.
      
      Case 1 is unlikely enough that I ignore it (at least for now).  In case 2, the
      node_<pid> will be empty and can be rmdir'ed to make the subsequent unshare()
      succeed.
      
      Changelog:
      	Name cloned cgroups as "node_<pid>".
      
      [clg@fr.ibm.com: fix order of cgroup subsystems in init/Kconfig]
      Signed-off-by: NSerge E. Hallyn <serue@us.ibm.com>
      Cc: Paul Menage <menage@google.com>
      Signed-off-by: NCedric Le Goater <clg@fr.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      858d72ea
    • P
      Task Control Groups: simple task cgroup debug info subsystem · 006cb992
      Paul Menage 提交于
      This example subsystem exports debugging information as an aid to diagnosing
      refcount leaks, etc, in the cgroup framework.
      Signed-off-by: NPaul Menage <menage@google.com>
      Cc: Serge E. Hallyn <serue@us.ibm.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Kirill Korotaev <dev@openvz.org>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
      Cc: Cedric Le Goater <clg@fr.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      006cb992
    • P
      Task Control Groups: example CPU accounting subsystem · 62d0df64
      Paul Menage 提交于
      This example demonstrates how to use the generic cgroup subsystem for a
      simple resource tracker that counts, for the processes in a cgroup, the
      total CPU time used and the %CPU used in the last complete 10 second interval.
      
      Portions contributed by Balbir Singh <balbir@in.ibm.com>
      Signed-off-by: NPaul Menage <menage@google.com>
      Cc: Serge E. Hallyn <serue@us.ibm.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Kirill Korotaev <dev@openvz.org>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
      Cc: Cedric Le Goater <clg@fr.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      62d0df64
    • P
      Task Control Groups: make cpusets a client of cgroups · 8793d854
      Paul Menage 提交于
      Remove the filesystem support logic from the cpusets system and makes cpusets
      a cgroup subsystem
      
      The "cpuset" filesystem becomes a dummy filesystem; attempts to mount it get
      passed through to the cgroup filesystem with the appropriate options to
      emulate the old cpuset filesystem behaviour.
      Signed-off-by: NPaul Menage <menage@google.com>
      Cc: Serge E. Hallyn <serue@us.ibm.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Kirill Korotaev <dev@openvz.org>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
      Cc: Cedric Le Goater <clg@fr.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8793d854
    • P
      Task Control Groups: basic task cgroup framework · ddbcc7e8
      Paul Menage 提交于
      Generic Process Control Groups
      --------------------------
      
      There have recently been various proposals floating around for
      resource management/accounting and other task grouping subsystems in
      the kernel, including ResGroups, User BeanCounters, NSProxy
      cgroups, and others.  These all need the basic abstraction of being
      able to group together multiple processes in an aggregate, in order to
      track/limit the resources permitted to those processes, or control
      other behaviour of the processes, and all implement this grouping in
      different ways.
      
      This patchset provides a framework for tracking and grouping processes
      into arbitrary "cgroups" and assigning arbitrary state to those
      groupings, in order to control the behaviour of the cgroup as an
      aggregate.
      
      The intention is that the various resource management and
      virtualization/cgroup efforts can also become task cgroup
      clients, with the result that:
      
      - the userspace APIs are (somewhat) normalised
      
      - it's easier to test e.g. the ResGroups CPU controller in
       conjunction with the BeanCounters memory controller, or use either of
      them as the resource-control portion of a virtual server system.
      
      - the additional kernel footprint of any of the competing resource
       management systems is substantially reduced, since it doesn't need
       to provide process grouping/containment, hence improving their
       chances of getting into the kernel
      
      This patch:
      
      Add the main task cgroups framework - the cgroup filesystem, and the
      basic structures for tracking membership and associating subsystem state
      objects to tasks.
      Signed-off-by: NPaul Menage <menage@google.com>
      Cc: Serge E. Hallyn <serue@us.ibm.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Kirill Korotaev <dev@openvz.org>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
      Cc: Cedric Le Goater <clg@fr.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ddbcc7e8
  7. 17 10月, 2007 3 次提交
  8. 15 10月, 2007 6 次提交
  9. 20 9月, 2007 1 次提交
  10. 01 8月, 2007 2 次提交
  11. 26 7月, 2007 1 次提交
  12. 18 7月, 2007 1 次提交
  13. 17 7月, 2007 4 次提交
  14. 10 7月, 2007 1 次提交
  15. 17 5月, 2007 2 次提交
  16. 11 5月, 2007 5 次提交
    • D
      signal/timer/event: eventfd core · e1ad7468
      Davide Libenzi 提交于
      This is a very simple and light file descriptor, that can be used as event
      wait/dispatch by userspace (both wait and dispatch) and by the kernel
      (dispatch only).  It can be used instead of pipe(2) in all cases where those
      would simply be used to signal events.  Their kernel overhead is much lower
      than pipes, and they do not consume two fds.  When used in the kernel, it can
      offer an fd-bridge to enable, for example, functionalities like KAIO or
      syslets/threadlets to signal to an fd the completion of certain operations.
      But more in general, an eventfd can be used by the kernel to signal readiness,
      in a POSIX poll/select way, of interfaces that would otherwise be incompatible
      with it.  The API is:
      
      int eventfd(unsigned int count);
      
      The eventfd API accepts an initial "count" parameter, and returns an eventfd
      fd.  It supports poll(2) (POLLIN, POLLOUT, POLLERR), read(2) and write(2).
      
      The POLLIN flag is raised when the internal counter is greater than zero.
      
      The POLLOUT flag is raised when at least a value of "1" can be written to the
      internal counter.
      
      The POLLERR flag is raised when an overflow in the counter value is detected.
      
      The write(2) operation can never overflow the counter, since it blocks (unless
      O_NONBLOCK is set, in which case -EAGAIN is returned).
      
      But the eventfd_signal() function can do it, since it's supposed to not sleep
      during its operation.
      
      The read(2) function reads the __u64 counter value, and reset the internal
      value to zero.  If the value read is equal to (__u64) -1, an overflow happened
      on the internal counter (due to 2^64 eventfd_signal() posts that has never
      been retired - unlickely, but possible).
      
      The write(2) call writes an __u64 count value, and adds it to the current
      counter.  The eventfd fd supports O_NONBLOCK also.
      
      On the kernel side, we have:
      
      struct file *eventfd_fget(int fd);
      int eventfd_signal(struct file *file, unsigned int n);
      
      The eventfd_fget() should be called to get a struct file* from an eventfd fd
      (this is an fget() + check of f_op being an eventfd fops pointer).
      
      The kernel can then call eventfd_signal() every time it wants to post an event
      to userspace.  The eventfd_signal() function can be called from any context.
      An eventfd() simple test and bench is available here:
      
      http://www.xmailserver.org/eventfd-bench.c
      
      This is the eventfd-based version of pipetest-4 (pipe(2) based):
      
      http://www.xmailserver.org/pipetest-4.c
      
      Not that performance matters much in the eventfd case, but eventfd-bench
      shows almost as double as performance than pipetest-4.
      
      [akpm@linux-foundation.org: fix i386 build]
      [akpm@linux-foundation.org: add sys_eventfd to sys_ni.c]
      Signed-off-by: NDavide Libenzi <davidel@xmailserver.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e1ad7468
    • D
      signal/timer/event: timerfd core · b215e283
      Davide Libenzi 提交于
      This patch introduces a new system call for timers events delivered though
      file descriptors.  This allows timer event to be used with standard POSIX
      poll(2), select(2) and read(2).  As a consequence of supporting the Linux
      f_op->poll subsystem, they can be used with epoll(2) too.
      
      The system call is defined as:
      
      int timerfd(int ufd, int clockid, int flags, const struct itimerspec *utmr);
      
      The "ufd" parameter allows for re-use (re-programming) of an existing timerfd
      w/out going through the close/open cycle (same as signalfd).  If "ufd" is -1,
      s new file descriptor will be created, otherwise the existing "ufd" will be
      re-programmed.
      
      The "clockid" parameter is either CLOCK_MONOTONIC or CLOCK_REALTIME.  The time
      specified in the "utmr->it_value" parameter is the expiry time for the timer.
      
      If the TFD_TIMER_ABSTIME flag is set in "flags", this is an absolute time,
      otherwise it's a relative time.
      
      If the time specified in the "utmr->it_interval" is not zero (.tv_sec == 0,
      tv_nsec == 0), this is the period at which the following ticks should be
      generated.
      
      The "utmr->it_interval" should be set to zero if only one tick is requested.
      Setting the "utmr->it_value" to zero will disable the timer, or will create a
      timerfd without the timer enabled.
      
      The function returns the new (or same, in case "ufd" is a valid timerfd
      descriptor) file, or -1 in case of error.
      
      As stated before, the timerfd file descriptor supports poll(2), select(2) and
      epoll(2).  When a timer event happened on the timerfd, a POLLIN mask will be
      returned.
      
      The read(2) call can be used, and it will return a u32 variable holding the
      number of "ticks" that happened on the interface since the last call to
      read(2).  The read(2) call supportes the O_NONBLOCK flag too, and EAGAIN will
      be returned if no ticks happened.
      
      A quick test program, shows timerfd working correctly on my amd64 box:
      
      http://www.xmailserver.org/timerfd-test.c
      
      [akpm@linux-foundation.org: add sys_timerfd to sys_ni.c]
      Signed-off-by: NDavide Libenzi <davidel@xmailserver.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b215e283
    • D
      signal/timer/event: signalfd core · fba2afaa
      Davide Libenzi 提交于
      This patch series implements the new signalfd() system call.
      
      I took part of the original Linus code (and you know how badly it can be
      broken :), and I added even more breakage ;) Signals are fetched from the same
      signal queue used by the process, so signalfd will compete with standard
      kernel delivery in dequeue_signal().  If you want to reliably fetch signals on
      the signalfd file, you need to block them with sigprocmask(SIG_BLOCK).  This
      seems to be working fine on my Dual Opteron machine.  I made a quick test
      program for it:
      
      http://www.xmailserver.org/signafd-test.c
      
      The signalfd() system call implements signal delivery into a file descriptor
      receiver.  The signalfd file descriptor if created with the following API:
      
      int signalfd(int ufd, const sigset_t *mask, size_t masksize);
      
      The "ufd" parameter allows to change an existing signalfd sigmask, w/out going
      to close/create cycle (Linus idea).  Use "ufd" == -1 if you want a brand new
      signalfd file.
      
      The "mask" allows to specify the signal mask of signals that we are interested
      in.  The "masksize" parameter is the size of "mask".
      
      The signalfd fd supports the poll(2) and read(2) system calls.  The poll(2)
      will return POLLIN when signals are available to be dequeued.  As a direct
      consequence of supporting the Linux poll subsystem, the signalfd fd can use
      used together with epoll(2) too.
      
      The read(2) system call will return a "struct signalfd_siginfo" structure in
      the userspace supplied buffer.  The return value is the number of bytes copied
      in the supplied buffer, or -1 in case of error.  The read(2) call can also
      return 0, in case the sighand structure to which the signalfd was attached,
      has been orphaned.  The O_NONBLOCK flag is also supported, and read(2) will
      return -EAGAIN in case no signal is available.
      
      If the size of the buffer passed to read(2) is lower than sizeof(struct
      signalfd_siginfo), -EINVAL is returned.  A read from the signalfd can also
      return -ERESTARTSYS in case a signal hits the process.  The format of the
      struct signalfd_siginfo is, and the valid fields depends of the (->code &
      __SI_MASK) value, in the same way a struct siginfo would:
      
      struct signalfd_siginfo {
      	__u32 signo;	/* si_signo */
      	__s32 err;	/* si_errno */
      	__s32 code;	/* si_code */
      	__u32 pid;	/* si_pid */
      	__u32 uid;	/* si_uid */
      	__s32 fd;	/* si_fd */
      	__u32 tid;	/* si_fd */
      	__u32 band;	/* si_band */
      	__u32 overrun;	/* si_overrun */
      	__u32 trapno;	/* si_trapno */
      	__s32 status;	/* si_status */
      	__s32 svint;	/* si_int */
      	__u64 svptr;	/* si_ptr */
      	__u64 utime;	/* si_utime */
      	__u64 stime;	/* si_stime */
      	__u64 addr;	/* si_addr */
      };
      
      [akpm@linux-foundation.org: fix signalfd_copyinfo() on i386]
      Signed-off-by: NDavide Libenzi <davidel@xmailserver.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fba2afaa
    • D
      signal/timer/event fds: anonymous inode source · 5dc8bf81
      Davide Libenzi 提交于
      This patch add an anonymous inode source, to be used for files that need
      and inode only in order to create a file*. We do not care of having an
      inode for each file, and we do not even care of having different names in
      the associated dentries (dentry names will be same for classes of file*).
      This allow code reuse, and will be used by epoll, signalfd and timerfd
      (and whatever else there'll be).
      Signed-off-by: NDavide Libenzi <davidel@xmailserver.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5dc8bf81
    • C
      SLUB: SLUB_DEBUG must depend on SLUB · d4751a27
      Christoph Lameter 提交于
      Otherwise people get asked about SLUB_DEBUG even if they have another
      slab allocator enabled.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d4751a27
  17. 10 5月, 2007 2 次提交