1. 30 11月, 2010 1 次提交
    • M
      sched: Add 'autogroup' scheduling feature: automated per session task groups · 5091faa4
      Mike Galbraith 提交于
      A recurring complaint from CFS users is that parallel kbuild has
      a negative impact on desktop interactivity.  This patch
      implements an idea from Linus, to automatically create task
      groups.  Currently, only per session autogroups are implemented,
      but the patch leaves the way open for enhancement.
      
      Implementation: each task's signal struct contains an inherited
      pointer to a refcounted autogroup struct containing a task group
      pointer, the default for all tasks pointing to the
      init_task_group.  When a task calls setsid(), a new task group
      is created, the process is moved into the new task group, and a
      reference to the preveious task group is dropped.  Child
      processes inherit this task group thereafter, and increase it's
      refcount.  When the last thread of a process exits, the
      process's reference is dropped, such that when the last process
      referencing an autogroup exits, the autogroup is destroyed.
      
      At runqueue selection time, IFF a task has no cgroup assignment,
      its current autogroup is used.
      
      Autogroup bandwidth is controllable via setting it's nice level
      through the proc filesystem:
      
        cat /proc/<pid>/autogroup
      
      Displays the task's group and the group's nice level.
      
        echo <nice level> > /proc/<pid>/autogroup
      
      Sets the task group's shares to the weight of nice <level> task.
      Setting nice level is rate limited for !admin users due to the
      abuse risk of task group locking.
      
      The feature is enabled from boot by default if
      CONFIG_SCHED_AUTOGROUP=y is selected, but can be disabled via
      the boot option noautogroup, and can also be turned on/off on
      the fly via:
      
        echo [01] > /proc/sys/kernel/sched_autogroup_enabled
      
      ... which will automatically move tasks to/from the root task group.
      Signed-off-by: NMike Galbraith <efault@gmx.de>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Markus Trippelsdorf <markus@trippelsdorf.de>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      [ Removed the task_group_path() debug code, and fixed !EVENTFD build failure. ]
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      LKML-Reference: <1290281700.28711.9.camel@maggy.simson.net>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      5091faa4
  2. 01 9月, 2010 1 次提交
    • P
      pid: make setpgid() system call use RCU read-side critical section · 950eaaca
      Paul E. McKenney 提交于
      [   23.584719]
      [   23.584720] ===================================================
      [   23.585059] [ INFO: suspicious rcu_dereference_check() usage. ]
      [   23.585176] ---------------------------------------------------
      [   23.585176] kernel/pid.c:419 invoked rcu_dereference_check() without protection!
      [   23.585176]
      [   23.585176] other info that might help us debug this:
      [   23.585176]
      [   23.585176]
      [   23.585176] rcu_scheduler_active = 1, debug_locks = 1
      [   23.585176] 1 lock held by rc.sysinit/728:
      [   23.585176]  #0:  (tasklist_lock){.+.+..}, at: [<ffffffff8104771f>] sys_setpgid+0x5f/0x193
      [   23.585176]
      [   23.585176] stack backtrace:
      [   23.585176] Pid: 728, comm: rc.sysinit Not tainted 2.6.36-rc2 #2
      [   23.585176] Call Trace:
      [   23.585176]  [<ffffffff8105b436>] lockdep_rcu_dereference+0x99/0xa2
      [   23.585176]  [<ffffffff8104c324>] find_task_by_pid_ns+0x50/0x6a
      [   23.585176]  [<ffffffff8104c35b>] find_task_by_vpid+0x1d/0x1f
      [   23.585176]  [<ffffffff81047727>] sys_setpgid+0x67/0x193
      [   23.585176]  [<ffffffff810029eb>] system_call_fastpath+0x16/0x1b
      [   24.959669] type=1400 audit(1282938522.956:4): avc:  denied  { module_request } for  pid=766 comm="hwclock" kmod="char-major-10-135" scontext=system_u:system_r:hwclock_t:s0 tcontext=system_u:system_r:kernel_t:s0 tclas
      
      It turns out that the setpgid() system call fails to enter an RCU
      read-side critical section before doing a PID-to-task_struct translation.
      This commit therefore does rcu_read_lock() before the translation, and
      also does rcu_read_unlock() after the last use of the returned pointer.
      Reported-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Acked-by: NDavid Howells <dhowells@redhat.com>
      950eaaca
  3. 16 7月, 2010 9 次提交
    • J
      rlimits: implement prlimit64 syscall · c022a0ac
      Jiri Slaby 提交于
      This patch adds the code to support the sys_prlimit64 syscall which
      modifies-and-returns the rlim values of a selected process atomically.
      The first parameter, pid, being 0 means current process.
      
      Unlike the current implementation, it is a generic interface,
      architecture indepentent so that we needn't handle compat stuff
      anymore. In the future, after glibc start to use this we can deprecate
      sys_setrlimit and sys_getrlimit in favor to clean up the code finally.
      
      It also adds a possibility of changing limits of other processes. We
      check the user's permissions to do that and if it succeeds, the new
      limits are propagated online. This is good for large scale
      applications such as SAP or databases where administrators need to
      change limits time by time (e.g. on crashes increase core size). And
      it is unacceptable to restart the service.
      
      For safety, all rlim users now either use accessors or doesn't need
      them due to
      - locking
      - the fact a process was just forked and nobody else knows about it
        yet (and nobody can't thus read/write limits)
      hence it is safe to modify limits now.
      
      The limitation is that we currently stay at ulong internal
      representation. So the rlim64_is_infinity check is used where value is
      compared against ULONG_MAX on 32-bit which is the maximum value there.
      
      And since internally the limits are held in struct rlimit, converters
      which are used before and after do_prlimit call in sys_prlimit64 are
      introduced.
      Signed-off-by: NJiri Slaby <jslaby@suse.cz>
      c022a0ac
    • J
      rlimits: switch more rlimit syscalls to do_prlimit · b9518345
      Jiri Slaby 提交于
      After we added more generic do_prlimit, switch sys_getrlimit to that.
      Also switch compat handling, so we can get rid of ugly __user casts
      and avoid setting process' address limit to kernel data and back.
      Signed-off-by: NJiri Slaby <jslaby@suse.cz>
      b9518345
    • J
      rlimits: redo do_setrlimit to more generic do_prlimit · 5b41535a
      Jiri Slaby 提交于
      It now allows also reading of limits. I.e. all read and writes will
      later use this function.
      
      It takes two parameters, new and old limits which can be both NULL.
      If new is non-NULL, the value in it is set to rlimits.
      If old is non-NULL, current rlimits are stored there.
      If both are non-NULL, old are stored prior to setting the new ones,
      atomically.
      (Similar to sigaction.)
      Signed-off-by: NJiri Slaby <jslaby@suse.cz>
      5b41535a
    • J
      rlimits: do security check under task_lock · 86f162f4
      Jiri Slaby 提交于
      Do security_task_setrlimit under task_lock. Other tasks may change
      limits under our hands while we are checking limits inside the
      function. From now on, they can't.
      
      Note that all the security work is done under a spinlock here now.
      Security hooks count with that, they are called from interrupt context
      (like security_task_kill) and with spinlocks already held (e.g.
      capable->security_capable).
      Signed-off-by: NJiri Slaby <jslaby@suse.cz>
      Acked-by: NJames Morris <jmorris@namei.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      86f162f4
    • J
      rlimits: allow setrlimit to non-current tasks · 1c1e618d
      Jiri Slaby 提交于
      Add locking to allow setrlimit accept task parameter other than
      current.
      
      Namely, lock tasklist_lock for read and check whether the task
      structure has sighand non-null. Do all the signal processing under
      that lock still held.
      
      There are some points:
      1) security_task_setrlimit is now called with that lock held. This is
         not new, many security_* functions are called with this lock held
         already so it doesn't harm (all this security_* stuff does almost
         the same).
      2) task->sighand->siglock (in update_rlimit_cpu) is nested in
         tasklist_lock. This dependence is already existing.
      3) tsk->alloc_lock is nested in tasklist_lock. This is OK too, already
         existing dependence.
      Signed-off-by: NJiri Slaby <jirislaby@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      1c1e618d
    • J
      rlimits: split sys_setrlimit · 7855c35d
      Jiri Slaby 提交于
      Create do_setrlimit from sys_setrlimit and declare do_setrlimit
      in the resource header. This is the first phase to have generic
      do_prlimit which allows to be called from read, write and compat
      rlimits code.
      
      The new do_setrlimit also accepts a task pointer to change the limits
      of. Currently, it cannot be other than current, but this will change
      with locking later.
      
      Also pass tsk->group_leader to security_task_setrlimit to check
      whether current is allowed to change rlimits of the process and not
      its arbitrary thread because it makes more sense given that rlimit are
      per process and not per-thread.
      Signed-off-by: NJiri Slaby <jslaby@suse.cz>
      7855c35d
    • O
      rlimits: make sure ->rlim_max never grows in sys_setrlimit · 2fb9d268
      Oleg Nesterov 提交于
      Mostly preparation for Jiri's changes, but probably makes sense anyway.
      
      sys_setrlimit() checks new_rlim.rlim_max <= old_rlim->rlim_max, but when
      it takes task_lock() old_rlim->rlim_max can be already lowered. Move this
      check under task_lock().
      
      Currently this is not important, we can only race with our sub-thread,
      this means the application is stupid. But when we change the code to allow
      the update of !current task's limits, it becomes important to make sure
      ->rlim_max can be lowered "reliably" even if we race with the application
      doing sys_setrlimit().
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NJiri Slaby <jslaby@suse.cz>
      2fb9d268
    • J
      rlimits: add task_struct to update_rlimit_cpu · 5ab46b34
      Jiri Slaby 提交于
      Add task_struct as a parameter to update_rlimit_cpu to be able to set
      rlimit_cpu of different task than current.
      Signed-off-by: NJiri Slaby <jirislaby@gmail.com>
      Acked-by: NJames Morris <jmorris@namei.org>
      5ab46b34
    • J
      rlimits: security, add task_struct to setrlimit · 8fd00b4d
      Jiri Slaby 提交于
      Add task_struct to task_setrlimit of security_operations to be able to set
      rlimit of task other than current.
      Signed-off-by: NJiri Slaby <jirislaby@gmail.com>
      Acked-by: NEric Paris <eparis@redhat.com>
      Acked-by: NJames Morris <jmorris@namei.org>
      8fd00b4d
  4. 28 5月, 2010 1 次提交
    • N
      kmod: add init function to usermodehelper · a06a4dc3
      Neil Horman 提交于
      About 6 months ago, I made a set of changes to how the core-dump-to-a-pipe
      feature in the kernel works.  We had reports of several races, including
      some reports of apps bypassing our recursion check so that a process that
      was forked as part of a core_pattern setup could infinitely crash and
      refork until the system crashed.
      
      We fixed those by improving our recursion checks.  The new check basically
      refuses to fork a process if its core limit is zero, which works well.
      
      Unfortunately, I've been getting grief from maintainer of user space
      programs that are inserted as the forked process of core_pattern.  They
      contend that in order for their programs (such as abrt and apport) to
      work, all the running processes in a system must have their core limits
      set to a non-zero value, to which I say 'yes'.  I did this by design, and
      think thats the right way to do things.
      
      But I've been asked to ease this burden on user space enough times that I
      thought I would take a look at it.  The first suggestion was to make the
      recursion check fail on a non-zero 'special' number, like one.  That way
      the core collector process could set its core size ulimit to 1, and enable
      the kernel's recursion detection.  This isn't a bad idea on the surface,
      but I don't like it since its opt-in, in that if a program like abrt or
      apport has a bug and fails to set such a core limit, we're left with a
      recursively crashing system again.
      
      So I've come up with this.  What I've done is modify the
      call_usermodehelper api such that an extra parameter is added, a function
      pointer which will be called by the user helper task, after it forks, but
      before it exec's the required process.  This will give the caller the
      opportunity to get a call back in the processes context, allowing it to do
      whatever it needs to to the process in the kernel prior to exec-ing the
      user space code.  In the case of do_coredump, this callback is ues to set
      the core ulimit of the helper process to 1.  This elimnates the opt-in
      problem that I had above, as it allows the ulimit for core sizes to be set
      to the value of 1, which is what the recursion check looks for in
      do_coredump.
      
      This patch:
      
      Create new function call_usermodehelper_fns() and allow it to assign both
      an init and cleanup function, as we'll as arbitrary data.
      
      The init function is called from the context of the forked process and
      allows for customization of the helper process prior to calling exec.  Its
      return code gates the continuation of the process, or causes its exit.
      Also add an arbitrary data pointer to the subprocess_info struct allowing
      for data to be passed from the caller to the new process, and the
      subsequent cleanup process
      
      Also, use this patch to cleanup the cleanup function.  It currently takes
      an argp and envp pointer for freeing, which is ugly.  Lets instead just
      make the subprocess_info structure public, and pass that to the cleanup
      and init routines
      Signed-off-by: NNeil Horman <nhorman@tuxdriver.com>
      Reviewed-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a06a4dc3
  5. 25 4月, 2010 1 次提交
  6. 12 4月, 2010 2 次提交
  7. 30 3月, 2010 1 次提交
    • T
      include cleanup: Update gfp.h and slab.h includes to prepare for breaking... · 5a0e3ad6
      Tejun Heo 提交于
      include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
      
      percpu.h is included by sched.h and module.h and thus ends up being
      included when building most .c files.  percpu.h includes slab.h which
      in turn includes gfp.h making everything defined by the two files
      universally available and complicating inclusion dependencies.
      
      percpu.h -> slab.h dependency is about to be removed.  Prepare for
      this change by updating users of gfp and slab facilities include those
      headers directly instead of assuming availability.  As this conversion
      needs to touch large number of source files, the following script is
      used as the basis of conversion.
      
        http://userweb.kernel.org/~tj/misc/slabh-sweep.py
      
      The script does the followings.
      
      * Scan files for gfp and slab usages and update includes such that
        only the necessary includes are there.  ie. if only gfp is used,
        gfp.h, if slab is used, slab.h.
      
      * When the script inserts a new include, it looks at the include
        blocks and try to put the new include such that its order conforms
        to its surrounding.  It's put in the include block which contains
        core kernel includes, in the same order that the rest are ordered -
        alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
        doesn't seem to be any matching order.
      
      * If the script can't find a place to put a new include (mostly
        because the file doesn't have fitting include block), it prints out
        an error message indicating which .h file needs to be added to the
        file.
      
      The conversion was done in the following steps.
      
      1. The initial automatic conversion of all .c files updated slightly
         over 4000 files, deleting around 700 includes and adding ~480 gfp.h
         and ~3000 slab.h inclusions.  The script emitted errors for ~400
         files.
      
      2. Each error was manually checked.  Some didn't need the inclusion,
         some needed manual addition while adding it to implementation .h or
         embedding .c file was more appropriate for others.  This step added
         inclusions to around 150 files.
      
      3. The script was run again and the output was compared to the edits
         from #2 to make sure no file was left behind.
      
      4. Several build tests were done and a couple of problems were fixed.
         e.g. lib/decompress_*.c used malloc/free() wrappers around slab
         APIs requiring slab.h to be added manually.
      
      5. The script was run on all .h files but without automatically
         editing them as sprinkling gfp.h and slab.h inclusions around .h
         files could easily lead to inclusion dependency hell.  Most gfp.h
         inclusion directives were ignored as stuff from gfp.h was usually
         wildly available and often used in preprocessor macros.  Each
         slab.h inclusion directive was examined and added manually as
         necessary.
      
      6. percpu.h was updated not to include slab.h.
      
      7. Build test were done on the following configurations and failures
         were fixed.  CONFIG_GCOV_KERNEL was turned off for all tests (as my
         distributed build env didn't work with gcov compiles) and a few
         more options had to be turned off depending on archs to make things
         build (like ipr on powerpc/64 which failed due to missing writeq).
      
         * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
         * powerpc and powerpc64 SMP allmodconfig
         * sparc and sparc64 SMP allmodconfig
         * ia64 SMP allmodconfig
         * s390 SMP allmodconfig
         * alpha SMP allmodconfig
         * um on x86_64 SMP allmodconfig
      
      8. percpu.h modifications were reverted so that it could be applied as
         a separate patch and serve as bisection point.
      
      Given the fact that I had only a couple of failures from tests on step
      6, I'm fairly confident about the coverage of this conversion patch.
      If there is a breakage, it's likely to be something in one of the arch
      headers which should be easily discoverable easily on most builds of
      the specific arch.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Guess-its-ok-by: NChristoph Lameter <cl@linux-foundation.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      5a0e3ad6
  8. 13 3月, 2010 2 次提交
    • C
      Add generic sys_olduname() · 5cacdb4a
      Christoph Hellwig 提交于
      Add generic implementations of the old and really old uname system calls.
      Note that sh only implements sys_olduname but not sys_oldolduname, but I'm
      not going to bother with another ifdef for that special case.
      
      m32r implemented an old uname but never wired it up, so kill it, too.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Hirokazu Takata <takata@linux-m32r.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: Andreas Schwab <schwab@linux-m68k.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5cacdb4a
    • C
      improve sys_newuname() for compat architectures · e28cbf22
      Christoph Hellwig 提交于
      On an architecture that supports 32-bit compat we need to override the
      reported machine in uname with the 32-bit value.  Instead of doing this
      separately in every architecture introduce a COMPAT_UTS_MACHINE define in
      <asm/compat.h> and apply it directly in sys_newuname().
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Hirokazu Takata <takata@linux-m32r.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: Andreas Schwab <schwab@linux-m68k.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e28cbf22
  9. 07 3月, 2010 1 次提交
  10. 23 2月, 2010 1 次提交
  11. 21 1月, 2010 1 次提交
  12. 16 12月, 2009 1 次提交
  13. 11 12月, 2009 1 次提交
    • T
      sys: Fix missing rcu protection for __task_cred() access · d4581a23
      Thomas Gleixner 提交于
      commit c69e8d9c (CRED: Use RCU to access another task's creds and to
      release a task's own creds) added non rcu_read_lock() protected access
      to task creds of the target task in set_prio_one().
      
      The comment above the function says:
       * - the caller must hold the RCU read lock
      
      The calling code in sys_setpriority does read_lock(&tasklist_lock) but
      not rcu_read_lock(). This works only when CONFIG_TREE_PREEMPT_RCU=n.
      With CONFIG_TREE_PREEMPT_RCU=y the rcu_callbacks can run in the tick
      interrupt when they see no read side critical section.
      
      There is another instance of __task_cred() in sys_setpriority() itself
      which is equally unprotected.
      
      Wrap the whole code section into a rcu read side critical section to
      fix this quick and dirty.
      
      Will be revisited in course of the read_lock(&tasklist_lock) -> rcu
      crusade.
      
      Oleg noted further:
      
      This also fixes another bug here. find_task_by_vpid() is not safe
      without rcu_read_lock(). I do not mean it is not safe to use the
      result, just find_pid_ns() by itself is not safe.
      
      Usually tasklist gives enough protection, but if copy_process() fails
      it calls free_pid() lockless and does call_rcu(delayed_put_pid().
      This means, without rcu lock find_pid_ns() can't scan the hash table
      safely.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      LKML-Reference: <20091210004703.029784964@linutronix.de>
      Acked-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      d4581a23
  14. 03 12月, 2009 1 次提交
    • H
      sched, cputime: Introduce thread_group_times() · 0cf55e1e
      Hidetoshi Seto 提交于
      This is a real fix for problem of utime/stime values decreasing
      described in the thread:
      
         http://lkml.org/lkml/2009/11/3/522
      
      Now cputime is accounted in the following way:
      
       - {u,s}time in task_struct are increased every time when the thread
         is interrupted by a tick (timer interrupt).
      
       - When a thread exits, its {u,s}time are added to signal->{u,s}time,
         after adjusted by task_times().
      
       - When all threads in a thread_group exits, accumulated {u,s}time
         (and also c{u,s}time) in signal struct are added to c{u,s}time
         in signal struct of the group's parent.
      
      So {u,s}time in task struct are "raw" tick count, while
      {u,s}time and c{u,s}time in signal struct are "adjusted" values.
      
      And accounted values are used by:
      
       - task_times(), to get cputime of a thread:
         This function returns adjusted values that originates from raw
         {u,s}time and scaled by sum_exec_runtime that accounted by CFS.
      
       - thread_group_cputime(), to get cputime of a thread group:
         This function returns sum of all {u,s}time of living threads in
         the group, plus {u,s}time in the signal struct that is sum of
         adjusted cputimes of all exited threads belonged to the group.
      
      The problem is the return value of thread_group_cputime(),
      because it is mixed sum of "raw" value and "adjusted" value:
      
        group's {u,s}time = foreach(thread){{u,s}time} + exited({u,s}time)
      
      This misbehavior can break {u,s}time monotonicity.
      Assume that if there is a thread that have raw values greater
      than adjusted values (e.g. interrupted by 1000Hz ticks 50 times
      but only runs 45ms) and if it exits, cputime will decrease (e.g.
      -5ms).
      
      To fix this, we could do:
      
        group's {u,s}time = foreach(t){task_times(t)} + exited({u,s}time)
      
      But task_times() contains hard divisions, so applying it for
      every thread should be avoided.
      
      This patch fixes the above problem in the following way:
      
       - Modify thread's exit (= __exit_signal()) not to use task_times().
         It means {u,s}time in signal struct accumulates raw values instead
         of adjusted values.  As the result it makes thread_group_cputime()
         to return pure sum of "raw" values.
      
       - Introduce a new function thread_group_times(*task, *utime, *stime)
         that converts "raw" values of thread_group_cputime() to "adjusted"
         values, in same calculation procedure as task_times().
      
       - Modify group's exit (= wait_task_zombie()) to use this introduced
         thread_group_times().  It make c{u,s}time in signal struct to
         have adjusted values like before this patch.
      
       - Replace some thread_group_cputime() by thread_group_times().
         This replacements are only applied where conveys the "adjusted"
         cputime to users, and where already uses task_times() near by it.
         (i.e. sys_times(), getrusage(), and /proc/<PID>/stat.)
      
      This patch have a positive side effect:
      
       - Before this patch, if a group contains many short-life threads
         (e.g. runs 0.9ms and not interrupted by ticks), the group's
         cputime could be invisible since thread's cputime was accumulated
         after adjusted: imagine adjustment function as adj(ticks, runtime),
           {adj(0, 0.9) + adj(0, 0.9) + ....} = {0 + 0 + ....} = 0.
         After this patch it will not happen because the adjustment is
         applied after accumulated.
      
      v2:
       - remove if()s, put new variables into signal_struct.
      Signed-off-by: NHidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Spencer Candland <spencer@bluehost.com>
      Cc: Americo Wang <xiyou.wangcong@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Stanislaw Gruszka <sgruszka@redhat.com>
      LKML-Reference: <4B162517.8040909@jp.fujitsu.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      0cf55e1e
  15. 26 11月, 2009 1 次提交
    • H
      sched: Introduce task_times() to replace task_{u,s}time() pair · d180c5bc
      Hidetoshi Seto 提交于
      Functions task_{u,s}time() are called in pair in almost all
      cases.  However task_stime() is implemented to call task_utime()
      from its inside, so such paired calls run task_utime() twice.
      
      It means we do heavy divisions (div_u64 + do_div) twice to get
      utime and stime which can be obtained at same time by one set
      of divisions.
      
      This patch introduces a function task_times(*tsk, *utime,
      *stime) to retrieve utime and stime at once in better, optimized
      way.
      Signed-off-by: NHidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Stanislaw Gruszka <sgruszka@redhat.com>
      Cc: Spencer Candland <spencer@bluehost.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Americo Wang <xiyou.wangcong@gmail.com>
      LKML-Reference: <4B0E16AE.906@jp.fujitsu.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      d180c5bc
  16. 29 10月, 2009 1 次提交
    • C
      connector: fix regression introduced by sid connector · 0d0df599
      Christian Borntraeger 提交于
      Since commit 02b51df1 (proc connector: add
      event for process becoming session leader) we have the following warning:
      
      Badness at kernel/softirq.c:143
      [...]
      Krnl PSW : 0404c00180000000 00000000001481d4 (local_bh_enable+0xb0/0xe0)
      [...]
      Call Trace:
      ([<000000013fe04100>] 0x13fe04100)
       [<000000000048a946>] sk_filter+0x9a/0xd0
       [<000000000049d938>] netlink_broadcast+0x2c0/0x53c
       [<00000000003ba9ae>] cn_netlink_send+0x272/0x2b0
       [<00000000003baef0>] proc_sid_connector+0xc4/0xd4
       [<0000000000142604>] __set_special_pids+0x58/0x90
       [<0000000000159938>] sys_setsid+0xb4/0xd8
       [<00000000001187fe>] sysc_noemu+0x10/0x16
       [<00000041616cb266>] 0x41616cb266
      
      The warning is
      --->    WARN_ON_ONCE(in_irq() || irqs_disabled());
      
      The network code must not be called with disabled interrupts but
      sys_setsid holds the tasklist_lock with spinlock_irq while calling the
      connector.
      
      After a discussion we agreed that we can move proc_sid_connector from
      __set_special_pids to sys_setsid.
      
      We also agreed that it is sufficient to change the check from
      task_session(curr) != pid into err > 0, since if we don't change the
      session, this means we were already the leader and return -EPERM.
      
      One last thing:
      There is also daemonize(), and some people might want to get a
      notification in that case. Since daemonize() is only needed if a user
      space does kernel_thread this does not look important (and there seems
      to be no consensus if this connector should be called in daemonize). If
      we really want this, we can add proc_sid_connector to daemonize() in an
      additional patch (Scott?)
      Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Cc: Scott James Remnant <scott@ubuntu.com>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: David S. Miller <davem@davemloft.net>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NEvgeniy Polyakov <zbr@ioremap.net>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0d0df599
  17. 14 10月, 2009 1 次提交
  18. 04 10月, 2009 1 次提交
    • A
      HWPOISON: Clean up PR_MCE_KILL interface · 1087e9b4
      Andi Kleen 提交于
      While writing the manpage I noticed some shortcomings in the
      current interface.
      
      - Define symbolic names for all the different values
      - Boundary check the kill mode values
      - For symmetry add a get interface too. This allows library
      code to get/set the current state.
      - For consistency define a PR_MCE_KILL_DEFAULT value
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      1087e9b4
  19. 23 9月, 2009 1 次提交
    • J
      getrusage: fill ru_maxrss value · 1f10206c
      Jiri Pirko 提交于
      Make ->ru_maxrss value in struct rusage filled accordingly to rss hiwater
      mark.  This struct is filled as a parameter to getrusage syscall.
      ->ru_maxrss value is set to KBs which is the way it is done in BSD
      systems.  /usr/bin/time (gnu time) application converts ->ru_maxrss to KBs
      which seems to be incorrect behavior.  Maintainer of this util was
      notified by me with the patch which corrects it and cc'ed.
      
      To make this happen we extend struct signal_struct by two fields.  The
      first one is ->maxrss which we use to store rss hiwater of the task.  The
      second one is ->cmaxrss which we use to store highest rss hiwater of all
      task childs.  These values are used in k_getrusage() to actually fill
      ->ru_maxrss.  k_getrusage() uses current rss hiwater value directly if mm
      struct exists.
      
      Note:
      exec() clear mm->hiwater_rss, but doesn't clear sig->maxrss.
      it is intetionally behavior. *BSD getrusage have exec() inheriting.
      
      test programs
      ========================================================
      
      getrusage.c
      ===========
       #include <stdio.h>
       #include <stdlib.h>
       #include <string.h>
       #include <sys/types.h>
       #include <sys/time.h>
       #include <sys/resource.h>
       #include <sys/types.h>
       #include <sys/wait.h>
       #include <unistd.h>
       #include <signal.h>
       #include <sys/mman.h>
      
       #include "common.h"
      
       #define err(str) perror(str), exit(1)
      
      int main(int argc, char** argv)
      {
      	int status;
      
      	printf("allocate 100MB\n");
      	consume(100);
      
      	printf("testcase1: fork inherit? \n");
      	printf("  expect: initial.self ~= child.self\n");
      	show_rusage("initial");
      	if (__fork()) {
      		wait(&status);
      	} else {
      		show_rusage("fork child");
      		_exit(0);
      	}
      	printf("\n");
      
      	printf("testcase2: fork inherit? (cont.) \n");
      	printf("  expect: initial.children ~= 100MB, but child.children = 0\n");
      	show_rusage("initial");
      	if (__fork()) {
      		wait(&status);
      	} else {
      		show_rusage("child");
      		_exit(0);
      	}
      	printf("\n");
      
      	printf("testcase3: fork + malloc \n");
      	printf("  expect: child.self ~= initial.self + 50MB\n");
      	show_rusage("initial");
      	if (__fork()) {
      		wait(&status);
      	} else {
      		printf("allocate +50MB\n");
      		consume(50);
      		show_rusage("fork child");
      		_exit(0);
      	}
      	printf("\n");
      
      	printf("testcase4: grandchild maxrss\n");
      	printf("  expect: post_wait.children ~= 300MB\n");
      	show_rusage("initial");
      	if (__fork()) {
      		wait(&status);
      		show_rusage("post_wait");
      	} else {
      		system("./child -n 0 -g 300");
      		_exit(0);
      	}
      	printf("\n");
      
      	printf("testcase5: zombie\n");
      	printf("  expect: pre_wait ~= initial, IOW the zombie process is not accounted.\n");
      	printf("          post_wait ~= 400MB, IOW wait() collect child's max_rss. \n");
      	show_rusage("initial");
      	if (__fork()) {
      		sleep(1); /* children become zombie */
      		show_rusage("pre_wait");
      		wait(&status);
      		show_rusage("post_wait");
      	} else {
      		system("./child -n 400");
      		_exit(0);
      	}
      	printf("\n");
      
      	printf("testcase6: SIG_IGN\n");
      	printf("  expect: initial ~= after_zombie (child's 500MB alloc should be ignored).\n");
      	show_rusage("initial");
      	signal(SIGCHLD, SIG_IGN);
      	if (__fork()) {
      		sleep(1); /* children become zombie */
      		show_rusage("after_zombie");
      	} else {
      		system("./child -n 500");
      		_exit(0);
      	}
      	printf("\n");
      	signal(SIGCHLD, SIG_DFL);
      
      	printf("testcase7: exec (without fork) \n");
      	printf("  expect: initial ~= exec \n");
      	show_rusage("initial");
      	execl("./child", "child", "-v", NULL);
      
      	return 0;
      }
      
      child.c
      =======
       #include <sys/types.h>
       #include <unistd.h>
       #include <sys/types.h>
       #include <sys/wait.h>
       #include <stdio.h>
       #include <stdlib.h>
       #include <string.h>
       #include <sys/types.h>
       #include <sys/time.h>
       #include <sys/resource.h>
      
       #include "common.h"
      
      int main(int argc, char** argv)
      {
      	int status;
      	int c;
      	long consume_size = 0;
      	long grandchild_consume_size = 0;
      	int show = 0;
      
      	while ((c = getopt(argc, argv, "n:g:v")) != -1) {
      		switch (c) {
      		case 'n':
      			consume_size = atol(optarg);
      			break;
      		case 'v':
      			show = 1;
      			break;
      		case 'g':
      
      			grandchild_consume_size = atol(optarg);
      			break;
      		default:
      			break;
      		}
      	}
      
      	if (show)
      		show_rusage("exec");
      
      	if (consume_size) {
      		printf("child alloc %ldMB\n", consume_size);
      		consume(consume_size);
      	}
      
      	if (grandchild_consume_size) {
      		if (fork()) {
      			wait(&status);
      		} else {
      			printf("grandchild alloc %ldMB\n", grandchild_consume_size);
      			consume(grandchild_consume_size);
      
      			exit(0);
      		}
      	}
      
      	return 0;
      }
      
      common.c
      ========
       #include <stdio.h>
       #include <stdlib.h>
       #include <string.h>
       #include <sys/types.h>
       #include <sys/time.h>
       #include <sys/resource.h>
       #include <sys/types.h>
       #include <sys/wait.h>
       #include <unistd.h>
       #include <signal.h>
       #include <sys/mman.h>
      
       #include "common.h"
       #define err(str) perror(str), exit(1)
      
      void show_rusage(char *prefix)
      {
          	int err, err2;
          	struct rusage rusage_self;
          	struct rusage rusage_children;
      
          	printf("%s: ", prefix);
          	err = getrusage(RUSAGE_SELF, &rusage_self);
          	if (!err)
          		printf("self %ld ", rusage_self.ru_maxrss);
          	err2 = getrusage(RUSAGE_CHILDREN, &rusage_children);
          	if (!err2)
          		printf("children %ld ", rusage_children.ru_maxrss);
      
          	printf("\n");
      }
      
      /* Some buggy OS need this worthless CPU waste. */
      void make_pagefault(void)
      {
      	void *addr;
      	int size = getpagesize();
      	int i;
      
      	for (i=0; i<1000; i++) {
      		addr = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANON, -1, 0);
      		if (addr == MAP_FAILED)
      			err("make_pagefault");
      		memset(addr, 0, size);
      		munmap(addr, size);
      	}
      }
      
      void consume(int mega)
      {
          	size_t sz = mega * 1024 * 1024;
          	void *ptr;
      
          	ptr = malloc(sz);
          	memset(ptr, 0, sz);
      	make_pagefault();
      }
      
      pid_t __fork(void)
      {
      	pid_t pid;
      
      	pid = fork();
      	make_pagefault();
      
      	return pid;
      }
      
      common.h
      ========
      void show_rusage(char *prefix);
      void make_pagefault(void);
      void consume(int mega);
      pid_t __fork(void);
      
      FreeBSD result (expected result)
      ========================================================
      allocate 100MB
      testcase1: fork inherit?
        expect: initial.self ~= child.self
      initial: self 103492 children 0
      fork child: self 103540 children 0
      
      testcase2: fork inherit? (cont.)
        expect: initial.children ~= 100MB, but child.children = 0
      initial: self 103540 children 103540
      child: self 103564 children 0
      
      testcase3: fork + malloc
        expect: child.self ~= initial.self + 50MB
      initial: self 103564 children 103564
      allocate +50MB
      fork child: self 154860 children 0
      
      testcase4: grandchild maxrss
        expect: post_wait.children ~= 300MB
      initial: self 103564 children 154860
      grandchild alloc 300MB
      post_wait: self 103564 children 308720
      
      testcase5: zombie
        expect: pre_wait ~= initial, IOW the zombie process is not accounted.
                post_wait ~= 400MB, IOW wait() collect child's max_rss.
      initial: self 103564 children 308720
      child alloc 400MB
      pre_wait: self 103564 children 308720
      post_wait: self 103564 children 411312
      
      testcase6: SIG_IGN
        expect: initial ~= after_zombie (child's 500MB alloc should be ignored).
      initial: self 103564 children 411312
      child alloc 500MB
      after_zombie: self 103624 children 411312
      
      testcase7: exec (without fork)
        expect: initial ~= exec
      initial: self 103624 children 411312
      exec: self 103624 children 411312
      
      Linux result (actual test result)
      ========================================================
      allocate 100MB
      testcase1: fork inherit?
        expect: initial.self ~= child.self
      initial: self 102848 children 0
      fork child: self 102572 children 0
      
      testcase2: fork inherit? (cont.)
        expect: initial.children ~= 100MB, but child.children = 0
      initial: self 102876 children 102644
      child: self 102572 children 0
      
      testcase3: fork + malloc
        expect: child.self ~= initial.self + 50MB
      initial: self 102876 children 102644
      allocate +50MB
      fork child: self 153804 children 0
      
      testcase4: grandchild maxrss
        expect: post_wait.children ~= 300MB
      initial: self 102876 children 153864
      grandchild alloc 300MB
      post_wait: self 102876 children 307536
      
      testcase5: zombie
        expect: pre_wait ~= initial, IOW the zombie process is not accounted.
                post_wait ~= 400MB, IOW wait() collect child's max_rss.
      initial: self 102876 children 307536
      child alloc 400MB
      pre_wait: self 102876 children 307536
      post_wait: self 102876 children 410076
      
      testcase6: SIG_IGN
        expect: initial ~= after_zombie (child's 500MB alloc should be ignored).
      initial: self 102876 children 410076
      child alloc 500MB
      after_zombie: self 102880 children 410076
      
      testcase7: exec (without fork)
        expect: initial ~= exec
      initial: self 102880 children 410076
      exec: self 102880 children 410076
      Signed-off-by: NJiri Pirko <jpirko@redhat.com>
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1f10206c
  20. 21 9月, 2009 1 次提交
    • I
      perf: Do the big rename: Performance Counters -> Performance Events · cdd6c482
      Ingo Molnar 提交于
      Bye-bye Performance Counters, welcome Performance Events!
      
      In the past few months the perfcounters subsystem has grown out its
      initial role of counting hardware events, and has become (and is
      becoming) a much broader generic event enumeration, reporting, logging,
      monitoring, analysis facility.
      
      Naming its core object 'perf_counter' and naming the subsystem
      'perfcounters' has become more and more of a misnomer. With pending
      code like hw-breakpoints support the 'counter' name is less and
      less appropriate.
      
      All in one, we've decided to rename the subsystem to 'performance
      events' and to propagate this rename through all fields, variables
      and API names. (in an ABI compatible fashion)
      
      The word 'event' is also a bit shorter than 'counter' - which makes
      it slightly more convenient to write/handle as well.
      
      Thanks goes to Stephane Eranian who first observed this misnomer and
      suggested a rename.
      
      User-space tooling and ABI compatibility is not affected - this patch
      should be function-invariant. (Also, defconfigs were not touched to
      keep the size down.)
      
      This patch has been generated via the following script:
      
        FILES=$(find * -type f | grep -vE 'oprofile|[^K]config')
      
        sed -i \
          -e 's/PERF_EVENT_/PERF_RECORD_/g' \
          -e 's/PERF_COUNTER/PERF_EVENT/g' \
          -e 's/perf_counter/perf_event/g' \
          -e 's/nb_counters/nb_events/g' \
          -e 's/swcounter/swevent/g' \
          -e 's/tpcounter_event/tp_event/g' \
          $FILES
      
        for N in $(find . -name perf_counter.[ch]); do
          M=$(echo $N | sed 's/perf_counter/perf_event/g')
          mv $N $M
        done
      
        FILES=$(find . -name perf_event.*)
      
        sed -i \
          -e 's/COUNTER_MASK/REG_MASK/g' \
          -e 's/COUNTER/EVENT/g' \
          -e 's/\<event\>/event_id/g' \
          -e 's/counter/event/g' \
          -e 's/Counter/Event/g' \
          $FILES
      
      ... to keep it as correct as possible. This script can also be
      used by anyone who has pending perfcounters patches - it converts
      a Linux kernel tree over to the new naming. We tried to time this
      change to the point in time where the amount of pending patches
      is the smallest: the end of the merge window.
      
      Namespace clashes were fixed up in a preparatory patch - and some
      stylistic fallout will be fixed up in a subsequent patch.
      
      ( NOTE: 'counters' are still the proper terminology when we deal
        with hardware registers - and these sed scripts are a bit
        over-eager in renaming them. I've undone some of that, but
        in case there's something left where 'counter' would be
        better than 'event' we can undo that on an individual basis
        instead of touching an otherwise nicely automated patch. )
      Suggested-by: NStephane Eranian <eranian@google.com>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Acked-by: NPaul Mackerras <paulus@samba.org>
      Reviewed-by: NArjan van de Ven <arjan@linux.intel.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Kyle McMartin <kyle@mcmartin.ca>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: <linux-arch@vger.kernel.org>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      cdd6c482
  21. 16 9月, 2009 1 次提交
    • A
      HWPOISON: Add PR_MCE_KILL prctl to control early kill behaviour per process · 4db96cf0
      Andi Kleen 提交于
      This allows processes to override their early/late kill
      behaviour on hardware memory errors.
      
      Typically applications which are memory error aware is
      better of with early kill (see the error as soon
      as possible), all others with late kill (only
      see the error when the error is really impacting execution)
      
      There's a global sysctl, but this way an application
      can set its specific policy.
      
      We're using two bits, one to signify that the process
      stated its intention and that
      
      I also made the prctl future proof by enforcing
      the unused arguments are 0.
      
      The state is inherited to children.
      
      Note this makes us officially run out of process flags
      on 32bit, but the next patch can easily add another field.
      
      Manpage patch will be supplied separately.
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      4db96cf0
  22. 17 6月, 2009 1 次提交
  23. 14 4月, 2009 1 次提交
  24. 03 4月, 2009 1 次提交
    • O
      pids: kill signal_struct-> __pgrp/__session and friends · 1b0f7ffd
      Oleg Nesterov 提交于
      We are wasting 2 words in signal_struct without any reason to implement
      task_pgrp_nr() and task_session_nr().
      
      task_session_nr() has no callers since
      2e2ba22e, we can remove it.
      
      task_pgrp_nr() is still (I believe wrongly) used in fs/autofsX and
      fs/coda.
      
      This patch reimplements task_pgrp_nr() via task_pgrp_nr_ns(), and kills
      __pgrp/__session and the related helpers.
      
      The change in drivers/char/tty_io.c is cosmetic, but hopefully makes sense
      anyway.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: Alan Cox <number6@the-village.bc.nu>		[tty parts]
      Cc: Cedric Le Goater <clg@fr.ibm.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Serge Hallyn <serue@us.ibm.com>
      Cc: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
      Cc: Roland McGrath <roland@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1b0f7ffd
  25. 01 4月, 2009 1 次提交
  26. 27 2月, 2009 1 次提交
  27. 06 2月, 2009 1 次提交
    • A
      revert "rlimit: permit setting RLIMIT_NOFILE to RLIM_INFINITY" · 60fd760f
      Andrew Morton 提交于
      Revert commit 0c2d64fb because it causes
      (arguably poorly designed) existing userspace to spend interminable
      periods closing billions of not-open file descriptors.
      
      We could bring this back, with some sort of opt-in tunable in /proc, which
      defaults to "off".
      
      Peter's alanysis follows:
      
      : I spent several hours trying to get to the bottom of a serious
      : performance issue that appeared on one of our servers after upgrading to
      : 2.6.28.  In the end it's what could be considered a userspace bug that
      : was triggered by a change in 2.6.28.  Since this might also affect other
      : people I figured I'd at least document what I found here, and maybe we
      : can even do something about it:
      :
      :
      : So, I upgraded some of debian.org's machines to 2.6.28.1 and immediately
      : the team maintaining our ftp archive complained that one of their
      : scripts that previously ran in a few minutes still hadn't even come
      : close to being done after an hour or so.  Downgrading to 2.6.27 fixed
      : that.
      :
      : Turns out that script is forking a lot and something in it or python or
      : whereever closes all the file descriptors it doesn't want to pass on.
      : That is, it starts at zero and goes up to ulimit -n/RLIMIT_NOFILE and
      : closes them all with a few exceptions.
      :
      : Turns out that takes a long time when your limit -n is now 2^20 (1048576).
      :
      : With 2.6.27.* the ulimit -n was the standard 1024, but with 2.6.28 it is
      : now a thousand times that.
      :
      : 2.6.28 included a patch titled "rlimit: permit setting RLIMIT_NOFILE to
      : RLIM_INFINITY" (0c2d64fb)[1] that
      : allows, as the title implies, to set the limit for number of files to
      : infinity.
      :
      : Closer investigation showed that the broken default ulimit did not apply
      : to "system" processes (like stuff started from init).  In the end I
      : could establish that all processes that passed through pam_limit at one
      : point had the bad resource limit.
      :
      : Apparently the pam library in Debian etch (4.0) initializes the limits
      : to some default values when it doesn't have any settings in limit.conf
      : to override them.  Turns out that for nofiles this is RLIM_INFINITY.
      : Commenting out "case RLIMIT_NOFILE" in pam_limit.c:267 of our pam
      : package version 0.79-5 fixes that - tho I'm not sure what side effects
      : that has.
      :
      : Debian lenny (the upcoming 5.0 version) doesn't have this issue as it
      : uses a different pam (version).
      Reported-by: NPeter Palfrader <weasel@debian.org>
      Cc: Adam Tkac <vonsch@gmail.com>
      Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
      Cc: <stable@kernel.org>		[2.6.28.x]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      60fd760f
  28. 14 1月, 2009 3 次提交