1. 26 7月, 2008 2 次提交
    • O
      introduce PF_KTHREAD flag · 7b34e428
      Oleg Nesterov 提交于
      Introduce the new PF_KTHREAD flag to mark the kernel threads.  It is set
      by INIT_TASK() and copied to the forked childs (we could set it in
      kthreadd() along with PF_NOFREEZE instead).
      
      daemonize() was changed as well.  In that case testing of PF_KTHREAD is
      racy, but daemonize() is hopeless anyway.
      
      This flag is cleared in do_execve(), before search_binary_handler().
      Probably not the best place, we can do this in exec_mmap() or in
      start_thread(), or clear it along with PF_FORKNOEXEC.  But I think this
      doesn't matter in practice, and if do_execve() fails kthread should die
      soon.
      Signed-off-by: NOleg Nesterov <oleg@tv-sign.ru>
      Cc: Roland McGrath <roland@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7b34e428
    • O
      ptrace: give more respect to SIGKILL · 364d3c13
      Oleg Nesterov 提交于
      ptrace_stop() has some complicated checks to prevent the scheduling in the
      TASK_TRACED state with the pending SIGKILL, but these checks are racy, and
      they depend on arch_ptrace_stop_needed().
      
      This patch assumes that the traced task should die asap if it was killed by
      SIGKILL, in that case schedule()->signal_pending_state() has no reason to
      ignore the TASK_WAKEKILL part of TASK_TRACED, and we can kill this nasty
      special case.
      
      Note: do_exit()->ptrace_notify() is special, the killed task can already
      dequeue SIGKILL at this point. Another indication that fatal_signal_pending()
      is not exactly right.
      Signed-off-by: NOleg Nesterov <oleg@tv-sign.ru>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Matthew Wilcox <matthew@wil.cx>
      Cc: Roland McGrath <roland@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      364d3c13
  2. 25 7月, 2008 1 次提交
  3. 18 7月, 2008 1 次提交
  4. 17 7月, 2008 2 次提交
    • R
      ptrace children revamp · f470021a
      Roland McGrath 提交于
      ptrace no longer fiddles with the children/sibling links, and the
      old ptrace_children list is gone.  Now ptrace, whether of one's own
      children or another's via PTRACE_ATTACH, just uses the new ptraced
      list instead.
      
      There should be no user-visible difference that matters.  The only
      change is the order in which do_wait() sees multiple stopped
      children and stopped ptrace attachees.  Since wait_task_stopped()
      was changed earlier so it no longer reorders the children list, we
      already know this won't cause any new problems.
      Signed-off-by: NRoland McGrath <roland@redhat.com>
      f470021a
    • R
      Freezer: Introduce PF_FREEZER_NOSIG · ebb12db5
      Rafael J. Wysocki 提交于
      The freezer currently attempts to distinguish kernel threads from
      user space tasks by checking if their mm pointer is unset and it
      does not send fake signals to kernel threads.  However, there are
      kernel threads, mostly related to networking, that behave like
      user space tasks and may want to be sent a fake signal to be frozen.
      
      Introduce the new process flag PF_FREEZER_NOSIG that will be set
      by default for all kernel threads and make the freezer only send
      fake signals to the tasks having PF_FREEZER_NOSIG unset.  Provide
      the set_freezable_with_signal() function to be called by the kernel
      threads that want to be sent a fake signal for freezing.
      
      This patch should not change the freezer's observable behavior.
      Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Acked-by: NPavel Machek <pavel@suse.cz>
      Signed-off-by: NLen Brown <len.brown@intel.com>
      ebb12db5
  5. 11 7月, 2008 1 次提交
    • S
      sched_clock: stop maximum check on NO HZ · af52a90a
      Steven Rostedt 提交于
      Working with ftrace I would get large jumps of 11 millisecs or more with
      the clock tracer. This killed the latencing timings of ftrace and also
      caused the irqoff self tests to fail.
      
      What was happening is with NO_HZ the idle would stop the jiffy counter and
      before the jiffy counter was updated the sched_clock would have a bad
      delta jiffies to compare with the gtod with the maximum.
      
      The jiffies would stop and the last sched_tick would record the last gtod.
      On wakeup, the sched clock update would compare the gtod + delta jiffies
      (which would be zero) and compare it to the TSC. The TSC would have
      correctly (with a stable TSC) moved forward several jiffies. But because the
      jiffies has not been updated yet the clock would be prevented from moving
      forward because it would appear that the TSC jumped too far ahead.
      
      The clock would then virtually stop, until the jiffies are updated. Then
      the next sched clock update would see that the clock was very much behind
      since the delta jiffies is now correct. This would then jump the clock
      forward by several jiffies.
      
      This caused ftrace to report several milliseconds of interrupts off
      latency at every resume from NO_HZ idle.
      
      This patch adds hooks into the nohz code to disable the checking of the
      maximum clock update when nohz is in effect. It resumes the max check
      when nohz has updated the jiffies again.
      Signed-off-by: NSteven Rostedt <srostedt@redhat.com>
      Cc: Steven Rostedt <srostedt@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      af52a90a
  6. 27 6月, 2008 3 次提交
  7. 24 6月, 2008 1 次提交
    • R
      sched: add new API sched_setscheduler_nocheck: add a flag to control access checks · 961ccddd
      Rusty Russell 提交于
      Hidehiro Kawai noticed that sched_setscheduler() can fail in
      stop_machine: it calls sched_setscheduler() from insmod, which can
      have CAP_SYS_MODULE without CAP_SYS_NICE.
      
      Two cases could have failed, so are changed to sched_setscheduler_nocheck:
        kernel/softirq.c:cpu_callback()
      	- CPU hotplug callback
        kernel/stop_machine.c:__stop_machine_run()
      	- Called from various places, including modprobe()
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      Cc: Jeremy Fitzhardinge <jeremy@goop.org>
      Cc: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: linux-mm@kvack.org
      Cc: sugita <yumiko.sugita.yf@hitachi.com>
      Cc: Satoshi OSHIMA <satoshi.oshima.fk@hitachi.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      961ccddd
  8. 10 6月, 2008 2 次提交
    • D
      sched: prevent bound kthreads from changing cpus_allowed · 9985b0ba
      David Rientjes 提交于
      Kthreads that have called kthread_bind() are bound to specific cpus, so
      other tasks should not be able to change their cpus_allowed from under
      them.  Otherwise, it is possible to move kthreads, such as the migration
      or software watchdog threads, so they are not allowed access to the cpu
      they work on.
      
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Paul Menage <menage@google.com>
      Cc: Paul Jackson <pj@sgi.com>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      9985b0ba
    • O
      sched: fix TASK_WAKEKILL vs SIGKILL race · 16882c1e
      Oleg Nesterov 提交于
      schedule() has the special "TASK_INTERRUPTIBLE && signal_pending()" case,
      this allows us to do
      
      	current->state = TASK_INTERRUPTIBLE;
      	schedule();
      
      without fear to sleep with pending signal.
      
      However, the code like
      
      	current->state = TASK_KILLABLE;
      	schedule();
      
      is not right, schedule() doesn't take TASK_WAKEKILL into account. This means
      that mutex_lock_killable(), wait_for_completion_killable(), down_killable(),
      schedule_timeout_killable() can miss SIGKILL (and btw the second SIGKILL has
      no effect).
      
      Introduce the new helper, signal_pending_state(), and change schedule() to
      use it. Hopefully it will have more users, that is why the task's state is
      passed separately.
      
      Note this "__TASK_STOPPED | __TASK_TRACED" check in signal_pending_state().
      This is needed to preserve the current behaviour (ptrace_notify). I hope
      this check will be removed soon, but this (afaics good) change needs the
      separate discussion.
      
      The fast path is "(state & (INTERRUPTIBLE | WAKEKILL)) + signal_pending(p)",
      basically the same that schedule() does now. However, this patch of course
      bloats schedule().
      Signed-off-by: NOleg Nesterov <oleg@tv-sign.ru>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      16882c1e
  9. 06 6月, 2008 3 次提交
    • G
      sched: fix cpupri hotplug support · 1f11eb6a
      Gregory Haskins 提交于
      The RT folks over at RedHat found an issue w.r.t. hotplug support which
      was traced to problems with the cpupri infrastructure in the scheduler:
      
      https://bugzilla.redhat.com/show_bug.cgi?id=449676
      
      This bug affects 23-rt12+, 24-rtX, 25-rtX, and sched-devel.  This patch
      applies to 25.4-rt4, though it should trivially apply to most cpupri enabled
      kernels mentioned above.
      
      It turned out that the issue was that offline cpus could get inadvertently
      registered with cpupri so that they were erroneously selected during
      migration decisions.  The end result would be an OOPS as the offline cpu
      had tasks routed to it.
      
      This patch generalizes the old join/leave domain interface into an
      online/offline interface, and adjusts the root-domain/hotplug code to
      utilize it.
      
      I was able to easily reproduce the issue prior to this patch, and am no
      longer able to reproduce it after this patch.  I can offline cpus
      indefinately and everything seems to be in working order.
      
      Thanks to Arnaldo (acme), Thomas, and Peter for doing the legwork to point
      me in the right direction.  Also thank you to Peter for reviewing the
      early iterations of this patch.
      Signed-off-by: NGregory Haskins <ghaskins@novell.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      1f11eb6a
    • R
      sched: reorder task_struct to reduce padding on 64bit builds · c7aceaba
      Richard Kennedy 提交于
      This patch removes 24 bytes of padding and allows 1 extra object per
      slab on my fedora based config.
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      c7aceaba
    • I
      namespacecheck: more sched.c fixes · 554ec22f
      Ingo Molnar 提交于
      [ Stephen Rothwell <sfr@canb.auug.org.au>: build fix ]
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      554ec22f
  10. 29 5月, 2008 1 次提交
  11. 27 5月, 2008 1 次提交
  12. 25 5月, 2008 2 次提交
  13. 24 5月, 2008 8 次提交
  14. 13 5月, 2008 2 次提交
    • L
      Make 'cond_resched()' nullification depend on PREEMPT_BKL · c714a534
      Linus Torvalds 提交于
      Because it's not correct with a non-preemptable BKL and just causes
      PREEMPT kernels to have longer latencies than non-PREEMPT ones (which is
      obviously not the point of it at all).
      
      Of course, that config option actually got removed as an option earlier,
      so for now this basically disables it entirely, but if BKL preemption is
      ever resurrected it will be a meaningful optimization.  And in the
      meantime, it at least documents the intent of the code, while not doing
      the wrong thing.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c714a534
    • L
      Fix up 'need_resched()' definition · 9404ef02
      Linus Torvalds 提交于
      We should not go through the task pointer to get at the thread info,
      since it's usually cheaper to just access the thread info directly.
      
      So don't make the code look up 'current', when we can just use the
      thread info accessor functions directly.  This generally avoids one
      level of indirection and tends to work better together with code that
      also looks at other thread flags (eg preempt_count).
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9404ef02
  15. 12 5月, 2008 1 次提交
    • L
      Add new 'cond_resched_bkl()' helper function · c3921ab7
      Linus Torvalds 提交于
      It acts exactly like a regular 'cond_resched()', but will not get
      optimized away when CONFIG_PREEMPT is set.
      
      Normal kernel code is already preemptable in the presense of
      CONFIG_PREEMPT, so cond_resched() is optimized away (see commit
      02b67cc3 "sched: do not do
      cond_resched() when CONFIG_PREEMPT").
      
      But when wanting to conditionally reschedule while holding a lock, you
      need to use "cond_sched_lock(lock)", and the new function is the BKL
      equivalent of that.
      
      Also make fs/locks.c use it.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c3921ab7
  16. 06 5月, 2008 3 次提交
  17. 30 4月, 2008 5 次提交
    • P
      Deprecate find_task_by_pid() · 5cd20455
      Pavel Emelyanov 提交于
      There are some places that are known to operate on tasks'
      global pids only:
      
      * the rest_init() call (called on boot)
      * the kgdb's getthread
      * the create_kthread() (since the kthread is run in init ns)
      
      So use the find_task_by_pid_ns(..., &init_pid_ns) there
      and schedule the find_task_by_pid for removal.
      
      [sukadev@us.ibm.com: Fix warning in kernel/pid.c]
      Signed-off-by: NPavel Emelyanov <xemul@openvz.org>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NSukadev Bhattiprolu <sukadev@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5cd20455
    • R
      signals: use HAVE_SET_RESTORE_SIGMASK · f3de272b
      Roland McGrath 提交于
      Change all the #ifdef TIF_RESTORE_SIGMASK conditionals in non-arch code to
      #ifdef HAVE_SET_RESTORE_SIGMASK.  If arch code defines it first, the generic
      set_restore_sigmask() using TIF_RESTORE_SIGMASK is not defined.
      Signed-off-by: NRoland McGrath <roland@redhat.com>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f3de272b
    • O
      signals: fix /sbin/init protection from unwanted signals · fae5fa44
      Oleg Nesterov 提交于
      The global init has a lot of long standing problems with the unhandled fatal
      signals.
      
      	- The "is_global_init(current)" check in get_signal_to_deliver()
      	  protects only the main thread. Sub-thread can dequee the fatal
      	  signal and shutdown the whole thread group except the main thread.
      	  If it dequeues SIGSTOP /sbin/init will be stopped, this is not
      	  right too. Note that we can't use is_global_init(->group_leader),
      	  this breaks exec and this can't solve other problems we have.
      
      	- Even if afterwards ignored, the fatal signals sets SIGNAL_GROUP_EXIT
      	  on delivery. This breaks exec, has other bad implications, and this
      	  is just wrong.
      
      Introduce the new SIGNAL_UNKILLABLE flag to fix these problems.  It also helps
      to solve some other problems addressed by the subsequent patches.
      
      Currently we use this flag for the global init only, but it could also be used
      by kthreads and (perhaps) by the sub-namespace inits.
      Signed-off-by: NOleg Nesterov <oleg@tv-sign.ru>
      Acked-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Roland McGrath <roland@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fae5fa44
    • O
      signals: join send_sigqueue() with send_group_sigqueue() · ac5c2153
      Oleg Nesterov 提交于
      We export send_sigqueue() and send_group_sigqueue() for the only user,
      posix_timer_event().  This is a bit silly, because both are just trivial
      helpers on top of do_send_sigqueue() and because the we pass the unused
      .si_signo parameter.
      
      Kill them both, rename do_send_sigqueue() to send_sigqueue(), and export it.
      Signed-off-by: NOleg Nesterov <oleg@tv-sign.ru>
      Cc: Roland McGrath <roland@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ac5c2153
    • O
      signals: re-assign CLD_CONTINUED notification from the sender to reciever · e4420551
      Oleg Nesterov 提交于
      Based on discussion with Jiri and Roland.
      
      In short: currently handle_stop_signal(SIGCONT, p) sends the notification to
      p->parent, with this patch p itself notifies its parent when it becomes
      running.
      
      handle_stop_signal(SIGCONT) has to drop ->siglock temporary in order to notify
      the parent with do_notify_parent_cldstop().  This leads to multiple problems:
      
      	- as Jiri Kosina pointed out, the stopped task can resume without
      	  actually seeing SIGCONT which may have a handler.
      
      	- we race with another sig_kernel_stop() signal which may come in
      	  that window.
      
      	- we race with sig_fatal() signals which may set SIGNAL_GROUP_EXIT
      	  in that window.
      
      	- we can't avoid taking tasklist_lock() while sending SIGCONT.
      
      With this patch handle_stop_signal() just sets the new SIGNAL_CLD_CONTINUED
      flag in p->signal->flags and returns.  The notification is sent by the first
      task which returns from finish_stop() (there should be at least one) or any
      other signalled thread from get_signal_to_deliver().
      
      This is a user-visible change.  Say, currently kill(SIGCONT, stopped_child)
      can't return without seeing SIGCHLD, with this patch SIGCHLD can be delayed
      unpredictably.  Another difference is that if the child is ptraced by another
      process, CLD_CONTINUED may be delivered to ->real_parent after ptrace_detach()
      while currently it always goes to the tracer which doesn't actually need this
      notification.  Hopefully not a problem.
      
      The patch asks for the futher obvious cleanups, I'll send them separately.
      Signed-off-by: NOleg Nesterov <oleg@tv-sign.ru>
      Cc: Roland McGrath <roland@redhat.com>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e4420551
  18. 29 4月, 2008 1 次提交
    • B
      cgroups: add an owner to the mm_struct · cf475ad2
      Balbir Singh 提交于
      Remove the mem_cgroup member from mm_struct and instead adds an owner.
      
      This approach was suggested by Paul Menage.  The advantage of this approach
      is that, once the mm->owner is known, using the subsystem id, the cgroup
      can be determined.  It also allows several control groups that are
      virtually grouped by mm_struct, to exist independent of the memory
      controller i.e., without adding mem_cgroup's for each controller, to
      mm_struct.
      
      A new config option CONFIG_MM_OWNER is added and the memory resource
      controller selects this config option.
      
      This patch also adds cgroup callbacks to notify subsystems when mm->owner
      changes.  The mm_cgroup_changed callback is called with the task_lock() of
      the new task held and is called just prior to changing the mm->owner.
      
      I am indebted to Paul Menage for the several reviews of this patchset and
      helping me make it lighter and simpler.
      
      This patch was tested on a powerpc box, it was compiled with both the
      MM_OWNER config turned on and off.
      
      After the thread group leader exits, it's moved to init_css_state by
      cgroup_exit(), thus all future charges from runnings threads would be
      redirected to the init_css_set's subsystem.
      Signed-off-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Sudhir Kumar <skumar@linux.vnet.ibm.com>
      Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
      Cc: Hirokazu Takahashi <taka@valinux.co.jp>
      Cc: David Rientjes <rientjes@google.com>,
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NPekka Enberg <penberg@cs.helsinki.fi>
      Reviewed-by: NPaul Menage <menage@google.com>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cf475ad2