1. 25 7月, 2008 1 次提交
  2. 18 7月, 2008 1 次提交
  3. 17 7月, 2008 2 次提交
    • R
      ptrace children revamp · f470021a
      Roland McGrath 提交于
      ptrace no longer fiddles with the children/sibling links, and the
      old ptrace_children list is gone.  Now ptrace, whether of one's own
      children or another's via PTRACE_ATTACH, just uses the new ptraced
      list instead.
      
      There should be no user-visible difference that matters.  The only
      change is the order in which do_wait() sees multiple stopped
      children and stopped ptrace attachees.  Since wait_task_stopped()
      was changed earlier so it no longer reorders the children list, we
      already know this won't cause any new problems.
      Signed-off-by: NRoland McGrath <roland@redhat.com>
      f470021a
    • R
      Freezer: Introduce PF_FREEZER_NOSIG · ebb12db5
      Rafael J. Wysocki 提交于
      The freezer currently attempts to distinguish kernel threads from
      user space tasks by checking if their mm pointer is unset and it
      does not send fake signals to kernel threads.  However, there are
      kernel threads, mostly related to networking, that behave like
      user space tasks and may want to be sent a fake signal to be frozen.
      
      Introduce the new process flag PF_FREEZER_NOSIG that will be set
      by default for all kernel threads and make the freezer only send
      fake signals to the tasks having PF_FREEZER_NOSIG unset.  Provide
      the set_freezable_with_signal() function to be called by the kernel
      threads that want to be sent a fake signal for freezing.
      
      This patch should not change the freezer's observable behavior.
      Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Acked-by: NPavel Machek <pavel@suse.cz>
      Signed-off-by: NLen Brown <len.brown@intel.com>
      ebb12db5
  4. 11 7月, 2008 1 次提交
    • S
      sched_clock: stop maximum check on NO HZ · af52a90a
      Steven Rostedt 提交于
      Working with ftrace I would get large jumps of 11 millisecs or more with
      the clock tracer. This killed the latencing timings of ftrace and also
      caused the irqoff self tests to fail.
      
      What was happening is with NO_HZ the idle would stop the jiffy counter and
      before the jiffy counter was updated the sched_clock would have a bad
      delta jiffies to compare with the gtod with the maximum.
      
      The jiffies would stop and the last sched_tick would record the last gtod.
      On wakeup, the sched clock update would compare the gtod + delta jiffies
      (which would be zero) and compare it to the TSC. The TSC would have
      correctly (with a stable TSC) moved forward several jiffies. But because the
      jiffies has not been updated yet the clock would be prevented from moving
      forward because it would appear that the TSC jumped too far ahead.
      
      The clock would then virtually stop, until the jiffies are updated. Then
      the next sched clock update would see that the clock was very much behind
      since the delta jiffies is now correct. This would then jump the clock
      forward by several jiffies.
      
      This caused ftrace to report several milliseconds of interrupts off
      latency at every resume from NO_HZ idle.
      
      This patch adds hooks into the nohz code to disable the checking of the
      maximum clock update when nohz is in effect. It resumes the max check
      when nohz has updated the jiffies again.
      Signed-off-by: NSteven Rostedt <srostedt@redhat.com>
      Cc: Steven Rostedt <srostedt@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      af52a90a
  5. 27 6月, 2008 3 次提交
  6. 24 6月, 2008 1 次提交
    • R
      sched: add new API sched_setscheduler_nocheck: add a flag to control access checks · 961ccddd
      Rusty Russell 提交于
      Hidehiro Kawai noticed that sched_setscheduler() can fail in
      stop_machine: it calls sched_setscheduler() from insmod, which can
      have CAP_SYS_MODULE without CAP_SYS_NICE.
      
      Two cases could have failed, so are changed to sched_setscheduler_nocheck:
        kernel/softirq.c:cpu_callback()
      	- CPU hotplug callback
        kernel/stop_machine.c:__stop_machine_run()
      	- Called from various places, including modprobe()
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      Cc: Jeremy Fitzhardinge <jeremy@goop.org>
      Cc: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: linux-mm@kvack.org
      Cc: sugita <yumiko.sugita.yf@hitachi.com>
      Cc: Satoshi OSHIMA <satoshi.oshima.fk@hitachi.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      961ccddd
  7. 10 6月, 2008 2 次提交
    • D
      sched: prevent bound kthreads from changing cpus_allowed · 9985b0ba
      David Rientjes 提交于
      Kthreads that have called kthread_bind() are bound to specific cpus, so
      other tasks should not be able to change their cpus_allowed from under
      them.  Otherwise, it is possible to move kthreads, such as the migration
      or software watchdog threads, so they are not allowed access to the cpu
      they work on.
      
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Paul Menage <menage@google.com>
      Cc: Paul Jackson <pj@sgi.com>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      9985b0ba
    • O
      sched: fix TASK_WAKEKILL vs SIGKILL race · 16882c1e
      Oleg Nesterov 提交于
      schedule() has the special "TASK_INTERRUPTIBLE && signal_pending()" case,
      this allows us to do
      
      	current->state = TASK_INTERRUPTIBLE;
      	schedule();
      
      without fear to sleep with pending signal.
      
      However, the code like
      
      	current->state = TASK_KILLABLE;
      	schedule();
      
      is not right, schedule() doesn't take TASK_WAKEKILL into account. This means
      that mutex_lock_killable(), wait_for_completion_killable(), down_killable(),
      schedule_timeout_killable() can miss SIGKILL (and btw the second SIGKILL has
      no effect).
      
      Introduce the new helper, signal_pending_state(), and change schedule() to
      use it. Hopefully it will have more users, that is why the task's state is
      passed separately.
      
      Note this "__TASK_STOPPED | __TASK_TRACED" check in signal_pending_state().
      This is needed to preserve the current behaviour (ptrace_notify). I hope
      this check will be removed soon, but this (afaics good) change needs the
      separate discussion.
      
      The fast path is "(state & (INTERRUPTIBLE | WAKEKILL)) + signal_pending(p)",
      basically the same that schedule() does now. However, this patch of course
      bloats schedule().
      Signed-off-by: NOleg Nesterov <oleg@tv-sign.ru>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      16882c1e
  8. 06 6月, 2008 3 次提交
    • G
      sched: fix cpupri hotplug support · 1f11eb6a
      Gregory Haskins 提交于
      The RT folks over at RedHat found an issue w.r.t. hotplug support which
      was traced to problems with the cpupri infrastructure in the scheduler:
      
      https://bugzilla.redhat.com/show_bug.cgi?id=449676
      
      This bug affects 23-rt12+, 24-rtX, 25-rtX, and sched-devel.  This patch
      applies to 25.4-rt4, though it should trivially apply to most cpupri enabled
      kernels mentioned above.
      
      It turned out that the issue was that offline cpus could get inadvertently
      registered with cpupri so that they were erroneously selected during
      migration decisions.  The end result would be an OOPS as the offline cpu
      had tasks routed to it.
      
      This patch generalizes the old join/leave domain interface into an
      online/offline interface, and adjusts the root-domain/hotplug code to
      utilize it.
      
      I was able to easily reproduce the issue prior to this patch, and am no
      longer able to reproduce it after this patch.  I can offline cpus
      indefinately and everything seems to be in working order.
      
      Thanks to Arnaldo (acme), Thomas, and Peter for doing the legwork to point
      me in the right direction.  Also thank you to Peter for reviewing the
      early iterations of this patch.
      Signed-off-by: NGregory Haskins <ghaskins@novell.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      1f11eb6a
    • R
      sched: reorder task_struct to reduce padding on 64bit builds · c7aceaba
      Richard Kennedy 提交于
      This patch removes 24 bytes of padding and allows 1 extra object per
      slab on my fedora based config.
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      c7aceaba
    • I
      namespacecheck: more sched.c fixes · 554ec22f
      Ingo Molnar 提交于
      [ Stephen Rothwell <sfr@canb.auug.org.au>: build fix ]
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      554ec22f
  9. 29 5月, 2008 1 次提交
  10. 27 5月, 2008 1 次提交
  11. 25 5月, 2008 2 次提交
  12. 24 5月, 2008 8 次提交
  13. 13 5月, 2008 2 次提交
    • L
      Make 'cond_resched()' nullification depend on PREEMPT_BKL · c714a534
      Linus Torvalds 提交于
      Because it's not correct with a non-preemptable BKL and just causes
      PREEMPT kernels to have longer latencies than non-PREEMPT ones (which is
      obviously not the point of it at all).
      
      Of course, that config option actually got removed as an option earlier,
      so for now this basically disables it entirely, but if BKL preemption is
      ever resurrected it will be a meaningful optimization.  And in the
      meantime, it at least documents the intent of the code, while not doing
      the wrong thing.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c714a534
    • L
      Fix up 'need_resched()' definition · 9404ef02
      Linus Torvalds 提交于
      We should not go through the task pointer to get at the thread info,
      since it's usually cheaper to just access the thread info directly.
      
      So don't make the code look up 'current', when we can just use the
      thread info accessor functions directly.  This generally avoids one
      level of indirection and tends to work better together with code that
      also looks at other thread flags (eg preempt_count).
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9404ef02
  14. 12 5月, 2008 1 次提交
    • L
      Add new 'cond_resched_bkl()' helper function · c3921ab7
      Linus Torvalds 提交于
      It acts exactly like a regular 'cond_resched()', but will not get
      optimized away when CONFIG_PREEMPT is set.
      
      Normal kernel code is already preemptable in the presense of
      CONFIG_PREEMPT, so cond_resched() is optimized away (see commit
      02b67cc3 "sched: do not do
      cond_resched() when CONFIG_PREEMPT").
      
      But when wanting to conditionally reschedule while holding a lock, you
      need to use "cond_sched_lock(lock)", and the new function is the BKL
      equivalent of that.
      
      Also make fs/locks.c use it.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c3921ab7
  15. 06 5月, 2008 3 次提交
  16. 30 4月, 2008 5 次提交
    • P
      Deprecate find_task_by_pid() · 5cd20455
      Pavel Emelyanov 提交于
      There are some places that are known to operate on tasks'
      global pids only:
      
      * the rest_init() call (called on boot)
      * the kgdb's getthread
      * the create_kthread() (since the kthread is run in init ns)
      
      So use the find_task_by_pid_ns(..., &init_pid_ns) there
      and schedule the find_task_by_pid for removal.
      
      [sukadev@us.ibm.com: Fix warning in kernel/pid.c]
      Signed-off-by: NPavel Emelyanov <xemul@openvz.org>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NSukadev Bhattiprolu <sukadev@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5cd20455
    • R
      signals: use HAVE_SET_RESTORE_SIGMASK · f3de272b
      Roland McGrath 提交于
      Change all the #ifdef TIF_RESTORE_SIGMASK conditionals in non-arch code to
      #ifdef HAVE_SET_RESTORE_SIGMASK.  If arch code defines it first, the generic
      set_restore_sigmask() using TIF_RESTORE_SIGMASK is not defined.
      Signed-off-by: NRoland McGrath <roland@redhat.com>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f3de272b
    • O
      signals: fix /sbin/init protection from unwanted signals · fae5fa44
      Oleg Nesterov 提交于
      The global init has a lot of long standing problems with the unhandled fatal
      signals.
      
      	- The "is_global_init(current)" check in get_signal_to_deliver()
      	  protects only the main thread. Sub-thread can dequee the fatal
      	  signal and shutdown the whole thread group except the main thread.
      	  If it dequeues SIGSTOP /sbin/init will be stopped, this is not
      	  right too. Note that we can't use is_global_init(->group_leader),
      	  this breaks exec and this can't solve other problems we have.
      
      	- Even if afterwards ignored, the fatal signals sets SIGNAL_GROUP_EXIT
      	  on delivery. This breaks exec, has other bad implications, and this
      	  is just wrong.
      
      Introduce the new SIGNAL_UNKILLABLE flag to fix these problems.  It also helps
      to solve some other problems addressed by the subsequent patches.
      
      Currently we use this flag for the global init only, but it could also be used
      by kthreads and (perhaps) by the sub-namespace inits.
      Signed-off-by: NOleg Nesterov <oleg@tv-sign.ru>
      Acked-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Roland McGrath <roland@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fae5fa44
    • O
      signals: join send_sigqueue() with send_group_sigqueue() · ac5c2153
      Oleg Nesterov 提交于
      We export send_sigqueue() and send_group_sigqueue() for the only user,
      posix_timer_event().  This is a bit silly, because both are just trivial
      helpers on top of do_send_sigqueue() and because the we pass the unused
      .si_signo parameter.
      
      Kill them both, rename do_send_sigqueue() to send_sigqueue(), and export it.
      Signed-off-by: NOleg Nesterov <oleg@tv-sign.ru>
      Cc: Roland McGrath <roland@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ac5c2153
    • O
      signals: re-assign CLD_CONTINUED notification from the sender to reciever · e4420551
      Oleg Nesterov 提交于
      Based on discussion with Jiri and Roland.
      
      In short: currently handle_stop_signal(SIGCONT, p) sends the notification to
      p->parent, with this patch p itself notifies its parent when it becomes
      running.
      
      handle_stop_signal(SIGCONT) has to drop ->siglock temporary in order to notify
      the parent with do_notify_parent_cldstop().  This leads to multiple problems:
      
      	- as Jiri Kosina pointed out, the stopped task can resume without
      	  actually seeing SIGCONT which may have a handler.
      
      	- we race with another sig_kernel_stop() signal which may come in
      	  that window.
      
      	- we race with sig_fatal() signals which may set SIGNAL_GROUP_EXIT
      	  in that window.
      
      	- we can't avoid taking tasklist_lock() while sending SIGCONT.
      
      With this patch handle_stop_signal() just sets the new SIGNAL_CLD_CONTINUED
      flag in p->signal->flags and returns.  The notification is sent by the first
      task which returns from finish_stop() (there should be at least one) or any
      other signalled thread from get_signal_to_deliver().
      
      This is a user-visible change.  Say, currently kill(SIGCONT, stopped_child)
      can't return without seeing SIGCHLD, with this patch SIGCHLD can be delayed
      unpredictably.  Another difference is that if the child is ptraced by another
      process, CLD_CONTINUED may be delivered to ->real_parent after ptrace_detach()
      while currently it always goes to the tracer which doesn't actually need this
      notification.  Hopefully not a problem.
      
      The patch asks for the futher obvious cleanups, I'll send them separately.
      Signed-off-by: NOleg Nesterov <oleg@tv-sign.ru>
      Cc: Roland McGrath <roland@redhat.com>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e4420551
  17. 29 4月, 2008 1 次提交
    • B
      cgroups: add an owner to the mm_struct · cf475ad2
      Balbir Singh 提交于
      Remove the mem_cgroup member from mm_struct and instead adds an owner.
      
      This approach was suggested by Paul Menage.  The advantage of this approach
      is that, once the mm->owner is known, using the subsystem id, the cgroup
      can be determined.  It also allows several control groups that are
      virtually grouped by mm_struct, to exist independent of the memory
      controller i.e., without adding mem_cgroup's for each controller, to
      mm_struct.
      
      A new config option CONFIG_MM_OWNER is added and the memory resource
      controller selects this config option.
      
      This patch also adds cgroup callbacks to notify subsystems when mm->owner
      changes.  The mm_cgroup_changed callback is called with the task_lock() of
      the new task held and is called just prior to changing the mm->owner.
      
      I am indebted to Paul Menage for the several reviews of this patchset and
      helping me make it lighter and simpler.
      
      This patch was tested on a powerpc box, it was compiled with both the
      MM_OWNER config turned on and off.
      
      After the thread group leader exits, it's moved to init_css_state by
      cgroup_exit(), thus all future charges from runnings threads would be
      redirected to the init_css_set's subsystem.
      Signed-off-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Sudhir Kumar <skumar@linux.vnet.ibm.com>
      Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
      Cc: Hirokazu Takahashi <taka@valinux.co.jp>
      Cc: David Rientjes <rientjes@google.com>,
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NPekka Enberg <penberg@cs.helsinki.fi>
      Reviewed-by: NPaul Menage <menage@google.com>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cf475ad2
  18. 28 4月, 2008 1 次提交
    • A
      capabilities: implement per-process securebits · 3898b1b4
      Andrew G. Morgan 提交于
      Filesystem capability support makes it possible to do away with (set)uid-0
      based privilege and use capabilities instead.  That is, with filesystem
      support for capabilities but without this present patch, it is (conceptually)
      possible to manage a system with capabilities alone and never need to obtain
      privilege via (set)uid-0.
      
      Of course, conceptually isn't quite the same as currently possible since few
      user applications, certainly not enough to run a viable system, are currently
      prepared to leverage capabilities to exercise privilege.  Further, many
      applications exist that may never get upgraded in this way, and the kernel
      will continue to want to support their setuid-0 base privilege needs.
      
      Where pure-capability applications evolve and replace setuid-0 binaries, it is
      desirable that there be a mechanisms by which they can contain their
      privilege.  In addition to leveraging the per-process bounding and inheritable
      sets, this should include suppressing the privilege of the uid-0 superuser
      from the process' tree of children.
      
      The feature added by this patch can be leveraged to suppress the privilege
      associated with (set)uid-0.  This suppression requires CAP_SETPCAP to
      initiate, and only immediately affects the 'current' process (it is inherited
      through fork()/exec()).  This reimplementation differs significantly from the
      historical support for securebits which was system-wide, unwieldy and which
      has ultimately withered to a dead relic in the source of the modern kernel.
      
      With this patch applied a process, that is capable(CAP_SETPCAP), can now drop
      all legacy privilege (through uid=0) for itself and all subsequently
      fork()'d/exec()'d children with:
      
        prctl(PR_SET_SECUREBITS, 0x2f);
      
      This patch represents a no-op unless CONFIG_SECURITY_FILE_CAPABILITIES is
      enabled at configure time.
      
      [akpm@linux-foundation.org: fix uninitialised var warning]
      [serue@us.ibm.com: capabilities: use cap_task_prctl when !CONFIG_SECURITY]
      Signed-off-by: NAndrew G. Morgan <morgan@kernel.org>
      Acked-by: NSerge Hallyn <serue@us.ibm.com>
      Reviewed-by: NJames Morris <jmorris@namei.org>
      Cc: Stephen Smalley <sds@tycho.nsa.gov>
      Cc: Paul Moore <paul.moore@hp.com>
      Signed-off-by: NSerge E. Hallyn <serue@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3898b1b4
  19. 27 4月, 2008 1 次提交
    • C
      s390: KVM preparation: provide hook to enable pgstes in user pagetable · 402b0862
      Carsten Otte 提交于
      The SIE instruction on s390 uses the 2nd half of the page table page to
      virtualize the storage keys of a guest. This patch offers the s390_enable_sie
      function, which reorganizes the page tables of a single-threaded process to
      reserve space in the page table:
      s390_enable_sie makes sure that the process is single threaded and then uses
      dup_mm to create a new mm with reorganized page tables. The old mm is freed
      and the process has now a page status extended field after every page table.
      
      Code that wants to exploit pgstes should SELECT CONFIG_PGSTE.
      
      This patch has a small common code hit, namely making dup_mm non-static.
      
      Edit (Carsten): I've modified Martin's patch, following Jeremy Fitzhardinge's
      review feedback. Now we do have the prototype for dup_mm in
      include/linux/sched.h. Following Martin's suggestion, s390_enable_sie() does now
      call task_lock() to prevent race against ptrace modification of mm_users.
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      Signed-off-by: NCarsten Otte <cotte@de.ibm.com>
      Acked-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAvi Kivity <avi@qumranet.com>
      402b0862