1. 23 6月, 2011 1 次提交
  2. 17 6月, 2011 1 次提交
    • T
      ptrace: implement PTRACE_LISTEN · 544b2c91
      Tejun Heo 提交于
      The previous patch implemented async notification for ptrace but it
      only worked while trace is running.  This patch introduces
      PTRACE_LISTEN which is suggested by Oleg Nestrov.
      
      It's allowed iff tracee is in STOP trap and puts tracee into
      quasi-running state - tracee never really runs but wait(2) and
      ptrace(2) consider it to be running.  While ptracer is listening,
      tracee is allowed to re-enter STOP to notify an async event.
      Listening state is cleared on the first notification.  Ptracer can
      also clear it by issuing INTERRUPT - tracee will re-trap into STOP
      with listening state cleared.
      
      This allows ptracer to monitor group stop state without running tracee
      - use INTERRUPT to put tracee into STOP trap, issue LISTEN and then
      wait(2) to wait for the next group stop event.  When it happens,
      PTRACE_GETSIGINFO provides information to determine the current state.
      
      Test program follows.
      
        #define PTRACE_SEIZE		0x4206
        #define PTRACE_INTERRUPT	0x4207
        #define PTRACE_LISTEN		0x4208
      
        #define PTRACE_SEIZE_DEVEL	0x80000000
      
        static const struct timespec ts1s = { .tv_sec = 1 };
      
        int main(int argc, char **argv)
        {
      	  pid_t tracee, tracer;
      	  int i;
      
      	  tracee = fork();
      	  if (!tracee)
      		  while (1)
      			  pause();
      
      	  tracer = fork();
      	  if (!tracer) {
      		  siginfo_t si;
      
      		  ptrace(PTRACE_SEIZE, tracee, NULL,
      			 (void *)(unsigned long)PTRACE_SEIZE_DEVEL);
      		  ptrace(PTRACE_INTERRUPT, tracee, NULL, NULL);
      	  repeat:
      		  waitid(P_PID, tracee, NULL, WSTOPPED);
      
      		  ptrace(PTRACE_GETSIGINFO, tracee, NULL, &si);
      		  if (!si.si_code) {
      			  printf("tracer: SIG %d\n", si.si_signo);
      			  ptrace(PTRACE_CONT, tracee, NULL,
      				 (void *)(unsigned long)si.si_signo);
      			  goto repeat;
      		  }
      		  printf("tracer: stopped=%d signo=%d\n",
      			 si.si_signo != SIGTRAP, si.si_signo);
      		  if (si.si_signo != SIGTRAP)
      			  ptrace(PTRACE_LISTEN, tracee, NULL, NULL);
      		  else
      			  ptrace(PTRACE_CONT, tracee, NULL, NULL);
      		  goto repeat;
      	  }
      
      	  for (i = 0; i < 3; i++) {
      		  nanosleep(&ts1s, NULL);
      		  printf("mother: SIGSTOP\n");
      		  kill(tracee, SIGSTOP);
      		  nanosleep(&ts1s, NULL);
      		  printf("mother: SIGCONT\n");
      		  kill(tracee, SIGCONT);
      	  }
      	  nanosleep(&ts1s, NULL);
      
      	  kill(tracer, SIGKILL);
      	  kill(tracee, SIGKILL);
      	  return 0;
        }
      
      This is identical to the program to test TRAP_NOTIFY except that
      tracee is PTRACE_LISTEN'd instead of PTRACE_CONT'd when group stopped.
      This allows ptracer to monitor when group stop ends without running
      tracee.
      
        # ./test-listen
        tracer: stopped=0 signo=5
        mother: SIGSTOP
        tracer: SIG 19
        tracer: stopped=1 signo=19
        mother: SIGCONT
        tracer: stopped=0 signo=5
        tracer: SIG 18
        mother: SIGSTOP
        tracer: SIG 19
        tracer: stopped=1 signo=19
        mother: SIGCONT
        tracer: stopped=0 signo=5
        tracer: SIG 18
        mother: SIGSTOP
        tracer: SIG 19
        tracer: stopped=1 signo=19
        mother: SIGCONT
        tracer: stopped=0 signo=5
        tracer: SIG 18
      
      -v2: Moved JOBCTL_LISTENING check in wait_task_stopped() into
           task_stopped_code() as suggested by Oleg.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      544b2c91
  3. 14 5月, 2011 1 次提交
    • T
      job control: reorganize wait_task_stopped() · 19e27463
      Tejun Heo 提交于
      wait_task_stopped() tested task_stopped_code() without acquiring
      siglock and, if stop condition existed, called wait_task_stopped() and
      directly returned the result.  This patch moves the initial
      task_stopped_code() testing into wait_task_stopped() and make
      wait_consider_task() fall through to wait_task_continue() on 0 return.
      
      This is for the following two reasons.
      
      * Because the initial task_stopped_code() test is done without
        acquiring siglock, it may race against SIGCONT generation.  The
        stopped condition might have been replaced by continued state by the
        time wait_task_stopped() acquired siglock.  This may lead to
        unexpected failure of WNOHANG waits.
      
        This reorganization addresses this single race case but there are
        other cases - TASK_RUNNING -> TASK_STOPPED transition and EXIT_*
        transitions.
      
      * Scheduled ptrace updates require changes to the initial test which
        would fit better inside wait_task_stopped().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      19e27463
  4. 25 4月, 2011 1 次提交
    • F
      ptrace: Prepare to fix racy accesses on task breakpoints · bf26c018
      Frederic Weisbecker 提交于
      When a task is traced and is in a stopped state, the tracer
      may execute a ptrace request to examine the tracee state and
      get its task struct. Right after, the tracee can be killed
      and thus its breakpoints released.
      This can happen concurrently when the tracer is in the middle
      of reading or modifying these breakpoints, leading to dereferencing
      a freed pointer.
      
      Hence, to prepare the fix, create a generic breakpoint reference
      holding API. When a reference on the breakpoints of a task is
      held, the breakpoints won't be released until the last reference
      is dropped. After that, no more ptrace request on the task's
      breakpoints can be serviced for the tracer.
      Reported-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Prasad <prasad@linux.vnet.ibm.com>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: v2.6.33.. <stable@kernel.org>
      Link: http://lkml.kernel.org/r/1302284067-7860-2-git-send-email-fweisbec@gmail.com
      bf26c018
  5. 31 3月, 2011 1 次提交
  6. 23 3月, 2011 3 次提交
    • T
      job control: Allow access to job control events through ptracees · 45cb24a1
      Tejun Heo 提交于
      Currently a real parent can't access job control stopped/continued
      events through a ptraced child.  This utterly breaks job control when
      the children are ptraced.
      
      For example, if a program is run from an interactive shell and then
      strace(1) attaches to it, pressing ^Z would send SIGTSTP and strace(1)
      would notice it but the shell has no way to tell whether the child
      entered job control stop and thus can't tell when to take over the
      terminal - leading to awkward lone ^Z on the terminal.
      
      Because the job control and ptrace stopped states are independent,
      there is no reason to prevent real parents from accessing the stopped
      state regardless of ptrace.  The continued state isn't separate but
      ptracers don't have any use for them as ptracees can never resume
      without explicit command from their ptracers, so as long as ptracers
      don't consume it, it should be fine.
      
      Although this is a behavior change, because the previous behavior is
      utterly broken when viewed from real parents and the change is only
      visible to real parents, I don't think it's necessary to make this
      behavior optional.
      
      One situation to be careful about is when a task from the real
      parent's group is ptracing.  The parent group is the recipient of both
      ptrace and job control stop events and one stop can be reported as
      both job control and ptrace stops.  As this can break the current
      ptrace users, suppress job control stopped events for these cases.
      
      If a real parent ptracer wants to know about both job control and
      ptrace stops, it can create a separate process to serve the role of
      real parent.
      
      Note that this only updates wait(2) side of things.  The real parent
      can access the states via wait(2) but still is not properly notified
      (woken up and delivered signal).  Test case polls wait(2) with WNOHANG
      to work around.  Notification will be updated by future patches.
      
      Test case follows.
      
        #include <stdio.h>
        #include <unistd.h>
        #include <time.h>
        #include <errno.h>
        #include <sys/types.h>
        #include <sys/ptrace.h>
        #include <sys/wait.h>
      
        int main(void)
        {
      	  const struct timespec ts100ms = { .tv_nsec = 100000000 };
      	  pid_t tracee, tracer;
      	  siginfo_t si;
      	  int i;
      
      	  tracee = fork();
      	  if (tracee == 0) {
      		  while (1) {
      			  printf("tracee: SIGSTOP\n");
      			  raise(SIGSTOP);
      			  nanosleep(&ts100ms, NULL);
      			  printf("tracee: SIGCONT\n");
      			  raise(SIGCONT);
      			  nanosleep(&ts100ms, NULL);
      		  }
      	  }
      
      	  waitid(P_PID, tracee, &si, WSTOPPED | WNOHANG | WNOWAIT);
      
      	  tracer = fork();
      	  if (tracer == 0) {
      		  nanosleep(&ts100ms, NULL);
      		  ptrace(PTRACE_ATTACH, tracee, NULL, NULL);
      
      		  for (i = 0; i < 11; i++) {
      			  si.si_pid = 0;
      			  waitid(P_PID, tracee, &si, WSTOPPED);
      			  if (si.si_pid && si.si_code == CLD_TRAPPED)
      				  ptrace(PTRACE_CONT, tracee, NULL,
      					 (void *)(long)si.si_status);
      		  }
      		  printf("tracer: EXITING\n");
      		  return 0;
      	  }
      
      	  while (1) {
      		  si.si_pid = 0;
      		  waitid(P_PID, tracee, &si,
      			 WSTOPPED | WCONTINUED | WEXITED | WNOHANG);
      		  if (si.si_pid)
      			  printf("mommy : WAIT status=%02d code=%02d\n",
      				 si.si_status, si.si_code);
      		  nanosleep(&ts100ms, NULL);
      	  }
      	  return 0;
        }
      
      Before the patch, while ptraced, the parent can't see any job control
      events.
      
        tracee: SIGSTOP
        mommy : WAIT status=19 code=05
        tracee: SIGCONT
        tracee: SIGSTOP
        tracee: SIGCONT
        tracee: SIGSTOP
        tracee: SIGCONT
        tracee: SIGSTOP
        tracer: EXITING
        mommy : WAIT status=19 code=05
        ^C
      
      After the patch,
      
        tracee: SIGSTOP
        mommy : WAIT status=19 code=05
        tracee: SIGCONT
        mommy : WAIT status=18 code=06
        tracee: SIGSTOP
        mommy : WAIT status=19 code=05
        tracee: SIGCONT
        mommy : WAIT status=18 code=06
        tracee: SIGSTOP
        mommy : WAIT status=19 code=05
        tracee: SIGCONT
        mommy : WAIT status=18 code=06
        tracee: SIGSTOP
        tracer: EXITING
        mommy : WAIT status=19 code=05
        ^C
      
      -v2: Oleg pointed out that wait(2) should be suppressed for the real
           parent's group instead of only the real parent task itself.
           Updated accordingly.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      45cb24a1
    • T
      job control: Fix ptracer wait(2) hang and explain notask_error clearing · 9b84cca2
      Tejun Heo 提交于
      wait(2) and friends allow access to stopped/continued states through
      zombies, which is required as the states are process-wide and should
      be accessible whether the leader task is alive or undead.
      wait_consider_task() implements this by always clearing notask_error
      and going through wait_task_stopped/continued() for unreaped zombies.
      
      However, while ptraced, the stopped state is per-task and as such if
      the ptracee became a zombie, there's no further stopped event to
      listen to and wait(2) and friends should return -ECHILD on the tracee.
      
      Fix it by clearing notask_error only if WCONTINUED | WEXITED is set
      for ptraced zombies.  While at it, document why clearing notask_error
      is safe for each case.
      
      Test case follows.
      
        #include <stdio.h>
        #include <unistd.h>
        #include <pthread.h>
        #include <time.h>
        #include <sys/types.h>
        #include <sys/ptrace.h>
        #include <sys/wait.h>
      
        static void *nooper(void *arg)
        {
      	  pause();
      	  return NULL;
        }
      
        int main(void)
        {
      	  const struct timespec ts1s = { .tv_sec = 1 };
      	  pid_t tracee, tracer;
      	  siginfo_t si;
      
      	  tracee = fork();
      	  if (tracee == 0) {
      		  pthread_t thr;
      
      		  pthread_create(&thr, NULL, nooper, NULL);
      		  nanosleep(&ts1s, NULL);
      		  printf("tracee exiting\n");
      		  pthread_exit(NULL);	/* let subthread run */
      	  }
      
      	  tracer = fork();
      	  if (tracer == 0) {
      		  ptrace(PTRACE_ATTACH, tracee, NULL, NULL);
      		  while (1) {
      			  if (waitid(P_PID, tracee, &si, WSTOPPED) < 0) {
      				  perror("waitid");
      				  break;
      			  }
      			  ptrace(PTRACE_CONT, tracee, NULL,
      				 (void *)(long)si.si_status);
      		  }
      		  return 0;
      	  }
      
      	  waitid(P_PID, tracer, &si, WEXITED);
      	  kill(tracee, SIGKILL);
      	  return 0;
        }
      
      Before the patch, after the tracee becomes a zombie, the tracer's
      waitid(WSTOPPED) never returns and the program doesn't terminate.
      
        tracee exiting
        ^C
      
      After the patch, tracee exiting triggers waitid() to fail.
      
        tracee exiting
        waitid: No child processes
      
      -v2: Oleg pointed out that exited in addition to continued can happen
           for ptraced dead group leader.  Clear notask_error for ptraced
           child on WEXITED too.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      9b84cca2
    • T
      job control: Small reorganization of wait_consider_task() · 823b018e
      Tejun Heo 提交于
      Move EXIT_DEAD test in wait_consider_task() above ptrace check.  As
      ptraced tasks can't be EXIT_DEAD, this change doesn't cause any
      behavior change.  This is to prepare for further changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      823b018e
  7. 10 3月, 2011 1 次提交
    • J
      block: initial patch for on-stack per-task plugging · 73c10101
      Jens Axboe 提交于
      This patch adds support for creating a queuing context outside
      of the queue itself. This enables us to batch up pieces of IO
      before grabbing the block device queue lock and submitting them to
      the IO scheduler.
      
      The context is created on the stack of the process and assigned in
      the task structure, so that we can auto-unplug it if we hit a schedule
      event.
      
      The current queue plugging happens implicitly if IO is submitted to
      an empty device, yet callers have to remember to unplug that IO when
      they are going to wait for it. This is an ugly API and has caused bugs
      in the past. Additionally, it requires hacks in the vm (->sync_page()
      callback) to handle that logic. By switching to an explicit plugging
      scheme we make the API a lot nicer and can get rid of the ->sync_page()
      hack in the vm.
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      73c10101
  8. 07 1月, 2011 1 次提交
  9. 17 12月, 2010 1 次提交
  10. 03 12月, 2010 1 次提交
    • N
      do_exit(): make sure that we run with get_fs() == USER_DS · 33dd94ae
      Nelson Elhage 提交于
      If a user manages to trigger an oops with fs set to KERNEL_DS, fs is not
      otherwise reset before do_exit().  do_exit may later (via mm_release in
      fork.c) do a put_user to a user-controlled address, potentially allowing
      a user to leverage an oops into a controlled write into kernel memory.
      
      This is only triggerable in the presence of another bug, but this
      potentially turns a lot of DoS bugs into privilege escalations, so it's
      worth fixing.  I have proof-of-concept code which uses this bug along
      with CVE-2010-3849 to write a zero to an arbitrary kernel address, so
      I've tested that this is not theoretical.
      
      A more logical place to put this fix might be when we know an oops has
      occurred, before we call do_exit(), but that would involve changing
      every architecture, in multiple places.
      
      Let's just stick it in do_exit instead.
      
      [akpm@linux-foundation.org: update code comment]
      Signed-off-by: NNelson Elhage <nelhage@ksplice.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      33dd94ae
  11. 06 11月, 2010 1 次提交
    • O
      posix-cpu-timers: workaround to suppress the problems with mt exec · e0a70217
      Oleg Nesterov 提交于
      posix-cpu-timers.c correctly assumes that the dying process does
      posix_cpu_timers_exit_group() and removes all !CPUCLOCK_PERTHREAD
      timers from signal->cpu_timers list.
      
      But, it also assumes that timer->it.cpu.task is always the group
      leader, and thus the dead ->task means the dead thread group.
      
      This is obviously not true after de_thread() changes the leader.
      After that almost every posix_cpu_timer_ method has problems.
      
      It is not simple to fix this bug correctly. First of all, I think
      that timer->it.cpu should use struct pid instead of task_struct.
      Also, the locking should be reworked completely. In particular,
      tasklist_lock should not be used at all. This all needs a lot of
      nontrivial and hard-to-test changes.
      
      Change __exit_signal() to do posix_cpu_timers_exit_group() when
      the old leader dies during exec. This is not the fix, just the
      temporary hack to hide the problem for 2.6.37 and stable. IOW,
      this is obviously wrong but this is what we currently have anyway:
      cpu timers do not work after mt exec.
      
      In theory this change adds another race. The exiting leader can
      detach the timers which were attached to the new leader. However,
      the window between de_thread() and release_task() is small, we
      can pretend that sys_timer_create() was called before de_thread().
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e0a70217
  12. 28 10月, 2010 1 次提交
  13. 27 10月, 2010 1 次提交
    • Y
      oom: add per-mm oom disable count · 3d5992d2
      Ying Han 提交于
      It's pointless to kill a task if another thread sharing its mm cannot be
      killed to allow future memory freeing.  A subsequent patch will prevent
      kills in such cases, but first it's necessary to have a way to flag a task
      that shares memory with an OOM_DISABLE task that doesn't incur an
      additional tasklist scan, which would make select_bad_process() an O(n^2)
      function.
      
      This patch adds an atomic counter to struct mm_struct that follows how
      many threads attached to it have an oom_score_adj of OOM_SCORE_ADJ_MIN.
      They cannot be killed by the kernel, so their memory cannot be freed in
      oom conditions.
      
      This only requires task_lock() on the task that we're operating on, it
      does not require mm->mmap_sem since task_lock() pins the mm and the
      operation is atomic.
      
      [rientjes@google.com: changelog and sys_unshare() code]
      [rientjes@google.com: protect oom_disable_count with task_lock in fork]
      [rientjes@google.com: use old_mm for oom_disable_count in exec]
      Signed-off-by: NYing Han <yinghan@google.com>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3d5992d2
  14. 10 9月, 2010 1 次提交
  15. 18 8月, 2010 1 次提交
    • D
      Fix unprotected access to task credentials in waitid() · f362b732
      Daniel J Blueman 提交于
      Using a program like the following:
      
      	#include <stdlib.h>
      	#include <unistd.h>
      	#include <sys/types.h>
      	#include <sys/wait.h>
      
      	int main() {
      		id_t id;
      		siginfo_t infop;
      		pid_t res;
      
      		id = fork();
      		if (id == 0) { sleep(1); exit(0); }
      		kill(id, SIGSTOP);
      		alarm(1);
      		waitid(P_PID, id, &infop, WCONTINUED);
      		return 0;
      	}
      
      to call waitid() on a stopped process results in access to the child task's
      credentials without the RCU read lock being held - which may be replaced in the
      meantime - eliciting the following warning:
      
      	===================================================
      	[ INFO: suspicious rcu_dereference_check() usage. ]
      	---------------------------------------------------
      	kernel/exit.c:1460 invoked rcu_dereference_check() without protection!
      
      	other info that might help us debug this:
      
      	rcu_scheduler_active = 1, debug_locks = 1
      	2 locks held by waitid02/22252:
      	 #0:  (tasklist_lock){.?.?..}, at: [<ffffffff81061ce5>] do_wait+0xc5/0x310
      	 #1:  (&(&sighand->siglock)->rlock){-.-...}, at: [<ffffffff810611da>]
      	wait_consider_task+0x19a/0xbe0
      
      	stack backtrace:
      	Pid: 22252, comm: waitid02 Not tainted 2.6.35-323cd+ #3
      	Call Trace:
      	 [<ffffffff81095da4>] lockdep_rcu_dereference+0xa4/0xc0
      	 [<ffffffff81061b31>] wait_consider_task+0xaf1/0xbe0
      	 [<ffffffff81061d15>] do_wait+0xf5/0x310
      	 [<ffffffff810620b6>] sys_waitid+0x86/0x1f0
      	 [<ffffffff8105fce0>] ? child_wait_callback+0x0/0x70
      	 [<ffffffff81003282>] system_call_fastpath+0x16/0x1b
      
      This is fixed by holding the RCU read lock in wait_task_continued() to ensure
      that the task's current credentials aren't destroyed between us reading the
      cred pointer and us reading the UID from those credentials.
      
      Furthermore, protect wait_task_stopped() in the same way.
      
      We don't need to keep holding the RCU read lock once we've read the UID from
      the credentials as holding the RCU read lock doesn't stop the target task from
      changing its creds under us - so the credentials may be outdated immediately
      after we've read the pointer, lock or no lock.
      Signed-off-by: NDaniel J Blueman <daniel.blueman@gmail.com>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Acked-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f362b732
  16. 11 8月, 2010 1 次提交
  17. 28 5月, 2010 10 次提交
  18. 25 5月, 2010 1 次提交
    • M
      cpuset,mm: fix no node to alloc memory when changing cpuset's mems · c0ff7453
      Miao Xie 提交于
      Before applying this patch, cpuset updates task->mems_allowed and
      mempolicy by setting all new bits in the nodemask first, and clearing all
      old unallowed bits later.  But in the way, the allocator may find that
      there is no node to alloc memory.
      
      The reason is that cpuset rebinds the task's mempolicy, it cleans the
      nodes which the allocater can alloc pages on, for example:
      
      (mpol: mempolicy)
      	task1			task1's mpol	task2
      	alloc page		1
      	  alloc on node0? NO	1
      				1		change mems from 1 to 0
      				1		rebind task1's mpol
      				0-1		  set new bits
      				0	  	  clear disallowed bits
      	  alloc on node1? NO	0
      	  ...
      	can't alloc page
      	  goto oom
      
      This patch fixes this problem by expanding the nodes range first(set newly
      allowed bits) and shrink it lazily(clear newly disallowed bits).  So we
      use a variable to tell the write-side task that read-side task is reading
      nodemask, and the write-side task clears newly disallowed nodes after
      read-side task ends the current memory allocation.
      
      [akpm@linux-foundation.org: fix spello]
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Paul Menage <menage@google.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Ravikiran Thirumalai <kiran@scalex86.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Andi Kleen <andi@firstfloor.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c0ff7453
  19. 07 4月, 2010 1 次提交
  20. 03 4月, 2010 1 次提交
  21. 07 3月, 2010 2 次提交
  22. 04 3月, 2010 1 次提交
  23. 25 2月, 2010 1 次提交
    • P
      sched: Use lockdep-based checking on rcu_dereference() · d11c563d
      Paul E. McKenney 提交于
      Update the rcu_dereference() usages to take advantage of the new
      lockdep-based checking.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: laijs@cn.fujitsu.com
      Cc: dipankar@in.ibm.com
      Cc: mathieu.desnoyers@polymtl.ca
      Cc: josh@joshtriplett.org
      Cc: dvhltc@us.ibm.com
      Cc: niv@us.ibm.com
      Cc: peterz@infradead.org
      Cc: rostedt@goodmis.org
      Cc: Valdis.Kletnieks@vt.edu
      Cc: dhowells@redhat.com
      LKML-Reference: <1266887105-1528-6-git-send-email-paulmck@linux.vnet.ibm.com>
      [ -v2: fix allmodconfig missing symbol export build failure on x86 ]
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      d11c563d
  24. 18 12月, 2009 1 次提交
    • O
      do_wait() optimization: do not place sub-threads on task_struct->children list · 9cd80bbb
      Oleg Nesterov 提交于
      Thanks to Roland who pointed out de_thread() issues.
      
      Currently we add sub-threads to ->real_parent->children list.  This buys
      nothing but slows down do_wait().
      
      With this patch ->children contains only main threads (group leaders).
      The only complication is that forget_original_parent() should iterate over
      sub-threads by hand, and de_thread() needs another list_replace() when it
      changes ->group_leader.
      
      Henceforth do_wait_thread() can never see task_detached() && !EXIT_DEAD
      tasks, we can remove this check (and we can unify do_wait_thread() and
      ptrace_do_wait()).
      
      This change can confuse the optimistic search in mm_update_next_owner(),
      but this is fixable and minor.
      
      Perhaps badness() and oom_kill_process() should be updated, but they
      should be fixed in any case.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Roland McGrath <roland@redhat.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Ratan Nalumasu <rnalumasu@gmail.com>
      Cc: Vitaly Mayatskikh <vmayatsk@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9cd80bbb
  25. 15 12月, 2009 1 次提交
  26. 12 12月, 2009 1 次提交
  27. 04 12月, 2009 1 次提交
  28. 03 12月, 2009 1 次提交
    • H
      sched, cputime: Introduce thread_group_times() · 0cf55e1e
      Hidetoshi Seto 提交于
      This is a real fix for problem of utime/stime values decreasing
      described in the thread:
      
         http://lkml.org/lkml/2009/11/3/522
      
      Now cputime is accounted in the following way:
      
       - {u,s}time in task_struct are increased every time when the thread
         is interrupted by a tick (timer interrupt).
      
       - When a thread exits, its {u,s}time are added to signal->{u,s}time,
         after adjusted by task_times().
      
       - When all threads in a thread_group exits, accumulated {u,s}time
         (and also c{u,s}time) in signal struct are added to c{u,s}time
         in signal struct of the group's parent.
      
      So {u,s}time in task struct are "raw" tick count, while
      {u,s}time and c{u,s}time in signal struct are "adjusted" values.
      
      And accounted values are used by:
      
       - task_times(), to get cputime of a thread:
         This function returns adjusted values that originates from raw
         {u,s}time and scaled by sum_exec_runtime that accounted by CFS.
      
       - thread_group_cputime(), to get cputime of a thread group:
         This function returns sum of all {u,s}time of living threads in
         the group, plus {u,s}time in the signal struct that is sum of
         adjusted cputimes of all exited threads belonged to the group.
      
      The problem is the return value of thread_group_cputime(),
      because it is mixed sum of "raw" value and "adjusted" value:
      
        group's {u,s}time = foreach(thread){{u,s}time} + exited({u,s}time)
      
      This misbehavior can break {u,s}time monotonicity.
      Assume that if there is a thread that have raw values greater
      than adjusted values (e.g. interrupted by 1000Hz ticks 50 times
      but only runs 45ms) and if it exits, cputime will decrease (e.g.
      -5ms).
      
      To fix this, we could do:
      
        group's {u,s}time = foreach(t){task_times(t)} + exited({u,s}time)
      
      But task_times() contains hard divisions, so applying it for
      every thread should be avoided.
      
      This patch fixes the above problem in the following way:
      
       - Modify thread's exit (= __exit_signal()) not to use task_times().
         It means {u,s}time in signal struct accumulates raw values instead
         of adjusted values.  As the result it makes thread_group_cputime()
         to return pure sum of "raw" values.
      
       - Introduce a new function thread_group_times(*task, *utime, *stime)
         that converts "raw" values of thread_group_cputime() to "adjusted"
         values, in same calculation procedure as task_times().
      
       - Modify group's exit (= wait_task_zombie()) to use this introduced
         thread_group_times().  It make c{u,s}time in signal struct to
         have adjusted values like before this patch.
      
       - Replace some thread_group_cputime() by thread_group_times().
         This replacements are only applied where conveys the "adjusted"
         cputime to users, and where already uses task_times() near by it.
         (i.e. sys_times(), getrusage(), and /proc/<PID>/stat.)
      
      This patch have a positive side effect:
      
       - Before this patch, if a group contains many short-life threads
         (e.g. runs 0.9ms and not interrupted by ticks), the group's
         cputime could be invisible since thread's cputime was accumulated
         after adjusted: imagine adjustment function as adj(ticks, runtime),
           {adj(0, 0.9) + adj(0, 0.9) + ....} = {0 + 0 + ....} = 0.
         After this patch it will not happen because the adjustment is
         applied after accumulated.
      
      v2:
       - remove if()s, put new variables into signal_struct.
      Signed-off-by: NHidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Spencer Candland <spencer@bluehost.com>
      Cc: Americo Wang <xiyou.wangcong@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Stanislaw Gruszka <sgruszka@redhat.com>
      LKML-Reference: <4B162517.8040909@jp.fujitsu.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      0cf55e1e