1. 10 3月, 2012 1 次提交
  2. 27 1月, 2012 1 次提交
    • Y
      sched: Fix ancient race in do_exit() · b5740f4b
      Yasunori Goto 提交于
      try_to_wake_up() has a problem which may change status from TASK_DEAD to
      TASK_RUNNING in race condition with SMI or guest environment of virtual
      machine. As a result, exited task is scheduled() again and panic occurs.
      
      Here is the sequence how it occurs:
      
       ----------------------------------+-----------------------------
                                         |
                  CPU A                  |             CPU B
       ----------------------------------+-----------------------------
      
      TASK A calls exit()....
      
      do_exit()
      
        exit_mm()
          down_read(mm->mmap_sem);
      
          rwsem_down_failed_common()
      
            set TASK_UNINTERRUPTIBLE
            set waiter.task <= task A
            list_add to sem->wait_list
                 :
            raw_spin_unlock_irq()
            (I/O interruption occured)
      
                                            __rwsem_do_wake(mmap_sem)
      
                                              list_del(&waiter->list);
                                              waiter->task = NULL
                                              wake_up_process(task A)
                                                try_to_wake_up()
                                                   (task is still
                                                     TASK_UNINTERRUPTIBLE)
                                                    p->on_rq is still 1.)
      
                                                    ttwu_do_wakeup()
                                                       (*A)
                                                         :
           (I/O interruption handler finished)
      
            if (!waiter.task)
                schedule() is not called
                due to waiter.task is NULL.
      
            tsk->state = TASK_RUNNING
      
                :
                                                    check_preempt_curr();
                                                        :
        task->state = TASK_DEAD
                                                    (*B)
                                              <---    set TASK_RUNNING (*C)
      
           schedule()
           (exit task is running again)
           BUG_ON() is called!
       --------------------------------------------------------
      
      The execution time between (*A) and (*B) is usually very short,
      because the interruption is disabled, and setting TASK_RUNNING at (*C)
      must be executed before setting TASK_DEAD.
      
      HOWEVER, if SMI is interrupted between (*A) and (*B),
      (*C) is able to execute AFTER setting TASK_DEAD!
      Then, exited task is scheduled again, and BUG_ON() is called....
      
      If the system works on guest system of virtual machine, the time
      between (*A) and (*B) may be also long due to scheduling of hypervisor,
      and same phenomenon can occur.
      
      By this patch, do_exit() waits for releasing task->pi_lock which is used
      in try_to_wake_up(). It guarantees the task becomes TASK_DEAD after
      waking up.
      Signed-off-by: NYasunori Goto <y-goto@jp.fujitsu.com>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20120117174031.3118.E1E9C6FF@jp.fujitsu.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
      b5740f4b
  3. 18 1月, 2012 1 次提交
  4. 13 1月, 2012 1 次提交
  5. 05 1月, 2012 1 次提交
    • O
      ptrace: partially fix the do_wait(WEXITED) vs EXIT_DEAD->EXIT_ZOMBIE race · 50b8d257
      Oleg Nesterov 提交于
      Test-case:
      
      	int main(void)
      	{
      		int pid, status;
      
      		pid = fork();
      		if (!pid) {
      			for (;;) {
      				if (!fork())
      					return 0;
      				if (waitpid(-1, &status, 0) < 0) {
      					printf("ERR!! wait: %m\n");
      					return 0;
      				}
      			}
      		}
      
      		assert(ptrace(PTRACE_ATTACH, pid, 0,0) == 0);
      		assert(waitpid(-1, NULL, 0) == pid);
      
      		assert(ptrace(PTRACE_SETOPTIONS, pid, 0,
      					PTRACE_O_TRACEFORK) == 0);
      
      		do {
      			ptrace(PTRACE_CONT, pid, 0, 0);
      			pid = waitpid(-1, NULL, 0);
      		} while (pid > 0);
      
      		return 1;
      	}
      
      It fails because ->real_parent sees its child in EXIT_DEAD state
      while the tracer is going to change the state back to EXIT_ZOMBIE
      in wait_task_zombie().
      
      The offending commit is 823b018e which moved the EXIT_DEAD check,
      but in fact we should not blame it. The original code was not
      correct as well because it didn't take ptrace_reparented() into
      account and because we can't really trust ->ptrace.
      
      This patch adds the additional check to close this particular
      race but it doesn't solve the whole problem. We simply can't
      rely on ->ptrace in this case, it can be cleared if the tracer
      is multithreaded by the exiting ->parent.
      
      I think we should kill EXIT_DEAD altogether, we should always
      remove the soon-to-be-reaped child from ->children or at least
      we should never do the DEAD->ZOMBIE transition. But this is too
      complex for 3.2.
      Reported-and-tested-by: NDenys Vlasenko <vda.linux@googlemail.com>
      Tested-by: NLukasz Michalik <lmi@ift.uni.wroc.pl>
      Acked-by: NTejun Heo <tj@kernel.org>
      Cc: <stable@kernel.org>		[3.0+]
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      50b8d257
  6. 18 12月, 2011 1 次提交
  7. 15 12月, 2011 1 次提交
  8. 22 11月, 2011 1 次提交
  9. 01 11月, 2011 1 次提交
    • D
      oom: remove oom_disable_count · c9f01245
      David Rientjes 提交于
      This removes mm->oom_disable_count entirely since it's unnecessary and
      currently buggy.  The counter was intended to be per-process but it's
      currently decremented in the exit path for each thread that exits, causing
      it to underflow.
      
      The count was originally intended to prevent oom killing threads that
      share memory with threads that cannot be killed since it doesn't lead to
      future memory freeing.  The counter could be fixed to represent all
      threads sharing the same mm, but it's better to remove the count since:
      
       - it is possible that the OOM_DISABLE thread sharing memory with the
         victim is waiting on that thread to exit and will actually cause
         future memory freeing, and
      
       - there is no guarantee that a thread is disabled from oom killing just
         because another thread sharing its mm is oom disabled.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Reported-by: NOleg Nesterov <oleg@redhat.com>
      Reviewed-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Ying Han <yinghan@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c9f01245
  10. 27 7月, 2011 1 次提交
    • V
      ipc: introduce shm_rmid_forced sysctl · b34a6b1d
      Vasiliy Kulikov 提交于
      Add support for the shm_rmid_forced sysctl.  If set to 1, all shared
      memory objects in current ipc namespace will be automatically forced to
      use IPC_RMID.
      
      The POSIX way of handling shmem allows one to create shm objects and
      call shmdt(), leaving shm object associated with no process, thus
      consuming memory not counted via rlimits.
      
      With shm_rmid_forced=1 the shared memory object is counted at least for
      one process, so OOM killer may effectively kill the fat process holding
      the shared memory.
      
      It obviously breaks POSIX - some programs relying on the feature would
      stop working.  So set shm_rmid_forced=1 only if you're sure nobody uses
      "orphaned" memory.  Use shm_rmid_forced=0 by default for compatability
      reasons.
      
      The feature was previously impemented in -ow as a configure option.
      
      [akpm@linux-foundation.org: fix documentation, per Randy]
      [akpm@linux-foundation.org: fix warning]
      [akpm@linux-foundation.org: readability/conventionality tweaks]
      [akpm@linux-foundation.org: fix shm_rmid_forced/shm_forced_rmid confusion, use standard comment layout]
      Signed-off-by: NVasiliy Kulikov <segoon@openwall.com>
      Cc: Randy Dunlap <rdunlap@xenotime.net>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: "Serge E. Hallyn" <serge.hallyn@canonical.com>
      Cc: Daniel Lezcano <daniel.lezcano@free.fr>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
      Cc: Solar Designer <solar@openwall.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b34a6b1d
  11. 18 7月, 2011 1 次提交
    • O
      has_stopped_jobs: s/task_is_stopped/SIGNAL_STOP_STOPPED/ · 961c4675
      Oleg Nesterov 提交于
      has_stopped_jobs() naively checks task_is_stopped(group_leader). This
      was always wrong even without ptrace, group_leader can be dead. And
      given that ptrace can change the state to TRACED this is wrong even
      in the single-threaded case.
      
      Change the code to check SIGNAL_STOP_STOPPED and simplify the code,
      retval + break/continue doesn't make this trivial code more readable.
      
      We could probably add the usual "|| signal->group_stop_count" check
      but I don't think this makes sense, the task can start the group-stop
      right after the check anyway.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      961c4675
  12. 12 7月, 2011 1 次提交
    • J
      fixlet: Remove fs_excl from struct task. · 4aede84b
      Justin TerAvest 提交于
      fs_excl is a poor man's priority inheritance for filesystems to hint to
      the block layer that an operation is important. It was never clearly
      specified, not widely adopted, and will not prevent starvation in many
      cases (like across cgroups).
      
      fs_excl was introduced with the time sliced CFQ IO scheduler, to
      indicate when a process held FS exclusive resources and thus needed
      a boost.
      
      It doesn't cover all file systems, and it was never fully complete.
      Lets kill it.
      Signed-off-by: NJustin TerAvest <teravest@google.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      4aede84b
  13. 09 7月, 2011 1 次提交
  14. 28 6月, 2011 6 次提交
    • O
      ptrace: wait_consider_task: s/same_thread_group/ptrace_reparented/ · 479bf98c
      Oleg Nesterov 提交于
      wait_consider_task() checks same_thread_group(parent, real_parent),
      this is the open-coded ptrace_reparented().
      
      __ptrace_detach() remains the only function which has to check this by
      hand, although we could reorganize the code to delay __ptrace_unlink.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      479bf98c
    • O
      kill task_detached() · e550f14d
      Oleg Nesterov 提交于
      Upadate the last user of task_detached(), wait_task_zombie(), to
      use thread_group_leader() and kill task_detached().
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Reviewed-by: NTejun Heo <tj@kernel.org>
      e550f14d
    • O
      reparent_leader: check EXIT_DEAD instead of task_detached() · 0976a03e
      Oleg Nesterov 提交于
      Change reparent_leader() to check ->exit_state instead of ->exit_signal,
      this matches the similar EXIT_DEAD check in wait_consider_task() and
      allows us to cleanup the do_notify_parent/task_detached logic.
      
      task_detached() was really needed during reparenting before 9cd80bbb
      "do_wait() optimization: do not place sub-threads on ->children list"
      to filter out the sub-threads. After this change task_detached(p) can
      only be true if p is the dead group_leader and its parent ignores
      SIGCHLD, in this case the caller of do_notify_parent() is going to
      reap this task and it should set EXIT_DEAD.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Reviewed-by: NTejun Heo <tj@kernel.org>
      0976a03e
    • O
      make do_notify_parent() __must_check, update the callers · 86773473
      Oleg Nesterov 提交于
      Change other callers of do_notify_parent() to check the value it
      returns, this makes the subsequent task_detached() unnecessary.
      Mark do_notify_parent() as __must_check.
      
      Use thread_group_leader() instead of !task_detached() to check
      if we need to notify the real parent in wait_task_zombie().
      
      Remove the stale comment in release_task(). "just for sanity" is
      no longer true, we have to set EXIT_DEAD to avoid the races with
      do_wait().
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      86773473
    • O
      kill tracehook_notify_death() · 45cdf5cc
      Oleg Nesterov 提交于
      Kill tracehook_notify_death(), reimplement the logic in its caller,
      exit_notify().
      
      Also, change the exec_id's check to use thread_group_leader() instead
      of task_detached(), this is more clear. This logic only applies to
      the exiting leader, a sub-thread must never change its exit_signal.
      
      Note: when the traced group leader exits the exit_signal-or-SIGCHLD
      logic looks really strange:
      
      	- we notify the tracer even if !thread_group_empty() but
      	   do_wait(WEXITED) can't work until all threads exit
      
      	- if the tracer is real_parent, it is not clear why can't
      	  we use ->exit_signal event if !thread_group_empty()
      
      -v2: do not try to fix the 2nd oddity to avoid the subtle behavior
           change mixed with reorganization, suggested by Tejun.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Reviewed-by: NTejun Heo <tj@kernel.org>
      45cdf5cc
    • O
      make do_notify_parent() return bool · 53c8f9f1
      Oleg Nesterov 提交于
      - change do_notify_parent() to return a boolean, true if the task should
        be reaped because its parent ignores SIGCHLD.
      
      - update the only caller which checks the returned value, exit_notify().
      
      This temporary uglifies exit_notify() even more, will be cleanuped by
      the next change.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      53c8f9f1
  15. 23 6月, 2011 2 次提交
    • T
      ptrace: kill trivial tracehooks · a288eecc
      Tejun Heo 提交于
      At this point, tracehooks aren't useful to mainline kernel and mostly
      just add an extra layer of obfuscation.  Although they have comments,
      without actual in-kernel users, it is difficult to tell what are their
      assumptions and they're actually trying to achieve.  To mainline
      kernel, they just aren't worth keeping around.
      
      This patch kills the following trivial tracehooks.
      
      * Ones testing whether task is ptraced.  Replace with ->ptrace test.
      
      	tracehook_expect_breakpoints()
      	tracehook_consider_ignored_signal()
      	tracehook_consider_fatal_signal()
      
      * ptrace_event() wrappers.  Call directly.
      
      	tracehook_report_exec()
      	tracehook_report_exit()
      	tracehook_report_vfork_done()
      
      * ptrace_release_task() wrapper.  Call directly.
      
      	tracehook_finish_release_task()
      
      * noop
      
      	tracehook_prepare_release_task()
      	tracehook_report_death()
      
      This doesn't introduce any behavior change.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      a288eecc
    • T
      ptrace: kill task_ptrace() · d21142ec
      Tejun Heo 提交于
      task_ptrace(task) simply dereferences task->ptrace and isn't even used
      consistently only adding confusion.  Kill it and directly access
      ->ptrace instead.
      
      This doesn't introduce any behavior change.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      d21142ec
  16. 17 6月, 2011 1 次提交
    • T
      ptrace: implement PTRACE_LISTEN · 544b2c91
      Tejun Heo 提交于
      The previous patch implemented async notification for ptrace but it
      only worked while trace is running.  This patch introduces
      PTRACE_LISTEN which is suggested by Oleg Nestrov.
      
      It's allowed iff tracee is in STOP trap and puts tracee into
      quasi-running state - tracee never really runs but wait(2) and
      ptrace(2) consider it to be running.  While ptracer is listening,
      tracee is allowed to re-enter STOP to notify an async event.
      Listening state is cleared on the first notification.  Ptracer can
      also clear it by issuing INTERRUPT - tracee will re-trap into STOP
      with listening state cleared.
      
      This allows ptracer to monitor group stop state without running tracee
      - use INTERRUPT to put tracee into STOP trap, issue LISTEN and then
      wait(2) to wait for the next group stop event.  When it happens,
      PTRACE_GETSIGINFO provides information to determine the current state.
      
      Test program follows.
      
        #define PTRACE_SEIZE		0x4206
        #define PTRACE_INTERRUPT	0x4207
        #define PTRACE_LISTEN		0x4208
      
        #define PTRACE_SEIZE_DEVEL	0x80000000
      
        static const struct timespec ts1s = { .tv_sec = 1 };
      
        int main(int argc, char **argv)
        {
      	  pid_t tracee, tracer;
      	  int i;
      
      	  tracee = fork();
      	  if (!tracee)
      		  while (1)
      			  pause();
      
      	  tracer = fork();
      	  if (!tracer) {
      		  siginfo_t si;
      
      		  ptrace(PTRACE_SEIZE, tracee, NULL,
      			 (void *)(unsigned long)PTRACE_SEIZE_DEVEL);
      		  ptrace(PTRACE_INTERRUPT, tracee, NULL, NULL);
      	  repeat:
      		  waitid(P_PID, tracee, NULL, WSTOPPED);
      
      		  ptrace(PTRACE_GETSIGINFO, tracee, NULL, &si);
      		  if (!si.si_code) {
      			  printf("tracer: SIG %d\n", si.si_signo);
      			  ptrace(PTRACE_CONT, tracee, NULL,
      				 (void *)(unsigned long)si.si_signo);
      			  goto repeat;
      		  }
      		  printf("tracer: stopped=%d signo=%d\n",
      			 si.si_signo != SIGTRAP, si.si_signo);
      		  if (si.si_signo != SIGTRAP)
      			  ptrace(PTRACE_LISTEN, tracee, NULL, NULL);
      		  else
      			  ptrace(PTRACE_CONT, tracee, NULL, NULL);
      		  goto repeat;
      	  }
      
      	  for (i = 0; i < 3; i++) {
      		  nanosleep(&ts1s, NULL);
      		  printf("mother: SIGSTOP\n");
      		  kill(tracee, SIGSTOP);
      		  nanosleep(&ts1s, NULL);
      		  printf("mother: SIGCONT\n");
      		  kill(tracee, SIGCONT);
      	  }
      	  nanosleep(&ts1s, NULL);
      
      	  kill(tracer, SIGKILL);
      	  kill(tracee, SIGKILL);
      	  return 0;
        }
      
      This is identical to the program to test TRAP_NOTIFY except that
      tracee is PTRACE_LISTEN'd instead of PTRACE_CONT'd when group stopped.
      This allows ptracer to monitor when group stop ends without running
      tracee.
      
        # ./test-listen
        tracer: stopped=0 signo=5
        mother: SIGSTOP
        tracer: SIG 19
        tracer: stopped=1 signo=19
        mother: SIGCONT
        tracer: stopped=0 signo=5
        tracer: SIG 18
        mother: SIGSTOP
        tracer: SIG 19
        tracer: stopped=1 signo=19
        mother: SIGCONT
        tracer: stopped=0 signo=5
        tracer: SIG 18
        mother: SIGSTOP
        tracer: SIG 19
        tracer: stopped=1 signo=19
        mother: SIGCONT
        tracer: stopped=0 signo=5
        tracer: SIG 18
      
      -v2: Moved JOBCTL_LISTENING check in wait_task_stopped() into
           task_stopped_code() as suggested by Oleg.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      544b2c91
  17. 16 6月, 2011 1 次提交
    • K
      memcg: clear mm->owner when last possible owner leaves · 733eda7a
      KAMEZAWA Hiroyuki 提交于
      The following crash was reported:
      
      > Call Trace:
      > [<ffffffff81139792>] mem_cgroup_from_task+0x15/0x17
      > [<ffffffff8113a75a>] __mem_cgroup_try_charge+0x148/0x4b4
      > [<ffffffff810493f3>] ? need_resched+0x23/0x2d
      > [<ffffffff814cbf43>] ? preempt_schedule+0x46/0x4f
      > [<ffffffff8113afe8>] mem_cgroup_charge_common+0x9a/0xce
      > [<ffffffff8113b6d1>] mem_cgroup_newpage_charge+0x5d/0x5f
      > [<ffffffff81134024>] khugepaged+0x5da/0xfaf
      > [<ffffffff81078ea0>] ? __init_waitqueue_head+0x4b/0x4b
      > [<ffffffff81133a4a>] ? add_mm_counter.constprop.5+0x13/0x13
      > [<ffffffff81078625>] kthread+0xa8/0xb0
      > [<ffffffff814d13e8>] ? sub_preempt_count+0xa1/0xb4
      > [<ffffffff814d5664>] kernel_thread_helper+0x4/0x10
      > [<ffffffff814ce858>] ? retint_restore_args+0x13/0x13
      > [<ffffffff8107857d>] ? __init_kthread_worker+0x5a/0x5a
      
      What happens is that khugepaged tries to charge a huge page against an mm
      whose last possible owner has already exited, and the memory controller
      crashes when the stale mm->owner is used to look up the cgroup to charge.
      
      mm->owner has never been set to NULL with the last owner going away, but
      nobody cared until khugepaged came along.
      
      Even then it wasn't a problem because the final mmput() on an mm was
      forced to acquire and release mmap_sem in write-mode, preventing an
      exiting owner to go away while the mmap_sem was held, and until "692e0b35
      mm: thp: optimize memcg charge in khugepaged", the memory cgroup charge
      was protected by mmap_sem in read-mode.
      
      Instead of going back to relying on the mmap_sem to enforce lifetime of a
      task, this patch ensures that mm->owner is properly set to NULL when the
      last possible owner is exiting, which the memory controller can handle
      just fine.
      
      [akpm@linux-foundation.org: tweak comments]
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: NHugh Dickins <hughd@google.com>
      Reported-by: NDave Jones <davej@redhat.com>
      Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      733eda7a
  18. 14 5月, 2011 1 次提交
    • T
      job control: reorganize wait_task_stopped() · 19e27463
      Tejun Heo 提交于
      wait_task_stopped() tested task_stopped_code() without acquiring
      siglock and, if stop condition existed, called wait_task_stopped() and
      directly returned the result.  This patch moves the initial
      task_stopped_code() testing into wait_task_stopped() and make
      wait_consider_task() fall through to wait_task_continue() on 0 return.
      
      This is for the following two reasons.
      
      * Because the initial task_stopped_code() test is done without
        acquiring siglock, it may race against SIGCONT generation.  The
        stopped condition might have been replaced by continued state by the
        time wait_task_stopped() acquired siglock.  This may lead to
        unexpected failure of WNOHANG waits.
      
        This reorganization addresses this single race case but there are
        other cases - TASK_RUNNING -> TASK_STOPPED transition and EXIT_*
        transitions.
      
      * Scheduled ptrace updates require changes to the initial test which
        would fit better inside wait_task_stopped().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      19e27463
  19. 25 4月, 2011 1 次提交
    • F
      ptrace: Prepare to fix racy accesses on task breakpoints · bf26c018
      Frederic Weisbecker 提交于
      When a task is traced and is in a stopped state, the tracer
      may execute a ptrace request to examine the tracee state and
      get its task struct. Right after, the tracee can be killed
      and thus its breakpoints released.
      This can happen concurrently when the tracer is in the middle
      of reading or modifying these breakpoints, leading to dereferencing
      a freed pointer.
      
      Hence, to prepare the fix, create a generic breakpoint reference
      holding API. When a reference on the breakpoints of a task is
      held, the breakpoints won't be released until the last reference
      is dropped. After that, no more ptrace request on the task's
      breakpoints can be serviced for the tracer.
      Reported-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Prasad <prasad@linux.vnet.ibm.com>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: v2.6.33.. <stable@kernel.org>
      Link: http://lkml.kernel.org/r/1302284067-7860-2-git-send-email-fweisbec@gmail.com
      bf26c018
  20. 31 3月, 2011 1 次提交
  21. 23 3月, 2011 3 次提交
    • T
      job control: Allow access to job control events through ptracees · 45cb24a1
      Tejun Heo 提交于
      Currently a real parent can't access job control stopped/continued
      events through a ptraced child.  This utterly breaks job control when
      the children are ptraced.
      
      For example, if a program is run from an interactive shell and then
      strace(1) attaches to it, pressing ^Z would send SIGTSTP and strace(1)
      would notice it but the shell has no way to tell whether the child
      entered job control stop and thus can't tell when to take over the
      terminal - leading to awkward lone ^Z on the terminal.
      
      Because the job control and ptrace stopped states are independent,
      there is no reason to prevent real parents from accessing the stopped
      state regardless of ptrace.  The continued state isn't separate but
      ptracers don't have any use for them as ptracees can never resume
      without explicit command from their ptracers, so as long as ptracers
      don't consume it, it should be fine.
      
      Although this is a behavior change, because the previous behavior is
      utterly broken when viewed from real parents and the change is only
      visible to real parents, I don't think it's necessary to make this
      behavior optional.
      
      One situation to be careful about is when a task from the real
      parent's group is ptracing.  The parent group is the recipient of both
      ptrace and job control stop events and one stop can be reported as
      both job control and ptrace stops.  As this can break the current
      ptrace users, suppress job control stopped events for these cases.
      
      If a real parent ptracer wants to know about both job control and
      ptrace stops, it can create a separate process to serve the role of
      real parent.
      
      Note that this only updates wait(2) side of things.  The real parent
      can access the states via wait(2) but still is not properly notified
      (woken up and delivered signal).  Test case polls wait(2) with WNOHANG
      to work around.  Notification will be updated by future patches.
      
      Test case follows.
      
        #include <stdio.h>
        #include <unistd.h>
        #include <time.h>
        #include <errno.h>
        #include <sys/types.h>
        #include <sys/ptrace.h>
        #include <sys/wait.h>
      
        int main(void)
        {
      	  const struct timespec ts100ms = { .tv_nsec = 100000000 };
      	  pid_t tracee, tracer;
      	  siginfo_t si;
      	  int i;
      
      	  tracee = fork();
      	  if (tracee == 0) {
      		  while (1) {
      			  printf("tracee: SIGSTOP\n");
      			  raise(SIGSTOP);
      			  nanosleep(&ts100ms, NULL);
      			  printf("tracee: SIGCONT\n");
      			  raise(SIGCONT);
      			  nanosleep(&ts100ms, NULL);
      		  }
      	  }
      
      	  waitid(P_PID, tracee, &si, WSTOPPED | WNOHANG | WNOWAIT);
      
      	  tracer = fork();
      	  if (tracer == 0) {
      		  nanosleep(&ts100ms, NULL);
      		  ptrace(PTRACE_ATTACH, tracee, NULL, NULL);
      
      		  for (i = 0; i < 11; i++) {
      			  si.si_pid = 0;
      			  waitid(P_PID, tracee, &si, WSTOPPED);
      			  if (si.si_pid && si.si_code == CLD_TRAPPED)
      				  ptrace(PTRACE_CONT, tracee, NULL,
      					 (void *)(long)si.si_status);
      		  }
      		  printf("tracer: EXITING\n");
      		  return 0;
      	  }
      
      	  while (1) {
      		  si.si_pid = 0;
      		  waitid(P_PID, tracee, &si,
      			 WSTOPPED | WCONTINUED | WEXITED | WNOHANG);
      		  if (si.si_pid)
      			  printf("mommy : WAIT status=%02d code=%02d\n",
      				 si.si_status, si.si_code);
      		  nanosleep(&ts100ms, NULL);
      	  }
      	  return 0;
        }
      
      Before the patch, while ptraced, the parent can't see any job control
      events.
      
        tracee: SIGSTOP
        mommy : WAIT status=19 code=05
        tracee: SIGCONT
        tracee: SIGSTOP
        tracee: SIGCONT
        tracee: SIGSTOP
        tracee: SIGCONT
        tracee: SIGSTOP
        tracer: EXITING
        mommy : WAIT status=19 code=05
        ^C
      
      After the patch,
      
        tracee: SIGSTOP
        mommy : WAIT status=19 code=05
        tracee: SIGCONT
        mommy : WAIT status=18 code=06
        tracee: SIGSTOP
        mommy : WAIT status=19 code=05
        tracee: SIGCONT
        mommy : WAIT status=18 code=06
        tracee: SIGSTOP
        mommy : WAIT status=19 code=05
        tracee: SIGCONT
        mommy : WAIT status=18 code=06
        tracee: SIGSTOP
        tracer: EXITING
        mommy : WAIT status=19 code=05
        ^C
      
      -v2: Oleg pointed out that wait(2) should be suppressed for the real
           parent's group instead of only the real parent task itself.
           Updated accordingly.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      45cb24a1
    • T
      job control: Fix ptracer wait(2) hang and explain notask_error clearing · 9b84cca2
      Tejun Heo 提交于
      wait(2) and friends allow access to stopped/continued states through
      zombies, which is required as the states are process-wide and should
      be accessible whether the leader task is alive or undead.
      wait_consider_task() implements this by always clearing notask_error
      and going through wait_task_stopped/continued() for unreaped zombies.
      
      However, while ptraced, the stopped state is per-task and as such if
      the ptracee became a zombie, there's no further stopped event to
      listen to and wait(2) and friends should return -ECHILD on the tracee.
      
      Fix it by clearing notask_error only if WCONTINUED | WEXITED is set
      for ptraced zombies.  While at it, document why clearing notask_error
      is safe for each case.
      
      Test case follows.
      
        #include <stdio.h>
        #include <unistd.h>
        #include <pthread.h>
        #include <time.h>
        #include <sys/types.h>
        #include <sys/ptrace.h>
        #include <sys/wait.h>
      
        static void *nooper(void *arg)
        {
      	  pause();
      	  return NULL;
        }
      
        int main(void)
        {
      	  const struct timespec ts1s = { .tv_sec = 1 };
      	  pid_t tracee, tracer;
      	  siginfo_t si;
      
      	  tracee = fork();
      	  if (tracee == 0) {
      		  pthread_t thr;
      
      		  pthread_create(&thr, NULL, nooper, NULL);
      		  nanosleep(&ts1s, NULL);
      		  printf("tracee exiting\n");
      		  pthread_exit(NULL);	/* let subthread run */
      	  }
      
      	  tracer = fork();
      	  if (tracer == 0) {
      		  ptrace(PTRACE_ATTACH, tracee, NULL, NULL);
      		  while (1) {
      			  if (waitid(P_PID, tracee, &si, WSTOPPED) < 0) {
      				  perror("waitid");
      				  break;
      			  }
      			  ptrace(PTRACE_CONT, tracee, NULL,
      				 (void *)(long)si.si_status);
      		  }
      		  return 0;
      	  }
      
      	  waitid(P_PID, tracer, &si, WEXITED);
      	  kill(tracee, SIGKILL);
      	  return 0;
        }
      
      Before the patch, after the tracee becomes a zombie, the tracer's
      waitid(WSTOPPED) never returns and the program doesn't terminate.
      
        tracee exiting
        ^C
      
      After the patch, tracee exiting triggers waitid() to fail.
      
        tracee exiting
        waitid: No child processes
      
      -v2: Oleg pointed out that exited in addition to continued can happen
           for ptraced dead group leader.  Clear notask_error for ptraced
           child on WEXITED too.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      9b84cca2
    • T
      job control: Small reorganization of wait_consider_task() · 823b018e
      Tejun Heo 提交于
      Move EXIT_DEAD test in wait_consider_task() above ptrace check.  As
      ptraced tasks can't be EXIT_DEAD, this change doesn't cause any
      behavior change.  This is to prepare for further changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      823b018e
  22. 10 3月, 2011 1 次提交
    • J
      block: initial patch for on-stack per-task plugging · 73c10101
      Jens Axboe 提交于
      This patch adds support for creating a queuing context outside
      of the queue itself. This enables us to batch up pieces of IO
      before grabbing the block device queue lock and submitting them to
      the IO scheduler.
      
      The context is created on the stack of the process and assigned in
      the task structure, so that we can auto-unplug it if we hit a schedule
      event.
      
      The current queue plugging happens implicitly if IO is submitted to
      an empty device, yet callers have to remember to unplug that IO when
      they are going to wait for it. This is an ugly API and has caused bugs
      in the past. Additionally, it requires hacks in the vm (->sync_page()
      callback) to handle that logic. By switching to an explicit plugging
      scheme we make the API a lot nicer and can get rid of the ->sync_page()
      hack in the vm.
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      73c10101
  23. 07 1月, 2011 1 次提交
  24. 17 12月, 2010 1 次提交
  25. 03 12月, 2010 1 次提交
    • N
      do_exit(): make sure that we run with get_fs() == USER_DS · 33dd94ae
      Nelson Elhage 提交于
      If a user manages to trigger an oops with fs set to KERNEL_DS, fs is not
      otherwise reset before do_exit().  do_exit may later (via mm_release in
      fork.c) do a put_user to a user-controlled address, potentially allowing
      a user to leverage an oops into a controlled write into kernel memory.
      
      This is only triggerable in the presence of another bug, but this
      potentially turns a lot of DoS bugs into privilege escalations, so it's
      worth fixing.  I have proof-of-concept code which uses this bug along
      with CVE-2010-3849 to write a zero to an arbitrary kernel address, so
      I've tested that this is not theoretical.
      
      A more logical place to put this fix might be when we know an oops has
      occurred, before we call do_exit(), but that would involve changing
      every architecture, in multiple places.
      
      Let's just stick it in do_exit instead.
      
      [akpm@linux-foundation.org: update code comment]
      Signed-off-by: NNelson Elhage <nelhage@ksplice.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      33dd94ae
  26. 06 11月, 2010 1 次提交
    • O
      posix-cpu-timers: workaround to suppress the problems with mt exec · e0a70217
      Oleg Nesterov 提交于
      posix-cpu-timers.c correctly assumes that the dying process does
      posix_cpu_timers_exit_group() and removes all !CPUCLOCK_PERTHREAD
      timers from signal->cpu_timers list.
      
      But, it also assumes that timer->it.cpu.task is always the group
      leader, and thus the dead ->task means the dead thread group.
      
      This is obviously not true after de_thread() changes the leader.
      After that almost every posix_cpu_timer_ method has problems.
      
      It is not simple to fix this bug correctly. First of all, I think
      that timer->it.cpu should use struct pid instead of task_struct.
      Also, the locking should be reworked completely. In particular,
      tasklist_lock should not be used at all. This all needs a lot of
      nontrivial and hard-to-test changes.
      
      Change __exit_signal() to do posix_cpu_timers_exit_group() when
      the old leader dies during exec. This is not the fix, just the
      temporary hack to hide the problem for 2.6.37 and stable. IOW,
      this is obviously wrong but this is what we currently have anyway:
      cpu timers do not work after mt exec.
      
      In theory this change adds another race. The exiting leader can
      detach the timers which were attached to the new leader. However,
      the window between de_thread() and release_task() is small, we
      can pretend that sys_timer_create() was called before de_thread().
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e0a70217
  27. 28 10月, 2010 1 次提交
  28. 27 10月, 2010 1 次提交
    • Y
      oom: add per-mm oom disable count · 3d5992d2
      Ying Han 提交于
      It's pointless to kill a task if another thread sharing its mm cannot be
      killed to allow future memory freeing.  A subsequent patch will prevent
      kills in such cases, but first it's necessary to have a way to flag a task
      that shares memory with an OOM_DISABLE task that doesn't incur an
      additional tasklist scan, which would make select_bad_process() an O(n^2)
      function.
      
      This patch adds an atomic counter to struct mm_struct that follows how
      many threads attached to it have an oom_score_adj of OOM_SCORE_ADJ_MIN.
      They cannot be killed by the kernel, so their memory cannot be freed in
      oom conditions.
      
      This only requires task_lock() on the task that we're operating on, it
      does not require mm->mmap_sem since task_lock() pins the mm and the
      operation is atomic.
      
      [rientjes@google.com: changelog and sys_unshare() code]
      [rientjes@google.com: protect oom_disable_count with task_lock in fork]
      [rientjes@google.com: use old_mm for oom_disable_count in exec]
      Signed-off-by: NYing Han <yinghan@google.com>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3d5992d2
  29. 10 9月, 2010 1 次提交
  30. 18 8月, 2010 1 次提交
    • D
      Fix unprotected access to task credentials in waitid() · f362b732
      Daniel J Blueman 提交于
      Using a program like the following:
      
      	#include <stdlib.h>
      	#include <unistd.h>
      	#include <sys/types.h>
      	#include <sys/wait.h>
      
      	int main() {
      		id_t id;
      		siginfo_t infop;
      		pid_t res;
      
      		id = fork();
      		if (id == 0) { sleep(1); exit(0); }
      		kill(id, SIGSTOP);
      		alarm(1);
      		waitid(P_PID, id, &infop, WCONTINUED);
      		return 0;
      	}
      
      to call waitid() on a stopped process results in access to the child task's
      credentials without the RCU read lock being held - which may be replaced in the
      meantime - eliciting the following warning:
      
      	===================================================
      	[ INFO: suspicious rcu_dereference_check() usage. ]
      	---------------------------------------------------
      	kernel/exit.c:1460 invoked rcu_dereference_check() without protection!
      
      	other info that might help us debug this:
      
      	rcu_scheduler_active = 1, debug_locks = 1
      	2 locks held by waitid02/22252:
      	 #0:  (tasklist_lock){.?.?..}, at: [<ffffffff81061ce5>] do_wait+0xc5/0x310
      	 #1:  (&(&sighand->siglock)->rlock){-.-...}, at: [<ffffffff810611da>]
      	wait_consider_task+0x19a/0xbe0
      
      	stack backtrace:
      	Pid: 22252, comm: waitid02 Not tainted 2.6.35-323cd+ #3
      	Call Trace:
      	 [<ffffffff81095da4>] lockdep_rcu_dereference+0xa4/0xc0
      	 [<ffffffff81061b31>] wait_consider_task+0xaf1/0xbe0
      	 [<ffffffff81061d15>] do_wait+0xf5/0x310
      	 [<ffffffff810620b6>] sys_waitid+0x86/0x1f0
      	 [<ffffffff8105fce0>] ? child_wait_callback+0x0/0x70
      	 [<ffffffff81003282>] system_call_fastpath+0x16/0x1b
      
      This is fixed by holding the RCU read lock in wait_task_continued() to ensure
      that the task's current credentials aren't destroyed between us reading the
      cred pointer and us reading the UID from those credentials.
      
      Furthermore, protect wait_task_stopped() in the same way.
      
      We don't need to keep holding the RCU read lock once we've read the UID from
      the credentials as holding the RCU read lock doesn't stop the target task from
      changing its creds under us - so the credentials may be outdated immediately
      after we've read the pointer, lock or no lock.
      Signed-off-by: NDaniel J Blueman <daniel.blueman@gmail.com>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Acked-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f362b732
  31. 11 8月, 2010 1 次提交
  32. 28 5月, 2010 1 次提交
    • O
      proc: turn signal_struct->count into "int nr_threads" · b3ac022c
      Oleg Nesterov 提交于
      No functional changes, just s/atomic_t count/int nr_threads/.
      
      With the recent changes this counter has a single user, get_nr_threads()
      And, none of its callers need the really accurate number of threads, not
      to mention each caller obviously races with fork/exit.  It is only used to
      report this value to the user-space, except first_tid() uses it to avoid
      the unnecessary while_each_thread() loop in the unlikely case.
      
      It is a bit sad we need a word in struct signal_struct for this, perhaps
      we can change get_nr_threads() to approximate the number of threads using
      signal->live and kill ->nr_threads later.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Acked-by: NRoland McGrath <roland@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b3ac022c