1. 08 6月, 2012 1 次提交
    • K
      mm: correctly synchronize rss-counters at exit/exec · 40af1bbd
      Konstantin Khlebnikov 提交于
      mm->rss_stat counters have per-task delta: task->rss_stat.  Before
      changing task->mm pointer the kernel must flush this delta with
      sync_mm_rss().
      
      do_exit() already calls sync_mm_rss() to flush the rss-counters before
      committing the rss statistics into task->signal->maxrss, taskstats,
      audit and other stuff.  Unfortunately the kernel does this before
      calling mm_release(), which can call put_user() for processing
      task->clear_child_tid.  So at this point we can trigger page-faults and
      task->rss_stat becomes non-zero again.  As a result mm->rss_stat becomes
      inconsistent and check_mm() will print something like this:
      
      | BUG: Bad rss-counter state mm:ffff88020813c380 idx:1 val:-1
      | BUG: Bad rss-counter state mm:ffff88020813c380 idx:2 val:1
      
      This patch moves sync_mm_rss() into mm_release(), and moves mm_release()
      out of do_exit() and calls it earlier.  After mm_release() there should
      be no pagefaults.
      
      [akpm@linux-foundation.org: tweak comment]
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Reported-by: NMarkus Trippelsdorf <markus@trippelsdorf.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: <stable@vger.kernel.org>		[3.4.x]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      40af1bbd
  2. 01 6月, 2012 2 次提交
  3. 24 5月, 2012 2 次提交
    • O
      genirq: reimplement exit_irq_thread() hook via task_work_add() · 4d1d61a6
      Oleg Nesterov 提交于
      exit_irq_thread() and task->irq_thread are needed to handle the unexpected
      (and unlikely) exit of irq-thread.
      
      We can use task_work instead and make this all private to
      kernel/irq/manage.c, cleanup plus micro-optimization.
      
      1. rename exit_irq_thread() to irq_thread_dtor(), make it
         static, and move it up before irq_thread().
      
      2. change irq_thread() to do task_work_add(irq_thread_dtor)
         at the start and task_work_cancel() before return.
      
         tracehook_notify_resume() can never play with kthreads,
         only do_exit()->exit_task_work() can call the callback
         and this is what we want.
      
      3. remove task_struct->irq_thread and the special hook
         in do_exit().
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Richard Kuo <rkuo@codeaurora.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Alexander Gordeev <agordeev@redhat.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: David Smith <dsmith@redhat.com>
      Cc: "Frank Ch. Eigler" <fche@redhat.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      4d1d61a6
    • O
      task_work_add: generic process-context callbacks · e73f8959
      Oleg Nesterov 提交于
      Provide a simple mechanism that allows running code in the (nonatomic)
      context of the arbitrary task.
      
      The caller does task_work_add(task, task_work) and this task executes
      task_work->func() either from do_notify_resume() or from do_exit().  The
      callback can rely on PF_EXITING to detect the latter case.
      
      "struct task_work" can be embedded in another struct, still it has "void
      *data" to handle the most common/simple case.
      
      This allows us to kill the ->replacement_session_keyring hack, and
      potentially this can have more users.
      
      Performance-wise, this adds 2 "unlikely(!hlist_empty())" checks into
      tracehook_notify_resume() and do_exit().  But at the same time we can
      remove the "replacement_session_keyring != NULL" checks from
      arch/*/signal.c and exit_creds().
      
      Note: task_work_add/task_work_run abuses ->pi_lock.  This is only because
      this lock is already used by lookup_pi_state() to synchronize with
      do_exit() setting PF_EXITING.  Fortunately the scope of this lock in
      task_work.c is really tiny, and the code is unlikely anyway.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NDavid Howells <dhowells@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Richard Kuo <rkuo@codeaurora.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Alexander Gordeev <agordeev@redhat.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: David Smith <dsmith@redhat.com>
      Cc: "Frank Ch. Eigler" <fche@redhat.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      e73f8959
  4. 18 5月, 2012 1 次提交
  5. 03 5月, 2012 1 次提交
  6. 24 3月, 2012 2 次提交
    • D
      kernel/exit.c: if init dies, log a signal which killed it, if any · 397a21f2
      Denys Vlasenko 提交于
      I just received another user's pleas for help when their init
      mysteriously died.  I again explained that they need to check whether it
      died because of bad instruction, a segv, or something else.  Which was
      an annoying detour into writing a trivial C program to spawn his init
      and print its exit code:
      
        http://lists.busybox.net/pipermail/busybox/2012-January/077172.html
      
      I hear you saying "just test it under /bin/sh".  Well, the crashing init
      _was_ /bin/sh.
      
      Which prompted me to make kernel do this first step automatically.  We can
      print exit code, which makes it possible to see that death was from e.g.
      SIGILL without writing test programs.
      
      [akpm@linux-foundation.org: add 0x to hex number output]
      Signed-off-by: NDenys Vlasenko <vda.linux@googlemail.com>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      397a21f2
    • L
      prctl: add PR_{SET,GET}_CHILD_SUBREAPER to allow simple process supervision · ebec18a6
      Lennart Poettering 提交于
      Userspace service managers/supervisors need to track their started
      services.  Many services daemonize by double-forking and get implicitly
      re-parented to PID 1.  The service manager will no longer be able to
      receive the SIGCHLD signals for them, and is no longer in charge of
      reaping the children with wait().  All information about the children is
      lost at the moment PID 1 cleans up the re-parented processes.
      
      With this prctl, a service manager process can mark itself as a sort of
      'sub-init', able to stay as the parent for all orphaned processes
      created by the started services.  All SIGCHLD signals will be delivered
      to the service manager.
      
      Receiving SIGCHLD and doing wait() is in cases of a service-manager much
      preferred over any possible asynchronous notification about specific
      PIDs, because the service manager has full access to the child process
      data in /proc and the PID can not be re-used until the wait(), the
      service-manager itself is in charge of, has happened.
      
      As a side effect, the relevant parent PID information does not get lost
      by a double-fork, which results in a more elaborate process tree and
      'ps' output:
      
      before:
        # ps afx
        253 ?        Ss     0:00 /bin/dbus-daemon --system --nofork
        294 ?        Sl     0:00 /usr/libexec/polkit-1/polkitd
        328 ?        S      0:00 /usr/sbin/modem-manager
        608 ?        Sl     0:00 /usr/libexec/colord
        658 ?        Sl     0:00 /usr/libexec/upowerd
        819 ?        Sl     0:00 /usr/libexec/imsettings-daemon
        916 ?        Sl     0:00 /usr/libexec/udisks-daemon
        917 ?        S      0:00  \_ udisks-daemon: not polling any devices
      
      after:
        # ps afx
        294 ?        Ss     0:00 /bin/dbus-daemon --system --nofork
        426 ?        Sl     0:00  \_ /usr/libexec/polkit-1/polkitd
        449 ?        S      0:00  \_ /usr/sbin/modem-manager
        635 ?        Sl     0:00  \_ /usr/libexec/colord
        705 ?        Sl     0:00  \_ /usr/libexec/upowerd
        959 ?        Sl     0:00  \_ /usr/libexec/udisks-daemon
        960 ?        S      0:00  |   \_ udisks-daemon: not polling any devices
        977 ?        Sl     0:00  \_ /usr/libexec/packagekitd
      
      This prctl is orthogonal to PID namespaces.  PID namespaces are isolated
      from each other, while a service management process usually requires the
      services to live in the same namespace, to be able to talk to each
      other.
      
      Users of this will be the systemd per-user instance, which provides
      init-like functionality for the user's login session and D-Bus, which
      activates bus services on-demand.  Both need init-like capabilities to
      be able to properly keep track of the services they start.
      
      Many thanks to Oleg for several rounds of review and insights.
      
      [akpm@linux-foundation.org: fix comment layout and spelling]
      [akpm@linux-foundation.org: add lengthy code comment from Oleg]
      Reviewed-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NLennart Poettering <lennart@poettering.net>
      Signed-off-by: NKay Sievers <kay.sievers@vrfy.org>
      Acked-by: NValdis Kletnieks <Valdis.Kletnieks@vt.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ebec18a6
  7. 22 3月, 2012 1 次提交
  8. 21 3月, 2012 2 次提交
    • O
      exit_signal: fix the "parent has changed security domain" logic · b6e238dc
      Oleg Nesterov 提交于
      exit_notify() changes ->exit_signal if the parent already did exec.
      This doesn't really work, we are not going to send the signal now
      if there is another live thread or the exiting task is traced. The
      parent can exec before the last dies or the tracer detaches.
      
      Move this check into do_notify_parent() which actually sends the
      signal.
      
      The user-visible change is that we do not change ->exit_signal,
      and thus the exiting task is still "clone children" for
      do_wait()->eligible_child(__WCLONE). Hopefully this is fine, the
      current logic is racy anyway.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b6e238dc
    • O
      exit_signal: simplify the "we have changed execution domain" logic · e6368253
      Oleg Nesterov 提交于
      exit_notify() checks "tsk->self_exec_id != tsk->parent_exec_id"
      to handle the "we have changed execution domain" case.
      
      We can change do_thread() to always set ->exit_signal = SIGCHLD
      and remove this check to simplify the code.
      
      We could change setup_new_exec() instead, this looks more logical
      because it increments ->self_exec_id. But note that de_thread()
      already resets ->exit_signal if it changes the leader, let's keep
      both changes close to each other.
      
      Note that we change ->exit_signal lockless, this changes the rules.
      Thereafter ->exit_signal is not stable under tasklist but this is
      fine, the only possible change is OLDSIG -> SIGCHLD. This can race
      with eligible_child() but the race is harmless. We can race with
      reparent_leader() which changes our ->exit_signal in parallel, but
      it does the same change to SIGCHLD.
      
      The noticeable user-visible change is that the execing task is not
      "visible" to do_wait()->eligible_child(__WCLONE) right after exec.
      To me this looks more logical, and this is consistent with mt case.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e6368253
  9. 10 3月, 2012 1 次提交
  10. 05 3月, 2012 1 次提交
  11. 20 2月, 2012 1 次提交
    • D
      Replace the fd_sets in struct fdtable with an array of unsigned longs · 1fd36adc
      David Howells 提交于
      Replace the fd_sets in struct fdtable with an array of unsigned longs and then
      use the standard non-atomic bit operations rather than the FD_* macros.
      
      This:
      
       (1) Removes the abuses of struct fd_set:
      
           (a) Since we don't want to allocate a full fd_set the vast majority of the
           	 time, we actually, in effect, just allocate a just-big-enough array of
           	 unsigned longs and cast it to an fd_set type - so why bother with the
           	 fd_set at all?
      
           (b) Some places outside of the core fdtable handling code (such as
           	 SELinux) want to look inside the array of unsigned longs hidden inside
           	 the fd_set struct for more efficient iteration over the entire set.
      
       (2) Eliminates the use of FD_*() macros in the kernel completely.
      
       (3) Permits the __FD_*() macros to be deleted entirely where not exposed to
           userspace.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Link: http://lkml.kernel.org/r/20120216174954.23314.48147.stgit@warthog.procyon.org.ukSigned-off-by: NH. Peter Anvin <hpa@zytor.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      1fd36adc
  12. 14 2月, 2012 1 次提交
  13. 27 1月, 2012 1 次提交
    • Y
      sched: Fix ancient race in do_exit() · b5740f4b
      Yasunori Goto 提交于
      try_to_wake_up() has a problem which may change status from TASK_DEAD to
      TASK_RUNNING in race condition with SMI or guest environment of virtual
      machine. As a result, exited task is scheduled() again and panic occurs.
      
      Here is the sequence how it occurs:
      
       ----------------------------------+-----------------------------
                                         |
                  CPU A                  |             CPU B
       ----------------------------------+-----------------------------
      
      TASK A calls exit()....
      
      do_exit()
      
        exit_mm()
          down_read(mm->mmap_sem);
      
          rwsem_down_failed_common()
      
            set TASK_UNINTERRUPTIBLE
            set waiter.task <= task A
            list_add to sem->wait_list
                 :
            raw_spin_unlock_irq()
            (I/O interruption occured)
      
                                            __rwsem_do_wake(mmap_sem)
      
                                              list_del(&waiter->list);
                                              waiter->task = NULL
                                              wake_up_process(task A)
                                                try_to_wake_up()
                                                   (task is still
                                                     TASK_UNINTERRUPTIBLE)
                                                    p->on_rq is still 1.)
      
                                                    ttwu_do_wakeup()
                                                       (*A)
                                                         :
           (I/O interruption handler finished)
      
            if (!waiter.task)
                schedule() is not called
                due to waiter.task is NULL.
      
            tsk->state = TASK_RUNNING
      
                :
                                                    check_preempt_curr();
                                                        :
        task->state = TASK_DEAD
                                                    (*B)
                                              <---    set TASK_RUNNING (*C)
      
           schedule()
           (exit task is running again)
           BUG_ON() is called!
       --------------------------------------------------------
      
      The execution time between (*A) and (*B) is usually very short,
      because the interruption is disabled, and setting TASK_RUNNING at (*C)
      must be executed before setting TASK_DEAD.
      
      HOWEVER, if SMI is interrupted between (*A) and (*B),
      (*C) is able to execute AFTER setting TASK_DEAD!
      Then, exited task is scheduled again, and BUG_ON() is called....
      
      If the system works on guest system of virtual machine, the time
      between (*A) and (*B) may be also long due to scheduling of hypervisor,
      and same phenomenon can occur.
      
      By this patch, do_exit() waits for releasing task->pi_lock which is used
      in try_to_wake_up(). It guarantees the task becomes TASK_DEAD after
      waking up.
      Signed-off-by: NYasunori Goto <y-goto@jp.fujitsu.com>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20120117174031.3118.E1E9C6FF@jp.fujitsu.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
      b5740f4b
  14. 18 1月, 2012 1 次提交
  15. 13 1月, 2012 1 次提交
  16. 05 1月, 2012 1 次提交
    • O
      ptrace: partially fix the do_wait(WEXITED) vs EXIT_DEAD->EXIT_ZOMBIE race · 50b8d257
      Oleg Nesterov 提交于
      Test-case:
      
      	int main(void)
      	{
      		int pid, status;
      
      		pid = fork();
      		if (!pid) {
      			for (;;) {
      				if (!fork())
      					return 0;
      				if (waitpid(-1, &status, 0) < 0) {
      					printf("ERR!! wait: %m\n");
      					return 0;
      				}
      			}
      		}
      
      		assert(ptrace(PTRACE_ATTACH, pid, 0,0) == 0);
      		assert(waitpid(-1, NULL, 0) == pid);
      
      		assert(ptrace(PTRACE_SETOPTIONS, pid, 0,
      					PTRACE_O_TRACEFORK) == 0);
      
      		do {
      			ptrace(PTRACE_CONT, pid, 0, 0);
      			pid = waitpid(-1, NULL, 0);
      		} while (pid > 0);
      
      		return 1;
      	}
      
      It fails because ->real_parent sees its child in EXIT_DEAD state
      while the tracer is going to change the state back to EXIT_ZOMBIE
      in wait_task_zombie().
      
      The offending commit is 823b018e which moved the EXIT_DEAD check,
      but in fact we should not blame it. The original code was not
      correct as well because it didn't take ptrace_reparented() into
      account and because we can't really trust ->ptrace.
      
      This patch adds the additional check to close this particular
      race but it doesn't solve the whole problem. We simply can't
      rely on ->ptrace in this case, it can be cleared if the tracer
      is multithreaded by the exiting ->parent.
      
      I think we should kill EXIT_DEAD altogether, we should always
      remove the soon-to-be-reaped child from ->children or at least
      we should never do the DEAD->ZOMBIE transition. But this is too
      complex for 3.2.
      Reported-and-tested-by: NDenys Vlasenko <vda.linux@googlemail.com>
      Tested-by: NLukasz Michalik <lmi@ift.uni.wroc.pl>
      Acked-by: NTejun Heo <tj@kernel.org>
      Cc: <stable@kernel.org>		[3.0+]
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      50b8d257
  17. 18 12月, 2011 1 次提交
  18. 15 12月, 2011 1 次提交
  19. 22 11月, 2011 1 次提交
  20. 01 11月, 2011 1 次提交
    • D
      oom: remove oom_disable_count · c9f01245
      David Rientjes 提交于
      This removes mm->oom_disable_count entirely since it's unnecessary and
      currently buggy.  The counter was intended to be per-process but it's
      currently decremented in the exit path for each thread that exits, causing
      it to underflow.
      
      The count was originally intended to prevent oom killing threads that
      share memory with threads that cannot be killed since it doesn't lead to
      future memory freeing.  The counter could be fixed to represent all
      threads sharing the same mm, but it's better to remove the count since:
      
       - it is possible that the OOM_DISABLE thread sharing memory with the
         victim is waiting on that thread to exit and will actually cause
         future memory freeing, and
      
       - there is no guarantee that a thread is disabled from oom killing just
         because another thread sharing its mm is oom disabled.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Reported-by: NOleg Nesterov <oleg@redhat.com>
      Reviewed-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Ying Han <yinghan@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c9f01245
  21. 27 7月, 2011 1 次提交
    • V
      ipc: introduce shm_rmid_forced sysctl · b34a6b1d
      Vasiliy Kulikov 提交于
      Add support for the shm_rmid_forced sysctl.  If set to 1, all shared
      memory objects in current ipc namespace will be automatically forced to
      use IPC_RMID.
      
      The POSIX way of handling shmem allows one to create shm objects and
      call shmdt(), leaving shm object associated with no process, thus
      consuming memory not counted via rlimits.
      
      With shm_rmid_forced=1 the shared memory object is counted at least for
      one process, so OOM killer may effectively kill the fat process holding
      the shared memory.
      
      It obviously breaks POSIX - some programs relying on the feature would
      stop working.  So set shm_rmid_forced=1 only if you're sure nobody uses
      "orphaned" memory.  Use shm_rmid_forced=0 by default for compatability
      reasons.
      
      The feature was previously impemented in -ow as a configure option.
      
      [akpm@linux-foundation.org: fix documentation, per Randy]
      [akpm@linux-foundation.org: fix warning]
      [akpm@linux-foundation.org: readability/conventionality tweaks]
      [akpm@linux-foundation.org: fix shm_rmid_forced/shm_forced_rmid confusion, use standard comment layout]
      Signed-off-by: NVasiliy Kulikov <segoon@openwall.com>
      Cc: Randy Dunlap <rdunlap@xenotime.net>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: "Serge E. Hallyn" <serge.hallyn@canonical.com>
      Cc: Daniel Lezcano <daniel.lezcano@free.fr>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
      Cc: Solar Designer <solar@openwall.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b34a6b1d
  22. 18 7月, 2011 1 次提交
    • O
      has_stopped_jobs: s/task_is_stopped/SIGNAL_STOP_STOPPED/ · 961c4675
      Oleg Nesterov 提交于
      has_stopped_jobs() naively checks task_is_stopped(group_leader). This
      was always wrong even without ptrace, group_leader can be dead. And
      given that ptrace can change the state to TRACED this is wrong even
      in the single-threaded case.
      
      Change the code to check SIGNAL_STOP_STOPPED and simplify the code,
      retval + break/continue doesn't make this trivial code more readable.
      
      We could probably add the usual "|| signal->group_stop_count" check
      but I don't think this makes sense, the task can start the group-stop
      right after the check anyway.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      961c4675
  23. 12 7月, 2011 1 次提交
    • J
      fixlet: Remove fs_excl from struct task. · 4aede84b
      Justin TerAvest 提交于
      fs_excl is a poor man's priority inheritance for filesystems to hint to
      the block layer that an operation is important. It was never clearly
      specified, not widely adopted, and will not prevent starvation in many
      cases (like across cgroups).
      
      fs_excl was introduced with the time sliced CFQ IO scheduler, to
      indicate when a process held FS exclusive resources and thus needed
      a boost.
      
      It doesn't cover all file systems, and it was never fully complete.
      Lets kill it.
      Signed-off-by: NJustin TerAvest <teravest@google.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      4aede84b
  24. 09 7月, 2011 1 次提交
  25. 28 6月, 2011 6 次提交
    • O
      ptrace: wait_consider_task: s/same_thread_group/ptrace_reparented/ · 479bf98c
      Oleg Nesterov 提交于
      wait_consider_task() checks same_thread_group(parent, real_parent),
      this is the open-coded ptrace_reparented().
      
      __ptrace_detach() remains the only function which has to check this by
      hand, although we could reorganize the code to delay __ptrace_unlink.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      479bf98c
    • O
      kill task_detached() · e550f14d
      Oleg Nesterov 提交于
      Upadate the last user of task_detached(), wait_task_zombie(), to
      use thread_group_leader() and kill task_detached().
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Reviewed-by: NTejun Heo <tj@kernel.org>
      e550f14d
    • O
      reparent_leader: check EXIT_DEAD instead of task_detached() · 0976a03e
      Oleg Nesterov 提交于
      Change reparent_leader() to check ->exit_state instead of ->exit_signal,
      this matches the similar EXIT_DEAD check in wait_consider_task() and
      allows us to cleanup the do_notify_parent/task_detached logic.
      
      task_detached() was really needed during reparenting before 9cd80bbb
      "do_wait() optimization: do not place sub-threads on ->children list"
      to filter out the sub-threads. After this change task_detached(p) can
      only be true if p is the dead group_leader and its parent ignores
      SIGCHLD, in this case the caller of do_notify_parent() is going to
      reap this task and it should set EXIT_DEAD.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Reviewed-by: NTejun Heo <tj@kernel.org>
      0976a03e
    • O
      make do_notify_parent() __must_check, update the callers · 86773473
      Oleg Nesterov 提交于
      Change other callers of do_notify_parent() to check the value it
      returns, this makes the subsequent task_detached() unnecessary.
      Mark do_notify_parent() as __must_check.
      
      Use thread_group_leader() instead of !task_detached() to check
      if we need to notify the real parent in wait_task_zombie().
      
      Remove the stale comment in release_task(). "just for sanity" is
      no longer true, we have to set EXIT_DEAD to avoid the races with
      do_wait().
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      86773473
    • O
      kill tracehook_notify_death() · 45cdf5cc
      Oleg Nesterov 提交于
      Kill tracehook_notify_death(), reimplement the logic in its caller,
      exit_notify().
      
      Also, change the exec_id's check to use thread_group_leader() instead
      of task_detached(), this is more clear. This logic only applies to
      the exiting leader, a sub-thread must never change its exit_signal.
      
      Note: when the traced group leader exits the exit_signal-or-SIGCHLD
      logic looks really strange:
      
      	- we notify the tracer even if !thread_group_empty() but
      	   do_wait(WEXITED) can't work until all threads exit
      
      	- if the tracer is real_parent, it is not clear why can't
      	  we use ->exit_signal event if !thread_group_empty()
      
      -v2: do not try to fix the 2nd oddity to avoid the subtle behavior
           change mixed with reorganization, suggested by Tejun.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Reviewed-by: NTejun Heo <tj@kernel.org>
      45cdf5cc
    • O
      make do_notify_parent() return bool · 53c8f9f1
      Oleg Nesterov 提交于
      - change do_notify_parent() to return a boolean, true if the task should
        be reaped because its parent ignores SIGCHLD.
      
      - update the only caller which checks the returned value, exit_notify().
      
      This temporary uglifies exit_notify() even more, will be cleanuped by
      the next change.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      53c8f9f1
  26. 23 6月, 2011 2 次提交
    • T
      ptrace: kill trivial tracehooks · a288eecc
      Tejun Heo 提交于
      At this point, tracehooks aren't useful to mainline kernel and mostly
      just add an extra layer of obfuscation.  Although they have comments,
      without actual in-kernel users, it is difficult to tell what are their
      assumptions and they're actually trying to achieve.  To mainline
      kernel, they just aren't worth keeping around.
      
      This patch kills the following trivial tracehooks.
      
      * Ones testing whether task is ptraced.  Replace with ->ptrace test.
      
      	tracehook_expect_breakpoints()
      	tracehook_consider_ignored_signal()
      	tracehook_consider_fatal_signal()
      
      * ptrace_event() wrappers.  Call directly.
      
      	tracehook_report_exec()
      	tracehook_report_exit()
      	tracehook_report_vfork_done()
      
      * ptrace_release_task() wrapper.  Call directly.
      
      	tracehook_finish_release_task()
      
      * noop
      
      	tracehook_prepare_release_task()
      	tracehook_report_death()
      
      This doesn't introduce any behavior change.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      a288eecc
    • T
      ptrace: kill task_ptrace() · d21142ec
      Tejun Heo 提交于
      task_ptrace(task) simply dereferences task->ptrace and isn't even used
      consistently only adding confusion.  Kill it and directly access
      ->ptrace instead.
      
      This doesn't introduce any behavior change.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      d21142ec
  27. 17 6月, 2011 1 次提交
    • T
      ptrace: implement PTRACE_LISTEN · 544b2c91
      Tejun Heo 提交于
      The previous patch implemented async notification for ptrace but it
      only worked while trace is running.  This patch introduces
      PTRACE_LISTEN which is suggested by Oleg Nestrov.
      
      It's allowed iff tracee is in STOP trap and puts tracee into
      quasi-running state - tracee never really runs but wait(2) and
      ptrace(2) consider it to be running.  While ptracer is listening,
      tracee is allowed to re-enter STOP to notify an async event.
      Listening state is cleared on the first notification.  Ptracer can
      also clear it by issuing INTERRUPT - tracee will re-trap into STOP
      with listening state cleared.
      
      This allows ptracer to monitor group stop state without running tracee
      - use INTERRUPT to put tracee into STOP trap, issue LISTEN and then
      wait(2) to wait for the next group stop event.  When it happens,
      PTRACE_GETSIGINFO provides information to determine the current state.
      
      Test program follows.
      
        #define PTRACE_SEIZE		0x4206
        #define PTRACE_INTERRUPT	0x4207
        #define PTRACE_LISTEN		0x4208
      
        #define PTRACE_SEIZE_DEVEL	0x80000000
      
        static const struct timespec ts1s = { .tv_sec = 1 };
      
        int main(int argc, char **argv)
        {
      	  pid_t tracee, tracer;
      	  int i;
      
      	  tracee = fork();
      	  if (!tracee)
      		  while (1)
      			  pause();
      
      	  tracer = fork();
      	  if (!tracer) {
      		  siginfo_t si;
      
      		  ptrace(PTRACE_SEIZE, tracee, NULL,
      			 (void *)(unsigned long)PTRACE_SEIZE_DEVEL);
      		  ptrace(PTRACE_INTERRUPT, tracee, NULL, NULL);
      	  repeat:
      		  waitid(P_PID, tracee, NULL, WSTOPPED);
      
      		  ptrace(PTRACE_GETSIGINFO, tracee, NULL, &si);
      		  if (!si.si_code) {
      			  printf("tracer: SIG %d\n", si.si_signo);
      			  ptrace(PTRACE_CONT, tracee, NULL,
      				 (void *)(unsigned long)si.si_signo);
      			  goto repeat;
      		  }
      		  printf("tracer: stopped=%d signo=%d\n",
      			 si.si_signo != SIGTRAP, si.si_signo);
      		  if (si.si_signo != SIGTRAP)
      			  ptrace(PTRACE_LISTEN, tracee, NULL, NULL);
      		  else
      			  ptrace(PTRACE_CONT, tracee, NULL, NULL);
      		  goto repeat;
      	  }
      
      	  for (i = 0; i < 3; i++) {
      		  nanosleep(&ts1s, NULL);
      		  printf("mother: SIGSTOP\n");
      		  kill(tracee, SIGSTOP);
      		  nanosleep(&ts1s, NULL);
      		  printf("mother: SIGCONT\n");
      		  kill(tracee, SIGCONT);
      	  }
      	  nanosleep(&ts1s, NULL);
      
      	  kill(tracer, SIGKILL);
      	  kill(tracee, SIGKILL);
      	  return 0;
        }
      
      This is identical to the program to test TRAP_NOTIFY except that
      tracee is PTRACE_LISTEN'd instead of PTRACE_CONT'd when group stopped.
      This allows ptracer to monitor when group stop ends without running
      tracee.
      
        # ./test-listen
        tracer: stopped=0 signo=5
        mother: SIGSTOP
        tracer: SIG 19
        tracer: stopped=1 signo=19
        mother: SIGCONT
        tracer: stopped=0 signo=5
        tracer: SIG 18
        mother: SIGSTOP
        tracer: SIG 19
        tracer: stopped=1 signo=19
        mother: SIGCONT
        tracer: stopped=0 signo=5
        tracer: SIG 18
        mother: SIGSTOP
        tracer: SIG 19
        tracer: stopped=1 signo=19
        mother: SIGCONT
        tracer: stopped=0 signo=5
        tracer: SIG 18
      
      -v2: Moved JOBCTL_LISTENING check in wait_task_stopped() into
           task_stopped_code() as suggested by Oleg.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      544b2c91
  28. 16 6月, 2011 1 次提交
    • K
      memcg: clear mm->owner when last possible owner leaves · 733eda7a
      KAMEZAWA Hiroyuki 提交于
      The following crash was reported:
      
      > Call Trace:
      > [<ffffffff81139792>] mem_cgroup_from_task+0x15/0x17
      > [<ffffffff8113a75a>] __mem_cgroup_try_charge+0x148/0x4b4
      > [<ffffffff810493f3>] ? need_resched+0x23/0x2d
      > [<ffffffff814cbf43>] ? preempt_schedule+0x46/0x4f
      > [<ffffffff8113afe8>] mem_cgroup_charge_common+0x9a/0xce
      > [<ffffffff8113b6d1>] mem_cgroup_newpage_charge+0x5d/0x5f
      > [<ffffffff81134024>] khugepaged+0x5da/0xfaf
      > [<ffffffff81078ea0>] ? __init_waitqueue_head+0x4b/0x4b
      > [<ffffffff81133a4a>] ? add_mm_counter.constprop.5+0x13/0x13
      > [<ffffffff81078625>] kthread+0xa8/0xb0
      > [<ffffffff814d13e8>] ? sub_preempt_count+0xa1/0xb4
      > [<ffffffff814d5664>] kernel_thread_helper+0x4/0x10
      > [<ffffffff814ce858>] ? retint_restore_args+0x13/0x13
      > [<ffffffff8107857d>] ? __init_kthread_worker+0x5a/0x5a
      
      What happens is that khugepaged tries to charge a huge page against an mm
      whose last possible owner has already exited, and the memory controller
      crashes when the stale mm->owner is used to look up the cgroup to charge.
      
      mm->owner has never been set to NULL with the last owner going away, but
      nobody cared until khugepaged came along.
      
      Even then it wasn't a problem because the final mmput() on an mm was
      forced to acquire and release mmap_sem in write-mode, preventing an
      exiting owner to go away while the mmap_sem was held, and until "692e0b35
      mm: thp: optimize memcg charge in khugepaged", the memory cgroup charge
      was protected by mmap_sem in read-mode.
      
      Instead of going back to relying on the mmap_sem to enforce lifetime of a
      task, this patch ensures that mm->owner is properly set to NULL when the
      last possible owner is exiting, which the memory controller can handle
      just fine.
      
      [akpm@linux-foundation.org: tweak comments]
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: NHugh Dickins <hughd@google.com>
      Reported-by: NDave Jones <davej@redhat.com>
      Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      733eda7a
  29. 14 5月, 2011 1 次提交
    • T
      job control: reorganize wait_task_stopped() · 19e27463
      Tejun Heo 提交于
      wait_task_stopped() tested task_stopped_code() without acquiring
      siglock and, if stop condition existed, called wait_task_stopped() and
      directly returned the result.  This patch moves the initial
      task_stopped_code() testing into wait_task_stopped() and make
      wait_consider_task() fall through to wait_task_continue() on 0 return.
      
      This is for the following two reasons.
      
      * Because the initial task_stopped_code() test is done without
        acquiring siglock, it may race against SIGCONT generation.  The
        stopped condition might have been replaced by continued state by the
        time wait_task_stopped() acquired siglock.  This may lead to
        unexpected failure of WNOHANG waits.
      
        This reorganization addresses this single race case but there are
        other cases - TASK_RUNNING -> TASK_STOPPED transition and EXIT_*
        transitions.
      
      * Scheduled ptrace updates require changes to the initial test which
        would fit better inside wait_task_stopped().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      19e27463
  30. 25 4月, 2011 1 次提交
    • F
      ptrace: Prepare to fix racy accesses on task breakpoints · bf26c018
      Frederic Weisbecker 提交于
      When a task is traced and is in a stopped state, the tracer
      may execute a ptrace request to examine the tracee state and
      get its task struct. Right after, the tracee can be killed
      and thus its breakpoints released.
      This can happen concurrently when the tracer is in the middle
      of reading or modifying these breakpoints, leading to dereferencing
      a freed pointer.
      
      Hence, to prepare the fix, create a generic breakpoint reference
      holding API. When a reference on the breakpoints of a task is
      held, the breakpoints won't be released until the last reference
      is dropped. After that, no more ptrace request on the task's
      breakpoints can be serviced for the tracer.
      Reported-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Prasad <prasad@linux.vnet.ibm.com>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: v2.6.33.. <stable@kernel.org>
      Link: http://lkml.kernel.org/r/1302284067-7860-2-git-send-email-fweisbec@gmail.com
      bf26c018