1. 25 6月, 2015 1 次提交
  2. 13 4月, 2015 1 次提交
  3. 12 2月, 2015 2 次提交
    • M
      oom, PM: make OOM detection in the freezer path raceless · c32b3cbe
      Michal Hocko 提交于
      Commit 5695be14 ("OOM, PM: OOM killed task shouldn't escape PM
      suspend") has left a race window when OOM killer manages to
      note_oom_kill after freeze_processes checks the counter.  The race
      window is quite small and really unlikely and partial solution deemed
      sufficient at the time of submission.
      
      Tejun wasn't happy about this partial solution though and insisted on a
      full solution.  That requires the full OOM and freezer's task freezing
      exclusion, though.  This is done by this patch which introduces oom_sem
      RW lock and turns oom_killer_disable() into a full OOM barrier.
      
      oom_killer_disabled check is moved from the allocation path to the OOM
      level and we take oom_sem for reading for both the check and the whole
      OOM invocation.
      
      oom_killer_disable() takes oom_sem for writing so it waits for all
      currently running OOM killer invocations.  Then it disable all the further
      OOMs by setting oom_killer_disabled and checks for any oom victims.
      Victims are counted via mark_tsk_oom_victim resp.  unmark_oom_victim.  The
      last victim wakes up all waiters enqueued by oom_killer_disable().
      Therefore this function acts as the full OOM barrier.
      
      The page fault path is covered now as well although it was assumed to be
      safe before.  As per Tejun, "We used to have freezing points deep in file
      system code which may be reacheable from page fault." so it would be
      better and more robust to not rely on freezing points here.  Same applies
      to the memcg OOM killer.
      
      out_of_memory tells the caller whether the OOM was allowed to trigger and
      the callers are supposed to handle the situation.  The page allocation
      path simply fails the allocation same as before.  The page fault path will
      retry the fault (more on that later) and Sysrq OOM trigger will simply
      complain to the log.
      
      Normally there wouldn't be any unfrozen user tasks after
      try_to_freeze_tasks so the function will not block. But if there was an
      OOM killer racing with try_to_freeze_tasks and the OOM victim didn't
      finish yet then we have to wait for it. This should complete in a finite
      time, though, because
      
      	- the victim cannot loop in the page fault handler (it would die
      	  on the way out from the exception)
      	- it cannot loop in the page allocator because all the further
      	  allocation would fail and __GFP_NOFAIL allocations are not
      	  acceptable at this stage
      	- it shouldn't be blocked on any locks held by frozen tasks
      	  (try_to_freeze expects lockless context) and kernel threads and
      	  work queues are not frozen yet
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Suggested-by: NTejun Heo <tj@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c32b3cbe
    • M
      oom: add helpers for setting and clearing TIF_MEMDIE · 49550b60
      Michal Hocko 提交于
      This patchset addresses a race which was described in the changelog for
      5695be14 ("OOM, PM: OOM killed task shouldn't escape PM suspend"):
      
      : PM freezer relies on having all tasks frozen by the time devices are
      : getting frozen so that no task will touch them while they are getting
      : frozen.  But OOM killer is allowed to kill an already frozen task in order
      : to handle OOM situtation.  In order to protect from late wake ups OOM
      : killer is disabled after all tasks are frozen.  This, however, still keeps
      : a window open when a killed task didn't manage to die by the time
      : freeze_processes finishes.
      
      The original patch hasn't closed the race window completely because that
      would require a more complex solution as it can be seen by this patchset.
      
      The primary motivation was to close the race condition between OOM killer
      and PM freezer _completely_.  As Tejun pointed out, even though the race
      condition is unlikely the harder it would be to debug weird bugs deep in
      the PM freezer when the debugging options are reduced considerably.  I can
      only speculate what might happen when a task is still runnable
      unexpectedly.
      
      On a plus side and as a side effect the oom enable/disable has a better
      (full barrier) semantic without polluting hot paths.
      
      I have tested the series in KVM with 100M RAM:
      - many small tasks (20M anon mmap) which are triggering OOM continually
      - s2ram which resumes automatically is triggered in a loop
      	echo processors > /sys/power/pm_test
      	while true
      	do
      		echo mem > /sys/power/state
      		sleep 1s
      	done
      - simple module which allocates and frees 20M in 8K chunks. If it sees
        freezing(current) then it tries another round of allocation before calling
        try_to_freeze
      - debugging messages of PM stages and OOM killer enable/disable/fail added
        and unmark_oom_victim is delayed by 1s after it clears TIF_MEMDIE and before
        it wakes up waiters.
      - rebased on top of the current mmotm which means some necessary updates
        in mm/oom_kill.c. mark_tsk_oom_victim is now called under task_lock but
        I think this should be OK because __thaw_task shouldn't interfere with any
        locking down wake_up_process. Oleg?
      
      As expected there are no OOM killed tasks after oom is disabled and
      allocations requested by the kernel thread are failing after all the tasks
      are frozen and OOM disabled.  I wasn't able to catch a race where
      oom_killer_disable would really have to wait but I kinda expected the race
      is really unlikely.
      
      [  242.609330] Killed process 2992 (mem_eater) total-vm:24412kB, anon-rss:2164kB, file-rss:4kB
      [  243.628071] Unmarking 2992 OOM victim. oom_victims: 1
      [  243.636072] (elapsed 2.837 seconds) done.
      [  243.641985] Trying to disable OOM killer
      [  243.643032] Waiting for concurent OOM victims
      [  243.644342] OOM killer disabled
      [  243.645447] Freezing remaining freezable tasks ... (elapsed 0.005 seconds) done.
      [  243.652983] Suspending console(s) (use no_console_suspend to debug)
      [  243.903299] kmem_eater: page allocation failure: order:1, mode:0x204010
      [...]
      [  243.992600] PM: suspend of devices complete after 336.667 msecs
      [  243.993264] PM: late suspend of devices complete after 0.660 msecs
      [  243.994713] PM: noirq suspend of devices complete after 1.446 msecs
      [  243.994717] ACPI: Preparing to enter system sleep state S3
      [  243.994795] PM: Saving platform NVS memory
      [  243.994796] Disabling non-boot CPUs ...
      
      The first 2 patches are simple cleanups for OOM.  They should go in
      regardless the rest IMO.
      
      Patches 3 and 4 are trivial printk -> pr_info conversion and they should
      go in ditto.
      
      The main patch is the last one and I would appreciate acks from Tejun and
      Rafael.  I think the OOM part should be OK (except for __thaw_task vs.
      task_lock where a look from Oleg would appreciated) but I am not so sure I
      haven't screwed anything in the freezer code.  I have found several
      surprises there.
      
      This patch (of 5):
      
      This patch is just a preparatory and it doesn't introduce any functional
      change.
      
      Note:
      I am utterly unhappy about lowmemory killer abusing TIF_MEMDIE just to
      wait for the oom victim and to prevent from new killing. This is
      just a side effect of the flag. The primary meaning is to give the oom
      victim access to the memory reserves and that shouldn't be necessary
      here.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      49550b60
  4. 09 1月, 2015 1 次提交
  5. 11 12月, 2014 17 次提交
    • O
      exit: exit_notify: re-use "dead" list to autoreap current · 6c66e7db
      Oleg Nesterov 提交于
      After the previous change we can add just the exiting EXIT_DEAD task to
      the "dead" list and remove another release_task(tsk).
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Aaron Tomlin <atomlin@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Sterling Alexander <stalexan@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6c66e7db
    • O
      exit: reparent: call forget_original_parent() under tasklist_lock · 482a3767
      Oleg Nesterov 提交于
      Shift "release dead children" loop from forget_original_parent() to its
      caller, exit_notify().  It is safe to reap them even if our parent reaps
      us right after we drop tasklist_lock, those children no longer have any
      connection to the exiting task.
      
      And this allows us to avoid write_lock_irq(tasklist_lock) right after it
      was released by forget_original_parent(), we can simply call it with
      tasklist_lock held.
      
      While at it, move the comment about forget_original_parent() up to
      this function.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Aaron Tomlin <atomlin@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Sterling Alexander <stalexan@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      482a3767
    • O
      exit: reparent: avoid find_new_reaper() if no children · ad9e206a
      Oleg Nesterov 提交于
      Now that pid_ns logic was isolated we can change forget_original_parent()
      to return right after find_child_reaper() when father->children is empty,
      there is nothing to reparent in this case.
      
      In particular this avoids find_alive_thread() and this can help if the
      whole process exits and it has a lot of PF_EXITING threads at the start of
      the thread list, this can easily lead to O(nr_threads ** 2) iterations.
      
      Trivial test case (tested under KVM, 2 CPUs):
      
          static void *tfunc(void *arg)
          {
              pause();
              return NULL;
          }
      
          static int child(unsigned int nt)
          {
              pthread_t pt;
      
              while (nt--)
                  assert(pthread_create(&pt, NULL, tfunc, NULL) == 0);
      
              pthread_kill(pt, SIGTRAP);
              pause();
              return 0;
          }
      
          int main(int argc, const char *argv[])
          {
              int stat;
              unsigned int nf = atoi(argv[1]);
              unsigned int nt = atoi(argv[2]);
      
              while (nf--) {
                  if (!fork())
                      return child(nt);
      
                  wait(&stat);
                  assert(stat == SIGTRAP);
              }
      
              return 0;
          }
      
      $ time ./test 16 16536 shows:
      
                    real        user         sys
          -    5m37.628s    0m4.437s    8m5.560s
          +    0m50.032s    0m7.130s    1m4.927s
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Aaron Tomlin <atomlin@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Sterling Alexander <stalexan@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ad9e206a
    • O
      exit: reparent: introduce find_alive_thread() · c9dc05bf
      Oleg Nesterov 提交于
      Add the new simple helper to factor out the for_each_thread() code in
      find_child_reaper() and find_new_reaper().  It can also simplify the
      potential PF_EXITING -> exit_state change, plus perhaps we can change this
      code to take SIGNAL_GROUP_EXIT into account.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Aaron Tomlin <atomlin@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Kay Sievers <kay@vrfy.org>
      Cc: Lennart Poettering <lennart@poettering.net>
      Cc: Sterling Alexander <stalexan@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c9dc05bf
    • O
      exit: reparent: introduce find_child_reaper() · 1109909c
      Oleg Nesterov 提交于
      find_new_reaper() does 2 completely different things.  Not only it finds a
      reaper, it also updates pid_ns->child_reaper or kills the whole namespace
      if the caller is ->child_reaper.
      
      Now that has_child_subreaper logic doesn't depend on child_reaper check we
      can move that pid_ns code into a separate helper.  IMHO this makes the
      code more clean, and this allows the next changes.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Aaron Tomlin <atomlin@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Kay Sievers <kay@vrfy.org>
      Cc: Lennart Poettering <lennart@poettering.net>
      Cc: Sterling Alexander <stalexan@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1109909c
    • O
      exit: reparent: document the ->has_child_subreaper checks · 175aed3f
      Oleg Nesterov 提交于
      Swap the "init_task" and same_thread_group() checks.  This way it is more
      simple to document these checks and we can remove the link to the previous
      discussion on lkml.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Aaron Tomlin <atomlin@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Kay Sievers <kay@vrfy.org>
      Cc: Lennart Poettering <lennart@poettering.net>
      Cc: Sterling Alexander <stalexan@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      175aed3f
    • O
      exit: reparent: s/while_each_thread/for_each_thread/ in find_new_reaper() · 3750ef97
      Oleg Nesterov 提交于
      Change find_new_reaper() to use for_each_thread() instead of deprecated
      while_each_thread().  We do not bother to check "thread != father" in the
      1st loop, we can rely on PF_EXITING check.
      
      Note: this means the minor behavioural change: for_each_thread() starts
      from the group leader.  But this should be fine, nobody should make any
      assumption about do_wait(__WNOTHREAD) when it comes to reparented tasks.
      And this can avoid the pointless reparenting to a short-living thread
      While zombie leaders are not that common.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Aaron Tomlin <atomlin@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Kay Sievers <kay@vrfy.org>
      Cc: Lennart Poettering <lennart@poettering.net>
      Cc: Sterling Alexander <stalexan@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3750ef97
    • O
      exit: reparent: fix the cross-namespace PR_SET_CHILD_SUBREAPER reparenting · 7d24e2df
      Oleg Nesterov 提交于
      find_new_reaper() assumes that "has_child_subreaper" logic is safe as
      long as we are not the exiting ->child_reaper and this is doubly wrong:
      
      1. In fact it is safe if "pid_ns->child_reaper == father"; there must
         be no children after zap_pid_ns_processes() returns, so it doesn't
         matter what we return in this case and even pid_ns->child_reaper is
         wrong otherwise: we can't reparent to ->child_reaper == current.
      
         This is not a bug, but this is confusing.
      
      2. It is not safe if we are not pid_ns->child_reaper but from the same
         thread group. We drop tasklist_lock before zap_pid_ns_processes(),
         so another thread can lock it and choose the new reaper from the
         upper namespace if has_child_subreaper == T, and this is obviously
         wrong.
      
         This is not that bad, zap_pid_ns_processes() won't return until the
         the new reaper reaps all zombies, but this should be fixed anyway.
      
      We could change for_each_thread() loop to use ->exit_state instead of
      PF_EXITING which we had to use until 8aac6270, or we could change
      copy_signal() to check CLONE_NEWPID before setting has_child_subreaper,
      but lets change this code so that it is clear we can't look outside of
      our namespace, otherwise same_thread_group(reaper, child_reaper) check
      will look wrong and confusing anyway.
      
      We can simply start from "father" and fix the problem. We can't wrongly
      return a thread from the same thread group if ->is_child_subreaper == T,
      we know that all threads have PF_EXITING set.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Aaron Tomlin <atomlin@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Kay Sievers <kay@vrfy.org>
      Cc: Lennart Poettering <lennart@poettering.net>
      Cc: Sterling Alexander <stalexan@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7d24e2df
    • O
      exit: reparent: fix the dead-parent PR_SET_CHILD_SUBREAPER reparenting · 8a1296ae
      Oleg Nesterov 提交于
      The ->has_child_subreaper code in find_new_reaper() finds alive "thread"
      but returns another "reaper" thread which can be dead.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Aaron Tomlin <atomlin@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Kay Sievers <kay@vrfy.org>
      Cc: Lennart Poettering <lennart@poettering.net>
      Cc: Sterling Alexander <stalexan@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8a1296ae
    • O
      exit: release_task: fix the comment about group leader accounting · 26e75b5c
      Oleg Nesterov 提交于
      Contrary to what the comment in __exit_signal() says we do account the
      group leader. Fix this and explain why.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Aaron Tomlin <atomlin@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Sterling Alexander <stalexan@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      26e75b5c
    • O
      exit: wait: drop tasklist_lock before psig->c* accounting · 986094df
      Oleg Nesterov 提交于
      wait_task_zombie() no longer needs tasklist_lock to accumulate the
      psig->c* counters, we can drop it right after cmpxchg(exit_state).
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Aaron Tomlin <atomlin@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Sterling Alexander <stalexan@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      986094df
    • O
      exit: wait: don't use zombie->real_parent · f953ccd0
      Oleg Nesterov 提交于
      1. wait_task_zombie() uses p->real_parent to get psig/siglock. This is
         correct but needs tasklist_lock, ->real_parent can exit.
      
         We can use "current" instead. This is our natural child, its parent
         must be our sub-thread.
      
      2. Read psig/sig outside of ->siglock, ->signal is no longer protected
         by this lock.
      
      3. Fix the outdated comments about tasklist_lock. We can not race with
         __exit_signal(), the whole thread group is dead, nobody but us can
         call it.
      
         Also clarify the usage of ->stats_lock and ->siglock.
      
      Note: thread_group_cputime_adjusted() is sub-optimal in this case, we
      probably want to export cputime_adjust() to avoid thread_group_cputime().
      The comment says "all threads" but there are no other threads.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Aaron Tomlin <atomlin@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Sterling Alexander <stalexan@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f953ccd0
    • O
      exit: wait: cleanup the ptrace_reparented() checks · f6507f83
      Oleg Nesterov 提交于
      Now that EXIT_DEAD is the terminal state we can kill "int traced"
      variable and check "state == EXIT_DEAD" instead to cleanup the code.  In
      particular, this way it is clear that the check obviously doesn't need
      tasklist_lock.
      
      Also fix the type of "unsigned long state", "long" was always wrong
      although this doesn't matter because cmpxchg/xchg uses typeof(*ptr).
      
      [akpm@linux-foundation.org: don't make me google the C Operator Precedence table]
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Aaron Tomlin <atomlin@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Sterling Alexander <stalexan@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f6507f83
    • O
      exit: ptrace: shift "reap dead" code from exit_ptrace() to forget_original_parent() · 7c8bd232
      Oleg Nesterov 提交于
      Now that forget_original_parent() uses ->ptrace_entry for EXIT_DEAD tasks,
      we can simply pass "dead_children" list to exit_ptrace() and remove
      another release_task() loop.  Plus this way we do not need to drop and
      reacquire tasklist_lock.
      
      Also shift the list_empty(ptraced) check, if we want this optimization it
      makes sense to eliminate the function call altogether.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Aaron Tomlin <atomlin@redhat.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>,
      Cc: Sterling Alexander <stalexan@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Roland McGrath <roland@hack.frob.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7c8bd232
    • O
      exit: reparent: cleanup the usage of reparent_leader() · 2831096e
      Oleg Nesterov 提交于
      1. Now that reparent_leader() doesn't abuse ->sibling we can shift
         list_move_tail() from reparent_leader() to forget_original_parent()
         and turn it into a single list_splice_tail_init(). This also makes
         BUG_ON(!list_empty()) and list_for_each_entry_safe() unnecessary.
      
      2. This also allows to shift the same_thread_group() check, it looks
         a bit more clear in the caller.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Aaron Tomlin <atomlin@redhat.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>,
      Cc: Sterling Alexander <stalexan@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Roland McGrath <roland@hack.frob.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2831096e
    • O
      exit: reparent: cleanup the changing of ->parent · 57a05918
      Oleg Nesterov 提交于
      1. Cosmetic, but "if (t->parent == father)" looks a bit confusing.
         We need to change t->parent if and only if t is not traced.
      
      2. If we actually want this BUG_ON() to ensure that parent/ptrace
         match each other, then we should also take ptrace_reparented()
         case into account too.
      
      3. Change this code to use for_each_thread() instead of deprecated
         while_each_thread().
      
      [dan.carpenter@oracle.com: silence a bogus static checker warning]
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Aaron Tomlin <atomlin@redhat.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>,
      Cc: Sterling Alexander <stalexan@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Roland McGrath <roland@hack.frob.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      57a05918
    • O
      exit: reparent: use ->ptrace_entry rather than ->sibling for EXIT_DEAD tasks · dc2fd4b0
      Oleg Nesterov 提交于
      reparent_leader() reuses ->sibling as a list node to add an EXIT_DEAD task
      into dead_children list we are going to release.  This obviously removes
      the dead task from its real_parent->children list and this is even good;
      the parent can do nothing with the EXIT_DEAD reparented zombie, it only
      makes do_wait() slower.
      
      But, this also means that it can not be reparented once again, so if its
      new parent dies too nobody will update ->parent/real_parent, they can
      point to the freed memory even before release_task() we are going to call,
      this breaks the code which relies on pid_alive() to access
      ->real_parent/parent.
      
      Fortunately this is mostly theoretical, this can only happen if init or
      PR_SET_CHILD_SUBREAPER process ignores SIGCHLD and the new parent
      sub-thread exits right after we drop tasklist_lock.
      
      Change this code to use ->ptrace_entry instead, we know that the child is
      not traced so nobody can ever use this member.  This also allows to unify
      this logic with exit_ptrace(), see the next changes.
      
      Note: we really need to change release_task() to nullify real_parent/
      parent/group_leader pointers, but we need to change the current users
      first somehow.  And it would be better to reap this zombie immediately but
      release_task_locked() we need is complicated by proc_flush_task().
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Aaron Tomlin <atomlin@redhat.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>,
      Cc: Sterling Alexander <stalexan@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Roland McGrath <roland@hack.frob.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dc2fd4b0
  6. 06 11月, 2014 1 次提交
  7. 28 10月, 2014 1 次提交
    • P
      sched, exit: Deal with nested sleeps · 1029a2b5
      Peter Zijlstra 提交于
      do_wait() is a big wait loop, but we set TASK_RUNNING too late; we end
      up calling potential sleeps before we reset it.
      
      Not strictly a bug since we're guaranteed to exit the loop and not
      call schedule(); put in annotations to quiet might_sleep().
      
       WARNING: CPU: 0 PID: 1 at ../kernel/sched/core.c:7123 __might_sleep+0x7e/0x90()
       do not call blocking ops when !TASK_RUNNING; state=1 set at [<ffffffff8109a788>] do_wait+0x88/0x270
      
       Call Trace:
        [<ffffffff81694991>] dump_stack+0x4e/0x7a
        [<ffffffff8109877c>] warn_slowpath_common+0x8c/0xc0
        [<ffffffff8109886c>] warn_slowpath_fmt+0x4c/0x50
        [<ffffffff810bca6e>] __might_sleep+0x7e/0x90
        [<ffffffff811a1c15>] might_fault+0x55/0xb0
        [<ffffffff8109a3fb>] wait_consider_task+0x90b/0xc10
        [<ffffffff8109a804>] do_wait+0x104/0x270
        [<ffffffff8109b837>] SyS_wait4+0x77/0x100
        [<ffffffff8169d692>] system_call_fastpath+0x16/0x1b
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: tglx@linutronix.de
      Cc: umgwanakikbuti@gmail.com
      Cc: ilya.dryomov@inktank.com
      Cc: Alex Elder <alex.elder@linaro.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Axel Lin <axel.lin@ingics.com>
      Cc: Daniel Borkmann <dborkman@redhat.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Guillaume Morin <guillaume@morinfr.org>
      Cc: Ionut Alexa <ionut.m.alexa@gmail.com>
      Cc: Jason Baron <jbaron@akamai.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Michal Schmidt <mschmidt@redhat.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Link: http://lkml.kernel.org/r/20140924082242.186408915@infradead.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      1029a2b5
  8. 08 9月, 2014 3 次提交
    • R
      time, signal: Protect resource use statistics with seqlock · e78c3496
      Rik van Riel 提交于
      Both times() and clock_gettime(CLOCK_PROCESS_CPUTIME_ID) have scalability
      issues on large systems, due to both functions being serialized with a
      lock.
      
      The lock protects against reporting a wrong value, due to a thread in the
      task group exiting, its statistics reporting up to the signal struct, and
      that exited task's statistics being counted twice (or not at all).
      
      Protecting that with a lock results in times() and clock_gettime() being
      completely serialized on large systems.
      
      This can be fixed by using a seqlock around the events that gather and
      propagate statistics. As an additional benefit, the protection code can
      be moved into thread_group_cputime(), slightly simplifying the calling
      functions.
      
      In the case of posix_cpu_clock_get_task() things can be simplified a
      lot, because the calling function already ensures that the task sticks
      around, and the rest is now taken care of in thread_group_cputime().
      
      This way the statistics reporting code can run lockless.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alex Thorlton <athorlton@sgi.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Daeseok Youn <daeseok.youn@gmail.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dongsheng Yang <yangds.fnst@cn.fujitsu.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Guillaume Morin <guillaume@morinfr.org>
      Cc: Ionut Alexa <ionut.m.alexa@gmail.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Michal Schmidt <mschmidt@redhat.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: umgwanakikbuti@gmail.com
      Cc: fweisbec@gmail.com
      Cc: srao@redhat.com
      Cc: lwoodman@redhat.com
      Cc: atheurer@redhat.com
      Link: http://lkml.kernel.org/r/20140816134010.26a9b572@annuminas.surriel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      e78c3496
    • R
      exit: Always reap resource stats in __exit_signal() · 90ed9cbe
      Rik van Riel 提交于
      Oleg pointed out that wait_task_zombie adds a task's usage statistics
      to the parent's signal struct, but the task's own signal struct should
      also propagate the statistics at exit time.
      
      This allows thread_group_cputime(reaped_zombie) to get the statistics
      after __unhash_process() has made the task invisible to for_each_thread,
      but before the thread has actually been rcu freed, making sure no
      non-monotonic results are returned inside that window.
      Suggested-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Guillaume Morin <guillaume@morinfr.org>
      Cc: Ionut Alexa <ionut.m.alexa@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Michal Schmidt <mschmidt@redhat.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: umgwanakikbuti@gmail.com
      Cc: fweisbec@gmail.com
      Cc: srao@redhat.com
      Cc: lwoodman@redhat.com
      Cc: atheurer@redhat.com
      Link: http://lkml.kernel.org/r/1408133138-22048-2-git-send-email-riel@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      90ed9cbe
    • P
      rcu: Make TASKS_RCU handle tasks that are almost done exiting · 3f95aa81
      Paul E. McKenney 提交于
      Once a task has passed exit_notify() in the do_exit() code path, it
      is no longer on the task lists, and is therefore no longer visible
      to rcu_tasks_kthread().  This means that an almost-exited task might
      be preempted while within a trampoline, and this task won't be waited
      on by rcu_tasks_kthread().  This commit fixes this bug by adding an
      srcu_struct.  An exiting task does srcu_read_lock() just before calling
      exit_notify(), and does the corresponding srcu_read_unlock() after
      doing the final preempt_disable().  This means that rcu_tasks_kthread()
      can do synchronize_srcu() to wait for all mostly-exited tasks to reach
      their final preempt_disable() region, and then use synchronize_sched()
      to wait for those tasks to finish exiting.
      Reported-by: NOleg Nesterov <oleg@redhat.com>
      Suggested-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      3f95aa81
  9. 09 8月, 2014 1 次提交
  10. 07 8月, 2014 1 次提交
    • D
      mm, oom: remove unnecessary exit_state check · fb794bcb
      David Rientjes 提交于
      The oom killer scans each process and determines whether it is eligible
      for oom kill or whether the oom killer should abort because of
      concurrent memory freeing.  It will abort when an eligible process is
      found to have TIF_MEMDIE set, meaning it has already been oom killed and
      we're waiting for it to exit.
      
      Processes with task->mm == NULL should not be considered because they
      are either kthreads or have already detached their memory and killing
      them would not lead to memory freeing.  That memory is only freed after
      exit_mm() has returned, however, and not when task->mm is first set to
      NULL.
      
      Clear TIF_MEMDIE after exit_mm()'s mmput() so that an oom killed process
      is no longer considered for oom kill, but only until exit_mm() has
      returned.  This was fragile in the past because it relied on
      exit_notify() to be reached before no longer considering TIF_MEMDIE
      processes.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fb794bcb
  11. 07 6月, 2014 1 次提交
    • O
      signals: mv {dis,}allow_signal() from sched.h/exit.c to signal.[ch] · 0341729b
      Oleg Nesterov 提交于
      Move the declaration/definition of allow_signal/disallow_signal to
      signal.h/signal.c.  The new place is more logical and allows to use the
      static helpers in signal.c (see the next changes).
      
      While at it, make them return void and remove the valid_signal() check.
      Nobody checks the returned value, and in-kernel users must not pass the
      wrong signal number.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0341729b
  12. 05 6月, 2014 3 次提交
  13. 08 4月, 2014 7 次提交
    • O
      wait: WSTOPPED|WCONTINUED doesn't work if a zombie leader is traced by another process · 7c733eb3
      Oleg Nesterov 提交于
      Even if the main thread is dead the process still can stop/continue.
      However, if the leader is ptraced wait_consider_task(ptrace => false)
      always skips wait_task_stopped/wait_task_continued, so WSTOPPED or
      WCONTINUED can never work for the natural parent in this case.
      
      Move the "A zombie ptracee is only visible to its ptracer" check into the
      "if (!delay_group_leader(p))" block.  ->notask_error is cleared by the
      "fall through" code below.
      
      This depends on the previous change, wait_task_stopped/continued must be
      avoided if !delay_group_leader() and the tracer is ->real_parent.
      Otherwise WSTOPPED|WEXITED could wrongly report "stopped" when the child
      is already dead (single-threaded or not).  If it is traced by another task
      then the "stopped" state is fine until the debugger detaches and reveals a
      zombie state.
      
      Stupid test-case:
      
      	void *tfunc(void *arg)
      	{
      		sleep(1);	// wait for zombie leader
      		raise(SIGSTOP);
      		exit(0x13);
      		return NULL;
      	}
      
      	int run_child(void)
      	{
      		pthread_t thread;
      
      		if (!fork()) {
      			int tracee = getppid();
      
      			assert(ptrace(PTRACE_ATTACH, tracee, 0,0) == 0);
      			do
      				ptrace(PTRACE_CONT, tracee, 0,0);
      			while (wait(NULL) > 0);
      
      			return 0;
      		}
      
      		sleep(1);	// wait for PTRACE_ATTACH
      		assert(pthread_create(&thread, NULL, tfunc, NULL) == 0);
      		pthread_exit(NULL);
      	}
      
      	int main(void)
      	{
      		int child, stat;
      
      		child = fork();
      		if (!child)
      			return run_child();
      
      		assert(child == waitpid(-1, &stat, WSTOPPED));
      		assert(stat == 0x137f);
      
      		kill(child, SIGCONT);
      
      		assert(child == waitpid(-1, &stat, WCONTINUED));
      		assert(stat == 0xffff);
      
      		assert(child == waitpid(-1, &stat, 0));
      		assert(stat == 0x1300);
      
      		return 0;
      	}
      
      Without this patch it hangs in waitpid(WSTOPPED), wait_task_stopped() is
      never called.
      
      Note: this doesn't fix all problems with a zombie delay_group_leader(),
      WCONTINUED | WEXITED check is not exactly right.  debugger can't assume it
      will be notified if another thread reaps the whole thread group.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Jan Kratochvil <jan.kratochvil@redhat.com>
      Cc: Lennart Poettering <lpoetter@redhat.com>
      Cc: Michal Schmidt <mschmidt@redhat.com>
      Cc: Roland McGrath <roland@hack.frob.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7c733eb3
    • O
      wait: WSTOPPED|WCONTINUED hangs if a zombie child is traced by real_parent · 377d75da
      Oleg Nesterov 提交于
      "A zombie is only visible to its ptracer" logic in wait_consider_task()
      is very wrong. Trivial test-case:
      
      	#include <unistd.h>
      	#include <sys/ptrace.h>
      	#include <sys/wait.h>
      	#include <assert.h>
      
      	int main(void)
      	{
      		int child = fork();
      
      		if (!child) {
      			assert(ptrace(PTRACE_TRACEME, 0,0,0) == 0);
      			return 0x23;
      		}
      
      		assert(waitid(P_ALL, child, NULL, WEXITED | WNOWAIT) == 0);
      		assert(waitid(P_ALL, 0, NULL, WSTOPPED) == -1);
      		return 0;
      	}
      
      it hangs in waitpid(WSTOPPED) despite the fact it has a single zombie
      child.  This is because wait_consider_task(ptrace => 0) sees p->ptrace and
      cleares ->notask_error assuming that the debugger should detach and notify
      us.
      
      Change wait_consider_task(ptrace => 0) to pretend that ptrace == T if the
      child is traced by us.  This really simplifies the logic and allows us to
      do more fixes, see the next changes.  This also hides the unwanted group
      stop state automatically, we can remove another ptrace_reparented() check.
      
      Unfortunately, this adds the following behavioural changes:
      
      	1. Before this patch wait(WEXITED | __WNOTHREAD) does not reap
      	   a natural child if it is traced by the caller's sub-thread.
      
      	   Hopefully nobody will ever notice this change, and I think
      	   that nobody should rely on this behaviour anyway.
      
      	2. SIGNAL_STOP_CONTINUED is no longer hidden from debugger if
      	   it is real parent.
      
      	   While this change comes as a side effect, I think it is good
      	   by itself. The group continued state can not be consumed by
      	   another process in this case, it doesn't depend on ptrace,
      	   it doesn't make sense to hide it from real parent.
      
      	   Perhaps we should add the thread_group_leader() check before
      	   wait_task_continued()? May be, but this shouldn't depend on
      	   ptrace_reparented().
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Jan Kratochvil <jan.kratochvil@redhat.com>
      Cc: Lennart Poettering <lpoetter@redhat.com>
      Cc: Michal Schmidt <mschmidt@redhat.com>
      Cc: Roland McGrath <roland@hack.frob.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      377d75da
    • O
      wait: completely ignore the EXIT_DEAD tasks · b3ab0316
      Oleg Nesterov 提交于
      Now that EXIT_DEAD is the terminal state it doesn't make sense to call
      eligible_child() or security_task_wait() if the task is really dead.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Tested-by: NMichal Schmidt <mschmidt@redhat.com>
      Cc: Jan Kratochvil <jan.kratochvil@redhat.com>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Lennart Poettering <lpoetter@redhat.com>
      Cc: Roland McGrath <roland@hack.frob.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b3ab0316
    • O
      wait: use EXIT_TRACE only if thread_group_leader(zombie) · b4360690
      Oleg Nesterov 提交于
      wait_task_zombie() always uses EXIT_TRACE/ptrace_unlink() if
      ptrace_reparented().  This is suboptimal and a bit confusing: we do not
      need do_notify_parent(p) if !thread_group_leader(p) and in this case we
      also do not need ptrace_unlink(), we can rely on ptrace_release_task().
      
      Change wait_task_zombie() to check thread_group_leader() along with
      ptrace_reparented() and simplify the final p->exit_state transition.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Tested-by: NMichal Schmidt <mschmidt@redhat.com>
      Cc: Jan Kratochvil <jan.kratochvil@redhat.com>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Lennart Poettering <lpoetter@redhat.com>
      Cc: Roland McGrath <roland@hack.frob.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b4360690
    • O
      wait: introduce EXIT_TRACE to avoid the racy EXIT_DEAD->EXIT_ZOMBIE transition · abd50b39
      Oleg Nesterov 提交于
      wait_task_zombie() first does EXIT_ZOMBIE->EXIT_DEAD transition and
      drops tasklist_lock.  If this task is not the natural child and it is
      traced, we change its state back to EXIT_ZOMBIE for ->real_parent.
      
      The last transition is racy, this is even documented in 50b8d257
      "ptrace: partially fix the do_wait(WEXITED) vs EXIT_DEAD->EXIT_ZOMBIE
      race".  wait_consider_task() tries to detect this transition and clear
      ->notask_error but we can't rely on ptrace_reparented(), debugger can
      exit and do ptrace_unlink() before its sub-thread sets EXIT_ZOMBIE.
      
      And there is another problem which were missed before: this transition
      can also race with reparent_leader() which doesn't reset >exit_signal if
      EXIT_DEAD, assuming that this task must be reaped by someone else.  So
      the tracee can be re-parented with ->exit_signal != SIGCHLD, and if
      /sbin/init doesn't use __WALL it becomes unreapable.  This was fixed by
      the previous commit, but it was the temporary hack.
      
      1. Add the new exit_state, EXIT_TRACE. It means that the task is the
         traced zombie, debugger is going to detach and notify its natural
         parent.
      
         This new state is actually EXIT_ZOMBIE | EXIT_DEAD. This way we
         can avoid the changes in proc/kgdb code, get_task_state() still
         reports "X (dead)" in this case.
      
         Note: with or without this change userspace can see Z -> X -> Z
         transition. Not really bad, but probably makes sense to fix.
      
      2. Change wait_task_zombie() to use EXIT_TRACE instead of EXIT_DEAD
         if we need to notify the ->real_parent.
      
      3. Revert the previous hack in reparent_leader(), now that EXIT_DEAD
         is always the final state we can safely ignore such a task.
      
      4. Change wait_consider_task() to check EXIT_TRACE separately and kill
         the racy and no longer needed ptrace_reparented() case.
      
         If ptrace == T an EXIT_TRACE thread should be simply ignored, the
         owner of this state is going to ptrace_unlink() this task. We can
         pretend that it was already removed from ->ptraced list.
      
         Otherwise we should skip this thread too but clear ->notask_error,
         we must be the natural parent and debugger is going to untrace and
         notify us. IOW, this doesn't differ from "EXIT_ZOMBIE && p->ptrace"
         even if the task was already untraced.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Reported-by: NJan Kratochvil <jan.kratochvil@redhat.com>
      Reported-by: NMichal Schmidt <mschmidt@redhat.com>
      Tested-by: NMichal Schmidt <mschmidt@redhat.com>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Lennart Poettering <lpoetter@redhat.com>
      Cc: Roland McGrath <roland@hack.frob.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      abd50b39
    • O
      wait: fix reparent_leader() vs EXIT_DEAD->EXIT_ZOMBIE race · dfccbb5e
      Oleg Nesterov 提交于
      wait_task_zombie() first does EXIT_ZOMBIE->EXIT_DEAD transition and
      drops tasklist_lock.  If this task is not the natural child and it is
      traced, we change its state back to EXIT_ZOMBIE for ->real_parent.
      
      The last transition is racy, this is even documented in 50b8d257
      "ptrace: partially fix the do_wait(WEXITED) vs EXIT_DEAD->EXIT_ZOMBIE
      race".  wait_consider_task() tries to detect this transition and clear
      ->notask_error but we can't rely on ptrace_reparented(), debugger can
      exit and do ptrace_unlink() before its sub-thread sets EXIT_ZOMBIE.
      
      And there is another problem which were missed before: this transition
      can also race with reparent_leader() which doesn't reset >exit_signal if
      EXIT_DEAD, assuming that this task must be reaped by someone else.  So
      the tracee can be re-parented with ->exit_signal != SIGCHLD, and if
      /sbin/init doesn't use __WALL it becomes unreapable.
      
      Change reparent_leader() to update ->exit_signal even if EXIT_DEAD.
      Note: this is the simple temporary hack for -stable, it doesn't try to
      solve all problems, it will be reverted by the next changes.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Reported-by: NJan Kratochvil <jan.kratochvil@redhat.com>
      Reported-by: NMichal Schmidt <mschmidt@redhat.com>
      Tested-by: NMichal Schmidt <mschmidt@redhat.com>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Lennart Poettering <lpoetter@redhat.com>
      Cc: Roland McGrath <roland@hack.frob.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dfccbb5e
    • G
      kernel/exit.c: call proc_exit_connector() after exit_state is set · ef982393
      Guillaume Morin 提交于
      The process events connector delivers a notification when a process
      exits.  This is really convenient for a process that spawns and wants to
      monitor its children through an epoll-able() interface.
      
      Unfortunately, there is a small window between when the event is
      delivered and the child become wait()-able.
      
      This is creates a race if the parent wants to make sure that it knows
      about the exit, e.g
      
      pid_t pid = fork();
      if (pid > 0) {
      	register_interest_for_pid(pid);
      	if (waitpid(pid, NULL, WNOHANG) > 0)
      	{
      	  /* We might have raced with exit() */
      	}
      	return;
      }
      
      /* Child */
      execve(...)
      
      register_interest_for_pid() would be telling the the connector socket
      reader to pay attention to events related to pid.
      
      Though this is not a bug, I think it would make the connector a bit more
      usable if this race was closed by simply moving the call to
      proc_exit_connector() from just before exit_notify() to right after.
      
      Oleg said:
      
      : Even with this patch the code above is still "racy" if the child is
      : multi-threaded.  Plus it should obviously filter-out subthreads.  And
      : afaics there is no way to make it reliable, even if you change the code
      : above so that waitpid() is called only after the last thread exits WNOHANG
      : still can fail.
      Signed-off-by: NGuillaume Morin <guillaume@morinfr.org>
      Cc: Matt Helsley <matt.helsley@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: David S. Miller <davem@davemloft.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ef982393