1. 08 4月, 2014 5 次提交
    • O
      wait: introduce EXIT_TRACE to avoid the racy EXIT_DEAD->EXIT_ZOMBIE transition · abd50b39
      Oleg Nesterov 提交于
      wait_task_zombie() first does EXIT_ZOMBIE->EXIT_DEAD transition and
      drops tasklist_lock.  If this task is not the natural child and it is
      traced, we change its state back to EXIT_ZOMBIE for ->real_parent.
      
      The last transition is racy, this is even documented in 50b8d257
      "ptrace: partially fix the do_wait(WEXITED) vs EXIT_DEAD->EXIT_ZOMBIE
      race".  wait_consider_task() tries to detect this transition and clear
      ->notask_error but we can't rely on ptrace_reparented(), debugger can
      exit and do ptrace_unlink() before its sub-thread sets EXIT_ZOMBIE.
      
      And there is another problem which were missed before: this transition
      can also race with reparent_leader() which doesn't reset >exit_signal if
      EXIT_DEAD, assuming that this task must be reaped by someone else.  So
      the tracee can be re-parented with ->exit_signal != SIGCHLD, and if
      /sbin/init doesn't use __WALL it becomes unreapable.  This was fixed by
      the previous commit, but it was the temporary hack.
      
      1. Add the new exit_state, EXIT_TRACE. It means that the task is the
         traced zombie, debugger is going to detach and notify its natural
         parent.
      
         This new state is actually EXIT_ZOMBIE | EXIT_DEAD. This way we
         can avoid the changes in proc/kgdb code, get_task_state() still
         reports "X (dead)" in this case.
      
         Note: with or without this change userspace can see Z -> X -> Z
         transition. Not really bad, but probably makes sense to fix.
      
      2. Change wait_task_zombie() to use EXIT_TRACE instead of EXIT_DEAD
         if we need to notify the ->real_parent.
      
      3. Revert the previous hack in reparent_leader(), now that EXIT_DEAD
         is always the final state we can safely ignore such a task.
      
      4. Change wait_consider_task() to check EXIT_TRACE separately and kill
         the racy and no longer needed ptrace_reparented() case.
      
         If ptrace == T an EXIT_TRACE thread should be simply ignored, the
         owner of this state is going to ptrace_unlink() this task. We can
         pretend that it was already removed from ->ptraced list.
      
         Otherwise we should skip this thread too but clear ->notask_error,
         we must be the natural parent and debugger is going to untrace and
         notify us. IOW, this doesn't differ from "EXIT_ZOMBIE && p->ptrace"
         even if the task was already untraced.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Reported-by: NJan Kratochvil <jan.kratochvil@redhat.com>
      Reported-by: NMichal Schmidt <mschmidt@redhat.com>
      Tested-by: NMichal Schmidt <mschmidt@redhat.com>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Lennart Poettering <lpoetter@redhat.com>
      Cc: Roland McGrath <roland@hack.frob.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      abd50b39
    • O
      wait: fix reparent_leader() vs EXIT_DEAD->EXIT_ZOMBIE race · dfccbb5e
      Oleg Nesterov 提交于
      wait_task_zombie() first does EXIT_ZOMBIE->EXIT_DEAD transition and
      drops tasklist_lock.  If this task is not the natural child and it is
      traced, we change its state back to EXIT_ZOMBIE for ->real_parent.
      
      The last transition is racy, this is even documented in 50b8d257
      "ptrace: partially fix the do_wait(WEXITED) vs EXIT_DEAD->EXIT_ZOMBIE
      race".  wait_consider_task() tries to detect this transition and clear
      ->notask_error but we can't rely on ptrace_reparented(), debugger can
      exit and do ptrace_unlink() before its sub-thread sets EXIT_ZOMBIE.
      
      And there is another problem which were missed before: this transition
      can also race with reparent_leader() which doesn't reset >exit_signal if
      EXIT_DEAD, assuming that this task must be reaped by someone else.  So
      the tracee can be re-parented with ->exit_signal != SIGCHLD, and if
      /sbin/init doesn't use __WALL it becomes unreapable.
      
      Change reparent_leader() to update ->exit_signal even if EXIT_DEAD.
      Note: this is the simple temporary hack for -stable, it doesn't try to
      solve all problems, it will be reverted by the next changes.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Reported-by: NJan Kratochvil <jan.kratochvil@redhat.com>
      Reported-by: NMichal Schmidt <mschmidt@redhat.com>
      Tested-by: NMichal Schmidt <mschmidt@redhat.com>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Lennart Poettering <lpoetter@redhat.com>
      Cc: Roland McGrath <roland@hack.frob.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dfccbb5e
    • G
      kernel/exit.c: call proc_exit_connector() after exit_state is set · ef982393
      Guillaume Morin 提交于
      The process events connector delivers a notification when a process
      exits.  This is really convenient for a process that spawns and wants to
      monitor its children through an epoll-able() interface.
      
      Unfortunately, there is a small window between when the event is
      delivered and the child become wait()-able.
      
      This is creates a race if the parent wants to make sure that it knows
      about the exit, e.g
      
      pid_t pid = fork();
      if (pid > 0) {
      	register_interest_for_pid(pid);
      	if (waitpid(pid, NULL, WNOHANG) > 0)
      	{
      	  /* We might have raced with exit() */
      	}
      	return;
      }
      
      /* Child */
      execve(...)
      
      register_interest_for_pid() would be telling the the connector socket
      reader to pay attention to events related to pid.
      
      Though this is not a bug, I think it would make the connector a bit more
      usable if this race was closed by simply moving the call to
      proc_exit_connector() from just before exit_notify() to right after.
      
      Oleg said:
      
      : Even with this patch the code above is still "racy" if the child is
      : multi-threaded.  Plus it should obviously filter-out subthreads.  And
      : afaics there is no way to make it reliable, even if you change the code
      : above so that waitpid() is called only after the last thread exits WNOHANG
      : still can fail.
      Signed-off-by: NGuillaume Morin <guillaume@morinfr.org>
      Cc: Matt Helsley <matt.helsley@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: David S. Miller <davem@davemloft.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ef982393
    • O
      exit: move check_stack_usage() to the end of do_exit() · 4bcb8232
      Oleg Nesterov 提交于
      It is not clear why check_stack_usage() is called so early and thus it
      never checks the stack usage in, say, exit_notify() or
      flush_ptrace_hw_breakpoint() or other functions which are only called by
      do_exit().
      
      Move the callsite down to the last preempt_disable/schedule.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4bcb8232
    • O
      exit: call disassociate_ctty() before exit_task_namespaces() · c39df5fa
      Oleg Nesterov 提交于
      Commit 8aac6270 ("move exit_task_namespaces() outside of
      exit_notify()") breaks pppd and the exiting service crashes the kernel:
      
          BUG: unable to handle kernel NULL pointer dereference at 0000000000000028
          IP: ppp_register_channel+0x13/0x20 [ppp_generic]
          Call Trace:
            ppp_asynctty_open+0x12b/0x170 [ppp_async]
            tty_ldisc_open.isra.2+0x27/0x60
            tty_ldisc_hangup+0x1e3/0x220
            __tty_hangup+0x2c4/0x440
            disassociate_ctty+0x61/0x270
            do_exit+0x7f2/0xa50
      
      ppp_register_channel() needs ->net_ns and current->nsproxy == NULL.
      
      Move disassociate_ctty() before exit_task_namespaces(), it doesn't make
      sense to delay it after perf_event_exit_task() or cgroup_exit().
      
      This also allows to use task_work_add() inside the (nontrivial) code
      paths in disassociate_ctty().
      
      Investigated by Peter Hurley.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Reported-by: NSree Harsha Totakura <sreeharsha@totakura.in>
      Cc: Peter Hurley <peter@hurleysoftware.com>
      Cc: Sree Harsha Totakura <sreeharsha@totakura.in>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Andrey Vagin <avagin@openvz.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: <stable@vger.kernel.org>	[v3.10+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c39df5fa
  2. 29 3月, 2014 1 次提交
  3. 22 1月, 2014 1 次提交
    • O
      introduce for_each_thread() to replace the buggy while_each_thread() · 0c740d0a
      Oleg Nesterov 提交于
      while_each_thread() and next_thread() should die, almost every lockless
      usage is wrong.
      
      1. Unless g == current, the lockless while_each_thread() is not safe.
      
         while_each_thread(g, t) can loop forever if g exits, next_thread()
         can't reach the unhashed thread in this case. Note that this can
         happen even if g is the group leader, it can exec.
      
      2. Even if while_each_thread() itself was correct, people often use
         it wrongly.
      
         It was never safe to just take rcu_read_lock() and loop unless
         you verify that pid_alive(g) == T, even the first next_thread()
         can point to the already freed/reused memory.
      
      This patch adds signal_struct->thread_head and task->thread_node to
      create the normal rcu-safe list with the stable head.  The new
      for_each_thread(g, t) helper is always safe under rcu_read_lock() as
      long as this task_struct can't go away.
      
      Note: of course it is ugly to have both task_struct->thread_node and the
      old task_struct->thread_group, we will kill it later, after we change
      the users of while_each_thread() to use for_each_thread().
      
      Perhaps we can kill it even before we convert all users, we can
      reimplement next_thread(t) using the new thread_head/thread_node.  But
      we can't do this right now because this will lead to subtle behavioural
      changes.  For example, do/while_each_thread() always sees at least one
      task, while for_each_thread() can do nothing if the whole thread group
      has died.  Or thread_group_empty(), currently its semantics is not clear
      unless thread_group_leader(p) and we need to audit the callers before we
      can change it.
      
      So this patch adds the new interface which has to coexist with the old
      one for some time, hopefully the next changes will be more or less
      straightforward and the old one will go away soon.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Reviewed-by: NSergey Dyasly <dserrg@gmail.com>
      Tested-by: NSergey Dyasly <dserrg@gmail.com>
      Reviewed-by: NSameer Nanda <snanda@chromium.org>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Mandeep Singh Baines <msb@chromium.org>
      Cc: "Ma, Xindong" <xindong.ma@intel.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: "Tu, Xiaobing" <xiaobing.tu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0c740d0a
  4. 10 7月, 2013 1 次提交
  5. 04 7月, 2013 1 次提交
  6. 15 6月, 2013 1 次提交
    • O
      move exit_task_namespaces() outside of exit_notify() · 8aac6270
      Oleg Nesterov 提交于
      exit_notify() does exit_task_namespaces() after
      forget_original_parent(). This was needed to ensure that ->nsproxy
      can't be cleared prematurely, an exiting child we are going to
      reparent can do do_notify_parent() and use the parent's (ours) pid_ns.
      
      However, after 32084504 "pidns: use task_active_pid_ns in
      do_notify_parent" ->nsproxy != NULL is no longer needed, we rely
      on task_active_pid_ns().
      
      Move exit_task_namespaces() from exit_notify() to do_exit(), after
      exit_fs() and before exit_task_work().
      
      This solves the problem reported by Andrey, free_ipc_ns()->shm_destroy()
      does fput() which needs task_work_add().
      
      Note: this particular problem can be fixed if we change fput(), and
      that change makes sense anyway. But there is another reason to move
      the callsite. The original reason for exit_task_namespaces() from
      the middle of exit_notify() was subtle and it has already gone away,
      now this looks confusing. And this allows us do simplify exit_notify(),
      we can avoid unlock/lock(tasklist) and we can use ->exit_state instead
      of PF_EXITING in forget_original_parent().
      Reported-by: NAndrey Vagin <avagin@openvz.org>
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Acked-by: NAndrey Vagin <avagin@openvz.org>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      8aac6270
  7. 12 5月, 2013 1 次提交
  8. 10 4月, 2013 1 次提交
  9. 01 4月, 2013 1 次提交
    • P
      Revert "lockdep: check that no locks held at freeze time" · dbf520a9
      Paul Walmsley 提交于
      This reverts commit 6aa97070.
      
      Commit 6aa97070 ("lockdep: check that no locks held at freeze time")
      causes problems with NFS root filesystems.  The failures were noticed on
      OMAP2 and 3 boards during kernel init:
      
        [ BUG: swapper/0/1 still has locks held! ]
        3.9.0-rc3-00344-ga937536b #1 Not tainted
        -------------------------------------
        1 lock held by swapper/0/1:
         #0:  (&type->s_umount_key#13/1){+.+.+.}, at: [<c011e84c>] sget+0x248/0x574
      
        stack backtrace:
          rpc_wait_bit_killable
          __wait_on_bit
          out_of_line_wait_on_bit
          __rpc_execute
          rpc_run_task
          rpc_call_sync
          nfs_proc_get_root
          nfs_get_root
          nfs_fs_mount_common
          nfs_try_mount
          nfs_fs_mount
          mount_fs
          vfs_kern_mount
          do_mount
          sys_mount
          do_mount_root
          mount_root
          prepare_namespace
          kernel_init_freeable
          kernel_init
      
      Although the rootfs mounts, the system is unstable.  Here's a transcript
      from a PM test:
      
        http://www.pwsan.com/omap/testlogs/test_v3.9-rc3/20130317194234/pm/37xxevm/37xxevm_log.txt
      
      Here's what the test log should look like:
      
        http://www.pwsan.com/omap/testlogs/test_v3.8/20130218214403/pm/37xxevm/37xxevm_log.txt
      
      Mailing list discussion is here:
      
        http://lkml.org/lkml/2013/3/4/221
      
      Deal with this for v3.9 by reverting the problem commit, until folks can
      figure out the right long-term course of action.
      Signed-off-by: NPaul Walmsley <paul@pwsan.com>
      Cc: Mandeep Singh Baines <msb@chromium.org>
      Cc: Jeff Layton <jlayton@redhat.com>
      Cc: Shawn Guo <shawn.guo@linaro.org>
      Cc: <maciej.rutecki@gmail.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Ben Chan <benchan@chromium.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Rafael J. Wysocki <rjw@sisk.pl>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dbf520a9
  10. 04 3月, 2013 1 次提交
  11. 28 2月, 2013 2 次提交
  12. 28 1月, 2013 1 次提交
    • F
      cputime: Use accessors to read task cputime stats · 6fac4829
      Frederic Weisbecker 提交于
      This is in preparation for the full dynticks feature. While
      remotely reading the cputime of a task running in a full
      dynticks CPU, we'll need to do some extra-computation. This
      way we can account the time it spent tickless in userspace
      since its last cputime snapshot.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Li Zhong <zhong@linux.vnet.ibm.com>
      Cc: Namhyung Kim <namhyung.kim@lge.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      6fac4829
  13. 29 11月, 2012 2 次提交
    • A
      kill daemonize() · c4144670
      Al Viro 提交于
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      c4144670
    • F
      cputime: Rename thread_group_times to thread_group_cputime_adjusted · e80d0a1a
      Frederic Weisbecker 提交于
      We have thread_group_cputime() and thread_group_times(). The naming
      doesn't provide enough information about the difference between
      these two APIs.
      
      To lower the confusion, rename thread_group_times() to
      thread_group_cputime_adjusted(). This name better suggests that
      it's a version of thread_group_cputime() that does some stabilization
      on the raw cputime values. ie here: scale on top of CFS runtime
      stats and bound lower value for monotonicity.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      e80d0a1a
  14. 19 11月, 2012 1 次提交
  15. 27 9月, 2012 2 次提交
  16. 25 9月, 2012 1 次提交
    • E
      net: use a per task frag allocator · 5640f768
      Eric Dumazet 提交于
      We currently use a per socket order-0 page cache for tcp_sendmsg()
      operations.
      
      This page is used to build fragments for skbs.
      
      Its done to increase probability of coalescing small write() into
      single segments in skbs still in write queue (not yet sent)
      
      But it wastes a lot of memory for applications handling many mostly
      idle sockets, since each socket holds one page in sk->sk_sndmsg_page
      
      Its also quite inefficient to build TSO 64KB packets, because we need
      about 16 pages per skb on arches where PAGE_SIZE = 4096, so we hit
      page allocator more than wanted.
      
      This patch adds a per task frag allocator and uses bigger pages,
      if available. An automatic fallback is done in case of memory pressure.
      
      (up to 32768 bytes per frag, thats order-3 pages on x86)
      
      This increases TCP stream performance by 20% on loopback device,
      but also benefits on other network devices, since 8x less frags are
      mapped on transmit and unmapped on tx completion. Alexander Duyck
      mentioned a probable performance win on systems with IOMMU enabled.
      
      Its possible some SG enabled hardware cant cope with bigger fragments,
      but their ndo_start_xmit() should already handle this, splitting a
      fragment in sub fragments, since some arches have PAGE_SIZE=65536
      
      Successfully tested on various ethernet devices.
      (ixgbe, igb, bnx2x, tg3, mellanox mlx4)
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Ben Hutchings <bhutchings@solarflare.com>
      Cc: Vijay Subramanian <subramanian.vijay@gmail.com>
      Cc: Alexander Duyck <alexander.h.duyck@intel.com>
      Tested-by: NVijay Subramanian <subramanian.vijay@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5640f768
  17. 27 7月, 2012 1 次提交
    • J
      posix_types.h: Cleanup stale __NFDBITS and related definitions · 8ded2bbc
      Josh Boyer 提交于
      Recently, glibc made a change to suppress sign-conversion warnings in
      FD_SET (glibc commit ceb9e56b3d1).  This uncovered an issue with the
      kernel's definition of __NFDBITS if applications #include
      <linux/types.h> after including <sys/select.h>.  A build failure would
      be seen when passing the -Werror=sign-compare and -D_FORTIFY_SOURCE=2
      flags to gcc.
      
      It was suggested that the kernel should either match the glibc
      definition of __NFDBITS or remove that entirely.  The current in-kernel
      uses of __NFDBITS can be replaced with BITS_PER_LONG, and there are no
      uses of the related __FDELT and __FDMASK defines.  Given that, we'll
      continue the cleanup that was started with commit 8b3d1cda
      ("posix_types: Remove fd_set macros") and drop the remaining unused
      macros.
      
      Additionally, linux/time.h has similar macros defined that expand to
      nothing so we'll remove those at the same time.
      Reported-by: NJeff Law <law@redhat.com>
      Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      CC: <stable@vger.kernel.org>
      Signed-off-by: NJosh Boyer <jwboyer@redhat.com>
      [ .. and fix up whitespace as per akpm ]
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8ded2bbc
  18. 23 7月, 2012 1 次提交
  19. 21 6月, 2012 3 次提交
  20. 08 6月, 2012 2 次提交
    • L
      Revert "mm: correctly synchronize rss-counters at exit/exec" · 48d212a2
      Linus Torvalds 提交于
      This reverts commit 40af1bbd.
      
      It's horribly and utterly broken for at least the following reasons:
      
       - calling sync_mm_rss() from mmput() is fundamentally wrong, because
         there's absolutely no reason to believe that the task that does the
         mmput() always does it on its own VM.  Example: fork, ptrace, /proc -
         you name it.
      
       - calling it *after* having done mmdrop() on it is doubly insane, since
         the mm struct may well be gone now.
      
       - testing mm against NULL before you call it is insane too, since a
      NULL mm there would have caused oopses long before.
      
      .. and those are just the three bugs I found before I decided to give up
      looking for me and revert it asap.  I should have caught it before I
      even took it, but I trusted Andrew too much.
      
      Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Markus Trippelsdorf <markus@trippelsdorf.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      48d212a2
    • K
      mm: correctly synchronize rss-counters at exit/exec · 40af1bbd
      Konstantin Khlebnikov 提交于
      mm->rss_stat counters have per-task delta: task->rss_stat.  Before
      changing task->mm pointer the kernel must flush this delta with
      sync_mm_rss().
      
      do_exit() already calls sync_mm_rss() to flush the rss-counters before
      committing the rss statistics into task->signal->maxrss, taskstats,
      audit and other stuff.  Unfortunately the kernel does this before
      calling mm_release(), which can call put_user() for processing
      task->clear_child_tid.  So at this point we can trigger page-faults and
      task->rss_stat becomes non-zero again.  As a result mm->rss_stat becomes
      inconsistent and check_mm() will print something like this:
      
      | BUG: Bad rss-counter state mm:ffff88020813c380 idx:1 val:-1
      | BUG: Bad rss-counter state mm:ffff88020813c380 idx:2 val:1
      
      This patch moves sync_mm_rss() into mm_release(), and moves mm_release()
      out of do_exit() and calls it earlier.  After mm_release() there should
      be no pagefaults.
      
      [akpm@linux-foundation.org: tweak comment]
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Reported-by: NMarkus Trippelsdorf <markus@trippelsdorf.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: <stable@vger.kernel.org>		[3.4.x]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      40af1bbd
  21. 01 6月, 2012 2 次提交
  22. 24 5月, 2012 2 次提交
    • O
      genirq: reimplement exit_irq_thread() hook via task_work_add() · 4d1d61a6
      Oleg Nesterov 提交于
      exit_irq_thread() and task->irq_thread are needed to handle the unexpected
      (and unlikely) exit of irq-thread.
      
      We can use task_work instead and make this all private to
      kernel/irq/manage.c, cleanup plus micro-optimization.
      
      1. rename exit_irq_thread() to irq_thread_dtor(), make it
         static, and move it up before irq_thread().
      
      2. change irq_thread() to do task_work_add(irq_thread_dtor)
         at the start and task_work_cancel() before return.
      
         tracehook_notify_resume() can never play with kthreads,
         only do_exit()->exit_task_work() can call the callback
         and this is what we want.
      
      3. remove task_struct->irq_thread and the special hook
         in do_exit().
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Richard Kuo <rkuo@codeaurora.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Alexander Gordeev <agordeev@redhat.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: David Smith <dsmith@redhat.com>
      Cc: "Frank Ch. Eigler" <fche@redhat.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      4d1d61a6
    • O
      task_work_add: generic process-context callbacks · e73f8959
      Oleg Nesterov 提交于
      Provide a simple mechanism that allows running code in the (nonatomic)
      context of the arbitrary task.
      
      The caller does task_work_add(task, task_work) and this task executes
      task_work->func() either from do_notify_resume() or from do_exit().  The
      callback can rely on PF_EXITING to detect the latter case.
      
      "struct task_work" can be embedded in another struct, still it has "void
      *data" to handle the most common/simple case.
      
      This allows us to kill the ->replacement_session_keyring hack, and
      potentially this can have more users.
      
      Performance-wise, this adds 2 "unlikely(!hlist_empty())" checks into
      tracehook_notify_resume() and do_exit().  But at the same time we can
      remove the "replacement_session_keyring != NULL" checks from
      arch/*/signal.c and exit_creds().
      
      Note: task_work_add/task_work_run abuses ->pi_lock.  This is only because
      this lock is already used by lookup_pi_state() to synchronize with
      do_exit() setting PF_EXITING.  Fortunately the scope of this lock in
      task_work.c is really tiny, and the code is unlikely anyway.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NDavid Howells <dhowells@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Richard Kuo <rkuo@codeaurora.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Alexander Gordeev <agordeev@redhat.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: David Smith <dsmith@redhat.com>
      Cc: "Frank Ch. Eigler" <fche@redhat.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      e73f8959
  23. 18 5月, 2012 1 次提交
  24. 03 5月, 2012 1 次提交
  25. 24 3月, 2012 2 次提交
    • D
      kernel/exit.c: if init dies, log a signal which killed it, if any · 397a21f2
      Denys Vlasenko 提交于
      I just received another user's pleas for help when their init
      mysteriously died.  I again explained that they need to check whether it
      died because of bad instruction, a segv, or something else.  Which was
      an annoying detour into writing a trivial C program to spawn his init
      and print its exit code:
      
        http://lists.busybox.net/pipermail/busybox/2012-January/077172.html
      
      I hear you saying "just test it under /bin/sh".  Well, the crashing init
      _was_ /bin/sh.
      
      Which prompted me to make kernel do this first step automatically.  We can
      print exit code, which makes it possible to see that death was from e.g.
      SIGILL without writing test programs.
      
      [akpm@linux-foundation.org: add 0x to hex number output]
      Signed-off-by: NDenys Vlasenko <vda.linux@googlemail.com>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      397a21f2
    • L
      prctl: add PR_{SET,GET}_CHILD_SUBREAPER to allow simple process supervision · ebec18a6
      Lennart Poettering 提交于
      Userspace service managers/supervisors need to track their started
      services.  Many services daemonize by double-forking and get implicitly
      re-parented to PID 1.  The service manager will no longer be able to
      receive the SIGCHLD signals for them, and is no longer in charge of
      reaping the children with wait().  All information about the children is
      lost at the moment PID 1 cleans up the re-parented processes.
      
      With this prctl, a service manager process can mark itself as a sort of
      'sub-init', able to stay as the parent for all orphaned processes
      created by the started services.  All SIGCHLD signals will be delivered
      to the service manager.
      
      Receiving SIGCHLD and doing wait() is in cases of a service-manager much
      preferred over any possible asynchronous notification about specific
      PIDs, because the service manager has full access to the child process
      data in /proc and the PID can not be re-used until the wait(), the
      service-manager itself is in charge of, has happened.
      
      As a side effect, the relevant parent PID information does not get lost
      by a double-fork, which results in a more elaborate process tree and
      'ps' output:
      
      before:
        # ps afx
        253 ?        Ss     0:00 /bin/dbus-daemon --system --nofork
        294 ?        Sl     0:00 /usr/libexec/polkit-1/polkitd
        328 ?        S      0:00 /usr/sbin/modem-manager
        608 ?        Sl     0:00 /usr/libexec/colord
        658 ?        Sl     0:00 /usr/libexec/upowerd
        819 ?        Sl     0:00 /usr/libexec/imsettings-daemon
        916 ?        Sl     0:00 /usr/libexec/udisks-daemon
        917 ?        S      0:00  \_ udisks-daemon: not polling any devices
      
      after:
        # ps afx
        294 ?        Ss     0:00 /bin/dbus-daemon --system --nofork
        426 ?        Sl     0:00  \_ /usr/libexec/polkit-1/polkitd
        449 ?        S      0:00  \_ /usr/sbin/modem-manager
        635 ?        Sl     0:00  \_ /usr/libexec/colord
        705 ?        Sl     0:00  \_ /usr/libexec/upowerd
        959 ?        Sl     0:00  \_ /usr/libexec/udisks-daemon
        960 ?        S      0:00  |   \_ udisks-daemon: not polling any devices
        977 ?        Sl     0:00  \_ /usr/libexec/packagekitd
      
      This prctl is orthogonal to PID namespaces.  PID namespaces are isolated
      from each other, while a service management process usually requires the
      services to live in the same namespace, to be able to talk to each
      other.
      
      Users of this will be the systemd per-user instance, which provides
      init-like functionality for the user's login session and D-Bus, which
      activates bus services on-demand.  Both need init-like capabilities to
      be able to properly keep track of the services they start.
      
      Many thanks to Oleg for several rounds of review and insights.
      
      [akpm@linux-foundation.org: fix comment layout and spelling]
      [akpm@linux-foundation.org: add lengthy code comment from Oleg]
      Reviewed-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NLennart Poettering <lennart@poettering.net>
      Signed-off-by: NKay Sievers <kay.sievers@vrfy.org>
      Acked-by: NValdis Kletnieks <Valdis.Kletnieks@vt.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ebec18a6
  26. 22 3月, 2012 1 次提交
  27. 21 3月, 2012 1 次提交
    • O
      exit_signal: fix the "parent has changed security domain" logic · b6e238dc
      Oleg Nesterov 提交于
      exit_notify() changes ->exit_signal if the parent already did exec.
      This doesn't really work, we are not going to send the signal now
      if there is another live thread or the exiting task is traced. The
      parent can exec before the last dies or the tracer detaches.
      
      Move this check into do_notify_parent() which actually sends the
      signal.
      
      The user-visible change is that we do not change ->exit_signal,
      and thus the exiting task is still "clone children" for
      do_wait()->eligible_child(__WCLONE). Hopefully this is fine, the
      current logic is racy anyway.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b6e238dc