1. 15 5月, 2014 12 次提交
  2. 14 5月, 2014 2 次提交
  3. 13 4月, 2014 1 次提交
  4. 12 4月, 2014 1 次提交
  5. 11 4月, 2014 1 次提交
  6. 10 4月, 2014 1 次提交
  7. 09 4月, 2014 4 次提交
    • L
      futex: avoid race between requeue and wake · 69cd9eba
      Linus Torvalds 提交于
      Jan Stancek reported:
       "pthread_cond_broadcast/4-1.c testcase from openposix testsuite (LTP)
        occasionally fails, because some threads fail to wake up.
      
        Testcase creates 5 threads, which are all waiting on same condition.
        Main thread then calls pthread_cond_broadcast() without holding mutex,
        which calls:
      
            futex(uaddr1, FUTEX_CMP_REQUEUE_PRIVATE, 1, 2147483647, uaddr2, ..)
      
        This immediately wakes up single thread A, which unlocks mutex and
        tries to wake up another thread:
      
            futex(uaddr2, FUTEX_WAKE_PRIVATE, 1)
      
        If thread A manages to call futex_wake() before any waiters are
        requeued for uaddr2, no other thread is woken up"
      
      The ordering constraints for the hash bucket waiter counting are that
      the waiter counts have to be incremented _before_ getting the spinlock
      (because the spinlock acts as part of the memory barrier), but the
      "requeue" operation didn't honor those rules, and nobody had even
      thought about that case.
      
      This fairly simple patch just increments the waiter count for the target
      hash bucket (hb2) when requeing a futex before taking the locks.  It
      then decrements them again after releasing the lock - the code that
      actually moves the futex(es) between hash buckets will do the additional
      required waiter count housekeeping.
      Reported-and-tested-by: NJan Stancek <jstancek@redhat.com>
      Acked-by: NDavidlohr Bueso <davidlohr@hp.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org # 3.14
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      69cd9eba
    • M
      tracepoint: Fix sparse warnings in tracepoint.c · b725dfea
      Mathieu Desnoyers 提交于
      Fix the following sparse warnings:
      
        CHECK   kernel/tracepoint.c
      kernel/tracepoint.c:184:18: warning: incorrect type in assignment (different address spaces)
      kernel/tracepoint.c:184:18:    expected struct tracepoint_func *tp_funcs
      kernel/tracepoint.c:184:18:    got struct tracepoint_func [noderef] <asn:4>*funcs
      kernel/tracepoint.c:216:18: warning: incorrect type in assignment (different address spaces)
      kernel/tracepoint.c:216:18:    expected struct tracepoint_func *tp_funcs
      kernel/tracepoint.c:216:18:    got struct tracepoint_func [noderef] <asn:4>*funcs
      kernel/tracepoint.c:392:24: error: return expression in void function
        CC      kernel/tracepoint.o
      kernel/tracepoint.c: In function tracepoint_module_going:
      kernel/tracepoint.c:491:6: warning: symbol 'syscall_regfunc' was not declared. Should it be static?
      kernel/tracepoint.c:508:6: warning: symbol 'syscall_unregfunc' was not declared. Should it be static?
      
      Link: http://lkml.kernel.org/r/1397049883-28692-1-git-send-email-mathieu.desnoyers@efficios.comSigned-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      b725dfea
    • S
      tracepoint: Simplify tracepoint module search · eb7d035c
      Steven Rostedt (Red Hat) 提交于
      Instead of copying the num_tracepoints and tracepoints_ptrs from
      the module structure to the tp_mod structure, which only uses it to
      find the module associated to tracepoints of modules that are coming
      and going, simply copy the pointer to the module struct to the tracepoint
      tp_module structure.
      
      Also removed un-needed brackets around an if statement.
      
      Link: http://lkml.kernel.org/r/20140408201705.4dad2c4a@gandalf.local.homeAcked-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      eb7d035c
    • M
      tracepoint: Use struct pointer instead of name hash for reg/unreg tracepoints · de7b2973
      Mathieu Desnoyers 提交于
      Register/unregister tracepoint probes with struct tracepoint pointer
      rather than tracepoint name.
      
      This change, which vastly simplifies tracepoint.c, has been proposed by
      Steven Rostedt. It also removes 8.8kB (mostly of text) to the vmlinux
      size.
      
      From this point on, the tracers need to pass a struct tracepoint pointer
      to probe register/unregister. A probe can now only be connected to a
      tracepoint that exists. Moreover, tracers are responsible for
      unregistering the probe before the module containing its associated
      tracepoint is unloaded.
      
         text    data     bss     dec     hex filename
      10443444        4282528 10391552        25117524        17f4354 vmlinux.orig
      10434930        4282848 10391552        25109330        17f2352 vmlinux
      
      Link: http://lkml.kernel.org/r/1396992381-23785-2-git-send-email-mathieu.desnoyers@efficios.com
      
      CC: Ingo Molnar <mingo@kernel.org>
      CC: Frederic Weisbecker <fweisbec@gmail.com>
      CC: Andrew Morton <akpm@linux-foundation.org>
      CC: Frank Ch. Eigler <fche@redhat.com>
      CC: Johannes Berg <johannes.berg@intel.com>
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      [ SDR - fixed return val in void func in tracepoint_module_going() ]
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      de7b2973
  8. 08 4月, 2014 18 次提交
    • J
      lglock: map to spinlock when !CONFIG_SMP · 64b47e8f
      Josh Triplett 提交于
      When the system has only one CPU, lglock is effectively a spinlock; map
      it directly to spinlock to eliminate the indirection and duplicate code.
      
      In addition to removing overhead, this drops 1.6k of code with a
      defconfig modified to have !CONFIG_SMP, and 1.1k with a minimal config.
      Signed-off-by: NJosh Triplett <josh@joshtriplett.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Michal Marek <mmarek@suse.cz>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      64b47e8f
    • C
      modules: use raw_cpu_write for initialization of per cpu refcount. · 08f141d3
      Christoph Lameter 提交于
      The initialization of a structure is not subject to synchronization.
      The use of __this_cpu would trigger a false positive with the additional
      preemption checks for __this_cpu ops.
      
      So simply disable the check through the use of raw_cpu ops.
      
      Trace:
      
        __this_cpu_write operation in preemptible [00000000] code: modprobe/286
        caller is __this_cpu_preempt_check+0x38/0x60
        CPU: 3 PID: 286 Comm: modprobe Tainted: GF            3.12.0-rc4+ #187
        Call Trace:
          dump_stack+0x4e/0x82
          check_preemption_disabled+0xec/0x110
          __this_cpu_preempt_check+0x38/0x60
          load_module+0xcfd/0x2650
          SyS_init_module+0xa6/0xd0
          tracesys+0xe1/0xe6
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      Acked-by: NIngo Molnar <mingo@kernel.org>
      Acked-by: NRusty Russell <rusty@rustcorp.com.au>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      08f141d3
    • G
      kernel: use macros from compiler.h instead of __attribute__((...)) · 52f5684c
      Gideon Israel Dsouza 提交于
      To increase compiler portability there is <linux/compiler.h> which
      provides convenience macros for various gcc constructs.  Eg: __weak for
      __attribute__((weak)).  I've replaced all instances of gcc attributes
      with the right macro in the kernel subsystem.
      Signed-off-by: NGideon Israel Dsouza <gidisrael@gmail.com>
      Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      52f5684c
    • F
      kernel/panic.c: display reason at end + pr_emerg · d7c0847f
      Fabian Frederick 提交于
      Currently, booting without initrd specified on 80x25 screen gives a call
      trace followed by atkbd : Spurious ACK.  Original message ("VFS: Unable
      to mount root fs") is not available.  Of course this could happen in
      other situations...
      
      This patch displays panic reason after call trace which could help lot
      of people even if it's not the very last line on screen.
      
      Also, convert all panic.c printk(KERN_EMERG to pr_emerg(
      
      [akpm@linux-foundation.org: missed a couple of pr_ conversions]
      Signed-off-by: NFabian Frederick <fabf@skynet.be>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d7c0847f
    • L
      hung_task: check the value of "sysctl_hung_task_timeout_sec" · 80df2847
      Liu Hua 提交于
      As sysctl_hung_task_timeout_sec is unsigned long, when this value is
      larger then LONG_MAX/HZ, the function schedule_timeout_interruptible in
      watchdog will return immediately without sleep and with print :
      
        schedule_timeout: wrong timeout value ffffffffffffff83
      
      and then the funtion watchdog will call schedule_timeout_interruptible
      again and again.  The screen will be filled with
      
      	"schedule_timeout: wrong timeout value ffffffffffffff83"
      
      This patch does some check and correction in sysctl, to let the function
      schedule_timeout_interruptible allways get the valid parameter.
      Signed-off-by: NLiu Hua <sdu.liu@huawei.com>
      Tested-by: NSatoru Takeuchi <satoru.takeuchi@gmail.com>
      Cc: <stable@vger.kernel.org>	[3.4+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      80df2847
    • O
      wait: WSTOPPED|WCONTINUED doesn't work if a zombie leader is traced by another process · 7c733eb3
      Oleg Nesterov 提交于
      Even if the main thread is dead the process still can stop/continue.
      However, if the leader is ptraced wait_consider_task(ptrace => false)
      always skips wait_task_stopped/wait_task_continued, so WSTOPPED or
      WCONTINUED can never work for the natural parent in this case.
      
      Move the "A zombie ptracee is only visible to its ptracer" check into the
      "if (!delay_group_leader(p))" block.  ->notask_error is cleared by the
      "fall through" code below.
      
      This depends on the previous change, wait_task_stopped/continued must be
      avoided if !delay_group_leader() and the tracer is ->real_parent.
      Otherwise WSTOPPED|WEXITED could wrongly report "stopped" when the child
      is already dead (single-threaded or not).  If it is traced by another task
      then the "stopped" state is fine until the debugger detaches and reveals a
      zombie state.
      
      Stupid test-case:
      
      	void *tfunc(void *arg)
      	{
      		sleep(1);	// wait for zombie leader
      		raise(SIGSTOP);
      		exit(0x13);
      		return NULL;
      	}
      
      	int run_child(void)
      	{
      		pthread_t thread;
      
      		if (!fork()) {
      			int tracee = getppid();
      
      			assert(ptrace(PTRACE_ATTACH, tracee, 0,0) == 0);
      			do
      				ptrace(PTRACE_CONT, tracee, 0,0);
      			while (wait(NULL) > 0);
      
      			return 0;
      		}
      
      		sleep(1);	// wait for PTRACE_ATTACH
      		assert(pthread_create(&thread, NULL, tfunc, NULL) == 0);
      		pthread_exit(NULL);
      	}
      
      	int main(void)
      	{
      		int child, stat;
      
      		child = fork();
      		if (!child)
      			return run_child();
      
      		assert(child == waitpid(-1, &stat, WSTOPPED));
      		assert(stat == 0x137f);
      
      		kill(child, SIGCONT);
      
      		assert(child == waitpid(-1, &stat, WCONTINUED));
      		assert(stat == 0xffff);
      
      		assert(child == waitpid(-1, &stat, 0));
      		assert(stat == 0x1300);
      
      		return 0;
      	}
      
      Without this patch it hangs in waitpid(WSTOPPED), wait_task_stopped() is
      never called.
      
      Note: this doesn't fix all problems with a zombie delay_group_leader(),
      WCONTINUED | WEXITED check is not exactly right.  debugger can't assume it
      will be notified if another thread reaps the whole thread group.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Jan Kratochvil <jan.kratochvil@redhat.com>
      Cc: Lennart Poettering <lpoetter@redhat.com>
      Cc: Michal Schmidt <mschmidt@redhat.com>
      Cc: Roland McGrath <roland@hack.frob.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7c733eb3
    • O
      wait: WSTOPPED|WCONTINUED hangs if a zombie child is traced by real_parent · 377d75da
      Oleg Nesterov 提交于
      "A zombie is only visible to its ptracer" logic in wait_consider_task()
      is very wrong. Trivial test-case:
      
      	#include <unistd.h>
      	#include <sys/ptrace.h>
      	#include <sys/wait.h>
      	#include <assert.h>
      
      	int main(void)
      	{
      		int child = fork();
      
      		if (!child) {
      			assert(ptrace(PTRACE_TRACEME, 0,0,0) == 0);
      			return 0x23;
      		}
      
      		assert(waitid(P_ALL, child, NULL, WEXITED | WNOWAIT) == 0);
      		assert(waitid(P_ALL, 0, NULL, WSTOPPED) == -1);
      		return 0;
      	}
      
      it hangs in waitpid(WSTOPPED) despite the fact it has a single zombie
      child.  This is because wait_consider_task(ptrace => 0) sees p->ptrace and
      cleares ->notask_error assuming that the debugger should detach and notify
      us.
      
      Change wait_consider_task(ptrace => 0) to pretend that ptrace == T if the
      child is traced by us.  This really simplifies the logic and allows us to
      do more fixes, see the next changes.  This also hides the unwanted group
      stop state automatically, we can remove another ptrace_reparented() check.
      
      Unfortunately, this adds the following behavioural changes:
      
      	1. Before this patch wait(WEXITED | __WNOTHREAD) does not reap
      	   a natural child if it is traced by the caller's sub-thread.
      
      	   Hopefully nobody will ever notice this change, and I think
      	   that nobody should rely on this behaviour anyway.
      
      	2. SIGNAL_STOP_CONTINUED is no longer hidden from debugger if
      	   it is real parent.
      
      	   While this change comes as a side effect, I think it is good
      	   by itself. The group continued state can not be consumed by
      	   another process in this case, it doesn't depend on ptrace,
      	   it doesn't make sense to hide it from real parent.
      
      	   Perhaps we should add the thread_group_leader() check before
      	   wait_task_continued()? May be, but this shouldn't depend on
      	   ptrace_reparented().
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Jan Kratochvil <jan.kratochvil@redhat.com>
      Cc: Lennart Poettering <lpoetter@redhat.com>
      Cc: Michal Schmidt <mschmidt@redhat.com>
      Cc: Roland McGrath <roland@hack.frob.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      377d75da
    • O
      wait: completely ignore the EXIT_DEAD tasks · b3ab0316
      Oleg Nesterov 提交于
      Now that EXIT_DEAD is the terminal state it doesn't make sense to call
      eligible_child() or security_task_wait() if the task is really dead.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Tested-by: NMichal Schmidt <mschmidt@redhat.com>
      Cc: Jan Kratochvil <jan.kratochvil@redhat.com>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Lennart Poettering <lpoetter@redhat.com>
      Cc: Roland McGrath <roland@hack.frob.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b3ab0316
    • O
      wait: use EXIT_TRACE only if thread_group_leader(zombie) · b4360690
      Oleg Nesterov 提交于
      wait_task_zombie() always uses EXIT_TRACE/ptrace_unlink() if
      ptrace_reparented().  This is suboptimal and a bit confusing: we do not
      need do_notify_parent(p) if !thread_group_leader(p) and in this case we
      also do not need ptrace_unlink(), we can rely on ptrace_release_task().
      
      Change wait_task_zombie() to check thread_group_leader() along with
      ptrace_reparented() and simplify the final p->exit_state transition.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Tested-by: NMichal Schmidt <mschmidt@redhat.com>
      Cc: Jan Kratochvil <jan.kratochvil@redhat.com>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Lennart Poettering <lpoetter@redhat.com>
      Cc: Roland McGrath <roland@hack.frob.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b4360690
    • O
      wait: introduce EXIT_TRACE to avoid the racy EXIT_DEAD->EXIT_ZOMBIE transition · abd50b39
      Oleg Nesterov 提交于
      wait_task_zombie() first does EXIT_ZOMBIE->EXIT_DEAD transition and
      drops tasklist_lock.  If this task is not the natural child and it is
      traced, we change its state back to EXIT_ZOMBIE for ->real_parent.
      
      The last transition is racy, this is even documented in 50b8d257
      "ptrace: partially fix the do_wait(WEXITED) vs EXIT_DEAD->EXIT_ZOMBIE
      race".  wait_consider_task() tries to detect this transition and clear
      ->notask_error but we can't rely on ptrace_reparented(), debugger can
      exit and do ptrace_unlink() before its sub-thread sets EXIT_ZOMBIE.
      
      And there is another problem which were missed before: this transition
      can also race with reparent_leader() which doesn't reset >exit_signal if
      EXIT_DEAD, assuming that this task must be reaped by someone else.  So
      the tracee can be re-parented with ->exit_signal != SIGCHLD, and if
      /sbin/init doesn't use __WALL it becomes unreapable.  This was fixed by
      the previous commit, but it was the temporary hack.
      
      1. Add the new exit_state, EXIT_TRACE. It means that the task is the
         traced zombie, debugger is going to detach and notify its natural
         parent.
      
         This new state is actually EXIT_ZOMBIE | EXIT_DEAD. This way we
         can avoid the changes in proc/kgdb code, get_task_state() still
         reports "X (dead)" in this case.
      
         Note: with or without this change userspace can see Z -> X -> Z
         transition. Not really bad, but probably makes sense to fix.
      
      2. Change wait_task_zombie() to use EXIT_TRACE instead of EXIT_DEAD
         if we need to notify the ->real_parent.
      
      3. Revert the previous hack in reparent_leader(), now that EXIT_DEAD
         is always the final state we can safely ignore such a task.
      
      4. Change wait_consider_task() to check EXIT_TRACE separately and kill
         the racy and no longer needed ptrace_reparented() case.
      
         If ptrace == T an EXIT_TRACE thread should be simply ignored, the
         owner of this state is going to ptrace_unlink() this task. We can
         pretend that it was already removed from ->ptraced list.
      
         Otherwise we should skip this thread too but clear ->notask_error,
         we must be the natural parent and debugger is going to untrace and
         notify us. IOW, this doesn't differ from "EXIT_ZOMBIE && p->ptrace"
         even if the task was already untraced.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Reported-by: NJan Kratochvil <jan.kratochvil@redhat.com>
      Reported-by: NMichal Schmidt <mschmidt@redhat.com>
      Tested-by: NMichal Schmidt <mschmidt@redhat.com>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Lennart Poettering <lpoetter@redhat.com>
      Cc: Roland McGrath <roland@hack.frob.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      abd50b39
    • O
      wait: fix reparent_leader() vs EXIT_DEAD->EXIT_ZOMBIE race · dfccbb5e
      Oleg Nesterov 提交于
      wait_task_zombie() first does EXIT_ZOMBIE->EXIT_DEAD transition and
      drops tasklist_lock.  If this task is not the natural child and it is
      traced, we change its state back to EXIT_ZOMBIE for ->real_parent.
      
      The last transition is racy, this is even documented in 50b8d257
      "ptrace: partially fix the do_wait(WEXITED) vs EXIT_DEAD->EXIT_ZOMBIE
      race".  wait_consider_task() tries to detect this transition and clear
      ->notask_error but we can't rely on ptrace_reparented(), debugger can
      exit and do ptrace_unlink() before its sub-thread sets EXIT_ZOMBIE.
      
      And there is another problem which were missed before: this transition
      can also race with reparent_leader() which doesn't reset >exit_signal if
      EXIT_DEAD, assuming that this task must be reaped by someone else.  So
      the tracee can be re-parented with ->exit_signal != SIGCHLD, and if
      /sbin/init doesn't use __WALL it becomes unreapable.
      
      Change reparent_leader() to update ->exit_signal even if EXIT_DEAD.
      Note: this is the simple temporary hack for -stable, it doesn't try to
      solve all problems, it will be reverted by the next changes.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Reported-by: NJan Kratochvil <jan.kratochvil@redhat.com>
      Reported-by: NMichal Schmidt <mschmidt@redhat.com>
      Tested-by: NMichal Schmidt <mschmidt@redhat.com>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Lennart Poettering <lpoetter@redhat.com>
      Cc: Roland McGrath <roland@hack.frob.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dfccbb5e
    • G
      kernel/exit.c: call proc_exit_connector() after exit_state is set · ef982393
      Guillaume Morin 提交于
      The process events connector delivers a notification when a process
      exits.  This is really convenient for a process that spawns and wants to
      monitor its children through an epoll-able() interface.
      
      Unfortunately, there is a small window between when the event is
      delivered and the child become wait()-able.
      
      This is creates a race if the parent wants to make sure that it knows
      about the exit, e.g
      
      pid_t pid = fork();
      if (pid > 0) {
      	register_interest_for_pid(pid);
      	if (waitpid(pid, NULL, WNOHANG) > 0)
      	{
      	  /* We might have raced with exit() */
      	}
      	return;
      }
      
      /* Child */
      execve(...)
      
      register_interest_for_pid() would be telling the the connector socket
      reader to pay attention to events related to pid.
      
      Though this is not a bug, I think it would make the connector a bit more
      usable if this race was closed by simply moving the call to
      proc_exit_connector() from just before exit_notify() to right after.
      
      Oleg said:
      
      : Even with this patch the code above is still "racy" if the child is
      : multi-threaded.  Plus it should obviously filter-out subthreads.  And
      : afaics there is no way to make it reliable, even if you change the code
      : above so that waitpid() is called only after the last thread exits WNOHANG
      : still can fail.
      Signed-off-by: NGuillaume Morin <guillaume@morinfr.org>
      Cc: Matt Helsley <matt.helsley@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: David S. Miller <davem@davemloft.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ef982393
    • O
      exit: move check_stack_usage() to the end of do_exit() · 4bcb8232
      Oleg Nesterov 提交于
      It is not clear why check_stack_usage() is called so early and thus it
      never checks the stack usage in, say, exit_notify() or
      flush_ptrace_hw_breakpoint() or other functions which are only called by
      do_exit().
      
      Move the callsite down to the last preempt_disable/schedule.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4bcb8232
    • O
      exit: call disassociate_ctty() before exit_task_namespaces() · c39df5fa
      Oleg Nesterov 提交于
      Commit 8aac6270 ("move exit_task_namespaces() outside of
      exit_notify()") breaks pppd and the exiting service crashes the kernel:
      
          BUG: unable to handle kernel NULL pointer dereference at 0000000000000028
          IP: ppp_register_channel+0x13/0x20 [ppp_generic]
          Call Trace:
            ppp_asynctty_open+0x12b/0x170 [ppp_async]
            tty_ldisc_open.isra.2+0x27/0x60
            tty_ldisc_hangup+0x1e3/0x220
            __tty_hangup+0x2c4/0x440
            disassociate_ctty+0x61/0x270
            do_exit+0x7f2/0xa50
      
      ppp_register_channel() needs ->net_ns and current->nsproxy == NULL.
      
      Move disassociate_ctty() before exit_task_namespaces(), it doesn't make
      sense to delay it after perf_event_exit_task() or cgroup_exit().
      
      This also allows to use task_work_add() inside the (nontrivial) code
      paths in disassociate_ctty().
      
      Investigated by Peter Hurley.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Reported-by: NSree Harsha Totakura <sreeharsha@totakura.in>
      Cc: Peter Hurley <peter@hurleysoftware.com>
      Cc: Sree Harsha Totakura <sreeharsha@totakura.in>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Andrey Vagin <avagin@openvz.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: <stable@vger.kernel.org>	[v3.10+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c39df5fa
    • D
      res_counter: remove interface for locked charging and uncharging · 539a13b4
      David Rientjes 提交于
      The res_counter_{charge,uncharge}_locked() variants are not used in the
      kernel outside of the resource counter code itself, so remove the
      interface.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Jianguo Wu <wujianguo@huawei.com>
      Cc: Tim Hockin <thockin@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      539a13b4
    • D
      mm, mempolicy: remove per-process flag · f0432d15
      David Rientjes 提交于
      PF_MEMPOLICY is an unnecessary optimization for CONFIG_SLAB users.
      There's no significant performance degradation to checking
      current->mempolicy rather than current->flags & PF_MEMPOLICY in the
      allocation path, especially since this is considered unlikely().
      
      Running TCP_RR with netperf-2.4.5 through localhost on 16 cpu machine with
      64GB of memory and without a mempolicy:
      
      	threads		before		after
      	16		1249409		1244487
      	32		1281786		1246783
      	48		1239175		1239138
      	64		1244642		1241841
      	80		1244346		1248918
      	96		1266436		1254316
      	112		1307398		1312135
      	128		1327607		1326502
      
      Per-process flags are a scarce resource so we should free them up whenever
      possible and make them available.  We'll be using it shortly for memcg oom
      reserves.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Jianguo Wu <wujianguo@huawei.com>
      Cc: Tim Hockin <thockin@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f0432d15
    • D
      fork: collapse copy_flags into copy_process · 514ddb44
      David Rientjes 提交于
      copy_flags() does not use the clone_flags formal and can be collapsed
      into copy_process() for cleaner code.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Jianguo Wu <wujianguo@huawei.com>
      Cc: Tim Hockin <thockin@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      514ddb44
    • D
      mm: per-thread vma caching · 615d6e87
      Davidlohr Bueso 提交于
      This patch is a continuation of efforts trying to optimize find_vma(),
      avoiding potentially expensive rbtree walks to locate a vma upon faults.
      The original approach (https://lkml.org/lkml/2013/11/1/410), where the
      largest vma was also cached, ended up being too specific and random,
      thus further comparison with other approaches were needed.  There are
      two things to consider when dealing with this, the cache hit rate and
      the latency of find_vma().  Improving the hit-rate does not necessarily
      translate in finding the vma any faster, as the overhead of any fancy
      caching schemes can be too high to consider.
      
      We currently cache the last used vma for the whole address space, which
      provides a nice optimization, reducing the total cycles in find_vma() by
      up to 250%, for workloads with good locality.  On the other hand, this
      simple scheme is pretty much useless for workloads with poor locality.
      Analyzing ebizzy runs shows that, no matter how many threads are
      running, the mmap_cache hit rate is less than 2%, and in many situations
      below 1%.
      
      The proposed approach is to replace this scheme with a small per-thread
      cache, maximizing hit rates at a very low maintenance cost.
      Invalidations are performed by simply bumping up a 32-bit sequence
      number.  The only expensive operation is in the rare case of a seq
      number overflow, where all caches that share the same address space are
      flushed.  Upon a miss, the proposed replacement policy is based on the
      page number that contains the virtual address in question.  Concretely,
      the following results are seen on an 80 core, 8 socket x86-64 box:
      
      1) System bootup: Most programs are single threaded, so the per-thread
         scheme does improve ~50% hit rate by just adding a few more slots to
         the cache.
      
      +----------------+----------+------------------+
      | caching scheme | hit-rate | cycles (billion) |
      +----------------+----------+------------------+
      | baseline       | 50.61%   | 19.90            |
      | patched        | 73.45%   | 13.58            |
      +----------------+----------+------------------+
      
      2) Kernel build: This one is already pretty good with the current
         approach as we're dealing with good locality.
      
      +----------------+----------+------------------+
      | caching scheme | hit-rate | cycles (billion) |
      +----------------+----------+------------------+
      | baseline       | 75.28%   | 11.03            |
      | patched        | 88.09%   | 9.31             |
      +----------------+----------+------------------+
      
      3) Oracle 11g Data Mining (4k pages): Similar to the kernel build workload.
      
      +----------------+----------+------------------+
      | caching scheme | hit-rate | cycles (billion) |
      +----------------+----------+------------------+
      | baseline       | 70.66%   | 17.14            |
      | patched        | 91.15%   | 12.57            |
      +----------------+----------+------------------+
      
      4) Ebizzy: There's a fair amount of variation from run to run, but this
         approach always shows nearly perfect hit rates, while baseline is just
         about non-existent.  The amounts of cycles can fluctuate between
         anywhere from ~60 to ~116 for the baseline scheme, but this approach
         reduces it considerably.  For instance, with 80 threads:
      
      +----------------+----------+------------------+
      | caching scheme | hit-rate | cycles (billion) |
      +----------------+----------+------------------+
      | baseline       | 1.06%    | 91.54            |
      | patched        | 99.97%   | 14.18            |
      +----------------+----------+------------------+
      
      [akpm@linux-foundation.org: fix nommu build, per Davidlohr]
      [akpm@linux-foundation.org: document vmacache_valid() logic]
      [akpm@linux-foundation.org: attempt to untangle header files]
      [akpm@linux-foundation.org: add vmacache_find() BUG_ON]
      [hughd@google.com: add vmacache_valid_mm() (from Oleg)]
      [akpm@linux-foundation.org: coding-style fixes]
      [akpm@linux-foundation.org: adjust and enhance comments]
      Signed-off-by: NDavidlohr Bueso <davidlohr@hp.com>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Reviewed-by: NMichel Lespinasse <walken@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Tested-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      615d6e87