1. 22 5月, 2014 1 次提交
  2. 20 5月, 2014 10 次提交
    • L
      workqueue: use generic attach/detach routine for rescuers · 51697d39
      Lai Jiangshan 提交于
      There are several problems with the code that rescuers use to bind
      themselve to the target pool's cpumask.
      
        1) It is very different from how the normal workers bind to cpumask,
           increasing code complexity and maintenance overhead.
      
        2) The code of cpu-binding for rescuers is complicated.
      
        3) If one or more cpu hotplugs happen while a rescuer is processing
           its scheduled work items, the rescuer may not stay bound to the
           cpumask of the pool. This is an allowed behavior, but is still
           hairy. It will be better if the cpumask of the rescuer is always
           kept synchronized with the pool across cpu hotplugs.
      
      Using generic attach/detach routine will solve the above problems and
      results in much simpler code.
      
      tj: Minor description updates.
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      51697d39
    • L
      workqueue: separate pool-attaching code out from create_worker() · 4736cbf7
      Lai Jiangshan 提交于
      Currently, the code to attach a new worker to its pool is embedded in
      create_worker().  Separating this code out will make the codes clearer
      and will allow rescuers to share the code path later.
      
      tj: Description and comment updates.
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      4736cbf7
    • L
      workqueue: rename manager_mutex to attach_mutex · 92f9c5c4
      Lai Jiangshan 提交于
      manager_mutex is only used to protect the attaching for the pool
      and the pool->workers list. It protects the pool->workers and operations
      based on this list, such as:
      
      	cpu-binding for the workers in the pool->workers
      	the operations to set/clear WORKER_UNBOUND
      
      So let's rename manager_mutex to attach_mutex to better reflect its
      role. This patch is a pure rename.
      
      tj: Minor command and description updates.
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      92f9c5c4
    • L
      workqueue: narrow the protection range of manager_mutex · 4d757c5c
      Lai Jiangshan 提交于
      In create_worker(), as pool->worker_ida now uses
      ida_simple_get()/ida_simple_put() and doesn't require external
      synchronization, it doesn't need manager_mutex.
      
      struct worker allocation and kthread allocation are not visible by any
      one before attached, so they don't need manager_mutex either.
      
      The above operations are before the attaching operation which attaches
      the worker to the pool. Between attaching and starting the worker, the
      worker is already attached to the pool, so the cpu hotplug will handle
      cpu-binding for the worker correctly and we don't need the
      manager_mutex after attaching.
      
      The conclusion is that only the attaching operation needs manager_mutex,
      so we narrow the protection section of manager_mutex in create_worker().
      
      Some comments about manager_mutex are removed, because we will rename
      it to attach_mutex and add worker_attach_to_pool() later which will be
      self-explanatory.
      
      tj: Minor description updates.
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      4d757c5c
    • L
      workqueue: convert worker_idr to worker_ida · 7cda9aae
      Lai Jiangshan 提交于
      We no longer iterate workers via worker_idr and worker_idr is used
      only for allocating/freeing ID, so we can convert it to worker_ida.
      
      By using ida_simple_get/remove(), worker_ida doesn't require external
      synchronization, so we don't need manager_mutex to protect it and the
      ID-removal code is allowed to be moved out from
      worker_detach_from_pool().
      
      In a later patch, worker_detach_from_pool() will be used in rescuers
      which don't have IDs, so we move the ID-removal code out from
      worker_detach_from_pool() into worker_thread().
      
      tj: Minor description updates.
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      7cda9aae
    • L
      workqueue: separate iteration role from worker_idr · da028469
      Lai Jiangshan 提交于
      worker_idr has the iteration (iterating for attached workers) and
      worker ID duties. These two duties don't have to be tied together. We
      can separate them and use a list for tracking attached workers and
      iteration.
      
      Before this separation, it wasn't possible to add rescuer workers to
      worker_idr due to rescuer workers couldn't allocate ID dynamically
      because ID-allocation depends on memory-allocation, which rescuer
      can't depend on.
      
      After separation, we can easily add the rescuer workers to the list for
      iteration without any memory-allocation. It is required when we attach
      the rescuer worker to the pool in later patch.
      
      tj: Minor description updates.
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      da028469
    • L
      workqueue: destroy worker directly in the idle timeout handler · 3347fc9f
      Lai Jiangshan 提交于
      Since destroy_worker() doesn't need to sleep nor require manager_mutex,
      destroy_worker() can be directly called in the idle timeout
      handler, it helps us remove POOL_MANAGE_WORKERS and
      maybe_destroy_worker() and simplify the manage_workers()
      
      After POOL_MANAGE_WORKERS is removed, worker_thread() doesn't
      need to test whether it needs to manage after processed works.
      So we can remove the test branch.
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      3347fc9f
    • L
      workqueue: async worker destruction · 60f5a4bc
      Lai Jiangshan 提交于
      worker destruction includes these parts of code:
      	adjust pool's stats
      	remove the worker from idle list
      	detach the worker from the pool
      	kthread_stop() to wait for the worker's task exit
      	free the worker struct
      
      We can find out that there is no essential work to do after
      kthread_stop(), which means destroy_worker() doesn't need to wait for
      the worker's task exit, so we can remove kthread_stop() and free the
      worker struct in the worker exiting path.
      
      However, put_unbound_pool() still needs to sync the all the workers'
      destruction before destroying the pool; otherwise, the workers may
      access to the invalid pool when they are exiting.
      
      So we also move the code of "detach the worker" to the exiting
      path and let put_unbound_pool() to sync with this code via
      detach_completion.
      
      The code of "detach the worker" is wrapped in a new function
      "worker_detach_from_pool()" although worker_detach_from_pool() is only
      called once (in worker_thread()) after this patch, but we need to wrap
      it for these reasons:
      
        1) The code of "detach the worker" is not short enough to unfold them
           in worker_thread().
        2) the name of "worker_detach_from_pool()" is self-comment, and we add
           some comments above the function.
        3) it will be shared by rescuer in later patch which allows rescuer
           and normal thread use the same attach/detach frameworks.
      
      The worker id is freed when detaching which happens before the worker
      is fully dead, but this id of the dying worker may be re-used for a
      new worker, so the dying worker's task name is changed to
      "worker/dying" to avoid two or several workers having the same name.
      
      Since "detach the worker" is moved out from destroy_worker(),
      destroy_worker() doesn't require manager_mutex, so the
      "lockdep_assert_held(&pool->manager_mutex)" in destroy_worker() is
      removed, and destroy_worker() is not protected by manager_mutex in
      put_unbound_pool().
      
      tj: Minor description updates.
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      60f5a4bc
    • L
      workqueue: destroy_worker() should destroy idle workers only · 73eb7fe7
      Lai Jiangshan 提交于
      We used to have the CPU online failure path where a worker is created
      and then destroyed without being started. A worker was created for
      the CPU coming online and if the online operation failed the created worker
      was shut down without being started.  But this behavior was changed.
      The first worker is created and started at the same time for the CPU coming
      online.
      
      It means that we had already ensured in the code that destroy_worker()
      destroys only idle workers and we don't want to allow it to destroy
      any non-idle worker in the future. Otherwise, it may be buggy and it
      may be extremely hard to check. We should force destroy_worker() to
      destroy only idle workers explicitly.
      
      Since destroy_worker() destroys only idle workers, this patch does not
      change any functionality. We just need to update the comments and the
      sanity check code.
      
      In the sanity check code, we will refuse to destroy the worker
      if !(worker->flags & WORKER_IDLE).
      
      If the worker entered idle which means it is already started,
      so we remove the check of "worker->flags & WORKER_STARTED",
      after this removal, WORKER_STARTED is totally unneeded,
      so we remove WORKER_STARTED too.
      
      In the comments for create_worker(), "Create a new worker which is bound..."
      is changed to "... which is attached..." due to we change the name of this
      behavior to attaching.
      
      tj: Minor description / comment updates.
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      73eb7fe7
    • L
      workqueue: use manager lock only to protect worker_idr · 9625ab17
      Lai Jiangshan 提交于
      worker_idr is highly bound to managers and is always/only accessed in manager
      lock context. So we don't need pool->lock for it.
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      9625ab17
  3. 13 5月, 2014 1 次提交
  4. 19 4月, 2014 2 次提交
    • D
      workqueue: simplify wq_update_unbound_numa() by jumping to use_dfl_pwq if the... · 534a3fbb
      Daeseok Youn 提交于
      workqueue: simplify wq_update_unbound_numa() by jumping to use_dfl_pwq if the target cpumask equals wq's
      
      wq_update_unbound_numa(), when it's decided that the newly updated
      cpumask equals the default, looks at whether the current pwq is
      already the default one and skips setting pwq to the default one.
      This extra step is unnecessary and we can always jump to use_dfl_pwq
      instead. Simplify the code by removing the conditional.
      This doesn't make any functional difference.
      Signed-off-by: NDaeseok Youn <daeseok.youn@gmail.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      534a3fbb
    • L
      workqueue: fix a possible race condition between rescuer and pwq-release · 77668c8b
      Lai Jiangshan 提交于
      There is a race condition between rescuer_thread() and
      pwq_unbound_release_workfn().
      
      Even after a pwq is scheduled for rescue, the associated work items
      may be consumed by any worker.  If all of them are consumed before the
      rescuer gets to them and the pwq's base ref was put due to attribute
      change, the pwq may be released while still being linked on
      @wq->maydays list making the rescuer dereference already freed pwq
      later.
      
      Make send_mayday() pin the target pwq until the rescuer is done with
      it.
      
      tj: Updated comment and patch description.
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: stable@vger.kernel.org # v3.10+
      77668c8b
  5. 18 4月, 2014 1 次提交
    • L
      workqueue: make rescuer_thread() empty wq->maydays list before exiting · 4d595b86
      Lai Jiangshan 提交于
      After a @pwq is scheduled for emergency execution, other workers may
      consume the affectd work items before the rescuer gets to them.  This
      means that a workqueue many have pwqs queued on @wq->maydays list
      while not having any work item pending or in-flight.  If
      destroy_workqueue() executes in such condition, the rescuer may exit
      without emptying @wq->maydays.
      
      This currently doesn't cause any actual harm.  destroy_workqueue() can
      safely destroy all the involved data structures whether @wq->maydays
      is populated or not as nobody access the list once the rescuer exits.
      
      However, this is nasty and makes future development difficult.  Let's
      update rescuer_thread() so that it empties @wq->maydays after seeing
      should_stop to guarantee that the list is empty on rescuer exit.
      
      tj: Updated comment and patch description.
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: stable@vger.kernel.org # v3.10+
      4d595b86
  6. 17 4月, 2014 1 次提交
  7. 13 4月, 2014 1 次提交
  8. 12 4月, 2014 1 次提交
  9. 11 4月, 2014 1 次提交
  10. 10 4月, 2014 1 次提交
  11. 09 4月, 2014 4 次提交
    • L
      futex: avoid race between requeue and wake · 69cd9eba
      Linus Torvalds 提交于
      Jan Stancek reported:
       "pthread_cond_broadcast/4-1.c testcase from openposix testsuite (LTP)
        occasionally fails, because some threads fail to wake up.
      
        Testcase creates 5 threads, which are all waiting on same condition.
        Main thread then calls pthread_cond_broadcast() without holding mutex,
        which calls:
      
            futex(uaddr1, FUTEX_CMP_REQUEUE_PRIVATE, 1, 2147483647, uaddr2, ..)
      
        This immediately wakes up single thread A, which unlocks mutex and
        tries to wake up another thread:
      
            futex(uaddr2, FUTEX_WAKE_PRIVATE, 1)
      
        If thread A manages to call futex_wake() before any waiters are
        requeued for uaddr2, no other thread is woken up"
      
      The ordering constraints for the hash bucket waiter counting are that
      the waiter counts have to be incremented _before_ getting the spinlock
      (because the spinlock acts as part of the memory barrier), but the
      "requeue" operation didn't honor those rules, and nobody had even
      thought about that case.
      
      This fairly simple patch just increments the waiter count for the target
      hash bucket (hb2) when requeing a futex before taking the locks.  It
      then decrements them again after releasing the lock - the code that
      actually moves the futex(es) between hash buckets will do the additional
      required waiter count housekeeping.
      Reported-and-tested-by: NJan Stancek <jstancek@redhat.com>
      Acked-by: NDavidlohr Bueso <davidlohr@hp.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org # 3.14
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      69cd9eba
    • M
      tracepoint: Fix sparse warnings in tracepoint.c · b725dfea
      Mathieu Desnoyers 提交于
      Fix the following sparse warnings:
      
        CHECK   kernel/tracepoint.c
      kernel/tracepoint.c:184:18: warning: incorrect type in assignment (different address spaces)
      kernel/tracepoint.c:184:18:    expected struct tracepoint_func *tp_funcs
      kernel/tracepoint.c:184:18:    got struct tracepoint_func [noderef] <asn:4>*funcs
      kernel/tracepoint.c:216:18: warning: incorrect type in assignment (different address spaces)
      kernel/tracepoint.c:216:18:    expected struct tracepoint_func *tp_funcs
      kernel/tracepoint.c:216:18:    got struct tracepoint_func [noderef] <asn:4>*funcs
      kernel/tracepoint.c:392:24: error: return expression in void function
        CC      kernel/tracepoint.o
      kernel/tracepoint.c: In function tracepoint_module_going:
      kernel/tracepoint.c:491:6: warning: symbol 'syscall_regfunc' was not declared. Should it be static?
      kernel/tracepoint.c:508:6: warning: symbol 'syscall_unregfunc' was not declared. Should it be static?
      
      Link: http://lkml.kernel.org/r/1397049883-28692-1-git-send-email-mathieu.desnoyers@efficios.comSigned-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      b725dfea
    • S
      tracepoint: Simplify tracepoint module search · eb7d035c
      Steven Rostedt (Red Hat) 提交于
      Instead of copying the num_tracepoints and tracepoints_ptrs from
      the module structure to the tp_mod structure, which only uses it to
      find the module associated to tracepoints of modules that are coming
      and going, simply copy the pointer to the module struct to the tracepoint
      tp_module structure.
      
      Also removed un-needed brackets around an if statement.
      
      Link: http://lkml.kernel.org/r/20140408201705.4dad2c4a@gandalf.local.homeAcked-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      eb7d035c
    • M
      tracepoint: Use struct pointer instead of name hash for reg/unreg tracepoints · de7b2973
      Mathieu Desnoyers 提交于
      Register/unregister tracepoint probes with struct tracepoint pointer
      rather than tracepoint name.
      
      This change, which vastly simplifies tracepoint.c, has been proposed by
      Steven Rostedt. It also removes 8.8kB (mostly of text) to the vmlinux
      size.
      
      From this point on, the tracers need to pass a struct tracepoint pointer
      to probe register/unregister. A probe can now only be connected to a
      tracepoint that exists. Moreover, tracers are responsible for
      unregistering the probe before the module containing its associated
      tracepoint is unloaded.
      
         text    data     bss     dec     hex filename
      10443444        4282528 10391552        25117524        17f4354 vmlinux.orig
      10434930        4282848 10391552        25109330        17f2352 vmlinux
      
      Link: http://lkml.kernel.org/r/1396992381-23785-2-git-send-email-mathieu.desnoyers@efficios.com
      
      CC: Ingo Molnar <mingo@kernel.org>
      CC: Frederic Weisbecker <fweisbec@gmail.com>
      CC: Andrew Morton <akpm@linux-foundation.org>
      CC: Frank Ch. Eigler <fche@redhat.com>
      CC: Johannes Berg <johannes.berg@intel.com>
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      [ SDR - fixed return val in void func in tracepoint_module_going() ]
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      de7b2973
  12. 08 4月, 2014 16 次提交
    • J
      lglock: map to spinlock when !CONFIG_SMP · 64b47e8f
      Josh Triplett 提交于
      When the system has only one CPU, lglock is effectively a spinlock; map
      it directly to spinlock to eliminate the indirection and duplicate code.
      
      In addition to removing overhead, this drops 1.6k of code with a
      defconfig modified to have !CONFIG_SMP, and 1.1k with a minimal config.
      Signed-off-by: NJosh Triplett <josh@joshtriplett.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Michal Marek <mmarek@suse.cz>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      64b47e8f
    • C
      modules: use raw_cpu_write for initialization of per cpu refcount. · 08f141d3
      Christoph Lameter 提交于
      The initialization of a structure is not subject to synchronization.
      The use of __this_cpu would trigger a false positive with the additional
      preemption checks for __this_cpu ops.
      
      So simply disable the check through the use of raw_cpu ops.
      
      Trace:
      
        __this_cpu_write operation in preemptible [00000000] code: modprobe/286
        caller is __this_cpu_preempt_check+0x38/0x60
        CPU: 3 PID: 286 Comm: modprobe Tainted: GF            3.12.0-rc4+ #187
        Call Trace:
          dump_stack+0x4e/0x82
          check_preemption_disabled+0xec/0x110
          __this_cpu_preempt_check+0x38/0x60
          load_module+0xcfd/0x2650
          SyS_init_module+0xa6/0xd0
          tracesys+0xe1/0xe6
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      Acked-by: NIngo Molnar <mingo@kernel.org>
      Acked-by: NRusty Russell <rusty@rustcorp.com.au>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      08f141d3
    • G
      kernel: use macros from compiler.h instead of __attribute__((...)) · 52f5684c
      Gideon Israel Dsouza 提交于
      To increase compiler portability there is <linux/compiler.h> which
      provides convenience macros for various gcc constructs.  Eg: __weak for
      __attribute__((weak)).  I've replaced all instances of gcc attributes
      with the right macro in the kernel subsystem.
      Signed-off-by: NGideon Israel Dsouza <gidisrael@gmail.com>
      Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      52f5684c
    • F
      kernel/panic.c: display reason at end + pr_emerg · d7c0847f
      Fabian Frederick 提交于
      Currently, booting without initrd specified on 80x25 screen gives a call
      trace followed by atkbd : Spurious ACK.  Original message ("VFS: Unable
      to mount root fs") is not available.  Of course this could happen in
      other situations...
      
      This patch displays panic reason after call trace which could help lot
      of people even if it's not the very last line on screen.
      
      Also, convert all panic.c printk(KERN_EMERG to pr_emerg(
      
      [akpm@linux-foundation.org: missed a couple of pr_ conversions]
      Signed-off-by: NFabian Frederick <fabf@skynet.be>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d7c0847f
    • L
      hung_task: check the value of "sysctl_hung_task_timeout_sec" · 80df2847
      Liu Hua 提交于
      As sysctl_hung_task_timeout_sec is unsigned long, when this value is
      larger then LONG_MAX/HZ, the function schedule_timeout_interruptible in
      watchdog will return immediately without sleep and with print :
      
        schedule_timeout: wrong timeout value ffffffffffffff83
      
      and then the funtion watchdog will call schedule_timeout_interruptible
      again and again.  The screen will be filled with
      
      	"schedule_timeout: wrong timeout value ffffffffffffff83"
      
      This patch does some check and correction in sysctl, to let the function
      schedule_timeout_interruptible allways get the valid parameter.
      Signed-off-by: NLiu Hua <sdu.liu@huawei.com>
      Tested-by: NSatoru Takeuchi <satoru.takeuchi@gmail.com>
      Cc: <stable@vger.kernel.org>	[3.4+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      80df2847
    • O
      wait: WSTOPPED|WCONTINUED doesn't work if a zombie leader is traced by another process · 7c733eb3
      Oleg Nesterov 提交于
      Even if the main thread is dead the process still can stop/continue.
      However, if the leader is ptraced wait_consider_task(ptrace => false)
      always skips wait_task_stopped/wait_task_continued, so WSTOPPED or
      WCONTINUED can never work for the natural parent in this case.
      
      Move the "A zombie ptracee is only visible to its ptracer" check into the
      "if (!delay_group_leader(p))" block.  ->notask_error is cleared by the
      "fall through" code below.
      
      This depends on the previous change, wait_task_stopped/continued must be
      avoided if !delay_group_leader() and the tracer is ->real_parent.
      Otherwise WSTOPPED|WEXITED could wrongly report "stopped" when the child
      is already dead (single-threaded or not).  If it is traced by another task
      then the "stopped" state is fine until the debugger detaches and reveals a
      zombie state.
      
      Stupid test-case:
      
      	void *tfunc(void *arg)
      	{
      		sleep(1);	// wait for zombie leader
      		raise(SIGSTOP);
      		exit(0x13);
      		return NULL;
      	}
      
      	int run_child(void)
      	{
      		pthread_t thread;
      
      		if (!fork()) {
      			int tracee = getppid();
      
      			assert(ptrace(PTRACE_ATTACH, tracee, 0,0) == 0);
      			do
      				ptrace(PTRACE_CONT, tracee, 0,0);
      			while (wait(NULL) > 0);
      
      			return 0;
      		}
      
      		sleep(1);	// wait for PTRACE_ATTACH
      		assert(pthread_create(&thread, NULL, tfunc, NULL) == 0);
      		pthread_exit(NULL);
      	}
      
      	int main(void)
      	{
      		int child, stat;
      
      		child = fork();
      		if (!child)
      			return run_child();
      
      		assert(child == waitpid(-1, &stat, WSTOPPED));
      		assert(stat == 0x137f);
      
      		kill(child, SIGCONT);
      
      		assert(child == waitpid(-1, &stat, WCONTINUED));
      		assert(stat == 0xffff);
      
      		assert(child == waitpid(-1, &stat, 0));
      		assert(stat == 0x1300);
      
      		return 0;
      	}
      
      Without this patch it hangs in waitpid(WSTOPPED), wait_task_stopped() is
      never called.
      
      Note: this doesn't fix all problems with a zombie delay_group_leader(),
      WCONTINUED | WEXITED check is not exactly right.  debugger can't assume it
      will be notified if another thread reaps the whole thread group.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Jan Kratochvil <jan.kratochvil@redhat.com>
      Cc: Lennart Poettering <lpoetter@redhat.com>
      Cc: Michal Schmidt <mschmidt@redhat.com>
      Cc: Roland McGrath <roland@hack.frob.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7c733eb3
    • O
      wait: WSTOPPED|WCONTINUED hangs if a zombie child is traced by real_parent · 377d75da
      Oleg Nesterov 提交于
      "A zombie is only visible to its ptracer" logic in wait_consider_task()
      is very wrong. Trivial test-case:
      
      	#include <unistd.h>
      	#include <sys/ptrace.h>
      	#include <sys/wait.h>
      	#include <assert.h>
      
      	int main(void)
      	{
      		int child = fork();
      
      		if (!child) {
      			assert(ptrace(PTRACE_TRACEME, 0,0,0) == 0);
      			return 0x23;
      		}
      
      		assert(waitid(P_ALL, child, NULL, WEXITED | WNOWAIT) == 0);
      		assert(waitid(P_ALL, 0, NULL, WSTOPPED) == -1);
      		return 0;
      	}
      
      it hangs in waitpid(WSTOPPED) despite the fact it has a single zombie
      child.  This is because wait_consider_task(ptrace => 0) sees p->ptrace and
      cleares ->notask_error assuming that the debugger should detach and notify
      us.
      
      Change wait_consider_task(ptrace => 0) to pretend that ptrace == T if the
      child is traced by us.  This really simplifies the logic and allows us to
      do more fixes, see the next changes.  This also hides the unwanted group
      stop state automatically, we can remove another ptrace_reparented() check.
      
      Unfortunately, this adds the following behavioural changes:
      
      	1. Before this patch wait(WEXITED | __WNOTHREAD) does not reap
      	   a natural child if it is traced by the caller's sub-thread.
      
      	   Hopefully nobody will ever notice this change, and I think
      	   that nobody should rely on this behaviour anyway.
      
      	2. SIGNAL_STOP_CONTINUED is no longer hidden from debugger if
      	   it is real parent.
      
      	   While this change comes as a side effect, I think it is good
      	   by itself. The group continued state can not be consumed by
      	   another process in this case, it doesn't depend on ptrace,
      	   it doesn't make sense to hide it from real parent.
      
      	   Perhaps we should add the thread_group_leader() check before
      	   wait_task_continued()? May be, but this shouldn't depend on
      	   ptrace_reparented().
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Jan Kratochvil <jan.kratochvil@redhat.com>
      Cc: Lennart Poettering <lpoetter@redhat.com>
      Cc: Michal Schmidt <mschmidt@redhat.com>
      Cc: Roland McGrath <roland@hack.frob.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      377d75da
    • O
      wait: completely ignore the EXIT_DEAD tasks · b3ab0316
      Oleg Nesterov 提交于
      Now that EXIT_DEAD is the terminal state it doesn't make sense to call
      eligible_child() or security_task_wait() if the task is really dead.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Tested-by: NMichal Schmidt <mschmidt@redhat.com>
      Cc: Jan Kratochvil <jan.kratochvil@redhat.com>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Lennart Poettering <lpoetter@redhat.com>
      Cc: Roland McGrath <roland@hack.frob.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b3ab0316
    • O
      wait: use EXIT_TRACE only if thread_group_leader(zombie) · b4360690
      Oleg Nesterov 提交于
      wait_task_zombie() always uses EXIT_TRACE/ptrace_unlink() if
      ptrace_reparented().  This is suboptimal and a bit confusing: we do not
      need do_notify_parent(p) if !thread_group_leader(p) and in this case we
      also do not need ptrace_unlink(), we can rely on ptrace_release_task().
      
      Change wait_task_zombie() to check thread_group_leader() along with
      ptrace_reparented() and simplify the final p->exit_state transition.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Tested-by: NMichal Schmidt <mschmidt@redhat.com>
      Cc: Jan Kratochvil <jan.kratochvil@redhat.com>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Lennart Poettering <lpoetter@redhat.com>
      Cc: Roland McGrath <roland@hack.frob.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b4360690
    • O
      wait: introduce EXIT_TRACE to avoid the racy EXIT_DEAD->EXIT_ZOMBIE transition · abd50b39
      Oleg Nesterov 提交于
      wait_task_zombie() first does EXIT_ZOMBIE->EXIT_DEAD transition and
      drops tasklist_lock.  If this task is not the natural child and it is
      traced, we change its state back to EXIT_ZOMBIE for ->real_parent.
      
      The last transition is racy, this is even documented in 50b8d257
      "ptrace: partially fix the do_wait(WEXITED) vs EXIT_DEAD->EXIT_ZOMBIE
      race".  wait_consider_task() tries to detect this transition and clear
      ->notask_error but we can't rely on ptrace_reparented(), debugger can
      exit and do ptrace_unlink() before its sub-thread sets EXIT_ZOMBIE.
      
      And there is another problem which were missed before: this transition
      can also race with reparent_leader() which doesn't reset >exit_signal if
      EXIT_DEAD, assuming that this task must be reaped by someone else.  So
      the tracee can be re-parented with ->exit_signal != SIGCHLD, and if
      /sbin/init doesn't use __WALL it becomes unreapable.  This was fixed by
      the previous commit, but it was the temporary hack.
      
      1. Add the new exit_state, EXIT_TRACE. It means that the task is the
         traced zombie, debugger is going to detach and notify its natural
         parent.
      
         This new state is actually EXIT_ZOMBIE | EXIT_DEAD. This way we
         can avoid the changes in proc/kgdb code, get_task_state() still
         reports "X (dead)" in this case.
      
         Note: with or without this change userspace can see Z -> X -> Z
         transition. Not really bad, but probably makes sense to fix.
      
      2. Change wait_task_zombie() to use EXIT_TRACE instead of EXIT_DEAD
         if we need to notify the ->real_parent.
      
      3. Revert the previous hack in reparent_leader(), now that EXIT_DEAD
         is always the final state we can safely ignore such a task.
      
      4. Change wait_consider_task() to check EXIT_TRACE separately and kill
         the racy and no longer needed ptrace_reparented() case.
      
         If ptrace == T an EXIT_TRACE thread should be simply ignored, the
         owner of this state is going to ptrace_unlink() this task. We can
         pretend that it was already removed from ->ptraced list.
      
         Otherwise we should skip this thread too but clear ->notask_error,
         we must be the natural parent and debugger is going to untrace and
         notify us. IOW, this doesn't differ from "EXIT_ZOMBIE && p->ptrace"
         even if the task was already untraced.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Reported-by: NJan Kratochvil <jan.kratochvil@redhat.com>
      Reported-by: NMichal Schmidt <mschmidt@redhat.com>
      Tested-by: NMichal Schmidt <mschmidt@redhat.com>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Lennart Poettering <lpoetter@redhat.com>
      Cc: Roland McGrath <roland@hack.frob.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      abd50b39
    • O
      wait: fix reparent_leader() vs EXIT_DEAD->EXIT_ZOMBIE race · dfccbb5e
      Oleg Nesterov 提交于
      wait_task_zombie() first does EXIT_ZOMBIE->EXIT_DEAD transition and
      drops tasklist_lock.  If this task is not the natural child and it is
      traced, we change its state back to EXIT_ZOMBIE for ->real_parent.
      
      The last transition is racy, this is even documented in 50b8d257
      "ptrace: partially fix the do_wait(WEXITED) vs EXIT_DEAD->EXIT_ZOMBIE
      race".  wait_consider_task() tries to detect this transition and clear
      ->notask_error but we can't rely on ptrace_reparented(), debugger can
      exit and do ptrace_unlink() before its sub-thread sets EXIT_ZOMBIE.
      
      And there is another problem which were missed before: this transition
      can also race with reparent_leader() which doesn't reset >exit_signal if
      EXIT_DEAD, assuming that this task must be reaped by someone else.  So
      the tracee can be re-parented with ->exit_signal != SIGCHLD, and if
      /sbin/init doesn't use __WALL it becomes unreapable.
      
      Change reparent_leader() to update ->exit_signal even if EXIT_DEAD.
      Note: this is the simple temporary hack for -stable, it doesn't try to
      solve all problems, it will be reverted by the next changes.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Reported-by: NJan Kratochvil <jan.kratochvil@redhat.com>
      Reported-by: NMichal Schmidt <mschmidt@redhat.com>
      Tested-by: NMichal Schmidt <mschmidt@redhat.com>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Lennart Poettering <lpoetter@redhat.com>
      Cc: Roland McGrath <roland@hack.frob.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dfccbb5e
    • G
      kernel/exit.c: call proc_exit_connector() after exit_state is set · ef982393
      Guillaume Morin 提交于
      The process events connector delivers a notification when a process
      exits.  This is really convenient for a process that spawns and wants to
      monitor its children through an epoll-able() interface.
      
      Unfortunately, there is a small window between when the event is
      delivered and the child become wait()-able.
      
      This is creates a race if the parent wants to make sure that it knows
      about the exit, e.g
      
      pid_t pid = fork();
      if (pid > 0) {
      	register_interest_for_pid(pid);
      	if (waitpid(pid, NULL, WNOHANG) > 0)
      	{
      	  /* We might have raced with exit() */
      	}
      	return;
      }
      
      /* Child */
      execve(...)
      
      register_interest_for_pid() would be telling the the connector socket
      reader to pay attention to events related to pid.
      
      Though this is not a bug, I think it would make the connector a bit more
      usable if this race was closed by simply moving the call to
      proc_exit_connector() from just before exit_notify() to right after.
      
      Oleg said:
      
      : Even with this patch the code above is still "racy" if the child is
      : multi-threaded.  Plus it should obviously filter-out subthreads.  And
      : afaics there is no way to make it reliable, even if you change the code
      : above so that waitpid() is called only after the last thread exits WNOHANG
      : still can fail.
      Signed-off-by: NGuillaume Morin <guillaume@morinfr.org>
      Cc: Matt Helsley <matt.helsley@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: David S. Miller <davem@davemloft.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ef982393
    • O
      exit: move check_stack_usage() to the end of do_exit() · 4bcb8232
      Oleg Nesterov 提交于
      It is not clear why check_stack_usage() is called so early and thus it
      never checks the stack usage in, say, exit_notify() or
      flush_ptrace_hw_breakpoint() or other functions which are only called by
      do_exit().
      
      Move the callsite down to the last preempt_disable/schedule.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4bcb8232
    • O
      exit: call disassociate_ctty() before exit_task_namespaces() · c39df5fa
      Oleg Nesterov 提交于
      Commit 8aac6270 ("move exit_task_namespaces() outside of
      exit_notify()") breaks pppd and the exiting service crashes the kernel:
      
          BUG: unable to handle kernel NULL pointer dereference at 0000000000000028
          IP: ppp_register_channel+0x13/0x20 [ppp_generic]
          Call Trace:
            ppp_asynctty_open+0x12b/0x170 [ppp_async]
            tty_ldisc_open.isra.2+0x27/0x60
            tty_ldisc_hangup+0x1e3/0x220
            __tty_hangup+0x2c4/0x440
            disassociate_ctty+0x61/0x270
            do_exit+0x7f2/0xa50
      
      ppp_register_channel() needs ->net_ns and current->nsproxy == NULL.
      
      Move disassociate_ctty() before exit_task_namespaces(), it doesn't make
      sense to delay it after perf_event_exit_task() or cgroup_exit().
      
      This also allows to use task_work_add() inside the (nontrivial) code
      paths in disassociate_ctty().
      
      Investigated by Peter Hurley.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Reported-by: NSree Harsha Totakura <sreeharsha@totakura.in>
      Cc: Peter Hurley <peter@hurleysoftware.com>
      Cc: Sree Harsha Totakura <sreeharsha@totakura.in>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Andrey Vagin <avagin@openvz.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: <stable@vger.kernel.org>	[v3.10+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c39df5fa
    • D
      res_counter: remove interface for locked charging and uncharging · 539a13b4
      David Rientjes 提交于
      The res_counter_{charge,uncharge}_locked() variants are not used in the
      kernel outside of the resource counter code itself, so remove the
      interface.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Jianguo Wu <wujianguo@huawei.com>
      Cc: Tim Hockin <thockin@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      539a13b4
    • D
      mm, mempolicy: remove per-process flag · f0432d15
      David Rientjes 提交于
      PF_MEMPOLICY is an unnecessary optimization for CONFIG_SLAB users.
      There's no significant performance degradation to checking
      current->mempolicy rather than current->flags & PF_MEMPOLICY in the
      allocation path, especially since this is considered unlikely().
      
      Running TCP_RR with netperf-2.4.5 through localhost on 16 cpu machine with
      64GB of memory and without a mempolicy:
      
      	threads		before		after
      	16		1249409		1244487
      	32		1281786		1246783
      	48		1239175		1239138
      	64		1244642		1241841
      	80		1244346		1248918
      	96		1266436		1254316
      	112		1307398		1312135
      	128		1327607		1326502
      
      Per-process flags are a scarce resource so we should free them up whenever
      possible and make them available.  We'll be using it shortly for memcg oom
      reserves.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Jianguo Wu <wujianguo@huawei.com>
      Cc: Tim Hockin <thockin@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f0432d15