1. 28 6月, 2011 2 次提交
    • O
      kill tracehook_notify_death() · 45cdf5cc
      Oleg Nesterov 提交于
      Kill tracehook_notify_death(), reimplement the logic in its caller,
      exit_notify().
      
      Also, change the exec_id's check to use thread_group_leader() instead
      of task_detached(), this is more clear. This logic only applies to
      the exiting leader, a sub-thread must never change its exit_signal.
      
      Note: when the traced group leader exits the exit_signal-or-SIGCHLD
      logic looks really strange:
      
      	- we notify the tracer even if !thread_group_empty() but
      	   do_wait(WEXITED) can't work until all threads exit
      
      	- if the tracer is real_parent, it is not clear why can't
      	  we use ->exit_signal event if !thread_group_empty()
      
      -v2: do not try to fix the 2nd oddity to avoid the subtle behavior
           change mixed with reorganization, suggested by Tejun.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Reviewed-by: NTejun Heo <tj@kernel.org>
      45cdf5cc
    • O
      make do_notify_parent() return bool · 53c8f9f1
      Oleg Nesterov 提交于
      - change do_notify_parent() to return a boolean, true if the task should
        be reaped because its parent ignores SIGCHLD.
      
      - update the only caller which checks the returned value, exit_notify().
      
      This temporary uglifies exit_notify() even more, will be cleanuped by
      the next change.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      53c8f9f1
  2. 23 6月, 2011 3 次提交
    • T
      ptrace: kill clone/exec tracehooks · 4b9d33e6
      Tejun Heo 提交于
      At this point, tracehooks aren't useful to mainline kernel and mostly
      just add an extra layer of obfuscation.  Although they have comments,
      without actual in-kernel users, it is difficult to tell what are their
      assumptions and they're actually trying to achieve.  To mainline
      kernel, they just aren't worth keeping around.
      
      This patch kills the following clone and exec related tracehooks.
      
      	tracehook_prepare_clone()
      	tracehook_finish_clone()
      	tracehook_report_clone()
      	tracehook_report_clone_complete()
      	tracehook_unsafe_exec()
      
      The changes are mostly trivial - logic is moved to the caller and
      comments are merged and adjusted appropriately.
      
      The only exception is in check_unsafe_exec() where LSM_UNSAFE_PTRACE*
      are OR'd to bprm->unsafe instead of setting it, which produces the
      same result as the field is always zero on entry.  It also tests
      p->ptrace instead of (p->ptrace & PT_PTRACED) for consistency, which
      also gives the same result.
      
      This doesn't introduce any behavior change.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      4b9d33e6
    • T
      ptrace: kill trivial tracehooks · a288eecc
      Tejun Heo 提交于
      At this point, tracehooks aren't useful to mainline kernel and mostly
      just add an extra layer of obfuscation.  Although they have comments,
      without actual in-kernel users, it is difficult to tell what are their
      assumptions and they're actually trying to achieve.  To mainline
      kernel, they just aren't worth keeping around.
      
      This patch kills the following trivial tracehooks.
      
      * Ones testing whether task is ptraced.  Replace with ->ptrace test.
      
      	tracehook_expect_breakpoints()
      	tracehook_consider_ignored_signal()
      	tracehook_consider_fatal_signal()
      
      * ptrace_event() wrappers.  Call directly.
      
      	tracehook_report_exec()
      	tracehook_report_exit()
      	tracehook_report_vfork_done()
      
      * ptrace_release_task() wrapper.  Call directly.
      
      	tracehook_finish_release_task()
      
      * noop
      
      	tracehook_prepare_release_task()
      	tracehook_report_death()
      
      This doesn't introduce any behavior change.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      a288eecc
    • T
      ptrace: kill task_ptrace() · d21142ec
      Tejun Heo 提交于
      task_ptrace(task) simply dereferences task->ptrace and isn't even used
      consistently only adding confusion.  Kill it and directly access
      ->ptrace instead.
      
      This doesn't introduce any behavior change.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      d21142ec
  3. 17 6月, 2011 5 次提交
    • T
      ptrace: implement PTRACE_LISTEN · 544b2c91
      Tejun Heo 提交于
      The previous patch implemented async notification for ptrace but it
      only worked while trace is running.  This patch introduces
      PTRACE_LISTEN which is suggested by Oleg Nestrov.
      
      It's allowed iff tracee is in STOP trap and puts tracee into
      quasi-running state - tracee never really runs but wait(2) and
      ptrace(2) consider it to be running.  While ptracer is listening,
      tracee is allowed to re-enter STOP to notify an async event.
      Listening state is cleared on the first notification.  Ptracer can
      also clear it by issuing INTERRUPT - tracee will re-trap into STOP
      with listening state cleared.
      
      This allows ptracer to monitor group stop state without running tracee
      - use INTERRUPT to put tracee into STOP trap, issue LISTEN and then
      wait(2) to wait for the next group stop event.  When it happens,
      PTRACE_GETSIGINFO provides information to determine the current state.
      
      Test program follows.
      
        #define PTRACE_SEIZE		0x4206
        #define PTRACE_INTERRUPT	0x4207
        #define PTRACE_LISTEN		0x4208
      
        #define PTRACE_SEIZE_DEVEL	0x80000000
      
        static const struct timespec ts1s = { .tv_sec = 1 };
      
        int main(int argc, char **argv)
        {
      	  pid_t tracee, tracer;
      	  int i;
      
      	  tracee = fork();
      	  if (!tracee)
      		  while (1)
      			  pause();
      
      	  tracer = fork();
      	  if (!tracer) {
      		  siginfo_t si;
      
      		  ptrace(PTRACE_SEIZE, tracee, NULL,
      			 (void *)(unsigned long)PTRACE_SEIZE_DEVEL);
      		  ptrace(PTRACE_INTERRUPT, tracee, NULL, NULL);
      	  repeat:
      		  waitid(P_PID, tracee, NULL, WSTOPPED);
      
      		  ptrace(PTRACE_GETSIGINFO, tracee, NULL, &si);
      		  if (!si.si_code) {
      			  printf("tracer: SIG %d\n", si.si_signo);
      			  ptrace(PTRACE_CONT, tracee, NULL,
      				 (void *)(unsigned long)si.si_signo);
      			  goto repeat;
      		  }
      		  printf("tracer: stopped=%d signo=%d\n",
      			 si.si_signo != SIGTRAP, si.si_signo);
      		  if (si.si_signo != SIGTRAP)
      			  ptrace(PTRACE_LISTEN, tracee, NULL, NULL);
      		  else
      			  ptrace(PTRACE_CONT, tracee, NULL, NULL);
      		  goto repeat;
      	  }
      
      	  for (i = 0; i < 3; i++) {
      		  nanosleep(&ts1s, NULL);
      		  printf("mother: SIGSTOP\n");
      		  kill(tracee, SIGSTOP);
      		  nanosleep(&ts1s, NULL);
      		  printf("mother: SIGCONT\n");
      		  kill(tracee, SIGCONT);
      	  }
      	  nanosleep(&ts1s, NULL);
      
      	  kill(tracer, SIGKILL);
      	  kill(tracee, SIGKILL);
      	  return 0;
        }
      
      This is identical to the program to test TRAP_NOTIFY except that
      tracee is PTRACE_LISTEN'd instead of PTRACE_CONT'd when group stopped.
      This allows ptracer to monitor when group stop ends without running
      tracee.
      
        # ./test-listen
        tracer: stopped=0 signo=5
        mother: SIGSTOP
        tracer: SIG 19
        tracer: stopped=1 signo=19
        mother: SIGCONT
        tracer: stopped=0 signo=5
        tracer: SIG 18
        mother: SIGSTOP
        tracer: SIG 19
        tracer: stopped=1 signo=19
        mother: SIGCONT
        tracer: stopped=0 signo=5
        tracer: SIG 18
        mother: SIGSTOP
        tracer: SIG 19
        tracer: stopped=1 signo=19
        mother: SIGCONT
        tracer: stopped=0 signo=5
        tracer: SIG 18
      
      -v2: Moved JOBCTL_LISTENING check in wait_task_stopped() into
           task_stopped_code() as suggested by Oleg.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      544b2c91
    • T
      ptrace: implement TRAP_NOTIFY and use it for group stop events · fb1d910c
      Tejun Heo 提交于
      Currently there's no way for ptracer to find out whether group stop
      finished other than polling with INTERRUPT - GETSIGINFO - CONT
      sequence.  This patch implements group stop notification for ptracer
      using STOP traps.
      
      When group stop state of a seized tracee changes, JOBCTL_TRAP_NOTIFY
      is set, which schedules a STOP trap which is sticky - it isn't cleared
      by other traps and at least one STOP trap will happen eventually.
      STOP trap is synchronization point for event notification and the
      tracer can determine the current group stop state by looking at the
      signal number portion of exit code (si_status from waitid(2) or
      si_code from PTRACE_GETSIGINFO).
      
      Notifications are generated both on start and end of group stops but,
      because group stop participation always happens before STOP trap, this
      doesn't cause an extra trap while tracee is participating in group
      stop.  The symmetry will be useful later.
      
      Note that this notification works iff tracee is not trapped.
      Currently there is no way to be notified of group stop state changes
      while tracee is trapped.  This will be addressed by a later patch.
      
      An example program follows.
      
        #define PTRACE_SEIZE		0x4206
        #define PTRACE_INTERRUPT	0x4207
      
        #define PTRACE_SEIZE_DEVEL	0x80000000
      
        static const struct timespec ts1s = { .tv_sec = 1 };
      
        int main(int argc, char **argv)
        {
      	  pid_t tracee, tracer;
      	  int i;
      
      	  tracee = fork();
      	  if (!tracee)
      		  while (1)
      			  pause();
      
      	  tracer = fork();
      	  if (!tracer) {
      		  siginfo_t si;
      
      		  ptrace(PTRACE_SEIZE, tracee, NULL,
      			 (void *)(unsigned long)PTRACE_SEIZE_DEVEL);
      		  ptrace(PTRACE_INTERRUPT, tracee, NULL, NULL);
      	  repeat:
      		  waitid(P_PID, tracee, NULL, WSTOPPED);
      
      		  ptrace(PTRACE_GETSIGINFO, tracee, NULL, &si);
      		  if (!si.si_code) {
      			  printf("tracer: SIG %d\n", si.si_signo);
      			  ptrace(PTRACE_CONT, tracee, NULL,
      				 (void *)(unsigned long)si.si_signo);
      			  goto repeat;
      		  }
      		  printf("tracer: stopped=%d signo=%d\n",
      			 si.si_signo != SIGTRAP, si.si_signo);
      		  ptrace(PTRACE_CONT, tracee, NULL, NULL);
      		  goto repeat;
      	  }
      
      	  for (i = 0; i < 3; i++) {
      		  nanosleep(&ts1s, NULL);
      		  printf("mother: SIGSTOP\n");
      		  kill(tracee, SIGSTOP);
      		  nanosleep(&ts1s, NULL);
      		  printf("mother: SIGCONT\n");
      		  kill(tracee, SIGCONT);
      	  }
      	  nanosleep(&ts1s, NULL);
      
      	  kill(tracer, SIGKILL);
      	  kill(tracee, SIGKILL);
      	  return 0;
        }
      
      In the above program, tracer keeps tracee running and gets
      notification of each group stop state changes.
      
        # ./test-notify
        tracer: stopped=0 signo=5
        mother: SIGSTOP
        tracer: SIG 19
        tracer: stopped=1 signo=19
        mother: SIGCONT
        tracer: stopped=0 signo=5
        tracer: SIG 18
        mother: SIGSTOP
        tracer: SIG 19
        tracer: stopped=1 signo=19
        mother: SIGCONT
        tracer: stopped=0 signo=5
        tracer: SIG 18
        mother: SIGSTOP
        tracer: SIG 19
        tracer: stopped=1 signo=19
        mother: SIGCONT
        tracer: stopped=0 signo=5
        tracer: SIG 18
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      fb1d910c
    • T
      ptrace: implement PTRACE_INTERRUPT · fca26f26
      Tejun Heo 提交于
      Currently, there's no way to trap a running ptracee short of sending a
      signal which has various side effects.  This patch implements
      PTRACE_INTERRUPT which traps ptracee without any signal or job control
      related side effect.
      
      The implementation is almost trivial.  It uses the group stop trap -
      SIGTRAP | PTRACE_EVENT_STOP << 8.  A new trap flag
      JOBCTL_TRAP_INTERRUPT is added, which is set on PTRACE_INTERRUPT and
      cleared when any trap happens.  As INTERRUPT should be useable
      regardless of the current state of tracee, task_is_traced() test in
      ptrace_check_attach() is skipped for INTERRUPT.
      
      PTRACE_INTERRUPT is available iff tracee is attached with
      PTRACE_SEIZE.
      
      Test program follows.
      
        #define PTRACE_SEIZE		0x4206
        #define PTRACE_INTERRUPT	0x4207
      
        #define PTRACE_SEIZE_DEVEL	0x80000000
      
        static const struct timespec ts100ms = { .tv_nsec = 100000000 };
        static const struct timespec ts1s = { .tv_sec = 1 };
        static const struct timespec ts3s = { .tv_sec = 3 };
      
        int main(int argc, char **argv)
        {
      	  pid_t tracee;
      
      	  tracee = fork();
      	  if (tracee == 0) {
      		  nanosleep(&ts100ms, NULL);
      		  while (1) {
      			  printf("tracee: alive pid=%d\n", getpid());
      			  nanosleep(&ts1s, NULL);
      		  }
      	  }
      
      	  if (argc > 1)
      		  kill(tracee, SIGSTOP);
      
      	  nanosleep(&ts100ms, NULL);
      
      	  ptrace(PTRACE_SEIZE, tracee, NULL,
      		 (void *)(unsigned long)PTRACE_SEIZE_DEVEL);
      	  if (argc > 1) {
      		  waitid(P_PID, tracee, NULL, WSTOPPED);
      		  ptrace(PTRACE_CONT, tracee, NULL, NULL);
      	  }
      	  nanosleep(&ts3s, NULL);
      
      	  printf("tracer: INTERRUPT and DETACH\n");
      	  ptrace(PTRACE_INTERRUPT, tracee, NULL, NULL);
      	  waitid(P_PID, tracee, NULL, WSTOPPED);
      	  ptrace(PTRACE_DETACH, tracee, NULL, NULL);
      	  nanosleep(&ts3s, NULL);
      
      	  printf("tracer: exiting\n");
      	  kill(tracee, SIGKILL);
      	  return 0;
        }
      
      When called without argument, tracee is seized from running state,
      interrupted and then detached back to running state.
      
        # ./test-interrupt
        tracee: alive pid=4546
        tracee: alive pid=4546
        tracee: alive pid=4546
        tracer: INTERRUPT and DETACH
        tracee: alive pid=4546
        tracee: alive pid=4546
        tracee: alive pid=4546
        tracer: exiting
      
      When called with argument, tracee is seized from stopped state,
      continued, interrupted and then detached back to stopped state.
      
        # ./test-interrupt  1
        tracee: alive pid=4548
        tracee: alive pid=4548
        tracee: alive pid=4548
        tracer: INTERRUPT and DETACH
        tracer: exiting
      
      Before PTRACE_INTERRUPT, once the tracee was running, there was no way
      to trap tracee and do PTRACE_DETACH without causing side effect.
      
      -v2: Updated to use task_set_jobctl_pending() so that it doesn't end
           up scheduling TRAP_STOP if child is dying which may make the
           child unkillable.  Spotted by Oleg.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      fca26f26
    • T
      ptrace: implement PTRACE_SEIZE · 3544d72a
      Tejun Heo 提交于
      PTRACE_ATTACH implicitly issues SIGSTOP on attach which has side
      effects on tracee signal and job control states.  This patch
      implements a new ptrace request PTRACE_SEIZE which attaches a tracee
      without trapping it or affecting its signal and job control states.
      
      The usage is the same with PTRACE_ATTACH but it takes PTRACE_SEIZE_*
      flags in @data.  Currently, the only defined flag is
      PTRACE_SEIZE_DEVEL which is a temporary flag to enable PTRACE_SEIZE.
      PTRACE_SEIZE will change ptrace behaviors outside of attach itself.
      The changes will be implemented gradually and the DEVEL flag is to
      prevent programs which expect full SEIZE behavior from using it before
      all the behavior modifications are complete while allowing unit
      testing.  The flag will be removed once SEIZE behaviors are completely
      implemented.
      
      * PTRACE_SEIZE, unlike ATTACH, doesn't force tracee to trap.  After
        attaching tracee continues to run unless a trap condition occurs.
      
      * PTRACE_SEIZE doesn't affect signal or group stop state.
      
      * If PTRACE_SEIZE'd, group stop uses PTRACE_EVENT_STOP trap which uses
        exit_code of (signr | PTRACE_EVENT_STOP << 8) where signr is one of
        the stopping signals if group stop is in effect or SIGTRAP
        otherwise, and returns usual trap siginfo on PTRACE_GETSIGINFO
        instead of NULL.
      
      Seizing sets PT_SEIZED in ->ptrace of the tracee.  This flag will be
      used to determine whether new SEIZE behaviors should be enabled.
      
      Test program follows.
      
        #define PTRACE_SEIZE		0x4206
        #define PTRACE_SEIZE_DEVEL	0x80000000
      
        static const struct timespec ts100ms = { .tv_nsec = 100000000 };
        static const struct timespec ts1s = { .tv_sec = 1 };
        static const struct timespec ts3s = { .tv_sec = 3 };
      
        int main(int argc, char **argv)
        {
      	  pid_t tracee;
      
      	  tracee = fork();
      	  if (tracee == 0) {
      		  nanosleep(&ts100ms, NULL);
      		  while (1) {
      			  printf("tracee: alive\n");
      			  nanosleep(&ts1s, NULL);
      		  }
      	  }
      
      	  if (argc > 1)
      		  kill(tracee, SIGSTOP);
      
      	  nanosleep(&ts100ms, NULL);
      
      	  ptrace(PTRACE_SEIZE, tracee, NULL,
      		 (void *)(unsigned long)PTRACE_SEIZE_DEVEL);
      	  if (argc > 1) {
      		  waitid(P_PID, tracee, NULL, WSTOPPED);
      		  ptrace(PTRACE_CONT, tracee, NULL, NULL);
      	  }
      	  nanosleep(&ts3s, NULL);
      	  printf("tracer: exiting\n");
      	  return 0;
        }
      
      When the above program is called w/o argument, tracee is seized while
      running and remains running.  When tracer exits, tracee continues to
      run and print out messages.
      
        # ./test-seize-simple
        tracee: alive
        tracee: alive
        tracee: alive
        tracer: exiting
        tracee: alive
        tracee: alive
      
      When called with an argument, tracee is seized from stopped state and
      continued, and returns to stopped state when tracer exits.
      
        # ./test-seize
        tracee: alive
        tracee: alive
        tracee: alive
        tracer: exiting
        # ps -el|grep test-seize
        1 T     0  4720     1  0  80   0 -   941 signal ttyS0    00:00:00 test-seize
      
      -v2: SEIZE doesn't schedule TRAP_STOP and leaves tracee running as Jan
           suggested.
      
      -v3: PTRACE_EVENT_STOP traps now report group stop state by signr.  If
           group stop is in effect the stop signal number is returned as
           part of exit_code; otherwise, SIGTRAP.  This was suggested by
           Denys and Oleg.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jan Kratochvil <jan.kratochvil@redhat.com>
      Cc: Denys Vlasenko <vda.linux@googlemail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      3544d72a
    • T
      job control: introduce JOBCTL_TRAP_STOP and use it for group stop trap · 73ddff2b
      Tejun Heo 提交于
      do_signal_stop() implemented both normal group stop and trap for group
      stop while ptraced.  This approach has been enough but scheduled
      changes require trap mechanism which can be used in more generic
      manner and using group stop trap for generic trap site simplifies both
      userland visible interface and implementation.
      
      This patch adds a new jobctl flag - JOBCTL_TRAP_STOP.  When set, it
      triggers a trap site, which behaves like group stop trap, in
      get_signal_to_deliver() after checking for pending signals.  While
      ptraced, do_signal_stop() doesn't stop itself.  It initiates group
      stop if requested and schedules JOBCTL_TRAP_STOP and returns.  The
      caller - get_signal_to_deliver() - is responsible for checking whether
      TRAP_STOP is pending afterwards and handling it.
      
      ptrace_attach() is updated to use JOBCTL_TRAP_STOP instead of
      JOBCTL_STOP_PENDING and __ptrace_unlink() to clear all pending trap
      bits and TRAPPING so that TRAP_STOP and future trap bits don't linger
      after detach.
      
      While at it, add proper function comment to do_signal_stop() and make
      it return bool.
      
      -v2: __ptrace_unlink() updated to clear JOBCTL_TRAP_MASK and TRAPPING
           instead of JOBCTL_PENDING_MASK.  This avoids accidentally
           clearing JOBCTL_STOP_CONSUME.  Spotted by Oleg.
      
      -v3: do_signal_stop() updated to return %false without dropping
           siglock while ptraced and TRAP_STOP check moved inside for(;;)
           loop after group stop participation.  This avoids unnecessary
           relocking and also will help avoiding unnecessary traps by
           consuming group stop before handling pending traps.
      
      -v4: Jobctl trap handling moved into a separate function -
           do_jobctl_trap().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      73ddff2b
  4. 05 6月, 2011 9 次提交
    • T
      signal: remove three noop tracehooks · dd1d6772
      Tejun Heo 提交于
      Remove the following three noop tracehooks in signals.c.
      
      * tracehook_force_sigpending()
      * tracehook_get_signal()
      * tracehook_finish_jctl()
      
      The code area is about to be updated and these hooks don't do anything
      other than obfuscating the logic.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      dd1d6772
    • T
      ptrace: use bit_waitqueue for TRAPPING instead of wait_chldexit · 62c124ff
      Tejun Heo 提交于
      ptracer->signal->wait_chldexit was used to wait for TRAPPING; however,
      ->wait_chldexit was already complicated with waker-side filtering
      without adding TRAPPING wait on top of it.  Also, it unnecessarily
      made TRAPPING clearing depend on the current ptrace relationship - if
      the ptracee is detached, wakeup is lost.
      
      There is no reason to use signal->wait_chldexit here.  We're just
      waiting for JOBCTL_TRAPPING bit to clear and given the relatively
      infrequent use of ptrace, bit_waitqueue can serve it perfectly.
      
      This patch makes JOBCTL_TRAPPING wait use bit_waitqueue instead of
      signal->wait_chldexit.
      
      -v2: Use JOBCTL_*_BIT macros instead of ilog2() as suggested by Linus.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      62c124ff
    • T
      job control: introduce task_set_jobctl_pending() · 7dd3db54
      Tejun Heo 提交于
      task->jobctl currently hosts JOBCTL_STOP_PENDING and will host TRAP
      pending bits too.  Setting pending conditions on a dying task may make
      the task unkillable.  Currently, each setting site is responsible for
      checking for the condition but with to-be-added job control traps this
      becomes too fragile.
      
      This patch adds task_set_jobctl_pending() which should be used when
      setting task->jobctl bits to schedule a stop or trap.  The function
      performs the followings to ease setting pending bits.
      
      * Sanity checks.
      
      * If fatal signal is pending or PF_EXITING is set, no bit is set.
      
      * STOP_SIGMASK is automatically cleared if new value is being set.
      
      do_signal_stop() and ptrace_attach() are updated to use
      task_set_jobctl_pending() instead of setting STOP_PENDING explicitly.
      The surrounding structures around setting are changed to fit
      task_set_jobctl_pending() better but there should be no userland
      visible behavior difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      7dd3db54
    • T
      job control: make task_clear_jobctl_pending() clear TRAPPING automatically · 6dfca329
      Tejun Heo 提交于
      JOBCTL_TRAPPING indicates that ptracer is waiting for tracee to
      (re)transit into TRACED.  task_clear_jobctl_pending() must be called
      when either tracee enters TRACED or the transition is cancelled for
      some reason.  The former is achieved by explicitly calling
      task_clear_jobctl_pending() in ptrace_stop() and the latter by calling
      it at the end of do_signal_stop().
      
      Calling task_clear_jobctl_trapping() at the end of do_signal_stop()
      limits the scope TRAPPING can be used and is fragile in that seemingly
      unrelated changes to tracee's control flow can lead to stuck TRAPPING.
      
      We already have task_clear_jobctl_pending() calls on those cancelling
      events to clear JOBCTL_STOP_PENDING.  Cancellations can be handled by
      making those call sites use JOBCTL_PENDING_MASK instead and updating
      task_clear_jobctl_pending() such that task_clear_jobctl_trapping() is
      called automatically if no stop/trap is pending.
      
      This patch makes the above changes and removes the fallback
      task_clear_jobctl_trapping() call from do_signal_stop().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      6dfca329
    • T
      job control: introduce JOBCTL_PENDING_MASK and task_clear_jobctl_pending() · 3759a0d9
      Tejun Heo 提交于
      This patch introduces JOBCTL_PENDING_MASK and replaces
      task_clear_jobctl_stop_pending() with task_clear_jobctl_pending()
      which takes an extra @mask argument.
      
      JOBCTL_PENDING_MASK is currently equal to JOBCTL_STOP_PENDING but
      future patches will add more bits.  recalc_sigpending_tsk() is updated
      to use JOBCTL_PENDING_MASK instead.
      
      task_clear_jobctl_pending() takes @mask which in subset of
      JOBCTL_PENDING_MASK and clears the relevant jobctl bits.  If
      JOBCTL_STOP_PENDING is set, other STOP bits are cleared together.  All
      task_clear_jobctl_stop_pending() users are updated to call
      task_clear_jobctl_pending() with JOBCTL_STOP_PENDING which is
      functionally identical to task_clear_jobctl_stop_pending().
      
      This patch doesn't cause any functional change.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      3759a0d9
    • T
      ptrace: relocate set_current_state(TASK_TRACED) in ptrace_stop() · 81be24b8
      Tejun Heo 提交于
      In ptrace_stop(), after arch hook is done, the task state and jobctl
      bits are updated while holding siglock.  The ordering requirement
      there is that TASK_TRACED is set before JOBCTL_TRAPPING is cleared to
      prevent ptracer waiting on TRAPPING doesn't end up waking up TRACED is
      actually set and sees TASK_RUNNING in wait(2).
      
      Move set_current_state(TASK_TRACED) to the top of the block and
      reorganize comments.  This makes the ordering more obvious
      (TASK_TRACED before other updates) and helps future updates to group
      stop participation.
      
      This patch doesn't cause any functional change.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      81be24b8
    • T
      ptrace: ptrace_check_attach(): rename @kill to @ignore_state and add comments · 755e276b
      Tejun Heo 提交于
      PTRACE_INTERRUPT is going to be added which should also skip
      task_is_traced() check in ptrace_check_attach().  Rename @kill to
      @ignore_state and make it bool.  Add function comment while at it.
      
      This patch doesn't introduce any behavior difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      755e276b
    • T
      job control: rename signal->group_stop and flags to jobctl and update them · a8f072c1
      Tejun Heo 提交于
      signal->group_stop currently hosts mostly group stop related flags;
      however, it's gonna be used for wider purposes and the GROUP_STOP_
      flag prefix becomes confusing.  Rename signal->group_stop to
      signal->jobctl and rename all GROUP_STOP_* flags to JOBCTL_*.
      
      Bit position macros JOBCTL_*_BIT are defined and JOBCTL_* flags are
      defined in terms of them to allow using bitops later.
      
      While at it, reassign JOBCTL_TRAPPING to bit 22 to better accomodate
      future additions.
      
      This doesn't cause any functional change.
      
      -v2: JOBCTL_*_BIT macros added as suggested by Linus.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      a8f072c1
    • T
      ptrace: remove silly wait_trap variable from ptrace_attach() · 0b1007c3
      Tejun Heo 提交于
      Remove local variable wait_trap which determines whether to wait for
      !TRAPPING or not and simply wait for it if attach was successful.
      
      -v2: Oleg pointed out wait should happen iff attach was successful.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      0b1007c3
  5. 31 5月, 2011 1 次提交
  6. 30 5月, 2011 1 次提交
    • L
      mm: Fix boot crash in mm_alloc() · 6345d24d
      Linus Torvalds 提交于
      Thomas Gleixner reports that we now have a boot crash triggered by
      CONFIG_CPUMASK_OFFSTACK=y:
      
          BUG: unable to handle kernel NULL pointer dereference at   (null)
          IP: [<c11ae035>] find_next_bit+0x55/0xb0
          Call Trace:
           [<c11addda>] cpumask_any_but+0x2a/0x70
           [<c102396b>] flush_tlb_mm+0x2b/0x80
           [<c1022705>] pud_populate+0x35/0x50
           [<c10227ba>] pgd_alloc+0x9a/0xf0
           [<c103a3fc>] mm_init+0xec/0x120
           [<c103a7a3>] mm_alloc+0x53/0xd0
      
      which was introduced by commit de03c72c ("mm: convert
      mm->cpu_vm_cpumask into cpumask_var_t"), and is due to wrong ordering of
      mm_init() vs mm_init_cpumask
      
      Thomas wrote a patch to just fix the ordering of initialization, but I
      hate the new double allocation in the fork path, so I ended up instead
      doing some more radical surgery to clean it all up.
      Reported-by: NThomas Gleixner <tglx@linutronix.de>
      Reported-by: NIngo Molnar <mingo@elte.hu>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6345d24d
  7. 29 5月, 2011 1 次提交
    • T
      idle governor: Avoid lock acquisition to read pm_qos before entering idle · 333c5ae9
      Tim Chen 提交于
      Thanks to the reviews and comments by Rafael, James, Mark and Andi.
      Here's version 2 of the patch incorporating your comments and also some
      update to my previous patch comments.
      
      I noticed that before entering idle state, the menu idle governor will
      look up the current pm_qos target value according to the list of qos
      requests received.  This look up currently needs the acquisition of a
      lock to access the list of qos requests to find the qos target value,
      slowing down the entrance into idle state due to contention by multiple
      cpus to access this list.  The contention is severe when there are a lot
      of cpus waking and going into idle.  For example, for a simple workload
      that has 32 pair of processes ping ponging messages to each other, where
      64 cpu cores are active in test system, I see the following profile with
      37.82% of cpu cycles spent in contention of pm_qos_lock:
      
      -     37.82%          swapper  [kernel.kallsyms]          [k]
      _raw_spin_lock_irqsave
         - _raw_spin_lock_irqsave
            - 95.65% pm_qos_request
                 menu_select
                 cpuidle_idle_call
               - cpu_idle
                    99.98% start_secondary
      
      A better approach will be to cache the updated pm_qos target value so
      reading it does not require lock acquisition as in the patch below.
      With this patch the contention for pm_qos_lock is removed and I saw a
      2.2X increase in throughput for my message passing workload.
      
      cc: stable@kernel.org
      Signed-off-by: NTim Chen <tim.c.chen@linux.intel.com>
      Acked-by: NAndi Kleen <ak@linux.intel.com>
      Acked-by: NJames Bottomley <James.Bottomley@suse.de>
      Acked-by: Nmark gross <markgross@thegnar.org>
      Signed-off-by: NLen Brown <len.brown@intel.com>
      333c5ae9
  8. 28 5月, 2011 8 次提交
    • P
      rcu: Start RCU kthreads in TASK_INTERRUPTIBLE state · cc3ce517
      Paul E. McKenney 提交于
      Upon creation, kthreads are in TASK_UNINTERRUPTIBLE state, which can
      result in softlockup warnings.  Because some of RCU's kthreads can
      legitimately be idle indefinitely, start them in TASK_INTERRUPTIBLE
      state in order to avoid those warnings.
      Suggested-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Tested-by: NYinghai Lu <yinghai@kernel.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      cc3ce517
    • P
      rcu: Remove waitqueue usage for cpu, node, and boost kthreads · 08bca60a
      Peter Zijlstra 提交于
      It is not necessary to use waitqueues for the RCU kthreads because
      we always know exactly which thread is to be awakened.  In addition,
      wake_up() only issues an actual wakeup when there is a thread waiting on
      the queue, which was why there was an extra explicit wake_up_process()
      to get the RCU kthreads started.
      
      Eliminating the waitqueues (and wake_up()) in favor of wake_up_process()
      eliminates the need for the initial wake_up_process() and also shrinks
      the data structure size a bit.  The wakeup logic is placed in a new
      rcu_wait() macro.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      08bca60a
    • P
      rcu: Avoid acquiring rcu_node locks in timer functions · 8826f3b0
      Paul E. McKenney 提交于
      This commit switches manipulations of the rcu_node ->wakemask field
      to atomic operations, which allows rcu_cpu_kthread_timer() to avoid
      acquiring the rcu_node lock.  This should avoid the following lockdep
      splat reported by Valdis Kletnieks:
      
      [   12.872150] usb 1-4: new high speed USB device number 3 using ehci_hcd
      [   12.986667] usb 1-4: New USB device found, idVendor=413c, idProduct=2513
      [   12.986679] usb 1-4: New USB device strings: Mfr=0, Product=0, SerialNumber=0
      [   12.987691] hub 1-4:1.0: USB hub found
      [   12.987877] hub 1-4:1.0: 3 ports detected
      [   12.996372] input: PS/2 Generic Mouse as /devices/platform/i8042/serio1/input/input10
      [   13.071471] udevadm used greatest stack depth: 3984 bytes left
      [   13.172129]
      [   13.172130] =======================================================
      [   13.172425] [ INFO: possible circular locking dependency detected ]
      [   13.172650] 2.6.39-rc6-mmotm0506 #1
      [   13.172773] -------------------------------------------------------
      [   13.172997] blkid/267 is trying to acquire lock:
      [   13.173009]  (&p->pi_lock){-.-.-.}, at: [<ffffffff81032d8f>] try_to_wake_up+0x29/0x1aa
      [   13.173009]
      [   13.173009] but task is already holding lock:
      [   13.173009]  (rcu_node_level_0){..-...}, at: [<ffffffff810901cc>] rcu_cpu_kthread_timer+0x27/0x58
      [   13.173009]
      [   13.173009] which lock already depends on the new lock.
      [   13.173009]
      [   13.173009]
      [   13.173009] the existing dependency chain (in reverse order) is:
      [   13.173009]
      [   13.173009] -> #2 (rcu_node_level_0){..-...}:
      [   13.173009]        [<ffffffff810679b9>] check_prevs_add+0x8b/0x104
      [   13.173009]        [<ffffffff81067da1>] validate_chain+0x36f/0x3ab
      [   13.173009]        [<ffffffff8106846b>] __lock_acquire+0x369/0x3e2
      [   13.173009]        [<ffffffff81068a0f>] lock_acquire+0xfc/0x14c
      [   13.173009]        [<ffffffff815697f1>] _raw_spin_lock+0x36/0x45
      [   13.173009]        [<ffffffff81090794>] rcu_read_unlock_special+0x8c/0x1d5
      [   13.173009]        [<ffffffff8109092c>] __rcu_read_unlock+0x4f/0xd7
      [   13.173009]        [<ffffffff81027bd3>] rcu_read_unlock+0x21/0x23
      [   13.173009]        [<ffffffff8102cc34>] cpuacct_charge+0x6c/0x75
      [   13.173009]        [<ffffffff81030cc6>] update_curr+0x101/0x12e
      [   13.173009]        [<ffffffff810311d0>] check_preempt_wakeup+0xf7/0x23b
      [   13.173009]        [<ffffffff8102acb3>] check_preempt_curr+0x2b/0x68
      [   13.173009]        [<ffffffff81031d40>] ttwu_do_wakeup+0x76/0x128
      [   13.173009]        [<ffffffff81031e49>] ttwu_do_activate.constprop.63+0x57/0x5c
      [   13.173009]        [<ffffffff81031e96>] scheduler_ipi+0x48/0x5d
      [   13.173009]        [<ffffffff810177d5>] smp_reschedule_interrupt+0x16/0x18
      [   13.173009]        [<ffffffff815710f3>] reschedule_interrupt+0x13/0x20
      [   13.173009]        [<ffffffff810b66d1>] rcu_read_unlock+0x21/0x23
      [   13.173009]        [<ffffffff810b739c>] find_get_page+0xa9/0xb9
      [   13.173009]        [<ffffffff810b8b48>] filemap_fault+0x6a/0x34d
      [   13.173009]        [<ffffffff810d1a25>] __do_fault+0x54/0x3e6
      [   13.173009]        [<ffffffff810d447a>] handle_pte_fault+0x12c/0x1ed
      [   13.173009]        [<ffffffff810d48f7>] handle_mm_fault+0x1cd/0x1e0
      [   13.173009]        [<ffffffff8156cfee>] do_page_fault+0x42d/0x5de
      [   13.173009]        [<ffffffff8156a75f>] page_fault+0x1f/0x30
      [   13.173009]
      [   13.173009] -> #1 (&rq->lock){-.-.-.}:
      [   13.173009]        [<ffffffff810679b9>] check_prevs_add+0x8b/0x104
      [   13.173009]        [<ffffffff81067da1>] validate_chain+0x36f/0x3ab
      [   13.173009]        [<ffffffff8106846b>] __lock_acquire+0x369/0x3e2
      [   13.173009]        [<ffffffff81068a0f>] lock_acquire+0xfc/0x14c
      [   13.173009]        [<ffffffff815697f1>] _raw_spin_lock+0x36/0x45
      [   13.173009]        [<ffffffff81027e19>] __task_rq_lock+0x8b/0xd3
      [   13.173009]        [<ffffffff81032f7f>] wake_up_new_task+0x41/0x108
      [   13.173009]        [<ffffffff810376c3>] do_fork+0x265/0x33f
      [   13.173009]        [<ffffffff81007d02>] kernel_thread+0x6b/0x6d
      [   13.173009]        [<ffffffff8153a9dd>] rest_init+0x21/0xd2
      [   13.173009]        [<ffffffff81b1db4f>] start_kernel+0x3bb/0x3c6
      [   13.173009]        [<ffffffff81b1d29f>] x86_64_start_reservations+0xaf/0xb3
      [   13.173009]        [<ffffffff81b1d393>] x86_64_start_kernel+0xf0/0xf7
      [   13.173009]
      [   13.173009] -> #0 (&p->pi_lock){-.-.-.}:
      [   13.173009]        [<ffffffff81067788>] check_prev_add+0x68/0x20e
      [   13.173009]        [<ffffffff810679b9>] check_prevs_add+0x8b/0x104
      [   13.173009]        [<ffffffff81067da1>] validate_chain+0x36f/0x3ab
      [   13.173009]        [<ffffffff8106846b>] __lock_acquire+0x369/0x3e2
      [   13.173009]        [<ffffffff81068a0f>] lock_acquire+0xfc/0x14c
      [   13.173009]        [<ffffffff815698ea>] _raw_spin_lock_irqsave+0x44/0x57
      [   13.173009]        [<ffffffff81032d8f>] try_to_wake_up+0x29/0x1aa
      [   13.173009]        [<ffffffff81032f3c>] wake_up_process+0x10/0x12
      [   13.173009]        [<ffffffff810901e9>] rcu_cpu_kthread_timer+0x44/0x58
      [   13.173009]        [<ffffffff81045286>] call_timer_fn+0xac/0x1e9
      [   13.173009]        [<ffffffff8104556d>] run_timer_softirq+0x1aa/0x1f2
      [   13.173009]        [<ffffffff8103e487>] __do_softirq+0x109/0x26a
      [   13.173009]        [<ffffffff8157144c>] call_softirq+0x1c/0x30
      [   13.173009]        [<ffffffff81003207>] do_softirq+0x44/0xf1
      [   13.173009]        [<ffffffff8103e8b9>] irq_exit+0x58/0xc8
      [   13.173009]        [<ffffffff81017f5a>] smp_apic_timer_interrupt+0x79/0x87
      [   13.173009]        [<ffffffff81570fd3>] apic_timer_interrupt+0x13/0x20
      [   13.173009]        [<ffffffff810bd51a>] get_page_from_freelist+0x2aa/0x310
      [   13.173009]        [<ffffffff810bdf03>] __alloc_pages_nodemask+0x178/0x243
      [   13.173009]        [<ffffffff8101fe2f>] pte_alloc_one+0x1e/0x3a
      [   13.173009]        [<ffffffff810d27fe>] __pte_alloc+0x22/0x14b
      [   13.173009]        [<ffffffff810d48a8>] handle_mm_fault+0x17e/0x1e0
      [   13.173009]        [<ffffffff8156cfee>] do_page_fault+0x42d/0x5de
      [   13.173009]        [<ffffffff8156a75f>] page_fault+0x1f/0x30
      [   13.173009]
      [   13.173009] other info that might help us debug this:
      [   13.173009]
      [   13.173009] Chain exists of:
      [   13.173009]   &p->pi_lock --> &rq->lock --> rcu_node_level_0
      [   13.173009]
      [   13.173009]  Possible unsafe locking scenario:
      [   13.173009]
      [   13.173009]        CPU0                    CPU1
      [   13.173009]        ----                    ----
      [   13.173009]   lock(rcu_node_level_0);
      [   13.173009]                                lock(&rq->lock);
      [   13.173009]                                lock(rcu_node_level_0);
      [   13.173009]   lock(&p->pi_lock);
      [   13.173009]
      [   13.173009]  *** DEADLOCK ***
      [   13.173009]
      [   13.173009] 3 locks held by blkid/267:
      [   13.173009]  #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff8156cdb4>] do_page_fault+0x1f3/0x5de
      [   13.173009]  #1:  (&yield_timer){+.-...}, at: [<ffffffff810451da>] call_timer_fn+0x0/0x1e9
      [   13.173009]  #2:  (rcu_node_level_0){..-...}, at: [<ffffffff810901cc>] rcu_cpu_kthread_timer+0x27/0x58
      [   13.173009]
      [   13.173009] stack backtrace:
      [   13.173009] Pid: 267, comm: blkid Not tainted 2.6.39-rc6-mmotm0506 #1
      [   13.173009] Call Trace:
      [   13.173009]  <IRQ>  [<ffffffff8154a529>] print_circular_bug+0xc8/0xd9
      [   13.173009]  [<ffffffff81067788>] check_prev_add+0x68/0x20e
      [   13.173009]  [<ffffffff8100c861>] ? save_stack_trace+0x28/0x46
      [   13.173009]  [<ffffffff810679b9>] check_prevs_add+0x8b/0x104
      [   13.173009]  [<ffffffff81067da1>] validate_chain+0x36f/0x3ab
      [   13.173009]  [<ffffffff8106846b>] __lock_acquire+0x369/0x3e2
      [   13.173009]  [<ffffffff81032d8f>] ? try_to_wake_up+0x29/0x1aa
      [   13.173009]  [<ffffffff81068a0f>] lock_acquire+0xfc/0x14c
      [   13.173009]  [<ffffffff81032d8f>] ? try_to_wake_up+0x29/0x1aa
      [   13.173009]  [<ffffffff810901a5>] ? rcu_check_quiescent_state+0x82/0x82
      [   13.173009]  [<ffffffff815698ea>] _raw_spin_lock_irqsave+0x44/0x57
      [   13.173009]  [<ffffffff81032d8f>] ? try_to_wake_up+0x29/0x1aa
      [   13.173009]  [<ffffffff81032d8f>] try_to_wake_up+0x29/0x1aa
      [   13.173009]  [<ffffffff810901a5>] ? rcu_check_quiescent_state+0x82/0x82
      [   13.173009]  [<ffffffff81032f3c>] wake_up_process+0x10/0x12
      [   13.173009]  [<ffffffff810901e9>] rcu_cpu_kthread_timer+0x44/0x58
      [   13.173009]  [<ffffffff810901a5>] ? rcu_check_quiescent_state+0x82/0x82
      [   13.173009]  [<ffffffff81045286>] call_timer_fn+0xac/0x1e9
      [   13.173009]  [<ffffffff810451da>] ? del_timer+0x75/0x75
      [   13.173009]  [<ffffffff810901a5>] ? rcu_check_quiescent_state+0x82/0x82
      [   13.173009]  [<ffffffff8104556d>] run_timer_softirq+0x1aa/0x1f2
      [   13.173009]  [<ffffffff8103e487>] __do_softirq+0x109/0x26a
      [   13.173009]  [<ffffffff8106365f>] ? tick_dev_program_event+0x37/0xf6
      [   13.173009]  [<ffffffff810a0e4a>] ? time_hardirqs_off+0x1b/0x2f
      [   13.173009]  [<ffffffff8157144c>] call_softirq+0x1c/0x30
      [   13.173009]  [<ffffffff81003207>] do_softirq+0x44/0xf1
      [   13.173009]  [<ffffffff8103e8b9>] irq_exit+0x58/0xc8
      [   13.173009]  [<ffffffff81017f5a>] smp_apic_timer_interrupt+0x79/0x87
      [   13.173009]  [<ffffffff81570fd3>] apic_timer_interrupt+0x13/0x20
      [   13.173009]  <EOI>  [<ffffffff810bd384>] ? get_page_from_freelist+0x114/0x310
      [   13.173009]  [<ffffffff810bd51a>] ? get_page_from_freelist+0x2aa/0x310
      [   13.173009]  [<ffffffff812220e7>] ? clear_page_c+0x7/0x10
      [   13.173009]  [<ffffffff810bd1ef>] ? prep_new_page+0x14c/0x1cd
      [   13.173009]  [<ffffffff810bd51a>] get_page_from_freelist+0x2aa/0x310
      [   13.173009]  [<ffffffff810bdf03>] __alloc_pages_nodemask+0x178/0x243
      [   13.173009]  [<ffffffff810d46b9>] ? __pmd_alloc+0x87/0x99
      [   13.173009]  [<ffffffff8101fe2f>] pte_alloc_one+0x1e/0x3a
      [   13.173009]  [<ffffffff810d46b9>] ? __pmd_alloc+0x87/0x99
      [   13.173009]  [<ffffffff810d27fe>] __pte_alloc+0x22/0x14b
      [   13.173009]  [<ffffffff810d48a8>] handle_mm_fault+0x17e/0x1e0
      [   13.173009]  [<ffffffff8156cfee>] do_page_fault+0x42d/0x5de
      [   13.173009]  [<ffffffff810d915f>] ? sys_brk+0x32/0x10c
      [   13.173009]  [<ffffffff810a0e4a>] ? time_hardirqs_off+0x1b/0x2f
      [   13.173009]  [<ffffffff81065c4f>] ? trace_hardirqs_off_caller+0x3f/0x9c
      [   13.173009]  [<ffffffff812235dd>] ? trace_hardirqs_off_thunk+0x3a/0x3c
      [   13.173009]  [<ffffffff8156a75f>] page_fault+0x1f/0x30
      [   14.010075] usb 5-1: new full speed USB device number 2 using uhci_hcd
      Reported-by: NValdis Kletnieks <Valdis.Kletnieks@vt.edu>
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      8826f3b0
    • P
      perf: Fix SIGIO handling · f506b3dc
      Peter Zijlstra 提交于
      Vince noticed that unless we mmap() a buffer, SIGIO gets lost. So
      explicitly push the wakeup (including signals) when requested.
      Reported-by: NVince Weaver <vweaver1@eecs.utk.edu>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: <stable@kernel.org>
      Link: http://lkml.kernel.org/n/tip-2euus3f3x3dyvdk52cjxw8zu@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@elte.hu>
      f506b3dc
    • K
      cpuset: Fix cpuset_cpus_allowed_fallback(), don't update tsk->rt.nr_cpus_allowed · 1e1b6c51
      KOSAKI Motohiro 提交于
      The rule is, we have to update tsk->rt.nr_cpus_allowed if we change
      tsk->cpus_allowed. Otherwise RT scheduler may confuse.
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/4DD4B3FA.5060901@jp.fujitsu.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
      1e1b6c51
    • P
      sched: Fix ->min_vruntime calculation in dequeue_entity() · 1e876231
      Peter Zijlstra 提交于
      Dima Zavin <dima@android.com> reported:
      
      "After pulling the thread off the run-queue during a cgroup change,
      the cfs_rq.min_vruntime gets recalculated. The dequeued thread's vruntime
      then gets normalized to this new value. This can then lead to the thread
      getting an unfair boost in the new group if the vruntime of the next
      task in the old run-queue was way further ahead."
      Reported-by: NDima Zavin <dima@android.com>
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      Recalls-having-tested-once-upon-a-time-by: NMike Galbraith <efault@gmx.de>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/1305674470-23727-1-git-send-email-john.stultz@linaro.orgSigned-off-by: NIngo Molnar <mingo@elte.hu>
      1e876231
    • P
      sched: Fix ttwu() for __ARCH_WANT_INTERRUPTS_ON_CTXSW · d6aa8f85
      Peter Zijlstra 提交于
      Marc reported that e4a52bcb (sched: Remove rq->lock from the first
      half of ttwu()) broke his ARM-SMP machine. Now ARM is one of the few
      __ARCH_WANT_INTERRUPTS_ON_CTXSW users, so that exception in the ttwu()
      code was suspect.
      
      Yong found that the interrupt could hit after context_switch() changes
      current but before it clears p->on_cpu, if that interrupt were to
      attempt a wake-up of p we would indeed find ourselves spinning in IRQ
      context.
      
      Fix this by reverting to the old behaviour for this situation and
      perform a full remote wake-up.
      
      Cc: Frank Rowand <frank.rowand@am.sony.com>
      Cc: Yong Zhang <yong.zhang0@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Reported-by: NMarc Zyngier <Marc.Zyngier@arm.com>
      Tested-by: NMarc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      d6aa8f85
    • X
      sched: More sched_domain iterations fixes · cd4ae6ad
      Xiaotian Feng 提交于
      sched_domain iterations needs to be protected by rcu_read_lock() now,
      this patch adds another two places which needs the rcu lock, which is
      spotted by following suspicious rcu_dereference_check() usage warnings.
      
      kernel/sched_rt.c:1244 invoked rcu_dereference_check() without protection!
      kernel/sched_stats.h:41 invoked rcu_dereference_check() without protection!
      Signed-off-by: NXiaotian Feng <dfeng@redhat.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/1303469634-11678-1-git-send-email-dfeng@redhat.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
      cd4ae6ad
  9. 27 5月, 2011 10 次提交
    • R
      kernel/profile.c: remove some duplicate code from profile_hits() · 6f7bd76f
      Rakib Mullick 提交于
      profile_hits() has a common check for prof_on and prof_buffer regardless
      of SMP or !SMP.  So, remove some duplicate code by splitting profile_hits
      into two.
      
      [akpm@linux-foundation.org: make do_profile_hits static]
      Signed-off-by: NRakib Mullick <rakib.mullick@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6f7bd76f
    • J
      mm: extract exe_file handling from procfs · 38646013
      Jiri Slaby 提交于
      Setup and cleanup of mm_struct->exe_file is currently done in fs/proc/.
      This was because exe_file was needed only for /proc/<pid>/exe.  Since we
      will need the exe_file functionality also for core dumps (so core name can
      contain full binary path), built this functionality always into the
      kernel.
      
      To achieve that move that out of proc FS to the kernel/ where in fact it
      should belong.  By doing that we can make dup_mm_exe_file static.  Also we
      can drop linux/proc_fs.h inclusion in fs/exec.c and kernel/fork.c.
      Signed-off-by: NJiri Slaby <jslaby@suse.cz>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      38646013
    • D
      cgroup: remove the ns_cgroup · a77aea92
      Daniel Lezcano 提交于
      The ns_cgroup is an annoying cgroup at the namespace / cgroup frontier and
      leads to some problems:
      
        * cgroup creation is out-of-control
        * cgroup name can conflict when pids are looping
        * it is not possible to have a single process handling a lot of
          namespaces without falling in a exponential creation time
        * we may want to create a namespace without creating a cgroup
      
        The ns_cgroup was replaced by a compatibility flag 'clone_children',
        where a newly created cgroup will copy the parent cgroup values.
        The userspace has to manually create a cgroup and add a task to
        the 'tasks' file.
      
      This patch removes the ns_cgroup as suggested in the following thread:
      
      https://lists.linux-foundation.org/pipermail/containers/2009-June/018616.html
      
      The 'cgroup_clone' function is removed because it is no longer used.
      
      This is a userspace-visible change.  Commit 45531757 ("cgroup: notify
      ns_cgroup deprecated") (merged into 2.6.27) caused the kernel to emit a
      printk warning users that the feature is planned for removal.  Since that
      time we have heard from XXX users who were affected by this.
      Signed-off-by: NDaniel Lezcano <daniel.lezcano@free.fr>
      Signed-off-by: NSerge E. Hallyn <serge.hallyn@canonical.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Jamal Hadi Salim <hadi@cyberus.ca>
      Reviewed-by: NLi Zefan <lizf@cn.fujitsu.com>
      Acked-by: NPaul Menage <menage@google.com>
      Acked-by: NMatt Helsley <matthltc@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a77aea92
    • B
      cgroups: use flex_array in attach_proc · d846687d
      Ben Blum 提交于
      Convert cgroup_attach_proc to use flex_array.
      
      The cgroup_attach_proc implementation requires a pre-allocated array to
      store task pointers to atomically move a thread-group, but asking for a
      monolithic array with kmalloc() may be unreliable for very large groups.
      Using flex_array provides the same functionality with less risk of
      failure.
      
      This is a post-patch for cgroup-procs-write.patch.
      Signed-off-by: NBen Blum <bblum@andrew.cmu.edu>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Reviewed-by: NPaul Menage <menage@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Miao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d846687d
    • B
      cgroups: make procs file writable · 74a1166d
      Ben Blum 提交于
      Make procs file writable to move all threads by tgid at once.
      
      Add functionality that enables users to move all threads in a threadgroup
      at once to a cgroup by writing the tgid to the 'cgroup.procs' file.  This
      current implementation makes use of a per-threadgroup rwsem that's taken
      for reading in the fork() path to prevent newly forking threads within the
      threadgroup from "escaping" while the move is in progress.
      Signed-off-by: NBen Blum <bblum@andrew.cmu.edu>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Reviewed-by: NPaul Menage <menage@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Miao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      74a1166d
    • B
      cgroups: add per-thread subsystem callbacks · f780bdb7
      Ben Blum 提交于
      Add cgroup subsystem callbacks for per-thread attachment in atomic contexts
      
      Add can_attach_task(), pre_attach(), and attach_task() as new callbacks
      for cgroups's subsystem interface.  Unlike can_attach and attach, these
      are for per-thread operations, to be called potentially many times when
      attaching an entire threadgroup.
      
      Also, the old "bool threadgroup" interface is removed, as replaced by
      this.  All subsystems are modified for the new interface - of note is
      cpuset, which requires from/to nodemasks for attach to be globally scoped
      (though per-cpuset would work too) to persist from its pre_attach to
      attach_task and attach.
      
      This is a pre-patch for cgroup-procs-writable.patch.
      Signed-off-by: NBen Blum <bblum@andrew.cmu.edu>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Reviewed-by: NPaul Menage <menage@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Miao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f780bdb7
    • B
      cgroups: read-write lock CLONE_THREAD forking per threadgroup · 4714d1d3
      Ben Blum 提交于
      Adds functionality to read/write lock CLONE_THREAD fork()ing per-threadgroup
      
      Add an rwsem that lives in a threadgroup's signal_struct that's taken for
      reading in the fork path, under CONFIG_CGROUPS.  If another part of the
      kernel later wants to use such a locking mechanism, the CONFIG_CGROUPS
      ifdefs should be changed to a higher-up flag that CGROUPS and the other
      system would both depend on.
      
      This is a pre-patch for cgroup-procs-write.patch.
      Signed-off-by: NBen Blum <bblum@andrew.cmu.edu>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Reviewed-by: NPaul Menage <menage@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Miao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4714d1d3
    • R
      PM: Fix PM QOS's user mode interface to work with ASCII input · 0775a60a
      Rafael J. Wysocki 提交于
      Make pm_qos_power_write() accept values passed to it in the ASCII hex
      format either with or without an ending newline.
      Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
      Acked-by: NMark Gross <markgross@thegnar.org>
      0775a60a
    • P
      rcu: Decrease memory-barrier usage based on semi-formal proof · 23b5c8fa
      Paul E. McKenney 提交于
      (Note: this was reverted, and is now being re-applied in pieces, with
      this being the fifth and final piece.  See below for the reason that
      it is now felt to be safe to re-apply this.)
      
      Commit d09b62df fixed grace-period synchronization, but left some smp_mb()
      invocations in rcu_process_callbacks() that are no longer needed, but
      sheer paranoia prevented them from being removed.  This commit removes
      them and provides a proof of correctness in their absence.  It also adds
      a memory barrier to rcu_report_qs_rsp() immediately before the update to
      rsp->completed in order to handle the theoretical possibility that the
      compiler or CPU might move massive quantities of code into a lock-based
      critical section.  This also proves that the sheer paranoia was not
      entirely unjustified, at least from a theoretical point of view.
      
      In addition, the old dyntick-idle synchronization depended on the fact
      that grace periods were many milliseconds in duration, so that it could
      be assumed that no dyntick-idle CPU could reorder a memory reference
      across an entire grace period.  Unfortunately for this design, the
      addition of expedited grace periods breaks this assumption, which has
      the unfortunate side-effect of requiring atomic operations in the
      functions that track dyntick-idle state for RCU.  (There is some hope
      that the algorithms used in user-level RCU might be applied here, but
      some work is required to handle the NMIs that user-space applications
      can happily ignore.  For the short term, better safe than sorry.)
      
      This proof assumes that neither compiler nor CPU will allow a lock
      acquisition and release to be reordered, as doing so can result in
      deadlock.  The proof is as follows:
      
      1.	A given CPU declares a quiescent state under the protection of
      	its leaf rcu_node's lock.
      
      2.	If there is more than one level of rcu_node hierarchy, the
      	last CPU to declare a quiescent state will also acquire the
      	->lock of the next rcu_node up in the hierarchy,  but only
      	after releasing the lower level's lock.  The acquisition of this
      	lock clearly cannot occur prior to the acquisition of the leaf
      	node's lock.
      
      3.	Step 2 repeats until we reach the root rcu_node structure.
      	Please note again that only one lock is held at a time through
      	this process.  The acquisition of the root rcu_node's ->lock
      	must occur after the release of that of the leaf rcu_node.
      
      4.	At this point, we set the ->completed field in the rcu_state
      	structure in rcu_report_qs_rsp().  However, if the rcu_node
      	hierarchy contains only one rcu_node, then in theory the code
      	preceding the quiescent state could leak into the critical
      	section.  We therefore precede the update of ->completed with a
      	memory barrier.  All CPUs will therefore agree that any updates
      	preceding any report of a quiescent state will have happened
      	before the update of ->completed.
      
      5.	Regardless of whether a new grace period is needed, rcu_start_gp()
      	will propagate the new value of ->completed to all of the leaf
      	rcu_node structures, under the protection of each rcu_node's ->lock.
      	If a new grace period is needed immediately, this propagation
      	will occur in the same critical section that ->completed was
      	set in, but courtesy of the memory barrier in #4 above, is still
      	seen to follow any pre-quiescent-state activity.
      
      6.	When a given CPU invokes __rcu_process_gp_end(), it becomes
      	aware of the end of the old grace period and therefore makes
      	any RCU callbacks that were waiting on that grace period eligible
      	for invocation.
      
      	If this CPU is the same one that detected the end of the grace
      	period, and if there is but a single rcu_node in the hierarchy,
      	we will still be in the single critical section.  In this case,
      	the memory barrier in step #4 guarantees that all callbacks will
      	be seen to execute after each CPU's quiescent state.
      
      	On the other hand, if this is a different CPU, it will acquire
      	the leaf rcu_node's ->lock, and will again be serialized after
      	each CPU's quiescent state for the old grace period.
      
      On the strength of this proof, this commit therefore removes the memory
      barriers from rcu_process_callbacks() and adds one to rcu_report_qs_rsp().
      The effect is to reduce the number of memory barriers by one and to
      reduce the frequency of execution from about once per scheduling tick
      per CPU to once per grace period.
      
      This was reverted do to hangs found during testing by Yinghai Lu and
      Ingo Molnar.  Frederic Weisbecker supplied Yinghai with tracing that
      located the underlying problem, and Frederic also provided the fix.
      
      The underlying problem was that the HARDIRQ_ENTER() macro from
      lib/locking-selftest.c invoked irq_enter(), which in turn invokes
      rcu_irq_enter(), but HARDIRQ_EXIT() invoked __irq_exit(), which
      does not invoke rcu_irq_exit().  This situation resulted in calls
      to rcu_irq_enter() that were not balanced by the required calls to
      rcu_irq_exit().  Therefore, after these locking selftests completed,
      RCU's dyntick-idle nesting count was a large number (for example,
      72), which caused RCU to to conclude that the affected CPU was not in
      dyntick-idle mode when in fact it was.
      
      RCU would therefore incorrectly wait for this dyntick-idle CPU, resulting
      in hangs.
      
      In contrast, with Frederic's patch, which replaces the irq_enter()
      in HARDIRQ_ENTER() with an __irq_enter(), these tests don't ever call
      either rcu_irq_enter() or rcu_irq_exit(), which works because the CPU
      running the test is already marked as not being in dyntick-idle mode.
      This means that the rcu_irq_enter() and rcu_irq_exit() calls and RCU
      then has no problem working out which CPUs are in dyntick-idle mode and
      which are not.
      
      The reason that the imbalance was not noticed before the barrier patch
      was applied is that the old implementation of rcu_enter_nohz() ignored
      the nesting depth.  This could still result in delays, but much shorter
      ones.  Whenever there was a delay, RCU would IPI the CPU with the
      unbalanced nesting level, which would eventually result in rcu_enter_nohz()
      being called, which in turn would force RCU to see that the CPU was in
      dyntick-idle mode.
      
      The reason that very few people noticed the problem is that the mismatched
      irq_enter() vs. __irq_exit() occured only when the kernel was built with
      CONFIG_DEBUG_LOCKING_API_SELFTESTS.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      23b5c8fa
    • P
      rcu: Make rcu_enter_nohz() pay attention to nesting · 4305ce78
      Paul E. McKenney 提交于
      The old version of rcu_enter_nohz() forced RCU into nohz mode even if
      the nesting count was non-zero.  This change causes rcu_enter_nohz()
      to hold off for non-zero nesting counts.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      4305ce78