1. 23 6月, 2011 6 次提交
    • T
      ptrace: s/tracehook_tracer_task()/ptrace_parent()/ · 06d98473
      Tejun Heo 提交于
      tracehook.h is on the way out.  Rename tracehook_tracer_task() to
      ptrace_parent() and move it from tracehook.h to ptrace.h.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: John Johansen <john.johansen@canonical.com>
      Cc: Stephen Smalley <sds@tycho.nsa.gov>
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      06d98473
    • T
      ptrace: kill clone/exec tracehooks · 4b9d33e6
      Tejun Heo 提交于
      At this point, tracehooks aren't useful to mainline kernel and mostly
      just add an extra layer of obfuscation.  Although they have comments,
      without actual in-kernel users, it is difficult to tell what are their
      assumptions and they're actually trying to achieve.  To mainline
      kernel, they just aren't worth keeping around.
      
      This patch kills the following clone and exec related tracehooks.
      
      	tracehook_prepare_clone()
      	tracehook_finish_clone()
      	tracehook_report_clone()
      	tracehook_report_clone_complete()
      	tracehook_unsafe_exec()
      
      The changes are mostly trivial - logic is moved to the caller and
      comments are merged and adjusted appropriately.
      
      The only exception is in check_unsafe_exec() where LSM_UNSAFE_PTRACE*
      are OR'd to bprm->unsafe instead of setting it, which produces the
      same result as the field is always zero on entry.  It also tests
      p->ptrace instead of (p->ptrace & PT_PTRACED) for consistency, which
      also gives the same result.
      
      This doesn't introduce any behavior change.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      4b9d33e6
    • T
      ptrace: kill trivial tracehooks · a288eecc
      Tejun Heo 提交于
      At this point, tracehooks aren't useful to mainline kernel and mostly
      just add an extra layer of obfuscation.  Although they have comments,
      without actual in-kernel users, it is difficult to tell what are their
      assumptions and they're actually trying to achieve.  To mainline
      kernel, they just aren't worth keeping around.
      
      This patch kills the following trivial tracehooks.
      
      * Ones testing whether task is ptraced.  Replace with ->ptrace test.
      
      	tracehook_expect_breakpoints()
      	tracehook_consider_ignored_signal()
      	tracehook_consider_fatal_signal()
      
      * ptrace_event() wrappers.  Call directly.
      
      	tracehook_report_exec()
      	tracehook_report_exit()
      	tracehook_report_vfork_done()
      
      * ptrace_release_task() wrapper.  Call directly.
      
      	tracehook_finish_release_task()
      
      * noop
      
      	tracehook_prepare_release_task()
      	tracehook_report_death()
      
      This doesn't introduce any behavior change.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      a288eecc
    • T
      ptrace: move SIGTRAP on exec(2) logic to ptrace_event() · f3c04b93
      Tejun Heo 提交于
      Move SIGTRAP on exec(2) logic from tracehook_report_exec() to
      ptrace_event().  This is part of changes to make ptrace_event()
      smarter and handle ptrace event related details in one place.
      
      This doesn't introduce any behavior change.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      f3c04b93
    • T
      ptrace: introduce ptrace_event_enabled() and simplify ptrace_event() and tracehook_prepare_clone() · 643ad838
      Tejun Heo 提交于
      This patch implements ptrace_event_enabled() which tests whether a
      given PTRACE_EVENT_* is enabled and use it to simplify ptrace_event()
      and tracehook_prepare_clone().
      
      PT_EVENT_FLAG() macro is added which calculates PT_TRACE_* flag from
      PTRACE_EVENT_*.  This is used to define PT_TRACE_* flags and by
      ptrace_event_enabled() to find the matching flag.
      
      This is used to make ptrace_event() and tracehook_prepare_clone()
      simpler.
      
      * ptrace_event() callers were responsible for providing mask to test
        whether the event was enabled.  This patch implements
        ptrace_event_enabled() and make ptrace_event() drop @mask and
        determine whether the event is enabled from @event.  Note that
        @event is constant and this conversion doesn't add runtime overhead.
      
        All conversions except tracehook_report_clone_complete() are
        trivial.  tracehook_report_clone_complete() used to use 0 for @mask
        (always enabled) but now tests whether the specified event is
        enabled.  This doesn't cause any behavior difference as it's
        guaranteed that the event specified by @trace is enabled.
      
      * tracehook_prepare_clone() now only determines which event is
        applicable and use ptrace_event_enabled() for enable test.
      
      This doesn't introduce any behavior change.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      643ad838
    • T
      ptrace: kill task_ptrace() · d21142ec
      Tejun Heo 提交于
      task_ptrace(task) simply dereferences task->ptrace and isn't even used
      consistently only adding confusion.  Kill it and directly access
      ->ptrace instead.
      
      This doesn't introduce any behavior change.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      d21142ec
  2. 17 6月, 2011 5 次提交
    • T
      ptrace: implement PTRACE_LISTEN · 544b2c91
      Tejun Heo 提交于
      The previous patch implemented async notification for ptrace but it
      only worked while trace is running.  This patch introduces
      PTRACE_LISTEN which is suggested by Oleg Nestrov.
      
      It's allowed iff tracee is in STOP trap and puts tracee into
      quasi-running state - tracee never really runs but wait(2) and
      ptrace(2) consider it to be running.  While ptracer is listening,
      tracee is allowed to re-enter STOP to notify an async event.
      Listening state is cleared on the first notification.  Ptracer can
      also clear it by issuing INTERRUPT - tracee will re-trap into STOP
      with listening state cleared.
      
      This allows ptracer to monitor group stop state without running tracee
      - use INTERRUPT to put tracee into STOP trap, issue LISTEN and then
      wait(2) to wait for the next group stop event.  When it happens,
      PTRACE_GETSIGINFO provides information to determine the current state.
      
      Test program follows.
      
        #define PTRACE_SEIZE		0x4206
        #define PTRACE_INTERRUPT	0x4207
        #define PTRACE_LISTEN		0x4208
      
        #define PTRACE_SEIZE_DEVEL	0x80000000
      
        static const struct timespec ts1s = { .tv_sec = 1 };
      
        int main(int argc, char **argv)
        {
      	  pid_t tracee, tracer;
      	  int i;
      
      	  tracee = fork();
      	  if (!tracee)
      		  while (1)
      			  pause();
      
      	  tracer = fork();
      	  if (!tracer) {
      		  siginfo_t si;
      
      		  ptrace(PTRACE_SEIZE, tracee, NULL,
      			 (void *)(unsigned long)PTRACE_SEIZE_DEVEL);
      		  ptrace(PTRACE_INTERRUPT, tracee, NULL, NULL);
      	  repeat:
      		  waitid(P_PID, tracee, NULL, WSTOPPED);
      
      		  ptrace(PTRACE_GETSIGINFO, tracee, NULL, &si);
      		  if (!si.si_code) {
      			  printf("tracer: SIG %d\n", si.si_signo);
      			  ptrace(PTRACE_CONT, tracee, NULL,
      				 (void *)(unsigned long)si.si_signo);
      			  goto repeat;
      		  }
      		  printf("tracer: stopped=%d signo=%d\n",
      			 si.si_signo != SIGTRAP, si.si_signo);
      		  if (si.si_signo != SIGTRAP)
      			  ptrace(PTRACE_LISTEN, tracee, NULL, NULL);
      		  else
      			  ptrace(PTRACE_CONT, tracee, NULL, NULL);
      		  goto repeat;
      	  }
      
      	  for (i = 0; i < 3; i++) {
      		  nanosleep(&ts1s, NULL);
      		  printf("mother: SIGSTOP\n");
      		  kill(tracee, SIGSTOP);
      		  nanosleep(&ts1s, NULL);
      		  printf("mother: SIGCONT\n");
      		  kill(tracee, SIGCONT);
      	  }
      	  nanosleep(&ts1s, NULL);
      
      	  kill(tracer, SIGKILL);
      	  kill(tracee, SIGKILL);
      	  return 0;
        }
      
      This is identical to the program to test TRAP_NOTIFY except that
      tracee is PTRACE_LISTEN'd instead of PTRACE_CONT'd when group stopped.
      This allows ptracer to monitor when group stop ends without running
      tracee.
      
        # ./test-listen
        tracer: stopped=0 signo=5
        mother: SIGSTOP
        tracer: SIG 19
        tracer: stopped=1 signo=19
        mother: SIGCONT
        tracer: stopped=0 signo=5
        tracer: SIG 18
        mother: SIGSTOP
        tracer: SIG 19
        tracer: stopped=1 signo=19
        mother: SIGCONT
        tracer: stopped=0 signo=5
        tracer: SIG 18
        mother: SIGSTOP
        tracer: SIG 19
        tracer: stopped=1 signo=19
        mother: SIGCONT
        tracer: stopped=0 signo=5
        tracer: SIG 18
      
      -v2: Moved JOBCTL_LISTENING check in wait_task_stopped() into
           task_stopped_code() as suggested by Oleg.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      544b2c91
    • T
      ptrace: implement TRAP_NOTIFY and use it for group stop events · fb1d910c
      Tejun Heo 提交于
      Currently there's no way for ptracer to find out whether group stop
      finished other than polling with INTERRUPT - GETSIGINFO - CONT
      sequence.  This patch implements group stop notification for ptracer
      using STOP traps.
      
      When group stop state of a seized tracee changes, JOBCTL_TRAP_NOTIFY
      is set, which schedules a STOP trap which is sticky - it isn't cleared
      by other traps and at least one STOP trap will happen eventually.
      STOP trap is synchronization point for event notification and the
      tracer can determine the current group stop state by looking at the
      signal number portion of exit code (si_status from waitid(2) or
      si_code from PTRACE_GETSIGINFO).
      
      Notifications are generated both on start and end of group stops but,
      because group stop participation always happens before STOP trap, this
      doesn't cause an extra trap while tracee is participating in group
      stop.  The symmetry will be useful later.
      
      Note that this notification works iff tracee is not trapped.
      Currently there is no way to be notified of group stop state changes
      while tracee is trapped.  This will be addressed by a later patch.
      
      An example program follows.
      
        #define PTRACE_SEIZE		0x4206
        #define PTRACE_INTERRUPT	0x4207
      
        #define PTRACE_SEIZE_DEVEL	0x80000000
      
        static const struct timespec ts1s = { .tv_sec = 1 };
      
        int main(int argc, char **argv)
        {
      	  pid_t tracee, tracer;
      	  int i;
      
      	  tracee = fork();
      	  if (!tracee)
      		  while (1)
      			  pause();
      
      	  tracer = fork();
      	  if (!tracer) {
      		  siginfo_t si;
      
      		  ptrace(PTRACE_SEIZE, tracee, NULL,
      			 (void *)(unsigned long)PTRACE_SEIZE_DEVEL);
      		  ptrace(PTRACE_INTERRUPT, tracee, NULL, NULL);
      	  repeat:
      		  waitid(P_PID, tracee, NULL, WSTOPPED);
      
      		  ptrace(PTRACE_GETSIGINFO, tracee, NULL, &si);
      		  if (!si.si_code) {
      			  printf("tracer: SIG %d\n", si.si_signo);
      			  ptrace(PTRACE_CONT, tracee, NULL,
      				 (void *)(unsigned long)si.si_signo);
      			  goto repeat;
      		  }
      		  printf("tracer: stopped=%d signo=%d\n",
      			 si.si_signo != SIGTRAP, si.si_signo);
      		  ptrace(PTRACE_CONT, tracee, NULL, NULL);
      		  goto repeat;
      	  }
      
      	  for (i = 0; i < 3; i++) {
      		  nanosleep(&ts1s, NULL);
      		  printf("mother: SIGSTOP\n");
      		  kill(tracee, SIGSTOP);
      		  nanosleep(&ts1s, NULL);
      		  printf("mother: SIGCONT\n");
      		  kill(tracee, SIGCONT);
      	  }
      	  nanosleep(&ts1s, NULL);
      
      	  kill(tracer, SIGKILL);
      	  kill(tracee, SIGKILL);
      	  return 0;
        }
      
      In the above program, tracer keeps tracee running and gets
      notification of each group stop state changes.
      
        # ./test-notify
        tracer: stopped=0 signo=5
        mother: SIGSTOP
        tracer: SIG 19
        tracer: stopped=1 signo=19
        mother: SIGCONT
        tracer: stopped=0 signo=5
        tracer: SIG 18
        mother: SIGSTOP
        tracer: SIG 19
        tracer: stopped=1 signo=19
        mother: SIGCONT
        tracer: stopped=0 signo=5
        tracer: SIG 18
        mother: SIGSTOP
        tracer: SIG 19
        tracer: stopped=1 signo=19
        mother: SIGCONT
        tracer: stopped=0 signo=5
        tracer: SIG 18
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      fb1d910c
    • T
      ptrace: implement PTRACE_INTERRUPT · fca26f26
      Tejun Heo 提交于
      Currently, there's no way to trap a running ptracee short of sending a
      signal which has various side effects.  This patch implements
      PTRACE_INTERRUPT which traps ptracee without any signal or job control
      related side effect.
      
      The implementation is almost trivial.  It uses the group stop trap -
      SIGTRAP | PTRACE_EVENT_STOP << 8.  A new trap flag
      JOBCTL_TRAP_INTERRUPT is added, which is set on PTRACE_INTERRUPT and
      cleared when any trap happens.  As INTERRUPT should be useable
      regardless of the current state of tracee, task_is_traced() test in
      ptrace_check_attach() is skipped for INTERRUPT.
      
      PTRACE_INTERRUPT is available iff tracee is attached with
      PTRACE_SEIZE.
      
      Test program follows.
      
        #define PTRACE_SEIZE		0x4206
        #define PTRACE_INTERRUPT	0x4207
      
        #define PTRACE_SEIZE_DEVEL	0x80000000
      
        static const struct timespec ts100ms = { .tv_nsec = 100000000 };
        static const struct timespec ts1s = { .tv_sec = 1 };
        static const struct timespec ts3s = { .tv_sec = 3 };
      
        int main(int argc, char **argv)
        {
      	  pid_t tracee;
      
      	  tracee = fork();
      	  if (tracee == 0) {
      		  nanosleep(&ts100ms, NULL);
      		  while (1) {
      			  printf("tracee: alive pid=%d\n", getpid());
      			  nanosleep(&ts1s, NULL);
      		  }
      	  }
      
      	  if (argc > 1)
      		  kill(tracee, SIGSTOP);
      
      	  nanosleep(&ts100ms, NULL);
      
      	  ptrace(PTRACE_SEIZE, tracee, NULL,
      		 (void *)(unsigned long)PTRACE_SEIZE_DEVEL);
      	  if (argc > 1) {
      		  waitid(P_PID, tracee, NULL, WSTOPPED);
      		  ptrace(PTRACE_CONT, tracee, NULL, NULL);
      	  }
      	  nanosleep(&ts3s, NULL);
      
      	  printf("tracer: INTERRUPT and DETACH\n");
      	  ptrace(PTRACE_INTERRUPT, tracee, NULL, NULL);
      	  waitid(P_PID, tracee, NULL, WSTOPPED);
      	  ptrace(PTRACE_DETACH, tracee, NULL, NULL);
      	  nanosleep(&ts3s, NULL);
      
      	  printf("tracer: exiting\n");
      	  kill(tracee, SIGKILL);
      	  return 0;
        }
      
      When called without argument, tracee is seized from running state,
      interrupted and then detached back to running state.
      
        # ./test-interrupt
        tracee: alive pid=4546
        tracee: alive pid=4546
        tracee: alive pid=4546
        tracer: INTERRUPT and DETACH
        tracee: alive pid=4546
        tracee: alive pid=4546
        tracee: alive pid=4546
        tracer: exiting
      
      When called with argument, tracee is seized from stopped state,
      continued, interrupted and then detached back to stopped state.
      
        # ./test-interrupt  1
        tracee: alive pid=4548
        tracee: alive pid=4548
        tracee: alive pid=4548
        tracer: INTERRUPT and DETACH
        tracer: exiting
      
      Before PTRACE_INTERRUPT, once the tracee was running, there was no way
      to trap tracee and do PTRACE_DETACH without causing side effect.
      
      -v2: Updated to use task_set_jobctl_pending() so that it doesn't end
           up scheduling TRAP_STOP if child is dying which may make the
           child unkillable.  Spotted by Oleg.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      fca26f26
    • T
      ptrace: implement PTRACE_SEIZE · 3544d72a
      Tejun Heo 提交于
      PTRACE_ATTACH implicitly issues SIGSTOP on attach which has side
      effects on tracee signal and job control states.  This patch
      implements a new ptrace request PTRACE_SEIZE which attaches a tracee
      without trapping it or affecting its signal and job control states.
      
      The usage is the same with PTRACE_ATTACH but it takes PTRACE_SEIZE_*
      flags in @data.  Currently, the only defined flag is
      PTRACE_SEIZE_DEVEL which is a temporary flag to enable PTRACE_SEIZE.
      PTRACE_SEIZE will change ptrace behaviors outside of attach itself.
      The changes will be implemented gradually and the DEVEL flag is to
      prevent programs which expect full SEIZE behavior from using it before
      all the behavior modifications are complete while allowing unit
      testing.  The flag will be removed once SEIZE behaviors are completely
      implemented.
      
      * PTRACE_SEIZE, unlike ATTACH, doesn't force tracee to trap.  After
        attaching tracee continues to run unless a trap condition occurs.
      
      * PTRACE_SEIZE doesn't affect signal or group stop state.
      
      * If PTRACE_SEIZE'd, group stop uses PTRACE_EVENT_STOP trap which uses
        exit_code of (signr | PTRACE_EVENT_STOP << 8) where signr is one of
        the stopping signals if group stop is in effect or SIGTRAP
        otherwise, and returns usual trap siginfo on PTRACE_GETSIGINFO
        instead of NULL.
      
      Seizing sets PT_SEIZED in ->ptrace of the tracee.  This flag will be
      used to determine whether new SEIZE behaviors should be enabled.
      
      Test program follows.
      
        #define PTRACE_SEIZE		0x4206
        #define PTRACE_SEIZE_DEVEL	0x80000000
      
        static const struct timespec ts100ms = { .tv_nsec = 100000000 };
        static const struct timespec ts1s = { .tv_sec = 1 };
        static const struct timespec ts3s = { .tv_sec = 3 };
      
        int main(int argc, char **argv)
        {
      	  pid_t tracee;
      
      	  tracee = fork();
      	  if (tracee == 0) {
      		  nanosleep(&ts100ms, NULL);
      		  while (1) {
      			  printf("tracee: alive\n");
      			  nanosleep(&ts1s, NULL);
      		  }
      	  }
      
      	  if (argc > 1)
      		  kill(tracee, SIGSTOP);
      
      	  nanosleep(&ts100ms, NULL);
      
      	  ptrace(PTRACE_SEIZE, tracee, NULL,
      		 (void *)(unsigned long)PTRACE_SEIZE_DEVEL);
      	  if (argc > 1) {
      		  waitid(P_PID, tracee, NULL, WSTOPPED);
      		  ptrace(PTRACE_CONT, tracee, NULL, NULL);
      	  }
      	  nanosleep(&ts3s, NULL);
      	  printf("tracer: exiting\n");
      	  return 0;
        }
      
      When the above program is called w/o argument, tracee is seized while
      running and remains running.  When tracer exits, tracee continues to
      run and print out messages.
      
        # ./test-seize-simple
        tracee: alive
        tracee: alive
        tracee: alive
        tracer: exiting
        tracee: alive
        tracee: alive
      
      When called with an argument, tracee is seized from stopped state and
      continued, and returns to stopped state when tracer exits.
      
        # ./test-seize
        tracee: alive
        tracee: alive
        tracee: alive
        tracer: exiting
        # ps -el|grep test-seize
        1 T     0  4720     1  0  80   0 -   941 signal ttyS0    00:00:00 test-seize
      
      -v2: SEIZE doesn't schedule TRAP_STOP and leaves tracee running as Jan
           suggested.
      
      -v3: PTRACE_EVENT_STOP traps now report group stop state by signr.  If
           group stop is in effect the stop signal number is returned as
           part of exit_code; otherwise, SIGTRAP.  This was suggested by
           Denys and Oleg.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jan Kratochvil <jan.kratochvil@redhat.com>
      Cc: Denys Vlasenko <vda.linux@googlemail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      3544d72a
    • T
      job control: introduce JOBCTL_TRAP_STOP and use it for group stop trap · 73ddff2b
      Tejun Heo 提交于
      do_signal_stop() implemented both normal group stop and trap for group
      stop while ptraced.  This approach has been enough but scheduled
      changes require trap mechanism which can be used in more generic
      manner and using group stop trap for generic trap site simplifies both
      userland visible interface and implementation.
      
      This patch adds a new jobctl flag - JOBCTL_TRAP_STOP.  When set, it
      triggers a trap site, which behaves like group stop trap, in
      get_signal_to_deliver() after checking for pending signals.  While
      ptraced, do_signal_stop() doesn't stop itself.  It initiates group
      stop if requested and schedules JOBCTL_TRAP_STOP and returns.  The
      caller - get_signal_to_deliver() - is responsible for checking whether
      TRAP_STOP is pending afterwards and handling it.
      
      ptrace_attach() is updated to use JOBCTL_TRAP_STOP instead of
      JOBCTL_STOP_PENDING and __ptrace_unlink() to clear all pending trap
      bits and TRAPPING so that TRAP_STOP and future trap bits don't linger
      after detach.
      
      While at it, add proper function comment to do_signal_stop() and make
      it return bool.
      
      -v2: __ptrace_unlink() updated to clear JOBCTL_TRAP_MASK and TRAPPING
           instead of JOBCTL_PENDING_MASK.  This avoids accidentally
           clearing JOBCTL_STOP_CONSUME.  Spotted by Oleg.
      
      -v3: do_signal_stop() updated to return %false without dropping
           siglock while ptraced and TRAP_STOP check moved inside for(;;)
           loop after group stop participation.  This avoids unnecessary
           relocking and also will help avoiding unnecessary traps by
           consuming group stop before handling pending traps.
      
      -v4: Jobctl trap handling moved into a separate function -
           do_jobctl_trap().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      73ddff2b
  3. 05 6月, 2011 5 次提交
    • T
      signal: remove three noop tracehooks · dd1d6772
      Tejun Heo 提交于
      Remove the following three noop tracehooks in signals.c.
      
      * tracehook_force_sigpending()
      * tracehook_get_signal()
      * tracehook_finish_jctl()
      
      The code area is about to be updated and these hooks don't do anything
      other than obfuscating the logic.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      dd1d6772
    • T
      job control: introduce task_set_jobctl_pending() · 7dd3db54
      Tejun Heo 提交于
      task->jobctl currently hosts JOBCTL_STOP_PENDING and will host TRAP
      pending bits too.  Setting pending conditions on a dying task may make
      the task unkillable.  Currently, each setting site is responsible for
      checking for the condition but with to-be-added job control traps this
      becomes too fragile.
      
      This patch adds task_set_jobctl_pending() which should be used when
      setting task->jobctl bits to schedule a stop or trap.  The function
      performs the followings to ease setting pending bits.
      
      * Sanity checks.
      
      * If fatal signal is pending or PF_EXITING is set, no bit is set.
      
      * STOP_SIGMASK is automatically cleared if new value is being set.
      
      do_signal_stop() and ptrace_attach() are updated to use
      task_set_jobctl_pending() instead of setting STOP_PENDING explicitly.
      The surrounding structures around setting are changed to fit
      task_set_jobctl_pending() better but there should be no userland
      visible behavior difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      7dd3db54
    • T
      job control: introduce JOBCTL_PENDING_MASK and task_clear_jobctl_pending() · 3759a0d9
      Tejun Heo 提交于
      This patch introduces JOBCTL_PENDING_MASK and replaces
      task_clear_jobctl_stop_pending() with task_clear_jobctl_pending()
      which takes an extra @mask argument.
      
      JOBCTL_PENDING_MASK is currently equal to JOBCTL_STOP_PENDING but
      future patches will add more bits.  recalc_sigpending_tsk() is updated
      to use JOBCTL_PENDING_MASK instead.
      
      task_clear_jobctl_pending() takes @mask which in subset of
      JOBCTL_PENDING_MASK and clears the relevant jobctl bits.  If
      JOBCTL_STOP_PENDING is set, other STOP bits are cleared together.  All
      task_clear_jobctl_stop_pending() users are updated to call
      task_clear_jobctl_pending() with JOBCTL_STOP_PENDING which is
      functionally identical to task_clear_jobctl_stop_pending().
      
      This patch doesn't cause any functional change.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      3759a0d9
    • T
      ptrace: ptrace_check_attach(): rename @kill to @ignore_state and add comments · 755e276b
      Tejun Heo 提交于
      PTRACE_INTERRUPT is going to be added which should also skip
      task_is_traced() check in ptrace_check_attach().  Rename @kill to
      @ignore_state and make it bool.  Add function comment while at it.
      
      This patch doesn't introduce any behavior difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      755e276b
    • T
      job control: rename signal->group_stop and flags to jobctl and update them · a8f072c1
      Tejun Heo 提交于
      signal->group_stop currently hosts mostly group stop related flags;
      however, it's gonna be used for wider purposes and the GROUP_STOP_
      flag prefix becomes confusing.  Rename signal->group_stop to
      signal->jobctl and rename all GROUP_STOP_* flags to JOBCTL_*.
      
      Bit position macros JOBCTL_*_BIT are defined and JOBCTL_* flags are
      defined in terms of them to allow using bitops later.
      
      While at it, reassign JOBCTL_TRAPPING to bit 22 to better accomodate
      future additions.
      
      This doesn't cause any functional change.
      
      -v2: JOBCTL_*_BIT macros added as suggested by Linus.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      a8f072c1
  4. 04 6月, 2011 1 次提交
    • L
      Revert "tty: make receive_buf() return the amout of bytes received" · 55db4c64
      Linus Torvalds 提交于
      This reverts commit b1c43f82.
      
      It was broken in so many ways, and results in random odd pty issues.
      
      It re-introduced the buggy schedule_work() in flush_to_ldisc() that can
      cause endless work-loops (see commit a5660b41: "tty: fix endless
      work loop when the buffer fills up").
      
      It also used an "unsigned int" return value fo the ->receive_buf()
      function, but then made multiple functions return a negative error code,
      and didn't actually check for the error in the caller.
      
      And it didn't actually work at all.  BenH bisected down odd tty behavior
      to it:
        "It looks like the patch is causing some major malfunctions of the X
         server for me, possibly related to PTYs.  For example, cat'ing a
         large file in a gnome terminal hangs the kernel for -minutes- in a
         loop of what looks like flush_to_ldisc/workqueue code, (some ftrace
         data in the quoted bits further down).
      
         ...
      
         Some more data: It -looks- like what happens is that the
         flush_to_ldisc work queue entry constantly re-queues itself (because
         the PTY is full ?) and the workqueue thread will basically loop
         forver calling it without ever scheduling, thus starving the consumer
         process that could have emptied the PTY."
      
      which is pretty much exactly the problem we fixed in a5660b41.
      
      Milton Miller pointed out the 'unsigned int' issue.
      Reported-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Reported-by: NMilton Miller <miltonm@bga.com>
      Cc: Stefan Bigler <stefan.bigler@keymile.com>
      Cc: Toby Gray <toby.gray@realvnc.com>
      Cc: Felipe Balbi <balbi@ti.com>
      Cc: Greg Kroah-Hartman <gregkh@suse.de>
      Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      55db4c64
  5. 02 6月, 2011 2 次提交
  6. 01 6月, 2011 2 次提交
    • Y
      intel-iommu: Enable super page (2MiB, 1GiB, etc.) support · 6dd9a7c7
      Youquan Song 提交于
      There are no externally-visible changes with this. In the loop in the
      internal __domain_mapping() function, we simply detect if we are mapping:
        - size >= 2MiB, and
        - virtual address aligned to 2MiB, and
        - physical address aligned to 2MiB, and
        - on hardware that supports superpages.
      
      (and likewise for larger superpages).
      
      We automatically use a superpage for such mappings. We never have to
      worry about *breaking* superpages, since we trust that we will always
      *unmap* the same range that was mapped. So all we need to do is ensure
      that dma_pte_clear_range() will also cope with superpages.
      
      Adjust pfn_to_dma_pte() to take a superpage 'level' as an argument, so
      it can return a PTE at the appropriate level rather than always
      extending the page tables all the way down to level 1. Again, this is
      simplified by the fact that we should never encounter existing small
      pages when we're creating a mapping; any old mapping that used the same
      virtual range will have been entirely removed and its obsolete page
      tables freed.
      
      Provide an 'intel_iommu=sp_off' argument on the command line as a
      chicken bit. Not that it should ever be required.
      
      ==
      
      The original commit seen in the iommu-2.6.git was Youquan's
      implementation (and completion) of my own half-baked code which I'd
      typed into an email. Followed by half a dozen subsequent 'fixes'.
      
      I've taken the unusual step of rewriting history and collapsing the
      original commits in order to keep the main history simpler, and make
      life easier for the people who are going to have to backport this to
      older kernels. And also so I can give it a more coherent commit comment
      which (hopefully) gives a better explanation of what's going on.
      
      The original sequence of commits leading to identical code was:
      
      Youquan Song (3):
            intel-iommu: super page support
            intel-iommu: Fix superpage alignment calculation error
            intel-iommu: Fix superpage level calculation error in dma_pfn_level_pte()
      
      David Woodhouse (4):
            intel-iommu: Precalculate superpage support for dmar_domain
            intel-iommu: Fix hardware_largepage_caps()
            intel-iommu: Fix inappropriate use of superpages in __domain_mapping()
            intel-iommu: Fix phys_pfn in __domain_mapping for sglist pages
      Signed-off-by: NYouquan Song <youquan.song@intel.com>
      Signed-off-by: NDavid Woodhouse <David.Woodhouse@intel.com>
      6dd9a7c7
    • R
      mtd: fix physmap.h warnings · 63da0290
      Randy Dunlap 提交于
      Fix build warnings in physmap.h:
      
      include/linux/mtd/physmap.h:25: warning: 'struct platform_device' declared inside parameter list
      include/linux/mtd/physmap.h:25: warning: its scope is only this definition or declaration, which is probably not what you want
      include/linux/mtd/physmap.h:26: warning: 'struct platform_device' declared inside parameter list
      include/linux/mtd/physmap.h:27: warning: 'struct platform_device' declared inside parameter list
      Signed-off-by: NRandy Dunlap <randy.dunlap@oracle.com>
      Signed-off-by: NDavid Woodhouse <David.Woodhouse@intel.com>
      63da0290
  7. 31 5月, 2011 1 次提交
  8. 30 5月, 2011 11 次提交
  9. 29 5月, 2011 7 次提交
    • M
      dm kcopyd: return client directly and not through a pointer · fa34ce73
      Mikulas Patocka 提交于
      Return client directly from dm_kcopyd_client_create, not through a
      parameter, making it consistent with dm_io_client_create.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      fa34ce73
    • M
      dm kcopyd: reserve fewer pages · 5f43ba29
      Mikulas Patocka 提交于
      Reserve just the minimum of pages needed to process one job.
      
      Because we allocate pages from page allocator, we don't need to reserve
      a large number of pages.  The maximum job size is SUB_JOB_SIZE and we
      calculate the number of reserved pages based on this.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      5f43ba29
    • M
      dm io: use fixed initial mempool size · bda8efec
      Mikulas Patocka 提交于
      Replace the arbitrary calculation of an initial io struct mempool size
      with a constant.
      
      The code calculated the number of reserved structures based on the request
      size and used a "magic" multiplication constant of 4.  This patch changes
      it to reserve a fixed number - itself still chosen quite arbitrarily.
      Further testing might show if there is a better number to choose.
      
      Note that if there is no memory pressure, we can still allocate an
      arbitrary number of "struct io" structures.  One structure is enough to
      process the whole request.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      bda8efec
    • M
      dm table: allow targets to support discards internally · 4c259327
      Mike Snitzer 提交于
      Permit a target to support discards regardless of whether or not all its
      underlying devices do.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      4c259327
    • H
      [S390] mm: fix storage key handling · a43a9d93
      Heiko Carstens 提交于
      page_get_storage_key() and page_set_storage_key() expect a page address
      and not its page frame number. This got inconsistent with 2d42552d
      "[S390] merge page_test_dirty and page_clear_dirty".
      
      Result is that we read/write storage keys from random pages and do not
      have a working dirty bit tracking at all.
      E.g. SetPageUpdate() doesn't clear the dirty bit of requested pages, which
      for example ext4 doesn't like very much and panics after a while.
      
      Unable to handle kernel paging request at virtual user address (null)
      Oops: 0004 [#1] PREEMPT SMP DEBUG_PAGEALLOC
      Modules linked in:
      CPU: 1 Not tainted 2.6.39-07551-g139f37f5-dirty #152
      Process flush-94:0 (pid: 1576, task: 000000003eb34538, ksp: 000000003c287b70)
      Krnl PSW : 0704c00180000000 0000000000316b12 (jbd2_journal_file_inode+0x10e/0x138)
                 R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 EA:3
      Krnl GPRS: 0000000000000000 0000000000000000 0000000000000000 0700000000000000
                 0000000000316a62 000000003eb34cd0 0000000000000025 000000003c287b88
                 0000000000000001 000000003c287a70 000000003f1ec678 000000003f1ec000
                 0000000000000000 000000003e66ec00 0000000000316a62 000000003c287988
      Krnl Code: 0000000000316b04: f0a0000407f4       srp     4(11,%r0),2036,0
                 0000000000316b0a: b9020022           ltgr    %r2,%r2
                 0000000000316b0e: a7740015           brc     7,316b38
                >0000000000316b12: e3d0c0000024       stg     %r13,0(%r12)
                 0000000000316b18: 4120c010           la      %r2,16(%r12)
                 0000000000316b1c: 4130d060           la      %r3,96(%r13)
                 0000000000316b20: e340d0600004       lg      %r4,96(%r13)
                 0000000000316b26: c0e50002b567       brasl   %r14,36d5f4
      Call Trace:
      ([<0000000000316a62>] jbd2_journal_file_inode+0x5e/0x138)
       [<00000000002da13c>] mpage_da_map_and_submit+0x2e8/0x42c
       [<00000000002daac2>] ext4_da_writepages+0x2da/0x504
       [<00000000002597e8>] writeback_single_inode+0xf8/0x268
       [<0000000000259f06>] writeback_sb_inodes+0xd2/0x18c
       [<000000000025a700>] writeback_inodes_wb+0x80/0x168
       [<000000000025aa92>] wb_writeback+0x2aa/0x324
       [<000000000025abde>] wb_do_writeback+0xd2/0x274
       [<000000000025ae3a>] bdi_writeback_thread+0xba/0x1c4
       [<00000000001737be>] kthread+0xa6/0xb0
       [<000000000056c1da>] kernel_thread_starter+0x6/0xc
       [<000000000056c1d4>] kernel_thread_starter+0x0/0xc
      INFO: lockdep is turned off.
      Last Breaking-Event-Address:
       [<0000000000316a8a>] jbd2_journal_file_inode+0x86/0x138
      Reported-by: NSebastian Ott <sebott@linux.vnet.ibm.com>
      Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      a43a9d93
    • T
      idle governor: Avoid lock acquisition to read pm_qos before entering idle · 333c5ae9
      Tim Chen 提交于
      Thanks to the reviews and comments by Rafael, James, Mark and Andi.
      Here's version 2 of the patch incorporating your comments and also some
      update to my previous patch comments.
      
      I noticed that before entering idle state, the menu idle governor will
      look up the current pm_qos target value according to the list of qos
      requests received.  This look up currently needs the acquisition of a
      lock to access the list of qos requests to find the qos target value,
      slowing down the entrance into idle state due to contention by multiple
      cpus to access this list.  The contention is severe when there are a lot
      of cpus waking and going into idle.  For example, for a simple workload
      that has 32 pair of processes ping ponging messages to each other, where
      64 cpu cores are active in test system, I see the following profile with
      37.82% of cpu cycles spent in contention of pm_qos_lock:
      
      -     37.82%          swapper  [kernel.kallsyms]          [k]
      _raw_spin_lock_irqsave
         - _raw_spin_lock_irqsave
            - 95.65% pm_qos_request
                 menu_select
                 cpuidle_idle_call
               - cpu_idle
                    99.98% start_secondary
      
      A better approach will be to cache the updated pm_qos target value so
      reading it does not require lock acquisition as in the patch below.
      With this patch the contention for pm_qos_lock is removed and I saw a
      2.2X increase in throughput for my message passing workload.
      
      cc: stable@kernel.org
      Signed-off-by: NTim Chen <tim.c.chen@linux.intel.com>
      Acked-by: NAndi Kleen <ak@linux.intel.com>
      Acked-by: NJames Bottomley <James.Bottomley@suse.de>
      Acked-by: Nmark gross <markgross@thegnar.org>
      Signed-off-by: NLen Brown <len.brown@intel.com>
      333c5ae9
    • A
      Cache xattr security drop check for write v2 · 69b45732
      Andi Kleen 提交于
      Some recent benchmarking on btrfs showed that a major scaling bottleneck
      on large systems on btrfs is currently the xattr lookup on every write.
      
      Why xattr lookup on every write I hear you ask?
      
      write wants to drop suid and security related xattrs that could set o
      capabilities for executables.  To do that it currently looks up
      security.capability on EVERY write (even for non executables) to decide
      whether to drop it or not.
      
      In btrfs this causes an additional tree walk, hitting some per file system
      locks and quite bad scalability. In a simple read workload on a 8S
      system I saw over 90% CPU time in spinlocks related to that.
      
      Chris Mason tells me this is also a problem in ext4, where it hits
      the global mbcache lock.
      
      This patch adds a simple per inode to avoid this problem.  We only
      do the lookup once per file and then if there is no xattr cache
      the decision. All xattr changes clear the flag.
      
      I also used the same flag to avoid the suid check, although
      that one is pretty cheap.
      
      A file system can also set this flag when it creates the inode,
      if it has a cheap way to do so.  This is done for some common file systems
      in followon patches.
      
      With this patch a major part of the lock contention disappears
      for btrfs. Some testing on smaller systems didn't show significant
      performance changes, but at least it helps the larger systems
      and is generally more efficient.
      
      v2: Rename is_sgid. add file system helper.
      Cc: chris.mason@oracle.com
      Cc: josef@redhat.com
      Cc: viro@zeniv.linux.org.uk
      Cc: agruen@linbit.com
      Cc: Serge E. Hallyn <serue@us.ibm.com>
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      69b45732