1. 18 6月, 2010 3 次提交
    • O
      sched: Fix the racy usage of thread_group_cputimer() in fastpath_timer_check() · 8d1f431c
      Oleg Nesterov 提交于
      fastpath_timer_check()->thread_group_cputimer() is racy and
      unneeded.
      
      It is racy because another thread can clear ->running before
      thread_group_cputimer() takes cputimer->lock. In this case
      thread_group_cputimer() will set ->running = true again and call
      thread_group_cputime(). But since we do not hold tasklist or
      siglock, we can race with fork/exit and copy the wrong results
      into cputimer->cputime.
      
      It is unneeded because if ->running == true we can just use
      the numbers in cputimer->cputime we already have.
      
      Change fastpath_timer_check() to copy cputimer->cputime into
      the local variable under cputimer->lock. We do not re-check
      ->running under cputimer->lock, run_posix_cpu_timers() does
      this check later.
      
      Note: we can add more optimizations on top of this change.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <20100611180446.GA13025@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      8d1f431c
    • O
      sched: run_posix_cpu_timers: Don't check ->exit_state, use lock_task_sighand() · 0bdd2ed4
      Oleg Nesterov 提交于
      run_posix_cpu_timers() doesn't work if current has already passed
      exit_notify(). This was needed to prevent the races with do_wait().
      
      Since ea6d290c ->signal is always valid and can't go away. We can
      remove the "tsk->exit_state == 0" in fastpath_timer_check() and
      convert run_posix_cpu_timers() to use lock_task_sighand().
      
      Note: it makes sense to take group_leader's sighand instead, the
      sub-thread still uses CPU after release_task(). But we need more
      changes to do this.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <20100610231018.GA25942@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      0bdd2ed4
    • O
      sched: thread_group_cputime: Simplify, document the "alive" check · bfac7009
      Oleg Nesterov 提交于
      thread_group_cputime() looks as if it is rcu-safe, but in fact this
      was wrong until ea6d290c which pins task->signal to task_struct.
      It checks ->sighand != NULL under rcu, but this can't help if ->signal
      can go away. Fortunately the caller either holds ->siglock, or it is
      fastpath_timer_check() which uses current and checks exit_state == 0.
      
      - Since ea6d290c commit tsk->signal is stable, we can read it first
        and avoid the initialization from INIT_CPUTIME.
      
      - Even if tsk->signal is always valid, we still have to check it
        is safe to use next_thread() under rcu_read_lock(). Currently
        the code checks ->sighand != NULL, change it to use pid_alive()
        which is commonly used to ensure the task wasn't unhashed before
        we take rcu_read_lock().
      
        Add the comment to explain this check.
      
      - Change the main loop to use the while_each_thread() helper.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <20100610230956.GA25921@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      bfac7009
  2. 28 5月, 2010 1 次提交
    • O
      posix-cpu-timers: avoid "task->signal != NULL" checks · d30fda35
      Oleg Nesterov 提交于
      Preparation to make task->signal immutable, no functional changes.
      
      posix-cpu-timers.c checks task->signal != NULL to ensure this task is
      alive and didn't pass __exit_signal().  This is correct but we are going
      to change the lifetime rules for ->signal and never reset this pointer.
      
      Change the code to check ->sighand instead, it doesn't matter which
      pointer we check under tasklist, they both are cleared simultaneously.
      
      As Roland pointed out, some of these changes are not strictly needed and
      probably it makes sense to revert them later, when ->signal will be pinned
      to task_struct.  But this patch tries to ensure the subsequent changes in
      fork/exit can't make any visible impact on posix cpu timers.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Acked-by: NRoland McGrath <roland@redhat.com>
      Cc: Stanislaw Gruszka <sgruszka@redhat.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d30fda35
  3. 10 5月, 2010 1 次提交
    • S
      posix-cpu-timers: Optimize run_posix_cpu_timers() · 29f87b79
      Stanislaw Gruszka 提交于
      We can optimize and simplify things taking into account signal->cputimer
      is always running when we have configured any process wide cpu timer.
      
      In check_process_timers(), we don't have to check if new updated value of
      signal->cputime_expires is smaller, since we maintain new first expiration
      time ({prof,virt,sched}_expires) in code flow and all other writes to
      expiration cache are protected by sighand->siglock .
      Signed-off-by: NStanislaw Gruszka <sgruszka@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      29f87b79
  4. 13 3月, 2010 6 次提交
    • S
      cpu-timers: Avoid iterating over all threads in fastpath_timer_check() · c2873937
      Stanislaw Gruszka 提交于
      Spread p->sighand->siglock locking scope to make sure that
      fastpath_timer_check() never iterates over all threads. Without
      locking there is small possibility that signal->cputimer will stop
      running while we write values to signal->cputime_expires.
      
      Calling thread_group_cputime() from fastpath_timer_check() is not only
      bad because it is slow, also it is racy with __exit_signal() which can
      lead to invalid signal->{s,u}time values.
      Signed-off-by: NStanislaw Gruszka <sgruszka@redhat.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      c2873937
    • S
      cpu-timers: Change SIGEV_NONE timer implementation · 1f169f84
      Stanislaw Gruszka 提交于
      When user sets up a timer without associated signal and process does
      not use any other cpu timers and does not exit, tsk->signal->cputimer
      is enabled and running forever.
      
      Avoid running the timer for no reason.
      
      I used below program to check patch does not break current user space
      visible behavior.
      
       #include <sys/time.h>
       #include <signal.h>
       #include <stdio.h>
       #include <stdlib.h>
       #include <string.h>
       #include <time.h>
       #include <unistd.h>
       #include <assert.h>
      
       void consume_cpu(void)
       {
      	int i = 0;
      	int count = 0;
      
      	for(i=0; i<100000000; i++)
      		count++;
       }
      
       int main(void)
       {
      	int i;
      	struct sigaction act;
      	struct sigevent evt = { };
      	timer_t tid;
      	struct itimerspec spec = { };
      
      	evt.sigev_notify = SIGEV_NONE;
      	assert(timer_create(CLOCK_PROCESS_CPUTIME_ID, &evt,  &tid) == 0);
      
      	spec.it_value.tv_sec = 10;
      	assert(timer_settime(tid, 0, &spec,  NULL) == 0);
      
      	for (i = 0; i < 30; i++) {
      		consume_cpu();
      		memset(&spec, 0, sizeof(spec));
      		assert(timer_gettime(tid, &spec) == 0);
      		printf("%lu.%09lu\n",
      			(unsigned long) spec.it_value.tv_sec,
      			(unsigned long) spec.it_value.tv_nsec);
      	}
      
      	assert(timer_delete(tid) == 0);
      	return 0;
       }
      Signed-off-by: NStanislaw Gruszka <sgruszka@redhat.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      1f169f84
    • S
      cpu-timers: Return correct previous timer reload value · ae1a78ee
      Stanislaw Gruszka 提交于
      According POSIX we need to correctly set old timer it_interval value when
      user request that in timer_settime().  Tested using below program.
      
       #include <sys/time.h>
       #include <signal.h>
       #include <stdio.h>
       #include <stdlib.h>
       #include <time.h>
       #include <unistd.h>
       #include <assert.h>
      
       int main(void)
       {
      	struct sigaction act;
      	struct sigevent evt = { };
      	timer_t tid;
      	struct itimerspec spec, u_spec, k_spec;
      
      	evt.sigev_notify = SIGEV_SIGNAL;
      	evt.sigev_signo = SIGPROF;
      	assert(timer_create(CLOCK_PROCESS_CPUTIME_ID, &evt,  &tid) == 0);
      
      	spec.it_value.tv_sec = 1;
      	spec.it_value.tv_nsec = 2;
      	spec.it_interval.tv_sec = 3;
      	spec.it_interval.tv_nsec = 4;
      	u_spec = spec;
      	assert(timer_settime(tid, 0, &spec,  NULL) == 0);
      
      	spec.it_value.tv_sec = 5;
      	spec.it_value.tv_nsec = 6;
      	spec.it_interval.tv_sec = 7;
      	spec.it_interval.tv_nsec = 8;
      	assert(timer_settime(tid, 0, &spec,  &k_spec) == 0);
      
       #define PRT(val) printf(#val ":\t%d/%d\n", (int) u_spec.val, (int) k_spec.val)
      	PRT(it_value.tv_sec);
      	PRT(it_value.tv_nsec);
      	PRT(it_interval.tv_sec);
      	PRT(it_interval.tv_nsec);
      
      	return 0;
       }
      Signed-off-by: NStanislaw Gruszka <sgruszka@redhat.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      ae1a78ee
    • S
      cpu-timers: Cleanup arm_timer() · 5eb9aa64
      Stanislaw Gruszka 提交于
      Signed-off-by: NStanislaw Gruszka <sgruszka@redhat.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      5eb9aa64
    • S
      cpu-timers: Simplify RLIMIT_CPU handling · f55db609
      Stanislaw Gruszka 提交于
      Let always set signal->cputime_expires expiration cache when setting
      new itimer, POSIX 1.b timer, and RLIMIT_CPU.  Since we are
      initializing prof_exp expiration cache during fork(), this allows to
      remove "RLIMIT_CPU != inf" check from fastpath_timer_check() and do
      some other cleanups.
      
      Checked against regression using test cases from:
      http://marc.info/?l=linux-kernel&m=123749066504641&w=4
      http://marc.info/?l=linux-kernel&m=123811277916642&w=2Signed-off-by: NStanislaw Gruszka <sgruszka@redhat.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      f55db609
    • S
      posix-cpu-timers: Reset expire cache when no timer is running · 15365c10
      Stanislaw Gruszka 提交于
      When a process deletes cpu timer or a timer expires we do not clear
      the expiration cache sig->cputimer_expires.
      
      As a result the fastpath_timer_check() which prevents us to loop over
      all threads in case no timer is active is not working and we run the
      slow path needlessly on every tick.
      
      Zero sig->cputimer_expires in stop_process_timers().
      Signed-off-by: NStanislaw Gruszka <sgruszka@redhat.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
      Cc: Spencer Candland <spencer@bluehost.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      15365c10
  5. 07 3月, 2010 2 次提交
  6. 18 11月, 2009 1 次提交
  7. 29 8月, 2009 1 次提交
    • X
      itimers: Add tracepoints for itimer · 3f0a525e
      Xiao Guangrong 提交于
      Add tracepoints for all itimer variants: ITIMER_REAL, ITIMER_VIRTUAL
      and ITIMER_PROF.
      
      [ tglx: Fixed comments and made the output more readable, parseable
        	and consistent. Replaced pid_vnr by pid_nr because the hrtimer
        	callback can happen in any namespace ]
      Signed-off-by: NXiao Guangrong <xiaoguangrong@cn.fujitsu.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
      Cc: Anton Blanchard <anton@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Zhaolei <zhaolei@cn.fujitsu.com>
      LKML-Reference: <4A7F8B6E.2010109@cn.fujitsu.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      3f0a525e
  8. 09 8月, 2009 1 次提交
  9. 03 8月, 2009 4 次提交
  10. 30 4月, 2009 1 次提交
  11. 08 4月, 2009 1 次提交
    • O
      posix-timers: fix RLIMIT_CPU && setitimer(CPUCLOCK_PROF) · 8f2e5865
      Oleg Nesterov 提交于
      update_rlimit_cpu() tries to optimize out set_process_cpu_timer() in case
      when we already have CPUCLOCK_PROF timer which should expire first. But it
      uses cputime_lt() instead of cputime_gt().
      
      Test case:
      
      	int main(void)
      	{
      		struct itimerval it = {
      			.it_value = { .tv_sec = 1000 },
      		};
      
      		assert(!setitimer(ITIMER_PROF, &it, NULL));
      
      		struct rlimit rl = {
      			.rlim_cur = 1,
      			.rlim_max = 1,
      		};
      
      		assert(!setrlimit(RLIMIT_CPU, &rl));
      
      		for (;;)
      			;
      
      		return 0;
      	}
      
      Without this patch, the task is not killed as RLIMIT_CPU demands.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Peter Lojkin <ia6432@inbox.ru>
      Cc: Roland McGrath <roland@redhat.com>
      Cc: stable@kernel.org
      LKML-Reference: <20090327000610.GA10108@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      8f2e5865
  12. 01 4月, 2009 1 次提交
    • H
      posixtimers, sched: Fix posix clock monotonicity · c5f8d995
      Hidetoshi Seto 提交于
      Impact: Regression fix (against clock_gettime() backwarding bug)
      
      This patch re-introduces a couple of functions, task_sched_runtime
      and thread_group_sched_runtime, which was once removed at the
      time of 2.6.28-rc1.
      
      These functions protect the sampling of thread/process clock with
      rq lock.  This rq lock is required not to update rq->clock during
      the sampling.
      
      i.e.
        The clock_gettime() may return
         ((accounted runtime before update) + (delta after update))
        that is less than what it should be.
      
      v2 -> v3:
      	- Rename static helper function __task_delta_exec()
      	  to do_task_delta_exec() since -tip tree already has
      	  a __task_delta_exec() of different version.
      
      v1 -> v2:
      	- Revises comments of function and patch description.
      	- Add note about accuracy of thread group's runtime.
      Signed-off-by: NHidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: stable@kernel.org	[2.6.28.x][2.6.29.x]
      LKML-Reference: <49D1CC93.4080401@jp.fujitsu.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      c5f8d995
  13. 24 3月, 2009 1 次提交
    • O
      posix timers: fix RLIMIT_CPU && fork() · 37bebc70
      Oleg Nesterov 提交于
      See http://bugzilla.kernel.org/show_bug.cgi?id=12911
      
      copy_signal() copies signal->rlim, but RLIMIT_CPU is "lost". Because
      posix_cpu_timers_init_group() sets cputime_expires.prof_exp = 0 and thus
      fastpath_timer_check() returns false unless we have other cpu timers.
      
      This is the minimal fix for 2.6.29 (tested) and 2.6.28. The patch is not
      optimal, we need further cleanups here. With this patch update_rlimit_cpu()
      is not really needed, but I don't think it should be removed.
      
      The proper fix (I think) is:
      
      	- set_process_cpu_timer() should just start the cputimer->running
      	  logic (it does), no need to change cputime_expires.xxx_exp
      
      	- posix_cpu_timers_init_group() should set ->running when needed
      
      	- fastpath_timer_check() can check ->running instead of
      	  task_cputime_zero(signal->cputime_expires)
      Reported-by: NPeter Lojkin <ia6432@inbox.ru>
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Roland McGrath <roland@redhat.com>
      Cc: <stable@kernel.org> [for 2.6.29.x]
      LKML-Reference: <20090323193411.GA17514@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      37bebc70
  14. 13 2月, 2009 1 次提交
    • P
      timers: more consistently use clock vs timer · 3997ad31
      Peter Zijlstra 提交于
      While reviewing the manpages, I noticed I'd missed some clock vs timer sites.
      
      Make sure that all timer functions call cpu_timer_sample_group() and not
      cpu_clock_sample_group(). This ensures that we enable the process wide timer
      in time, and therefore pay the O(n) thread group cost from the syscall.
      
      Not doing it here, will result in the first jiffy tick after setting the timer
      doing this, resulting in a very expensive tick (but only once) and a delay in
      actually starting the timer.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      3997ad31
  15. 11 2月, 2009 2 次提交
  16. 05 2月, 2009 1 次提交
    • P
      timers: split process wide cpu clocks/timers · 4cd4c1b4
      Peter Zijlstra 提交于
      Change the process wide cpu timers/clocks so that we:
      
       1) don't mess up the kernel with too many threads,
       2) don't have a per-cpu allocation for each process,
       3) have no impact when not used.
      
      In order to accomplish this we're going to split it into two parts:
      
       - clocks; which can take all the time they want since they run
                 from user context -- ie. sys_clock_gettime(CLOCK_PROCESS_CPUTIME_ID)
      
       - timers; which need constant time sampling but since they're
                 explicity used, the user can pay the overhead.
      
      The clock readout will go back to a full sum of the thread group, while the
      timers will run of a global 'clock' that only runs when needed, so only
      programs that make use of the facility pay the price.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Reviewed-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      4cd4c1b4
  17. 08 1月, 2009 1 次提交
    • P
      itimers: remove the per-cpu-ish-ness · 490dea45
      Peter Zijlstra 提交于
      Either we bounce once cacheline per cpu per tick, yielding n^2 bounces
      or we just bounce a single..
      
      Also, using per-cpu allocations for the thread-groups complicates the
      per-cpu allocator in that its currently aimed to be a fixed sized
      allocator and the only possible extention to that would be vmap based,
      which is seriously constrained on 32 bit archs.
      
      So making the per-cpu memory requirement depend on the number of
      processes is an issue.
      
      Lastly, it didn't deal with cpu-hotplug, although admittedly that might
      be fixable.
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      490dea45
  18. 24 11月, 2008 1 次提交
  19. 17 11月, 2008 2 次提交
    • O
      thread_group_cputime: kill the bogus ->signal != NULL check · ce394471
      Oleg Nesterov 提交于
      Impact: simplify the code
      
      thread_group_cputime() is called by current when it must have the valid
      ->signal, or under ->siglock, or under tasklist_lock after the ->signal
      check, or the caller is wait_task_zombie() which reaps the child. In any
      case ->signal can't be NULL.
      
      But the point of this patch is not optimization. If it is possible to call
      thread_group_cputime() when ->signal == NULL we are doing something wrong,
      and we should not mask the problem. thread_group_cputime() fills *times
      and the caller will use it, if we silently use task_struct->*times* we
      report the wrong values.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      ce394471
    • O
      sched, signals: fix the racy usage of ->signal in account_group_xxx/run_posix_cpu_timers · ad133ba3
      Oleg Nesterov 提交于
      Impact: fix potential NULL dereference
      
      Contrary to ad474cac changelog, other
      acct_group_xxx() helpers can be called after exit_notify() by timer tick.
      Thanks to Roland for pointing out this. Somehow I missed this simple fact
      when I read the original patch, and I am afraid I confused Frank during
      the discussion. Sorry.
      
      Fortunately, these helpers work with current, we can check ->exit_state
      to ensure that ->signal can't go away under us.
      
      Also, add the comment and compiler barrier to account_group_exec_runtime(),
      to make sure we load ->signal only once.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      ad133ba3
  20. 23 9月, 2008 1 次提交
    • F
      timers: fix itimer/many thread hang, v2 · bb34d92f
      Frank Mayhar 提交于
      This is the second resubmission of the posix timer rework patch, posted
      a few days ago.
      
      This includes the changes from the previous resubmittion, which addressed
      Oleg Nesterov's comments, removing the RCU stuff from the patch and
      un-inlining the thread_group_cputime() function for SMP.
      
      In addition, per Ingo Molnar it simplifies the UP code, consolidating much
      of it with the SMP version and depending on lower-level SMP/UP handling to
      take care of the differences.
      
      It also cleans up some UP compile errors, moves the scheduler stats-related
      macros into kernel/sched_stats.h, cleans up a merge error in
      kernel/fork.c and has a few other minor fixes and cleanups as suggested
      by Oleg and Ingo. Thanks for the review, guys.
      Signed-off-by: NFrank Mayhar <fmayhar@google.com>
      Cc: Roland McGrath <roland@redhat.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      bb34d92f
  21. 14 9月, 2008 2 次提交
    • I
      timers: fix itimer/many thread hang, cleanups · 5ce73a4a
      Ingo Molnar 提交于
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      5ce73a4a
    • F
      timers: fix itimer/many thread hang · f06febc9
      Frank Mayhar 提交于
      Overview
      
      This patch reworks the handling of POSIX CPU timers, including the
      ITIMER_PROF, ITIMER_VIRT timers and rlimit handling.  It was put together
      with the help of Roland McGrath, the owner and original writer of this code.
      
      The problem we ran into, and the reason for this rework, has to do with using
      a profiling timer in a process with a large number of threads.  It appears
      that the performance of the old implementation of run_posix_cpu_timers() was
      at least O(n*3) (where "n" is the number of threads in a process) or worse.
      Everything is fine with an increasing number of threads until the time taken
      for that routine to run becomes the same as or greater than the tick time, at
      which point things degrade rather quickly.
      
      This patch fixes bug 9906, "Weird hang with NPTL and SIGPROF."
      
      Code Changes
      
      This rework corrects the implementation of run_posix_cpu_timers() to make it
      run in constant time for a particular machine.  (Performance may vary between
      one machine and another depending upon whether the kernel is built as single-
      or multiprocessor and, in the latter case, depending upon the number of
      running processors.)  To do this, at each tick we now update fields in
      signal_struct as well as task_struct.  The run_posix_cpu_timers() function
      uses those fields to make its decisions.
      
      We define a new structure, "task_cputime," to contain user, system and
      scheduler times and use these in appropriate places:
      
      struct task_cputime {
      	cputime_t utime;
      	cputime_t stime;
      	unsigned long long sum_exec_runtime;
      };
      
      This is included in the structure "thread_group_cputime," which is a new
      substructure of signal_struct and which varies for uniprocessor versus
      multiprocessor kernels.  For uniprocessor kernels, it uses "task_cputime" as
      a simple substructure, while for multiprocessor kernels it is a pointer:
      
      struct thread_group_cputime {
      	struct task_cputime totals;
      };
      
      struct thread_group_cputime {
      	struct task_cputime *totals;
      };
      
      We also add a new task_cputime substructure directly to signal_struct, to
      cache the earliest expiration of process-wide timers, and task_cputime also
      replaces the it_*_expires fields of task_struct (used for earliest expiration
      of thread timers).  The "thread_group_cputime" structure contains process-wide
      timers that are updated via account_user_time() and friends.  In the non-SMP
      case the structure is a simple aggregator; unfortunately in the SMP case that
      simplicity was not achievable due to cache-line contention between CPUs (in
      one measured case performance was actually _worse_ on a 16-cpu system than
      the same test on a 4-cpu system, due to this contention).  For SMP, the
      thread_group_cputime counters are maintained as a per-cpu structure allocated
      using alloc_percpu().  The timer functions update only the timer field in
      the structure corresponding to the running CPU, obtained using per_cpu_ptr().
      
      We define a set of inline functions in sched.h that we use to maintain the
      thread_group_cputime structure and hide the differences between UP and SMP
      implementations from the rest of the kernel.  The thread_group_cputime_init()
      function initializes the thread_group_cputime structure for the given task.
      The thread_group_cputime_alloc() is a no-op for UP; for SMP it calls the
      out-of-line function thread_group_cputime_alloc_smp() to allocate and fill
      in the per-cpu structures and fields.  The thread_group_cputime_free()
      function, also a no-op for UP, in SMP frees the per-cpu structures.  The
      thread_group_cputime_clone_thread() function (also a UP no-op) for SMP calls
      thread_group_cputime_alloc() if the per-cpu structures haven't yet been
      allocated.  The thread_group_cputime() function fills the task_cputime
      structure it is passed with the contents of the thread_group_cputime fields;
      in UP it's that simple but in SMP it must also safely check that tsk->signal
      is non-NULL (if it is it just uses the appropriate fields of task_struct) and,
      if so, sums the per-cpu values for each online CPU.  Finally, the three
      functions account_group_user_time(), account_group_system_time() and
      account_group_exec_runtime() are used by timer functions to update the
      respective fields of the thread_group_cputime structure.
      
      Non-SMP operation is trivial and will not be mentioned further.
      
      The per-cpu structure is always allocated when a task creates its first new
      thread, via a call to thread_group_cputime_clone_thread() from copy_signal().
      It is freed at process exit via a call to thread_group_cputime_free() from
      cleanup_signal().
      
      All functions that formerly summed utime/stime/sum_sched_runtime values from
      from all threads in the thread group now use thread_group_cputime() to
      snapshot the values in the thread_group_cputime structure or the values in
      the task structure itself if the per-cpu structure hasn't been allocated.
      
      Finally, the code in kernel/posix-cpu-timers.c has changed quite a bit.
      The run_posix_cpu_timers() function has been split into a fast path and a
      slow path; the former safely checks whether there are any expired thread
      timers and, if not, just returns, while the slow path does the heavy lifting.
      With the dedicated thread group fields, timers are no longer "rebalanced" and
      the process_timer_rebalance() function and related code has gone away.  All
      summing loops are gone and all code that used them now uses the
      thread_group_cputime() inline.  When process-wide timers are set, the new
      task_cputime structure in signal_struct is used to cache the earliest
      expiration; this is checked in the fast path.
      
      Performance
      
      The fix appears not to add significant overhead to existing operations.  It
      generally performs the same as the current code except in two cases, one in
      which it performs slightly worse (Case 5 below) and one in which it performs
      very significantly better (Case 2 below).  Overall it's a wash except in those
      two cases.
      
      I've since done somewhat more involved testing on a dual-core Opteron system.
      
      Case 1: With no itimer running, for a test with 100,000 threads, the fixed
      	kernel took 1428.5 seconds, 513 seconds more than the unfixed system,
      	all of which was spent in the system.  There were twice as many
      	voluntary context switches with the fix as without it.
      
      Case 2: With an itimer running at .01 second ticks and 4000 threads (the most
      	an unmodified kernel can handle), the fixed kernel ran the test in
      	eight percent of the time (5.8 seconds as opposed to 70 seconds) and
      	had better tick accuracy (.012 seconds per tick as opposed to .023
      	seconds per tick).
      
      Case 3: A 4000-thread test with an initial timer tick of .01 second and an
      	interval of 10,000 seconds (i.e. a timer that ticks only once) had
      	very nearly the same performance in both cases:  6.3 seconds elapsed
      	for the fixed kernel versus 5.5 seconds for the unfixed kernel.
      
      With fewer threads (eight in these tests), the Case 1 test ran in essentially
      the same time on both the modified and unmodified kernels (5.2 seconds versus
      5.8 seconds).  The Case 2 test ran in about the same time as well, 5.9 seconds
      versus 5.4 seconds but again with much better tick accuracy, .013 seconds per
      tick versus .025 seconds per tick for the unmodified kernel.
      
      Since the fix affected the rlimit code, I also tested soft and hard CPU limits.
      
      Case 4: With a hard CPU limit of 20 seconds and eight threads (and an itimer
      	running), the modified kernel was very slightly favored in that while
      	it killed the process in 19.997 seconds of CPU time (5.002 seconds of
      	wall time), only .003 seconds of that was system time, the rest was
      	user time.  The unmodified kernel killed the process in 20.001 seconds
      	of CPU (5.014 seconds of wall time) of which .016 seconds was system
      	time.  Really, though, the results were too close to call.  The results
      	were essentially the same with no itimer running.
      
      Case 5: With a soft limit of 20 seconds and a hard limit of 2000 seconds
      	(where the hard limit would never be reached) and an itimer running,
      	the modified kernel exhibited worse tick accuracy than the unmodified
      	kernel: .050 seconds/tick versus .028 seconds/tick.  Otherwise,
      	performance was almost indistinguishable.  With no itimer running this
      	test exhibited virtually identical behavior and times in both cases.
      
      In times past I did some limited performance testing.  those results are below.
      
      On a four-cpu Opteron system without this fix, a sixteen-thread test executed
      in 3569.991 seconds, of which user was 3568.435s and system was 1.556s.  On
      the same system with the fix, user and elapsed time were about the same, but
      system time dropped to 0.007 seconds.  Performance with eight, four and one
      thread were comparable.  Interestingly, the timer ticks with the fix seemed
      more accurate:  The sixteen-thread test with the fix received 149543 ticks
      for 0.024 seconds per tick, while the same test without the fix received 58720
      for 0.061 seconds per tick.  Both cases were configured for an interval of
      0.01 seconds.  Again, the other tests were comparable.  Each thread in this
      test computed the primes up to 25,000,000.
      
      I also did a test with a large number of threads, 100,000 threads, which is
      impossible without the fix.  In this case each thread computed the primes only
      up to 10,000 (to make the runtime manageable).  System time dominated, at
      1546.968 seconds out of a total 2176.906 seconds (giving a user time of
      629.938s).  It received 147651 ticks for 0.015 seconds per tick, still quite
      accurate.  There is obviously no comparable test without the fix.
      Signed-off-by: NFrank Mayhar <fmayhar@google.com>
      Cc: Roland McGrath <roland@redhat.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      f06febc9
  22. 25 5月, 2008 1 次提交
  23. 01 5月, 2008 1 次提交
    • R
      remove div_long_long_rem · f8bd2258
      Roman Zippel 提交于
      x86 is the only arch right now, which provides an optimized for
      div_long_long_rem and it has the downside that one has to be very careful that
      the divide doesn't overflow.
      
      The API is a little akward, as the arguments for the unsigned divide are
      signed.  The signed version also doesn't handle a negative divisor and
      produces worse code on 64bit archs.
      
      There is little incentive to keep this API alive, so this converts the few
      users to the new API.
      Signed-off-by: NRoman Zippel <zippel@linux-m68k.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: john stultz <johnstul@us.ibm.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f8bd2258
  24. 17 4月, 2008 1 次提交
  25. 09 2月, 2008 1 次提交
  26. 26 1月, 2008 1 次提交