1. 09 12月, 2013 6 次提交
    • F
      posix-timers: Consolidate posix_cpu_clock_get() · 33ab0fec
      Frederic Weisbecker 提交于
      Consolidate the clock sampling common code used for both local
      and remote targets.
      
      Note that this introduces a tiny user ABI change: if a
      PID is passed to clock_gettime() along the clockid,
      we used to forbid a process wide clock sample when that
      PID doesn't belong to a group leader. Now after this patch
      we allow process wide clock samples if that PID belongs to
      the current task, even if the current task is not the
      group leader.
      
      But local process wide clock samples are allowed if PID == 0
      (current task) even if the current task is not the group leader.
      So in the end this should be no big deal as this actually harmonize
      the behaviour when the remote sample is actually a local one.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Kosaki Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      33ab0fec
    • F
      posix-timers: Remove useless clock sample on timers cleanup · af82eb3c
      Frederic Weisbecker 提交于
      a0b2062b
      ("posix_timers: fix racy timer delta caching on task exit") forgot
      to remove the arguments used for timer caching.
      
      Fix this leftover.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Kosaki Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      af82eb3c
    • F
      posix-timers: Remove dead task special case · a3222f88
      Frederic Weisbecker 提交于
      Now that we've removed all the optimizations that could
      result in NULL timer's targets, we can remove all the
      associated special case handling.
      
      Also add some warnings on NULL targets to spot any possible
      leftover.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Kosaki Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      a3222f88
    • F
      posix-timers: Cleanup reaped target handling · e26d70d2
      Frederic Weisbecker 提交于
      When a timer's target is seen to be buried, for example on calls
      to timer_gettime(), the posix cpu timers code behaves a bit
      like a garbage collector and releases early the reference to the
      task.
      
      Then again, this optimization complicates the code for no much
      value: it's up to the user to release the timer and its associated
      ressources by calling timer_delete() after it buries the target
      tasks.
      
      Remove this to simplify the code.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Kosaki Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      e26d70d2
    • F
      posix-timers: Remove dead process posix cpu timers caching · d430b917
      Frederic Weisbecker 提交于
      Now that we removed dead thread posix cpu timers caching,
      lets remove the dead process wide version. This caching
      is similar to the per thread version but it should be even
      more rare:
      
      * If the process id dead, we are not reading its timers
      status from a thread belonging to its group since they
      are all dead. So this caching only concern remote process
      timers reads. Now posix cpu timers using itimers or timer_settime()
      can't do remote process timers anyway so it's not even clear if there
      is actually a user for this caching.
      
      * Unlike per thread timers caching, this only applies to
      zombies targets. Buried targets' process wide timers return
      0 values. But then again, timer_gettime() can't read remote
      process timers, so if the process is dead, there can't be
      any reader left anyway.
      
      Then again this caching seem to complicate the code for
      corner cases that are probably not worth it. So lets get
      rid of it.
      
      Also remove the sample snapshot on dying process timer
      that is now useless, as suggested by Kosaki.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Kosaki Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      d430b917
    • F
      posix-timers: Remove dead thread posix cpu timers caching · 724a3713
      Frederic Weisbecker 提交于
      When a task is exiting or has exited, its posix cpu timers
      don't tick anymore and won't elapse further. It's too late
      for them to expire.
      
      So any further call to timer_gettime() on these timers will
      return the same remaining expiry time.
      
      The current code optimize this by caching the remaining delta
      and storing it where we use to save the absolute expiration time.
      This way, the future calls to timer_gettime() won't need to
      compute the difference between the absolute expiration time and
      the current time anymore.
      
      Now this optimization doesn't seem to bring much value. Computing
      the timer remaining delta is not very costly. Fetching the timer
      value OTOH can be costly in two ways:
      
      * CPUCLOCK_SCHED read requires to lock the target's rq. But some
      optimizations are on the way to make task_sched_runtime() not holding
      the rq lock of a non-running target.
      
      * CPUCLOCK_VIRT/CPUCLOCK_PROF read simply consist in fetching
      current->utime/current->stime except when the system uses full
      dynticks cputime accounting. The latter requires a per task lock
      in order to correctly compute user and system time. But once the
      target is dead, this lock shouldn't be contended anyway.
      
      All in one this caching doesn't seem to be justified.
      Given that it complicates the code significantly for
      few wins, let's remove it on single thread timers.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Kosaki Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      724a3713
  2. 03 12月, 2013 2 次提交
    • F
      posix-timers: Fix full dynticks CPUs kick on timer rescheduling · c925077c
      Frederic Weisbecker 提交于
      A posix CPU timer can be rearmed while it is firing or after it is
      notified with a signal. This can happen for example with timers that
      were set with a non zero interval in timer_settime().
      
      This rearming can happen in two places:
      
      1) On timer firing time, which happens on the target's tick. If the timer
      can't trigger a signal because it is ignored, it reschedules itself
      to honour the timer interval.
      
      2) On signal handling from the timer's notification target. This one
      can be a different task than the timer's target itself. Once the
      signal is notified, the notification target rearms the timer, again
      to honour the timer interval.
      
      When a timer is rearmed, we need to notify the full dynticks CPUs
      such that they restart their tick in case they are running tasks that
      may have a share in elapsing this timer.
      
      Now the 1st case above handles full dynticks CPUs with a call to
      posix_cpu_timer_kick_nohz() from the posix cpu timer firing code. But
      the second case ignores the fact that some CPUs may run non-idle tasks
      with their tick off. As a result, when a timer is resheduled after its signal
      notification, the full dynticks CPUs may completely ignore it and not
      tick on the timer as expected
      
      This patch fixes this bug by handling both cases in one. All we need
      is to move the kick to the rearming common code in posix_cpu_timer_schedule().
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Olivier Langlois <olivier@olivierlanglois.net>
      c925077c
    • F
      posix-timers: Spare workqueue if there is no full dynticks CPU to kick · d4283c65
      Frederic Weisbecker 提交于
      After a posix cpu timer is set, a workqueue is scheduled in order to
      kick the full dynticks CPUs and let them restart their tick if
      necessary in case the task they are running is concerned by the
      new timer.
      
      This kick is implemented by way of IPIs, which require interrupts
      to be enabled, hence the need for a workqueue to raise them because
      the posix cpu timer set path has interrupts disabled.
      
      Now if there is no full dynticks CPU on the system, the workqueue is
      still scheduled but it simply won't send any IPI and return immediately.
      
      So lets spare that worqueue when it is not needed.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      d4283c65
  3. 03 7月, 2013 5 次提交
    • F
      posix_timers: fix racy timer delta caching on task exit · a0b2062b
      Frederic Weisbecker 提交于
      When a task exits, we perform a caching of the remaining cputime delta
      before expiring of its timers.
      
      This is done from the following places:
      
      * When the task is reaped. We iterate through its list of
        posix cpu timers and store the remaining timer delta to
        the timer struct instead of the absolute value.
        (See posix_cpu_timers_exit() / posix_cpu_timers_exit_group() )
      
      * When we call posix_cpu_timer_get() or posix_cpu_timer_schedule().
        If the timer's task is considered dying when watched from these
        places, the same conversion from absolute to relative expiry time
        is performed. Then the given task's reference is released.
        (See clear_dead_task() ).
      
      The relevance of this caching is questionable but this is another
      and deeper debate.
      
      The big issue here is that these two sources of caching don't mix
      up very well together.
      
      More specifically, the caching can easily be done twice, resulting
      in a wrong delta as it gets spuriously substracted a second time by
      the elapsed clock. This can happen in the following scenario:
      
      1) The task exits and gets reaped: we call posix_cpu_timers_exit()
         and the absolute timer expiry values are converted to a relative
         delta.
      
      2) timer_gettime() -> posix_cpu_timer_get() is called and relies on
         clear_dead_task() because  tsk->exit_state == EXIT_DEAD.
         The delta gets substracted again by the elapsed clock and we return
         a wrong result.
      
      To fix this, just remove the caching done on task reaping time.  It
      doesn't bring much value on its own.  The caching done from
      posix_cpu_timer_get/schedule is enough.
      
      And it would also be hard to get it really right: we could make it put and
      clear the target task in the timer struct so that readers know if they are
      dealing with a relative cached of absolute value.  But it would be racy.
      The only safe way to do it would be to lock the itimer->it_lock so that we
      know nobody reads the cputime expiry value while we modify it and its
      target task reference.  Doing so would involve some funny workarounds to
      avoid circular lock against the sighand lock.  There is just no reason to
      maintain this.
      
      The user visible effect of this patch can be observed by running the
      following code: it creates a subthread that launches a posix cputimer
      which expires after 10 seconds. But then the subthread only busy loops for 2
      seconds and exits. The parent reaps the subthread and read the timer value.
      Its expected value should the be the initial timer's expiration value
      minus the cputime elapsed in the subthread. Roughly 10 - 2 = 8 seconds:
      
      	#include <sys/time.h>
      	#include <stdio.h>
      	#include <unistd.h>
      	#include <time.h>
      	#include <pthread.h>
      
      	static timer_t id;
      	static struct itimerspec val = { .it_value.tv_sec = 10, }, new;
      
      	static void *thread(void *unused)
      	{
      		int err;
      		struct timeval start, end, diff;
      
      		timer_create(CLOCK_THREAD_CPUTIME_ID, NULL, &id);
      		if (err < 0) {
      			perror("Can't create timer\n");
      			return NULL;
      		}
      
      		/* Arm 10 sec timer */
      		err = timer_settime(id, 0, &val, NULL);
      		if (err < 0) {
      			perror("Can't set timer\n");
      			return NULL;
      		}
      
      		/* Exit after 2 seconds of execution */
      		gettimeofday(&start, NULL);
      	        do {
      			gettimeofday(&end, NULL);
      			timersub(&end, &start, &diff);
      		} while (diff.tv_sec < 2);
      
      		return NULL;
      	}
      
      	int main(int argc, char **argv)
      	{
      		pthread_t pthread;
      		int err;
      
      		err = pthread_create(&pthread, NULL, thread, NULL);
      		if (err) {
      			perror("Can't create thread\n");
      			return -1;
      		}
      		pthread_join(pthread, NULL);
      		/* Just wait a little bit to make sure the child got reaped */
      		sleep(1);
      		err = timer_gettime(id, &new);
      		if (err)
      			perror("Can't get timer value\n");
      		printf("%d %ld\n", new.it_value.tv_sec, new.it_value.tv_nsec);
      
      		return 0;
      	}
      
      Before the patch:
      
             $ ./posix_cpu_timers
             6 2278074
      
      After the patch:
      
            $ ./posix_cpu_timers
            8 1158766
      
      Before the patch, the elapsed time got two more seconds spuriously accounted.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Stanislaw Gruszka <sgruszka@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Cc: Olivier Langlois <olivier@trillion01.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      a0b2062b
    • F
      posix-timers: correctly get dying task time sample in posix_cpu_timer_schedule() · 76cdcdd9
      Frederic Weisbecker 提交于
      In order to re-arm a timer after it fired, we take a sample of the current
      process or thread cputime.
      
      If the task is dying though, we don't arm anything but we cache the
      remaining timer expiration delta for further reads.
      
      Something similar is performed in posix_cpu_timer_get() but here we forget
      to take the process wide cputime sample before caching it.
      
      As a result we are storing random stack content, leading every further
      reads of that timer to return junk values.
      
      Fix this by taking the appropriate sample in the case of process wide
      timers.
      
      This probably doesn't matter much in practice because, at this stage, the
      thread is the last one in the group and we reached exit_notify().  This
      implies that we called exit_itimers() and there should be no more timers
      to handle for that task.
      
      So this is likely dead code anyway but let's fix the current logic
      and the warning that came along:
      
          kernel/posix-cpu-timers.c: In function 'posix_cpu_timer_schedule':
          kernel/posix-cpu-timers.c:1127: warning: 'now' may be used uninitialized in this function
      
      Then we can start to think further about cleaning up that code.
      Reported-by: NAndrew Morton <akpm@linux-foundation.org>
      Reported-by: NChen Gang <gang.chen@asianux.com>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Stanislaw Gruszka <sgruszka@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Chen Gang <gang.chen@asianux.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Cc: Olivier Langlois <olivier@trillion01.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      76cdcdd9
    • F
      posix_cpu_timers: consolidate expired timers check · 2473f3e7
      Frederic Weisbecker 提交于
      Consolidate the common code amongst per thread and per process timers list
      on tick time.
      
      List traversal, expiry check and subsequent updates can be shared in a
      common helper.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Stanislaw Gruszka <sgruszka@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Cc: Olivier Langlois <olivier@trillion01.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      2473f3e7
    • F
      posix_cpu_timers: consolidate timer list cleanups · 1a7fa510
      Frederic Weisbecker 提交于
      Cleaning up the posix cpu timers on task exit shares some common code
      among timer list types, most notably the list traversal and expiry time
      update.
      
      Unify this in a common helper.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Stanislaw Gruszka <sgruszka@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Cc: Olivier Langlois <olivier@trillion01.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      1a7fa510
    • F
      posix_cpu_timer: consolidate expiry time type · 55ccb616
      Frederic Weisbecker 提交于
      The posix cpu timer expiry time is stored in a union of two types: a 64
      bits field if we rely on scheduler precise accounting, or a cputime_t if
      we rely on jiffies.
      
      This results in quite some duplicate code and special cases to handle the
      two types.
      
      Just unify this into a single 64 bits field.  cputime_t can always fit
      into it.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Stanislaw Gruszka <sgruszka@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Cc: Olivier Langlois <olivier@trillion01.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      55ccb616
  4. 23 4月, 2013 1 次提交
    • F
      posix_timers: Fix pre-condition to stop the tick on full dynticks · 6ac29178
      Frederic Weisbecker 提交于
      The test that checks if a CPU can stop its tick from posix CPU
      timers angle was mistakenly inverted.
      
      What we want is to prevent the tick from being stopped as long
      as the current CPU's task runs a posix CPU timer.
      
      Fix this.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Geoff Levand <geoff@infradead.org>
      Cc: Gilad Ben Yossef <gilad@benyossef.com>
      Cc: Hakan Akkan <hakanakkan@gmail.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Kevin Hilman <khilman@linaro.org>
      Cc: Li Zhong <zhong@linux.vnet.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      6ac29178
  5. 19 4月, 2013 2 次提交
    • F
      posix_timers: New API to prevent from stopping the tick when timers are running · 555347f6
      Frederic Weisbecker 提交于
      Bring a new helper that the full dynticks infrastructure can
      call in order to know if it can safely stop the tick from
      the posix cpu timers subsystem point of view.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Geoff Levand <geoff@infradead.org>
      Cc: Gilad Ben Yossef <gilad@benyossef.com>
      Cc: Hakan Akkan <hakanakkan@gmail.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Kevin Hilman <khilman@linaro.org>
      Cc: Li Zhong <zhong@linux.vnet.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      555347f6
    • F
      posix_timers: Kick full dynticks CPUs when a posix cpu timer is armed · a8572160
      Frederic Weisbecker 提交于
      Kick the full dynticks CPUs when a posix cpu timer is enqueued by
      way of a standard call to posix_cpu_timer_set() or set_process_cpu_timer().
      This also include rescheduled firing timers.
      
      This way they can re-evaluate the state of (and possibly restart) their
      tick against the new expiry modification.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Geoff Levand <geoff@infradead.org>
      Cc: Gilad Ben Yossef <gilad@benyossef.com>
      Cc: Hakan Akkan <hakanakkan@gmail.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Kevin Hilman <khilman@linaro.org>
      Cc: Li Zhong <zhong@linux.vnet.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      a8572160
  6. 15 2月, 2013 1 次提交
  7. 28 1月, 2013 1 次提交
    • F
      cputime: Use accessors to read task cputime stats · 6fac4829
      Frederic Weisbecker 提交于
      This is in preparation for the full dynticks feature. While
      remotely reading the cputime of a task running in a full
      dynticks CPU, we'll need to do some extra-computation. This
      way we can account the time it spent tickless in userspace
      since its last cputime snapshot.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Li Zhong <zhong@linux.vnet.ibm.com>
      Cc: Namhyung Kim <namhyung.kim@lge.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      6fac4829
  8. 17 12月, 2012 1 次提交
  9. 29 11月, 2012 1 次提交
  10. 15 12月, 2011 1 次提交
  11. 18 10月, 2011 1 次提交
    • P
      cputimer: Cure lock inversion · bcd5cff7
      Peter Zijlstra 提交于
      There's a lock inversion between the cputimer->lock and rq->lock;
      notably the two callchains involved are:
      
       update_rlimit_cpu()
         sighand->siglock
         set_process_cpu_timer()
           cpu_timer_sample_group()
             thread_group_cputimer()
               cputimer->lock
               thread_group_cputime()
                 task_sched_runtime()
                   ->pi_lock
                   rq->lock
      
       scheduler_tick()
         rq->lock
         task_tick_fair()
           update_curr()
             account_group_exec()
               cputimer->lock
      
      Where the first one is enabling a CLOCK_PROCESS_CPUTIME_ID timer, and
      the second one is keeping up-to-date.
      
      This problem was introduced by e8abccb7 ("posix-cpu-timers: Cure
      SMP accounting oddities").
      
      Cure the problem by removing the cputimer->lock and rq->lock nesting,
      this leaves concurrent enablers doing duplicate work, but the time
      wasted should be on the same order otherwise wasted spinning on the
      lock and the greater-than assignment filter should ensure we preserve
      monotonicity.
      Reported-by: NDave Jones <davej@redhat.com>
      Reported-by: NSimon Kirby <sim@hostway.ca>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: stable@kernel.org
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Link: http://lkml.kernel.org/r/1318928713.21167.4.camel@twinsSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      bcd5cff7
  12. 30 9月, 2011 1 次提交
    • P
      posix-cpu-timers: Cure SMP wobbles · d670ec13
      Peter Zijlstra 提交于
      David reported:
      
        Attached below is a watered-down version of rt/tst-cpuclock2.c from
        GLIBC.  Just build it with "gcc -o test test.c -lpthread -lrt" or
        similar.
      
        Run it several times, and you will see cases where the main thread
        will measure a process clock difference before and after the nanosleep
        which is smaller than the cpu-burner thread's individual thread clock
        difference.  This doesn't make any sense since the cpu-burner thread
        is part of the top-level process's thread group.
      
        I've reproduced this on both x86-64 and sparc64 (using both 32-bit and
        64-bit binaries).
      
        For example:
      
        [davem@boricha build-x86_64-linux]$ ./test
        process: before(0.001221967) after(0.498624371) diff(497402404)
        thread:  before(0.000081692) after(0.498316431) diff(498234739)
        self:    before(0.001223521) after(0.001240219) diff(16698)
        [davem@boricha build-x86_64-linux]$ 
      
        The diff of 'process' should always be >= the diff of 'thread'.
      
        I make sure to wrap the 'thread' clock measurements the most tightly
        around the nanosleep() call, and that the 'process' clock measurements
        are the outer-most ones.
      
        ---
        #include <unistd.h>
        #include <stdio.h>
        #include <stdlib.h>
        #include <time.h>
        #include <fcntl.h>
        #include <string.h>
        #include <errno.h>
        #include <pthread.h>
      
        static pthread_barrier_t barrier;
      
        static void *chew_cpu(void *arg)
        {
      	  pthread_barrier_wait(&barrier);
      	  while (1)
      		  __asm__ __volatile__("" : : : "memory");
      	  return NULL;
        }
      
        int main(void)
        {
      	  clockid_t process_clock, my_thread_clock, th_clock;
      	  struct timespec process_before, process_after;
      	  struct timespec me_before, me_after;
      	  struct timespec th_before, th_after;
      	  struct timespec sleeptime;
      	  unsigned long diff;
      	  pthread_t th;
      	  int err;
      
      	  err = clock_getcpuclockid(0, &process_clock);
      	  if (err)
      		  return 1;
      
      	  err = pthread_getcpuclockid(pthread_self(), &my_thread_clock);
      	  if (err)
      		  return 1;
      
      	  pthread_barrier_init(&barrier, NULL, 2);
      	  err = pthread_create(&th, NULL, chew_cpu, NULL);
      	  if (err)
      		  return 1;
      
      	  err = pthread_getcpuclockid(th, &th_clock);
      	  if (err)
      		  return 1;
      
      	  pthread_barrier_wait(&barrier);
      
      	  err = clock_gettime(process_clock, &process_before);
      	  if (err)
      		  return 1;
      
      	  err = clock_gettime(my_thread_clock, &me_before);
      	  if (err)
      		  return 1;
      
      	  err = clock_gettime(th_clock, &th_before);
      	  if (err)
      		  return 1;
      
      	  sleeptime.tv_sec = 0;
      	  sleeptime.tv_nsec = 500000000;
      	  nanosleep(&sleeptime, NULL);
      
      	  err = clock_gettime(th_clock, &th_after);
      	  if (err)
      		  return 1;
      
      	  err = clock_gettime(my_thread_clock, &me_after);
      	  if (err)
      		  return 1;
      
      	  err = clock_gettime(process_clock, &process_after);
      	  if (err)
      		  return 1;
      
      	  diff = process_after.tv_nsec - process_before.tv_nsec;
      	  printf("process: before(%lu.%.9lu) after(%lu.%.9lu) diff(%lu)\n",
      		 process_before.tv_sec, process_before.tv_nsec,
      		 process_after.tv_sec, process_after.tv_nsec, diff);
      	  diff = th_after.tv_nsec - th_before.tv_nsec;
      	  printf("thread:  before(%lu.%.9lu) after(%lu.%.9lu) diff(%lu)\n",
      		 th_before.tv_sec, th_before.tv_nsec,
      		 th_after.tv_sec, th_after.tv_nsec, diff);
      	  diff = me_after.tv_nsec - me_before.tv_nsec;
      	  printf("self:    before(%lu.%.9lu) after(%lu.%.9lu) diff(%lu)\n",
      		 me_before.tv_sec, me_before.tv_nsec,
      		 me_after.tv_sec, me_after.tv_nsec, diff);
      
      	  return 0;
        }
      
      This is due to us using p->se.sum_exec_runtime in
      thread_group_cputime() where we iterate the thread group and sum all
      data. This does not take time since the last schedule operation (tick
      or otherwise) into account. We can cure this by using
      task_sched_runtime() at the cost of having to take locks.
      
      This also means we can (and must) do away with
      thread_group_sched_runtime() since the modified thread_group_cputime()
      is now more accurate and would deadlock when called from
      thread_group_sched_runtime().
      
      Aside of that it makes the function safe on 32 bit systems. The old
      code added t->se.sum_exec_runtime unprotected. sum_exec_runtime is a
      64bit value and could be changed on another cpu at the same time.
      Reported-by: NDavid Miller <davem@davemloft.net>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: stable@kernel.org
      Link: http://lkml.kernel.org/r/1314874459.7945.22.camel@twinsTested-by: NDavid Miller <davem@davemloft.net>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      d670ec13
  13. 13 9月, 2011 1 次提交
  14. 08 9月, 2011 1 次提交
    • P
      posix-cpu-timers: Cure SMP accounting oddities · e8abccb7
      Peter Zijlstra 提交于
      David reported:
      
        Attached below is a watered-down version of rt/tst-cpuclock2.c from
        GLIBC.  Just build it with "gcc -o test test.c -lpthread -lrt" or
        similar.
      
        Run it several times, and you will see cases where the main thread
        will measure a process clock difference before and after the nanosleep
        which is smaller than the cpu-burner thread's individual thread clock
        difference.  This doesn't make any sense since the cpu-burner thread
        is part of the top-level process's thread group.
      
        I've reproduced this on both x86-64 and sparc64 (using both 32-bit and
        64-bit binaries).
      
        For example:
      
        [davem@boricha build-x86_64-linux]$ ./test
        process: before(0.001221967) after(0.498624371) diff(497402404)
        thread:  before(0.000081692) after(0.498316431) diff(498234739)
        self:    before(0.001223521) after(0.001240219) diff(16698)
        [davem@boricha build-x86_64-linux]$
      
        The diff of 'process' should always be >= the diff of 'thread'.
      
        I make sure to wrap the 'thread' clock measurements the most tightly
        around the nanosleep() call, and that the 'process' clock measurements
        are the outer-most ones.
      
        ---
        #include <unistd.h>
        #include <stdio.h>
        #include <stdlib.h>
        #include <time.h>
        #include <fcntl.h>
        #include <string.h>
        #include <errno.h>
        #include <pthread.h>
      
        static pthread_barrier_t barrier;
      
        static void *chew_cpu(void *arg)
        {
      	  pthread_barrier_wait(&barrier);
      	  while (1)
      		  __asm__ __volatile__("" : : : "memory");
      	  return NULL;
        }
      
        int main(void)
        {
      	  clockid_t process_clock, my_thread_clock, th_clock;
      	  struct timespec process_before, process_after;
      	  struct timespec me_before, me_after;
      	  struct timespec th_before, th_after;
      	  struct timespec sleeptime;
      	  unsigned long diff;
      	  pthread_t th;
      	  int err;
      
      	  err = clock_getcpuclockid(0, &process_clock);
      	  if (err)
      		  return 1;
      
      	  err = pthread_getcpuclockid(pthread_self(), &my_thread_clock);
      	  if (err)
      		  return 1;
      
      	  pthread_barrier_init(&barrier, NULL, 2);
      	  err = pthread_create(&th, NULL, chew_cpu, NULL);
      	  if (err)
      		  return 1;
      
      	  err = pthread_getcpuclockid(th, &th_clock);
      	  if (err)
      		  return 1;
      
      	  pthread_barrier_wait(&barrier);
      
      	  err = clock_gettime(process_clock, &process_before);
      	  if (err)
      		  return 1;
      
      	  err = clock_gettime(my_thread_clock, &me_before);
      	  if (err)
      		  return 1;
      
      	  err = clock_gettime(th_clock, &th_before);
      	  if (err)
      		  return 1;
      
      	  sleeptime.tv_sec = 0;
      	  sleeptime.tv_nsec = 500000000;
      	  nanosleep(&sleeptime, NULL);
      
      	  err = clock_gettime(th_clock, &th_after);
      	  if (err)
      		  return 1;
      
      	  err = clock_gettime(my_thread_clock, &me_after);
      	  if (err)
      		  return 1;
      
      	  err = clock_gettime(process_clock, &process_after);
      	  if (err)
      		  return 1;
      
      	  diff = process_after.tv_nsec - process_before.tv_nsec;
      	  printf("process: before(%lu.%.9lu) after(%lu.%.9lu) diff(%lu)\n",
      		 process_before.tv_sec, process_before.tv_nsec,
      		 process_after.tv_sec, process_after.tv_nsec, diff);
      	  diff = th_after.tv_nsec - th_before.tv_nsec;
      	  printf("thread:  before(%lu.%.9lu) after(%lu.%.9lu) diff(%lu)\n",
      		 th_before.tv_sec, th_before.tv_nsec,
      		 th_after.tv_sec, th_after.tv_nsec, diff);
      	  diff = me_after.tv_nsec - me_before.tv_nsec;
      	  printf("self:    before(%lu.%.9lu) after(%lu.%.9lu) diff(%lu)\n",
      		 me_before.tv_sec, me_before.tv_nsec,
      		 me_after.tv_sec, me_after.tv_nsec, diff);
      
      	  return 0;
        }
      
      This is due to us using p->se.sum_exec_runtime in
      thread_group_cputime() where we iterate the thread group and sum all
      data. This does not take time since the last schedule operation (tick
      or otherwise) into account. We can cure this by using
      task_sched_runtime() at the cost of having to take locks.
      
      This also means we can (and must) do away with
      thread_group_sched_runtime() since the modified thread_group_cputime()
      is now more accurate and would deadlock when called from
      thread_group_sched_runtime().
      Reported-by: NDavid Miller <davem@davemloft.net>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/1314874459.7945.22.camel@twins
      Cc: stable@kernel.org
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      e8abccb7
  15. 23 5月, 2011 1 次提交
  16. 31 3月, 2011 1 次提交
  17. 02 2月, 2011 7 次提交
  18. 10 11月, 2010 1 次提交
    • S
      posix-cpu-timers: Rcu_read_lock/unlock protect find_task_by_vpid call · c0deae8c
      Sergey Senozhatsky 提交于
      Commit 4221a991 "Add RCU check for
      find_task_by_vpid()" introduced rcu_lockdep_assert to find_task_by_pid_ns.
      Add rcu_read_lock/rcu_read_unlock to call find_task_by_vpid.
      
      Tetsuo Handa wrote:
      | Quoting from one of posts in that thead
      | http://kerneltrap.org/mailarchive/linux-kernel/2010/2/8/4536388
      |
      || Usually tasklist gives enough protection, but if copy_process() fails
      || it calls free_pid() lockless and does call_rcu(delayed_put_pid().
      || This means, without rcu lock find_pid_ns() can't scan the hash table
      || safely.
      
      Thomas Gleixner wrote:
      | We can remove the tasklist_lock while at it. rcu_read_lock is enough.
      
      Patch also replaces thread_group_leader with has_group_leader_pid
      in accordance to comment by Oleg Nesterov:
      
      | ... thread_group_leader() check is not relaible without 
      | tasklist. If we race with de_thread() find_task_by_vpid() can find
      | the new leader before it updates its ->group_leader.
      |
      | perhaps it makes sense to change posix_cpu_timer_create() to use 
      | has_group_leader_pid() instead, just to make this code not look racy
      | and avoid adding new problems.
      Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Stanislaw Gruszka <sgruszka@redhat.com>
      Reviewed-by: NOleg Nesterov <oleg@redhat.com>
      LKML-Reference: <20101103165256.GD30053@swordfish.minsk.epam.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      c0deae8c
  19. 16 7月, 2010 1 次提交
  20. 18 6月, 2010 3 次提交
    • O
      sched: Fix the racy usage of thread_group_cputimer() in fastpath_timer_check() · 8d1f431c
      Oleg Nesterov 提交于
      fastpath_timer_check()->thread_group_cputimer() is racy and
      unneeded.
      
      It is racy because another thread can clear ->running before
      thread_group_cputimer() takes cputimer->lock. In this case
      thread_group_cputimer() will set ->running = true again and call
      thread_group_cputime(). But since we do not hold tasklist or
      siglock, we can race with fork/exit and copy the wrong results
      into cputimer->cputime.
      
      It is unneeded because if ->running == true we can just use
      the numbers in cputimer->cputime we already have.
      
      Change fastpath_timer_check() to copy cputimer->cputime into
      the local variable under cputimer->lock. We do not re-check
      ->running under cputimer->lock, run_posix_cpu_timers() does
      this check later.
      
      Note: we can add more optimizations on top of this change.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <20100611180446.GA13025@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      8d1f431c
    • O
      sched: run_posix_cpu_timers: Don't check ->exit_state, use lock_task_sighand() · 0bdd2ed4
      Oleg Nesterov 提交于
      run_posix_cpu_timers() doesn't work if current has already passed
      exit_notify(). This was needed to prevent the races with do_wait().
      
      Since ea6d290c ->signal is always valid and can't go away. We can
      remove the "tsk->exit_state == 0" in fastpath_timer_check() and
      convert run_posix_cpu_timers() to use lock_task_sighand().
      
      Note: it makes sense to take group_leader's sighand instead, the
      sub-thread still uses CPU after release_task(). But we need more
      changes to do this.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <20100610231018.GA25942@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      0bdd2ed4
    • O
      sched: thread_group_cputime: Simplify, document the "alive" check · bfac7009
      Oleg Nesterov 提交于
      thread_group_cputime() looks as if it is rcu-safe, but in fact this
      was wrong until ea6d290c which pins task->signal to task_struct.
      It checks ->sighand != NULL under rcu, but this can't help if ->signal
      can go away. Fortunately the caller either holds ->siglock, or it is
      fastpath_timer_check() which uses current and checks exit_state == 0.
      
      - Since ea6d290c commit tsk->signal is stable, we can read it first
        and avoid the initialization from INIT_CPUTIME.
      
      - Even if tsk->signal is always valid, we still have to check it
        is safe to use next_thread() under rcu_read_lock(). Currently
        the code checks ->sighand != NULL, change it to use pid_alive()
        which is commonly used to ensure the task wasn't unhashed before
        we take rcu_read_lock().
      
        Add the comment to explain this check.
      
      - Change the main loop to use the while_each_thread() helper.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <20100610230956.GA25921@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      bfac7009
  21. 28 5月, 2010 1 次提交
    • O
      posix-cpu-timers: avoid "task->signal != NULL" checks · d30fda35
      Oleg Nesterov 提交于
      Preparation to make task->signal immutable, no functional changes.
      
      posix-cpu-timers.c checks task->signal != NULL to ensure this task is
      alive and didn't pass __exit_signal().  This is correct but we are going
      to change the lifetime rules for ->signal and never reset this pointer.
      
      Change the code to check ->sighand instead, it doesn't matter which
      pointer we check under tasklist, they both are cleared simultaneously.
      
      As Roland pointed out, some of these changes are not strictly needed and
      probably it makes sense to revert them later, when ->signal will be pinned
      to task_struct.  But this patch tries to ensure the subsequent changes in
      fork/exit can't make any visible impact on posix cpu timers.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Acked-by: NRoland McGrath <roland@redhat.com>
      Cc: Stanislaw Gruszka <sgruszka@redhat.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d30fda35