1. 04 12月, 2015 1 次提交
  2. 23 11月, 2015 1 次提交
  3. 06 10月, 2015 3 次提交
    • X
      sched/core: Remove a parameter in the migrate_task_rq() function · 5a4fd036
      xiaofeng.yan 提交于
      The parameter "int next_cpu" in the following function is unused:
      
        migrate_task_rq(struct task_struct *p, int next_cpu)
      
      Remove it.
      Signed-off-by: Nxiaofeng.yan <yanxiaofeng@inspur.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Link: http://lkml.kernel.org/r/1442991360-31945-1-git-send-email-yanxiaofeng@inspur.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      5a4fd036
    • P
      sched/core: Fix task and run queue sched_info::run_delay inconsistencies · 1de64443
      Peter Zijlstra 提交于
      Mike Meyer reported the following bug:
      
      > During evaluation of some performance data, it was discovered thread
      > and run queue run_delay accounting data was inconsistent with the other
      > accounting data that was collected.  Further investigation found under
      > certain circumstances execution time was leaking into the task and
      > run queue accounting of run_delay.
      >
      > Consider the following sequence:
      >
      >     a. thread is running.
      >     b. thread moves beween cgroups, changes scheduling class or priority.
      >     c. thread sleeps OR
      >     d. thread involuntarily gives up cpu.
      >
      > a. implies:
      >
      >     thread->sched_info.last_queued = 0
      >
      > a. and b. results in the following:
      >
      >     1. dequeue_task(rq, thread)
      >
      >            sched_info_dequeued(rq, thread)
      >                delta = 0
      >
      >                sched_info_reset_dequeued(thread)
      >                    thread->sched_info.last_queued = 0
      >
      >                thread->sched_info.run_delay += delta
      >
      >     2. enqueue_task(rq, thread)
      >
      >            sched_info_queued(rq, thread)
      >
      >                /* thread is still on cpu at this point. */
      >                thread->sched_info.last_queued = task_rq(thread)->clock;
      >
      > c. results in:
      >
      >     dequeue_task(rq, thread)
      >
      >         sched_info_dequeued(rq, thread)
      >
      >             /* delta is execution time not run_delay. */
      >             delta = task_rq(thread)->clock - thread->sched_info.last_queued
      >
      >         sched_info_reset_dequeued(thread)
      >             thread->sched_info.last_queued = 0
      >
      >         thread->sched_info.run_delay += delta
      >
      >     Since thread was running between enqueue_task(rq, thread) and
      >     dequeue_task(rq, thread), the delta above is really execution
      >     time and not run_delay.
      >
      > d. results in:
      >
      >     __sched_info_switch(thread, next_thread)
      >
      >         sched_info_depart(rq, thread)
      >
      >             sched_info_queued(rq, thread)
      >
      >                 /* last_queued not updated due to being non-zero */
      >                 return
      >
      >     Since thread was running between enqueue_task(rq, thread) and
      >     __sched_info_switch(thread, next_thread), the execution time
      >     between enqueue_task(rq, thread) and
      >     __sched_info_switch(thread, next_thread) now will become
      >     associated with run_delay due to when last_queued was last updated.
      >
      
      This alternative patch solves the problem by not calling
      sched_info_{de,}queued() in {de,en}queue_task(). Therefore the
      sched_info state is preserved and things work as expected.
      
      By inlining the {de,en}queue_task() functions the new condition
      becomes (mostly) a compile-time constant and we'll not emit any new
      branch instructions.
      
      It even shrinks the code (due to inlining {en,de}queue_task()):
      
      $ size defconfig-build/kernel/sched/core.o defconfig-build/kernel/sched/core.o.orig
         text    data     bss     dec     hex filename
        64019   23378    2344   89741   15e8d defconfig-build/kernel/sched/core.o
        64149   23378    2344   89871   15f0f defconfig-build/kernel/sched/core.o.orig
      Reported-by: NMike Meyer <Mike.Meyer@Teradata.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Link: http://lkml.kernel.org/r/20150930154413.GO3604@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      1de64443
    • P
      sched/core: Fix TASK_DEAD race in finish_task_switch() · 95913d97
      Peter Zijlstra 提交于
      So the problem this patch is trying to address is as follows:
      
              CPU0                            CPU1
      
              context_switch(A, B)
                                              ttwu(A)
                                                LOCK A->pi_lock
                                                A->on_cpu == 0
              finish_task_switch(A)
                prev_state = A->state  <-.
                WMB                      |
                A->on_cpu = 0;           |
                UNLOCK rq0->lock         |
                                         |    context_switch(C, A)
                                         `--  A->state = TASK_DEAD
                prev_state == TASK_DEAD
                  put_task_struct(A)
                                              context_switch(A, C)
                                              finish_task_switch(A)
                                                A->state == TASK_DEAD
                                                  put_task_struct(A)
      
      The argument being that the WMB will allow the load of A->state on CPU0
      to cross over and observe CPU1's store of A->state, which will then
      result in a double-drop and use-after-free.
      
      Now the comment states (and this was true once upon a long time ago)
      that we need to observe A->state while holding rq->lock because that
      will order us against the wakeup; however the wakeup will not in fact
      acquire (that) rq->lock; it takes A->pi_lock these days.
      
      We can obviously fix this by upgrading the WMB to an MB, but that is
      expensive, so we'd rather avoid that.
      
      The alternative this patch takes is: smp_store_release(&A->on_cpu, 0),
      which avoids the MB on some archs, but not important ones like ARM.
      Reported-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: <stable@vger.kernel.org> # v3.1+
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Cc: manfred@colorfullife.com
      Cc: will.deacon@arm.com
      Fixes: e4a52bcb ("sched: Remove rq->lock from the first half of ttwu()")
      Link: http://lkml.kernel.org/r/20150929124509.GG3816@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      95913d97
  4. 23 9月, 2015 1 次提交
  5. 18 9月, 2015 1 次提交
  6. 13 9月, 2015 6 次提交
  7. 12 8月, 2015 1 次提交
  8. 04 8月, 2015 1 次提交
  9. 03 8月, 2015 4 次提交
    • Y
      sched/fair: Provide runnable_load_avg back to cfs_rq · 13962234
      Yuyang Du 提交于
      The cfs_rq's load_avg is composed of runnable_load_avg and blocked_load_avg.
      Before this series, sometimes the runnable_load_avg is used, and sometimes
      the load_avg is used. Completely replacing all uses of runnable_load_avg
      with load_avg may be too big a leap, i.e., the blocked_load_avg is concerned
      to result in overrated load. Therefore, we get runnable_load_avg back.
      
      The new cfs_rq's runnable_load_avg is improved to be updated with all of the
      runnable sched_eneities at the same time, so the one sched_entity updated and
      the others stale problem is solved.
      Signed-off-by: NYuyang Du <yuyang.du@intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: arjan@linux.intel.com
      Cc: bsegall@google.com
      Cc: dietmar.eggemann@arm.com
      Cc: fengguang.wu@intel.com
      Cc: len.brown@intel.com
      Cc: morten.rasmussen@arm.com
      Cc: pjt@google.com
      Cc: rafael.j.wysocki@intel.com
      Cc: umgwanakikbuti@gmail.com
      Cc: vincent.guittot@linaro.org
      Link: http://lkml.kernel.org/r/1436918682-4971-7-git-send-email-yuyang.du@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      13962234
    • Y
      sched/fair: Init cfs_rq's sched_entity load average · 540247fb
      Yuyang Du 提交于
      The runnable load and utilization averages of cfs_rq's sched_entity
      were not initiated. Like done to a task, give new cfs_rq' sched_entity
      start values to heavy its load in infant time.
      Signed-off-by: NYuyang Du <yuyang.du@intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: arjan@linux.intel.com
      Cc: bsegall@google.com
      Cc: dietmar.eggemann@arm.com
      Cc: fengguang.wu@intel.com
      Cc: len.brown@intel.com
      Cc: morten.rasmussen@arm.com
      Cc: pjt@google.com
      Cc: rafael.j.wysocki@intel.com
      Cc: umgwanakikbuti@gmail.com
      Cc: vincent.guittot@linaro.org
      Link: http://lkml.kernel.org/r/1436918682-4971-5-git-send-email-yuyang.du@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      540247fb
    • Y
      sched/fair: Rewrite runnable load and utilization average tracking · 9d89c257
      Yuyang Du 提交于
      The idea of runnable load average (let runnable time contribute to weight)
      was proposed by Paul Turner and Ben Segall, and it is still followed by
      this rewrite. This rewrite aims to solve the following issues:
      
      1. cfs_rq's load average (namely runnable_load_avg and blocked_load_avg) is
         updated at the granularity of an entity at a time, which results in the
         cfs_rq's load average is stale or partially updated: at any time, only
         one entity is up to date, all other entities are effectively lagging
         behind. This is undesirable.
      
         To illustrate, if we have n runnable entities in the cfs_rq, as time
         elapses, they certainly become outdated:
      
           t0: cfs_rq { e1_old, e2_old, ..., en_old }
      
         and when we update:
      
           t1: update e1, then we have cfs_rq { e1_new, e2_old, ..., en_old }
      
           t2: update e2, then we have cfs_rq { e1_old, e2_new, ..., en_old }
      
           ...
      
         We solve this by combining all runnable entities' load averages together
         in cfs_rq's avg, and update the cfs_rq's avg as a whole. This is based
         on the fact that if we regard the update as a function, then:
      
         w * update(e) = update(w * e) and
      
         update(e1) + update(e2) = update(e1 + e2), then
      
         w1 * update(e1) + w2 * update(e2) = update(w1 * e1 + w2 * e2)
      
         therefore, by this rewrite, we have an entirely updated cfs_rq at the
         time we update it:
      
           t1: update cfs_rq { e1_new, e2_new, ..., en_new }
      
           t2: update cfs_rq { e1_new, e2_new, ..., en_new }
      
           ...
      
      2. cfs_rq's load average is different between top rq->cfs_rq and other
         task_group's per CPU cfs_rqs in whether or not blocked_load_average
         contributes to the load.
      
         The basic idea behind runnable load average (the same for utilization)
         is that the blocked state is taken into account as opposed to only
         accounting for the currently runnable state. Therefore, the average
         should include both the runnable/running and blocked load averages.
         This rewrite does that.
      
         In addition, we also combine runnable/running and blocked averages
         of all entities into the cfs_rq's average, and update it together at
         once. This is based on the fact that:
      
           update(runnable) + update(blocked) = update(runnable + blocked)
      
         This significantly reduces the code as we don't need to separately
         maintain/update runnable/running load and blocked load.
      
      3. How task_group entities' share is calculated is complex and imprecise.
      
         We reduce the complexity in this rewrite to allow a very simple rule:
         the task_group's load_avg is aggregated from its per CPU cfs_rqs's
         load_avgs. Then group entity's weight is simply proportional to its
         own cfs_rq's load_avg / task_group's load_avg. To illustrate,
      
         if a task_group has { cfs_rq1, cfs_rq2, ..., cfs_rqn }, then,
      
         task_group_avg = cfs_rq1_avg + cfs_rq2_avg + ... + cfs_rqn_avg, then
      
         cfs_rqx's entity's share = cfs_rqx_avg / task_group_avg * task_group's share
      
      To sum up, this rewrite in principle is equivalent to the current one, but
      fixes the issues described above. Turns out, it significantly reduces the
      code complexity and hence increases clarity and efficiency. In addition,
      the new averages are more smooth/continuous (no spurious spikes and valleys)
      and updated more consistently and quickly to reflect the load dynamics.
      
      As a result, we have less load tracking overhead, better performance,
      and especially better power efficiency due to more balanced load.
      Signed-off-by: NYuyang Du <yuyang.du@intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: arjan@linux.intel.com
      Cc: bsegall@google.com
      Cc: dietmar.eggemann@arm.com
      Cc: fengguang.wu@intel.com
      Cc: len.brown@intel.com
      Cc: morten.rasmussen@arm.com
      Cc: pjt@google.com
      Cc: rafael.j.wysocki@intel.com
      Cc: umgwanakikbuti@gmail.com
      Cc: vincent.guittot@linaro.org
      Link: http://lkml.kernel.org/r/1436918682-4971-3-git-send-email-yuyang.du@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      9d89c257
    • Y
      sched/fair: Remove rq's runnable avg · cd126afe
      Yuyang Du 提交于
      The current rq->avg is not used at all since its merge into the kernel,
      and the code is in the scheduler's hot path, so remove it.
      Tested-by: NDietmar Eggemann <dietmar.eggemann@arm.com>
      Signed-off-by: NYuyang Du <yuyang.du@intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NDietmar Eggemann <dietmar.eggemann@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: arjan@linux.intel.com
      Cc: bsegall@google.com
      Cc: fengguang.wu@intel.com
      Cc: len.brown@intel.com
      Cc: morten.rasmussen@arm.com
      Cc: pjt@google.com
      Cc: rafael.j.wysocki@intel.com
      Cc: umgwanakikbuti@gmail.com
      Cc: vincent.guittot@linaro.org
      Link: http://lkml.kernel.org/r/1436918682-4971-2-git-send-email-yuyang.du@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      cd126afe
  10. 04 7月, 2015 2 次提交
  11. 19 6月, 2015 3 次提交
  12. 18 5月, 2015 1 次提交
    • P
      sched,perf: Fix periodic timers · 4cfafd30
      Peter Zijlstra 提交于
      In the below two commits (see Fixes) we have periodic timers that can
      stop themselves when they're no longer required, but need to be
      (re)-started when their idle condition changes.
      
      Further complications is that we want the timer handler to always do
      the forward such that it will always correctly deal with the overruns,
      and we do not want to race such that the handler has already decided
      to stop, but the (external) restart sees the timer still active and we
      end up with a 'lost' timer.
      
      The problem with the current code is that the re-start can come before
      the callback does the forward, at which point the forward from the
      callback will WARN about forwarding an enqueued timer.
      
      Now, conceptually its easy to detect if you're before or after the fwd
      by comparing the expiration time against the current time. Of course,
      that's expensive (and racy) because we don't have the current time.
      
      Alternatively one could cache this state inside the timer, but then
      everybody pays the overhead of maintaining this extra state, and that
      is undesired.
      
      The only other option that I could see is the external timer_active
      variable, which I tried to kill before. I would love a nicer interface
      for this seemingly simple 'problem' but alas.
      
      Fixes: 272325c4 ("perf: Fix mux_interval hrtimer wreckage")
      Fixes: 77a4d1a1 ("sched: Cleanup bandwidth timers")
      Cc: pjt@google.com
      Cc: tglx@linutronix.de
      Cc: klamm@yandex-team.ru
      Cc: mingo@kernel.org
      Cc: bsegall@google.com
      Cc: hpa@zytor.com
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/20150514102311.GX21418@twins.programming.kicks-ass.net
      4cfafd30
  13. 08 5月, 2015 2 次提交
  14. 22 4月, 2015 1 次提交
    • P
      sched: Cleanup bandwidth timers · 77a4d1a1
      Peter Zijlstra 提交于
      Roman reported a 3 cpu lockup scenario involving __start_cfs_bandwidth().
      
      The more I look at that code the more I'm convinced its crack, that
      entire __start_cfs_bandwidth() thing is brain melting, we don't need to
      cancel a timer before starting it, *hrtimer_start*() will happily remove
      the timer for you if its still enqueued.
      
      Removing that, removes a big part of the problem, no more ugly cancel
      loop to get stuck in.
      
      So now, if I understand things right, the entire reason you have this
      cfs_b->lock guarded ->timer_active nonsense is to make sure we don't
      accidentally lose the timer.
      
      It appears to me that it should be possible to guarantee that same by
      unconditionally (re)starting the timer when !queued. Because regardless
      what hrtimer::function will return, if we beat it to (re)enqueue the
      timer, it doesn't matter.
      
      Now, because hrtimers don't come with any serialization guarantees we
      must ensure both handler and (re)start loop serialize their access to
      the hrtimer to avoid both trying to forward the timer at the same
      time.
      
      Update the rt bandwidth timer to match.
      
      This effectively reverts: 09dc4ab0 ("sched/fair: Fix
      tg_set_cfs_bandwidth() deadlock on rq->lock").
      Reported-by: NRoman Gushchin <klamm@yandex-team.ru>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NBen Segall <bsegall@google.com>
      Cc: Paul Turner <pjt@google.com>
      Link: http://lkml.kernel.org/r/20150415095011.804589208@infradead.orgSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      77a4d1a1
  15. 02 4月, 2015 1 次提交
  16. 27 3月, 2015 5 次提交
  17. 23 3月, 2015 1 次提交
    • S
      sched/rt: Use IPI to trigger RT task push migration instead of pulling · b6366f04
      Steven Rostedt 提交于
      When debugging the latencies on a 40 core box, where we hit 300 to
      500 microsecond latencies, I found there was a huge contention on the
      runqueue locks.
      
      Investigating it further, running ftrace, I found that it was due to
      the pulling of RT tasks.
      
      The test that was run was the following:
      
       cyclictest --numa -p95 -m -d0 -i100
      
      This created a thread on each CPU, that would set its wakeup in iterations
      of 100 microseconds. The -d0 means that all the threads had the same
      interval (100us). Each thread sleeps for 100us and wakes up and measures
      its latencies.
      
      cyclictest is maintained at:
       git://git.kernel.org/pub/scm/linux/kernel/git/clrkwllms/rt-tests.git
      
      What happened was another RT task would be scheduled on one of the CPUs
      that was running our test, when the other CPU tests went to sleep and
      scheduled idle. This caused the "pull" operation to execute on all
      these CPUs. Each one of these saw the RT task that was overloaded on
      the CPU of the test that was still running, and each one tried
      to grab that task in a thundering herd way.
      
      To grab the task, each thread would do a double rq lock grab, grabbing
      its own lock as well as the rq of the overloaded CPU. As the sched
      domains on this box was rather flat for its size, I saw up to 12 CPUs
      block on this lock at once. This caused a ripple affect with the
      rq locks especially since the taking was done via a double rq lock, which
      means that several of the CPUs had their own rq locks held while trying
      to take this rq lock. As these locks were blocked, any wakeups or load
      balanceing on these CPUs would also block on these locks, and the wait
      time escalated.
      
      I've tried various methods to lessen the load, but things like an
      atomic counter to only let one CPU grab the task wont work, because
      the task may have a limited affinity, and we may pick the wrong
      CPU to take that lock and do the pull, to only find out that the
      CPU we picked isn't in the task's affinity.
      
      Instead of doing the PULL, I now have the CPUs that want the pull to
      send over an IPI to the overloaded CPU, and let that CPU pick what
      CPU to push the task to. No more need to grab the rq lock, and the
      push/pull algorithm still works fine.
      
      With this patch, the latency dropped to just 150us over a 20 hour run.
      Without the patch, the huge latencies would trigger in seconds.
      
      I've created a new sched feature called RT_PUSH_IPI, which is enabled
      by default.
      
      When RT_PUSH_IPI is not enabled, the old method of grabbing the rq locks
      and having the pulling CPU do the work is implemented. When RT_PUSH_IPI
      is enabled, the IPI is sent to the overloaded CPU to do a push.
      
      To enabled or disable this at run time:
      
       # mount -t debugfs nodev /sys/kernel/debug
       # echo RT_PUSH_IPI > /sys/kernel/debug/sched_features
      or
       # echo NO_RT_PUSH_IPI > /sys/kernel/debug/sched_features
      
      Update: This original patch would send an IPI to all CPUs in the RT overload
      list. But that could theoretically cause the reverse issue. That is, there
      could be lots of overloaded RT queues and one CPU lowers its priority. It would
      then send an IPI to all the overloaded RT queues and they could then all try
      to grab the rq lock of the CPU lowering its priority, and then we have the
      same problem.
      
      The latest design sends out only one IPI to the first overloaded CPU. It tries to
      push any tasks that it can, and then looks for the next overloaded CPU that can
      push to the source CPU. The IPIs stop when all overloaded CPUs that have pushable
      tasks that have priorities greater than the source CPU are covered. In case the
      source CPU lowers its priority again, a flag is set to tell the IPI traversal to
      restart with the first RT overloaded CPU after the source CPU.
      Parts-suggested-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Joern Engel <joern@purestorage.com>
      Cc: Clark Williams <williams@redhat.com>
      Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20150318144946.2f3cc982@gandalf.local.homeSigned-off-by: NIngo Molnar <mingo@kernel.org>
      b6366f04
  18. 18 2月, 2015 1 次提交
  19. 14 1月, 2015 2 次提交
  20. 16 11月, 2014 1 次提交
    • S
      sched/cputime: Fix clock_nanosleep()/clock_gettime() inconsistency · 6e998916
      Stanislaw Gruszka 提交于
      Commit d670ec13 "posix-cpu-timers: Cure SMP wobbles" fixes one glibc
      test case in cost of breaking another one. After that commit, calling
      clock_nanosleep(TIMER_ABSTIME, X) and then clock_gettime(&Y) can result
      of Y time being smaller than X time.
      
      Reproducer/tester can be found further below, it can be compiled and ran by:
      
      	gcc -o tst-cpuclock2 tst-cpuclock2.c -pthread
      	while ./tst-cpuclock2 ; do : ; done
      
      This reproducer, when running on a buggy kernel, will complain
      about "clock_gettime difference too small".
      
      Issue happens because on start in thread_group_cputimer() we initialize
      sum_exec_runtime of cputimer with threads runtime not yet accounted and
      then add the threads runtime to running cputimer again on scheduler
      tick, making it's sum_exec_runtime bigger than actual threads runtime.
      
      KOSAKI Motohiro posted a fix for this problem, but that patch was never
      applied: https://lkml.org/lkml/2013/5/26/191 .
      
      This patch takes different approach to cure the problem. It calls
      update_curr() when cputimer starts, that assure we will have updated
      stats of running threads and on the next schedule tick we will account
      only the runtime that elapsed from cputimer start. That also assure we
      have consistent state between cpu times of individual threads and cpu
      time of the process consisted by those threads.
      
      Full reproducer (tst-cpuclock2.c):
      
      	#define _GNU_SOURCE
      	#include <unistd.h>
      	#include <sys/syscall.h>
      	#include <stdio.h>
      	#include <time.h>
      	#include <pthread.h>
      	#include <stdint.h>
      	#include <inttypes.h>
      
      	/* Parameters for the Linux kernel ABI for CPU clocks.  */
      	#define CPUCLOCK_SCHED          2
      	#define MAKE_PROCESS_CPUCLOCK(pid, clock) \
      		((~(clockid_t) (pid) << 3) | (clockid_t) (clock))
      
      	static pthread_barrier_t barrier;
      
      	/* Help advance the clock.  */
      	static void *chew_cpu(void *arg)
      	{
      		pthread_barrier_wait(&barrier);
      		while (1) ;
      
      		return NULL;
      	}
      
      	/* Don't use the glibc wrapper.  */
      	static int do_nanosleep(int flags, const struct timespec *req)
      	{
      		clockid_t clock_id = MAKE_PROCESS_CPUCLOCK(0, CPUCLOCK_SCHED);
      
      		return syscall(SYS_clock_nanosleep, clock_id, flags, req, NULL);
      	}
      
      	static int64_t tsdiff(const struct timespec *before, const struct timespec *after)
      	{
      		int64_t before_i = before->tv_sec * 1000000000ULL + before->tv_nsec;
      		int64_t after_i = after->tv_sec * 1000000000ULL + after->tv_nsec;
      
      		return after_i - before_i;
      	}
      
      	int main(void)
      	{
      		int result = 0;
      		pthread_t th;
      
      		pthread_barrier_init(&barrier, NULL, 2);
      
      		if (pthread_create(&th, NULL, chew_cpu, NULL) != 0) {
      			perror("pthread_create");
      			return 1;
      		}
      
      		pthread_barrier_wait(&barrier);
      
      		/* The test.  */
      		struct timespec before, after, sleeptimeabs;
      		int64_t sleepdiff, diffabs;
      		const struct timespec sleeptime = {.tv_sec = 0,.tv_nsec = 100000000 };
      
      		/* The relative nanosleep.  Not sure why this is needed, but its presence
      		   seems to make it easier to reproduce the problem.  */
      		if (do_nanosleep(0, &sleeptime) != 0) {
      			perror("clock_nanosleep");
      			return 1;
      		}
      
      		/* Get the current time.  */
      		if (clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &before) < 0) {
      			perror("clock_gettime[2]");
      			return 1;
      		}
      
      		/* Compute the absolute sleep time based on the current time.  */
      		uint64_t nsec = before.tv_nsec + sleeptime.tv_nsec;
      		sleeptimeabs.tv_sec = before.tv_sec + nsec / 1000000000;
      		sleeptimeabs.tv_nsec = nsec % 1000000000;
      
      		/* Sleep for the computed time.  */
      		if (do_nanosleep(TIMER_ABSTIME, &sleeptimeabs) != 0) {
      			perror("absolute clock_nanosleep");
      			return 1;
      		}
      
      		/* Get the time after the sleep.  */
      		if (clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &after) < 0) {
      			perror("clock_gettime[3]");
      			return 1;
      		}
      
      		/* The time after sleep should always be equal to or after the absolute sleep
      		   time passed to clock_nanosleep.  */
      		sleepdiff = tsdiff(&sleeptimeabs, &after);
      		if (sleepdiff < 0) {
      			printf("absolute clock_nanosleep woke too early: %" PRId64 "\n", sleepdiff);
      			result = 1;
      
      			printf("Before %llu.%09llu\n", before.tv_sec, before.tv_nsec);
      			printf("After  %llu.%09llu\n", after.tv_sec, after.tv_nsec);
      			printf("Sleep  %llu.%09llu\n", sleeptimeabs.tv_sec, sleeptimeabs.tv_nsec);
      		}
      
      		/* The difference between the timestamps taken before and after the
      		   clock_nanosleep call should be equal to or more than the duration of the
      		   sleep.  */
      		diffabs = tsdiff(&before, &after);
      		if (diffabs < sleeptime.tv_nsec) {
      			printf("clock_gettime difference too small: %" PRId64 "\n", diffabs);
      			result = 1;
      		}
      
      		pthread_cancel(th);
      
      		return result;
      	}
      Signed-off-by: NStanislaw Gruszka <sgruszka@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20141112155843.GA24803@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      6e998916
  21. 04 11月, 2014 1 次提交