1. 09 12月, 2009 3 次提交
    • J
      sched: Move update_curr() in check_preempt_wakeup() to avoid redundant call · a65ac745
      Jupyung Lee 提交于
      If a RT task is woken up while a non-RT task is running,
      check_preempt_wakeup() is called to check whether the new task can
      preempt the old task. The function returns quickly without going deeper
      because it is apparent that a RT task can always preempt a non-RT task.
      
      In this situation, check_preempt_wakeup() always calls update_curr() to
      update vruntime value of the currently running task. However, the
      function call is unnecessary and redundant at that moment because (1) a
      non-RT task can always be preempted by a RT task regardless of its
      vruntime value, and (2) update_curr() will be called shortly when the
      context switch between two occurs.
      
      By moving update_curr() in check_preempt_wakeup(), we can avoid
      redundant call to update_curr(), slightly reducing the time taken to
      wake up RT tasks.
      Signed-off-by: NJupyung Lee <jupyung@gmail.com>
      [ Place update_curr() right before the wake_preempt_entity() call, which
        is the only thing that relies on the updated vruntime ]
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1258451500-6714-1-git-send-email-jupyung@gmail.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      a65ac745
    • P
      sched: Sanitize fork() handling · cd29fe6f
      Peter Zijlstra 提交于
      Currently we try to do task placement in wake_up_new_task() after we do
      the load-balance pass in sched_fork(). This yields complicated semantics
      in that we have to deal with tasks on different RQs and the
      set_task_cpu() calls in copy_process() and sched_fork()
      
      Rename ->task_new() to ->task_fork() and call it from sched_fork()
      before the balancing, this gives the policy a clear point to place the
      task.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      cd29fe6f
    • T
      sched: Protect sched_rr_get_param() access to task->sched_class · dba091b9
      Thomas Gleixner 提交于
      sched_rr_get_param calls
      task->sched_class->get_rr_interval(task) without protection
      against a concurrent sched_setscheduler() call which modifies
      task->sched_class.
      
      Serialize the access with task_rq_lock(task) and hand the rq
      pointer into get_rr_interval() as it's needed at least in the
      sched_fair implementation.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      LKML-Reference: <alpine.LFD.2.00.0912090930120.3089@localhost.localdomain>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      dba091b9
  2. 24 11月, 2009 1 次提交
    • T
      sched: Optimize branch hint in pick_next_task_fair() · 36ace27e
      Tim Blechmann 提交于
      Branch hint profiling on my nehalem machine showed 90%
      incorrect branch hints:
      
        15728471 158903754  90 pick_next_task_fair
        sched_fair.c    1555
      Signed-off-by: NTim Blechmann <tim@klingt.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      LKML-Reference: <4B0BBBB1.2050100@klingt.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      36ace27e
  3. 13 11月, 2009 2 次提交
  4. 05 11月, 2009 2 次提交
    • M
      sched: Fix affinity logic in select_task_rq_fair() · fd210738
      Mike Galbraith 提交于
      Ingo Molnar reported:
      
      [   26.804000] BUG: using smp_processor_id() in preemptible [00000000] code: events/1/10
      [   26.808000] caller is vmstat_update+0x26/0x70
      [   26.812000] Pid: 10, comm: events/1 Not tainted 2.6.32-rc5 #6887
      [   26.816000] Call Trace:
      [   26.820000]  [<c1924a24>] ? printk+0x28/0x3c
      [   26.824000]  [<c13258a0>] debug_smp_processor_id+0xf0/0x110
      [   26.824000] mount used greatest stack depth: 1464 bytes left
      [   26.828000]  [<c111d086>] vmstat_update+0x26/0x70
      [   26.832000]  [<c1086418>] worker_thread+0x188/0x310
      [   26.836000]  [<c10863b7>] ? worker_thread+0x127/0x310
      [   26.840000]  [<c108d310>] ? autoremove_wake_function+0x0/0x60
      [   26.844000]  [<c1086290>] ? worker_thread+0x0/0x310
      [   26.848000]  [<c108cf0c>] kthread+0x7c/0x90
      [   26.852000]  [<c108ce90>] ? kthread+0x0/0x90
      [   26.856000]  [<c100c0a7>] kernel_thread_helper+0x7/0x10
      [   26.860000] BUG: using smp_processor_id() in preemptible [00000000] code: events/1/10
      [   26.864000] caller is vmstat_update+0x3c/0x70
      
      Because this commit:
      
        a1f84a3a: sched: Check for an idle shared cache in select_task_rq_fair()
      
      broke ->cpus_allowed.
      Signed-off-by: NMike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: arjan@infradead.org
      Cc: <stable@kernel.org>
      LKML-Reference: <1257415066.12867.1.camel@marge.simson.net>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      fd210738
    • M
      sched: Check for an idle shared cache in select_task_rq_fair() · a1f84a3a
      Mike Galbraith 提交于
      When waking affine, check for an idle shared cache, and if
      found, wake to that CPU/sibling instead of the waker's CPU.
      
      This improves pgsql+oltp ramp up by roughly 8%. Possibly more
      for other loads, depending on overlap. The trade-off is a
      roughly 1% peak downturn if tasks are truly synchronous.
      Signed-off-by: NMike Galbraith <efault@gmx.de>
      Cc: Arjan van de Ven <arjan@infradead.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: <stable@kernel.org>
      LKML-Reference: <1256654138.17752.7.camel@marge.simson.net>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      a1f84a3a
  5. 24 10月, 2009 1 次提交
    • M
      sched: Strengthen buddies and mitigate buddy induced latencies · f685ceac
      Mike Galbraith 提交于
      This patch restores the effectiveness of LAST_BUDDY in preventing
      pgsql+oltp from collapsing due to wakeup preemption. It also
      switches LAST_BUDDY to exclusively do what it does best, namely
      mitigate the effects of aggressive wakeup preemption, which
      improves vmark throughput markedly, and restores mysql+oltp
      scalability.
      
      Since buddies are about scalability, enable them beginning at the
      point where we begin expanding sched_latency, namely
      sched_nr_latency. Previously, buddies were cleared aggressively,
      which seriously reduced their effectiveness. Not clearing
      aggressively however, produces a small drop in mysql+oltp
      throughput immediately after peak, indicating that LAST_BUDDY is
      actually doing some harm. This is right at the point where X on the
      desktop in competition with another load wants low latency service.
      Ergo, do not enable until we need to scale.
      
      To mitigate latency induced by buddies, or by a task just missing
      wakeup preemption, check latency at tick time.
      
      Last hunk prevents buddies from stymieing BALANCE_NEWIDLE via
      CACHE_HOT_BUDDY.
      
      Supporting performance tests:
      
       tip   = v2.6.32-rc5-1497-ga525b32
       tipx  = NO_GENTLE_FAIR_SLEEPERS NEXT_BUDDY granularity knobs = 31 knobs + 31 buddies
       tip+x = NO_GENTLE_FAIR_SLEEPERS granularity knobs = 31 knobs
      
      (Three run averages except where noted.)
      
       vmark:
       ------
       tip           108466 messages per second
       tip+          125307 messages per second
       tip+x         125335 messages per second
       tipx          117781 messages per second
       2.6.31.3      122729 messages per second
      
       mysql+oltp:
       -----------
       clients          1        2        4        8       16       32       64        128    256
       ..........................................................................................
       tip        9949.89 18690.20 34801.24 34460.04 32682.88 30765.97 28305.27 25059.64 19548.08
       tip+      10013.90 18526.84 34900.38 34420.14 33069.83 32083.40 30578.30 28010.71 25605.47
       tipx       9698.71 18002.70 34477.56 33420.01 32634.30 31657.27 29932.67 26827.52 21487.18
       2.6.31.3   8243.11 18784.20 34404.83 33148.38 31900.32 31161.90 29663.81 25995.94 18058.86
      
       pgsql+oltp:
       -----------
       clients          1        2        4        8       16       32       64      128      256
       ..........................................................................................
       tip       13686.37 26609.25 51934.28 51347.81 49479.51 45312.65 36691.91 26851.57 24145.35
       tip+ (1x) 13907.85 27135.87 52951.98 52514.04 51742.52 50705.43 49947.97 48374.19 46227.94
       tip+x     13906.78 27065.81 52951.19 52542.59 52176.11 51815.94 50838.90 49439.46 46891.00
       tipx      13742.46 26769.81 52351.99 51891.73 51320.79 50938.98 50248.65 48908.70 46553.84
       2.6.31.3  13815.35 26906.46 52683.34 52061.31 51937.10 51376.80 50474.28 49394.47 47003.25
      Signed-off-by: NMike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      f685ceac
  6. 14 10月, 2009 1 次提交
    • P
      sched: Do less agressive buddy clearing · 92f6a5e3
      Peter Zijlstra 提交于
      Yanmin reported a hackbench regression due to:
      
       > commit de69a80b
       > Author: Peter Zijlstra <a.p.zijlstra@chello.nl>
       > Date:   Thu Sep 17 09:01:20 2009 +0200
       >
       >     sched: Stop buddies from hogging the system
      
      I really liked de69a80b, and it affecting hackbench shows I wasn't
      crazy ;-)
      
      So hackbench is a multi-cast, with one sender spraying multiple
      receivers, who in their turn don't spray back.
      
      This would be exactly the scenario that patch 'cures'. Previously
      we would not clear the last buddy after running the next task,
      allowing the sender to get back to work sooner than it otherwise
      ought to have been, increasing latencies for other tasks.
      
      Now, since those receivers don't poke back, they don't enforce the
      buddy relation, which means there's nothing to re-elect the sender.
      
      Cure this by less agressively clearing the buddy stats. Only clear
      buddies when they were not chosen. It should still avoid a buddy
      sticking around long after its served its time.
      Reported-by: N"Zhang, Yanmin" <yanmin_zhang@linux.intel.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      CC: Mike Galbraith <efault@gmx.de>
      LKML-Reference: <1255084986.8802.46.camel@laptop>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      92f6a5e3
  7. 24 9月, 2009 1 次提交
  8. 21 9月, 2009 1 次提交
  9. 19 9月, 2009 1 次提交
  10. 18 9月, 2009 1 次提交
  11. 17 9月, 2009 3 次提交
    • P
      sched: Fix SD_POWERSAVING_BALANCE|SD_PREFER_LOCAL vs SD_WAKE_AFFINE · 29cd8bae
      Peter Zijlstra 提交于
      The SD_POWERSAVING_BALANCE|SD_PREFER_LOCAL code can break out of
      the domain iteration early, making us miss the SD_WAKE_AFFINE bits.
      
      Fix this by continuing iteration until there is no need for a
      larger domain.
      
      This also cleans up the cgroup stuff a bit, but not having two
      update_shares() invocations.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      29cd8bae
    • P
      sched: Stop buddies from hogging the system · de69a80b
      Peter Zijlstra 提交于
      Clear buddies more agressively.
      
      The (theoretical, haven't actually observed any of this) problem is
      that when we do not select either buddy in pick_next_entity()
      because they are too far ahead of the left-most task, we do not
      clear the buddies.
      
      This means that as soon as we service the left-most task, these
      same buddies will be tried again on the next schedule. Now if the
      left-most task was a pure hog, it wouldn't have done any wakeups
      and it wouldn't have set buddies of its own. That leads to the old
      buddies dominating, which would lead to bad latencies.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mike Galbraith <efault@gmx.de>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      de69a80b
    • P
      sched: Add new wakeup preemption mode: WAKEUP_RUNNING · ad4b78bb
      Peter Zijlstra 提交于
      Create a new wakeup preemption mode, preempt towards tasks that run
      shorter on avg. It sets next buddy to be sure we actually run the task
      we preempted for.
      
      Test results:
      
       root@twins:~# while :; do :; done &
       [1] 6537
       root@twins:~# while :; do :; done &
       [2] 6538
       root@twins:~# while :; do :; done &
       [3] 6539
       root@twins:~# while :; do :; done &
       [4] 6540
      
       root@twins:/home/peter# ./latt -c4 sleep 4
       Entries: 48 (clients=4)
      
       Averages:
       ------------------------------
              Max          4750 usec
              Avg           497 usec
              Stdev         737 usec
      
       root@twins:/home/peter# echo WAKEUP_RUNNING > /debug/sched_features
      
       root@twins:/home/peter# ./latt -c4 sleep 4
       Entries: 48 (clients=4)
      
       Averages:
       ------------------------------
              Max            14 usec
              Avg             5 usec
              Stdev           3 usec
      
      Disabled by default - needs more testing.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Acked-by: NMike Galbraith <efault@gmx.de>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      LKML-Reference: <new-submission>
      ad4b78bb
  12. 16 9月, 2009 6 次提交
  13. 15 9月, 2009 12 次提交
  14. 14 9月, 2009 1 次提交
    • I
      perf_counter, sched: Add sched_stat_runtime tracepoint · f977bb49
      Ingo Molnar 提交于
      This allows more precise tracking of how the scheduler accounts
      (and acts upon) a task having spent N nanoseconds of CPU time.
      
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      f977bb49
  15. 11 9月, 2009 1 次提交
    • I
      sched: Fix sched::sched_stat_wait tracepoint field · e1f84508
      Ingo Molnar 提交于
      This weird perf trace output:
      
        cc1-9943  [001]  2802.059479616: sched_stat_wait: task: as:9944 wait: 2801938766276 [ns]
      
      Is caused by setting one component field of the delta to zero
      a bit too early. Move it to later.
      
      ( Note, this does not affect the NEW_FAIR_SLEEPERS interactivity bug,
        it's just a reporting bug in essence. )
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Nikos Chantziaras <realnc@arcor.de>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Cc: Mike Galbraith <efault@gmx.de>
      LKML-Reference: <4AA93D34.8040500@arcor.de>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      e1f84508
  16. 09 9月, 2009 2 次提交
  17. 08 9月, 2009 1 次提交
    • M
      sched: Ensure that a child can't gain time over it's parent after fork() · b5d9d734
      Mike Galbraith 提交于
      A fork/exec load is usually "pass the baton", so the child
      should never be placed behind the parent.  With START_DEBIT we
      make room for the new task, but with child_runs_first, that
      room comes out of the _parent's_ hide. There's nothing to say
      that the parent wasn't ahead of min_vruntime at fork() time,
      which means that the "baton carrier", who is essentially the
      parent in drag, can gain time and increase scheduling latencies
      for waiters.
      
      With NEW_FAIR_SLEEPERS + START_DEBIT + child_runs_first
      enabled, we essentially pass the sleeper fairness off to the
      child, which is fine, but if we don't base placement on the
      parent's updated vruntime, we can end up compounding latency
      woes if the child itself then does fork/exec.  The debit
      incurred at fork doesn't hurt the parent who is then going to
      sleep and maybe exit, but the child who acquires the error
      harms all comers.
      
      This improves latencies of make -j<n> kernel build workloads.
      Reported-by: NJens Axboe <jens.axboe@oracle.com>
      Signed-off-by: NMike Galbraith <efault@gmx.de>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      b5d9d734