1. 09 9月, 2009 1 次提交
    • M
      sched: Turn off child_runs_first · 2bba22c5
      Mike Galbraith 提交于
      Set child_runs_first default to off.
      
      It hurts 'optimal' make -j<NR_CPUS> workloads as make jobs
      get preempted by child tasks, reducing parallelism.
      
      Note, this patch might make existing races in user
      applications more prominent than before - so breakages
      might be bisected to this commit.
      
      Child-runs-first is broken on SMP to begin with, and we
      already had it off briefly in v2.6.23 so most of the
      offenders ought to be fixed. Would be nice not to revert
      this commit but fix those apps finally ...
      Signed-off-by: NMike Galbraith <efault@gmx.de>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1252486344.28645.18.camel@marge.simson.net>
      [ made the sysctl independent of CONFIG_SCHED_DEBUG, in case
        people want to work around broken apps. ]
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      2bba22c5
  2. 08 9月, 2009 3 次提交
    • M
      sched: Ensure that a child can't gain time over it's parent after fork() · b5d9d734
      Mike Galbraith 提交于
      A fork/exec load is usually "pass the baton", so the child
      should never be placed behind the parent.  With START_DEBIT we
      make room for the new task, but with child_runs_first, that
      room comes out of the _parent's_ hide. There's nothing to say
      that the parent wasn't ahead of min_vruntime at fork() time,
      which means that the "baton carrier", who is essentially the
      parent in drag, can gain time and increase scheduling latencies
      for waiters.
      
      With NEW_FAIR_SLEEPERS + START_DEBIT + child_runs_first
      enabled, we essentially pass the sleeper fairness off to the
      child, which is fine, but if we don't base placement on the
      parent's updated vruntime, we can end up compounding latency
      woes if the child itself then does fork/exec.  The debit
      incurred at fork doesn't hurt the parent who is then going to
      sleep and maybe exit, but the child who acquires the error
      harms all comers.
      
      This improves latencies of make -j<n> kernel build workloads.
      Reported-by: NJens Axboe <jens.axboe@oracle.com>
      Signed-off-by: NMike Galbraith <efault@gmx.de>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      b5d9d734
    • P
      sched: Deal with low-load in wake_affine() · 71a29aa7
      Peter Zijlstra 提交于
      wake_affine() would always fail under low-load situations where
      both prev and this were idle, because adding a single task will
      always be a significant imbalance, even if there's nothing
      around that could balance it.
      
      Deal with this by allowing imbalance when there's nothing you
      can do about it.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      71a29aa7
    • P
      sched: Remove short cut from select_task_rq_fair() · cdd2ab3d
      Peter Zijlstra 提交于
      select_task_rq_fair() incorrectly skips the wake_affine()
      logic, remove this.
      
      When prev_cpu == this_cpu, the code jumps straight to the
      wake_idle() logic, this doesn't give the wake_affine() logic
      the chance to pin the task to this cpu.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      cdd2ab3d
  3. 02 9月, 2009 2 次提交
  4. 02 8月, 2009 3 次提交
  5. 18 7月, 2009 1 次提交
  6. 11 7月, 2009 1 次提交
  7. 18 6月, 2009 1 次提交
  8. 09 4月, 2009 1 次提交
  9. 11 2月, 2009 1 次提交
  10. 01 2月, 2009 3 次提交
  11. 16 1月, 2009 1 次提交
  12. 15 1月, 2009 3 次提交
    • P
      sched: fix update_min_vruntime · e17036da
      Peter Zijlstra 提交于
      Impact: fix SCHED_IDLE latency problems
      
      OK, so we have 1 running task A (which is obviously curr and the tree is
      equally obviously empty).
      
      'A' nicely chugs along, doing its thing, carrying min_vruntime along as it
      goes.
      
      Then some whacko speed freak SCHED_IDLE task gets inserted due to SMP
      balancing, which is very likely far right, in that case
      
      update_curr
        update_min_vruntime
          cfs_rq->rb_leftmost := true (the crazy task sitting in a tree)
            vruntime = se->vruntime
      
      and voila, min_vruntime is waaay right of where it ought to be.
      
      OK, so why did I write it like that to begin with...
      
      Aah, yes.
      
      Say we've just dequeued current
      
      schedule
        deactivate_task(prev)
          dequeue_entity
            update_min_vruntime
      
      Then we'll set
      
        vruntime = cfs_rq->min_vruntime;
      
      we find !cfs_rq->curr, but do find someone in the tree. Then we _must_
      do vruntime = se->vruntime, because
      
       vruntime = min_vruntime(vruntime := cfs_rq->min_vruntime, se->vruntime)
      
      will not advance vruntime, and cause lags the other way around (which we
      fixed with that initial patch: 1af5f730
      (sched: more accurate min_vruntime accounting).
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Tested-by: NMike Galbraith <efault@gmx.de>
      Acked-by: NMike Galbraith <efault@gmx.de>
      Cc: <stable@kernel.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      e17036da
    • P
      sched: SCHED_OTHER vs SCHED_IDLE isolation · 6bc912b7
      Peter Zijlstra 提交于
      Stronger SCHED_IDLE isolation:
      
       - no SCHED_IDLE buddies
       - never let SCHED_IDLE preempt on wakeup
       - always preempt SCHED_IDLE on wakeup
       - limit SLEEPER fairness for SCHED_IDLE.
      Signed-off-by: NMike Galbraith <efault@gmx.de>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      6bc912b7
    • P
      sched: prefer wakers · e52fb7c0
      Peter Zijlstra 提交于
      Prefer tasks that wake other tasks to preempt quickly. This improves
      performance because more work is available sooner.
      
      The workload that prompted this patch was a kernel build over NFS4 (for some
      curious and not understood reason we had to revert commit:
      18de9735 to make any progress at all)
      
      Without this patch a make -j8 bzImage (of x86-64 defconfig) would take
      3m30-ish, with this patch we're down to 2m50-ish.
      
      psql-sysbench/mysql-sysbench show a slight improvement in peak performance as
      well, tbench and vmark seemed to not care.
      
      It is possible to improve upon the build time (to 2m20-ish) but that seriously
      destroys other benchmarks (just shows that there's more room for tinkering).
      
      Much thanks to Mike who put in a lot of effort to benchmark things and proved
      a worthy opponent with a competing patch.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NMike Galbraith <efault@gmx.de>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      e52fb7c0
  13. 09 1月, 2009 1 次提交
  14. 03 1月, 2009 1 次提交
  15. 19 12月, 2008 1 次提交
    • V
      sched: bias task wakeups to preferred semi-idle packages · 7eb52dfa
      Vaidyanathan Srinivasan 提交于
      Impact: tweak task wakeup to save power more agressively
      
      Preferred wakeup cpu (from a semi idle package) has been
      nominated in find_busiest_group() in the previous patch.  Use
      this information in sched_mc_preferred_wakeup_cpu in function
      wake_idle() to bias task wakeups if the following conditions
      are satisfied:
      
              - The present cpu that is trying to wakeup the process is
                idle and waking the target process on this cpu will
                potentially wakeup a completely idle package
              - The previous cpu on which the target process ran is
                also idle and hence selecting the previous cpu may
                wakeup a semi idle cpu package
              - The task being woken up is allowed to run in the
                nominated cpu (cpu affinity and restrictions)
      
      Basically if both the current cpu and the previous cpu on
      which the task ran is idle, select the nominated cpu from semi
      idle cpu package for running the new task that is waking up.
      
      Cache hotness is considered since the actual biasing happens
      in wake_idle() only if the application is cache cold.
      
      This technique will effectively move short running bursty jobs in
      a mostly idle system.
      
      Wakeup biasing for power savings gets automatically disabled if
      system utilisation increases due to the fact that the probability
      of finding both this_cpu and prev_cpu idle decreases.
      Signed-off-by: NVaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Acked-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      7eb52dfa
  16. 16 12月, 2008 2 次提交
  17. 25 11月, 2008 2 次提交
    • R
      sched: convert remaining old-style cpumask operators · 96f874e2
      Rusty Russell 提交于
      Impact: Trivial API conversion
      
        NR_CPUS -> nr_cpu_ids
        cpumask_t -> struct cpumask
        sizeof(cpumask_t) -> cpumask_size()
        cpumask_a = cpumask_b -> cpumask_copy(&cpumask_a, &cpumask_b)
      
        cpu_set() -> cpumask_set_cpu()
        first_cpu() -> cpumask_first()
        cpumask_of_cpu() -> cpumask_of()
        cpus_* -> cpumask_*
      
      There are some FIXMEs where we all archs to complete infrastructure
      (patches have been sent):
      
        cpu_coregroup_map -> cpu_coregroup_mask
        node_to_cpumask* -> cpumask_of_node
      
      There is also one FIXME where we pass an array of cpumasks to
      partition_sched_domains(): this implies knowing the definition of
      'struct cpumask' and the size of a cpumask.  This will be fixed in a
      future patch.
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      96f874e2
    • R
      sched: wrap sched_group and sched_domain cpumask accesses. · 758b2cdc
      Rusty Russell 提交于
      Impact: trivial wrap of member accesses
      
      This eases the transition in the next patch.
      
      We also get rid of a temporary cpumask in find_idlest_cpu() thanks to
      for_each_cpu_and, and sched_balance_self() due to getting weight before
      setting sd to NULL.
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      758b2cdc
  18. 11 11月, 2008 1 次提交
  19. 05 11月, 2008 4 次提交
  20. 24 10月, 2008 4 次提交
  21. 22 10月, 2008 1 次提交
  22. 20 10月, 2008 2 次提交
    • P
      sched: revert back to per-rq vruntime · f9c0b095
      Peter Zijlstra 提交于
      Vatsa rightly points out that having the runqueue weight in the vruntime
      calculations can cause unfairness in the face of task joins/leaves.
      
      Suppose: dv = dt * rw / w
      
      Then take 10 tasks t_n, each of similar weight. If the first will run 1
      then its vruntime will increase by 10. Now, if the next 8 tasks leave after
      having run their 1, then the last task will get a vruntime increase of 2
      after having run 1.
      
      Which will leave us with 2 tasks of equal weight and equal runtime, of which
      one will not be scheduled for 8/2=4 units of time.
      
      Ergo, we cannot do that and must use: dv = dt / w.
      
      This means we cannot have a global vruntime based on effective priority, but
      must instead go back to the vruntime per rq model we started out with.
      
      This patch was lightly tested by doing starting while loops on each nice level
      and observing their execution time, and a simple group scenario of 1:2:3 pinned
      to a single cpu.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      f9c0b095
    • P
      sched: fair scheduler should not resched rt tasks · a4c2f00f
      Peter Zijlstra 提交于
      With use of ftrace Steven noticed that some RT tasks got rescheduled due
      to sched_fair interaction.
      
      What happens is that we reprogram the hrtick from enqueue/dequeue_fair_task()
      because that can change nr_running, and thus a current tasks ideal runtime.
      However, its possible the current task isn't a fair_sched_class task, and thus
      doesn't have a hrtick set to change.
      
      Fix this by wrapping those hrtick_start_fair() calls in a hrtick_update()
      function, which will check for the right conditions.
      Reported-by: NSteven Rostedt <srostedt@redhat.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Acked-by: NSteven Rostedt <srostedt@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      a4c2f00f