1. 11 4月, 2011 1 次提交
    • K
      sched: Fix sched-domain avg_load calculation · b0432d8f
      Ken Chen 提交于
      In function find_busiest_group(), the sched-domain avg_load isn't
      calculated at all if there is a group imbalance within the domain. This
      will cause erroneous imbalance calculation.
      
      The reason is that calculate_imbalance() sees sds->avg_load = 0 and it
      will dump entire sds->max_load into imbalance variable, which is used
      later on to migrate entire load from busiest CPU to the puller CPU.
      
      This has two really bad effect:
      
      1. stampede of task migration, and they won't be able to break out
         of the bad state because of positive feedback loop: large load
         delta -> heavier load migration -> larger imbalance and the cycle
         goes on.
      
      2. severe imbalance in CPU queue depth.  This causes really long
         scheduling latency blip which affects badly on application that
         has tight latency requirement.
      
      The fix is to have kernel calculate domain avg_load in both cases. This
      will ensure that imbalance calculation is always sensible and the target
      is usually half way between busiest and puller CPU.
      Signed-off-by: NKen Chen <kenchen@google.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: <stable@kernel.org>
      Link: http://lkml.kernel.org/r/20110408002322.3A0D812217F@elm.corp.google.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
      b0432d8f
  2. 05 4月, 2011 1 次提交
  3. 31 3月, 2011 2 次提交
  4. 04 3月, 2011 2 次提交
  5. 23 2月, 2011 3 次提交
    • P
      sched: Fix the group_imb logic · 866ab43e
      Peter Zijlstra 提交于
      On a 2*6*2 machine something like:
      
       taskset -c 3-11 bash -c 'for ((i=0;i<9;i++)) do while :; do :; done & done'
      
      _should_ result in 9 busy CPUs, each running 1 task.
      
      However it didn't quite work reliably, most of the time one cpu of the
      second socket (6-11) would be idle and one cpu of the first socket
      (0-5) would have two tasks on it.
      
      The group_imb logic is supposed to deal with this and detect when a
      particular group is imbalanced (like in our case, 0-2 are idle but 3-5
      will have 4 tasks on it).
      
      The detection phase needed a bit of a tweak as it was too weak and
      required more than 2 avg weight tasks difference between idle and busy
      cpus in the group which won't trigger for our test-case. So cure that
      to be one or more avg task weight difference between cpus.
      
      Once the detection phase worked, it was then defeated by the f_b_g()
      tests trying to avoid ping-pongs. In particular, this_load >= max_load
      triggered because the pulling cpu (the (first) idle cpu in on the
      second socket, say 6) would find this_load to be 5 and max_load to be
      4 (there'd be 5 tasks running on our socket and only 4 on the other
      socket).
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Nikhil Rao <ncrao@google.com>
      Cc: Venkatesh Pallipadi <venki@google.com>
      Cc: Suresh Siddha <suresh.b.siddha@intel.com>
      Cc: Mike Galbraith <efault@gmx.de>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      866ab43e
    • P
      sched: Clean up some f_b_g() comments · cc57aa8f
      Peter Zijlstra 提交于
      The existing comment tends to grow state (as it already has), split it
      up and place it near the actual tests.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Nikhil Rao <ncrao@google.com>
      Cc: Venkatesh Pallipadi <venki@google.com>
      Cc: Suresh Siddha <suresh.b.siddha@intel.com>
      Cc: Mike Galbraith <efault@gmx.de>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      cc57aa8f
    • P
      sched: Clean up remnants of sd_idle · c186fafe
      Peter Zijlstra 提交于
      With the wholesale removal of the sd_idle SMT logic we can clean up
      some more.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Nikhil Rao <ncrao@google.com>
      Cc: Venkatesh Pallipadi <venki@google.com>
      Cc: Suresh Siddha <suresh.b.siddha@intel.com>
      Cc: Mike Galbraith <efault@gmx.de>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      c186fafe
  6. 16 2月, 2011 1 次提交
  7. 03 2月, 2011 4 次提交
    • M
      sched: Add yield_to(task, preempt) functionality · d95f4122
      Mike Galbraith 提交于
      Currently only implemented for fair class tasks.
      
      Add a yield_to_task method() to the fair scheduling class. allowing the
      caller of yield_to() to accelerate another thread in it's thread group,
      task group.
      
      Implemented via a scheduler hint, using cfs_rq->next to encourage the
      target being selected.  We can rely on pick_next_entity to keep things
      fair, so noone can accelerate a thread that has already used its fair
      share of CPU time.
      
      This also means callers should only call yield_to when they really
      mean it.  Calling it too often can result in the scheduler just
      ignoring the hint.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NMike Galbraith <efault@gmx.de>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <20110201095051.4ddb7738@annuminas.surriel.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      d95f4122
    • R
      sched: Use a buddy to implement yield_task_fair() · ac53db59
      Rik van Riel 提交于
      Use the buddy mechanism to implement yield_task_fair.  This
      allows us to skip onto the next highest priority se at every
      level in the CFS tree, unless doing so would introduce gross
      unfairness in CPU time distribution.
      
      We order the buddy selection in pick_next_entity to check
      yield first, then last, then next.  We need next to be able
      to override yield, because it is possible for the "next" and
      "yield" task to be different processen in the same sub-tree
      of the CFS tree.  When they are, we need to go into that
      sub-tree regardless of the "yield" hint, and pick the correct
      entity once we get to the right level.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <20110201095103.3a79e92a@annuminas.surriel.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      ac53db59
    • R
      sched: Limit the scope of clear_buddies · 2c13c919
      Rik van Riel 提交于
      The clear_buddies function does not seem to play well with the concept
      of hierarchical runqueues.  In the following tree, task groups are
      represented by 'G', tasks by 'T', next by 'n' and last by 'l'.
      
           (nl)
          /    \
         G(nl)  G
         / \     \
       T(l) T(n)  T
      
      This situation can arise when a task is woken up T(n), and the previously
      running task T(l) is marked last.
      
      When clear_buddies is called from either T(l) or T(n), the next and last
      buddies of the group G(nl) will be cleared.  This is not the desired
      result, since we would like to be able to find the other type of buddy
      in many cases.
      
      This especially a worry when implementing yield_task_fair through the
      buddy system.
      
      The fix is simple: only clear the buddy type that the task itself
      is indicated to be.  As an added bonus, we stop walking up the tree
      when the buddy has already been cleared or pointed elsewhere.
      Signed-off-by: NRik van Riel <riel@redhat.coM>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <20110201094837.6b0962a9@annuminas.surriel.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      2c13c919
    • R
      sched: Check the right ->nr_running in yield_task_fair() · 725e7580
      Rik van Riel 提交于
      With CONFIG_FAIR_GROUP_SCHED, each task_group has its own cfs_rq.
      Yielding to a task from another cfs_rq may be worthwhile, since
      a process calling yield typically cannot use the CPU right now.
      
      Therefor, we want to check the per-cpu nr_running, not the
      cgroup local one.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <20110201094715.798c4f86@annuminas.surriel.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      725e7580
  8. 26 1月, 2011 6 次提交
  9. 24 1月, 2011 1 次提交
  10. 18 1月, 2011 2 次提交
  11. 19 12月, 2010 2 次提交
    • P
      sched: Fix interactivity bug by charging unaccounted run-time on entity re-weight · 19e5eebb
      Paul Turner 提交于
      Mike Galbraith reported poor interactivity[*] when the new shares distribution
      code was combined with autogroups.
      
      The root cause turns out to be a mis-ordering of accounting accrued execution
      time and shares updates.  Since update_curr() is issued hierarchically,
      updating the parent entity weights to reflect child enqueue/dequeue results in
      the parent's unaccounted execution time then being accrued (vs vruntime) at the
      new weight as opposed to the weight present at accumulation.
      
      While this doesn't have much effect on processes with timeslices that cross a
      tick, it is particularly problematic for an interactive process (e.g. Xorg)
      which incurs many (tiny) timeslices.  In this scenario almost all updates are
      at dequeue which can result in significant fairness perturbation (especially if
      it is the only thread, resulting in potential {tg->shares, MIN_SHARES}
      transitions).
      
      Correct this by ensuring unaccounted time is accumulated prior to manipulating
      an entity's weight.
      
      [*] http://xkcd.com/619/ is perversely Nostradamian here.
      Signed-off-by: NPaul Turner <pjt@google.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      LKML-Reference: <20101216031038.159704378@google.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      19e5eebb
    • P
      sched: Move periodic share updates to entity_tick() · 43365bd7
      Paul Turner 提交于
      Long running entities that do not block (dequeue) require periodic updates to
      maintain accurate share values.  (Note: group entities with several threads are
      quite likely to be non-blocking in many circumstances).
      
      By virtue of being long-running however, we will see entity ticks (otherwise
      the required update occurs in dequeue/put and we are done).  Thus we can move
      the detection (and associated work) for these updates into the periodic path.
      
      This restores the 'atomicity' of update_curr() with respect to accounting.
      Signed-off-by: NPaul Turner <pjt@google.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <20101216031038.067028969@google.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      43365bd7
  12. 23 11月, 2010 1 次提交
  13. 18 11月, 2010 12 次提交
  14. 11 11月, 2010 2 次提交
    • P
      sched: Fix cross-sched-class wakeup preemption · 1e5a7405
      Peter Zijlstra 提交于
      Instead of dealing with sched classes inside each check_preempt_curr()
      implementation, pull out this logic into the generic wakeup preemption
      path.
      
      This fixes a hang in KVM (and others) where we are waiting for the
      stop machine thread to run ...
      Reported-by: NMarkus Trippelsdorf <markus@trippelsdorf.de>
      Tested-by: NMarcelo Tosatti <mtosatti@redhat.com>
      Tested-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1288891946.2039.31.camel@laptop>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      1e5a7405
    • S
      sched: Use group weight, idle cpu metrics to fix imbalances during idle · aae6d3dd
      Suresh Siddha 提交于
      Currently we consider a sched domain to be well balanced when the imbalance
      is less than the domain's imablance_pct. As the number of cores and threads
      are increasing, current values of imbalance_pct (for example 25% for a
      NUMA domain) are not enough to detect imbalances like:
      
      a) On a WSM-EP system (two sockets, each having 6 cores and 12 logical threads),
      24 cpu-hogging tasks get scheduled as 13 on one socket and 11 on another
      socket. Leading to an idle HT cpu.
      
      b) On a hypothetial 2 socket NHM-EX system (each socket having 8 cores and
      16 logical threads), 16 cpu-hogging tasks can get scheduled as 9 on one
      socket and 7 on another socket. Leaving one core in a socket idle
      whereas in another socket we have a core having both its HT siblings busy.
      
      While this issue can be fixed by decreasing the domain's imbalance_pct
      (by making it a function of number of logical cpus in the domain), it
      can potentially cause more task migrations across sched groups in an
      overloaded case.
      
      Fix this by using imbalance_pct only during newly_idle and busy
      load balancing. And during idle load balancing, check if there
      is an imbalance in number of idle cpu's across the busiest and this
      sched_group or if the busiest group has more tasks than its weight that
      the idle cpu in this_group can pull.
      Reported-by: NNikhil Rao <ncrao@google.com>
      Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1284760952.2676.11.camel@sbsiddha-MOBL3.sc.intel.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      aae6d3dd