1. 14 4月, 2011 3 次提交
  2. 11 4月, 2011 5 次提交
    • P
      sched: Avoid using sd->level · a6c75f2f
      Peter Zijlstra 提交于
      Don't use sd->level for identifying properties of the domain.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20110407122942.350174079@chello.nlSigned-off-by: NIngo Molnar <mingo@elte.hu>
      a6c75f2f
    • P
      sched: Dynamically allocate sched_domain/sched_group data-structures · dce840a0
      Peter Zijlstra 提交于
      Instead of relying on static allocations for the sched_domain and
      sched_group trees, dynamically allocate and RCU free them.
      
      Allocating this dynamically also allows for some build_sched_groups()
      simplification since we can now (like with other simplifications) rely
      on the sched_domain tree instead of hard-coded knowledge.
      
      One tricky to note is that detach_destroy_domains() needs to hold
      rcu_read_lock() over the entire tear-down, per-cpu is not sufficient
      since that can lead to partial sched_group existance (could possibly
      be solved by doing the tear-down backwards but this is much more
      robust).
      
      A concequence of the above is that we can no longer print the
      sched_domain debug stuff from cpu_attach_domain() since that might now
      run with preemption disabled (due to classic RCU etc.) and
      sched_domain_debug() does some GFP_KERNEL allocations.
      
      Another thing to note is that we now fully rely on normal RCU and not
      RCU-sched, this is because with the new and exiting RCU flavours we
      grew over the years BH doesn't necessarily hold off RCU-sched grace
      periods (-rt is known to break this). This would in fact already cause
      us grief since we do sched_domain/sched_group iterations from softirq
      context.
      
      This patch is somewhat larger than I would like it to be, but I didn't
      find any means of shrinking/splitting this.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20110407122942.245307941@chello.nlSigned-off-by: NIngo Molnar <mingo@elte.hu>
      dce840a0
    • S
      sched: Eliminate dead code from wakeup_gran() · f4ad9bd2
      Shaohua Li 提交于
      calc_delta_fair() checks NICE_0_LOAD already, delete duplicate check.
      
      Signed-off-by: Shaohua Li<shaohua.li@intel.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mike Galbraith <efault@gmx.de>
      Link: http://lkml.kernel.org/r/1302238389.3981.92.camel@sli10-conroeSigned-off-by: NIngo Molnar <mingo@elte.hu>
      f4ad9bd2
    • K
      sched: Fix erroneous all_pinned logic · b30aef17
      Ken Chen 提交于
      The scheduler load balancer has specific code to deal with cases of
      unbalanced system due to lots of unmovable tasks (for example because of
      hard CPU affinity). In those situation, it excludes the busiest CPU that
      has pinned tasks for load balance consideration such that it can perform
      second 2nd load balance pass on the rest of the system.
      
      This all works as designed if there is only one cgroup in the system.
      
      However, when we have multiple cgroups, this logic has false positives and
      triggers multiple load balance passes despite there are actually no pinned
      tasks at all.
      
      The reason it has false positives is that the all pinned logic is deep in
      the lowest function of can_migrate_task() and is too low level:
      
      load_balance_fair() iterates each task group and calls balance_tasks() to
      migrate target load. Along the way, balance_tasks() will also set a
      all_pinned variable. Given that task-groups are iterated, this all_pinned
      variable is essentially the status of last group in the scanning process.
      Task group can have number of reasons that no load being migrated, none
      due to cpu affinity. However, this status bit is being propagated back up
      to the higher level load_balance(), which incorrectly think that no tasks
      were moved.  It kick off the all pinned logic and start multiple passes
      attempt to move load onto puller CPU.
      
      To fix this, move the all_pinned aggregation up at the iterator level.
      This ensures that the status is aggregated over all task-groups, not just
      last one in the list.
      Signed-off-by: NKen Chen <kenchen@google.com>
      Cc: stable@kernel.org
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/BANLkTi=ernzNawaR5tJZEsV_QVnfxqXmsQ@mail.gmail.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
      b30aef17
    • K
      sched: Fix sched-domain avg_load calculation · b0432d8f
      Ken Chen 提交于
      In function find_busiest_group(), the sched-domain avg_load isn't
      calculated at all if there is a group imbalance within the domain. This
      will cause erroneous imbalance calculation.
      
      The reason is that calculate_imbalance() sees sds->avg_load = 0 and it
      will dump entire sds->max_load into imbalance variable, which is used
      later on to migrate entire load from busiest CPU to the puller CPU.
      
      This has two really bad effect:
      
      1. stampede of task migration, and they won't be able to break out
         of the bad state because of positive feedback loop: large load
         delta -> heavier load migration -> larger imbalance and the cycle
         goes on.
      
      2. severe imbalance in CPU queue depth.  This causes really long
         scheduling latency blip which affects badly on application that
         has tight latency requirement.
      
      The fix is to have kernel calculate domain avg_load in both cases. This
      will ensure that imbalance calculation is always sensible and the target
      is usually half way between busiest and puller CPU.
      Signed-off-by: NKen Chen <kenchen@google.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: <stable@kernel.org>
      Link: http://lkml.kernel.org/r/20110408002322.3A0D812217F@elm.corp.google.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
      b0432d8f
  3. 05 4月, 2011 1 次提交
  4. 31 3月, 2011 2 次提交
  5. 04 3月, 2011 2 次提交
  6. 23 2月, 2011 3 次提交
    • P
      sched: Fix the group_imb logic · 866ab43e
      Peter Zijlstra 提交于
      On a 2*6*2 machine something like:
      
       taskset -c 3-11 bash -c 'for ((i=0;i<9;i++)) do while :; do :; done & done'
      
      _should_ result in 9 busy CPUs, each running 1 task.
      
      However it didn't quite work reliably, most of the time one cpu of the
      second socket (6-11) would be idle and one cpu of the first socket
      (0-5) would have two tasks on it.
      
      The group_imb logic is supposed to deal with this and detect when a
      particular group is imbalanced (like in our case, 0-2 are idle but 3-5
      will have 4 tasks on it).
      
      The detection phase needed a bit of a tweak as it was too weak and
      required more than 2 avg weight tasks difference between idle and busy
      cpus in the group which won't trigger for our test-case. So cure that
      to be one or more avg task weight difference between cpus.
      
      Once the detection phase worked, it was then defeated by the f_b_g()
      tests trying to avoid ping-pongs. In particular, this_load >= max_load
      triggered because the pulling cpu (the (first) idle cpu in on the
      second socket, say 6) would find this_load to be 5 and max_load to be
      4 (there'd be 5 tasks running on our socket and only 4 on the other
      socket).
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Nikhil Rao <ncrao@google.com>
      Cc: Venkatesh Pallipadi <venki@google.com>
      Cc: Suresh Siddha <suresh.b.siddha@intel.com>
      Cc: Mike Galbraith <efault@gmx.de>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      866ab43e
    • P
      sched: Clean up some f_b_g() comments · cc57aa8f
      Peter Zijlstra 提交于
      The existing comment tends to grow state (as it already has), split it
      up and place it near the actual tests.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Nikhil Rao <ncrao@google.com>
      Cc: Venkatesh Pallipadi <venki@google.com>
      Cc: Suresh Siddha <suresh.b.siddha@intel.com>
      Cc: Mike Galbraith <efault@gmx.de>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      cc57aa8f
    • P
      sched: Clean up remnants of sd_idle · c186fafe
      Peter Zijlstra 提交于
      With the wholesale removal of the sd_idle SMT logic we can clean up
      some more.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Nikhil Rao <ncrao@google.com>
      Cc: Venkatesh Pallipadi <venki@google.com>
      Cc: Suresh Siddha <suresh.b.siddha@intel.com>
      Cc: Mike Galbraith <efault@gmx.de>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      c186fafe
  7. 16 2月, 2011 1 次提交
  8. 03 2月, 2011 4 次提交
    • M
      sched: Add yield_to(task, preempt) functionality · d95f4122
      Mike Galbraith 提交于
      Currently only implemented for fair class tasks.
      
      Add a yield_to_task method() to the fair scheduling class. allowing the
      caller of yield_to() to accelerate another thread in it's thread group,
      task group.
      
      Implemented via a scheduler hint, using cfs_rq->next to encourage the
      target being selected.  We can rely on pick_next_entity to keep things
      fair, so noone can accelerate a thread that has already used its fair
      share of CPU time.
      
      This also means callers should only call yield_to when they really
      mean it.  Calling it too often can result in the scheduler just
      ignoring the hint.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NMike Galbraith <efault@gmx.de>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <20110201095051.4ddb7738@annuminas.surriel.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      d95f4122
    • R
      sched: Use a buddy to implement yield_task_fair() · ac53db59
      Rik van Riel 提交于
      Use the buddy mechanism to implement yield_task_fair.  This
      allows us to skip onto the next highest priority se at every
      level in the CFS tree, unless doing so would introduce gross
      unfairness in CPU time distribution.
      
      We order the buddy selection in pick_next_entity to check
      yield first, then last, then next.  We need next to be able
      to override yield, because it is possible for the "next" and
      "yield" task to be different processen in the same sub-tree
      of the CFS tree.  When they are, we need to go into that
      sub-tree regardless of the "yield" hint, and pick the correct
      entity once we get to the right level.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <20110201095103.3a79e92a@annuminas.surriel.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      ac53db59
    • R
      sched: Limit the scope of clear_buddies · 2c13c919
      Rik van Riel 提交于
      The clear_buddies function does not seem to play well with the concept
      of hierarchical runqueues.  In the following tree, task groups are
      represented by 'G', tasks by 'T', next by 'n' and last by 'l'.
      
           (nl)
          /    \
         G(nl)  G
         / \     \
       T(l) T(n)  T
      
      This situation can arise when a task is woken up T(n), and the previously
      running task T(l) is marked last.
      
      When clear_buddies is called from either T(l) or T(n), the next and last
      buddies of the group G(nl) will be cleared.  This is not the desired
      result, since we would like to be able to find the other type of buddy
      in many cases.
      
      This especially a worry when implementing yield_task_fair through the
      buddy system.
      
      The fix is simple: only clear the buddy type that the task itself
      is indicated to be.  As an added bonus, we stop walking up the tree
      when the buddy has already been cleared or pointed elsewhere.
      Signed-off-by: NRik van Riel <riel@redhat.coM>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <20110201094837.6b0962a9@annuminas.surriel.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      2c13c919
    • R
      sched: Check the right ->nr_running in yield_task_fair() · 725e7580
      Rik van Riel 提交于
      With CONFIG_FAIR_GROUP_SCHED, each task_group has its own cfs_rq.
      Yielding to a task from another cfs_rq may be worthwhile, since
      a process calling yield typically cannot use the CPU right now.
      
      Therefor, we want to check the per-cpu nr_running, not the
      cgroup local one.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <20110201094715.798c4f86@annuminas.surriel.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      725e7580
  9. 26 1月, 2011 6 次提交
  10. 24 1月, 2011 1 次提交
  11. 18 1月, 2011 2 次提交
  12. 19 12月, 2010 2 次提交
    • P
      sched: Fix interactivity bug by charging unaccounted run-time on entity re-weight · 19e5eebb
      Paul Turner 提交于
      Mike Galbraith reported poor interactivity[*] when the new shares distribution
      code was combined with autogroups.
      
      The root cause turns out to be a mis-ordering of accounting accrued execution
      time and shares updates.  Since update_curr() is issued hierarchically,
      updating the parent entity weights to reflect child enqueue/dequeue results in
      the parent's unaccounted execution time then being accrued (vs vruntime) at the
      new weight as opposed to the weight present at accumulation.
      
      While this doesn't have much effect on processes with timeslices that cross a
      tick, it is particularly problematic for an interactive process (e.g. Xorg)
      which incurs many (tiny) timeslices.  In this scenario almost all updates are
      at dequeue which can result in significant fairness perturbation (especially if
      it is the only thread, resulting in potential {tg->shares, MIN_SHARES}
      transitions).
      
      Correct this by ensuring unaccounted time is accumulated prior to manipulating
      an entity's weight.
      
      [*] http://xkcd.com/619/ is perversely Nostradamian here.
      Signed-off-by: NPaul Turner <pjt@google.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      LKML-Reference: <20101216031038.159704378@google.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      19e5eebb
    • P
      sched: Move periodic share updates to entity_tick() · 43365bd7
      Paul Turner 提交于
      Long running entities that do not block (dequeue) require periodic updates to
      maintain accurate share values.  (Note: group entities with several threads are
      quite likely to be non-blocking in many circumstances).
      
      By virtue of being long-running however, we will see entity ticks (otherwise
      the required update occurs in dequeue/put and we are done).  Thus we can move
      the detection (and associated work) for these updates into the periodic path.
      
      This restores the 'atomicity' of update_curr() with respect to accounting.
      Signed-off-by: NPaul Turner <pjt@google.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <20101216031038.067028969@google.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      43365bd7
  13. 23 11月, 2010 1 次提交
  14. 18 11月, 2010 7 次提交