1. 23 2月, 2011 3 次提交
    • P
      sched: Fix the group_imb logic · 866ab43e
      Peter Zijlstra 提交于
      On a 2*6*2 machine something like:
      
       taskset -c 3-11 bash -c 'for ((i=0;i<9;i++)) do while :; do :; done & done'
      
      _should_ result in 9 busy CPUs, each running 1 task.
      
      However it didn't quite work reliably, most of the time one cpu of the
      second socket (6-11) would be idle and one cpu of the first socket
      (0-5) would have two tasks on it.
      
      The group_imb logic is supposed to deal with this and detect when a
      particular group is imbalanced (like in our case, 0-2 are idle but 3-5
      will have 4 tasks on it).
      
      The detection phase needed a bit of a tweak as it was too weak and
      required more than 2 avg weight tasks difference between idle and busy
      cpus in the group which won't trigger for our test-case. So cure that
      to be one or more avg task weight difference between cpus.
      
      Once the detection phase worked, it was then defeated by the f_b_g()
      tests trying to avoid ping-pongs. In particular, this_load >= max_load
      triggered because the pulling cpu (the (first) idle cpu in on the
      second socket, say 6) would find this_load to be 5 and max_load to be
      4 (there'd be 5 tasks running on our socket and only 4 on the other
      socket).
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Nikhil Rao <ncrao@google.com>
      Cc: Venkatesh Pallipadi <venki@google.com>
      Cc: Suresh Siddha <suresh.b.siddha@intel.com>
      Cc: Mike Galbraith <efault@gmx.de>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      866ab43e
    • P
      sched: Clean up some f_b_g() comments · cc57aa8f
      Peter Zijlstra 提交于
      The existing comment tends to grow state (as it already has), split it
      up and place it near the actual tests.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Nikhil Rao <ncrao@google.com>
      Cc: Venkatesh Pallipadi <venki@google.com>
      Cc: Suresh Siddha <suresh.b.siddha@intel.com>
      Cc: Mike Galbraith <efault@gmx.de>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      cc57aa8f
    • P
      sched: Clean up remnants of sd_idle · c186fafe
      Peter Zijlstra 提交于
      With the wholesale removal of the sd_idle SMT logic we can clean up
      some more.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Nikhil Rao <ncrao@google.com>
      Cc: Venkatesh Pallipadi <venki@google.com>
      Cc: Suresh Siddha <suresh.b.siddha@intel.com>
      Cc: Mike Galbraith <efault@gmx.de>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      c186fafe
  2. 16 2月, 2011 1 次提交
  3. 03 2月, 2011 4 次提交
    • M
      sched: Add yield_to(task, preempt) functionality · d95f4122
      Mike Galbraith 提交于
      Currently only implemented for fair class tasks.
      
      Add a yield_to_task method() to the fair scheduling class. allowing the
      caller of yield_to() to accelerate another thread in it's thread group,
      task group.
      
      Implemented via a scheduler hint, using cfs_rq->next to encourage the
      target being selected.  We can rely on pick_next_entity to keep things
      fair, so noone can accelerate a thread that has already used its fair
      share of CPU time.
      
      This also means callers should only call yield_to when they really
      mean it.  Calling it too often can result in the scheduler just
      ignoring the hint.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NMike Galbraith <efault@gmx.de>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <20110201095051.4ddb7738@annuminas.surriel.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      d95f4122
    • R
      sched: Use a buddy to implement yield_task_fair() · ac53db59
      Rik van Riel 提交于
      Use the buddy mechanism to implement yield_task_fair.  This
      allows us to skip onto the next highest priority se at every
      level in the CFS tree, unless doing so would introduce gross
      unfairness in CPU time distribution.
      
      We order the buddy selection in pick_next_entity to check
      yield first, then last, then next.  We need next to be able
      to override yield, because it is possible for the "next" and
      "yield" task to be different processen in the same sub-tree
      of the CFS tree.  When they are, we need to go into that
      sub-tree regardless of the "yield" hint, and pick the correct
      entity once we get to the right level.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <20110201095103.3a79e92a@annuminas.surriel.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      ac53db59
    • R
      sched: Limit the scope of clear_buddies · 2c13c919
      Rik van Riel 提交于
      The clear_buddies function does not seem to play well with the concept
      of hierarchical runqueues.  In the following tree, task groups are
      represented by 'G', tasks by 'T', next by 'n' and last by 'l'.
      
           (nl)
          /    \
         G(nl)  G
         / \     \
       T(l) T(n)  T
      
      This situation can arise when a task is woken up T(n), and the previously
      running task T(l) is marked last.
      
      When clear_buddies is called from either T(l) or T(n), the next and last
      buddies of the group G(nl) will be cleared.  This is not the desired
      result, since we would like to be able to find the other type of buddy
      in many cases.
      
      This especially a worry when implementing yield_task_fair through the
      buddy system.
      
      The fix is simple: only clear the buddy type that the task itself
      is indicated to be.  As an added bonus, we stop walking up the tree
      when the buddy has already been cleared or pointed elsewhere.
      Signed-off-by: NRik van Riel <riel@redhat.coM>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <20110201094837.6b0962a9@annuminas.surriel.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      2c13c919
    • R
      sched: Check the right ->nr_running in yield_task_fair() · 725e7580
      Rik van Riel 提交于
      With CONFIG_FAIR_GROUP_SCHED, each task_group has its own cfs_rq.
      Yielding to a task from another cfs_rq may be worthwhile, since
      a process calling yield typically cannot use the CPU right now.
      
      Therefor, we want to check the per-cpu nr_running, not the
      cgroup local one.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <20110201094715.798c4f86@annuminas.surriel.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      725e7580
  4. 26 1月, 2011 6 次提交
  5. 24 1月, 2011 1 次提交
  6. 18 1月, 2011 2 次提交
  7. 19 12月, 2010 2 次提交
    • P
      sched: Fix interactivity bug by charging unaccounted run-time on entity re-weight · 19e5eebb
      Paul Turner 提交于
      Mike Galbraith reported poor interactivity[*] when the new shares distribution
      code was combined with autogroups.
      
      The root cause turns out to be a mis-ordering of accounting accrued execution
      time and shares updates.  Since update_curr() is issued hierarchically,
      updating the parent entity weights to reflect child enqueue/dequeue results in
      the parent's unaccounted execution time then being accrued (vs vruntime) at the
      new weight as opposed to the weight present at accumulation.
      
      While this doesn't have much effect on processes with timeslices that cross a
      tick, it is particularly problematic for an interactive process (e.g. Xorg)
      which incurs many (tiny) timeslices.  In this scenario almost all updates are
      at dequeue which can result in significant fairness perturbation (especially if
      it is the only thread, resulting in potential {tg->shares, MIN_SHARES}
      transitions).
      
      Correct this by ensuring unaccounted time is accumulated prior to manipulating
      an entity's weight.
      
      [*] http://xkcd.com/619/ is perversely Nostradamian here.
      Signed-off-by: NPaul Turner <pjt@google.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      LKML-Reference: <20101216031038.159704378@google.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      19e5eebb
    • P
      sched: Move periodic share updates to entity_tick() · 43365bd7
      Paul Turner 提交于
      Long running entities that do not block (dequeue) require periodic updates to
      maintain accurate share values.  (Note: group entities with several threads are
      quite likely to be non-blocking in many circumstances).
      
      By virtue of being long-running however, we will see entity ticks (otherwise
      the required update occurs in dequeue/put and we are done).  Thus we can move
      the detection (and associated work) for these updates into the periodic path.
      
      This restores the 'atomicity' of update_curr() with respect to accounting.
      Signed-off-by: NPaul Turner <pjt@google.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <20101216031038.067028969@google.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      43365bd7
  8. 23 11月, 2010 1 次提交
  9. 18 11月, 2010 12 次提交
  10. 11 11月, 2010 2 次提交
    • P
      sched: Fix cross-sched-class wakeup preemption · 1e5a7405
      Peter Zijlstra 提交于
      Instead of dealing with sched classes inside each check_preempt_curr()
      implementation, pull out this logic into the generic wakeup preemption
      path.
      
      This fixes a hang in KVM (and others) where we are waiting for the
      stop machine thread to run ...
      Reported-by: NMarkus Trippelsdorf <markus@trippelsdorf.de>
      Tested-by: NMarcelo Tosatti <mtosatti@redhat.com>
      Tested-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1288891946.2039.31.camel@laptop>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      1e5a7405
    • S
      sched: Use group weight, idle cpu metrics to fix imbalances during idle · aae6d3dd
      Suresh Siddha 提交于
      Currently we consider a sched domain to be well balanced when the imbalance
      is less than the domain's imablance_pct. As the number of cores and threads
      are increasing, current values of imbalance_pct (for example 25% for a
      NUMA domain) are not enough to detect imbalances like:
      
      a) On a WSM-EP system (two sockets, each having 6 cores and 12 logical threads),
      24 cpu-hogging tasks get scheduled as 13 on one socket and 11 on another
      socket. Leading to an idle HT cpu.
      
      b) On a hypothetial 2 socket NHM-EX system (each socket having 8 cores and
      16 logical threads), 16 cpu-hogging tasks can get scheduled as 9 on one
      socket and 7 on another socket. Leaving one core in a socket idle
      whereas in another socket we have a core having both its HT siblings busy.
      
      While this issue can be fixed by decreasing the domain's imbalance_pct
      (by making it a function of number of logical cpus in the domain), it
      can potentially cause more task migrations across sched groups in an
      overloaded case.
      
      Fix this by using imbalance_pct only during newly_idle and busy
      load balancing. And during idle load balancing, check if there
      is an imbalance in number of idle cpu's across the busiest and this
      sched_group or if the busiest group has more tasks than its weight that
      the idle cpu in this_group can pull.
      Reported-by: NNikhil Rao <ncrao@google.com>
      Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1284760952.2676.11.camel@sbsiddha-MOBL3.sc.intel.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      aae6d3dd
  11. 22 10月, 2010 1 次提交
  12. 19 10月, 2010 5 次提交
    • V
      sched: Remove irq time from available CPU power · aa483808
      Venkatesh Pallipadi 提交于
      The idea was suggested by Peter Zijlstra here:
      
        http://marc.info/?l=linux-kernel&m=127476934517534&w=2
      
      irq time is technically not available to the tasks running on the CPU.
      This patch removes irq time from CPU power piggybacking on
      sched_rt_avg_update().
      
      Tested this by keeping CPU X busy with a network intensive task having 75%
      oa a single CPU irq processing (hard+soft) on a 4-way system. And start seven
      cycle soakers on the system. Without this change, there will be two tasks on
      each CPU. With this change, there is a single task on irq busy CPU X and
      remaining 7 tasks are spread around among other 3 CPUs.
      Signed-off-by: NVenkatesh Pallipadi <venki@google.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1286237003-12406-8-git-send-email-venki@google.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      aa483808
    • V
      sched: Do not account irq time to current task · 305e6835
      Venkatesh Pallipadi 提交于
      Scheduler accounts both softirq and interrupt processing times to the
      currently running task. This means, if the interrupt processing was
      for some other task in the system, then the current task ends up being
      penalized as it gets shorter runtime than otherwise.
      
      Change sched task accounting to acoount only actual task time from
      currently running task. Now update_curr(), modifies the delta_exec to
      depend on rq->clock_task.
      
      Note that this change only handles CONFIG_IRQ_TIME_ACCOUNTING case. We can
      extend this to CONFIG_VIRT_CPU_ACCOUNTING with minimal effort. But, thats
      for later.
      
      This change will impact scheduling behavior in interrupt heavy conditions.
      
      Tested on a 4-way system with eth0 handled by CPU 2 and a network heavy
      task (nc) running on CPU 3 (and no RSS/RFS). With that I have CPU 2
      spending 75%+ of its time in irq processing. CPU 3 spending around 35%
      time running nc task.
      
      Now, if I run another CPU intensive task on CPU 2, without this change
      /proc/<pid>/schedstat shows 100% of time accounted to this task. With this
      change, it rightly shows less than 25% accounted to this task as remaining
      time is actually spent on irq processing.
      Signed-off-by: NVenkatesh Pallipadi <venki@google.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1286237003-12406-7-git-send-email-venki@google.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      305e6835
    • N
      sched: Drop group_capacity to 1 only if local group has extra capacity · 75dd321d
      Nikhil Rao 提交于
      When SD_PREFER_SIBLING is set on a sched domain, drop group_capacity to 1
      only if the local group has extra capacity. The extra check prevents the case
      where you always pull from the heaviest group when it is already under-utilized
      (possible with a large weight task outweighs the tasks on the system).
      
      For example, consider a 16-cpu quad-core quad-socket machine with MC and NUMA
      scheduling domains. Let's say we spawn 15 nice0 tasks and one nice-15 task,
      and each task is running on one core. In this case, we observe the following
      events when balancing at the NUMA domain:
      
      - find_busiest_group() will always pick the sched group containing the niced
        task to be the busiest group.
      - find_busiest_queue() will then always pick one of the cpus running the
        nice0 task (never picks the cpu with the nice -15 task since
        weighted_cpuload > imbalance).
      - The load balancer fails to migrate the task since it is the running task
        and increments sd->nr_balance_failed.
      - It repeats the above steps a few more times until sd->nr_balance_failed > 5,
        at which point it kicks off the active load balancer, wakes up the migration
        thread and kicks the nice 0 task off the cpu.
      
      The load balancer doesn't stop until we kick out all nice 0 tasks from
      the sched group, leaving you with 3 idle cpus and one cpu running the
      nice -15 task.
      
      When balancing at the NUMA domain, we drop sgs.group_capacity to 1 if the child
      domain (in this case MC) has SD_PREFER_SIBLING set.  Subsequent load checks are
      not relevant because the niced task has a very large weight.
      
      In this patch, we add an extra condition to the "if(prefer_sibling)" check in
      update_sd_lb_stats(). We drop the capacity of a group only if the local group
      has extra capacity, ie. nr_running < group_capacity. This patch preserves the
      original intent of the prefer_siblings check (to spread tasks across the system
      in low utilization scenarios) and fixes the case above.
      
      It helps in the following ways:
      - In low utilization cases (where nr_tasks << nr_cpus), we still drop
        group_capacity down to 1 if we prefer siblings.
      - On very busy systems (where nr_tasks >> nr_cpus), sgs.nr_running will most
        likely be > sgs.group_capacity.
      - When balancing large weight tasks, if the local group does not have extra
        capacity, we do not pick the group with the niced task as the busiest group.
        This prevents failed balances, active migration and the under-utilization
        described above.
      Signed-off-by: NNikhil Rao <ncrao@google.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1287173550-30365-5-git-send-email-ncrao@google.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      75dd321d
    • N
      sched: Force balancing on newidle balance if local group has capacity · fab47622
      Nikhil Rao 提交于
      This patch forces a load balance on a newly idle cpu when the local group has
      extra capacity and the busiest group does not have any. It improves system
      utilization when balancing tasks with a large weight differential.
      
      Under certain situations, such as a niced down task (i.e. nice = -15) in the
      presence of nr_cpus NICE0 tasks, the niced task lands on a sched group and
      kicks away other tasks because of its large weight. This leads to sub-optimal
      utilization of the machine. Even though the sched group has capacity, it does
      not pull tasks because sds.this_load >> sds.max_load, and f_b_g() returns NULL.
      
      With this patch, if the local group has extra capacity, we shortcut the checks
      in f_b_g() and try to pull a task over. A sched group has extra capacity if the
      group capacity is greater than the number of running tasks in that group.
      
      Thanks to Mike Galbraith for discussions leading to this patch and for the
      insight to reuse SD_NEWIDLE_BALANCE.
      Signed-off-by: NNikhil Rao <ncrao@google.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1287173550-30365-4-git-send-email-ncrao@google.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      fab47622
    • N
      sched: Set group_imb only a task can be pulled from the busiest cpu · 2582f0eb
      Nikhil Rao 提交于
      When cycling through sched groups to determine the busiest group, set
      group_imb only if the busiest cpu has more than 1 runnable task. This patch
      fixes the case where two cpus in a group have one runnable task each, but there
      is a large weight differential between these two tasks. The load balancer is
      unable to migrate any task from this group, and hence do not consider this
      group to be imbalanced.
      Signed-off-by: NNikhil Rao <ncrao@google.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1286996978-7007-3-git-send-email-ncrao@google.com>
      [ small code readability edits ]
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      2582f0eb