1. 14 8月, 2011 9 次提交
  2. 22 7月, 2011 6 次提交
  3. 21 7月, 2011 1 次提交
  4. 01 7月, 2011 1 次提交
  5. 28 5月, 2011 1 次提交
  6. 20 5月, 2011 1 次提交
  7. 04 5月, 2011 1 次提交
  8. 19 4月, 2011 2 次提交
    • V
      sched: Next buddy hint on sleep and preempt path · 2f36825b
      Venkatesh Pallipadi 提交于
      When a task in a taskgroup sleeps, pick_next_task starts all the way back at
      the root and picks the task/taskgroup with the min vruntime across all
      runnable tasks.
      
      But when there are many frequently sleeping tasks across different taskgroups,
      it makes better sense to stay with same taskgroup for its slice period (or
      until all tasks in the taskgroup sleeps) instead of switching cross taskgroup
      on each sleep after a short runtime.
      
      This helps specifically where taskgroups corresponds to a process with
      multiple threads. The change reduces the number of CR3 switches in this case.
      
      Example:
      
      Two taskgroups with 2 threads each which are running for 2ms and
      sleeping for 1ms. Looking at sched:sched_switch shows:
      
      BEFORE: taskgroup_1 threads [5004, 5005], taskgroup_2 threads [5016, 5017]
            cpu-soaker-5004  [003]  3683.391089
            cpu-soaker-5016  [003]  3683.393106
            cpu-soaker-5005  [003]  3683.395119
            cpu-soaker-5017  [003]  3683.397130
            cpu-soaker-5004  [003]  3683.399143
            cpu-soaker-5016  [003]  3683.401155
            cpu-soaker-5005  [003]  3683.403168
            cpu-soaker-5017  [003]  3683.405170
      
      AFTER: taskgroup_1 threads [21890, 21891], taskgroup_2 threads [21934, 21935]
            cpu-soaker-21890 [003]   865.895494
            cpu-soaker-21935 [003]   865.897506
            cpu-soaker-21934 [003]   865.899520
            cpu-soaker-21935 [003]   865.901532
            cpu-soaker-21934 [003]   865.903543
            cpu-soaker-21935 [003]   865.905546
            cpu-soaker-21891 [003]   865.907548
            cpu-soaker-21890 [003]   865.909560
            cpu-soaker-21891 [003]   865.911571
            cpu-soaker-21890 [003]   865.913582
            cpu-soaker-21891 [003]   865.915594
            cpu-soaker-21934 [003]   865.917606
      
      Similar problem is there when there are multiple taskgroups and say a task A
      preempts currently running task B of taskgroup_1. On schedule, pick_next_task
      can pick an unrelated task on taskgroup_2. Here it would be better to give some
      preference to task B on pick_next_task.
      
      A simple (may be extreme case) benchmark I tried was tbench with 2 tbench
      client processes with 2 threads each running on a single CPU. Avg throughput
      across 5 50 sec runs was:
      
       BEFORE: 105.84 MB/sec
       AFTER:  112.42 MB/sec
      Signed-off-by: NVenkatesh Pallipadi <venki@google.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/1302802253-25760-1-git-send-email-venki@google.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
      2f36825b
    • V
      sched: Make set_*_buddy() work on non-task entities · 69c80f3e
      Venkatesh Pallipadi 提交于
      Make set_*_buddy() work on non-task sched_entity, to facilitate the
      use of next_buddy to cache a group entity in cases where one of the
      tasks within that entity sleeps or gets preempted.
      
      set_skip_buddy() was incorrectly comparing the policy of task that is
      yielding to be not equal to SCHED_IDLE. Yielding should happen even
      when task yielding is SCHED_IDLE. This change removes the policy check
      on the yielding task.
      Signed-off-by: NVenkatesh Pallipadi <venki@google.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/1302744070-30079-2-git-send-email-venki@google.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
      69c80f3e
  9. 14 4月, 2011 3 次提交
  10. 11 4月, 2011 5 次提交
    • P
      sched: Avoid using sd->level · a6c75f2f
      Peter Zijlstra 提交于
      Don't use sd->level for identifying properties of the domain.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20110407122942.350174079@chello.nlSigned-off-by: NIngo Molnar <mingo@elte.hu>
      a6c75f2f
    • P
      sched: Dynamically allocate sched_domain/sched_group data-structures · dce840a0
      Peter Zijlstra 提交于
      Instead of relying on static allocations for the sched_domain and
      sched_group trees, dynamically allocate and RCU free them.
      
      Allocating this dynamically also allows for some build_sched_groups()
      simplification since we can now (like with other simplifications) rely
      on the sched_domain tree instead of hard-coded knowledge.
      
      One tricky to note is that detach_destroy_domains() needs to hold
      rcu_read_lock() over the entire tear-down, per-cpu is not sufficient
      since that can lead to partial sched_group existance (could possibly
      be solved by doing the tear-down backwards but this is much more
      robust).
      
      A concequence of the above is that we can no longer print the
      sched_domain debug stuff from cpu_attach_domain() since that might now
      run with preemption disabled (due to classic RCU etc.) and
      sched_domain_debug() does some GFP_KERNEL allocations.
      
      Another thing to note is that we now fully rely on normal RCU and not
      RCU-sched, this is because with the new and exiting RCU flavours we
      grew over the years BH doesn't necessarily hold off RCU-sched grace
      periods (-rt is known to break this). This would in fact already cause
      us grief since we do sched_domain/sched_group iterations from softirq
      context.
      
      This patch is somewhat larger than I would like it to be, but I didn't
      find any means of shrinking/splitting this.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20110407122942.245307941@chello.nlSigned-off-by: NIngo Molnar <mingo@elte.hu>
      dce840a0
    • S
      sched: Eliminate dead code from wakeup_gran() · f4ad9bd2
      Shaohua Li 提交于
      calc_delta_fair() checks NICE_0_LOAD already, delete duplicate check.
      
      Signed-off-by: Shaohua Li<shaohua.li@intel.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mike Galbraith <efault@gmx.de>
      Link: http://lkml.kernel.org/r/1302238389.3981.92.camel@sli10-conroeSigned-off-by: NIngo Molnar <mingo@elte.hu>
      f4ad9bd2
    • K
      sched: Fix erroneous all_pinned logic · b30aef17
      Ken Chen 提交于
      The scheduler load balancer has specific code to deal with cases of
      unbalanced system due to lots of unmovable tasks (for example because of
      hard CPU affinity). In those situation, it excludes the busiest CPU that
      has pinned tasks for load balance consideration such that it can perform
      second 2nd load balance pass on the rest of the system.
      
      This all works as designed if there is only one cgroup in the system.
      
      However, when we have multiple cgroups, this logic has false positives and
      triggers multiple load balance passes despite there are actually no pinned
      tasks at all.
      
      The reason it has false positives is that the all pinned logic is deep in
      the lowest function of can_migrate_task() and is too low level:
      
      load_balance_fair() iterates each task group and calls balance_tasks() to
      migrate target load. Along the way, balance_tasks() will also set a
      all_pinned variable. Given that task-groups are iterated, this all_pinned
      variable is essentially the status of last group in the scanning process.
      Task group can have number of reasons that no load being migrated, none
      due to cpu affinity. However, this status bit is being propagated back up
      to the higher level load_balance(), which incorrectly think that no tasks
      were moved.  It kick off the all pinned logic and start multiple passes
      attempt to move load onto puller CPU.
      
      To fix this, move the all_pinned aggregation up at the iterator level.
      This ensures that the status is aggregated over all task-groups, not just
      last one in the list.
      Signed-off-by: NKen Chen <kenchen@google.com>
      Cc: stable@kernel.org
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/BANLkTi=ernzNawaR5tJZEsV_QVnfxqXmsQ@mail.gmail.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
      b30aef17
    • K
      sched: Fix sched-domain avg_load calculation · b0432d8f
      Ken Chen 提交于
      In function find_busiest_group(), the sched-domain avg_load isn't
      calculated at all if there is a group imbalance within the domain. This
      will cause erroneous imbalance calculation.
      
      The reason is that calculate_imbalance() sees sds->avg_load = 0 and it
      will dump entire sds->max_load into imbalance variable, which is used
      later on to migrate entire load from busiest CPU to the puller CPU.
      
      This has two really bad effect:
      
      1. stampede of task migration, and they won't be able to break out
         of the bad state because of positive feedback loop: large load
         delta -> heavier load migration -> larger imbalance and the cycle
         goes on.
      
      2. severe imbalance in CPU queue depth.  This causes really long
         scheduling latency blip which affects badly on application that
         has tight latency requirement.
      
      The fix is to have kernel calculate domain avg_load in both cases. This
      will ensure that imbalance calculation is always sensible and the target
      is usually half way between busiest and puller CPU.
      Signed-off-by: NKen Chen <kenchen@google.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: <stable@kernel.org>
      Link: http://lkml.kernel.org/r/20110408002322.3A0D812217F@elm.corp.google.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
      b0432d8f
  11. 05 4月, 2011 1 次提交
  12. 31 3月, 2011 2 次提交
  13. 04 3月, 2011 2 次提交
  14. 23 2月, 2011 3 次提交
    • P
      sched: Fix the group_imb logic · 866ab43e
      Peter Zijlstra 提交于
      On a 2*6*2 machine something like:
      
       taskset -c 3-11 bash -c 'for ((i=0;i<9;i++)) do while :; do :; done & done'
      
      _should_ result in 9 busy CPUs, each running 1 task.
      
      However it didn't quite work reliably, most of the time one cpu of the
      second socket (6-11) would be idle and one cpu of the first socket
      (0-5) would have two tasks on it.
      
      The group_imb logic is supposed to deal with this and detect when a
      particular group is imbalanced (like in our case, 0-2 are idle but 3-5
      will have 4 tasks on it).
      
      The detection phase needed a bit of a tweak as it was too weak and
      required more than 2 avg weight tasks difference between idle and busy
      cpus in the group which won't trigger for our test-case. So cure that
      to be one or more avg task weight difference between cpus.
      
      Once the detection phase worked, it was then defeated by the f_b_g()
      tests trying to avoid ping-pongs. In particular, this_load >= max_load
      triggered because the pulling cpu (the (first) idle cpu in on the
      second socket, say 6) would find this_load to be 5 and max_load to be
      4 (there'd be 5 tasks running on our socket and only 4 on the other
      socket).
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Nikhil Rao <ncrao@google.com>
      Cc: Venkatesh Pallipadi <venki@google.com>
      Cc: Suresh Siddha <suresh.b.siddha@intel.com>
      Cc: Mike Galbraith <efault@gmx.de>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      866ab43e
    • P
      sched: Clean up some f_b_g() comments · cc57aa8f
      Peter Zijlstra 提交于
      The existing comment tends to grow state (as it already has), split it
      up and place it near the actual tests.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Nikhil Rao <ncrao@google.com>
      Cc: Venkatesh Pallipadi <venki@google.com>
      Cc: Suresh Siddha <suresh.b.siddha@intel.com>
      Cc: Mike Galbraith <efault@gmx.de>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      cc57aa8f
    • P
      sched: Clean up remnants of sd_idle · c186fafe
      Peter Zijlstra 提交于
      With the wholesale removal of the sd_idle SMT logic we can clean up
      some more.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Nikhil Rao <ncrao@google.com>
      Cc: Venkatesh Pallipadi <venki@google.com>
      Cc: Suresh Siddha <suresh.b.siddha@intel.com>
      Cc: Mike Galbraith <efault@gmx.de>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      c186fafe
  15. 16 2月, 2011 1 次提交
  16. 03 2月, 2011 1 次提交