1. 26 1月, 2011 2 次提交
  2. 24 1月, 2011 1 次提交
  3. 18 1月, 2011 2 次提交
  4. 19 12月, 2010 2 次提交
    • P
      sched: Fix interactivity bug by charging unaccounted run-time on entity re-weight · 19e5eebb
      Paul Turner 提交于
      Mike Galbraith reported poor interactivity[*] when the new shares distribution
      code was combined with autogroups.
      
      The root cause turns out to be a mis-ordering of accounting accrued execution
      time and shares updates.  Since update_curr() is issued hierarchically,
      updating the parent entity weights to reflect child enqueue/dequeue results in
      the parent's unaccounted execution time then being accrued (vs vruntime) at the
      new weight as opposed to the weight present at accumulation.
      
      While this doesn't have much effect on processes with timeslices that cross a
      tick, it is particularly problematic for an interactive process (e.g. Xorg)
      which incurs many (tiny) timeslices.  In this scenario almost all updates are
      at dequeue which can result in significant fairness perturbation (especially if
      it is the only thread, resulting in potential {tg->shares, MIN_SHARES}
      transitions).
      
      Correct this by ensuring unaccounted time is accumulated prior to manipulating
      an entity's weight.
      
      [*] http://xkcd.com/619/ is perversely Nostradamian here.
      Signed-off-by: NPaul Turner <pjt@google.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      LKML-Reference: <20101216031038.159704378@google.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      19e5eebb
    • P
      sched: Move periodic share updates to entity_tick() · 43365bd7
      Paul Turner 提交于
      Long running entities that do not block (dequeue) require periodic updates to
      maintain accurate share values.  (Note: group entities with several threads are
      quite likely to be non-blocking in many circumstances).
      
      By virtue of being long-running however, we will see entity ticks (otherwise
      the required update occurs in dequeue/put and we are done).  Thus we can move
      the detection (and associated work) for these updates into the periodic path.
      
      This restores the 'atomicity' of update_curr() with respect to accounting.
      Signed-off-by: NPaul Turner <pjt@google.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <20101216031038.067028969@google.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      43365bd7
  5. 23 11月, 2010 1 次提交
  6. 18 11月, 2010 12 次提交
  7. 11 11月, 2010 2 次提交
    • P
      sched: Fix cross-sched-class wakeup preemption · 1e5a7405
      Peter Zijlstra 提交于
      Instead of dealing with sched classes inside each check_preempt_curr()
      implementation, pull out this logic into the generic wakeup preemption
      path.
      
      This fixes a hang in KVM (and others) where we are waiting for the
      stop machine thread to run ...
      Reported-by: NMarkus Trippelsdorf <markus@trippelsdorf.de>
      Tested-by: NMarcelo Tosatti <mtosatti@redhat.com>
      Tested-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1288891946.2039.31.camel@laptop>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      1e5a7405
    • S
      sched: Use group weight, idle cpu metrics to fix imbalances during idle · aae6d3dd
      Suresh Siddha 提交于
      Currently we consider a sched domain to be well balanced when the imbalance
      is less than the domain's imablance_pct. As the number of cores and threads
      are increasing, current values of imbalance_pct (for example 25% for a
      NUMA domain) are not enough to detect imbalances like:
      
      a) On a WSM-EP system (two sockets, each having 6 cores and 12 logical threads),
      24 cpu-hogging tasks get scheduled as 13 on one socket and 11 on another
      socket. Leading to an idle HT cpu.
      
      b) On a hypothetial 2 socket NHM-EX system (each socket having 8 cores and
      16 logical threads), 16 cpu-hogging tasks can get scheduled as 9 on one
      socket and 7 on another socket. Leaving one core in a socket idle
      whereas in another socket we have a core having both its HT siblings busy.
      
      While this issue can be fixed by decreasing the domain's imbalance_pct
      (by making it a function of number of logical cpus in the domain), it
      can potentially cause more task migrations across sched groups in an
      overloaded case.
      
      Fix this by using imbalance_pct only during newly_idle and busy
      load balancing. And during idle load balancing, check if there
      is an imbalance in number of idle cpu's across the busiest and this
      sched_group or if the busiest group has more tasks than its weight that
      the idle cpu in this_group can pull.
      Reported-by: NNikhil Rao <ncrao@google.com>
      Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1284760952.2676.11.camel@sbsiddha-MOBL3.sc.intel.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      aae6d3dd
  8. 22 10月, 2010 1 次提交
  9. 19 10月, 2010 5 次提交
    • V
      sched: Remove irq time from available CPU power · aa483808
      Venkatesh Pallipadi 提交于
      The idea was suggested by Peter Zijlstra here:
      
        http://marc.info/?l=linux-kernel&m=127476934517534&w=2
      
      irq time is technically not available to the tasks running on the CPU.
      This patch removes irq time from CPU power piggybacking on
      sched_rt_avg_update().
      
      Tested this by keeping CPU X busy with a network intensive task having 75%
      oa a single CPU irq processing (hard+soft) on a 4-way system. And start seven
      cycle soakers on the system. Without this change, there will be two tasks on
      each CPU. With this change, there is a single task on irq busy CPU X and
      remaining 7 tasks are spread around among other 3 CPUs.
      Signed-off-by: NVenkatesh Pallipadi <venki@google.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1286237003-12406-8-git-send-email-venki@google.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      aa483808
    • V
      sched: Do not account irq time to current task · 305e6835
      Venkatesh Pallipadi 提交于
      Scheduler accounts both softirq and interrupt processing times to the
      currently running task. This means, if the interrupt processing was
      for some other task in the system, then the current task ends up being
      penalized as it gets shorter runtime than otherwise.
      
      Change sched task accounting to acoount only actual task time from
      currently running task. Now update_curr(), modifies the delta_exec to
      depend on rq->clock_task.
      
      Note that this change only handles CONFIG_IRQ_TIME_ACCOUNTING case. We can
      extend this to CONFIG_VIRT_CPU_ACCOUNTING with minimal effort. But, thats
      for later.
      
      This change will impact scheduling behavior in interrupt heavy conditions.
      
      Tested on a 4-way system with eth0 handled by CPU 2 and a network heavy
      task (nc) running on CPU 3 (and no RSS/RFS). With that I have CPU 2
      spending 75%+ of its time in irq processing. CPU 3 spending around 35%
      time running nc task.
      
      Now, if I run another CPU intensive task on CPU 2, without this change
      /proc/<pid>/schedstat shows 100% of time accounted to this task. With this
      change, it rightly shows less than 25% accounted to this task as remaining
      time is actually spent on irq processing.
      Signed-off-by: NVenkatesh Pallipadi <venki@google.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1286237003-12406-7-git-send-email-venki@google.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      305e6835
    • N
      sched: Drop group_capacity to 1 only if local group has extra capacity · 75dd321d
      Nikhil Rao 提交于
      When SD_PREFER_SIBLING is set on a sched domain, drop group_capacity to 1
      only if the local group has extra capacity. The extra check prevents the case
      where you always pull from the heaviest group when it is already under-utilized
      (possible with a large weight task outweighs the tasks on the system).
      
      For example, consider a 16-cpu quad-core quad-socket machine with MC and NUMA
      scheduling domains. Let's say we spawn 15 nice0 tasks and one nice-15 task,
      and each task is running on one core. In this case, we observe the following
      events when balancing at the NUMA domain:
      
      - find_busiest_group() will always pick the sched group containing the niced
        task to be the busiest group.
      - find_busiest_queue() will then always pick one of the cpus running the
        nice0 task (never picks the cpu with the nice -15 task since
        weighted_cpuload > imbalance).
      - The load balancer fails to migrate the task since it is the running task
        and increments sd->nr_balance_failed.
      - It repeats the above steps a few more times until sd->nr_balance_failed > 5,
        at which point it kicks off the active load balancer, wakes up the migration
        thread and kicks the nice 0 task off the cpu.
      
      The load balancer doesn't stop until we kick out all nice 0 tasks from
      the sched group, leaving you with 3 idle cpus and one cpu running the
      nice -15 task.
      
      When balancing at the NUMA domain, we drop sgs.group_capacity to 1 if the child
      domain (in this case MC) has SD_PREFER_SIBLING set.  Subsequent load checks are
      not relevant because the niced task has a very large weight.
      
      In this patch, we add an extra condition to the "if(prefer_sibling)" check in
      update_sd_lb_stats(). We drop the capacity of a group only if the local group
      has extra capacity, ie. nr_running < group_capacity. This patch preserves the
      original intent of the prefer_siblings check (to spread tasks across the system
      in low utilization scenarios) and fixes the case above.
      
      It helps in the following ways:
      - In low utilization cases (where nr_tasks << nr_cpus), we still drop
        group_capacity down to 1 if we prefer siblings.
      - On very busy systems (where nr_tasks >> nr_cpus), sgs.nr_running will most
        likely be > sgs.group_capacity.
      - When balancing large weight tasks, if the local group does not have extra
        capacity, we do not pick the group with the niced task as the busiest group.
        This prevents failed balances, active migration and the under-utilization
        described above.
      Signed-off-by: NNikhil Rao <ncrao@google.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1287173550-30365-5-git-send-email-ncrao@google.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      75dd321d
    • N
      sched: Force balancing on newidle balance if local group has capacity · fab47622
      Nikhil Rao 提交于
      This patch forces a load balance on a newly idle cpu when the local group has
      extra capacity and the busiest group does not have any. It improves system
      utilization when balancing tasks with a large weight differential.
      
      Under certain situations, such as a niced down task (i.e. nice = -15) in the
      presence of nr_cpus NICE0 tasks, the niced task lands on a sched group and
      kicks away other tasks because of its large weight. This leads to sub-optimal
      utilization of the machine. Even though the sched group has capacity, it does
      not pull tasks because sds.this_load >> sds.max_load, and f_b_g() returns NULL.
      
      With this patch, if the local group has extra capacity, we shortcut the checks
      in f_b_g() and try to pull a task over. A sched group has extra capacity if the
      group capacity is greater than the number of running tasks in that group.
      
      Thanks to Mike Galbraith for discussions leading to this patch and for the
      insight to reuse SD_NEWIDLE_BALANCE.
      Signed-off-by: NNikhil Rao <ncrao@google.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1287173550-30365-4-git-send-email-ncrao@google.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      fab47622
    • N
      sched: Set group_imb only a task can be pulled from the busiest cpu · 2582f0eb
      Nikhil Rao 提交于
      When cycling through sched groups to determine the busiest group, set
      group_imb only if the busiest cpu has more than 1 runnable task. This patch
      fixes the case where two cpus in a group have one runnable task each, but there
      is a large weight differential between these two tasks. The load balancer is
      unable to migrate any task from this group, and hence do not consider this
      group to be imbalanced.
      Signed-off-by: NNikhil Rao <ncrao@google.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1286996978-7007-3-git-send-email-ncrao@google.com>
      [ small code readability edits ]
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      2582f0eb
  10. 14 10月, 2010 1 次提交
  11. 08 10月, 2010 1 次提交
    • P
      sched: suppress RCU lockdep splat in task_fork_fair · b0a0f667
      Paul E. McKenney 提交于
      > ===================================================
      > [ INFO: suspicious rcu_dereference_check() usage. ]
      > ---------------------------------------------------
      > /home/greearb/git/linux.wireless-testing/kernel/sched.c:618 invoked rcu_dereference_check() without protection!
      >
      > other info that might help us debug this:
      >
      > rcu_scheduler_active = 1, debug_locks = 1
      > 1 lock held by ifup/23517:
      >   #0:  (&rq->lock){-.-.-.}, at: [<c042f782>] task_fork_fair+0x3b/0x108
      >
      > stack backtrace:
      > Pid: 23517, comm: ifup Not tainted 2.6.36-rc6-wl+ #5
      > Call Trace:
      >   [<c075e219>] ? printk+0xf/0x16
      >   [<c0455842>] lockdep_rcu_dereference+0x74/0x7d
      >   [<c0426854>] task_group+0x6d/0x79
      >   [<c042686e>] set_task_rq+0xe/0x57
      >   [<c042f79e>] task_fork_fair+0x57/0x108
      >   [<c042e965>] sched_fork+0x82/0xf9
      >   [<c04334b3>] copy_process+0x569/0xe8e
      >   [<c0433ef0>] do_fork+0x118/0x262
      >   [<c076302f>] ? do_page_fault+0x16a/0x2cf
      >   [<c044b80c>] ? up_read+0x16/0x2a
      >   [<c04085ae>] sys_clone+0x1b/0x20
      >   [<c04030a5>] ptregs_clone+0x15/0x30
      >   [<c0402f1c>] ? sysenter_do_call+0x12/0x38
      
      Here a newly created task is having its runqueue assigned.  The new task
      is not yet on the tasklist, so cannot go away.  This is therefore a false
      positive, suppress with an RCU read-side critical section.
      
      Reported-by: Ben Greear <greearb@candelatech.com
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Tested-by: Ben Greear <greearb@candelatech.com
      b0a0f667
  12. 21 9月, 2010 2 次提交
    • V
      sched: Increment cache_nice_tries only on periodic lb · 58b26c4c
      Venkatesh Pallipadi 提交于
      scheduler uses cache_nice_tries as an indicator to do cache_hot and
      active load balance, when normal load balance fails. Currently,
      this value is changed on any failed load balance attempt. That ends
      up being not so nice to workloads that enter/exit idle often, as
      they do more frequent new_idle balance and that pretty soon results
      in cache hot tasks being pulled in.
      
      Making the cache_nice_tries ignore failed new_idle balance seems to
      make better sense. With that only the failed load balance in
      periodic load balance gets accounted and the rate of accumulation
      of cache_nice_tries will not depend on idle entry/exit (short
      running sleep-wakeup kind of tasks). This reduces movement of
      cache_hot tasks.
      
      schedstat diff (after-before) excerpt from a workload that has
      frequent and short wakeup-idle pattern (:2 in cpu col below refers
      to NEWIDLE idx) This snapshot was across ~400 seconds.
      
      Without this change:
      domainstats:  domain0
       cpu     cnt      bln      fld      imb     gain    hgain  nobusyq  nobusyg
       0:2  306487   219575    73167  110069413    44583    19070     1172   218403
       1:2  292139   194853    81421  120893383    50745    21902     1259   193594
       2:2  283166   174607    91359  129699642    54931    23688     1287   173320
       3:2  273998   161788    93991  132757146    57122    24351     1366   160422
       4:2  289851   215692    62190  83398383    36377    13680      851   214841
       5:2  316312   222146    77605  117582154    49948    20281      988   221158
       6:2  297172   195596    83623  122133390    52801    21301      929   194667
       7:2  283391   178078    86378  126622761    55122    22239      928   177150
       8:2  297655   210359    72995  110246694    45798    19777     1125   209234
       9:2  297357   202011    79363  119753474    50953    22088     1089   200922
      10:2  278797   178703    83180  122514385    52969    22726     1128   177575
      11:2  272661   167669    86978  127342327    55857    24342     1195   166474
      12:2  293039   204031    73211  110282059    47285    19651      948   203083
      13:2  289502   196762    76803  114712942    49339    20547     1016   195746
      14:2  264446   169609    78292  115715605    50459    21017      982   168627
      15:2  260968   163660    80142  116811793    51483    21281     1064   162596
      
      With this change:
      domainstats:  domain0
       cpu     cnt      bln      fld      imb     gain    hgain  nobusyq  nobusyg
       0:2  272347   187380    77455  105420270    24975        1      953   186427
       1:2  267276   172360    86234  116242264    28087        6     1028   171332
       2:2  259769   156777    93281  123243134    30555        1     1043   155734
       3:2  250870   143129    97627  127370868    32026        6     1188   141941
       4:2  248422   177116    64096  78261112    22202        2      757   176359
       5:2  275595   180683    84950  116075022    29400        6      778   179905
       6:2  262418   162609    88944  119256898    31056        4      817   161792
       7:2  252204   147946    92646  122388300    32879        4      824   147122
       8:2  262335   172239    81631  110477214    26599        4      864   171375
       9:2  261563   164775    88016  117203621    28331        3      849   163926
      10:2  243389   140949    93379  121353071    29585        2      909   140040
      11:2  242795   134651    98310  124768957    30895        2     1016   133635
      12:2  255234   166622    79843  104696912    26483        4      746   165876
      13:2  244944   151595    83855  109808099    27787        3      801   150794
      14:2  241301   140982    89935  116954383    30403        6      845   140137
      15:2  232271   128564    92821  119185207    31207        4     1416   127148
      Signed-off-by: NVenkatesh Pallipadi <venki@google.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1284167957-3675-1-git-send-email-venki@google.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      58b26c4c
    • S
      sched: Fix nohz balance kick · f6c3f168
      Suresh Siddha 提交于
      There's a situation where the nohz balancer will try to wake itself:
      
      cpu-x is idle which is also ilb_cpu
      got a scheduler tick during idle
      and the nohz_kick_needed() in trigger_load_balance() checks for
      rq_x->nr_running which might not be zero (because of someone waking a
      task on this rq etc) and this leads to the situation of the cpu-x
      sending a kick to itself.
      
      And this can cause a lockup.
      
      Avoid this by not marking ourself eligible for kicking.
      Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1284400941.2684.19.camel@sbsiddha-MOBL3.sc.intel.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      f6c3f168
  13. 14 9月, 2010 1 次提交
  14. 10 9月, 2010 1 次提交
  15. 05 9月, 2010 1 次提交
  16. 20 8月, 2010 1 次提交
  17. 17 7月, 2010 2 次提交
  18. 29 6月, 2010 1 次提交
  19. 23 6月, 2010 1 次提交
    • D
      rcu: apply RCU protection to wake_affine() · f3b577de
      Daniel J Blueman 提交于
      The task_group() function returns a pointer that must be protected
      by either RCU, the ->alloc_lock, or the cgroup lock (see the
      rcu_dereference_check() in task_subsys_state(), which is invoked by
      task_group()).  The wake_affine() function currently does none of these,
      which means that a concurrent update would be within its rights to free
      the structure returned by task_group().  Because wake_affine() uses this
      structure only to compute load-balancing heuristics, there is no reason
      to acquire either of the two locks.
      
      Therefore, this commit introduces an RCU read-side critical section that
      starts before the first call to task_group() and ends after the last use
      of the "tg" pointer returned from task_group().  Thanks to Li Zefan for
      pointing out the need to extend the RCU read-side critical section from
      that proposed by the original patch.
      Signed-off-by: NDaniel J Blueman <daniel.blueman@gmail.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      f3b577de