1. 11 11月, 2010 1 次提交
    • S
      sched: Use group weight, idle cpu metrics to fix imbalances during idle · aae6d3dd
      Suresh Siddha 提交于
      Currently we consider a sched domain to be well balanced when the imbalance
      is less than the domain's imablance_pct. As the number of cores and threads
      are increasing, current values of imbalance_pct (for example 25% for a
      NUMA domain) are not enough to detect imbalances like:
      
      a) On a WSM-EP system (two sockets, each having 6 cores and 12 logical threads),
      24 cpu-hogging tasks get scheduled as 13 on one socket and 11 on another
      socket. Leading to an idle HT cpu.
      
      b) On a hypothetial 2 socket NHM-EX system (each socket having 8 cores and
      16 logical threads), 16 cpu-hogging tasks can get scheduled as 9 on one
      socket and 7 on another socket. Leaving one core in a socket idle
      whereas in another socket we have a core having both its HT siblings busy.
      
      While this issue can be fixed by decreasing the domain's imbalance_pct
      (by making it a function of number of logical cpus in the domain), it
      can potentially cause more task migrations across sched groups in an
      overloaded case.
      
      Fix this by using imbalance_pct only during newly_idle and busy
      load balancing. And during idle load balancing, check if there
      is an imbalance in number of idle cpu's across the busiest and this
      sched_group or if the busiest group has more tasks than its weight that
      the idle cpu in this_group can pull.
      Reported-by: NNikhil Rao <ncrao@google.com>
      Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1284760952.2676.11.camel@sbsiddha-MOBL3.sc.intel.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      aae6d3dd
  2. 22 10月, 2010 1 次提交
  3. 21 10月, 2010 1 次提交
  4. 19 10月, 2010 10 次提交
    • I
      sched: Export account_system_vtime() · b7dadc38
      Ingo Molnar 提交于
      KVM uses it for example:
      
       ERROR: "account_system_vtime" [arch/x86/kvm/kvm.ko] undefined!
      
      Cc: Venkatesh Pallipadi <venki@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1286237003-12406-3-git-send-email-venki@google.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      b7dadc38
    • V
      sched: Call tick_check_idle before __irq_enter · d267f87f
      Venkatesh Pallipadi 提交于
      When CPU is idle and on first interrupt, irq_enter calls tick_check_idle()
      to notify interruption from idle. But, there is a problem if this call
      is done after __irq_enter, as all routines in __irq_enter may find
      stale time due to yet to be done tick_check_idle.
      
      Specifically, trace calls in __irq_enter when they use global clock and also
      account_system_vtime change in this patch as it wants to use sched_clock_cpu()
      to do proper irq timing.
      
      But, tick_check_idle was moved after __irq_enter intentionally to
      prevent problem of unneeded ksoftirqd wakeups by the commit ee5f80a9:
      
          irq: call __irq_enter() before calling the tick_idle_check
          Impact: avoid spurious ksoftirqd wakeups
      
      Moving tick_check_idle() before __irq_enter and wrapping it with
      local_bh_enable/disable would solve both the problems.
      Fixed-by: NYong Zhang <yong.zhang0@gmail.com>
      Signed-off-by: NVenkatesh Pallipadi <venki@google.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1286237003-12406-9-git-send-email-venki@google.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      d267f87f
    • V
      sched: Remove irq time from available CPU power · aa483808
      Venkatesh Pallipadi 提交于
      The idea was suggested by Peter Zijlstra here:
      
        http://marc.info/?l=linux-kernel&m=127476934517534&w=2
      
      irq time is technically not available to the tasks running on the CPU.
      This patch removes irq time from CPU power piggybacking on
      sched_rt_avg_update().
      
      Tested this by keeping CPU X busy with a network intensive task having 75%
      oa a single CPU irq processing (hard+soft) on a 4-way system. And start seven
      cycle soakers on the system. Without this change, there will be two tasks on
      each CPU. With this change, there is a single task on irq busy CPU X and
      remaining 7 tasks are spread around among other 3 CPUs.
      Signed-off-by: NVenkatesh Pallipadi <venki@google.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1286237003-12406-8-git-send-email-venki@google.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      aa483808
    • V
      sched: Do not account irq time to current task · 305e6835
      Venkatesh Pallipadi 提交于
      Scheduler accounts both softirq and interrupt processing times to the
      currently running task. This means, if the interrupt processing was
      for some other task in the system, then the current task ends up being
      penalized as it gets shorter runtime than otherwise.
      
      Change sched task accounting to acoount only actual task time from
      currently running task. Now update_curr(), modifies the delta_exec to
      depend on rq->clock_task.
      
      Note that this change only handles CONFIG_IRQ_TIME_ACCOUNTING case. We can
      extend this to CONFIG_VIRT_CPU_ACCOUNTING with minimal effort. But, thats
      for later.
      
      This change will impact scheduling behavior in interrupt heavy conditions.
      
      Tested on a 4-way system with eth0 handled by CPU 2 and a network heavy
      task (nc) running on CPU 3 (and no RSS/RFS). With that I have CPU 2
      spending 75%+ of its time in irq processing. CPU 3 spending around 35%
      time running nc task.
      
      Now, if I run another CPU intensive task on CPU 2, without this change
      /proc/<pid>/schedstat shows 100% of time accounted to this task. With this
      change, it rightly shows less than 25% accounted to this task as remaining
      time is actually spent on irq processing.
      Signed-off-by: NVenkatesh Pallipadi <venki@google.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1286237003-12406-7-git-send-email-venki@google.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      305e6835
    • V
      sched: Add IRQ_TIME_ACCOUNTING, finer accounting of irq time · b52bfee4
      Venkatesh Pallipadi 提交于
      s390/powerpc/ia64 have support for CONFIG_VIRT_CPU_ACCOUNTING which does
      the fine granularity accounting of user, system, hardirq, softirq times.
      Adding that option on archs like x86 will be challenging however, given the
      state of TSC reliability on various platforms and also the overhead it will
      add in syscall entry exit.
      
      Instead, add a lighter variant that only does finer accounting of
      hardirq and softirq times, providing precise irq times (instead of timer tick
      based samples). This accounting is added with a new config option
      CONFIG_IRQ_TIME_ACCOUNTING so that there won't be any overhead for users not
      interested in paying the perf penalty.
      
      This accounting is based on sched_clock, with the code being generic.
      So, other archs may find it useful as well.
      
      This patch just adds the core logic and does not enable this logic yet.
      Signed-off-by: NVenkatesh Pallipadi <venki@google.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1286237003-12406-5-git-send-email-venki@google.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      b52bfee4
    • V
      sched: Fix softirq time accounting · 75e1056f
      Venkatesh Pallipadi 提交于
      Peter Zijlstra found a bug in the way softirq time is accounted in
      VIRT_CPU_ACCOUNTING on this thread:
      
         http://lkml.indiana.edu/hypermail//linux/kernel/1009.2/01366.html
      
      The problem is, softirq processing uses local_bh_disable internally. There
      is no way, later in the flow, to differentiate between whether softirq is
      being processed or is it just that bh has been disabled. So, a hardirq when bh
      is disabled results in time being wrongly accounted as softirq.
      
      Looking at the code a bit more, the problem exists in !VIRT_CPU_ACCOUNTING
      as well. As account_system_time() in normal tick based accouting also uses
      softirq_count, which will be set even when not in softirq with bh disabled.
      
      Peter also suggested solution of using 2*SOFTIRQ_OFFSET as irq count
      for local_bh_{disable,enable} and using just SOFTIRQ_OFFSET while softirq
      processing. The patch below does that and adds API in_serving_softirq() which
      returns whether we are currently processing softirq or not.
      
      Also changes one of the usages of softirq_count in net/sched/cls_cgroup.c
      to in_serving_softirq.
      
      Looks like many usages of in_softirq really want in_serving_softirq. Those
      changes can be made individually on a case by case basis.
      Signed-off-by: NVenkatesh Pallipadi <venki@google.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1286237003-12406-2-git-send-email-venki@google.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      75e1056f
    • N
      sched: Do not consider SCHED_IDLE tasks to be cache hot · ef8002f6
      Nikhil Rao 提交于
      This patch adds a check in task_hot to return if the task has SCHED_IDLE
      policy. SCHED_IDLE tasks have very low weight, and when run with regular
      workloads, are typically scheduled many milliseconds apart. There is no
      need to consider these tasks hot for load balancing.
      Signed-off-by: NNikhil Rao <ncrao@google.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1287173550-30365-2-git-send-email-ncrao@google.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      ef8002f6
    • L
      sched: Drop all load weight manipulation for RT tasks · 17bdcf94
      Linus Walleij 提交于
      Load weights are for the CFS, they do not belong in the RT task. This makes all
      RT scheduling classes leave the CFS weights alone.
      
      This fixes a real bug as well: I noticed the following phonomena: a process
      elevated to SCHED_RR forks with SCHED_RESET_ON_FORK set, and the child is
      indeed SCHED_OTHER, and the niceval is indeed reset to 0. However the weight
      inserted by set_load_weight() remains at 0, giving the task insignificat
      priority.
      
      With this fix, the weight is reset to what the task had before being elevated
      to SCHED_RR/SCHED_FIFO.
      
      Cc: Lennart Poettering <lennart@poettering.net>
      Cc: stable@kernel.org
      Signed-off-by: NLinus Walleij <linus.walleij@stericsson.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1286807811-10568-1-git-send-email-linus.walleij@stericsson.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      17bdcf94
    • P
      sched: Create special class for stop/migrate work · 34f971f6
      Peter Zijlstra 提交于
      In order to separate the stop/migrate work thread from the SCHED_FIFO
      implementation, create a special class for it that is of higher priority than
      SCHED_FIFO itself.
      
      This currently solves a problem where cpu-hotplug consumes so much cpu-time
      that the SCHED_FIFO class gets throttled, but has the bandwidth replenishment
      timer pending on the now dead cpu.
      
      It is also required for when we add the planned deadline scheduling class above
      SCHED_FIFO, as the stop/migrate thread still needs to transcent those tasks.
      Tested-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1285165776.2275.1022.camel@laptop>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      34f971f6
    • P
      sched: Unindent labels · 49246274
      Peter Zijlstra 提交于
      Labels should be on column 0.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      49246274
  5. 08 10月, 2010 1 次提交
    • P
      sched: fix RCU lockdep splat from task_group() · 6506cf6c
      Peter Zijlstra 提交于
      This addresses the following RCU lockdep splat:
      
      [0.051203] CPU0: AMD QEMU Virtual CPU version 0.12.4 stepping 03
      [0.052999] lockdep: fixing up alternatives.
      [0.054105]
      [0.054106] ===================================================
      [0.054999] [ INFO: suspicious rcu_dereference_check() usage. ]
      [0.054999] ---------------------------------------------------
      [0.054999] kernel/sched.c:616 invoked rcu_dereference_check() without protection!
      [0.054999]
      [0.054999] other info that might help us debug this:
      [0.054999]
      [0.054999]
      [0.054999] rcu_scheduler_active = 1, debug_locks = 1
      [0.054999] 3 locks held by swapper/1:
      [0.054999]  #0:  (cpu_add_remove_lock){+.+.+.}, at: [<ffffffff814be933>] cpu_up+0x42/0x6a
      [0.054999]  #1:  (cpu_hotplug.lock){+.+.+.}, at: [<ffffffff810400d8>] cpu_hotplug_begin+0x2a/0x51
      [0.054999]  #2:  (&rq->lock){-.-...}, at: [<ffffffff814be2f7>] init_idle+0x2f/0x113
      [0.054999]
      [0.054999] stack backtrace:
      [0.054999] Pid: 1, comm: swapper Not tainted 2.6.35 #1
      [0.054999] Call Trace:
      [0.054999]  [<ffffffff81068054>] lockdep_rcu_dereference+0x9b/0xa3
      [0.054999]  [<ffffffff810325c3>] task_group+0x7b/0x8a
      [0.054999]  [<ffffffff810325e5>] set_task_rq+0x13/0x40
      [0.054999]  [<ffffffff814be39a>] init_idle+0xd2/0x113
      [0.054999]  [<ffffffff814be78a>] fork_idle+0xb8/0xc7
      [0.054999]  [<ffffffff81068717>] ? mark_held_locks+0x4d/0x6b
      [0.054999]  [<ffffffff814bcebd>] do_fork_idle+0x17/0x2b
      [0.054999]  [<ffffffff814bc89b>] native_cpu_up+0x1c1/0x724
      [0.054999]  [<ffffffff814bcea6>] ? do_fork_idle+0x0/0x2b
      [0.054999]  [<ffffffff814be876>] _cpu_up+0xac/0x127
      [0.054999]  [<ffffffff814be946>] cpu_up+0x55/0x6a
      [0.054999]  [<ffffffff81ab562a>] kernel_init+0xe1/0x1ff
      [0.054999]  [<ffffffff81003854>] kernel_thread_helper+0x4/0x10
      [0.054999]  [<ffffffff814c353c>] ? restore_args+0x0/0x30
      [0.054999]  [<ffffffff81ab5549>] ? kernel_init+0x0/0x1ff
      [0.054999]  [<ffffffff81003850>] ? kernel_thread_helper+0x0/0x10
      [0.056074] Booting Node   0, Processors  #1lockdep: fixing up alternatives.
      [0.130045]  #2lockdep: fixing up alternatives.
      [0.203089]  #3 Ok.
      [0.275286] Brought up 4 CPUs
      [0.276005] Total of 4 processors activated (16017.17 BogoMIPS).
      
      The cgroup_subsys_state structures referenced by idle tasks are never
      freed, because the idle tasks should be part of the root cgroup,
      which is not removable.
      
      The problem is that while we do in-fact hold rq->lock, the newly spawned
      idle thread's cpu is not yet set to the correct cpu so the lockdep check
      in task_group():
      
        lockdep_is_held(&task_rq(p)->lock)
      
      will fail.
      
      But this is a chicken and egg problem.  Setting the CPU's runqueue requires
      that the CPU's runqueue already be set.  ;-)
      
      So insert an RCU read-side critical section to avoid the complaint.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      6506cf6c
  6. 21 9月, 2010 1 次提交
  7. 17 9月, 2010 1 次提交
    • P
      perf: Undo the per cpu-context timer stuff · e9d2b064
      Peter Zijlstra 提交于
      Revert the timer per cpu-context timers because of unfortunate
      nohz interaction. Fixing that would have been somewhat ugly, so
      go back to driving things from the regular tick. Provide a
      jiffies interval feature for people who want slower rotations.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Robert Richter <robert.richter@amd.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      LKML-Reference: <20100917093009.519845633@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      e9d2b064
  8. 16 9月, 2010 1 次提交
    • H
      sched: Remove branch hints within context_switch() · 31915ab4
      Heiko Carstens 提交于
      With 710390d9 "sched: Optimize branch hint in context_switch()"
      the branch hint logic within context_switch() got inversed.
      
      In fact the hints "if (likely(!mm))" and "if (likely(!prev->mm))"
      mean that it is likely that the previous and next task are kernel
      threads.
      
      That assumption is certainly counter intuitive, but Tim has shown
      that at least with his workload this is true. Nevertheless the
      truth is: it depends on the current workload. So just remove the
      annotations which also improves readability.
      Reported-by: NTim Blechmann <tim@klingt.org>
      Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mike Galbraith <efault@gmx.de>
      LKML-Reference: <20100916124225.GA2209@osiris.boeblingen.de.ibm.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      31915ab4
  9. 15 9月, 2010 1 次提交
  10. 14 9月, 2010 1 次提交
  11. 10 9月, 2010 4 次提交
  12. 08 9月, 2010 1 次提交
  13. 23 8月, 2010 1 次提交
    • T
      mutex: Improve the scalability of optimistic spinning · 9d0f4dcc
      Tim Chen 提交于
      There is a scalability issue for current implementation of optimistic
      mutex spin in the kernel.  It is found on a 8 node 64 core Nehalem-EX
      system (HT mode).
      
      The intention of the optimistic mutex spin is to busy wait and spin on a
      mutex if the owner of the mutex is running, in the hope that the mutex
      will be released soon and be acquired, without the thread trying to
      acquire mutex going to sleep. However, when we have a large number of
      threads, contending for the mutex, we could have the mutex grabbed by
      other thread, and then another ……, and we will keep spinning, wasting cpu
      cycles and adding to the contention.  One possible fix is to quit
      spinning and put the current thread on wait-list if mutex lock switch to
      a new owner while we spin, indicating heavy contention (see the patch
      included).
      
      I did some testing on a 8 socket Nehalem-EX system with a total of 64
      cores. Using Ingo's test-mutex program that creates/delete files with 256
      threads (http://lkml.org/lkml/2006/1/8/50) , I see the following speed up
      after putting in the mutex spin fix:
      
       ./mutex-test V 256 10
                       Ops/sec
       2.6.34          62864
       With fix        197200
      
      Repeating the test with Aim7 fserver workload, again there is a speed up
      with the fix:
      
                       Jobs/min
       2.6.34          91657
       With fix        149325
      
      To look at the impact on the distribution of mutex acquisition time, I
      collected the mutex acquisition time on Aim7 fserver workload with some
      instrumentation.  The average acquisition time is reduced by 48% and
      number of contentions reduced by 32%.
      
                       #contentions    Time to acquire mutex (cycles)
       2.6.34          72973           44765791
       With fix        49210           23067129
      
      The histogram of mutex acquisition time is listed below.  The acquisition
      time is in 2^bin cycles.  We see that without the fix, the acquisition
      time is mostly around 2^26 cycles.  With the fix, we the distribution get
      spread out a lot more towards the lower cycles, starting from 2^13.
      However, there is an increase of the tail distribution with the fix at
      2^28 and 2^29 cycles.  It seems a small price to pay for the reduced
      average acquisition time and also getting the cpu to do useful work.
      
       Mutex acquisition time distribution (acq time = 2^bin cycles):
               2.6.34                  With Fix
       bin     #occurrence     %       #occurrence     %
       11      2               0.00%   120             0.24%
       12      10              0.01%   790             1.61%
       13      14              0.02%   2058            4.18%
       14      86              0.12%   3378            6.86%
       15      393             0.54%   4831            9.82%
       16      710             0.97%   4893            9.94%
       17      815             1.12%   4667            9.48%
       18      790             1.08%   5147            10.46%
       19      580             0.80%   6250            12.70%
       20      429             0.59%   6870            13.96%
       21      311             0.43%   1809            3.68%
       22      255             0.35%   2305            4.68%
       23      317             0.44%   916             1.86%
       24      610             0.84%   233             0.47%
       25      3128            4.29%   95              0.19%
       26      63902           87.69%  122             0.25%
       27      619             0.85%   286             0.58%
       28      0               0.00%   3536            7.19%
       29      0               0.00%   903             1.83%
       30      0               0.00%   0               0.00%
      
      I've done similar experiments with 2.6.35 kernel on smaller boxes as
      well.  One is on a dual-socket Westmere box (12 cores total, with HT).
      Another experiment is on an old dual-socket Core 2 box (4 cores total, no
      HT)
      
      On the 12-core Westmere box, I see a 250% increase for Ingo's mutex-test
      program with my mutex patch but no significant difference in aim7's
      fserver workload.
      
      On the 4-core Core 2 box, I see the difference with the patch for both
      mutex-test and aim7 fserver are negligible.
      
      So far, it seems like the patch has not caused regression on smaller
      systems.
      Signed-off-by: NTim Chen <tim.c.chen@linux.intel.com>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: <stable@kernel.org> # .35.x
      LKML-Reference: <1282168827.9542.72.camel@schen9-DESK>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      9d0f4dcc
  14. 17 7月, 2010 2 次提交
  15. 01 7月, 2010 1 次提交
    • P
      sched: Cure nr_iowait_cpu() users · 8c215bd3
      Peter Zijlstra 提交于
      Commit 0224cf4c (sched: Intoduce get_cpu_iowait_time_us())
      broke things by not making sure preemption was indeed disabled
      by the callers of nr_iowait_cpu() which took the iowait value of
      the current cpu.
      
      This resulted in a heap of preempt warnings. Cure this by making
      nr_iowait_cpu() take a cpu number and fix up the callers to pass
      in the right number.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Arjan van de Ven <arjan@infradead.org>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Rafael J. Wysocki <rjw@sisk.pl>
      Cc: Maxim Levitsky <maximlevitsky@gmail.com>
      Cc: Len Brown <len.brown@intel.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: linux-pm@lists.linux-foundation.org
      LKML-Reference: <1277968037.1868.120.camel@laptop>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      8c215bd3
  16. 25 6月, 2010 1 次提交
  17. 24 6月, 2010 1 次提交
  18. 22 6月, 2010 1 次提交
  19. 18 6月, 2010 2 次提交
  20. 09 6月, 2010 7 次提交
    • V
      sched: Change nohz idle load balancing logic to push model · 83cd4fe2
      Venkatesh Pallipadi 提交于
      In the new push model, all idle CPUs indeed go into nohz mode. There is
      still the concept of idle load balancer (performing the load balancing
      on behalf of all the idle cpu's in the system). Busy CPU kicks the nohz
      balancer when any of the nohz CPUs need idle load balancing.
      The kickee CPU does the idle load balancing on behalf of all idle CPUs
      instead of the normal idle balance.
      
      This addresses the below two problems with the current nohz ilb logic:
      * the idle load balancer continued to have periodic ticks during idle and
        wokeup frequently, even though it did not have any rebalancing to do on
        behalf of any of the idle CPUs.
      * On x86 and CPUs that have APIC timer stoppage on idle CPUs, this
        periodic wakeup can result in a periodic additional interrupt on a CPU
        doing the timer broadcast.
      
      Also currently we are migrating the unpinned timers from an idle to the cpu
      doing idle load balancing (when all the cpus in the system are idle,
      there is no idle load balancing cpu and timers get added to the same idle cpu
      where the request was made. So the existing optimization works only on semi idle
      system).
      
      And In semi idle system, we no longer have periodic ticks on the idle load
      balancer CPU. Using that cpu will add more delays to the timers than intended
      (as that cpu's timer base may not be uptodate wrt jiffies etc). This was
      causing mysterious slowdowns during boot etc.
      
      For now, in the semi idle case, use the nearest busy cpu for migrating timers
      from an idle cpu.  This is good for power-savings anyway.
      Signed-off-by: NVenkatesh Pallipadi <venki@google.com>
      Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      LKML-Reference: <1274486981.2840.46.camel@sbs-t61.sc.intel.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      83cd4fe2
    • V
      sched: Avoid side-effect of tickless idle on update_cpu_load · fdf3e95d
      Venkatesh Pallipadi 提交于
      tickless idle has a negative side effect on update_cpu_load(), which
      in turn can affect load balancing behavior.
      
      update_cpu_load() is supposed to be called every tick, to keep track
      of various load indicies. With tickless idle, there are no scheduler
      ticks called on the idle CPUs. Idle CPUs may still do load balancing
      (with idle_load_balance CPU) using the stale cpu_load. It will also
      cause problems when all CPUs go idle for a while and become active
      again. In this case loads would not degrade as expected.
      
      This is how rq->nr_load_updates change looks like under different
      conditions:
      
      <cpu_num> <nr_load_updates change>
      All CPUS idle for 10 seconds (HZ=1000)
      0 1621
      10 496
      11 139
      12 875
      13 1672
      14 12
      15 21
      1 1472
      2 2426
      3 1161
      4 2108
      5 1525
      6 701
      7 249
      8 766
      9 1967
      
      One CPU busy rest idle for 10 seconds
      0 10003
      10 601
      11 95
      12 966
      13 1597
      14 114
      15 98
      1 3457
      2 93
      3 6679
      4 1425
      5 1479
      6 595
      7 193
      8 633
      9 1687
      
      All CPUs busy for 10 seconds
      0 10026
      10 10026
      11 10026
      12 10026
      13 10025
      14 10025
      15 10025
      1 10026
      2 10026
      3 10026
      4 10026
      5 10026
      6 10026
      7 10026
      8 10026
      9 10026
      
      That is update_cpu_load works properly only when all CPUs are busy.
      If all are idle, all the CPUs get way lower updates.  And when few
      CPUs are busy and rest are idle, only busy and ilb CPU does proper
      updates and rest of the idle CPUs will do lower updates.
      
      The patch keeps track of when a last update was done and fixes up
      the load avg based on current time.
      
      On one of my test system SPECjbb with warehouse 1..numcpus, patch
      improves throughput numbers by ~1% (average of 6 runs).  On another
      test system (with different domain hierarchy) there is no noticable
      change in perf.
      Signed-off-by: NVenkatesh Pallipadi <venki@google.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      LKML-Reference: <AANLkTilLtDWQsAUrIxJ6s04WTgmw9GuOODc5AOrYsaR5@mail.gmail.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      fdf3e95d
    • O
      sched: Simplify the reacquire_kernel_lock() logic · 246d86b5
      Oleg Nesterov 提交于
      - Contrary to what 6d558c3a says, there is no need to reload
        prev = rq->curr after the context switch. You always schedule
        back to where you came from, prev must be equal to current
        even if cpu/rq was changed.
      
      - This also means reacquire_kernel_lock() can use prev instead
        of current.
      
      - No need to reassign switch_count if reacquire_kernel_lock()
        reports need_resched(), we can just move the initial assignment
        down, under the "need_resched_nonpreemptible:" label.
      
      - Try to update the comment after context_switch().
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <20100519125711.GA30199@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      246d86b5
    • P
      sched_clock: Add local_clock() API and improve documentation · c676329a
      Peter Zijlstra 提交于
      For people who otherwise get to write: cpu_clock(smp_processor_id()),
      there is now: local_clock().
      
      Also, as per suggestion from Andrew, provide some documentation on
      the various clock interfaces, and minimize the unsigned long long vs
      u64 mess.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Jens Axboe <jaxboe@fusionio.com>
      LKML-Reference: <1275052414.1645.52.camel@laptop>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      c676329a
    • T
      sched: add hooks for workqueue · 21aa9af0
      Tejun Heo 提交于
      Concurrency managed workqueue needs to know when workers are going to
      sleep and waking up.  Using these two hooks, cmwq keeps track of the
      current concurrency level and throttles execution of new works if it's
      too high and wakes up another worker from the sleep hook if it becomes
      too low.
      
      This patch introduces PF_WQ_WORKER to identify workqueue workers and
      adds the following two hooks.
      
      * wq_worker_waking_up(): called when a worker is woken up.
      
      * wq_worker_sleeping(): called when a worker is going to sleep and may
        return a pointer to a local task which should be woken up.  The
        returned task is woken up using try_to_wake_up_local() which is
        simplified ttwu which is called under rq lock and can only wake up
        local tasks.
      
      Both hooks are currently defined as noop in kernel/workqueue_sched.h.
      Later cmwq implementation will replace them with proper
      implementation.
      
      These hooks are hard coded as they'll always be enabled.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Ingo Molnar <mingo@elte.hu>
      21aa9af0
    • T
      sched: refactor try_to_wake_up() · 9ed3811a
      Tejun Heo 提交于
      Factor ttwu_activate() and ttwu_woken_up() out of try_to_wake_up().
      The factoring out doesn't affect try_to_wake_up() much
      code-generation-wise.  Depending on configuration options, it ends up
      generating the same object code as before or slightly different one
      due to different register assignment.
      
      This is to help future implementation of try_to_wake_up_local().
      
      Mike Galbraith suggested rename to ttwu_post_activation() from
      ttwu_woken_up() and comment update in try_to_wake_up().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Ingo Molnar <mingo@elte.hu>
      9ed3811a
    • T
      sched: adjust when cpu_active and cpuset configurations are updated during cpu on/offlining · 3a101d05
      Tejun Heo 提交于
      Currently, when a cpu goes down, cpu_active is cleared before
      CPU_DOWN_PREPARE starts and cpuset configuration is updated from a
      default priority cpu notifier.  When a cpu is coming up, it's set
      before CPU_ONLINE but cpuset configuration again is updated from the
      same cpu notifier.
      
      For cpu notifiers, this presents an inconsistent state.  Threads which
      a CPU_DOWN_PREPARE notifier expects to be bound to the CPU can be
      migrated to other cpus because the cpu is no more inactive.
      
      Fix it by updating cpu_active in the highest priority cpu notifier and
      cpuset configuration in the second highest when a cpu is coming up.
      Down path is updated similarly.  This guarantees that all other cpu
      notifiers see consistent cpu_active and cpuset configuration.
      
      cpuset_track_online_cpus() notifier is converted to
      cpuset_update_active_cpus() which just updates the configuration and
      now called from cpuset_cpu_[in]active() notifiers registered from
      sched_init_smp().  If cpuset is disabled, cpuset_update_active_cpus()
      degenerates into partition_sched_domains() making separate notifier
      for !CONFIG_CPUSETS unnecessary.
      
      This problem is triggered by cmwq.  During CPU_DOWN_PREPARE, hotplug
      callback creates a kthread and kthread_bind()s it to the target cpu,
      and the thread is expected to run on that cpu.
      
      * Ingo's test discovered __cpuinit/exit markups were incorrect.
        Fixed.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Paul Menage <menage@google.com>
      3a101d05