1. 25 3月, 2009 1 次提交
  2. 17 3月, 2009 2 次提交
  3. 11 3月, 2009 1 次提交
    • M
      sched: add avg_overlap decay · df1c99d4
      Mike Galbraith 提交于
      Impact: more precise avg_overlap metric - better load-balancing
      
      avg_overlap is used to measure the runtime overlap of the waker and
      wakee.
      
      However, when a process changes behaviour, eg a pipe becomes
      un-congested and we don't need to go to sleep after a wakeup
      for a while, the avg_overlap value grows stale.
      
      When running we use the avg runtime between preemption as a
      measure for avg_overlap since the amount of runtime can be
      correlated to cache footprint.
      
      The longer we run, the less likely we'll be wanting to be
      migrated to another CPU.
      Signed-off-by: NMike Galbraith <efault@gmx.de>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1236709131.25234.576.camel@laptop>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      df1c99d4
  4. 10 3月, 2009 1 次提交
  5. 06 3月, 2009 1 次提交
  6. 05 3月, 2009 1 次提交
    • F
      sched: don't rebalance if attached on NULL domain · 8a0be9ef
      Frederic Weisbecker 提交于
      Impact: fix function graph trace hang / drop pointless softirq on UP
      
      While debugging a function graph trace hang on an old PII, I saw
      that it consumed most of its time on the timer interrupt. And
      the domain rebalancing softirq was the most concerned.
      
      The timer interrupt calls trigger_load_balance() which will
      decide if it is worth to schedule a rebalancing softirq.
      
      In case of builtin UP kernel, no problem arises because there is
      no domain question.
      
      In case of builtin SMP kernel running on an SMP box, still no
      problem, the softirq will be raised each time we reach the
      next_balance time.
      
      In case of builtin SMP kernel running on a UP box (most distros
      provide default SMP kernels, whatever the box you have), then
      the CPU is attached to the NULL sched domain. So a kind of
      unexpected behaviour happen:
      
      trigger_load_balance() -> raises the rebalancing softirq later
      on softirq: run_rebalance_domains() -> rebalance_domains() where
      the for_each_domain(cpu, sd) is not taken because of the NULL
      domain we are attached at. Which means rq->next_balance is never
      updated. So on the next timer tick, we will enter
      trigger_load_balance() which will always reschedule() the
      rebalacing softirq:
      
      if (time_after_eq(jiffies, rq->next_balance))
      	raise_softirq(SCHED_SOFTIRQ);
      
      So for each tick, we process this pointless softirq.
      
      This patch fixes it by checking if we are attached to the null
      domain before raising the softirq, another possible fix would be
      to set the maximal possible JIFFIES value to rq->next_balance if
      we are attached to the NULL domain.
      
      v2: build fix on UP
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      LKML-Reference: <49af242d.1c07d00a.32d5.ffffc019@mx.google.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      8a0be9ef
  7. 02 3月, 2009 1 次提交
  8. 27 2月, 2009 1 次提交
  9. 26 2月, 2009 2 次提交
  10. 16 2月, 2009 2 次提交
  11. 12 2月, 2009 1 次提交
  12. 11 2月, 2009 1 次提交
  13. 06 2月, 2009 1 次提交
    • J
      wait: prevent exclusive waiter starvation · 777c6c5f
      Johannes Weiner 提交于
      With exclusive waiters, every process woken up through the wait queue must
      ensure that the next waiter down the line is woken when it has finished.
      
      Interruptible waiters don't do that when aborting due to a signal.  And if
      an aborting waiter is concurrently woken up through the waitqueue, noone
      will ever wake up the next waiter.
      
      This has been observed with __wait_on_bit_lock() used by
      lock_page_killable(): the first contender on the queue was aborting when
      the actual lock holder woke it up concurrently.  The aborted contender
      didn't acquire the lock and therefor never did an unlock followed by
      waking up the next waiter.
      
      Add abort_exclusive_wait() which removes the process' wait descriptor from
      the waitqueue, iff still queued, or wakes up the next waiter otherwise.
      It does so under the waitqueue lock.  Racing with a wake up means the
      aborting process is either already woken (removed from the queue) and will
      wake up the next waiter, or it will remove itself from the queue and the
      concurrent wake up will apply to the next waiter after it.
      
      Use abort_exclusive_wait() in __wait_event_interruptible_exclusive() and
      __wait_on_bit_lock() when they were interrupted by other means than a wake
      up through the queue.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Reported-by: NChris Mason <chris.mason@oracle.com>
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Mentored-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Matthew Wilcox <matthew@wil.cx>
      Cc: Chuck Lever <cel@citi.umich.edu>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: <stable@kernel.org>		["after some testing"]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      777c6c5f
  14. 05 2月, 2009 1 次提交
  15. 01 2月, 2009 2 次提交
  16. 15 1月, 2009 3 次提交
  17. 14 1月, 2009 4 次提交
  18. 12 1月, 2009 1 次提交
  19. 11 1月, 2009 2 次提交
  20. 07 1月, 2009 1 次提交
    • P
      sched: fix possible recursive rq->lock · da8d5089
      Peter Zijlstra 提交于
      Vaidyanathan Srinivasan reported:
      
       > =============================================
       > [ INFO: possible recursive locking detected ]
       > 2.6.28-autotest-tip-sv #1
       > ---------------------------------------------
       > klogd/5062 is trying to acquire lock:
       >  (&rq->lock){++..}, at: [<ffffffff8022aca2>] task_rq_lock+0x45/0x7e
       >
       > but task is already holding lock:
       >  (&rq->lock){++..}, at: [<ffffffff805f7354>] schedule+0x158/0xa31
      
      With sched_mc at 2. (it is default-off)
      
      Strictly speaking we'll not deadlock, because ttwu will not be able to
      place the migration task on our rq, but since the code can deal with
      both rqs getting unlocked, this seems the easiest way out.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      da8d5089
  21. 06 1月, 2009 2 次提交
  22. 05 1月, 2009 2 次提交
  23. 04 1月, 2009 1 次提交
    • M
      sched: put back some stack hog changes that were undone in kernel/sched.c · 6ca09dfc
      Mike Travis 提交于
      Impact: prevents panic from stack overflow on numa-capable machines.
      
      Some of the "removal of stack hogs" changes in kernel/sched.c by using
      node_to_cpumask_ptr were undone by the early cpumask API updates, and
      causes a panic due to stack overflow.  This patch undoes those changes
      by using cpumask_of_node() which returns a 'const struct cpumask *'.
      
      In addition, cpu_coregoup_map is replaced with cpu_coregroup_mask further
      reducing stack usage.  (Both of these updates removed 9 FIXME's!)
      
      Also:
         Pick up some remaining changes from the old 'cpumask_t' functions to
         the new 'struct cpumask *' functions.
      
         Optimize memory traffic by allocating each percpu local_cpu_mask on the
         same node as the referring cpu.
      Signed-off-by: NMike Travis <travis@sgi.com>
      Acked-by: NRusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      6ca09dfc
  24. 31 12月, 2008 2 次提交
    • M
      [PATCH] idle cputime accounting · 79741dd3
      Martin Schwidefsky 提交于
      The cpu time spent by the idle process actually doing something is
      currently accounted as idle time. This is plain wrong, the architectures
      that support VIRT_CPU_ACCOUNTING=y can do better: distinguish between the
      time spent doing nothing and the time spent by idle doing work. The first
      is accounted with account_idle_time and the second with account_system_time.
      The architectures that use the account_xxx_time interface directly and not
      the account_xxx_ticks interface now need to do the check for the idle
      process in their arch code. In particular to improve the system vs true
      idle time accounting the arch code needs to measure the true idle time
      instead of just testing for the idle process.
      To improve the tick based accounting as well we would need an architecture
      primitive that can tell us if the pt_regs of the interrupted context
      points to the magic instruction that halts the cpu.
      
      In addition idle time is no more added to the stime of the idle process.
      This field now contains the system time of the idle process as it should
      be. On systems without VIRT_CPU_ACCOUNTING this will always be zero as
      every tick that occurs while idle is running will be accounted as idle
      time.
      
      This patch contains the necessary common code changes to be able to
      distinguish idle system time and true idle time. The architectures with
      support for VIRT_CPU_ACCOUNTING need some changes to exploit this.
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      79741dd3
    • M
      [PATCH] fix scaled & unscaled cputime accounting · 457533a7
      Martin Schwidefsky 提交于
      The utimescaled / stimescaled fields in the task structure and the
      global cpustat should be set on all architectures. On s390 the calls
      to account_user_time_scaled and account_system_time_scaled never have
      been added. In addition system time that is accounted as guest time
      to the user time of a process is accounted to the scaled system time
      instead of the scaled user time.
      To fix the bugs and to prevent future forgetfulness this patch merges
      account_system_time_scaled into account_system_time and
      account_user_time_scaled into account_user_time.
      
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Jeremy Fitzhardinge <jeremy@xensource.com>
      Cc: Chris Wright <chrisw@sous-sol.org>
      Cc: Michael Neuling <mikey@neuling.org>
      Acked-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      457533a7
  25. 29 12月, 2008 3 次提交
    • G
      sched: create "pushable_tasks" list to limit pushing to one attempt · 917b627d
      Gregory Haskins 提交于
      The RT scheduler employs a "push/pull" design to actively balance tasks
      within the system (on a per disjoint cpuset basis).  When a task is
      awoken, it is immediately determined if there are any lower priority
      cpus which should be preempted.  This is opposed to the way normal
      SCHED_OTHER tasks behave, which will wait for a periodic rebalancing
      operation to occur before spreading out load.
      
      When a particular RQ has more than 1 active RT task, it is said to
      be in an "overloaded" state.  Once this occurs, the system enters
      the active balancing mode, where it will try to push the task away,
      or persuade a different cpu to pull it over.  The system will stay
      in this state until the system falls back below the <= 1 queued RT
      task per RQ.
      
      However, the current implementation suffers from a limitation in the
      push logic.  Once overloaded, all tasks (other than current) on the
      RQ are analyzed on every push operation, even if it was previously
      unpushable (due to affinity, etc).  Whats more, the operation stops
      at the first task that is unpushable and will not look at items
      lower in the queue.  This causes two problems:
      
      1) We can have the same tasks analyzed over and over again during each
         push, which extends out the fast path in the scheduler for no
         gain.  Consider a RQ that has dozens of tasks that are bound to a
         core.  Each one of those tasks will be encountered and skipped
         for each push operation while they are queued.
      
      2) There may be lower-priority tasks under the unpushable task that
         could have been successfully pushed, but will never be considered
         until either the unpushable task is cleared, or a pull operation
         succeeds.  The net result is a potential latency source for mid
         priority tasks.
      
      This patch aims to rectify these two conditions by introducing a new
      priority sorted list: "pushable_tasks".  A task is added to the list
      each time a task is activated or preempted.  It is removed from the
      list any time it is deactivated, made current, or fails to push.
      
      This works because a task only needs to be attempted to push once.
      After an initial failure to push, the other cpus will eventually try to
      pull the task when the conditions are proper.  This also solves the
      problem that we don't completely analyze all tasks due to encountering
      an unpushable tasks.  Now every task will have a push attempted (when
      appropriate).
      
      This reduces latency both by shorting the critical section of the
      rq->lock for certain workloads, and by making sure the algorithm
      considers all eligible tasks in the system.
      
      [ rostedt: added a couple more BUG_ONs ]
      Signed-off-by: NGregory Haskins <ghaskins@novell.com>
      Acked-by: NSteven Rostedt <srostedt@redhat.com>
      917b627d
    • G
      sched: add sched_class->needs_post_schedule() member · 967fc046
      Gregory Haskins 提交于
      We currently run class->post_schedule() outside of the rq->lock, which
      means that we need to test for the need to post_schedule outside of
      the lock to avoid a forced reacquistion.  This is currently not a problem
      as we only look at rq->rt.overloaded.  However, we want to enhance this
      going forward to look at more state to reduce the need to post_schedule to
      a bare minimum set.  Therefore, we introduce a new member-func called
      needs_post_schedule() which tests for the post_schedule condtion without
      actually performing the work.  Therefore it is safe to call this
      function before the rq->lock is released, because we are guaranteed not
      to drop the lock at an intermediate point (such as what post_schedule()
      may do).
      
      We will use this later in the series
      
      [ rostedt: removed paranoid BUG_ON ]
      Signed-off-by: NGregory Haskins <ghaskins@novell.com>
      967fc046
    • G
      sched: make double-lock-balance fair · 8f45e2b5
      Gregory Haskins 提交于
      double_lock balance() currently favors logically lower cpus since they
      often do not have to release their own lock to acquire a second lock.
      The result is that logically higher cpus can get starved when there is
      a lot of pressure on the RQs.  This can result in higher latencies on
      higher cpu-ids.
      
      This patch makes the algorithm more fair by forcing all paths to have
      to release both locks before acquiring them again.  Since callsites to
      double_lock_balance already consider it a potential preemption/reschedule
      point, they have the proper logic to recheck for atomicity violations.
      Signed-off-by: NGregory Haskins <ghaskins@novell.com>
      8f45e2b5