1. 23 6月, 2010 1 次提交
    • D
      rcu: apply RCU protection to wake_affine() · f3b577de
      Daniel J Blueman 提交于
      The task_group() function returns a pointer that must be protected
      by either RCU, the ->alloc_lock, or the cgroup lock (see the
      rcu_dereference_check() in task_subsys_state(), which is invoked by
      task_group()).  The wake_affine() function currently does none of these,
      which means that a concurrent update would be within its rights to free
      the structure returned by task_group().  Because wake_affine() uses this
      structure only to compute load-balancing heuristics, there is no reason
      to acquire either of the two locks.
      
      Therefore, this commit introduces an RCU read-side critical section that
      starts before the first call to task_group() and ends after the last use
      of the "tg" pointer returned from task_group().  Thanks to Li Zefan for
      pointing out the need to extend the RCU read-side critical section from
      that proposed by the original patch.
      Signed-off-by: NDaniel J Blueman <daniel.blueman@gmail.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      f3b577de
  2. 01 6月, 2010 1 次提交
  3. 07 5月, 2010 1 次提交
    • T
      sched: replace migration_thread with cpu_stop · 969c7921
      Tejun Heo 提交于
      Currently migration_thread is serving three purposes - migration
      pusher, context to execute active_load_balance() and forced context
      switcher for expedited RCU synchronize_sched.  All three roles are
      hardcoded into migration_thread() and determining which job is
      scheduled is slightly messy.
      
      This patch kills migration_thread and replaces all three uses with
      cpu_stop.  The three different roles of migration_thread() are
      splitted into three separate cpu_stop callbacks -
      migration_cpu_stop(), active_load_balance_cpu_stop() and
      synchronize_sched_expedited_cpu_stop() - and each use case now simply
      asks cpu_stop to execute the callback as necessary.
      
      synchronize_sched_expedited() was implemented with private
      preallocated resources and custom multi-cpu queueing and waiting
      logic, both of which are provided by cpu_stop.
      synchronize_sched_expedited_count is made atomic and all other shared
      resources along with the mutex are dropped.
      
      synchronize_sched_expedited() also implemented a check to detect cases
      where not all the callback got executed on their assigned cpus and
      fall back to synchronize_sched().  If called with cpu hotplug blocked,
      cpu_stop already guarantees that and the condition cannot happen;
      otherwise, stop_machine() would break.  However, this patch preserves
      the paranoid check using a cpumask to record on which cpus the stopper
      ran so that it can serve as a bisection point if something actually
      goes wrong theree.
      
      Because the internal execution state is no longer visible,
      rcu_expedited_torture_stats() is removed.
      
      This patch also renames cpu_stop threads to from "stopper/%d" to
      "migration/%d".  The names of these threads ultimately don't matter
      and there's no reason to make unnecessary userland visible changes.
      
      With this patch applied, stop_machine() and sched now share the same
      resources.  stop_machine() is faster without wasting any resources and
      sched migration users are much cleaner.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Dipankar Sarma <dipankar@in.ibm.com>
      Cc: Josh Triplett <josh@freedesktop.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Dimitri Sivanich <sivanich@sgi.com>
      969c7921
  4. 23 4月, 2010 2 次提交
    • S
      sched: Fix select_idle_sibling() logic in select_task_rq_fair() · 99bd5e2f
      Suresh Siddha 提交于
      Issues in the current select_idle_sibling() logic in select_task_rq_fair()
      in the context of a task wake-up:
      
      a) Once we select the idle sibling, we use that domain (spanning the cpu that
         the task is currently woken-up and the idle sibling that we found) in our
         wake_affine() decisions. This domain is completely different from the
         domain(we are supposed to use) that spans the cpu that the task currently
         woken-up and the cpu where the task previously ran.
      
      b) We do select_idle_sibling() check only for the cpu that the task is
         currently woken-up on. If select_task_rq_fair() selects the previously run
         cpu for waking the task, doing a select_idle_sibling() check
         for that cpu also helps and we don't do this currently.
      
      c) In the scenarios where the cpu that the task is woken-up is busy but
         with its HT siblings are idle, we are selecting the task be woken-up
         on the idle HT sibling instead of a core that it previously ran
         and currently completely idle. i.e., we are not taking decisions based on
         wake_affine() but directly selecting an idle sibling that can cause
         an imbalance at the SMT/MC level which will be later corrected by the
         periodic load balancer.
      
      Fix this by first going through the load imbalance calculations using
      wake_affine() and once we make a decision of woken-up cpu vs previously-ran cpu,
      then choose a possible idle sibling for waking up the task on.
      Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1270079265.7835.8.camel@sbs-t61.sc.intel.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      99bd5e2f
    • P
      sched: Pre-compute cpumask_weight(sched_domain_span(sd)) · 669c55e9
      Peter Zijlstra 提交于
      Dave reported that his large SPARC machines spend lots of time in
      hweight64(), try and optimize some of those needless cpumask_weight()
      invocations (esp. with the large offstack cpumasks these are very
      expensive indeed).
      Reported-by: NDavid Miller <davem@davemloft.net>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      669c55e9
  5. 03 4月, 2010 2 次提交
    • P
      sched: Add enqueue/dequeue flags · 371fd7e7
      Peter Zijlstra 提交于
      In order to reduce the dependency on TASK_WAKING rework the enqueue
      interface to support a proper flags field.
      
      Replace the int wakeup, bool head arguments with an int flags argument
      and create the following flags:
      
        ENQUEUE_WAKEUP - the enqueue is a wakeup of a sleeping task,
        ENQUEUE_WAKING - the enqueue has relative vruntime due to
                         having sched_class::task_waking() called,
        ENQUEUE_HEAD - the waking task should be places on the head
                       of the priority queue (where appropriate).
      
      For symmetry also convert sched_class::dequeue() to a flags scheme.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      371fd7e7
    • P
      sched: Fix TASK_WAKING vs fork deadlock · 0017d735
      Peter Zijlstra 提交于
      Oleg noticed a few races with the TASK_WAKING usage on fork.
      
       - since TASK_WAKING is basically a spinlock, it should be IRQ safe
       - since we set TASK_WAKING (*) without holding rq->lock it could
         be there still is a rq->lock holder, thereby not actually
         providing full serialization.
      
      (*) in fact we clear PF_STARTING, which in effect enables TASK_WAKING.
      
      Cure the second issue by not setting TASK_WAKING in sched_fork(), but
      only temporarily in wake_up_new_task() while calling select_task_rq().
      
      Cure the first by holding rq->lock around the select_task_rq() call,
      this will disable IRQs, this however requires that we push down the
      rq->lock release into select_task_rq_fair()'s cgroup stuff.
      
      Because select_task_rq_fair() still needs to drop the rq->lock we
      cannot fully get rid of TASK_WAKING.
      Reported-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      0017d735
  6. 12 3月, 2010 10 次提交
  7. 11 3月, 2010 1 次提交
  8. 01 3月, 2010 1 次提交
  9. 26 2月, 2010 1 次提交
    • S
      sched: Fix SCHED_MC regression caused by change in sched cpu_power · dd5feea1
      Suresh Siddha 提交于
      On platforms like dual socket quad-core platform, the scheduler load
      balancer is not detecting the load imbalances in certain scenarios. This
      is leading to scenarios like where one socket is completely busy (with
      all the 4 cores running with 4 tasks) and leaving another socket
      completely idle. This causes performance issues as those 4 tasks share
      the memory controller, last-level cache bandwidth etc. Also we won't be
      taking advantage of turbo-mode as much as we would like, etc.
      
      Some of the comparisons in the scheduler load balancing code are
      comparing the "weighted cpu load that is scaled wrt sched_group's
      cpu_power" with the "weighted average load per task that is not scaled
      wrt sched_group's cpu_power". While this has probably been broken for a
      longer time (for multi socket numa nodes etc), the problem got aggrevated
      via this recent change:
      
       |
       |  commit f93e65c1
       |  Author: Peter Zijlstra <a.p.zijlstra@chello.nl>
       |  Date:   Tue Sep 1 10:34:32 2009 +0200
       |
       |	sched: Restore __cpu_power to a straight sum of power
       |
      
      Also with this change, the sched group cpu power alone no longer reflects
      the group capacity that is needed to implement MC, MT performance
      (default) and power-savings (user-selectable) policies.
      
      We need to use the computed group capacity (sgs.group_capacity, that is
      computed using the SD_PREFER_SIBLING logic in update_sd_lb_stats()) to
      find out if the group with the max load is above its capacity and how
      much load to move etc.
      Reported-by: NMa Ling <ling.ma@intel.com>
      Initial-Analysis-by: NZhang, Yanmin <yanmin_zhang@linux.intel.com>
      Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com>
      [ -v2: build fix ]
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: <stable@kernel.org> # [2.6.32.x, 2.6.33.x]
      LKML-Reference: <1266970432.11588.22.camel@sbs-t61.sc.intel.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      dd5feea1
  10. 23 1月, 2010 1 次提交
  11. 21 1月, 2010 11 次提交
  12. 17 1月, 2010 1 次提交
  13. 17 12月, 2009 2 次提交
    • P
      sched: Remove the cfs_rq dependency from set_task_cpu() · 88ec22d3
      Peter Zijlstra 提交于
      In order to remove the cfs_rq dependency from set_task_cpu() we
      need to ensure the task is cfs_rq invariant for all callsites.
      
      The simple approach is to substract cfs_rq->min_vruntime from
      se->vruntime on dequeue, and add cfs_rq->min_vruntime on
      enqueue.
      
      However, this has the downside of breaking FAIR_SLEEPERS since
      we loose the old vruntime as we only maintain the relative
      position.
      
      To solve this, we observe that we only migrate runnable tasks,
      we do this using deactivate_task(.sleep=0) and
      activate_task(.wakeup=0), therefore we can restrain the
      min_vruntime invariance to that state.
      
      The only other case is wakeup balancing, since we want to
      maintain the old vruntime we cannot make it relative on dequeue,
      but since we don't migrate inactive tasks, we can do so right
      before we activate it again.
      
      This is where we need the new pre-wakeup hook, we need to call
      this while still holding the old rq->lock. We could fold it into
      ->select_task_rq(), but since that has multiple callsites and
      would obfuscate the locking requirements, that seems like a
      fudge.
      
      This leaves the fork() case, simply make sure that ->task_fork()
      leaves the ->vruntime in a relative state.
      
      This covers all cases where set_task_cpu() gets called, and
      ensures it sees a relative vruntime.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mike Galbraith <efault@gmx.de>
      LKML-Reference: <20091216170518.191697025@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      88ec22d3
    • P
      sched: Select_task_rq_fair() must honour SD_LOAD_BALANCE · e4f42888
      Peter Zijlstra 提交于
      We should skip !SD_LOAD_BALANCE domains.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mike Galbraith <efault@gmx.de>
      LKML-Reference: <20091216170517.653578430@chello.nl>
      CC: stable@kernel.org
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      e4f42888
  14. 15 12月, 2009 1 次提交
  15. 09 12月, 2009 4 次提交