1. 14 9月, 2010 1 次提交
  2. 10 9月, 2010 1 次提交
  3. 05 9月, 2010 1 次提交
  4. 20 8月, 2010 1 次提交
  5. 17 7月, 2010 2 次提交
  6. 29 6月, 2010 1 次提交
  7. 23 6月, 2010 1 次提交
    • D
      rcu: apply RCU protection to wake_affine() · f3b577de
      Daniel J Blueman 提交于
      The task_group() function returns a pointer that must be protected
      by either RCU, the ->alloc_lock, or the cgroup lock (see the
      rcu_dereference_check() in task_subsys_state(), which is invoked by
      task_group()).  The wake_affine() function currently does none of these,
      which means that a concurrent update would be within its rights to free
      the structure returned by task_group().  Because wake_affine() uses this
      structure only to compute load-balancing heuristics, there is no reason
      to acquire either of the two locks.
      
      Therefore, this commit introduces an RCU read-side critical section that
      starts before the first call to task_group() and ends after the last use
      of the "tg" pointer returned from task_group().  Thanks to Li Zefan for
      pointing out the need to extend the RCU read-side critical section from
      that proposed by the original patch.
      Signed-off-by: NDaniel J Blueman <daniel.blueman@gmail.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      f3b577de
  8. 18 6月, 2010 2 次提交
  9. 09 6月, 2010 4 次提交
    • M
      sched: Add asymmetric group packing option for sibling domain · 532cb4c4
      Michael Neuling 提交于
      Check to see if the group is packed in a sched doman.
      
      This is primarily intended to used at the sibling level.  Some cores
      like POWER7 prefer to use lower numbered SMT threads.  In the case of
      POWER7, it can move to lower SMT modes only when higher threads are
      idle.  When in lower SMT modes, the threads will perform better since
      they share less core resources.  Hence when we have idle threads, we
      want them to be the higher ones.
      
      This adds a hook into f_b_g() called check_asym_packing() to check the
      packing.  This packing function is run on idle threads.  It checks to
      see if the busiest CPU in this domain (core in the P7 case) has a
      higher CPU number than what where the packing function is being run
      on.  If it is, calculate the imbalance and return the higher busier
      thread as the busiest group to f_b_g().  Here we are assuming a lower
      CPU number will be equivalent to a lower SMT thread number.
      
      It also creates a new SD_ASYM_PACKING flag to enable this feature at
      any scheduler domain level.
      
      It also creates an arch hook to enable this feature at the sibling
      level.  The default function doesn't enable this feature.
      
      Based heavily on patch from Peter Zijlstra.
      Fixes from Srivatsa Vaddagiri.
      Signed-off-by: NMichael Neuling <mikey@neuling.org>
      Signed-off-by: NSrivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      LKML-Reference: <20100608045702.2936CCC897@localhost.localdomain>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      532cb4c4
    • S
      sched: Fix capacity calculations for SMT4 · 9d5efe05
      Srivatsa Vaddagiri 提交于
      Handle cpu capacity being reported as 0 on cores with more number of
      hardware threads. For example on a Power7 core with 4 hardware
      threads, core power is 1177 and thus power of each hardware thread is
      1177/4 = 294. This low power can lead to capacity for each hardware
      thread being calculated as 0, which leads to tasks bouncing within the
      core madly!
      
      Fix this by reporting capacity for hardware threads as 1, provided
      their power is not scaled down significantly because of frequency
      scaling or real-time tasks usage of cpu.
      Signed-off-by: NSrivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
      Signed-off-by: NMichael Neuling <mikey@neuling.org>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      LKML-Reference: <20100608045702.21D03CC895@localhost.localdomain>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      9d5efe05
    • V
      sched: Change nohz idle load balancing logic to push model · 83cd4fe2
      Venkatesh Pallipadi 提交于
      In the new push model, all idle CPUs indeed go into nohz mode. There is
      still the concept of idle load balancer (performing the load balancing
      on behalf of all the idle cpu's in the system). Busy CPU kicks the nohz
      balancer when any of the nohz CPUs need idle load balancing.
      The kickee CPU does the idle load balancing on behalf of all idle CPUs
      instead of the normal idle balance.
      
      This addresses the below two problems with the current nohz ilb logic:
      * the idle load balancer continued to have periodic ticks during idle and
        wokeup frequently, even though it did not have any rebalancing to do on
        behalf of any of the idle CPUs.
      * On x86 and CPUs that have APIC timer stoppage on idle CPUs, this
        periodic wakeup can result in a periodic additional interrupt on a CPU
        doing the timer broadcast.
      
      Also currently we are migrating the unpinned timers from an idle to the cpu
      doing idle load balancing (when all the cpus in the system are idle,
      there is no idle load balancing cpu and timers get added to the same idle cpu
      where the request was made. So the existing optimization works only on semi idle
      system).
      
      And In semi idle system, we no longer have periodic ticks on the idle load
      balancer CPU. Using that cpu will add more delays to the timers than intended
      (as that cpu's timer base may not be uptodate wrt jiffies etc). This was
      causing mysterious slowdowns during boot etc.
      
      For now, in the semi idle case, use the nearest busy cpu for migrating timers
      from an idle cpu.  This is good for power-savings anyway.
      Signed-off-by: NVenkatesh Pallipadi <venki@google.com>
      Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      LKML-Reference: <1274486981.2840.46.camel@sbs-t61.sc.intel.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      83cd4fe2
    • V
      sched: Avoid side-effect of tickless idle on update_cpu_load · fdf3e95d
      Venkatesh Pallipadi 提交于
      tickless idle has a negative side effect on update_cpu_load(), which
      in turn can affect load balancing behavior.
      
      update_cpu_load() is supposed to be called every tick, to keep track
      of various load indicies. With tickless idle, there are no scheduler
      ticks called on the idle CPUs. Idle CPUs may still do load balancing
      (with idle_load_balance CPU) using the stale cpu_load. It will also
      cause problems when all CPUs go idle for a while and become active
      again. In this case loads would not degrade as expected.
      
      This is how rq->nr_load_updates change looks like under different
      conditions:
      
      <cpu_num> <nr_load_updates change>
      All CPUS idle for 10 seconds (HZ=1000)
      0 1621
      10 496
      11 139
      12 875
      13 1672
      14 12
      15 21
      1 1472
      2 2426
      3 1161
      4 2108
      5 1525
      6 701
      7 249
      8 766
      9 1967
      
      One CPU busy rest idle for 10 seconds
      0 10003
      10 601
      11 95
      12 966
      13 1597
      14 114
      15 98
      1 3457
      2 93
      3 6679
      4 1425
      5 1479
      6 595
      7 193
      8 633
      9 1687
      
      All CPUs busy for 10 seconds
      0 10026
      10 10026
      11 10026
      12 10026
      13 10025
      14 10025
      15 10025
      1 10026
      2 10026
      3 10026
      4 10026
      5 10026
      6 10026
      7 10026
      8 10026
      9 10026
      
      That is update_cpu_load works properly only when all CPUs are busy.
      If all are idle, all the CPUs get way lower updates.  And when few
      CPUs are busy and rest are idle, only busy and ilb CPU does proper
      updates and rest of the idle CPUs will do lower updates.
      
      The patch keeps track of when a last update was done and fixes up
      the load avg based on current time.
      
      On one of my test system SPECjbb with warehouse 1..numcpus, patch
      improves throughput numbers by ~1% (average of 6 runs).  On another
      test system (with different domain hierarchy) there is no noticable
      change in perf.
      Signed-off-by: NVenkatesh Pallipadi <venki@google.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      LKML-Reference: <AANLkTilLtDWQsAUrIxJ6s04WTgmw9GuOODc5AOrYsaR5@mail.gmail.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      fdf3e95d
  10. 01 6月, 2010 1 次提交
  11. 07 5月, 2010 1 次提交
    • T
      sched: replace migration_thread with cpu_stop · 969c7921
      Tejun Heo 提交于
      Currently migration_thread is serving three purposes - migration
      pusher, context to execute active_load_balance() and forced context
      switcher for expedited RCU synchronize_sched.  All three roles are
      hardcoded into migration_thread() and determining which job is
      scheduled is slightly messy.
      
      This patch kills migration_thread and replaces all three uses with
      cpu_stop.  The three different roles of migration_thread() are
      splitted into three separate cpu_stop callbacks -
      migration_cpu_stop(), active_load_balance_cpu_stop() and
      synchronize_sched_expedited_cpu_stop() - and each use case now simply
      asks cpu_stop to execute the callback as necessary.
      
      synchronize_sched_expedited() was implemented with private
      preallocated resources and custom multi-cpu queueing and waiting
      logic, both of which are provided by cpu_stop.
      synchronize_sched_expedited_count is made atomic and all other shared
      resources along with the mutex are dropped.
      
      synchronize_sched_expedited() also implemented a check to detect cases
      where not all the callback got executed on their assigned cpus and
      fall back to synchronize_sched().  If called with cpu hotplug blocked,
      cpu_stop already guarantees that and the condition cannot happen;
      otherwise, stop_machine() would break.  However, this patch preserves
      the paranoid check using a cpumask to record on which cpus the stopper
      ran so that it can serve as a bisection point if something actually
      goes wrong theree.
      
      Because the internal execution state is no longer visible,
      rcu_expedited_torture_stats() is removed.
      
      This patch also renames cpu_stop threads to from "stopper/%d" to
      "migration/%d".  The names of these threads ultimately don't matter
      and there's no reason to make unnecessary userland visible changes.
      
      With this patch applied, stop_machine() and sched now share the same
      resources.  stop_machine() is faster without wasting any resources and
      sched migration users are much cleaner.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Dipankar Sarma <dipankar@in.ibm.com>
      Cc: Josh Triplett <josh@freedesktop.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Dimitri Sivanich <sivanich@sgi.com>
      969c7921
  12. 23 4月, 2010 2 次提交
    • S
      sched: Fix select_idle_sibling() logic in select_task_rq_fair() · 99bd5e2f
      Suresh Siddha 提交于
      Issues in the current select_idle_sibling() logic in select_task_rq_fair()
      in the context of a task wake-up:
      
      a) Once we select the idle sibling, we use that domain (spanning the cpu that
         the task is currently woken-up and the idle sibling that we found) in our
         wake_affine() decisions. This domain is completely different from the
         domain(we are supposed to use) that spans the cpu that the task currently
         woken-up and the cpu where the task previously ran.
      
      b) We do select_idle_sibling() check only for the cpu that the task is
         currently woken-up on. If select_task_rq_fair() selects the previously run
         cpu for waking the task, doing a select_idle_sibling() check
         for that cpu also helps and we don't do this currently.
      
      c) In the scenarios where the cpu that the task is woken-up is busy but
         with its HT siblings are idle, we are selecting the task be woken-up
         on the idle HT sibling instead of a core that it previously ran
         and currently completely idle. i.e., we are not taking decisions based on
         wake_affine() but directly selecting an idle sibling that can cause
         an imbalance at the SMT/MC level which will be later corrected by the
         periodic load balancer.
      
      Fix this by first going through the load imbalance calculations using
      wake_affine() and once we make a decision of woken-up cpu vs previously-ran cpu,
      then choose a possible idle sibling for waking up the task on.
      Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1270079265.7835.8.camel@sbs-t61.sc.intel.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      99bd5e2f
    • P
      sched: Pre-compute cpumask_weight(sched_domain_span(sd)) · 669c55e9
      Peter Zijlstra 提交于
      Dave reported that his large SPARC machines spend lots of time in
      hweight64(), try and optimize some of those needless cpumask_weight()
      invocations (esp. with the large offstack cpumasks these are very
      expensive indeed).
      Reported-by: NDavid Miller <davem@davemloft.net>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      669c55e9
  13. 03 4月, 2010 2 次提交
    • P
      sched: Add enqueue/dequeue flags · 371fd7e7
      Peter Zijlstra 提交于
      In order to reduce the dependency on TASK_WAKING rework the enqueue
      interface to support a proper flags field.
      
      Replace the int wakeup, bool head arguments with an int flags argument
      and create the following flags:
      
        ENQUEUE_WAKEUP - the enqueue is a wakeup of a sleeping task,
        ENQUEUE_WAKING - the enqueue has relative vruntime due to
                         having sched_class::task_waking() called,
        ENQUEUE_HEAD - the waking task should be places on the head
                       of the priority queue (where appropriate).
      
      For symmetry also convert sched_class::dequeue() to a flags scheme.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      371fd7e7
    • P
      sched: Fix TASK_WAKING vs fork deadlock · 0017d735
      Peter Zijlstra 提交于
      Oleg noticed a few races with the TASK_WAKING usage on fork.
      
       - since TASK_WAKING is basically a spinlock, it should be IRQ safe
       - since we set TASK_WAKING (*) without holding rq->lock it could
         be there still is a rq->lock holder, thereby not actually
         providing full serialization.
      
      (*) in fact we clear PF_STARTING, which in effect enables TASK_WAKING.
      
      Cure the second issue by not setting TASK_WAKING in sched_fork(), but
      only temporarily in wake_up_new_task() while calling select_task_rq().
      
      Cure the first by holding rq->lock around the select_task_rq() call,
      this will disable IRQs, this however requires that we push down the
      rq->lock release into select_task_rq_fair()'s cgroup stuff.
      
      Because select_task_rq_fair() still needs to drop the rq->lock we
      cannot fully get rid of TASK_WAKING.
      Reported-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      0017d735
  14. 12 3月, 2010 10 次提交
  15. 11 3月, 2010 1 次提交
  16. 01 3月, 2010 1 次提交
  17. 26 2月, 2010 1 次提交
    • S
      sched: Fix SCHED_MC regression caused by change in sched cpu_power · dd5feea1
      Suresh Siddha 提交于
      On platforms like dual socket quad-core platform, the scheduler load
      balancer is not detecting the load imbalances in certain scenarios. This
      is leading to scenarios like where one socket is completely busy (with
      all the 4 cores running with 4 tasks) and leaving another socket
      completely idle. This causes performance issues as those 4 tasks share
      the memory controller, last-level cache bandwidth etc. Also we won't be
      taking advantage of turbo-mode as much as we would like, etc.
      
      Some of the comparisons in the scheduler load balancing code are
      comparing the "weighted cpu load that is scaled wrt sched_group's
      cpu_power" with the "weighted average load per task that is not scaled
      wrt sched_group's cpu_power". While this has probably been broken for a
      longer time (for multi socket numa nodes etc), the problem got aggrevated
      via this recent change:
      
       |
       |  commit f93e65c1
       |  Author: Peter Zijlstra <a.p.zijlstra@chello.nl>
       |  Date:   Tue Sep 1 10:34:32 2009 +0200
       |
       |	sched: Restore __cpu_power to a straight sum of power
       |
      
      Also with this change, the sched group cpu power alone no longer reflects
      the group capacity that is needed to implement MC, MT performance
      (default) and power-savings (user-selectable) policies.
      
      We need to use the computed group capacity (sgs.group_capacity, that is
      computed using the SD_PREFER_SIBLING logic in update_sd_lb_stats()) to
      find out if the group with the max load is above its capacity and how
      much load to move etc.
      Reported-by: NMa Ling <ling.ma@intel.com>
      Initial-Analysis-by: NZhang, Yanmin <yanmin_zhang@linux.intel.com>
      Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com>
      [ -v2: build fix ]
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: <stable@kernel.org> # [2.6.32.x, 2.6.33.x]
      LKML-Reference: <1266970432.11588.22.camel@sbs-t61.sc.intel.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      dd5feea1
  18. 23 1月, 2010 1 次提交
  19. 21 1月, 2010 6 次提交