1. 17 12月, 2013 3 次提交
    • W
      sched/numa: Drop sysctl_numa_balancing_settle_count sysctl · 1bd53a7e
      Wanpeng Li 提交于
      commit 887c290e (sched/numa: Decide whether to favour task or group weights
      based on swap candidate relationships) drop the check against
      sysctl_numa_balancing_settle_count, this patch remove the sysctl.
      Signed-off-by: NWanpeng Li <liwanp@linux.vnet.ibm.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Link: http://lkml.kernel.org/r/1386833006-6600-1-git-send-email-liwanp@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      1bd53a7e
    • K
      sched/rt: Fix rq's cpupri leak while enqueue/dequeue child RT entities · 757dfcaa
      Kirill Tkhai 提交于
      This patch touches the RT group scheduling case.
      
      Functions inc_rt_prio_smp() and dec_rt_prio_smp() change (global) rq's
      priority, while rt_rq passed to them may be not the top-level rt_rq.
      This is wrong, because changing of priority on a child level does not
      guarantee that the priority is the highest all over the rq. So, this
      leak makes RT balancing unusable.
      
      The short example: the task having the highest priority among all rq's
      RT tasks (no one other task has the same priority) are waking on a
      throttle rt_rq.  The rq's cpupri is set to the task's priority
      equivalent, but real rq->rt.highest_prio.curr is less.
      
      The patch below fixes the problem.
      Signed-off-by: NKirill Tkhai <tkhai@yandex.ru>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      CC: Steven Rostedt <rostedt@goodmis.org>
      CC: stable@vger.kernel.org
      Link: http://lkml.kernel.org/r/49231385567953@web4m.yandex.ruSigned-off-by: NIngo Molnar <mingo@kernel.org>
      757dfcaa
    • M
      sched: Assign correct scheduling domain to 'sd_llc' · 5d4cf996
      Mel Gorman 提交于
      Commit 42eb088e (sched: Avoid NULL dereference on sd_busy) corrected a NULL
      dereference on sd_busy but the fix also altered what scheduling domain it
      used for the 'sd_llc' percpu variable.
      
      One impact of this is that a task selecting a runqueue may consider
      idle CPUs that are not cache siblings as candidates for running.
      Tasks are then running on CPUs that are not cache hot.
      
      This was found through bisection where ebizzy threads were not seeing equal
      performance and it looked like a scheduling fairness issue. This patch
      mitigates but does not completely fix the problem on all machines tested
      implying there may be an additional bug or a common root cause. Here are
      the average range of performance seen by individual ebizzy threads. It
      was tested on top of candidate patches related to x86 TLB range flushing.
      
      	4-core machine
      			    3.13.0-rc3            3.13.0-rc3
      			       vanilla            fixsd-v3r3
      	Mean   1        0.00 (  0.00%)        0.00 (  0.00%)
      	Mean   2        0.34 (  0.00%)        0.10 ( 70.59%)
      	Mean   3        1.29 (  0.00%)        0.93 ( 27.91%)
      	Mean   4        7.08 (  0.00%)        0.77 ( 89.12%)
      	Mean   5      193.54 (  0.00%)        2.14 ( 98.89%)
      	Mean   6      151.12 (  0.00%)        2.06 ( 98.64%)
      	Mean   7      115.38 (  0.00%)        2.04 ( 98.23%)
      	Mean   8      108.65 (  0.00%)        1.92 ( 98.23%)
      
      	8-core machine
      	Mean   1         0.00 (  0.00%)        0.00 (  0.00%)
      	Mean   2         0.40 (  0.00%)        0.21 ( 47.50%)
      	Mean   3        23.73 (  0.00%)        0.89 ( 96.25%)
      	Mean   4        12.79 (  0.00%)        1.04 ( 91.87%)
      	Mean   5        13.08 (  0.00%)        2.42 ( 81.50%)
      	Mean   6        23.21 (  0.00%)       69.46 (-199.27%)
      	Mean   7        15.85 (  0.00%)      101.72 (-541.77%)
      	Mean   8       109.37 (  0.00%)       19.13 ( 82.51%)
      	Mean   12      124.84 (  0.00%)       28.62 ( 77.07%)
      	Mean   16      113.50 (  0.00%)       24.16 ( 78.71%)
      
      It's eliminated for one machine and reduced for another.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Alex Shi <alex.shi@linaro.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: H Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20131217092124.GV11295@suse.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      5d4cf996
  2. 11 12月, 2013 2 次提交
    • P
      sched/fair: Rework sched_fair time accounting · 9dbdb155
      Peter Zijlstra 提交于
      Christian suffers from a bad BIOS that wrecks his i5's TSC sync. This
      results in him occasionally seeing time going backwards - which
      crashes the scheduler ...
      
      Most of our time accounting can actually handle that except the most
      common one; the tick time update of sched_fair.
      
      There is a further problem with that code; previously we assumed that
      because we get a tick every TICK_NSEC our time delta could never
      exceed 32bits and math was simpler.
      
      However, ever since Frederic managed to get NO_HZ_FULL merged; this is
      no longer the case since now a task can run for a long time indeed
      without getting a tick. It only takes about ~4.2 seconds to overflow
      our u32 in nanoseconds.
      
      This means we not only need to better deal with time going backwards;
      but also means we need to be able to deal with large deltas.
      
      This patch reworks the entire code and uses mul_u64_u32_shr() as
      proposed by Andy a long while ago.
      
      We express our virtual time scale factor in a u32 multiplier and shift
      right and the 32bit mul_u64_u32_shr() implementation reduces to a
      single 32x32->64 multiply if the time delta is still short (common
      case).
      
      For 64bit a 64x64->128 multiply can be used if ARCH_SUPPORTS_INT128.
      Reported-and-Tested-by: NChristian Engelmayer <cengelma@gmx.at>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: fweisbec@gmail.com
      Cc: Paul Turner <pjt@google.com>
      Cc: Stanislaw Gruszka <sgruszka@redhat.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20131118172706.GI3866@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      9dbdb155
    • P
      sched: Initialize power_orig for overlapping groups · 8e8339a3
      Peter Zijlstra 提交于
      Yinghai reported that he saw a /0 in sg_capacity on his EX parts.
      Make sure to always initialize power_orig now that we actually use it.
      
      Ideally build_sched_domains() -> init_sched_groups_power() would also
      initialize this; but for some yet unexplained reason some setups seem
      to miss updates there.
      Reported-by: NYinghai Lu <yinghai@kernel.org>
      Tested-by: NYinghai Lu <yinghai@kernel.org>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/n/tip-l8ng2m9uml6fhibln8wqpom7@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      8e8339a3
  3. 05 12月, 2013 1 次提交
  4. 27 11月, 2013 6 次提交
  5. 20 11月, 2013 3 次提交
  6. 13 11月, 2013 4 次提交
  7. 06 11月, 2013 5 次提交
  8. 29 10月, 2013 5 次提交
  9. 28 10月, 2013 1 次提交
  10. 26 10月, 2013 1 次提交
  11. 16 10月, 2013 3 次提交
    • P
      sched: Remove get_online_cpus() usage · 6acce3ef
      Peter Zijlstra 提交于
      Remove get_online_cpus() usage from the scheduler; there's 4 sites that
      use it:
      
       - sched_init_smp(); where its completely superfluous since we're in
         'early' boot and there simply cannot be any hotplugging.
      
       - sched_getaffinity(); we already take a raw spinlock to protect the
         task cpus_allowed mask, this disables preemption and therefore
         also stabilizes cpu_online_mask as that's modified using
         stop_machine. However switch to active mask for symmetry with
         sched_setaffinity()/set_cpus_allowed_ptr(). We guarantee active
         mask stability by inserting sync_rcu/sched() into _cpu_down.
      
       - sched_setaffinity(); we don't appear to need get_online_cpus()
         either, there's two sites where hotplug appears relevant:
          * cpuset_cpus_allowed(); for the !cpuset case we use possible_mask,
            for the cpuset case we hold task_lock, which is a spinlock and
            thus for mainline disables preemption (might cause pain on RT).
          * set_cpus_allowed_ptr(); Holds all scheduler locks and thus has
            preemption properly disabled; also it already deals with hotplug
            races explicitly where it releases them.
      
       - migrate_swap(); we can make stop_two_cpus() do the heavy lifting for
         us with a little trickery. By adding a sync_sched/rcu() after the
         CPU_DOWN_PREPARE notifier we can provide preempt/rcu guarantees for
         cpu_active_mask. Use these to validate that both our cpus are active
         when queueing the stop work before we queue the stop_machine works
         for take_cpu_down().
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: "Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com>
      Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Link: http://lkml.kernel.org/r/20131011123820.GV3081@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      6acce3ef
    • P
      sched: Fix race in migrate_swap_stop() · 74602315
      Peter Zijlstra 提交于
      There is a subtle race in migrate_swap, when task P, on CPU A, decides to swap
      places with task T, on CPU B.
      
      Task P:
        - call migrate_swap
      Task T:
        - go to sleep, removing itself from the runqueue
      Task P:
        - double lock the runqueues on CPU A & B
      Task T:
        - get woken up, place itself on the runqueue of CPU C
      Task P:
        - see that task T is on a runqueue, and pretend to remove it
          from the runqueue on CPU B
      
      Now CPUs B & C both have corrupted scheduler data structures.
      
      This patch fixes it, by holding the pi_lock for both of the tasks
      involved in the migrate swap. This prevents task T from waking up,
      and placing itself onto another runqueue, until after migrate_swap
      has released all locks.
      
      This means that, when migrate_swap checks, task T will be either
      on the runqueue where it was originally seen, or not on any
      runqueue at all. Migrate_swap deals correctly with of those cases.
      Tested-by: NJoe Mario <jmario@redhat.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: hannes@cmpxchg.org
      Cc: aarcange@redhat.com
      Cc: srikar@linux.vnet.ibm.com
      Cc: tglx@linutronix.de
      Cc: hpa@zytor.com
      Link: http://lkml.kernel.org/r/20131010181722.GO13848@laptop.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      74602315
    • P
      sched/rt: Add missing rmb() · 7c3f2ab7
      Peter Zijlstra 提交于
      While discussing the proposed SCHED_DEADLINE patches which in parts
      mimic the existing FIFO code it was noticed that the wmb in
      rt_set_overloaded() didn't have a matching barrier.
      
      The only site using rt_overloaded() to test the rto_count is
      pull_rt_task() and we should issue a matching rmb before then assuming
      there's an rto_mask bit set.
      
      Without that smp_rmb() in there we could actually miss seeing the
      rto_mask bit.
      
      Also, change to using smp_[wr]mb(), even though this is SMP only code;
      memory barriers without smp_ always make me think they're against
      hardware of some sort.
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: vincent.guittot@linaro.org
      Cc: luca.abeni@unitn.it
      Cc: bruce.ashfield@windriver.com
      Cc: dhaval.giani@gmail.com
      Cc: rostedt@goodmis.org
      Cc: hgu1972@gmail.com
      Cc: oleg@redhat.com
      Cc: fweisbec@gmail.com
      Cc: darren@dvhart.com
      Cc: johan.eker@ericsson.com
      Cc: p.faure@akatech.ch
      Cc: paulmck@linux.vnet.ibm.com
      Cc: raistlin@linux.it
      Cc: claudio@evidence.eu.com
      Cc: insop.song@gmail.com
      Cc: michael@amarulasolutions.com
      Cc: liming.wang@windriver.com
      Cc: fchecconi@gmail.com
      Cc: jkacur@redhat.com
      Cc: tommaso.cucinotta@sssup.it
      Cc: Juri Lelli <juri.lelli@gmail.com>
      Cc: harald.gustafsson@ericsson.com
      Cc: nicola.manica@disi.unitn.it
      Cc: tglx@linutronix.de
      Link: http://lkml.kernel.org/r/20131015103507.GF10651@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      7c3f2ab7
  12. 14 10月, 2013 1 次提交
  13. 13 10月, 2013 1 次提交
  14. 09 10月, 2013 4 次提交