1. 25 9月, 2017 1 次提交
  2. 12 9月, 2017 2 次提交
  3. 11 9月, 2017 1 次提交
  4. 09 9月, 2017 1 次提交
  5. 07 9月, 2017 1 次提交
  6. 10 8月, 2017 8 次提交
  7. 01 8月, 2017 1 次提交
    • V
      sched: cpufreq: Allow remote cpufreq callbacks · 674e7541
      Viresh Kumar 提交于
      With Android UI and benchmarks the latency of cpufreq response to
      certain scheduling events can become very critical. Currently, callbacks
      into cpufreq governors are only made from the scheduler if the target
      CPU of the event is the same as the current CPU. This means there are
      certain situations where a target CPU may not run the cpufreq governor
      for some time.
      
      One testcase to show this behavior is where a task starts running on
      CPU0, then a new task is also spawned on CPU0 by a task on CPU1. If the
      system is configured such that the new tasks should receive maximum
      demand initially, this should result in CPU0 increasing frequency
      immediately. But because of the above mentioned limitation though, this
      does not occur.
      
      This patch updates the scheduler core to call the cpufreq callbacks for
      remote CPUs as well.
      
      The schedutil, ondemand and conservative governors are updated to
      process cpufreq utilization update hooks called for remote CPUs where
      the remote CPU is managed by the cpufreq policy of the local CPU.
      
      The intel_pstate driver is updated to always reject remote callbacks.
      
      This is tested with couple of usecases (Android: hackbench, recentfling,
      galleryfling, vellamo, Ubuntu: hackbench) on ARM hikey board (64 bit
      octa-core, single policy). Only galleryfling showed minor improvements,
      while others didn't had much deviation.
      
      The reason being that this patch only targets a corner case, where
      following are required to be true to improve performance and that
      doesn't happen too often with these tests:
      
      - Task is migrated to another CPU.
      - The task has high demand, and should take the target CPU to higher
        OPPs.
      - And the target CPU doesn't call into the cpufreq governor until the
        next tick.
      
      Based on initial work from Steve Muckle.
      Signed-off-by: NViresh Kumar <viresh.kumar@linaro.org>
      Acked-by: NSaravana Kannan <skannan@codeaurora.org>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      674e7541
  8. 05 7月, 2017 1 次提交
    • J
      sched/fair: Fix load_balance() affinity redo path · 65a4433a
      Jeffrey Hugo 提交于
      If load_balance() fails to migrate any tasks because all tasks were
      affined, load_balance() removes the source CPU from consideration and
      attempts to redo and balance among the new subset of CPUs.
      
      There is a bug in this code path where the algorithm considers all active
      CPUs in the system (minus the source that was just masked out).  This is
      not valid for two reasons: some active CPUs may not be in the current
      scheduling domain and one of the active CPUs is dst_cpu. These CPUs should
      not be considered, as we cannot pull load from them.
      
      Instead of failing out of load_balance(), we may end up redoing the search
      with no valid CPUs and incorrectly concluding the domain is balanced.
      Additionally, if the group_imbalance flag was just set, it may also be
      incorrectly unset, thus the flag will not be seen by other CPUs in future
      load_balance() runs as that algorithm intends.
      
      Fix the check by removing CPUs not in the current domain and the dst_cpu
      from considertation, thus limiting the evaluation to valid remaining CPUs
      from which load might be migrated.
      Co-authored-by: NAustin Christ <austinwc@codeaurora.org>
      Co-authored-by: NDietmar Eggemann <dietmar.eggemann@arm.com>
      Tested-by: NTyler Baicar <tbaicar@codeaurora.org>
      Signed-off-by: NJeffrey Hugo <jhugo@codeaurora.org>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Austin Christ <austinwc@codeaurora.org>
      Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Timur Tabi <timur@codeaurora.org>
      Link: http://lkml.kernel.org/r/1496863138-11322-2-git-send-email-jhugo@codeaurora.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      65a4433a
  9. 29 6月, 2017 1 次提交
    • T
      sched/numa: Hide numa_wake_affine() from UP build · ff801b71
      Thomas Gleixner 提交于
      Stephen reported the following build warning in UP:
      
      kernel/sched/fair.c:2657:9: warning: 'struct sched_domain' declared inside
      parameter list
               ^
      /home/sfr/next/next/kernel/sched/fair.c:2657:9: warning: its scope is only this
      definition or declaration, which is probably not what you want
      
      Hide the numa_wake_affine() inline stub on UP builds to get rid of it.
      
      Fixes: 3fed382b ("sched/numa: Implement NUMA node level wake_affine()")
      Reported-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      ff801b71
  10. 24 6月, 2017 4 次提交
  11. 22 6月, 2017 1 次提交
  12. 20 6月, 2017 1 次提交
  13. 11 6月, 2017 1 次提交
  14. 08 6月, 2017 1 次提交
    • P
      sched/core: Implement new approach to scale select_idle_cpu() · 1ad3aaf3
      Peter Zijlstra 提交于
      Hackbench recently suffered a bunch of pain, first by commit:
      
        4c77b18c ("sched/fair: Make select_idle_cpu() more aggressive")
      
      and then by commit:
      
        c743f0a5 ("sched/fair, cpumask: Export for_each_cpu_wrap()")
      
      which fixed a bug in the initial for_each_cpu_wrap() implementation
      that made select_idle_cpu() even more expensive. The bug was that it
      would skip over CPUs when bits were consequtive in the bitmask.
      
      This however gave me an idea to fix select_idle_cpu(); where the old
      scheme was a cliff-edge throttle on idle scanning, this introduces a
      more gradual approach. Instead of stopping to scan entirely, we limit
      how many CPUs we scan.
      
      Initial benchmarks show that it mostly recovers hackbench while not
      hurting anything else, except Mason's schbench, but not as bad as the
      old thing.
      
      It also appears to recover the tbench high-end, which also suffered like
      hackbench.
      Tested-by: NMatt Fleming <matt@codeblueprint.co.uk>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Chris Mason <clm@fb.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: hpa@zytor.com
      Cc: kitsunyan <kitsunyan@inbox.ru>
      Cc: linux-kernel@vger.kernel.org
      Cc: lvenanci@redhat.com
      Cc: riel@redhat.com
      Cc: xiaolong.ye@intel.com
      Link: http://lkml.kernel.org/r/20170517105350.hk5m4h4jb6dfr65a@hirez.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      1ad3aaf3
  15. 23 5月, 2017 1 次提交
    • V
      sched/numa: Use down_read_trylock() for the mmap_sem · 8655d549
      Vlastimil Babka 提交于
      A customer has reported a soft-lockup when running an intensive
      memory stress test, where the trace on multiple CPU's looks like this:
      
       RIP: 0010:[<ffffffff810c53fe>]
        [<ffffffff810c53fe>] native_queued_spin_lock_slowpath+0x10e/0x190
      ...
       Call Trace:
        [<ffffffff81182d07>] queued_spin_lock_slowpath+0x7/0xa
        [<ffffffff811bc331>] change_protection_range+0x3b1/0x930
        [<ffffffff811d4be8>] change_prot_numa+0x18/0x30
        [<ffffffff810adefe>] task_numa_work+0x1fe/0x310
        [<ffffffff81098322>] task_work_run+0x72/0x90
      
      Further investigation showed that the lock contention here is pmd_lock().
      
      The task_numa_work() function makes sure that only one thread is let to perform
      the work in a single scan period (via cmpxchg), but if there's a thread with
      mmap_sem locked for writing for several periods, multiple threads in
      task_numa_work() can build up a convoy waiting for mmap_sem for read and then
      all get unblocked at once.
      
      This patch changes the down_read() to the trylock version, which prevents the
      build up. For a workload experiencing mmap_sem contention, it's probably better
      to postpone the NUMA balancing work anyway. This seems to have fixed the soft
      lockups involving pmd_lock(), which is in line with the convoy theory.
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20170515131316.21909-1-vbabka@suse.czSigned-off-by: NIngo Molnar <mingo@kernel.org>
      8655d549
  16. 15 5月, 2017 7 次提交
  17. 14 4月, 2017 4 次提交
    • P
      sched/fair: Move the PELT constants into a generated header · 283e2ed3
      Peter Zijlstra 提交于
      Now that we have a tool to generate the PELT constants in C form,
      use its output as a separate header.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      283e2ed3
    • P
      sched/fair: Increase PELT accuracy for small tasks · bb0bd044
      Peter Zijlstra 提交于
      We truncate (and loose) the lower 10 bits of runtime in
      ___update_load_avg(), this means there's a consistent bias to
      under-account tasks. This is esp. significant for small tasks.
      
      Cure this by only forwarding last_update_time to the point we've
      actually accounted for, leaving the remainder for the next time.
      Reported-by: NMorten Rasmussen <morten.rasmussen@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NMorten Rasmussen <morten.rasmussen@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      bb0bd044
    • P
      sched/fair: Fix comments · 3841cdc3
      Peter Zijlstra 提交于
      Historically our periods (or p) argument in PELT denoted the number of
      full periods (what is now d2). However recent patches have changed
      this to the total decay (previously p+1), leading to a confusing
      discrepancy between comments and code.
      
      Try and clarify things by making periods (in code) and p (in comments)
      be the same thing (again).
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      3841cdc3
    • P
      sched/fair: Fix corner case in __accumulate_sum() · 05296e75
      Peter Zijlstra 提交于
      Paul noticed that in the (periods >= LOAD_AVG_MAX_N) case in
      __accumulate_sum(), the returned contribution value (LOAD_AVG_MAX) is
      incorrect.
      
      This is because at this point, the decay_load() on the old state --
      the first step in accumulate_sum() -- will not have resulted in 0, and
      will therefore result in a sum larger than the maximum value of our
      series. Obviously broken.
      
      Note that:
      
      	decay_load(LOAD_AVG_MAX, LOAD_AVG_MAX_N) =
      
                      1   (345 / 32)
      	47742 * - ^            = ~27
                      2
      
      Not to mention that any further contribution from the d3 segment (our
      new period) would also push it over the maximum.
      
      Solve this by noting that we can write our c2 term:
      
      		    p
      	c2 = 1024 \Sum y^n
      		   n=1
      
      In terms of our maximum value:
      
      		    inf		      inf	  p
      	max = 1024 \Sum y^n = 1024 ( \Sum y^n + \Sum y^n + y^0 )
      		    n=0		      n=p+1	 n=1
      
      Further note that:
      
                 inf              inf            inf
              ( \Sum y^n ) y^p = \Sum y^(n+p) = \Sum y^n
                 n=0              n=0            n=p
      
      Combined that gives us:
      
      		    p
      	c2 = 1024 \Sum y^n
      		   n=1
      
      		     inf        inf
      	   = 1024 ( \Sum y^n - \Sum y^n - y^0 )
      		     n=0        n=p+1
      
      	   = max - (max y^(p+1)) - 1024
      
      Further simplify things by dealing with p=0 early on.
      Reported-by: NPaul Turner <pjt@google.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Yuyang Du <yuyang.du@intel.com>
      Cc: linux-kernel@vger.kernel.org
      Fixes: a481db34 ("sched/fair: Optimize ___update_sched_avg()")
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      05296e75
  18. 30 3月, 2017 2 次提交
    • Y
      sched/fair: Optimize ___update_sched_avg() · a481db34
      Yuyang Du 提交于
      The main PELT function ___update_load_avg(), which implements the
      accumulation and progression of the geometric average series, is
      implemented along the following lines for the scenario where the time
      delta spans all 3 possible sections (see figure below):
      
        1. add the remainder of the last incomplete period
        2. decay old sum
        3. accumulate new sum in full periods since last_update_time
        4. accumulate the current incomplete period
        5. update averages
      
      Or:
      
                  d1          d2           d3
                  ^           ^            ^
                  |           |            |
                |<->|<----------------->|<--->|
        ... |---x---|------| ... |------|-----x (now)
      
        load_sum' = (load_sum + weight * scale * d1) * y^(p+1) +	(1,2)
      
                                              p
      	      weight * scale * 1024 * \Sum y^n +		(3)
                                             n=1
      
      	      weight * scale * d3 * y^0				(4)
      
        load_avg' = load_sum' / LOAD_AVG_MAX				(5)
      
      Where:
      
       d1 - is the delta part completing the remainder of the last
            incomplete period,
       d2 - is the delta part spannind complete periods, and
       d3 - is the delta part starting the current incomplete period.
      
      We can simplify the code in two steps; the first step is to separate
      the first term into new and old parts like:
      
        (load_sum + weight * scale * d1) * y^(p+1) = load_sum * y^(p+1) +
      					       weight * scale * d1 * y^(p+1)
      
      Once we've done that, its easy to see that all new terms carry the
      common factors:
      
        weight * scale
      
      If we factor those out, we arrive at the form:
      
        load_sum' = load_sum * y^(p+1) +
      
      	      weight * scale * (d1 * y^(p+1) +
      
      					 p
      			        1024 * \Sum y^n +
      					n=1
      
      				d3 * y^0)
      
      Which results in a simpler, smaller and faster implementation.
      Signed-off-by: NYuyang Du <yuyang.du@intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: bsegall@google.com
      Cc: dietmar.eggemann@arm.com
      Cc: matt@codeblueprint.co.uk
      Cc: morten.rasmussen@arm.com
      Cc: pjt@google.com
      Cc: umgwanakikbuti@gmail.com
      Cc: vincent.guittot@linaro.org
      Link: http://lkml.kernel.org/r/1486935863-25251-3-git-send-email-yuyang.du@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      a481db34
    • P
      sched/fair: Explicitly generate __update_load_avg() instances · 0ccb977f
      Peter Zijlstra 提交于
      The __update_load_avg() function is an __always_inline because its
      used with constant propagation to generate different variants of the
      code without having to duplicate it (which would be prone to bugs).
      
      Explicitly instantiate the 3 variants.
      
      Note that most of this is called from rather hot paths, so reducing
      branches is good.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      0ccb977f
  19. 27 3月, 2017 1 次提交
    • S
      sched/fair: Prefer sibiling only if local group is under-utilized · 05b40e05
      Srikar Dronamraju 提交于
      If the child domain prefers tasks to go siblings, the local group could
      end up pulling tasks to itself even if the local group is almost equally
      loaded as the source group.
      
      Lets assume a 4 core,smt==2 machine running 5 thread ebizzy workload.
      Everytime, local group has capacity and source group has atleast 2 threads,
      local group tries to pull the task. This causes the threads to constantly
      move between different cores. This is even more profound if the cores have
      more threads, like in Power 8, smt 8 mode.
      
      Fix this by only allowing local group to pull a task, if the source group
      has more number of tasks than the local group.
      
      Here are the relevant perf stat numbers of a 22 core,smt 8 Power 8 machine.
      
      Without patch:
       Performance counter stats for 'ebizzy -t 22 -S 100' (5 runs):
      
                   1,440      context-switches          #    0.001 K/sec                    ( +-  1.26% )
                     366      cpu-migrations            #    0.000 K/sec                    ( +-  5.58% )
                   3,933      page-faults               #    0.002 K/sec                    ( +- 11.08% )
      
       Performance counter stats for 'ebizzy -t 48 -S 100' (5 runs):
      
                   6,287      context-switches          #    0.001 K/sec                    ( +-  3.65% )
                   3,776      cpu-migrations            #    0.001 K/sec                    ( +-  4.84% )
                   5,702      page-faults               #    0.001 K/sec                    ( +-  9.36% )
      
       Performance counter stats for 'ebizzy -t 96 -S 100' (5 runs):
      
                   8,776      context-switches          #    0.001 K/sec                    ( +-  0.73% )
                   2,790      cpu-migrations            #    0.000 K/sec                    ( +-  0.98% )
                  10,540      page-faults               #    0.001 K/sec                    ( +-  3.12% )
      
      With patch:
      
       Performance counter stats for 'ebizzy -t 22 -S 100' (5 runs):
      
                   1,133      context-switches          #    0.001 K/sec                    ( +-  4.72% )
                     123      cpu-migrations            #    0.000 K/sec                    ( +-  3.42% )
                   3,858      page-faults               #    0.002 K/sec                    ( +-  8.52% )
      
       Performance counter stats for 'ebizzy -t 48 -S 100' (5 runs):
      
                   2,169      context-switches          #    0.000 K/sec                    ( +-  6.19% )
                     189      cpu-migrations            #    0.000 K/sec                    ( +- 12.75% )
                   5,917      page-faults               #    0.001 K/sec                    ( +-  8.09% )
      
       Performance counter stats for 'ebizzy -t 96 -S 100' (5 runs):
      
                   5,333      context-switches          #    0.001 K/sec                    ( +-  5.91% )
                     506      cpu-migrations            #    0.000 K/sec                    ( +-  3.35% )
                  10,792      page-faults               #    0.001 K/sec                    ( +-  7.75% )
      
      Which show that in these workloads CPU migrations get reduced significantly.
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Link: http://lkml.kernel.org/r/1490205470-10249-1-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      05b40e05