1. 14 4月, 2017 3 次提交
    • P
      sched/fair: Increase PELT accuracy for small tasks · bb0bd044
      Peter Zijlstra 提交于
      We truncate (and loose) the lower 10 bits of runtime in
      ___update_load_avg(), this means there's a consistent bias to
      under-account tasks. This is esp. significant for small tasks.
      
      Cure this by only forwarding last_update_time to the point we've
      actually accounted for, leaving the remainder for the next time.
      Reported-by: NMorten Rasmussen <morten.rasmussen@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NMorten Rasmussen <morten.rasmussen@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      bb0bd044
    • P
      sched/fair: Fix comments · 3841cdc3
      Peter Zijlstra 提交于
      Historically our periods (or p) argument in PELT denoted the number of
      full periods (what is now d2). However recent patches have changed
      this to the total decay (previously p+1), leading to a confusing
      discrepancy between comments and code.
      
      Try and clarify things by making periods (in code) and p (in comments)
      be the same thing (again).
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      3841cdc3
    • P
      sched/fair: Fix corner case in __accumulate_sum() · 05296e75
      Peter Zijlstra 提交于
      Paul noticed that in the (periods >= LOAD_AVG_MAX_N) case in
      __accumulate_sum(), the returned contribution value (LOAD_AVG_MAX) is
      incorrect.
      
      This is because at this point, the decay_load() on the old state --
      the first step in accumulate_sum() -- will not have resulted in 0, and
      will therefore result in a sum larger than the maximum value of our
      series. Obviously broken.
      
      Note that:
      
      	decay_load(LOAD_AVG_MAX, LOAD_AVG_MAX_N) =
      
                      1   (345 / 32)
      	47742 * - ^            = ~27
                      2
      
      Not to mention that any further contribution from the d3 segment (our
      new period) would also push it over the maximum.
      
      Solve this by noting that we can write our c2 term:
      
      		    p
      	c2 = 1024 \Sum y^n
      		   n=1
      
      In terms of our maximum value:
      
      		    inf		      inf	  p
      	max = 1024 \Sum y^n = 1024 ( \Sum y^n + \Sum y^n + y^0 )
      		    n=0		      n=p+1	 n=1
      
      Further note that:
      
                 inf              inf            inf
              ( \Sum y^n ) y^p = \Sum y^(n+p) = \Sum y^n
                 n=0              n=0            n=p
      
      Combined that gives us:
      
      		    p
      	c2 = 1024 \Sum y^n
      		   n=1
      
      		     inf        inf
      	   = 1024 ( \Sum y^n - \Sum y^n - y^0 )
      		     n=0        n=p+1
      
      	   = max - (max y^(p+1)) - 1024
      
      Further simplify things by dealing with p=0 early on.
      Reported-by: NPaul Turner <pjt@google.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Yuyang Du <yuyang.du@intel.com>
      Cc: linux-kernel@vger.kernel.org
      Fixes: a481db34 ("sched/fair: Optimize ___update_sched_avg()")
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      05296e75
  2. 30 3月, 2017 2 次提交
    • Y
      sched/fair: Optimize ___update_sched_avg() · a481db34
      Yuyang Du 提交于
      The main PELT function ___update_load_avg(), which implements the
      accumulation and progression of the geometric average series, is
      implemented along the following lines for the scenario where the time
      delta spans all 3 possible sections (see figure below):
      
        1. add the remainder of the last incomplete period
        2. decay old sum
        3. accumulate new sum in full periods since last_update_time
        4. accumulate the current incomplete period
        5. update averages
      
      Or:
      
                  d1          d2           d3
                  ^           ^            ^
                  |           |            |
                |<->|<----------------->|<--->|
        ... |---x---|------| ... |------|-----x (now)
      
        load_sum' = (load_sum + weight * scale * d1) * y^(p+1) +	(1,2)
      
                                              p
      	      weight * scale * 1024 * \Sum y^n +		(3)
                                             n=1
      
      	      weight * scale * d3 * y^0				(4)
      
        load_avg' = load_sum' / LOAD_AVG_MAX				(5)
      
      Where:
      
       d1 - is the delta part completing the remainder of the last
            incomplete period,
       d2 - is the delta part spannind complete periods, and
       d3 - is the delta part starting the current incomplete period.
      
      We can simplify the code in two steps; the first step is to separate
      the first term into new and old parts like:
      
        (load_sum + weight * scale * d1) * y^(p+1) = load_sum * y^(p+1) +
      					       weight * scale * d1 * y^(p+1)
      
      Once we've done that, its easy to see that all new terms carry the
      common factors:
      
        weight * scale
      
      If we factor those out, we arrive at the form:
      
        load_sum' = load_sum * y^(p+1) +
      
      	      weight * scale * (d1 * y^(p+1) +
      
      					 p
      			        1024 * \Sum y^n +
      					n=1
      
      				d3 * y^0)
      
      Which results in a simpler, smaller and faster implementation.
      Signed-off-by: NYuyang Du <yuyang.du@intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: bsegall@google.com
      Cc: dietmar.eggemann@arm.com
      Cc: matt@codeblueprint.co.uk
      Cc: morten.rasmussen@arm.com
      Cc: pjt@google.com
      Cc: umgwanakikbuti@gmail.com
      Cc: vincent.guittot@linaro.org
      Link: http://lkml.kernel.org/r/1486935863-25251-3-git-send-email-yuyang.du@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      a481db34
    • P
      sched/fair: Explicitly generate __update_load_avg() instances · 0ccb977f
      Peter Zijlstra 提交于
      The __update_load_avg() function is an __always_inline because its
      used with constant propagation to generate different variants of the
      code without having to duplicate it (which would be prone to bugs).
      
      Explicitly instantiate the 3 variants.
      
      Note that most of this is called from rather hot paths, so reducing
      branches is good.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      0ccb977f
  3. 27 3月, 2017 1 次提交
    • S
      sched/fair: Prefer sibiling only if local group is under-utilized · 05b40e05
      Srikar Dronamraju 提交于
      If the child domain prefers tasks to go siblings, the local group could
      end up pulling tasks to itself even if the local group is almost equally
      loaded as the source group.
      
      Lets assume a 4 core,smt==2 machine running 5 thread ebizzy workload.
      Everytime, local group has capacity and source group has atleast 2 threads,
      local group tries to pull the task. This causes the threads to constantly
      move between different cores. This is even more profound if the cores have
      more threads, like in Power 8, smt 8 mode.
      
      Fix this by only allowing local group to pull a task, if the source group
      has more number of tasks than the local group.
      
      Here are the relevant perf stat numbers of a 22 core,smt 8 Power 8 machine.
      
      Without patch:
       Performance counter stats for 'ebizzy -t 22 -S 100' (5 runs):
      
                   1,440      context-switches          #    0.001 K/sec                    ( +-  1.26% )
                     366      cpu-migrations            #    0.000 K/sec                    ( +-  5.58% )
                   3,933      page-faults               #    0.002 K/sec                    ( +- 11.08% )
      
       Performance counter stats for 'ebizzy -t 48 -S 100' (5 runs):
      
                   6,287      context-switches          #    0.001 K/sec                    ( +-  3.65% )
                   3,776      cpu-migrations            #    0.001 K/sec                    ( +-  4.84% )
                   5,702      page-faults               #    0.001 K/sec                    ( +-  9.36% )
      
       Performance counter stats for 'ebizzy -t 96 -S 100' (5 runs):
      
                   8,776      context-switches          #    0.001 K/sec                    ( +-  0.73% )
                   2,790      cpu-migrations            #    0.000 K/sec                    ( +-  0.98% )
                  10,540      page-faults               #    0.001 K/sec                    ( +-  3.12% )
      
      With patch:
      
       Performance counter stats for 'ebizzy -t 22 -S 100' (5 runs):
      
                   1,133      context-switches          #    0.001 K/sec                    ( +-  4.72% )
                     123      cpu-migrations            #    0.000 K/sec                    ( +-  3.42% )
                   3,858      page-faults               #    0.002 K/sec                    ( +-  8.52% )
      
       Performance counter stats for 'ebizzy -t 48 -S 100' (5 runs):
      
                   2,169      context-switches          #    0.000 K/sec                    ( +-  6.19% )
                     189      cpu-migrations            #    0.000 K/sec                    ( +- 12.75% )
                   5,917      page-faults               #    0.001 K/sec                    ( +-  8.09% )
      
       Performance counter stats for 'ebizzy -t 96 -S 100' (5 runs):
      
                   5,333      context-switches          #    0.001 K/sec                    ( +-  5.91% )
                     506      cpu-migrations            #    0.000 K/sec                    ( +-  3.35% )
                  10,792      page-faults               #    0.001 K/sec                    ( +-  7.75% )
      
      Which show that in these workloads CPU migrations get reduced significantly.
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Link: http://lkml.kernel.org/r/1490205470-10249-1-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      05b40e05
  4. 23 3月, 2017 1 次提交
    • V
      sched/fair: Fix FTQ noise bench regression · bc427898
      Vincent Guittot 提交于
      A regression of the FTQ noise has been reported by Ying Huang,
      on the following hardware:
      
        8 threads Intel(R) Core(TM)i7-4770 CPU @ 3.40GHz with 8G memory
      
      ... which was caused by this commit:
      
        commit 4e516076 ("sched/fair: Propagate asynchrous detach")
      
      The only part of the patch that can increase the noise is the update
      of blocked load of group entity in update_blocked_averages().
      
      We can optimize this call and skip the update of group entity if its load
      and utilization are already null and there is no pending propagation of load
      in the task group.
      
      This optimization partly restores the noise score. A more agressive
      optimization has been tried but has shown worse score.
      
      Reported-by: ying.huang@linux.intel.com
      Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: dietmar.eggemann@arm.com
      Cc: ying.huang@intel.com
      Fixes: 4e516076 ("sched/fair: Propagate asynchrous detach")
      Link: http://lkml.kernel.org/r/1489758442-2877-1-git-send-email-vincent.guittot@linaro.org
      [ Fixed typos, improved layout. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      bc427898
  5. 16 3月, 2017 2 次提交
  6. 02 3月, 2017 4 次提交
  7. 14 1月, 2017 6 次提交
    • D
      sched/fair: Explain why MIN_SHARES isn't scaled in calc_cfs_shares() · b8fd8423
      Dietmar Eggemann 提交于
      Signed-off-by: NDietmar Eggemann <dietmar.eggemann@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul Turner <pjt@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Samuel Thibault <samuel.thibault@ens-lyon.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/e9a4d858-bcf3-36b9-e3a9-449953e34569@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      b8fd8423
    • V
      sched/core: Fix group_entity's share update · 89ee048f
      Vincent Guittot 提交于
      The update of the share of a cfs_rq is done when its load_avg is updated
      but before the group_entity's load_avg has been updated for the past time
      slot. This generates wrong load_avg accounting which can be significant
      when small tasks are involved in the scheduling.
      
      Let take the example of a task a that is dequeued of its task group A:
         root
        (cfs_rq)
          \
          (se)
           A
          (cfs_rq)
            \
            (se)
             a
      
      Task "a" was the only task in task group A which becomes idle when a is
      dequeued.
      
      We have the sequence:
      
      - dequeue_entity a->se
          - update_load_avg(a->se)
          - dequeue_entity_load_avg(A->cfs_rq, a->se)
          - update_cfs_shares(A->cfs_rq)
      	A->cfs_rq->load.weight == 0
              A->se->load.weight is updated with the new share (0 in this case)
      - dequeue_entity A->se
          - update_load_avg(A->se) but its weight is now null so the last time
            slot (up to a tick) will be accounted with a weight of 0 instead of
            its real weight during the time slot. The last time slot will be
            accounted as an idle one whereas it was a running one.
      
      If the running time of task a is short enough that no tick happens when it
      runs, all running time of group entity A->se will be accounted as idle
      time.
      
      Instead, we should update the share of a cfs_rq (in fact the weight of its
      group entity) only after having updated the load_avg of the group_entity.
      
      update_cfs_shares() now takes the sched_entity as a parameter instead of the
      cfs_rq, and the weight of the group_entity is updated only once its load_avg
      has been synced with current time.
      Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: pjt@google.com
      Link: http://lkml.kernel.org/r/1482335426-7664-1-git-send-email-vincent.guittot@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      89ee048f
    • P
      sched/core: Add missing update_rq_clock() call for task_hot() · 3bed5e21
      Peter Zijlstra 提交于
      Add the update_rq_clock() call at the top of the callstack instead of
      at the bottom where we find it missing, this to aid later effort to
      minimize the number of update_rq_lock() calls.
      
        WARNING: CPU: 30 PID: 194 at ../kernel/sched/sched.h:797 assert_clock_updated()
        rq->clock_update_flags < RQCF_ACT_SKIP
      
        Call Trace:
          dump_stack()
          __warn()
          warn_slowpath_fmt()
          assert_clock_updated.isra.63.part.64()
          can_migrate_task()
          load_balance()
          pick_next_task_fair()
          __schedule()
          schedule()
          worker_thread()
          kthread()
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      3bed5e21
    • P
      sched/core: Add missing update_rq_clock() in post_init_entity_util_avg() · 4126bad6
      Peter Zijlstra 提交于
      Address this rq-clock update bug:
      
        WARNING: CPU: 0 PID: 0 at ../kernel/sched/sched.h:797 post_init_entity_util_avg()
        rq->clock_update_flags < RQCF_ACT_SKIP
      
        Call Trace:
          __warn()
          post_init_entity_util_avg()
          wake_up_new_task()
          _do_fork()
          kernel_thread()
          rest_init()
          start_kernel()
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      4126bad6
    • M
      sched/fair: Push rq lock pin/unpin into idle_balance() · 46f69fa3
      Matt Fleming 提交于
      Future patches will emit warnings if rq_clock() is called before
      update_rq_clock() inside a rq_pin_lock()/rq_unpin_lock() pair.
      
      Since there is only one caller of idle_balance() we can push the
      unpin/repin there.
      Signed-off-by: NMatt Fleming <matt@codeblueprint.co.uk>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Byungchul Park <byungchul.park@lge.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Luca Abeni <luca.abeni@unitn.it>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Wanpeng Li <wanpeng.li@hotmail.com>
      Cc: Yuyang Du <yuyang.du@intel.com>
      Link: http://lkml.kernel.org/r/20160921133813.31976-7-matt@codeblueprint.co.ukSigned-off-by: NIngo Molnar <mingo@kernel.org>
      46f69fa3
    • M
      sched/core: Add wrappers for lockdep_(un)pin_lock() · d8ac8971
      Matt Fleming 提交于
      In preparation for adding diagnostic checks to catch missing calls to
      update_rq_clock(), provide wrappers for (re)pinning and unpinning
      rq->lock.
      
      Because the pending diagnostic checks allow state to be maintained in
      rq_flags across pin contexts, swap the 'struct pin_cookie' arguments
      for 'struct rq_flags *'.
      Signed-off-by: NMatt Fleming <matt@codeblueprint.co.uk>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Byungchul Park <byungchul.park@lge.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Luca Abeni <luca.abeni@unitn.it>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Wanpeng Li <wanpeng.li@hotmail.com>
      Cc: Yuyang Du <yuyang.du@intel.com>
      Link: http://lkml.kernel.org/r/20160921133813.31976-5-matt@codeblueprint.co.ukSigned-off-by: NIngo Molnar <mingo@kernel.org>
      d8ac8971
  8. 11 12月, 2016 2 次提交
    • V
      sched/core: Use load_avg for selecting idlest group · 6b94780e
      Vincent Guittot 提交于
      find_idlest_group() only compares the runnable_load_avg when looking
      for the least loaded group. But on fork intensive use case like
      hackbench where tasks blocked quickly after the fork, this can lead to
      selecting the same CPU instead of other CPUs, which have similar
      runnable load but a lower load_avg.
      
      When the runnable_load_avg of 2 CPUs are close, we now take into
      account the amount of blocked load as a 2nd selection factor. There is
      now 3 zones for the runnable_load of the rq:
      
       - [0 .. (runnable_load - imbalance)]:
      	Select the new rq which has significantly less runnable_load
      
       - [(runnable_load - imbalance) .. (runnable_load + imbalance)]:
      	The runnable loads are close so we use load_avg to chose
      	between the 2 rq
      
       - [(runnable_load + imbalance) .. ULONG_MAX]:
      	Keep the current rq which has significantly less runnable_load
      
      The scale factor that is currently used for comparing runnable_load,
      doesn't work well with small value. As an example, the use of a
      scaling factor fails as soon as this_runnable_load == 0 because we
      always select local rq even if min_runnable_load is only 1, which
      doesn't really make sense because they are just the same. So instead
      of scaling factor, we use an absolute margin for runnable_load to
      detect CPUs with similar runnable_load and we keep using scaling
      factor for blocked load.
      
      For use case like hackbench, this enable the scheduler to select
      different CPUs during the fork sequence and to spread tasks across the
      system.
      
      Tests have been done on a Hikey board (ARM based octo cores) for
      several kernel. The result below gives min, max, avg and stdev values
      of 18 runs with each configuration.
      
      The patches depend on the "no missing update_rq_clock()" work.
      
      hackbench -P -g 1
      
               ea86cb4b  7dc603c9  v4.8        v4.8+patches
        min    0.049         0.050         0.051       0,048
        avg    0.057         0.057(0%)     0.057(0%)   0,055(+5%)
        max    0.066         0.068         0.070       0,063
        stdev  +/-9%         +/-9%         +/-8%       +/-9%
      
      More performance numbers here:
      
        https://lkml.kernel.org/r/20161203214707.GI20785@codeblueprint.co.ukTested-by: NMatt Fleming <matt@codeblueprint.co.uk>
      Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NMatt Fleming <matt@codeblueprint.co.uk>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Morten.Rasmussen@arm.com
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: dietmar.eggemann@arm.com
      Cc: kernellwp@gmail.com
      Cc: umgwanakikbuti@gmail.com
      Cc: yuyang.du@intel.comc
      Link: http://lkml.kernel.org/r/1481216215-24651-3-git-send-email-vincent.guittot@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      6b94780e
    • V
      sched/core: Fix find_idlest_group() for fork · f519a3f1
      Vincent Guittot 提交于
      During fork, the utilization of a task is init once the rq has been
      selected because the current utilization level of the rq is used to
      set the utilization of the fork task. As the task's utilization is
      still 0 at this step of the fork sequence, it doesn't make sense to
      look for some spare capacity that can fit the task's utilization.
      Furthermore, I can see perf regressions for the test:
      
         hackbench -P -g 1
      
      because the least loaded policy is always bypassed and tasks are not
      spread during fork.
      
      With this patch and the fix below, we are back to same performances as
      for v4.8. The fix below is only a temporary one used for the test
      until a smarter solution is found because we can't simply remove the
      test which is useful for others benchmarks
      
      | @@ -5708,13 +5708,6 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
      |
      |	avg_cost = this_sd->avg_scan_cost;
      |
      | -	/*
      | -	 * Due to large variance we need a large fuzz factor; hackbench in
      | -	 * particularly is sensitive here.
      | -	 */
      | -	if ((avg_idle / 512) < avg_cost)
      | -		return -1;
      | -
      |	time = local_clock();
      |
      |	for_each_cpu_wrap(cpu, sched_domain_span(sd), target, wrap) {
      Tested-by: NMatt Fleming <matt@codeblueprint.co.uk>
      Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NMatt Fleming <matt@codeblueprint.co.uk>
      Acked-by: NMorten Rasmussen <morten.rasmussen@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: dietmar.eggemann@arm.com
      Cc: kernellwp@gmail.com
      Cc: umgwanakikbuti@gmail.com
      Cc: yuyang.du@intel.comc
      Link: http://lkml.kernel.org/r/1481216215-24651-2-git-send-email-vincent.guittot@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      f519a3f1
  9. 24 11月, 2016 1 次提交
  10. 23 11月, 2016 1 次提交
  11. 16 11月, 2016 11 次提交
  12. 27 10月, 2016 1 次提交
  13. 20 10月, 2016 1 次提交
  14. 19 10月, 2016 1 次提交
    • V
      sched/fair: Fix incorrect task group ->load_avg · b5a9b340
      Vincent Guittot 提交于
      A scheduler performance regression has been reported by Joseph Salisbury,
      which he bisected back to:
      
        3d30544f ("sched/fair: Apply more PELT fixes)
      
      The regression triggers when several levels of task groups are involved
      (read: SystemD) and cpu_possible_mask != cpu_present_mask.
      
      The root cause is that group entity's load (tg_child->se[i]->avg.load_avg)
      is initialized to scale_load_down(se->load.weight). During the creation of
      a child task group, its group entities on possible CPUs are attached to
      parent's cfs_rq (tg_parent) and their loads are added to the parent's load
      (tg_parent->load_avg) with update_tg_load_avg().
      
      But only the load on online CPUs will then be updated to reflect real load,
      whereas load on other CPUs will stay at the initial value.
      
      The result is a tg_parent->load_avg that is higher than the real load, the
      weight of group entities (tg_parent->se[i]->load.weight) on online CPUs is
      smaller than it should be, and the task group gets a less running time than
      what it could expect.
      
      ( This situation can be detected with /proc/sched_debug. The ".tg_load_avg"
        of the task group will be much higher than sum of ".tg_load_avg_contrib"
        of online cfs_rqs of the task group. )
      
      The load of group entities don't have to be intialized to something else
      than 0 because their load will increase when an entity is attached.
      Reported-by: NJoseph Salisbury <joseph.salisbury@canonical.com>
      Tested-by: NDietmar Eggemann <dietmar.eggemann@arm.com>
      Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: <stable@vger.kernel.org> # 4.8.x
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: joonwoop@codeaurora.org
      Fixes: 3d30544f ("sched/fair: Apply more PELT fixes)
      Link: http://lkml.kernel.org/r/1476881123-10159-1-git-send-email-vincent.guittot@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      b5a9b340
  15. 11 10月, 2016 2 次提交
    • W
      sched/fair: Fix sched domains NULL dereference in select_idle_sibling() · 9cfb38a7
      Wanpeng Li 提交于
      Commit:
      
        10e2f1ac ("sched/core: Rewrite and improve select_idle_siblings()")
      
      ... improved select_idle_sibling(), but also triggered a regression (crash)
      during CPU-hotplug:
      
        BUG: unable to handle kernel NULL pointer dereference at 0000000000000078
        IP: [<ffffffffb10cd332>] select_idle_sibling+0x1c2/0x4f0
        Call Trace:
         <IRQ>
          select_task_rq_fair+0x749/0x930
          ? select_task_rq_fair+0xb4/0x930
          ? __lock_is_held+0x54/0x70
          try_to_wake_up+0x19a/0x5b0
          default_wake_function+0x12/0x20
          autoremove_wake_function+0x12/0x40
          __wake_up_common+0x55/0x90
          __wake_up+0x39/0x50
          wake_up_klogd_work_func+0x40/0x60
          irq_work_run_list+0x57/0x80
          irq_work_run+0x2c/0x30
          smp_irq_work_interrupt+0x2e/0x40
          irq_work_interrupt+0x96/0xa0
         <EOI>
          ? _raw_spin_unlock_irqrestore+0x45/0x80
          try_to_wake_up+0x4a/0x5b0
          wake_up_state+0x10/0x20
          __kthread_unpark+0x67/0x70
          kthread_unpark+0x22/0x30
          cpuhp_online_idle+0x3e/0x70
          cpu_startup_entry+0x6a/0x450
          start_secondary+0x154/0x180
      
      This can be reproduced by running the ftrace test case of kselftest, the
      test case will hot-unplug the CPU and the CPU will attach to the NULL
      sched-domain during scheduler teardown.
      
      The step 2 for the rewrite select_idle_siblings():
      
        | Step 2) tracks the average cost of the scan and compares this to the
        | average idle time guestimate for the CPU doing the wakeup.
      
      If the CPU which doing the wakeup is the going hot-unplug CPU, then NULL
      sched domain will be dereferenced to acquire the average cost of the scan.
      
      This patch fix it by failing the search of an idle CPU in the LLC process
      if this sched domain is NULL.
      Tested-by: NCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1475971443-3187-1-git-send-email-wanpeng.li@hotmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      9cfb38a7
    • E
      latent_entropy: Mark functions with __latent_entropy · 0766f788
      Emese Revfy 提交于
      The __latent_entropy gcc attribute can be used only on functions and
      variables.  If it is on a function then the plugin will instrument it for
      gathering control-flow entropy. If the attribute is on a variable then
      the plugin will initialize it with random contents.  The variable must
      be an integer, an integer array type or a structure with integer fields.
      
      These specific functions have been selected because they are init
      functions (to help gather boot-time entropy), are called at unpredictable
      times, or they have variable loops, each of which provide some level of
      latent entropy.
      Signed-off-by: NEmese Revfy <re.emese@gmail.com>
      [kees: expanded commit message]
      Signed-off-by: NKees Cook <keescook@chromium.org>
      0766f788
  16. 30 9月, 2016 1 次提交
    • P
      sched/fair: Fix min_vruntime tracking · b60205c7
      Peter Zijlstra 提交于
      While going through enqueue/dequeue to review the movement of
      set_curr_task() I noticed that the (2nd) update_min_vruntime() call in
      dequeue_entity() is suspect.
      
      It turns out, its actually wrong because it will consider
      cfs_rq->curr, which could be the entry we just normalized. This mixes
      different vruntime forms and leads to fail.
      
      The purpose of the second update_min_vruntime() is to move
      min_vruntime forward if the entity we just removed is the one that was
      holding it back; _except_ for the DEQUEUE_SAVE case, because then we
      know its a temporary removal and it will come back.
      
      However, since we do put_prev_task() _after_ dequeue(), cfs_rq->curr
      will still be set (and per the above, can be tranformed into a
      different unit), so update_min_vruntime() should also consider
      curr->on_rq. This also fixes another corner case where the enqueue
      (which also does update_curr()->update_min_vruntime()) happens on the
      rq->lock break in schedule(), between dequeue and put_prev_task.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Fixes: 1e876231 ("sched: Fix ->min_vruntime calculation in dequeue_entity()")
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      b60205c7