1. 02 4月, 2015 2 次提交
  2. 27 3月, 2015 15 次提交
    • W
      sched/deadline: Fix rt runtime corruption when dl fails its global constraints · a1963b81
      Wanpeng Li 提交于
      One version of sched_rt_global_constaints() (the !rt-cgroup one)
      changes state, therefore if we fail the later sched_dl_global_constraints()
      call the state is left in an inconsistent state.
      
      Fix this by changing the order of the calls.
      Signed-off-by: NWanpeng Li <wanpeng.li@linux.intel.com>
      [ Improved the changelog. ]
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NJuri Lelli <juri.lelli@arm.com>
      Link: http://lkml.kernel.org/r/1426590931-4639-2-git-send-email-wanpeng.li@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      a1963b81
    • W
      sched/deadline: Avoid a superfluous check · bd4bde14
      Wanpeng Li 提交于
      Since commit 40767b0d ("sched/deadline: Fix deadline parameter
      modification handling") we clear the thottled state when switching
      from a dl task, therefore we should never find it set in switching to
      a dl task.
      Signed-off-by: NWanpeng Li <wanpeng.li@linux.intel.com>
      [ Improved the changelog. ]
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NJuri Lelli <juri.lelli@arm.com>
      Link: http://lkml.kernel.org/r/1426590931-4639-1-git-send-email-wanpeng.li@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      bd4bde14
    • P
      sched: Improve load balancing in the presence of idle CPUs · d4573c3e
      Preeti U Murthy 提交于
      When a CPU is kicked to do nohz idle balancing, it wakes up to do load
      balancing on itself, followed by load balancing on behalf of idle CPUs.
      But it may end up with load after the load balancing attempt on itself.
      This aborts nohz idle balancing. As a result several idle CPUs are left
      without tasks till such a time that an ILB CPU finds it unfavorable to
      pull tasks upon itself. This delays spreading of load across idle CPUs
      and worse, clutters only a few CPUs with tasks.
      
      The effect of the above problem was observed on an SMT8 POWER server
      with 2 levels of numa domains. Busy loops equal to number of cores were
      spawned. Since load balancing on fork/exec is discouraged across numa
      domains, all busy loops would start on one of the numa domains. However
      it was expected that eventually one busy loop would run per core across
      all domains due to nohz idle load balancing. But it was observed that it
      took as long as 10 seconds to spread the load across numa domains.
      
      Further investigation showed that this was a consequence of the
      following:
      
       1. An ILB CPU was chosen from the first numa domain to trigger nohz idle
          load balancing [Given the experiment, upto 6 CPUs per core could be
          potentially idle in this domain.]
      
       2. However the ILB CPU would call load_balance() on itself before
          initiating nohz idle load balancing.
      
       3. Given cores are SMT8, the ILB CPU had enough opportunities to pull
          tasks from its sibling cores to even out load.
      
       4. Now that the ILB CPU was no longer idle, it would abort nohz idle
          load balancing
      
      As a result the opportunities to spread load across numa domains were
      lost until such a time that the cores within the first numa domain had
      equal number of tasks among themselves.  This is a pretty bad scenario,
      since the cores within the first numa domain would have as many as 4
      tasks each, while cores in the neighbouring numa domains would all
      remain idle.
      
      Fix this, by checking if a CPU was woken up to do nohz idle load
      balancing, before it does load balancing upon itself. This way we allow
      idle CPUs across the system to do load balancing which results in
      quicker spread of load, instead of performing load balancing within the
      local sched domain hierarchy of the ILB CPU alone under circumstances
      such as above.
      Signed-off-by: NPreeti U Murthy <preeti@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NJason Low <jason.low2@hp.com>
      Cc: benh@kernel.crashing.org
      Cc: daniel.lezcano@linaro.org
      Cc: efault@gmx.de
      Cc: iamjoonsoo.kim@lge.com
      Cc: morten.rasmussen@arm.com
      Cc: pjt@google.com
      Cc: riel@redhat.com
      Cc: srikar@linux.vnet.ibm.com
      Cc: svaidy@linux.vnet.ibm.com
      Cc: tim.c.chen@linux.intel.com
      Cc: vincent.guittot@linaro.org
      Link: http://lkml.kernel.org/r/20150326130014.21532.17158.stgit@preeti.in.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      d4573c3e
    • P
      sched: Optimize freq invariant accounting · dfbca41f
      Peter Zijlstra 提交于
      Currently the freq invariant accounting (in
      __update_entity_runnable_avg() and sched_rt_avg_update()) get the
      scale factor from a weak function call, this means that even for archs
      that default on their implementation the compiler cannot see into this
      function and optimize the extra scaling math away.
      
      This is sad, esp. since its a 64-bit multiplication which can be quite
      costly on some platforms.
      
      So replace the weak function with #ifdef and __always_inline goo. This
      is not quite as nice from an arch support PoV but should at least
      result in compile time errors if done wrong.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Ben Segall <bsegall@google.com>
      Cc: Morten.Rasmussen@arm.com
      Cc: Paul Turner <pjt@google.com>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Cc: dietmar.eggemann@arm.com
      Cc: efault@gmx.de
      Cc: kamalesh@linux.vnet.ibm.com
      Cc: nicolas.pitre@linaro.org
      Cc: preeti@linux.vnet.ibm.com
      Cc: riel@redhat.com
      Link: http://lkml.kernel.org/r/20150323131905.GF23123@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      dfbca41f
    • V
      sched: Move CFS tasks to CPUs with higher capacity · 1aaf90a4
      Vincent Guittot 提交于
      When a CPU is used to handle a lot of IRQs or some RT tasks, the remaining
      capacity for CFS tasks can be significantly reduced. Once we detect such
      situation by comparing cpu_capacity_orig and cpu_capacity, we trig an idle
      load balance to check if it's worth moving its tasks on an idle CPU.
      
      It's worth trying to move the task before the CPU is fully utilized to
      minimize the preemption by irq or RT tasks.
      
      Once the idle load_balance has selected the busiest CPU, it will look for an
      active load balance for only two cases:
      
        - There is only 1 task on the busiest CPU.
      
        - We haven't been able to move a task of the busiest rq.
      
      A CPU with a reduced capacity is included in the 1st case, and it's worth to
      actively migrate its task if the idle CPU has got more available capacity for
      CFS tasks. This test has been added in need_active_balance.
      
      As a sidenote, this will not generate more spurious ilb because we already
      trig an ilb if there is more than 1 busy cpu. If this cpu is the only one that
      has a task, we will trig the ilb once for migrating the task.
      
      The nohz_kick_needed function has been cleaned up a bit while adding the new
      test
      
      env.src_cpu and env.src_rq must be set unconditionnally because they are used
      in need_active_balance which is called even if busiest->nr_running equals 1
      Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Morten.Rasmussen@arm.com
      Cc: dietmar.eggemann@arm.com
      Cc: efault@gmx.de
      Cc: kamalesh@linux.vnet.ibm.com
      Cc: linaro-kernel@lists.linaro.org
      Cc: nicolas.pitre@linaro.org
      Cc: preeti@linux.vnet.ibm.com
      Cc: riel@redhat.com
      Link: http://lkml.kernel.org/r/1425052454-25797-12-git-send-email-vincent.guittot@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      1aaf90a4
    • V
      sched: Add SD_PREFER_SIBLING for SMT level · caff37ef
      Vincent Guittot 提交于
      Add the SD_PREFER_SIBLING flag for SMT level in order to ensure that
      the scheduler will place at least one task per core.
      Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NPreeti U. Murthy <preeti@linux.vnet.ibm.com>
      Cc: Morten.Rasmussen@arm.com
      Cc: dietmar.eggemann@arm.com
      Cc: efault@gmx.de
      Cc: kamalesh@linux.vnet.ibm.com
      Cc: linaro-kernel@lists.linaro.org
      Cc: nicolas.pitre@linaro.org
      Cc: riel@redhat.com
      Link: http://lkml.kernel.org/r/1425052454-25797-11-git-send-email-vincent.guittot@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      caff37ef
    • V
      sched: Remove unused struct sched_group_capacity::capacity_orig · dc7ff76e
      Vincent Guittot 提交于
      The 'struct sched_group_capacity::capacity_orig' field is no longer used
      in the scheduler so we can remove it.
      Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Morten.Rasmussen@arm.com
      Cc: dietmar.eggemann@arm.com
      Cc: efault@gmx.de
      Cc: kamalesh@linux.vnet.ibm.com
      Cc: linaro-kernel@lists.linaro.org
      Cc: nicolas.pitre@linaro.org
      Cc: preeti@linux.vnet.ibm.com
      Cc: riel@redhat.com
      Link: http://lkml.kernel.org/r/1425378903-5349-1-git-send-email-vincent.guittot@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      dc7ff76e
    • V
      sched: Replace capacity_factor by usage · ea67821b
      Vincent Guittot 提交于
      The scheduler tries to compute how many tasks a group of CPUs can handle by
      assuming that a task's load is SCHED_LOAD_SCALE and a CPU's capacity is
      SCHED_CAPACITY_SCALE.
      
      'struct sg_lb_stats:group_capacity_factor' divides the capacity of the group
      by SCHED_LOAD_SCALE to estimate how many task can run in the group. Then, it
      compares this value with the sum of nr_running to decide if the group is
      overloaded or not.
      
      But the 'group_capacity_factor' concept is hardly working for SMT systems, it
      sometimes works for big cores but fails to do the right thing for little cores.
      
      Below are two examples to illustrate the problem that this patch solves:
      
      1- If the original capacity of a CPU is less than SCHED_CAPACITY_SCALE
         (640 as an example), a group of 3 CPUS will have a max capacity_factor of 2
         (div_round_closest(3x640/1024) = 2) which means that it will be seen as
         overloaded even if we have only one task per CPU.
      
      2 - If the original capacity of a CPU is greater than SCHED_CAPACITY_SCALE
         (1512 as an example), a group of 4 CPUs will have a capacity_factor of 4
         (at max and thanks to the fix [0] for SMT system that prevent the apparition
         of ghost CPUs) but if one CPU is fully used by rt tasks (and its capacity is
         reduced to nearly nothing), the capacity factor of the group will still be 4
         (div_round_closest(3*1512/1024) = 5 which is cap to 4 with [0]).
      
      So, this patch tries to solve this issue by removing capacity_factor and
      replacing it with the 2 following metrics:
      
        - The available CPU's capacity for CFS tasks which is already used by
          load_balance().
      
        - The usage of the CPU by the CFS tasks. For the latter, utilization_avg_contrib
          has been re-introduced to compute the usage of a CPU by CFS tasks.
      
      'group_capacity_factor' and 'group_has_free_capacity' has been removed and replaced
      by 'group_no_capacity'. We compare the number of task with the number of CPUs and
      we evaluate the level of utilization of the CPUs to define if a group is
      overloaded or if a group has capacity to handle more tasks.
      
      For SD_PREFER_SIBLING, a group is tagged overloaded if it has more than 1 task
      so it will be selected in priority (among the overloaded groups). Since [1],
      SD_PREFER_SIBLING is no more concerned by the computation of 'load_above_capacity'
      because local is not overloaded.
      
      [1] 9a5d9ba6 ("sched/fair: Allow calculate_imbalance() to move idle cpus")
      Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Morten.Rasmussen@arm.com
      Cc: dietmar.eggemann@arm.com
      Cc: efault@gmx.de
      Cc: kamalesh@linux.vnet.ibm.com
      Cc: linaro-kernel@lists.linaro.org
      Cc: nicolas.pitre@linaro.org
      Cc: preeti@linux.vnet.ibm.com
      Cc: riel@redhat.com
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/1425052454-25797-9-git-send-email-vincent.guittot@linaro.org
      [ Tidied up the changelog. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      ea67821b
    • V
      sched: Calculate CPU's usage statistic and put it into struct sg_lb_stats::group_usage · 8bb5b00c
      Vincent Guittot 提交于
      Monitor the usage level of each group of each sched_domain level. The usage is
      the portion of cpu_capacity_orig that is currently used on a CPU or group of
      CPUs. We use the utilization_load_avg to evaluate the usage level of each
      group.
      
      The utilization_load_avg only takes into account the running time of the CFS
      tasks on a CPU with a maximum value of SCHED_LOAD_SCALE when the CPU is fully
      utilized. Nevertheless, we must cap utilization_load_avg which can be
      temporally greater than SCHED_LOAD_SCALE after the migration of a task on this
      CPU and until the metrics are stabilized.
      
      The utilization_load_avg is in the range [0..SCHED_LOAD_SCALE] to reflect the
      running load on the CPU whereas the available capacity for the CFS task is in
      the range [0..cpu_capacity_orig]. In order to test if a CPU is fully utilized
      by CFS tasks, we have to scale the utilization in the cpu_capacity_orig range
      of the CPU to get the usage of the latter. The usage can then be compared with
      the available capacity (ie cpu_capacity) to deduct the usage level of a CPU.
      
      The frequency scaling invariance of the usage is not taken into account in this
      patch, it will be solved in another patch which will deal with frequency
      scaling invariance on the utilization_load_avg.
      Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NMorten Rasmussen <morten.rasmussen@arm.com>
      Cc: Morten.Rasmussen@arm.com
      Cc: dietmar.eggemann@arm.com
      Cc: efault@gmx.de
      Cc: kamalesh@linux.vnet.ibm.com
      Cc: linaro-kernel@lists.linaro.org
      Cc: nicolas.pitre@linaro.org
      Cc: preeti@linux.vnet.ibm.com
      Cc: riel@redhat.com
      Link: http://lkml.kernel.org/r/1425455327-13508-1-git-send-email-vincent.guittot@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      8bb5b00c
    • V
      sched: Add struct rq::cpu_capacity_orig · ca6d75e6
      Vincent Guittot 提交于
      This new field 'cpu_capacity_orig' reflects the original capacity of a CPU
      before being altered by rt tasks and/or IRQ
      
      The cpu_capacity_orig will be used:
      
        - to detect when the capacity of a CPU has been noticeably reduced so we can
          trig load balance to look for a CPU with better capacity. As an example, we
          can detect when a CPU handles a significant amount of irq
          (with CONFIG_IRQ_TIME_ACCOUNTING) but this CPU is seen as an idle CPU by
          scheduler whereas CPUs, which are really idle, are available.
      
        - evaluate the available capacity for CFS tasks
      Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NKamalesh Babulal <kamalesh@linux.vnet.ibm.com>
      Acked-by: NMorten Rasmussen <morten.rasmussen@arm.com>
      Cc: Morten.Rasmussen@arm.com
      Cc: dietmar.eggemann@arm.com
      Cc: efault@gmx.de
      Cc: linaro-kernel@lists.linaro.org
      Cc: nicolas.pitre@linaro.org
      Cc: preeti@linux.vnet.ibm.com
      Cc: riel@redhat.com
      Link: http://lkml.kernel.org/r/1425052454-25797-7-git-send-email-vincent.guittot@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      ca6d75e6
    • V
      sched: Make scale_rt invariant with frequency · b5b4860d
      Vincent Guittot 提交于
      The average running time of RT tasks is used to estimate the remaining compute
      capacity for CFS tasks. This remaining capacity is the original capacity scaled
      down by a factor (aka scale_rt_capacity). This estimation of available capacity
      must also be invariant with frequency scaling.
      
      A frequency scaling factor is applied on the running time of the RT tasks for
      computing scale_rt_capacity.
      
      In sched_rt_avg_update(), we now scale the RT execution time like below:
      
        rq->rt_avg += rt_delta * arch_scale_freq_capacity() >> SCHED_CAPACITY_SHIFT
      
      Then, scale_rt_capacity can be summarized by:
      
        scale_rt_capacity = SCHED_CAPACITY_SCALE * available / total
      
      with available = total - rq->rt_avg
      
      This has been been optimized in current code by:
      
        scale_rt_capacity = available / (total >> SCHED_CAPACITY_SHIFT)
      
      But we can also developed the equation like below:
      
        scale_rt_capacity = SCHED_CAPACITY_SCALE - ((rq->rt_avg << SCHED_CAPACITY_SHIFT) / total)
      
      and we can optimize the equation by removing SCHED_CAPACITY_SHIFT shift in
      the computation of rq->rt_avg and scale_rt_capacity().
      
      so rq->rt_avg += rt_delta * arch_scale_freq_capacity()
      and
      scale_rt_capacity = SCHED_CAPACITY_SCALE - (rq->rt_avg / total)
      
      arch_scale_frequency_capacity() will be called in the hot path of the scheduler
      which implies to have a short and efficient function.
      
      As an example, arch_scale_frequency_capacity() should return a cached value that
      is updated periodically outside of the hot path.
      Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NMorten Rasmussen <morten.rasmussen@arm.com>
      Cc: Morten.Rasmussen@arm.com
      Cc: dietmar.eggemann@arm.com
      Cc: efault@gmx.de
      Cc: kamalesh@linux.vnet.ibm.com
      Cc: linaro-kernel@lists.linaro.org
      Cc: nicolas.pitre@linaro.org
      Cc: preeti@linux.vnet.ibm.com
      Cc: riel@redhat.com
      Link: http://lkml.kernel.org/r/1425052454-25797-6-git-send-email-vincent.guittot@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      b5b4860d
    • M
      sched: Make sched entity usage tracking scale-invariant · 0c1dc6b2
      Morten Rasmussen 提交于
      Apply frequency scale-invariance correction factor to usage tracking.
      
      Each segment of the running_avg_sum geometric series is now scaled by the
      current frequency so the utilization_avg_contrib of each entity will be
      invariant with frequency scaling.
      
      As a result, utilization_load_avg which is the sum of utilization_avg_contrib,
      becomes invariant too. So the usage level that is returned by get_cpu_usage(),
      stays relative to the max frequency as the cpu_capacity which is is compared against.
      
      Then, we want the keep the load tracking values in a 32-bit type, which implies
      that the max value of {runnable|running}_avg_sum must be lower than
      2^32/88761=48388 (88761 is the max weigth of a task). As LOAD_AVG_MAX = 47742,
      arch_scale_freq_capacity() must return a value less than
      (48388/47742) << SCHED_CAPACITY_SHIFT = 1037 (SCHED_SCALE_CAPACITY = 1024).
      So we define the range to [0..SCHED_SCALE_CAPACITY] in order to avoid overflow.
      Signed-off-by: NMorten Rasmussen <morten.rasmussen@arm.com>
      Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Paul Turner <pjt@google.com>
      Cc: Ben Segall <bsegall@google.com>
      Cc: Ben Segall <bsegall@google.com>
      Cc: Morten.Rasmussen@arm.com
      Cc: Paul Turner <pjt@google.com>
      Cc: dietmar.eggemann@arm.com
      Cc: efault@gmx.de
      Cc: kamalesh@linux.vnet.ibm.com
      Cc: linaro-kernel@lists.linaro.org
      Cc: nicolas.pitre@linaro.org
      Cc: preeti@linux.vnet.ibm.com
      Cc: riel@redhat.com
      Link: http://lkml.kernel.org/r/1425455186-13451-1-git-send-email-vincent.guittot@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      0c1dc6b2
    • V
      sched: Remove frequency scaling from cpu_capacity · a8faa8f5
      Vincent Guittot 提交于
      Now that arch_scale_cpu_capacity has been introduced to scale the original
      capacity, the arch_scale_freq_capacity is no longer used (it was
      previously used by ARM arch).
      
      Remove arch_scale_freq_capacity from the computation of cpu_capacity.
      The frequency invariance will be handled in the load tracking and not in
      the CPU capacity. arch_scale_freq_capacity will be revisited for scaling
      load with the current frequency of the CPUs in a later patch.
      Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NMorten Rasmussen <morten.rasmussen@arm.com>
      Cc: Morten.Rasmussen@arm.com
      Cc: dietmar.eggemann@arm.com
      Cc: efault@gmx.de
      Cc: kamalesh@linux.vnet.ibm.com
      Cc: linaro-kernel@lists.linaro.org
      Cc: nicolas.pitre@linaro.org
      Cc: preeti@linux.vnet.ibm.com
      Cc: riel@redhat.com
      Link: http://lkml.kernel.org/r/1425052454-25797-4-git-send-email-vincent.guittot@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      a8faa8f5
    • M
      sched: Track group sched_entity usage contributions · 21f44866
      Morten Rasmussen 提交于
      Add usage contribution tracking for group entities. Unlike
      se->avg.load_avg_contrib, se->avg.utilization_avg_contrib for group
      entities is the sum of se->avg.utilization_avg_contrib for all entities on the
      group runqueue.
      
      It is _not_ influenced in any way by the task group h_load. Hence it is
      representing the actual cpu usage of the group, not its intended load
      contribution which may differ significantly from the utilization on
      lightly utilized systems.
      Signed-off-by: NMorten Rasmussen <morten.rasmussen@arm.com>
      Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Paul Turner <pjt@google.com>
      Cc: Ben Segall <bsegall@google.com>
      Cc: Ben Segall <bsegall@google.com>
      Cc: Morten.Rasmussen@arm.com
      Cc: Paul Turner <pjt@google.com>
      Cc: dietmar.eggemann@arm.com
      Cc: efault@gmx.de
      Cc: kamalesh@linux.vnet.ibm.com
      Cc: linaro-kernel@lists.linaro.org
      Cc: nicolas.pitre@linaro.org
      Cc: preeti@linux.vnet.ibm.com
      Cc: riel@redhat.com
      Link: http://lkml.kernel.org/r/1425052454-25797-3-git-send-email-vincent.guittot@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      21f44866
    • V
      sched: Add sched_avg::utilization_avg_contrib · 36ee28e4
      Vincent Guittot 提交于
      Add new statistics which reflect the average time a task is running on the CPU
      and the sum of these running time of the tasks on a runqueue. The latter is
      named utilization_load_avg.
      
      This patch is based on the usage metric that was proposed in the 1st
      versions of the per-entity load tracking patchset by Paul Turner
      <pjt@google.com> but that has be removed afterwards. This version differs from
      the original one in the sense that it's not linked to task_group.
      
      The rq's utilization_load_avg will be used to check if a rq is overloaded or
      not instead of trying to compute how many tasks a group of CPUs can handle.
      
      Rename runnable_avg_period into avg_period as it is now used with both
      runnable_avg_sum and running_avg_sum.
      
      Add some descriptions of the variables to explain their differences.
      Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NMorten Rasmussen <morten.rasmussen@arm.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Ben Segall <bsegall@google.com>
      Cc: Ben Segall <bsegall@google.com>
      Cc: Morten.Rasmussen@arm.com
      Cc: Paul Turner <pjt@google.com>
      Cc: dietmar.eggemann@arm.com
      Cc: efault@gmx.de
      Cc: kamalesh@linux.vnet.ibm.com
      Cc: linaro-kernel@lists.linaro.org
      Cc: nicolas.pitre@linaro.org
      Cc: preeti@linux.vnet.ibm.com
      Cc: riel@redhat.com
      Link: http://lkml.kernel.org/r/1425052454-25797-2-git-send-email-vincent.guittot@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      36ee28e4
  3. 23 3月, 2015 2 次提交
    • S
      sched/rt: Use IPI to trigger RT task push migration instead of pulling · b6366f04
      Steven Rostedt 提交于
      When debugging the latencies on a 40 core box, where we hit 300 to
      500 microsecond latencies, I found there was a huge contention on the
      runqueue locks.
      
      Investigating it further, running ftrace, I found that it was due to
      the pulling of RT tasks.
      
      The test that was run was the following:
      
       cyclictest --numa -p95 -m -d0 -i100
      
      This created a thread on each CPU, that would set its wakeup in iterations
      of 100 microseconds. The -d0 means that all the threads had the same
      interval (100us). Each thread sleeps for 100us and wakes up and measures
      its latencies.
      
      cyclictest is maintained at:
       git://git.kernel.org/pub/scm/linux/kernel/git/clrkwllms/rt-tests.git
      
      What happened was another RT task would be scheduled on one of the CPUs
      that was running our test, when the other CPU tests went to sleep and
      scheduled idle. This caused the "pull" operation to execute on all
      these CPUs. Each one of these saw the RT task that was overloaded on
      the CPU of the test that was still running, and each one tried
      to grab that task in a thundering herd way.
      
      To grab the task, each thread would do a double rq lock grab, grabbing
      its own lock as well as the rq of the overloaded CPU. As the sched
      domains on this box was rather flat for its size, I saw up to 12 CPUs
      block on this lock at once. This caused a ripple affect with the
      rq locks especially since the taking was done via a double rq lock, which
      means that several of the CPUs had their own rq locks held while trying
      to take this rq lock. As these locks were blocked, any wakeups or load
      balanceing on these CPUs would also block on these locks, and the wait
      time escalated.
      
      I've tried various methods to lessen the load, but things like an
      atomic counter to only let one CPU grab the task wont work, because
      the task may have a limited affinity, and we may pick the wrong
      CPU to take that lock and do the pull, to only find out that the
      CPU we picked isn't in the task's affinity.
      
      Instead of doing the PULL, I now have the CPUs that want the pull to
      send over an IPI to the overloaded CPU, and let that CPU pick what
      CPU to push the task to. No more need to grab the rq lock, and the
      push/pull algorithm still works fine.
      
      With this patch, the latency dropped to just 150us over a 20 hour run.
      Without the patch, the huge latencies would trigger in seconds.
      
      I've created a new sched feature called RT_PUSH_IPI, which is enabled
      by default.
      
      When RT_PUSH_IPI is not enabled, the old method of grabbing the rq locks
      and having the pulling CPU do the work is implemented. When RT_PUSH_IPI
      is enabled, the IPI is sent to the overloaded CPU to do a push.
      
      To enabled or disable this at run time:
      
       # mount -t debugfs nodev /sys/kernel/debug
       # echo RT_PUSH_IPI > /sys/kernel/debug/sched_features
      or
       # echo NO_RT_PUSH_IPI > /sys/kernel/debug/sched_features
      
      Update: This original patch would send an IPI to all CPUs in the RT overload
      list. But that could theoretically cause the reverse issue. That is, there
      could be lots of overloaded RT queues and one CPU lowers its priority. It would
      then send an IPI to all the overloaded RT queues and they could then all try
      to grab the rq lock of the CPU lowering its priority, and then we have the
      same problem.
      
      The latest design sends out only one IPI to the first overloaded CPU. It tries to
      push any tasks that it can, and then looks for the next overloaded CPU that can
      push to the source CPU. The IPIs stop when all overloaded CPUs that have pushable
      tasks that have priorities greater than the source CPU are covered. In case the
      source CPU lowers its priority again, a flag is set to tell the IPI traversal to
      restart with the first RT overloaded CPU after the source CPU.
      Parts-suggested-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Joern Engel <joern@purestorage.com>
      Cc: Clark Williams <williams@redhat.com>
      Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20150318144946.2f3cc982@gandalf.local.homeSigned-off-by: NIngo Molnar <mingo@kernel.org>
      b6366f04
    • B
      sched: Fix RLIMIT_RTTIME when PI-boosting to RT · 746db944
      Brian Silverman 提交于
      When non-realtime tasks get priority-inheritance boosted to a realtime
      scheduling class, RLIMIT_RTTIME starts to apply to them. However, the
      counter used for checking this (the same one used for SCHED_RR
      timeslices) was not getting reset. This meant that tasks running with a
      non-realtime scheduling class which are repeatedly boosted to a realtime
      one, but never block while they are running realtime, eventually hit the
      timeout without ever running for a time over the limit. This patch
      resets the realtime timeslice counter when un-PI-boosting from an RT to
      a non-RT scheduling class.
      
      I have some test code with two threads and a shared PTHREAD_PRIO_INHERIT
      mutex which induces priority boosting and spins while boosted that gets
      killed by a SIGXCPU on non-fixed kernels but doesn't with this patch
      applied. It happens much faster with a CONFIG_PREEMPT_RT kernel, and
      does happen eventually with PREEMPT_VOLUNTARY kernels.
      Signed-off-by: NBrian Silverman <brian@peloton-tech.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: austin@peloton-tech.com
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/1424305436-6716-1-git-send-email-brian@peloton-tech.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      746db944
  4. 10 3月, 2015 1 次提交
  5. 06 3月, 2015 1 次提交
  6. 03 3月, 2015 1 次提交
  7. 01 3月, 2015 1 次提交
  8. 19 2月, 2015 1 次提交
  9. 18 2月, 2015 11 次提交
  10. 14 2月, 2015 2 次提交
    • T
      sched: use %*pb[l] to print bitmaps including cpumasks and nodemasks · 333470ee
      Tejun Heo 提交于
      printk and friends can now format bitmaps using '%*pb[l]'.  cpumask
      and nodemask also provide cpumask_pr_args() and nodemask_pr_args()
      respectively which can be used to generate the two printf arguments
      necessary to format the specified cpu/nodemask.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      333470ee
    • R
      PM / sleep: Re-implement suspend-to-idle handling · 38106313
      Rafael J. Wysocki 提交于
      In preparation for adding support for quiescing timers in the final
      stage of suspend-to-idle transitions, rework the freeze_enter()
      function making the system wait on a wakeup event, the freeze_wake()
      function terminating the suspend-to-idle loop and the mechanism by
      which deep idle states are entered during suspend-to-idle.
      
      First of all, introduce a simple state machine for suspend-to-idle
      and make the code in question use it.
      
      Second, prevent freeze_enter() from losing wakeup events due to race
      conditions and ensure that the number of online CPUs won't change
      while it is being executed.  In addition to that, make it force
      all of the CPUs re-enter the idle loop in case they are in idle
      states already (so they can enter deeper idle states if possible).
      
      Next, drop cpuidle_use_deepest_state() and replace use_deepest_state
      checks in cpuidle_select() and cpuidle_reflect() with a single
      suspend-to-idle state check in cpuidle_idle_call().
      
      Finally, introduce cpuidle_enter_freeze() that will simply find the
      deepest idle state available to the given CPU and enter it using
      cpuidle_enter().
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      38106313
  11. 13 2月, 2015 1 次提交
    • C
      kernel/sched/clock.c: add another clock for use with the soft lockup watchdog · 545a2bf7
      Cyril Bur 提交于
      When the hypervisor pauses a virtualised kernel the kernel will observe a
      jump in timebase, this can cause spurious messages from the softlockup
      detector.
      
      Whilst these messages are harmless, they are accompanied with a stack
      trace which causes undue concern and more problematically the stack trace
      in the guest has nothing to do with the observed problem and can only be
      misleading.
      
      Futhermore, on POWER8 this is completely avoidable with the introduction
      of the Virtual Time Base (VTB) register.
      
      This patch (of 2):
      
      This permits the use of arch specific clocks for which virtualised kernels
      can use their notion of 'running' time, not the elpased wall time which
      will include host execution time.
      Signed-off-by: NCyril Bur <cyrilbur@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Andrew Jones <drjones@redhat.com>
      Acked-by: NDon Zickus <dzickus@redhat.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Ulrich Obergfell <uobergfe@redhat.com>
      Cc: chai wen <chaiw.fnst@cn.fujitsu.com>
      Cc: Fabian Frederick <fabf@skynet.be>
      Cc: Aaron Tomlin <atomlin@redhat.com>
      Cc: Ben Zhang <benzh@chromium.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      545a2bf7
  12. 04 2月, 2015 2 次提交