1. 18 5月, 2015 1 次提交
    • P
      sched,perf: Fix periodic timers · 4cfafd30
      Peter Zijlstra 提交于
      In the below two commits (see Fixes) we have periodic timers that can
      stop themselves when they're no longer required, but need to be
      (re)-started when their idle condition changes.
      
      Further complications is that we want the timer handler to always do
      the forward such that it will always correctly deal with the overruns,
      and we do not want to race such that the handler has already decided
      to stop, but the (external) restart sees the timer still active and we
      end up with a 'lost' timer.
      
      The problem with the current code is that the re-start can come before
      the callback does the forward, at which point the forward from the
      callback will WARN about forwarding an enqueued timer.
      
      Now, conceptually its easy to detect if you're before or after the fwd
      by comparing the expiration time against the current time. Of course,
      that's expensive (and racy) because we don't have the current time.
      
      Alternatively one could cache this state inside the timer, but then
      everybody pays the overhead of maintaining this extra state, and that
      is undesired.
      
      The only other option that I could see is the external timer_active
      variable, which I tried to kill before. I would love a nicer interface
      for this seemingly simple 'problem' but alas.
      
      Fixes: 272325c4 ("perf: Fix mux_interval hrtimer wreckage")
      Fixes: 77a4d1a1 ("sched: Cleanup bandwidth timers")
      Cc: pjt@google.com
      Cc: tglx@linutronix.de
      Cc: klamm@yandex-team.ru
      Cc: mingo@kernel.org
      Cc: bsegall@google.com
      Cc: hpa@zytor.com
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/20150514102311.GX21418@twins.programming.kicks-ass.net
      4cfafd30
  2. 23 4月, 2015 1 次提交
  3. 22 4月, 2015 3 次提交
  4. 08 4月, 2015 1 次提交
  5. 03 4月, 2015 2 次提交
  6. 02 4月, 2015 4 次提交
  7. 27 3月, 2015 15 次提交
    • W
      sched/deadline: Fix rt runtime corruption when dl fails its global constraints · a1963b81
      Wanpeng Li 提交于
      One version of sched_rt_global_constaints() (the !rt-cgroup one)
      changes state, therefore if we fail the later sched_dl_global_constraints()
      call the state is left in an inconsistent state.
      
      Fix this by changing the order of the calls.
      Signed-off-by: NWanpeng Li <wanpeng.li@linux.intel.com>
      [ Improved the changelog. ]
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NJuri Lelli <juri.lelli@arm.com>
      Link: http://lkml.kernel.org/r/1426590931-4639-2-git-send-email-wanpeng.li@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      a1963b81
    • W
      sched/deadline: Avoid a superfluous check · bd4bde14
      Wanpeng Li 提交于
      Since commit 40767b0d ("sched/deadline: Fix deadline parameter
      modification handling") we clear the thottled state when switching
      from a dl task, therefore we should never find it set in switching to
      a dl task.
      Signed-off-by: NWanpeng Li <wanpeng.li@linux.intel.com>
      [ Improved the changelog. ]
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NJuri Lelli <juri.lelli@arm.com>
      Link: http://lkml.kernel.org/r/1426590931-4639-1-git-send-email-wanpeng.li@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      bd4bde14
    • P
      sched: Improve load balancing in the presence of idle CPUs · d4573c3e
      Preeti U Murthy 提交于
      When a CPU is kicked to do nohz idle balancing, it wakes up to do load
      balancing on itself, followed by load balancing on behalf of idle CPUs.
      But it may end up with load after the load balancing attempt on itself.
      This aborts nohz idle balancing. As a result several idle CPUs are left
      without tasks till such a time that an ILB CPU finds it unfavorable to
      pull tasks upon itself. This delays spreading of load across idle CPUs
      and worse, clutters only a few CPUs with tasks.
      
      The effect of the above problem was observed on an SMT8 POWER server
      with 2 levels of numa domains. Busy loops equal to number of cores were
      spawned. Since load balancing on fork/exec is discouraged across numa
      domains, all busy loops would start on one of the numa domains. However
      it was expected that eventually one busy loop would run per core across
      all domains due to nohz idle load balancing. But it was observed that it
      took as long as 10 seconds to spread the load across numa domains.
      
      Further investigation showed that this was a consequence of the
      following:
      
       1. An ILB CPU was chosen from the first numa domain to trigger nohz idle
          load balancing [Given the experiment, upto 6 CPUs per core could be
          potentially idle in this domain.]
      
       2. However the ILB CPU would call load_balance() on itself before
          initiating nohz idle load balancing.
      
       3. Given cores are SMT8, the ILB CPU had enough opportunities to pull
          tasks from its sibling cores to even out load.
      
       4. Now that the ILB CPU was no longer idle, it would abort nohz idle
          load balancing
      
      As a result the opportunities to spread load across numa domains were
      lost until such a time that the cores within the first numa domain had
      equal number of tasks among themselves.  This is a pretty bad scenario,
      since the cores within the first numa domain would have as many as 4
      tasks each, while cores in the neighbouring numa domains would all
      remain idle.
      
      Fix this, by checking if a CPU was woken up to do nohz idle load
      balancing, before it does load balancing upon itself. This way we allow
      idle CPUs across the system to do load balancing which results in
      quicker spread of load, instead of performing load balancing within the
      local sched domain hierarchy of the ILB CPU alone under circumstances
      such as above.
      Signed-off-by: NPreeti U Murthy <preeti@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NJason Low <jason.low2@hp.com>
      Cc: benh@kernel.crashing.org
      Cc: daniel.lezcano@linaro.org
      Cc: efault@gmx.de
      Cc: iamjoonsoo.kim@lge.com
      Cc: morten.rasmussen@arm.com
      Cc: pjt@google.com
      Cc: riel@redhat.com
      Cc: srikar@linux.vnet.ibm.com
      Cc: svaidy@linux.vnet.ibm.com
      Cc: tim.c.chen@linux.intel.com
      Cc: vincent.guittot@linaro.org
      Link: http://lkml.kernel.org/r/20150326130014.21532.17158.stgit@preeti.in.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      d4573c3e
    • P
      sched: Optimize freq invariant accounting · dfbca41f
      Peter Zijlstra 提交于
      Currently the freq invariant accounting (in
      __update_entity_runnable_avg() and sched_rt_avg_update()) get the
      scale factor from a weak function call, this means that even for archs
      that default on their implementation the compiler cannot see into this
      function and optimize the extra scaling math away.
      
      This is sad, esp. since its a 64-bit multiplication which can be quite
      costly on some platforms.
      
      So replace the weak function with #ifdef and __always_inline goo. This
      is not quite as nice from an arch support PoV but should at least
      result in compile time errors if done wrong.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Ben Segall <bsegall@google.com>
      Cc: Morten.Rasmussen@arm.com
      Cc: Paul Turner <pjt@google.com>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Cc: dietmar.eggemann@arm.com
      Cc: efault@gmx.de
      Cc: kamalesh@linux.vnet.ibm.com
      Cc: nicolas.pitre@linaro.org
      Cc: preeti@linux.vnet.ibm.com
      Cc: riel@redhat.com
      Link: http://lkml.kernel.org/r/20150323131905.GF23123@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      dfbca41f
    • V
      sched: Move CFS tasks to CPUs with higher capacity · 1aaf90a4
      Vincent Guittot 提交于
      When a CPU is used to handle a lot of IRQs or some RT tasks, the remaining
      capacity for CFS tasks can be significantly reduced. Once we detect such
      situation by comparing cpu_capacity_orig and cpu_capacity, we trig an idle
      load balance to check if it's worth moving its tasks on an idle CPU.
      
      It's worth trying to move the task before the CPU is fully utilized to
      minimize the preemption by irq or RT tasks.
      
      Once the idle load_balance has selected the busiest CPU, it will look for an
      active load balance for only two cases:
      
        - There is only 1 task on the busiest CPU.
      
        - We haven't been able to move a task of the busiest rq.
      
      A CPU with a reduced capacity is included in the 1st case, and it's worth to
      actively migrate its task if the idle CPU has got more available capacity for
      CFS tasks. This test has been added in need_active_balance.
      
      As a sidenote, this will not generate more spurious ilb because we already
      trig an ilb if there is more than 1 busy cpu. If this cpu is the only one that
      has a task, we will trig the ilb once for migrating the task.
      
      The nohz_kick_needed function has been cleaned up a bit while adding the new
      test
      
      env.src_cpu and env.src_rq must be set unconditionnally because they are used
      in need_active_balance which is called even if busiest->nr_running equals 1
      Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Morten.Rasmussen@arm.com
      Cc: dietmar.eggemann@arm.com
      Cc: efault@gmx.de
      Cc: kamalesh@linux.vnet.ibm.com
      Cc: linaro-kernel@lists.linaro.org
      Cc: nicolas.pitre@linaro.org
      Cc: preeti@linux.vnet.ibm.com
      Cc: riel@redhat.com
      Link: http://lkml.kernel.org/r/1425052454-25797-12-git-send-email-vincent.guittot@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      1aaf90a4
    • V
      sched: Add SD_PREFER_SIBLING for SMT level · caff37ef
      Vincent Guittot 提交于
      Add the SD_PREFER_SIBLING flag for SMT level in order to ensure that
      the scheduler will place at least one task per core.
      Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NPreeti U. Murthy <preeti@linux.vnet.ibm.com>
      Cc: Morten.Rasmussen@arm.com
      Cc: dietmar.eggemann@arm.com
      Cc: efault@gmx.de
      Cc: kamalesh@linux.vnet.ibm.com
      Cc: linaro-kernel@lists.linaro.org
      Cc: nicolas.pitre@linaro.org
      Cc: riel@redhat.com
      Link: http://lkml.kernel.org/r/1425052454-25797-11-git-send-email-vincent.guittot@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      caff37ef
    • V
      sched: Remove unused struct sched_group_capacity::capacity_orig · dc7ff76e
      Vincent Guittot 提交于
      The 'struct sched_group_capacity::capacity_orig' field is no longer used
      in the scheduler so we can remove it.
      Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Morten.Rasmussen@arm.com
      Cc: dietmar.eggemann@arm.com
      Cc: efault@gmx.de
      Cc: kamalesh@linux.vnet.ibm.com
      Cc: linaro-kernel@lists.linaro.org
      Cc: nicolas.pitre@linaro.org
      Cc: preeti@linux.vnet.ibm.com
      Cc: riel@redhat.com
      Link: http://lkml.kernel.org/r/1425378903-5349-1-git-send-email-vincent.guittot@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      dc7ff76e
    • V
      sched: Replace capacity_factor by usage · ea67821b
      Vincent Guittot 提交于
      The scheduler tries to compute how many tasks a group of CPUs can handle by
      assuming that a task's load is SCHED_LOAD_SCALE and a CPU's capacity is
      SCHED_CAPACITY_SCALE.
      
      'struct sg_lb_stats:group_capacity_factor' divides the capacity of the group
      by SCHED_LOAD_SCALE to estimate how many task can run in the group. Then, it
      compares this value with the sum of nr_running to decide if the group is
      overloaded or not.
      
      But the 'group_capacity_factor' concept is hardly working for SMT systems, it
      sometimes works for big cores but fails to do the right thing for little cores.
      
      Below are two examples to illustrate the problem that this patch solves:
      
      1- If the original capacity of a CPU is less than SCHED_CAPACITY_SCALE
         (640 as an example), a group of 3 CPUS will have a max capacity_factor of 2
         (div_round_closest(3x640/1024) = 2) which means that it will be seen as
         overloaded even if we have only one task per CPU.
      
      2 - If the original capacity of a CPU is greater than SCHED_CAPACITY_SCALE
         (1512 as an example), a group of 4 CPUs will have a capacity_factor of 4
         (at max and thanks to the fix [0] for SMT system that prevent the apparition
         of ghost CPUs) but if one CPU is fully used by rt tasks (and its capacity is
         reduced to nearly nothing), the capacity factor of the group will still be 4
         (div_round_closest(3*1512/1024) = 5 which is cap to 4 with [0]).
      
      So, this patch tries to solve this issue by removing capacity_factor and
      replacing it with the 2 following metrics:
      
        - The available CPU's capacity for CFS tasks which is already used by
          load_balance().
      
        - The usage of the CPU by the CFS tasks. For the latter, utilization_avg_contrib
          has been re-introduced to compute the usage of a CPU by CFS tasks.
      
      'group_capacity_factor' and 'group_has_free_capacity' has been removed and replaced
      by 'group_no_capacity'. We compare the number of task with the number of CPUs and
      we evaluate the level of utilization of the CPUs to define if a group is
      overloaded or if a group has capacity to handle more tasks.
      
      For SD_PREFER_SIBLING, a group is tagged overloaded if it has more than 1 task
      so it will be selected in priority (among the overloaded groups). Since [1],
      SD_PREFER_SIBLING is no more concerned by the computation of 'load_above_capacity'
      because local is not overloaded.
      
      [1] 9a5d9ba6 ("sched/fair: Allow calculate_imbalance() to move idle cpus")
      Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Morten.Rasmussen@arm.com
      Cc: dietmar.eggemann@arm.com
      Cc: efault@gmx.de
      Cc: kamalesh@linux.vnet.ibm.com
      Cc: linaro-kernel@lists.linaro.org
      Cc: nicolas.pitre@linaro.org
      Cc: preeti@linux.vnet.ibm.com
      Cc: riel@redhat.com
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/1425052454-25797-9-git-send-email-vincent.guittot@linaro.org
      [ Tidied up the changelog. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      ea67821b
    • V
      sched: Calculate CPU's usage statistic and put it into struct sg_lb_stats::group_usage · 8bb5b00c
      Vincent Guittot 提交于
      Monitor the usage level of each group of each sched_domain level. The usage is
      the portion of cpu_capacity_orig that is currently used on a CPU or group of
      CPUs. We use the utilization_load_avg to evaluate the usage level of each
      group.
      
      The utilization_load_avg only takes into account the running time of the CFS
      tasks on a CPU with a maximum value of SCHED_LOAD_SCALE when the CPU is fully
      utilized. Nevertheless, we must cap utilization_load_avg which can be
      temporally greater than SCHED_LOAD_SCALE after the migration of a task on this
      CPU and until the metrics are stabilized.
      
      The utilization_load_avg is in the range [0..SCHED_LOAD_SCALE] to reflect the
      running load on the CPU whereas the available capacity for the CFS task is in
      the range [0..cpu_capacity_orig]. In order to test if a CPU is fully utilized
      by CFS tasks, we have to scale the utilization in the cpu_capacity_orig range
      of the CPU to get the usage of the latter. The usage can then be compared with
      the available capacity (ie cpu_capacity) to deduct the usage level of a CPU.
      
      The frequency scaling invariance of the usage is not taken into account in this
      patch, it will be solved in another patch which will deal with frequency
      scaling invariance on the utilization_load_avg.
      Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NMorten Rasmussen <morten.rasmussen@arm.com>
      Cc: Morten.Rasmussen@arm.com
      Cc: dietmar.eggemann@arm.com
      Cc: efault@gmx.de
      Cc: kamalesh@linux.vnet.ibm.com
      Cc: linaro-kernel@lists.linaro.org
      Cc: nicolas.pitre@linaro.org
      Cc: preeti@linux.vnet.ibm.com
      Cc: riel@redhat.com
      Link: http://lkml.kernel.org/r/1425455327-13508-1-git-send-email-vincent.guittot@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      8bb5b00c
    • V
      sched: Add struct rq::cpu_capacity_orig · ca6d75e6
      Vincent Guittot 提交于
      This new field 'cpu_capacity_orig' reflects the original capacity of a CPU
      before being altered by rt tasks and/or IRQ
      
      The cpu_capacity_orig will be used:
      
        - to detect when the capacity of a CPU has been noticeably reduced so we can
          trig load balance to look for a CPU with better capacity. As an example, we
          can detect when a CPU handles a significant amount of irq
          (with CONFIG_IRQ_TIME_ACCOUNTING) but this CPU is seen as an idle CPU by
          scheduler whereas CPUs, which are really idle, are available.
      
        - evaluate the available capacity for CFS tasks
      Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NKamalesh Babulal <kamalesh@linux.vnet.ibm.com>
      Acked-by: NMorten Rasmussen <morten.rasmussen@arm.com>
      Cc: Morten.Rasmussen@arm.com
      Cc: dietmar.eggemann@arm.com
      Cc: efault@gmx.de
      Cc: linaro-kernel@lists.linaro.org
      Cc: nicolas.pitre@linaro.org
      Cc: preeti@linux.vnet.ibm.com
      Cc: riel@redhat.com
      Link: http://lkml.kernel.org/r/1425052454-25797-7-git-send-email-vincent.guittot@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      ca6d75e6
    • V
      sched: Make scale_rt invariant with frequency · b5b4860d
      Vincent Guittot 提交于
      The average running time of RT tasks is used to estimate the remaining compute
      capacity for CFS tasks. This remaining capacity is the original capacity scaled
      down by a factor (aka scale_rt_capacity). This estimation of available capacity
      must also be invariant with frequency scaling.
      
      A frequency scaling factor is applied on the running time of the RT tasks for
      computing scale_rt_capacity.
      
      In sched_rt_avg_update(), we now scale the RT execution time like below:
      
        rq->rt_avg += rt_delta * arch_scale_freq_capacity() >> SCHED_CAPACITY_SHIFT
      
      Then, scale_rt_capacity can be summarized by:
      
        scale_rt_capacity = SCHED_CAPACITY_SCALE * available / total
      
      with available = total - rq->rt_avg
      
      This has been been optimized in current code by:
      
        scale_rt_capacity = available / (total >> SCHED_CAPACITY_SHIFT)
      
      But we can also developed the equation like below:
      
        scale_rt_capacity = SCHED_CAPACITY_SCALE - ((rq->rt_avg << SCHED_CAPACITY_SHIFT) / total)
      
      and we can optimize the equation by removing SCHED_CAPACITY_SHIFT shift in
      the computation of rq->rt_avg and scale_rt_capacity().
      
      so rq->rt_avg += rt_delta * arch_scale_freq_capacity()
      and
      scale_rt_capacity = SCHED_CAPACITY_SCALE - (rq->rt_avg / total)
      
      arch_scale_frequency_capacity() will be called in the hot path of the scheduler
      which implies to have a short and efficient function.
      
      As an example, arch_scale_frequency_capacity() should return a cached value that
      is updated periodically outside of the hot path.
      Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NMorten Rasmussen <morten.rasmussen@arm.com>
      Cc: Morten.Rasmussen@arm.com
      Cc: dietmar.eggemann@arm.com
      Cc: efault@gmx.de
      Cc: kamalesh@linux.vnet.ibm.com
      Cc: linaro-kernel@lists.linaro.org
      Cc: nicolas.pitre@linaro.org
      Cc: preeti@linux.vnet.ibm.com
      Cc: riel@redhat.com
      Link: http://lkml.kernel.org/r/1425052454-25797-6-git-send-email-vincent.guittot@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      b5b4860d
    • M
      sched: Make sched entity usage tracking scale-invariant · 0c1dc6b2
      Morten Rasmussen 提交于
      Apply frequency scale-invariance correction factor to usage tracking.
      
      Each segment of the running_avg_sum geometric series is now scaled by the
      current frequency so the utilization_avg_contrib of each entity will be
      invariant with frequency scaling.
      
      As a result, utilization_load_avg which is the sum of utilization_avg_contrib,
      becomes invariant too. So the usage level that is returned by get_cpu_usage(),
      stays relative to the max frequency as the cpu_capacity which is is compared against.
      
      Then, we want the keep the load tracking values in a 32-bit type, which implies
      that the max value of {runnable|running}_avg_sum must be lower than
      2^32/88761=48388 (88761 is the max weigth of a task). As LOAD_AVG_MAX = 47742,
      arch_scale_freq_capacity() must return a value less than
      (48388/47742) << SCHED_CAPACITY_SHIFT = 1037 (SCHED_SCALE_CAPACITY = 1024).
      So we define the range to [0..SCHED_SCALE_CAPACITY] in order to avoid overflow.
      Signed-off-by: NMorten Rasmussen <morten.rasmussen@arm.com>
      Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Paul Turner <pjt@google.com>
      Cc: Ben Segall <bsegall@google.com>
      Cc: Ben Segall <bsegall@google.com>
      Cc: Morten.Rasmussen@arm.com
      Cc: Paul Turner <pjt@google.com>
      Cc: dietmar.eggemann@arm.com
      Cc: efault@gmx.de
      Cc: kamalesh@linux.vnet.ibm.com
      Cc: linaro-kernel@lists.linaro.org
      Cc: nicolas.pitre@linaro.org
      Cc: preeti@linux.vnet.ibm.com
      Cc: riel@redhat.com
      Link: http://lkml.kernel.org/r/1425455186-13451-1-git-send-email-vincent.guittot@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      0c1dc6b2
    • V
      sched: Remove frequency scaling from cpu_capacity · a8faa8f5
      Vincent Guittot 提交于
      Now that arch_scale_cpu_capacity has been introduced to scale the original
      capacity, the arch_scale_freq_capacity is no longer used (it was
      previously used by ARM arch).
      
      Remove arch_scale_freq_capacity from the computation of cpu_capacity.
      The frequency invariance will be handled in the load tracking and not in
      the CPU capacity. arch_scale_freq_capacity will be revisited for scaling
      load with the current frequency of the CPUs in a later patch.
      Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NMorten Rasmussen <morten.rasmussen@arm.com>
      Cc: Morten.Rasmussen@arm.com
      Cc: dietmar.eggemann@arm.com
      Cc: efault@gmx.de
      Cc: kamalesh@linux.vnet.ibm.com
      Cc: linaro-kernel@lists.linaro.org
      Cc: nicolas.pitre@linaro.org
      Cc: preeti@linux.vnet.ibm.com
      Cc: riel@redhat.com
      Link: http://lkml.kernel.org/r/1425052454-25797-4-git-send-email-vincent.guittot@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      a8faa8f5
    • M
      sched: Track group sched_entity usage contributions · 21f44866
      Morten Rasmussen 提交于
      Add usage contribution tracking for group entities. Unlike
      se->avg.load_avg_contrib, se->avg.utilization_avg_contrib for group
      entities is the sum of se->avg.utilization_avg_contrib for all entities on the
      group runqueue.
      
      It is _not_ influenced in any way by the task group h_load. Hence it is
      representing the actual cpu usage of the group, not its intended load
      contribution which may differ significantly from the utilization on
      lightly utilized systems.
      Signed-off-by: NMorten Rasmussen <morten.rasmussen@arm.com>
      Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Paul Turner <pjt@google.com>
      Cc: Ben Segall <bsegall@google.com>
      Cc: Ben Segall <bsegall@google.com>
      Cc: Morten.Rasmussen@arm.com
      Cc: Paul Turner <pjt@google.com>
      Cc: dietmar.eggemann@arm.com
      Cc: efault@gmx.de
      Cc: kamalesh@linux.vnet.ibm.com
      Cc: linaro-kernel@lists.linaro.org
      Cc: nicolas.pitre@linaro.org
      Cc: preeti@linux.vnet.ibm.com
      Cc: riel@redhat.com
      Link: http://lkml.kernel.org/r/1425052454-25797-3-git-send-email-vincent.guittot@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      21f44866
    • V
      sched: Add sched_avg::utilization_avg_contrib · 36ee28e4
      Vincent Guittot 提交于
      Add new statistics which reflect the average time a task is running on the CPU
      and the sum of these running time of the tasks on a runqueue. The latter is
      named utilization_load_avg.
      
      This patch is based on the usage metric that was proposed in the 1st
      versions of the per-entity load tracking patchset by Paul Turner
      <pjt@google.com> but that has be removed afterwards. This version differs from
      the original one in the sense that it's not linked to task_group.
      
      The rq's utilization_load_avg will be used to check if a rq is overloaded or
      not instead of trying to compute how many tasks a group of CPUs can handle.
      
      Rename runnable_avg_period into avg_period as it is now used with both
      runnable_avg_sum and running_avg_sum.
      
      Add some descriptions of the variables to explain their differences.
      Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NMorten Rasmussen <morten.rasmussen@arm.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Ben Segall <bsegall@google.com>
      Cc: Ben Segall <bsegall@google.com>
      Cc: Morten.Rasmussen@arm.com
      Cc: Paul Turner <pjt@google.com>
      Cc: dietmar.eggemann@arm.com
      Cc: efault@gmx.de
      Cc: kamalesh@linux.vnet.ibm.com
      Cc: linaro-kernel@lists.linaro.org
      Cc: nicolas.pitre@linaro.org
      Cc: preeti@linux.vnet.ibm.com
      Cc: riel@redhat.com
      Link: http://lkml.kernel.org/r/1425052454-25797-2-git-send-email-vincent.guittot@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      36ee28e4
  8. 26 3月, 2015 1 次提交
    • M
      mm: numa: slow PTE scan rate if migration failures occur · 074c2381
      Mel Gorman 提交于
      Dave Chinner reported the following on https://lkml.org/lkml/2015/3/1/226
      
        Across the board the 4.0-rc1 numbers are much slower, and the degradation
        is far worse when using the large memory footprint configs. Perf points
        straight at the cause - this is from 4.0-rc1 on the "-o bhash=101073" config:
      
         -   56.07%    56.07%  [kernel]            [k] default_send_IPI_mask_sequence_phys
            - default_send_IPI_mask_sequence_phys
               - 99.99% physflat_send_IPI_mask
                  - 99.37% native_send_call_func_ipi
                       smp_call_function_many
                     - native_flush_tlb_others
                        - 99.85% flush_tlb_page
                             ptep_clear_flush
                             try_to_unmap_one
                             rmap_walk
                             try_to_unmap
                             migrate_pages
                             migrate_misplaced_page
                           - handle_mm_fault
                              - 99.73% __do_page_fault
                                   trace_do_page_fault
                                   do_async_page_fault
                                 + async_page_fault
                    0.63% native_send_call_func_single_ipi
                       generic_exec_single
                       smp_call_function_single
      
      This is showing excessive migration activity even though excessive
      migrations are meant to get throttled.  Normally, the scan rate is tuned
      on a per-task basis depending on the locality of faults.  However, if
      migrations fail for any reason then the PTE scanner may scan faster if
      the faults continue to be remote.  This means there is higher system CPU
      overhead and fault trapping at exactly the time we know that migrations
      cannot happen.  This patch tracks when migration failures occur and
      slows the PTE scanner.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reported-by: NDave Chinner <david@fromorbit.com>
      Tested-by: NDave Chinner <david@fromorbit.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      074c2381
  9. 24 3月, 2015 1 次提交
  10. 23 3月, 2015 2 次提交
    • S
      sched/rt: Use IPI to trigger RT task push migration instead of pulling · b6366f04
      Steven Rostedt 提交于
      When debugging the latencies on a 40 core box, where we hit 300 to
      500 microsecond latencies, I found there was a huge contention on the
      runqueue locks.
      
      Investigating it further, running ftrace, I found that it was due to
      the pulling of RT tasks.
      
      The test that was run was the following:
      
       cyclictest --numa -p95 -m -d0 -i100
      
      This created a thread on each CPU, that would set its wakeup in iterations
      of 100 microseconds. The -d0 means that all the threads had the same
      interval (100us). Each thread sleeps for 100us and wakes up and measures
      its latencies.
      
      cyclictest is maintained at:
       git://git.kernel.org/pub/scm/linux/kernel/git/clrkwllms/rt-tests.git
      
      What happened was another RT task would be scheduled on one of the CPUs
      that was running our test, when the other CPU tests went to sleep and
      scheduled idle. This caused the "pull" operation to execute on all
      these CPUs. Each one of these saw the RT task that was overloaded on
      the CPU of the test that was still running, and each one tried
      to grab that task in a thundering herd way.
      
      To grab the task, each thread would do a double rq lock grab, grabbing
      its own lock as well as the rq of the overloaded CPU. As the sched
      domains on this box was rather flat for its size, I saw up to 12 CPUs
      block on this lock at once. This caused a ripple affect with the
      rq locks especially since the taking was done via a double rq lock, which
      means that several of the CPUs had their own rq locks held while trying
      to take this rq lock. As these locks were blocked, any wakeups or load
      balanceing on these CPUs would also block on these locks, and the wait
      time escalated.
      
      I've tried various methods to lessen the load, but things like an
      atomic counter to only let one CPU grab the task wont work, because
      the task may have a limited affinity, and we may pick the wrong
      CPU to take that lock and do the pull, to only find out that the
      CPU we picked isn't in the task's affinity.
      
      Instead of doing the PULL, I now have the CPUs that want the pull to
      send over an IPI to the overloaded CPU, and let that CPU pick what
      CPU to push the task to. No more need to grab the rq lock, and the
      push/pull algorithm still works fine.
      
      With this patch, the latency dropped to just 150us over a 20 hour run.
      Without the patch, the huge latencies would trigger in seconds.
      
      I've created a new sched feature called RT_PUSH_IPI, which is enabled
      by default.
      
      When RT_PUSH_IPI is not enabled, the old method of grabbing the rq locks
      and having the pulling CPU do the work is implemented. When RT_PUSH_IPI
      is enabled, the IPI is sent to the overloaded CPU to do a push.
      
      To enabled or disable this at run time:
      
       # mount -t debugfs nodev /sys/kernel/debug
       # echo RT_PUSH_IPI > /sys/kernel/debug/sched_features
      or
       # echo NO_RT_PUSH_IPI > /sys/kernel/debug/sched_features
      
      Update: This original patch would send an IPI to all CPUs in the RT overload
      list. But that could theoretically cause the reverse issue. That is, there
      could be lots of overloaded RT queues and one CPU lowers its priority. It would
      then send an IPI to all the overloaded RT queues and they could then all try
      to grab the rq lock of the CPU lowering its priority, and then we have the
      same problem.
      
      The latest design sends out only one IPI to the first overloaded CPU. It tries to
      push any tasks that it can, and then looks for the next overloaded CPU that can
      push to the source CPU. The IPIs stop when all overloaded CPUs that have pushable
      tasks that have priorities greater than the source CPU are covered. In case the
      source CPU lowers its priority again, a flag is set to tell the IPI traversal to
      restart with the first RT overloaded CPU after the source CPU.
      Parts-suggested-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Joern Engel <joern@purestorage.com>
      Cc: Clark Williams <williams@redhat.com>
      Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20150318144946.2f3cc982@gandalf.local.homeSigned-off-by: NIngo Molnar <mingo@kernel.org>
      b6366f04
    • B
      sched: Fix RLIMIT_RTTIME when PI-boosting to RT · 746db944
      Brian Silverman 提交于
      When non-realtime tasks get priority-inheritance boosted to a realtime
      scheduling class, RLIMIT_RTTIME starts to apply to them. However, the
      counter used for checking this (the same one used for SCHED_RR
      timeslices) was not getting reset. This meant that tasks running with a
      non-realtime scheduling class which are repeatedly boosted to a realtime
      one, but never block while they are running realtime, eventually hit the
      timeout without ever running for a time over the limit. This patch
      resets the realtime timeslice counter when un-PI-boosting from an RT to
      a non-RT scheduling class.
      
      I have some test code with two threads and a shared PTHREAD_PRIO_INHERIT
      mutex which induces priority boosting and spins while boosted that gets
      killed by a SIGXCPU on non-fixed kernels but doesn't with this patch
      applied. It happens much faster with a CONFIG_PREEMPT_RT kernel, and
      does happen eventually with PREEMPT_VOLUNTARY kernels.
      Signed-off-by: NBrian Silverman <brian@peloton-tech.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: austin@peloton-tech.com
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/1424305436-6716-1-git-send-email-brian@peloton-tech.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      746db944
  11. 20 3月, 2015 1 次提交
    • R
      sched, isolcpu: make cpu_isolated_map visible outside scheduler · 3fa0818b
      Rik van Riel 提交于
      Needed by the next patch. Also makes cpu_isolated_map present
      when compiled without SMP and/or with CONFIG_NR_CPUS=1, like
      the other cpu masks.
      
      At some point we may want to clean things up so cpumasks do
      not exist in UP kernels. Maybe something for the CONFIG_TINY
      crowd.
      
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Clark Williams <williams@redhat.com>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
      Cc: cgroups@vger.kernel.org
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Acked-by: NZefan Li <lizefan@huawei.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      3fa0818b
  12. 13 3月, 2015 2 次提交
    • P
      rcu: Handle outgoing CPUs on exit from idle loop · 88428cc5
      Paul E. McKenney 提交于
      This commit informs RCU of an outgoing CPU just before that CPU invokes
      arch_cpu_idle_dead() during its last pass through the idle loop (via a
      new CPU_DYING_IDLE notifier value).  This change means that RCU need not
      deal with outgoing CPUs passing through the scheduler after informing
      RCU that they are no longer online.  Note that removing the CPU from
      the rcu_node ->qsmaskinit bit masks is done at CPU_DYING_IDLE time,
      and orphaning callbacks is still done at CPU_DEAD time, the reason being
      that at CPU_DEAD time we have another CPU that can adopt them.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      88428cc5
    • P
      cpu: Make CPU-offline idle-loop transition point more precise · 528a25b0
      Paul E. McKenney 提交于
      This commit uses a per-CPU variable to make the CPU-offline code path
      through the idle loop more precise, so that the outgoing CPU is
      guaranteed to make it into the idle loop before it is powered off.
      This commit is in preparation for putting the RCU offline-handling
      code on this code path, which will eliminate the magic one-jiffy
      wait that RCU uses as the maximum time for an outgoing CPU to get
      all the way through the scheduler.
      
      The magic one-jiffy wait for incoming CPUs remains a separate issue.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      528a25b0
  13. 10 3月, 2015 1 次提交
  14. 09 3月, 2015 1 次提交
    • F
      context_tracking: Rename context symbols to prepare for transition state · c467ea76
      Frederic Weisbecker 提交于
      Current context tracking symbols are designed to express living state.
      As such they are prefixed with "IN_": IN_USER, IN_KERNEL.
      
      Now we are going to use these symbols to also express state transitions
      such as context_tracking_enter(IN_USER) or context_tracking_exit(IN_USER).
      But while the "IN_" prefix works well to express entering a context, it's
      confusing to depict a context exit: context_tracking_exit(IN_USER)
      could mean two things:
      	1) We are exiting the current context to enter user context.
      	2) We are exiting the user context
      We want 2) but the reviewer may be confused and understand 1)
      
      So lets disambiguate these symbols and rename them to CONTEXT_USER and
      CONTEXT_KERNEL.
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Will deacon <will.deacon@arm.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      c467ea76
  15. 06 3月, 2015 1 次提交
  16. 03 3月, 2015 1 次提交
  17. 01 3月, 2015 1 次提交
  18. 19 2月, 2015 1 次提交