1. 07 6月, 2015 2 次提交
  2. 19 5月, 2015 3 次提交
    • R
      sched/numa: Reduce conflict between fbq_classify_rq() and migration · c1ceac62
      Rik van Riel 提交于
      It is possible for fbq_classify_rq() to indicate that a CPU has tasks that
      should be moved to another NUMA node, but for migrate_improves_locality
      and migrate_degrades_locality to not identify those tasks.
      
      This patch always gives preference to preferred node evaluations, and
      only checks the number of faults when evaluating moves between two
      non-preferred nodes on a larger NUMA system.
      
      On a two node system, the number of faults is never evaluated. Either
      a task is about to be pulled off its preferred node, or migrated onto
      it.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: mgorman@suse.de
      Link: http://lkml.kernel.org/r/20150514225936.35b91717@annuminas.surriel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      c1ceac62
    • F
      sched/preempt: Optimize preemption operations on __schedule() callers · b30f0e3f
      Frederic Weisbecker 提交于
      __schedule() disables preemption and some of its callers
      (the preempt_schedule*() family) also set PREEMPT_ACTIVE.
      
      So we have two preempt_count() modifications that could be performed
      at once.
      
      Lets remove the preemption disablement from __schedule() and pull
      this responsibility to its callers in order to optimize preempt_count()
      operations in a single place.
      Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1431441711-29753-5-git-send-email-fweisbec@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      b30f0e3f
    • S
      sched: always use blk_schedule_flush_plug in io_schedule_out · 10d784ea
      Shaohua Li 提交于
      block plug callback could sleep, so we introduce a parameter
      'from_schedule' and corresponding drivers can use it to destinguish a
      schedule plug flush or a plug finish. Unfortunately io_schedule_out
      still uses blk_flush_plug(). This causes below output (Note, I added a
      might_sleep() in raid1_unplug to make it trigger faster, but the whole
      thing doesn't matter if I add might_sleep). In raid1/10, this can cause
      deadlock.
      
      This patch makes io_schedule_out always uses blk_schedule_flush_plug.
      This should only impact drivers (as far as I know, raid 1/10) which are
      sensitive to the 'from_schedule' parameter.
      
      [  370.817949] ------------[ cut here ]------------
      [  370.817960] WARNING: CPU: 7 PID: 145 at ../kernel/sched/core.c:7306 __might_sleep+0x7f/0x90()
      [  370.817969] do not call blocking ops when !TASK_RUNNING; state=2 set at [<ffffffff81092fcf>] prepare_to_wait+0x2f/0x90
      [  370.817971] Modules linked in: raid1
      [  370.817976] CPU: 7 PID: 145 Comm: kworker/u16:9 Tainted: G        W       4.0.0+ #361
      [  370.817977] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140709_153802- 04/01/2014
      [  370.817983] Workqueue: writeback bdi_writeback_workfn (flush-9:1)
      [  370.817985]  ffffffff81cd83be ffff8800ba8cb298 ffffffff819dd7af 0000000000000001
      [  370.817988]  ffff8800ba8cb2e8 ffff8800ba8cb2d8 ffffffff81051afc ffff8800ba8cb2c8
      [  370.817990]  ffffffffa00061a8 000000000000041e 0000000000000000 ffff8800ba8cba28
      [  370.817993] Call Trace:
      [  370.817999]  [<ffffffff819dd7af>] dump_stack+0x4f/0x7b
      [  370.818002]  [<ffffffff81051afc>] warn_slowpath_common+0x8c/0xd0
      [  370.818004]  [<ffffffff81051b86>] warn_slowpath_fmt+0x46/0x50
      [  370.818006]  [<ffffffff81092fcf>] ? prepare_to_wait+0x2f/0x90
      [  370.818008]  [<ffffffff81092fcf>] ? prepare_to_wait+0x2f/0x90
      [  370.818010]  [<ffffffff810776ef>] __might_sleep+0x7f/0x90
      [  370.818014]  [<ffffffffa0000c03>] raid1_unplug+0xd3/0x170 [raid1]
      [  370.818024]  [<ffffffff81421d9a>] blk_flush_plug_list+0x8a/0x1e0
      [  370.818028]  [<ffffffff819e3550>] ? bit_wait+0x50/0x50
      [  370.818031]  [<ffffffff819e21b0>] io_schedule_timeout+0x130/0x140
      [  370.818033]  [<ffffffff819e3586>] bit_wait_io+0x36/0x50
      [  370.818034]  [<ffffffff819e31b5>] __wait_on_bit+0x65/0x90
      [  370.818041]  [<ffffffff8125b67c>] ? ext4_read_block_bitmap_nowait+0xbc/0x630
      [  370.818043]  [<ffffffff819e3550>] ? bit_wait+0x50/0x50
      [  370.818045]  [<ffffffff819e3302>] out_of_line_wait_on_bit+0x72/0x80
      [  370.818047]  [<ffffffff810935e0>] ? autoremove_wake_function+0x40/0x40
      [  370.818050]  [<ffffffff811de744>] __wait_on_buffer+0x44/0x50
      [  370.818053]  [<ffffffff8125ae80>] ext4_wait_block_bitmap+0xe0/0xf0
      [  370.818058]  [<ffffffff812975d6>] ext4_mb_init_cache+0x206/0x790
      [  370.818062]  [<ffffffff8114bc6c>] ? lru_cache_add+0x1c/0x50
      [  370.818064]  [<ffffffff81297c7e>] ext4_mb_init_group+0x11e/0x200
      [  370.818066]  [<ffffffff81298231>] ext4_mb_load_buddy+0x341/0x360
      [  370.818068]  [<ffffffff8129a1a3>] ext4_mb_find_by_goal+0x93/0x2f0
      [  370.818070]  [<ffffffff81295b54>] ? ext4_mb_normalize_request+0x1e4/0x5b0
      [  370.818072]  [<ffffffff8129ab67>] ext4_mb_regular_allocator+0x67/0x460
      [  370.818074]  [<ffffffff81295b54>] ? ext4_mb_normalize_request+0x1e4/0x5b0
      [  370.818076]  [<ffffffff8129ca4b>] ext4_mb_new_blocks+0x4cb/0x620
      [  370.818079]  [<ffffffff81290956>] ext4_ext_map_blocks+0x4c6/0x14d0
      [  370.818081]  [<ffffffff812a4d4e>] ? ext4_es_lookup_extent+0x4e/0x290
      [  370.818085]  [<ffffffff8126399d>] ext4_map_blocks+0x14d/0x4f0
      [  370.818088]  [<ffffffff81266fbd>] ext4_writepages+0x76d/0xe50
      [  370.818094]  [<ffffffff81149691>] do_writepages+0x21/0x50
      [  370.818097]  [<ffffffff811d5c00>] __writeback_single_inode+0x60/0x490
      [  370.818099]  [<ffffffff811d630a>] writeback_sb_inodes+0x2da/0x590
      [  370.818103]  [<ffffffff811abf4b>] ? trylock_super+0x1b/0x50
      [  370.818105]  [<ffffffff811abf4b>] ? trylock_super+0x1b/0x50
      [  370.818107]  [<ffffffff811d665f>] __writeback_inodes_wb+0x9f/0xd0
      [  370.818109]  [<ffffffff811d69db>] wb_writeback+0x34b/0x3c0
      [  370.818111]  [<ffffffff811d70df>] bdi_writeback_workfn+0x23f/0x550
      [  370.818116]  [<ffffffff8106bbd8>] process_one_work+0x1c8/0x570
      [  370.818117]  [<ffffffff8106bb5b>] ? process_one_work+0x14b/0x570
      [  370.818119]  [<ffffffff8106c09b>] worker_thread+0x11b/0x470
      [  370.818121]  [<ffffffff8106bf80>] ? process_one_work+0x570/0x570
      [  370.818124]  [<ffffffff81071868>] kthread+0xf8/0x110
      [  370.818126]  [<ffffffff81071770>] ? kthread_create_on_node+0x210/0x210
      [  370.818129]  [<ffffffff819e9322>] ret_from_fork+0x42/0x70
      [  370.818131]  [<ffffffff81071770>] ? kthread_create_on_node+0x210/0x210
      [  370.818132] ---[ end trace 7b4deb71e68b6605 ]---
      
      V2: don't change ->in_iowait
      
      Cc: NeilBrown <neilb@suse.de>
      Signed-off-by: NShaohua Li <shli@fb.com>
      Reviewed-by: NJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      10d784ea
  3. 17 5月, 2015 1 次提交
    • N
      sched: Fix function declaration return type mismatch · 58ac93e4
      Nicholas Mc Guire 提交于
      static code checking was unhappy with:
      
        ./kernel/sched/fair.c:162 WARNING: return of wrong type
                      int != unsigned int
      
      get_update_sysctl_factor() is declared to return int but is
      currently  returning an unsigned int. The first few preprocessed
      lines are:
      
       static int get_update_sysctl_factor(void)
       {
       unsigned int cpus = ({ int __min1 = (cpumask_weight(cpu_online_mask));
       int __min2 = (8); __min1 < __min2 ? __min1: __min2; });
       unsigned int factor;
      
      The type used by min_t() should be 'unsigned int' and the return type
      of get_update_sysctl_factor() should also be 'unsigned int' as its
      call-site update_sysctl() is expecting 'unsigned int' and the values
      utilizing:
      
        'factor'
        'sysctl_sched_min_granularity'
        'sched_nr_latency'
        'sysctl_sched_wakeup_granularity'
      
      ... are also all 'unsigned int', plus cpumask_weight() is also
      returning 'unsigned int'.
      
      So the natural type to use around here is 'unsigned int'.
      
      ( Patch was compile tested with x86_64_defconfig +
        CONFIG_SCHED_DEBUG=y and the changed sections in
        kernel/sched/fair.i were reviewed. )
      Signed-off-by: NNicholas Mc Guire <hofrat@osadl.org>
      [ Improved the changelog a bit. ]
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1431716742-11077-1-git-send-email-hofrat@osadl.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      58ac93e4
  4. 08 5月, 2015 10 次提交
  5. 29 4月, 2015 1 次提交
  6. 27 4月, 2015 1 次提交
  7. 16 4月, 2015 1 次提交
  8. 08 4月, 2015 1 次提交
  9. 03 4月, 2015 2 次提交
  10. 02 4月, 2015 4 次提交
  11. 27 3月, 2015 14 次提交
    • W
      sched/deadline: Fix rt runtime corruption when dl fails its global constraints · a1963b81
      Wanpeng Li 提交于
      One version of sched_rt_global_constaints() (the !rt-cgroup one)
      changes state, therefore if we fail the later sched_dl_global_constraints()
      call the state is left in an inconsistent state.
      
      Fix this by changing the order of the calls.
      Signed-off-by: NWanpeng Li <wanpeng.li@linux.intel.com>
      [ Improved the changelog. ]
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NJuri Lelli <juri.lelli@arm.com>
      Link: http://lkml.kernel.org/r/1426590931-4639-2-git-send-email-wanpeng.li@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      a1963b81
    • W
      sched/deadline: Avoid a superfluous check · bd4bde14
      Wanpeng Li 提交于
      Since commit 40767b0d ("sched/deadline: Fix deadline parameter
      modification handling") we clear the thottled state when switching
      from a dl task, therefore we should never find it set in switching to
      a dl task.
      Signed-off-by: NWanpeng Li <wanpeng.li@linux.intel.com>
      [ Improved the changelog. ]
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NJuri Lelli <juri.lelli@arm.com>
      Link: http://lkml.kernel.org/r/1426590931-4639-1-git-send-email-wanpeng.li@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      bd4bde14
    • P
      sched: Improve load balancing in the presence of idle CPUs · d4573c3e
      Preeti U Murthy 提交于
      When a CPU is kicked to do nohz idle balancing, it wakes up to do load
      balancing on itself, followed by load balancing on behalf of idle CPUs.
      But it may end up with load after the load balancing attempt on itself.
      This aborts nohz idle balancing. As a result several idle CPUs are left
      without tasks till such a time that an ILB CPU finds it unfavorable to
      pull tasks upon itself. This delays spreading of load across idle CPUs
      and worse, clutters only a few CPUs with tasks.
      
      The effect of the above problem was observed on an SMT8 POWER server
      with 2 levels of numa domains. Busy loops equal to number of cores were
      spawned. Since load balancing on fork/exec is discouraged across numa
      domains, all busy loops would start on one of the numa domains. However
      it was expected that eventually one busy loop would run per core across
      all domains due to nohz idle load balancing. But it was observed that it
      took as long as 10 seconds to spread the load across numa domains.
      
      Further investigation showed that this was a consequence of the
      following:
      
       1. An ILB CPU was chosen from the first numa domain to trigger nohz idle
          load balancing [Given the experiment, upto 6 CPUs per core could be
          potentially idle in this domain.]
      
       2. However the ILB CPU would call load_balance() on itself before
          initiating nohz idle load balancing.
      
       3. Given cores are SMT8, the ILB CPU had enough opportunities to pull
          tasks from its sibling cores to even out load.
      
       4. Now that the ILB CPU was no longer idle, it would abort nohz idle
          load balancing
      
      As a result the opportunities to spread load across numa domains were
      lost until such a time that the cores within the first numa domain had
      equal number of tasks among themselves.  This is a pretty bad scenario,
      since the cores within the first numa domain would have as many as 4
      tasks each, while cores in the neighbouring numa domains would all
      remain idle.
      
      Fix this, by checking if a CPU was woken up to do nohz idle load
      balancing, before it does load balancing upon itself. This way we allow
      idle CPUs across the system to do load balancing which results in
      quicker spread of load, instead of performing load balancing within the
      local sched domain hierarchy of the ILB CPU alone under circumstances
      such as above.
      Signed-off-by: NPreeti U Murthy <preeti@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NJason Low <jason.low2@hp.com>
      Cc: benh@kernel.crashing.org
      Cc: daniel.lezcano@linaro.org
      Cc: efault@gmx.de
      Cc: iamjoonsoo.kim@lge.com
      Cc: morten.rasmussen@arm.com
      Cc: pjt@google.com
      Cc: riel@redhat.com
      Cc: srikar@linux.vnet.ibm.com
      Cc: svaidy@linux.vnet.ibm.com
      Cc: tim.c.chen@linux.intel.com
      Cc: vincent.guittot@linaro.org
      Link: http://lkml.kernel.org/r/20150326130014.21532.17158.stgit@preeti.in.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      d4573c3e
    • P
      sched: Optimize freq invariant accounting · dfbca41f
      Peter Zijlstra 提交于
      Currently the freq invariant accounting (in
      __update_entity_runnable_avg() and sched_rt_avg_update()) get the
      scale factor from a weak function call, this means that even for archs
      that default on their implementation the compiler cannot see into this
      function and optimize the extra scaling math away.
      
      This is sad, esp. since its a 64-bit multiplication which can be quite
      costly on some platforms.
      
      So replace the weak function with #ifdef and __always_inline goo. This
      is not quite as nice from an arch support PoV but should at least
      result in compile time errors if done wrong.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Ben Segall <bsegall@google.com>
      Cc: Morten.Rasmussen@arm.com
      Cc: Paul Turner <pjt@google.com>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Cc: dietmar.eggemann@arm.com
      Cc: efault@gmx.de
      Cc: kamalesh@linux.vnet.ibm.com
      Cc: nicolas.pitre@linaro.org
      Cc: preeti@linux.vnet.ibm.com
      Cc: riel@redhat.com
      Link: http://lkml.kernel.org/r/20150323131905.GF23123@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      dfbca41f
    • V
      sched: Move CFS tasks to CPUs with higher capacity · 1aaf90a4
      Vincent Guittot 提交于
      When a CPU is used to handle a lot of IRQs or some RT tasks, the remaining
      capacity for CFS tasks can be significantly reduced. Once we detect such
      situation by comparing cpu_capacity_orig and cpu_capacity, we trig an idle
      load balance to check if it's worth moving its tasks on an idle CPU.
      
      It's worth trying to move the task before the CPU is fully utilized to
      minimize the preemption by irq or RT tasks.
      
      Once the idle load_balance has selected the busiest CPU, it will look for an
      active load balance for only two cases:
      
        - There is only 1 task on the busiest CPU.
      
        - We haven't been able to move a task of the busiest rq.
      
      A CPU with a reduced capacity is included in the 1st case, and it's worth to
      actively migrate its task if the idle CPU has got more available capacity for
      CFS tasks. This test has been added in need_active_balance.
      
      As a sidenote, this will not generate more spurious ilb because we already
      trig an ilb if there is more than 1 busy cpu. If this cpu is the only one that
      has a task, we will trig the ilb once for migrating the task.
      
      The nohz_kick_needed function has been cleaned up a bit while adding the new
      test
      
      env.src_cpu and env.src_rq must be set unconditionnally because they are used
      in need_active_balance which is called even if busiest->nr_running equals 1
      Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Morten.Rasmussen@arm.com
      Cc: dietmar.eggemann@arm.com
      Cc: efault@gmx.de
      Cc: kamalesh@linux.vnet.ibm.com
      Cc: linaro-kernel@lists.linaro.org
      Cc: nicolas.pitre@linaro.org
      Cc: preeti@linux.vnet.ibm.com
      Cc: riel@redhat.com
      Link: http://lkml.kernel.org/r/1425052454-25797-12-git-send-email-vincent.guittot@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      1aaf90a4
    • V
      sched: Add SD_PREFER_SIBLING for SMT level · caff37ef
      Vincent Guittot 提交于
      Add the SD_PREFER_SIBLING flag for SMT level in order to ensure that
      the scheduler will place at least one task per core.
      Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NPreeti U. Murthy <preeti@linux.vnet.ibm.com>
      Cc: Morten.Rasmussen@arm.com
      Cc: dietmar.eggemann@arm.com
      Cc: efault@gmx.de
      Cc: kamalesh@linux.vnet.ibm.com
      Cc: linaro-kernel@lists.linaro.org
      Cc: nicolas.pitre@linaro.org
      Cc: riel@redhat.com
      Link: http://lkml.kernel.org/r/1425052454-25797-11-git-send-email-vincent.guittot@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      caff37ef
    • V
      sched: Remove unused struct sched_group_capacity::capacity_orig · dc7ff76e
      Vincent Guittot 提交于
      The 'struct sched_group_capacity::capacity_orig' field is no longer used
      in the scheduler so we can remove it.
      Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Morten.Rasmussen@arm.com
      Cc: dietmar.eggemann@arm.com
      Cc: efault@gmx.de
      Cc: kamalesh@linux.vnet.ibm.com
      Cc: linaro-kernel@lists.linaro.org
      Cc: nicolas.pitre@linaro.org
      Cc: preeti@linux.vnet.ibm.com
      Cc: riel@redhat.com
      Link: http://lkml.kernel.org/r/1425378903-5349-1-git-send-email-vincent.guittot@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      dc7ff76e
    • V
      sched: Replace capacity_factor by usage · ea67821b
      Vincent Guittot 提交于
      The scheduler tries to compute how many tasks a group of CPUs can handle by
      assuming that a task's load is SCHED_LOAD_SCALE and a CPU's capacity is
      SCHED_CAPACITY_SCALE.
      
      'struct sg_lb_stats:group_capacity_factor' divides the capacity of the group
      by SCHED_LOAD_SCALE to estimate how many task can run in the group. Then, it
      compares this value with the sum of nr_running to decide if the group is
      overloaded or not.
      
      But the 'group_capacity_factor' concept is hardly working for SMT systems, it
      sometimes works for big cores but fails to do the right thing for little cores.
      
      Below are two examples to illustrate the problem that this patch solves:
      
      1- If the original capacity of a CPU is less than SCHED_CAPACITY_SCALE
         (640 as an example), a group of 3 CPUS will have a max capacity_factor of 2
         (div_round_closest(3x640/1024) = 2) which means that it will be seen as
         overloaded even if we have only one task per CPU.
      
      2 - If the original capacity of a CPU is greater than SCHED_CAPACITY_SCALE
         (1512 as an example), a group of 4 CPUs will have a capacity_factor of 4
         (at max and thanks to the fix [0] for SMT system that prevent the apparition
         of ghost CPUs) but if one CPU is fully used by rt tasks (and its capacity is
         reduced to nearly nothing), the capacity factor of the group will still be 4
         (div_round_closest(3*1512/1024) = 5 which is cap to 4 with [0]).
      
      So, this patch tries to solve this issue by removing capacity_factor and
      replacing it with the 2 following metrics:
      
        - The available CPU's capacity for CFS tasks which is already used by
          load_balance().
      
        - The usage of the CPU by the CFS tasks. For the latter, utilization_avg_contrib
          has been re-introduced to compute the usage of a CPU by CFS tasks.
      
      'group_capacity_factor' and 'group_has_free_capacity' has been removed and replaced
      by 'group_no_capacity'. We compare the number of task with the number of CPUs and
      we evaluate the level of utilization of the CPUs to define if a group is
      overloaded or if a group has capacity to handle more tasks.
      
      For SD_PREFER_SIBLING, a group is tagged overloaded if it has more than 1 task
      so it will be selected in priority (among the overloaded groups). Since [1],
      SD_PREFER_SIBLING is no more concerned by the computation of 'load_above_capacity'
      because local is not overloaded.
      
      [1] 9a5d9ba6 ("sched/fair: Allow calculate_imbalance() to move idle cpus")
      Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Morten.Rasmussen@arm.com
      Cc: dietmar.eggemann@arm.com
      Cc: efault@gmx.de
      Cc: kamalesh@linux.vnet.ibm.com
      Cc: linaro-kernel@lists.linaro.org
      Cc: nicolas.pitre@linaro.org
      Cc: preeti@linux.vnet.ibm.com
      Cc: riel@redhat.com
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/1425052454-25797-9-git-send-email-vincent.guittot@linaro.org
      [ Tidied up the changelog. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      ea67821b
    • V
      sched: Calculate CPU's usage statistic and put it into struct sg_lb_stats::group_usage · 8bb5b00c
      Vincent Guittot 提交于
      Monitor the usage level of each group of each sched_domain level. The usage is
      the portion of cpu_capacity_orig that is currently used on a CPU or group of
      CPUs. We use the utilization_load_avg to evaluate the usage level of each
      group.
      
      The utilization_load_avg only takes into account the running time of the CFS
      tasks on a CPU with a maximum value of SCHED_LOAD_SCALE when the CPU is fully
      utilized. Nevertheless, we must cap utilization_load_avg which can be
      temporally greater than SCHED_LOAD_SCALE after the migration of a task on this
      CPU and until the metrics are stabilized.
      
      The utilization_load_avg is in the range [0..SCHED_LOAD_SCALE] to reflect the
      running load on the CPU whereas the available capacity for the CFS task is in
      the range [0..cpu_capacity_orig]. In order to test if a CPU is fully utilized
      by CFS tasks, we have to scale the utilization in the cpu_capacity_orig range
      of the CPU to get the usage of the latter. The usage can then be compared with
      the available capacity (ie cpu_capacity) to deduct the usage level of a CPU.
      
      The frequency scaling invariance of the usage is not taken into account in this
      patch, it will be solved in another patch which will deal with frequency
      scaling invariance on the utilization_load_avg.
      Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NMorten Rasmussen <morten.rasmussen@arm.com>
      Cc: Morten.Rasmussen@arm.com
      Cc: dietmar.eggemann@arm.com
      Cc: efault@gmx.de
      Cc: kamalesh@linux.vnet.ibm.com
      Cc: linaro-kernel@lists.linaro.org
      Cc: nicolas.pitre@linaro.org
      Cc: preeti@linux.vnet.ibm.com
      Cc: riel@redhat.com
      Link: http://lkml.kernel.org/r/1425455327-13508-1-git-send-email-vincent.guittot@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      8bb5b00c
    • V
      sched: Add struct rq::cpu_capacity_orig · ca6d75e6
      Vincent Guittot 提交于
      This new field 'cpu_capacity_orig' reflects the original capacity of a CPU
      before being altered by rt tasks and/or IRQ
      
      The cpu_capacity_orig will be used:
      
        - to detect when the capacity of a CPU has been noticeably reduced so we can
          trig load balance to look for a CPU with better capacity. As an example, we
          can detect when a CPU handles a significant amount of irq
          (with CONFIG_IRQ_TIME_ACCOUNTING) but this CPU is seen as an idle CPU by
          scheduler whereas CPUs, which are really idle, are available.
      
        - evaluate the available capacity for CFS tasks
      Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NKamalesh Babulal <kamalesh@linux.vnet.ibm.com>
      Acked-by: NMorten Rasmussen <morten.rasmussen@arm.com>
      Cc: Morten.Rasmussen@arm.com
      Cc: dietmar.eggemann@arm.com
      Cc: efault@gmx.de
      Cc: linaro-kernel@lists.linaro.org
      Cc: nicolas.pitre@linaro.org
      Cc: preeti@linux.vnet.ibm.com
      Cc: riel@redhat.com
      Link: http://lkml.kernel.org/r/1425052454-25797-7-git-send-email-vincent.guittot@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      ca6d75e6
    • V
      sched: Make scale_rt invariant with frequency · b5b4860d
      Vincent Guittot 提交于
      The average running time of RT tasks is used to estimate the remaining compute
      capacity for CFS tasks. This remaining capacity is the original capacity scaled
      down by a factor (aka scale_rt_capacity). This estimation of available capacity
      must also be invariant with frequency scaling.
      
      A frequency scaling factor is applied on the running time of the RT tasks for
      computing scale_rt_capacity.
      
      In sched_rt_avg_update(), we now scale the RT execution time like below:
      
        rq->rt_avg += rt_delta * arch_scale_freq_capacity() >> SCHED_CAPACITY_SHIFT
      
      Then, scale_rt_capacity can be summarized by:
      
        scale_rt_capacity = SCHED_CAPACITY_SCALE * available / total
      
      with available = total - rq->rt_avg
      
      This has been been optimized in current code by:
      
        scale_rt_capacity = available / (total >> SCHED_CAPACITY_SHIFT)
      
      But we can also developed the equation like below:
      
        scale_rt_capacity = SCHED_CAPACITY_SCALE - ((rq->rt_avg << SCHED_CAPACITY_SHIFT) / total)
      
      and we can optimize the equation by removing SCHED_CAPACITY_SHIFT shift in
      the computation of rq->rt_avg and scale_rt_capacity().
      
      so rq->rt_avg += rt_delta * arch_scale_freq_capacity()
      and
      scale_rt_capacity = SCHED_CAPACITY_SCALE - (rq->rt_avg / total)
      
      arch_scale_frequency_capacity() will be called in the hot path of the scheduler
      which implies to have a short and efficient function.
      
      As an example, arch_scale_frequency_capacity() should return a cached value that
      is updated periodically outside of the hot path.
      Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NMorten Rasmussen <morten.rasmussen@arm.com>
      Cc: Morten.Rasmussen@arm.com
      Cc: dietmar.eggemann@arm.com
      Cc: efault@gmx.de
      Cc: kamalesh@linux.vnet.ibm.com
      Cc: linaro-kernel@lists.linaro.org
      Cc: nicolas.pitre@linaro.org
      Cc: preeti@linux.vnet.ibm.com
      Cc: riel@redhat.com
      Link: http://lkml.kernel.org/r/1425052454-25797-6-git-send-email-vincent.guittot@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      b5b4860d
    • M
      sched: Make sched entity usage tracking scale-invariant · 0c1dc6b2
      Morten Rasmussen 提交于
      Apply frequency scale-invariance correction factor to usage tracking.
      
      Each segment of the running_avg_sum geometric series is now scaled by the
      current frequency so the utilization_avg_contrib of each entity will be
      invariant with frequency scaling.
      
      As a result, utilization_load_avg which is the sum of utilization_avg_contrib,
      becomes invariant too. So the usage level that is returned by get_cpu_usage(),
      stays relative to the max frequency as the cpu_capacity which is is compared against.
      
      Then, we want the keep the load tracking values in a 32-bit type, which implies
      that the max value of {runnable|running}_avg_sum must be lower than
      2^32/88761=48388 (88761 is the max weigth of a task). As LOAD_AVG_MAX = 47742,
      arch_scale_freq_capacity() must return a value less than
      (48388/47742) << SCHED_CAPACITY_SHIFT = 1037 (SCHED_SCALE_CAPACITY = 1024).
      So we define the range to [0..SCHED_SCALE_CAPACITY] in order to avoid overflow.
      Signed-off-by: NMorten Rasmussen <morten.rasmussen@arm.com>
      Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Paul Turner <pjt@google.com>
      Cc: Ben Segall <bsegall@google.com>
      Cc: Ben Segall <bsegall@google.com>
      Cc: Morten.Rasmussen@arm.com
      Cc: Paul Turner <pjt@google.com>
      Cc: dietmar.eggemann@arm.com
      Cc: efault@gmx.de
      Cc: kamalesh@linux.vnet.ibm.com
      Cc: linaro-kernel@lists.linaro.org
      Cc: nicolas.pitre@linaro.org
      Cc: preeti@linux.vnet.ibm.com
      Cc: riel@redhat.com
      Link: http://lkml.kernel.org/r/1425455186-13451-1-git-send-email-vincent.guittot@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      0c1dc6b2
    • V
      sched: Remove frequency scaling from cpu_capacity · a8faa8f5
      Vincent Guittot 提交于
      Now that arch_scale_cpu_capacity has been introduced to scale the original
      capacity, the arch_scale_freq_capacity is no longer used (it was
      previously used by ARM arch).
      
      Remove arch_scale_freq_capacity from the computation of cpu_capacity.
      The frequency invariance will be handled in the load tracking and not in
      the CPU capacity. arch_scale_freq_capacity will be revisited for scaling
      load with the current frequency of the CPUs in a later patch.
      Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NMorten Rasmussen <morten.rasmussen@arm.com>
      Cc: Morten.Rasmussen@arm.com
      Cc: dietmar.eggemann@arm.com
      Cc: efault@gmx.de
      Cc: kamalesh@linux.vnet.ibm.com
      Cc: linaro-kernel@lists.linaro.org
      Cc: nicolas.pitre@linaro.org
      Cc: preeti@linux.vnet.ibm.com
      Cc: riel@redhat.com
      Link: http://lkml.kernel.org/r/1425052454-25797-4-git-send-email-vincent.guittot@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      a8faa8f5
    • M
      sched: Track group sched_entity usage contributions · 21f44866
      Morten Rasmussen 提交于
      Add usage contribution tracking for group entities. Unlike
      se->avg.load_avg_contrib, se->avg.utilization_avg_contrib for group
      entities is the sum of se->avg.utilization_avg_contrib for all entities on the
      group runqueue.
      
      It is _not_ influenced in any way by the task group h_load. Hence it is
      representing the actual cpu usage of the group, not its intended load
      contribution which may differ significantly from the utilization on
      lightly utilized systems.
      Signed-off-by: NMorten Rasmussen <morten.rasmussen@arm.com>
      Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Paul Turner <pjt@google.com>
      Cc: Ben Segall <bsegall@google.com>
      Cc: Ben Segall <bsegall@google.com>
      Cc: Morten.Rasmussen@arm.com
      Cc: Paul Turner <pjt@google.com>
      Cc: dietmar.eggemann@arm.com
      Cc: efault@gmx.de
      Cc: kamalesh@linux.vnet.ibm.com
      Cc: linaro-kernel@lists.linaro.org
      Cc: nicolas.pitre@linaro.org
      Cc: preeti@linux.vnet.ibm.com
      Cc: riel@redhat.com
      Link: http://lkml.kernel.org/r/1425052454-25797-3-git-send-email-vincent.guittot@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      21f44866