1. 13 9月, 2015 6 次提交
  2. 12 8月, 2015 2 次提交
  3. 03 8月, 2015 9 次提交
    • Y
      sched/fair: Clean up load average references · 7ea241af
      Yuyang Du 提交于
      For cfs_rq, we have load.weight, runnable_load_avg, and load_avg.
      Clean up how they are used:
      
        - First, as group sched_entity already largely uses load_avg, we now expand
          to use load_avg in all cases.
      
        - Second, for CPU-wide load balancing, we choose to use runnable_load_avg
          in all cases, which is the same as before this series.
      Signed-off-by: NYuyang Du <yuyang.du@intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: arjan@linux.intel.com
      Cc: bsegall@google.com
      Cc: dietmar.eggemann@arm.com
      Cc: fengguang.wu@intel.com
      Cc: len.brown@intel.com
      Cc: morten.rasmussen@arm.com
      Cc: pjt@google.com
      Cc: rafael.j.wysocki@intel.com
      Cc: umgwanakikbuti@gmail.com
      Cc: vincent.guittot@linaro.org
      Link: http://lkml.kernel.org/r/1436918682-4971-8-git-send-email-yuyang.du@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      7ea241af
    • Y
      sched/fair: Provide runnable_load_avg back to cfs_rq · 13962234
      Yuyang Du 提交于
      The cfs_rq's load_avg is composed of runnable_load_avg and blocked_load_avg.
      Before this series, sometimes the runnable_load_avg is used, and sometimes
      the load_avg is used. Completely replacing all uses of runnable_load_avg
      with load_avg may be too big a leap, i.e., the blocked_load_avg is concerned
      to result in overrated load. Therefore, we get runnable_load_avg back.
      
      The new cfs_rq's runnable_load_avg is improved to be updated with all of the
      runnable sched_eneities at the same time, so the one sched_entity updated and
      the others stale problem is solved.
      Signed-off-by: NYuyang Du <yuyang.du@intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: arjan@linux.intel.com
      Cc: bsegall@google.com
      Cc: dietmar.eggemann@arm.com
      Cc: fengguang.wu@intel.com
      Cc: len.brown@intel.com
      Cc: morten.rasmussen@arm.com
      Cc: pjt@google.com
      Cc: rafael.j.wysocki@intel.com
      Cc: umgwanakikbuti@gmail.com
      Cc: vincent.guittot@linaro.org
      Link: http://lkml.kernel.org/r/1436918682-4971-7-git-send-email-yuyang.du@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      13962234
    • Y
      sched/fair: Remove task and group entity load when they are dead · 12695578
      Yuyang Du 提交于
      When task exits or group is destroyed, the entity's load should be
      removed from its parent cfs_rq's load. Otherwise, it will take time
      for the parent cfs_rq to decay the dead entity's load to 0, which
      is not desired.
      Signed-off-by: NYuyang Du <yuyang.du@intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: arjan@linux.intel.com
      Cc: bsegall@google.com
      Cc: dietmar.eggemann@arm.com
      Cc: fengguang.wu@intel.com
      Cc: len.brown@intel.com
      Cc: morten.rasmussen@arm.com
      Cc: pjt@google.com
      Cc: rafael.j.wysocki@intel.com
      Cc: umgwanakikbuti@gmail.com
      Cc: vincent.guittot@linaro.org
      Link: http://lkml.kernel.org/r/1436918682-4971-6-git-send-email-yuyang.du@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      12695578
    • Y
      sched/fair: Init cfs_rq's sched_entity load average · 540247fb
      Yuyang Du 提交于
      The runnable load and utilization averages of cfs_rq's sched_entity
      were not initiated. Like done to a task, give new cfs_rq' sched_entity
      start values to heavy its load in infant time.
      Signed-off-by: NYuyang Du <yuyang.du@intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: arjan@linux.intel.com
      Cc: bsegall@google.com
      Cc: dietmar.eggemann@arm.com
      Cc: fengguang.wu@intel.com
      Cc: len.brown@intel.com
      Cc: morten.rasmussen@arm.com
      Cc: pjt@google.com
      Cc: rafael.j.wysocki@intel.com
      Cc: umgwanakikbuti@gmail.com
      Cc: vincent.guittot@linaro.org
      Link: http://lkml.kernel.org/r/1436918682-4971-5-git-send-email-yuyang.du@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      540247fb
    • V
      sched/fair: Implement update_blocked_averages() for CONFIG_FAIR_GROUP_SCHED=n · 6c1d47c0
      Vincent Guittot 提交于
      The load and the utilization of idle CPUs must be updated periodically in
      order to decay the blocked part.
      
      If CONFIG_FAIR_GROUP_SCHED is not set, the load and util of idle cpus
      are not decayed and stay at the values set before becoming idle.
      Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: NYuyang Du <yuyang.du@intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: arjan@linux.intel.com
      Cc: bsegall@google.com
      Cc: dietmar.eggemann@arm.com
      Cc: fengguang.wu@intel.com
      Cc: len.brown@intel.com
      Cc: morten.rasmussen@arm.com
      Cc: pjt@google.com
      Cc: rafael.j.wysocki@intel.com
      Cc: umgwanakikbuti@gmail.com
      Link: http://lkml.kernel.org/r/1436918682-4971-4-git-send-email-yuyang.du@intel.com
      [ Fixed up the SOB chain. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      6c1d47c0
    • Y
      sched/fair: Rewrite runnable load and utilization average tracking · 9d89c257
      Yuyang Du 提交于
      The idea of runnable load average (let runnable time contribute to weight)
      was proposed by Paul Turner and Ben Segall, and it is still followed by
      this rewrite. This rewrite aims to solve the following issues:
      
      1. cfs_rq's load average (namely runnable_load_avg and blocked_load_avg) is
         updated at the granularity of an entity at a time, which results in the
         cfs_rq's load average is stale or partially updated: at any time, only
         one entity is up to date, all other entities are effectively lagging
         behind. This is undesirable.
      
         To illustrate, if we have n runnable entities in the cfs_rq, as time
         elapses, they certainly become outdated:
      
           t0: cfs_rq { e1_old, e2_old, ..., en_old }
      
         and when we update:
      
           t1: update e1, then we have cfs_rq { e1_new, e2_old, ..., en_old }
      
           t2: update e2, then we have cfs_rq { e1_old, e2_new, ..., en_old }
      
           ...
      
         We solve this by combining all runnable entities' load averages together
         in cfs_rq's avg, and update the cfs_rq's avg as a whole. This is based
         on the fact that if we regard the update as a function, then:
      
         w * update(e) = update(w * e) and
      
         update(e1) + update(e2) = update(e1 + e2), then
      
         w1 * update(e1) + w2 * update(e2) = update(w1 * e1 + w2 * e2)
      
         therefore, by this rewrite, we have an entirely updated cfs_rq at the
         time we update it:
      
           t1: update cfs_rq { e1_new, e2_new, ..., en_new }
      
           t2: update cfs_rq { e1_new, e2_new, ..., en_new }
      
           ...
      
      2. cfs_rq's load average is different between top rq->cfs_rq and other
         task_group's per CPU cfs_rqs in whether or not blocked_load_average
         contributes to the load.
      
         The basic idea behind runnable load average (the same for utilization)
         is that the blocked state is taken into account as opposed to only
         accounting for the currently runnable state. Therefore, the average
         should include both the runnable/running and blocked load averages.
         This rewrite does that.
      
         In addition, we also combine runnable/running and blocked averages
         of all entities into the cfs_rq's average, and update it together at
         once. This is based on the fact that:
      
           update(runnable) + update(blocked) = update(runnable + blocked)
      
         This significantly reduces the code as we don't need to separately
         maintain/update runnable/running load and blocked load.
      
      3. How task_group entities' share is calculated is complex and imprecise.
      
         We reduce the complexity in this rewrite to allow a very simple rule:
         the task_group's load_avg is aggregated from its per CPU cfs_rqs's
         load_avgs. Then group entity's weight is simply proportional to its
         own cfs_rq's load_avg / task_group's load_avg. To illustrate,
      
         if a task_group has { cfs_rq1, cfs_rq2, ..., cfs_rqn }, then,
      
         task_group_avg = cfs_rq1_avg + cfs_rq2_avg + ... + cfs_rqn_avg, then
      
         cfs_rqx's entity's share = cfs_rqx_avg / task_group_avg * task_group's share
      
      To sum up, this rewrite in principle is equivalent to the current one, but
      fixes the issues described above. Turns out, it significantly reduces the
      code complexity and hence increases clarity and efficiency. In addition,
      the new averages are more smooth/continuous (no spurious spikes and valleys)
      and updated more consistently and quickly to reflect the load dynamics.
      
      As a result, we have less load tracking overhead, better performance,
      and especially better power efficiency due to more balanced load.
      Signed-off-by: NYuyang Du <yuyang.du@intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: arjan@linux.intel.com
      Cc: bsegall@google.com
      Cc: dietmar.eggemann@arm.com
      Cc: fengguang.wu@intel.com
      Cc: len.brown@intel.com
      Cc: morten.rasmussen@arm.com
      Cc: pjt@google.com
      Cc: rafael.j.wysocki@intel.com
      Cc: umgwanakikbuti@gmail.com
      Cc: vincent.guittot@linaro.org
      Link: http://lkml.kernel.org/r/1436918682-4971-3-git-send-email-yuyang.du@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      9d89c257
    • Y
      sched/fair: Remove rq's runnable avg · cd126afe
      Yuyang Du 提交于
      The current rq->avg is not used at all since its merge into the kernel,
      and the code is in the scheduler's hot path, so remove it.
      Tested-by: NDietmar Eggemann <dietmar.eggemann@arm.com>
      Signed-off-by: NYuyang Du <yuyang.du@intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NDietmar Eggemann <dietmar.eggemann@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: arjan@linux.intel.com
      Cc: bsegall@google.com
      Cc: fengguang.wu@intel.com
      Cc: len.brown@intel.com
      Cc: morten.rasmussen@arm.com
      Cc: pjt@google.com
      Cc: rafael.j.wysocki@intel.com
      Cc: umgwanakikbuti@gmail.com
      Cc: vincent.guittot@linaro.org
      Link: http://lkml.kernel.org/r/1436918682-4971-2-git-send-email-yuyang.du@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      cd126afe
    • M
      sched/fair: Beef up wake_wide() · 63b0e9ed
      Mike Galbraith 提交于
      Josef Bacik reported that Facebook sees better performance with their
      1:N load (1 dispatch/node, N workers/node) when carrying an old patch
      to try very hard to wake to an idle CPU.  While looking at wake_wide(),
      I noticed that it doesn't pay attention to the wakeup of a many partner
      waker, returning 1 only when waking one of its many partners.
      
      Correct that, letting explicit domain flags override the heuristic.
      
      While at it, adjust task_struct bits, we don't need a 64-bit counter.
      Tested-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NMike Galbraith <umgwanakikbuti@gmail.com>
      [ Tidy things up. ]
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: kernel-team<Kernel-team@fb.com>
      Cc: morten.rasmussen@arm.com
      Cc: riel@redhat.com
      Link: http://lkml.kernel.org/r/1436888390.7983.49.camel@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      63b0e9ed
    • Y
      sched/fair: Avoid pulling all tasks in idle balancing · 985d3a4c
      Yuyang Du 提交于
      In idle balancing where a CPU going idle pulls tasks from another CPU,
      a livelock may happen if the CPU pulls all tasks from another, makes
      it idle, and this iterates. So just avoid this.
      Reported-by: NRabin Vincent <rabin.vincent@axis.com>
      Signed-off-by: NYuyang Du <yuyang.du@intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Ben Segall <bsegall@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
      Cc: Morten Rasmussen <morten.rasmussen@arm.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20150705221151.GF5197@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      985d3a4c
  4. 07 7月, 2015 5 次提交
  5. 06 7月, 2015 1 次提交
  6. 04 7月, 2015 1 次提交
    • S
      sched/numa: Fix numa balancing stats in /proc/pid/sched · 397f2378
      Srikar Dronamraju 提交于
      Commit 44dba3d5 ("sched: Refactor task_struct to use
      numa_faults instead of numa_* pointers") modified the way
      tsk->numa_faults stats are accounted.
      
      However that commit never touched show_numa_stats() that is displayed
      in /proc/pid/sched and thus the numbers displayed in /proc/pid/sched
      don't match the actual numbers.
      
      Fix it by making sure that /proc/pid/sched reflects the task
      fault numbers. Also add group fault stats too.
      
      Also couple of more modifications are added here:
      
      1. Format changes:
      
        - Previously we would list two entries per node, one for private
          and one for shared. Also the home node info was listed in each entry.
      
        - Now preferred node, total_faults and current node are
          displayed separately.
      
        - Now there is one entry per node, that lists private,shared task and
          group faults.
      
      2. Unit changes:
      
        - p->numa_pages_migrated was getting reset after every read of
          /proc/pid/sched. It's more useful to have absolute numbers since
          differential migrations between two accesses can be more easily
          calculated.
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Iulia Manda <iulia.manda21@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1435252903-1081-4-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      397f2378
  7. 19 6月, 2015 1 次提交
  8. 11 6月, 2015 1 次提交
    • M
      sched, numa: do not hint for NUMA balancing on VM_MIXEDMAP mappings · 8e76d4ee
      Mel Gorman 提交于
      Jovi Zhangwei reported the following problem
      
        Below kernel vm bug can be triggered by tcpdump which mmaped a lot of pages
        with GFP_COMP flag.
      
        [Mon May 25 05:29:33 2015] page:ffffea0015414000 count:66 mapcount:1 mapping:          (null) index:0x0
        [Mon May 25 05:29:33 2015] flags: 0x20047580004000(head)
        [Mon May 25 05:29:33 2015] page dumped because: VM_BUG_ON_PAGE(compound_order(page) && !PageTransHuge(page))
        [Mon May 25 05:29:33 2015] ------------[ cut here ]------------
        [Mon May 25 05:29:33 2015] kernel BUG at mm/migrate.c:1661!
        [Mon May 25 05:29:33 2015] invalid opcode: 0000 [#1] SMP
      
      In this case it was triggered by running tcpdump but it's not necessary
      reproducible on all systems.
      
        sudo tcpdump -i bond0.100 'tcp port 4242' -c 100000000000 -w 4242.pcap
      
      Compound pages cannot be migrated and it was not expected that such pages
      be marked for NUMA balancing.  This did not take into account that drivers
      such as net/packet/af_packet.c may insert compound pages into userspace
      with vm_insert_page.  This patch tells the NUMA balancing protection
      scanner to skip all VM_MIXEDMAP mappings which avoids the possibility that
      compound pages are marked for migration.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reported-by: NJovi Zhangwei <jovi@cloudflare.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8e76d4ee
  9. 07 6月, 2015 3 次提交
    • R
      sched/numa: Only consider less busy nodes as numa balancing destinations · 6f9aad0b
      Rik van Riel 提交于
      Changeset a43455a1 ("sched/numa: Ensure task_numa_migrate() checks
      the preferred node") fixes an issue where workloads would never
      converge on a fully loaded (or overloaded) system.
      
      However, it introduces a regression on less than fully loaded systems,
      where workloads converge on a few NUMA nodes, instead of properly
      staying spread out across the whole system. This leads to a reduction
      in available memory bandwidth, and usable CPU cache, with predictable
      performance problems.
      
      The root cause appears to be an interaction between the load balancer
      and NUMA balancing, where the short term load represented by the load
      balancer differs from the long term load the NUMA balancing code would
      like to base its decisions on.
      
      Simply reverting a43455a1 would re-introduce the non-convergence
      of workloads on fully loaded systems, so that is not a good option. As
      an aside, the check done before a43455a1 only applied to a task's
      preferred node, not to other candidate nodes in the system, so the
      converge-on-too-few-nodes problem still happens, just to a lesser
      degree.
      
      Instead, try to compensate for the impedance mismatch between the load
      balancer and NUMA balancing by only ever considering a lesser loaded
      node as a destination for NUMA balancing, regardless of whether the
      task is trying to move to the preferred node, or to another node.
      
      This patch also addresses the issue that a system with a single
      runnable thread would never migrate that thread to near its memory,
      introduced by 095bebf6 ("sched/numa: Do not move past the balance
      point if unbalanced").
      
      A test where the main thread creates a large memory area, and spawns a
      worker thread to iterate over the memory (placed on another node by
      select_task_rq_fair), after which the main thread goes to sleep and
      waits for the worker thread to loop over all the memory now sees the
      worker thread migrated to where the memory is, instead of having all
      the memory migrated over like before.
      
      Jirka has run a number of performance tests on several systems: single
      instance SpecJBB 2005 performance is 7-15% higher on a 4 node system,
      with higher gains on systems with more cores per socket.
      Multi-instance SpecJBB 2005 (one per node), linpack, and stream see
      little or no changes with the revert of 095bebf6 and this patch.
      Reported-by: NArtem Bityutski <dedekind1@gmail.com>
      Reported-by: NJirka Hladky <jhladky@redhat.com>
      Tested-by: NJirka Hladky <jhladky@redhat.com>
      Tested-by: NArtem Bityutskiy <dedekind1@gmail.com>
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20150528095249.3083ade0@annuminas.surriel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      6f9aad0b
    • R
      Revert 095bebf6 ("sched/numa: Do not move past the balance point if unbalanced") · e4991b24
      Rik van Riel 提交于
      Commit 095bebf6 ("sched/numa: Do not move past the balance point
      if unbalanced") broke convergence of workloads with just one runnable
      thread, by making it impossible for the one runnable thread on the
      system to move from one NUMA node to another.
      
      Instead, the thread would remain where it was, and pull all the memory
      across to its location, which is much slower than just migrating the
      thread to where the memory is.
      
      The next patch has a better fix for the issue that 095bebf6 tried
      to address.
      Reported-by: NJirka Hladky <jhladky@redhat.com>
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: dedekind1@gmail.com
      Cc: mgorman@suse.de
      Link: http://lkml.kernel.org/r/1432753468-7785-2-git-send-email-riel@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      e4991b24
    • B
      sched/fair: Prevent throttling in early pick_next_task_fair() · 54d27365
      Ben Segall 提交于
      The optimized task selection logic optimistically selects a new task
      to run without first doing a full put_prev_task(). This is so that we
      can avoid a put/set on the common ancestors of the old and new task.
      
      Similarly, we should only call check_cfs_rq_runtime() to throttle
      eligible groups if they're part of the common ancestry, otherwise it
      is possible to end up with no eligible task in the simple task
      selection.
      
      Imagine:
      		/root
      	/prev		/next
      	/A		/B
      
      If our optimistic selection ends up throttling /next, we goto simple
      and our put_prev_task() ends up throttling /prev, after which we're
      going to bug out in set_next_entity() because there aren't any tasks
      left.
      
      Avoid this scenario by only throttling common ancestors.
      Reported-by: NMohammed Naser <mnaser@vexxhost.com>
      Reported-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Signed-off-by: NBen Segall <bsegall@google.com>
      [ munged Changelog ]
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Roman Gushchin <klamm@yandex-team.ru>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: pjt@google.com
      Fixes: 678d5718 ("sched/fair: Optimize cgroup pick_next_task_fair()")
      Link: http://lkml.kernel.org/r/xm26wq1oswoq.fsf@sword-of-the-dawn.mtv.corp.google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      54d27365
  10. 19 5月, 2015 1 次提交
  11. 18 5月, 2015 1 次提交
    • P
      sched,perf: Fix periodic timers · 4cfafd30
      Peter Zijlstra 提交于
      In the below two commits (see Fixes) we have periodic timers that can
      stop themselves when they're no longer required, but need to be
      (re)-started when their idle condition changes.
      
      Further complications is that we want the timer handler to always do
      the forward such that it will always correctly deal with the overruns,
      and we do not want to race such that the handler has already decided
      to stop, but the (external) restart sees the timer still active and we
      end up with a 'lost' timer.
      
      The problem with the current code is that the re-start can come before
      the callback does the forward, at which point the forward from the
      callback will WARN about forwarding an enqueued timer.
      
      Now, conceptually its easy to detect if you're before or after the fwd
      by comparing the expiration time against the current time. Of course,
      that's expensive (and racy) because we don't have the current time.
      
      Alternatively one could cache this state inside the timer, but then
      everybody pays the overhead of maintaining this extra state, and that
      is undesired.
      
      The only other option that I could see is the external timer_active
      variable, which I tried to kill before. I would love a nicer interface
      for this seemingly simple 'problem' but alas.
      
      Fixes: 272325c4 ("perf: Fix mux_interval hrtimer wreckage")
      Fixes: 77a4d1a1 ("sched: Cleanup bandwidth timers")
      Cc: pjt@google.com
      Cc: tglx@linutronix.de
      Cc: klamm@yandex-team.ru
      Cc: mingo@kernel.org
      Cc: bsegall@google.com
      Cc: hpa@zytor.com
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/20150514102311.GX21418@twins.programming.kicks-ass.net
      4cfafd30
  12. 17 5月, 2015 1 次提交
    • N
      sched: Fix function declaration return type mismatch · 58ac93e4
      Nicholas Mc Guire 提交于
      static code checking was unhappy with:
      
        ./kernel/sched/fair.c:162 WARNING: return of wrong type
                      int != unsigned int
      
      get_update_sysctl_factor() is declared to return int but is
      currently  returning an unsigned int. The first few preprocessed
      lines are:
      
       static int get_update_sysctl_factor(void)
       {
       unsigned int cpus = ({ int __min1 = (cpumask_weight(cpu_online_mask));
       int __min2 = (8); __min1 < __min2 ? __min1: __min2; });
       unsigned int factor;
      
      The type used by min_t() should be 'unsigned int' and the return type
      of get_update_sysctl_factor() should also be 'unsigned int' as its
      call-site update_sysctl() is expecting 'unsigned int' and the values
      utilizing:
      
        'factor'
        'sysctl_sched_min_granularity'
        'sched_nr_latency'
        'sysctl_sched_wakeup_granularity'
      
      ... are also all 'unsigned int', plus cpumask_weight() is also
      returning 'unsigned int'.
      
      So the natural type to use around here is 'unsigned int'.
      
      ( Patch was compile tested with x86_64_defconfig +
        CONFIG_SCHED_DEBUG=y and the changed sections in
        kernel/sched/fair.i were reviewed. )
      Signed-off-by: NNicholas Mc Guire <hofrat@osadl.org>
      [ Improved the changelog a bit. ]
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1431716742-11077-1-git-send-email-hofrat@osadl.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      58ac93e4
  13. 08 5月, 2015 3 次提交
  14. 22 4月, 2015 2 次提交
  15. 08 4月, 2015 1 次提交
  16. 27 3月, 2015 2 次提交
    • P
      sched: Improve load balancing in the presence of idle CPUs · d4573c3e
      Preeti U Murthy 提交于
      When a CPU is kicked to do nohz idle balancing, it wakes up to do load
      balancing on itself, followed by load balancing on behalf of idle CPUs.
      But it may end up with load after the load balancing attempt on itself.
      This aborts nohz idle balancing. As a result several idle CPUs are left
      without tasks till such a time that an ILB CPU finds it unfavorable to
      pull tasks upon itself. This delays spreading of load across idle CPUs
      and worse, clutters only a few CPUs with tasks.
      
      The effect of the above problem was observed on an SMT8 POWER server
      with 2 levels of numa domains. Busy loops equal to number of cores were
      spawned. Since load balancing on fork/exec is discouraged across numa
      domains, all busy loops would start on one of the numa domains. However
      it was expected that eventually one busy loop would run per core across
      all domains due to nohz idle load balancing. But it was observed that it
      took as long as 10 seconds to spread the load across numa domains.
      
      Further investigation showed that this was a consequence of the
      following:
      
       1. An ILB CPU was chosen from the first numa domain to trigger nohz idle
          load balancing [Given the experiment, upto 6 CPUs per core could be
          potentially idle in this domain.]
      
       2. However the ILB CPU would call load_balance() on itself before
          initiating nohz idle load balancing.
      
       3. Given cores are SMT8, the ILB CPU had enough opportunities to pull
          tasks from its sibling cores to even out load.
      
       4. Now that the ILB CPU was no longer idle, it would abort nohz idle
          load balancing
      
      As a result the opportunities to spread load across numa domains were
      lost until such a time that the cores within the first numa domain had
      equal number of tasks among themselves.  This is a pretty bad scenario,
      since the cores within the first numa domain would have as many as 4
      tasks each, while cores in the neighbouring numa domains would all
      remain idle.
      
      Fix this, by checking if a CPU was woken up to do nohz idle load
      balancing, before it does load balancing upon itself. This way we allow
      idle CPUs across the system to do load balancing which results in
      quicker spread of load, instead of performing load balancing within the
      local sched domain hierarchy of the ILB CPU alone under circumstances
      such as above.
      Signed-off-by: NPreeti U Murthy <preeti@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NJason Low <jason.low2@hp.com>
      Cc: benh@kernel.crashing.org
      Cc: daniel.lezcano@linaro.org
      Cc: efault@gmx.de
      Cc: iamjoonsoo.kim@lge.com
      Cc: morten.rasmussen@arm.com
      Cc: pjt@google.com
      Cc: riel@redhat.com
      Cc: srikar@linux.vnet.ibm.com
      Cc: svaidy@linux.vnet.ibm.com
      Cc: tim.c.chen@linux.intel.com
      Cc: vincent.guittot@linaro.org
      Link: http://lkml.kernel.org/r/20150326130014.21532.17158.stgit@preeti.in.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      d4573c3e
    • P
      sched: Optimize freq invariant accounting · dfbca41f
      Peter Zijlstra 提交于
      Currently the freq invariant accounting (in
      __update_entity_runnable_avg() and sched_rt_avg_update()) get the
      scale factor from a weak function call, this means that even for archs
      that default on their implementation the compiler cannot see into this
      function and optimize the extra scaling math away.
      
      This is sad, esp. since its a 64-bit multiplication which can be quite
      costly on some platforms.
      
      So replace the weak function with #ifdef and __always_inline goo. This
      is not quite as nice from an arch support PoV but should at least
      result in compile time errors if done wrong.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Ben Segall <bsegall@google.com>
      Cc: Morten.Rasmussen@arm.com
      Cc: Paul Turner <pjt@google.com>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Cc: dietmar.eggemann@arm.com
      Cc: efault@gmx.de
      Cc: kamalesh@linux.vnet.ibm.com
      Cc: nicolas.pitre@linaro.org
      Cc: preeti@linux.vnet.ibm.com
      Cc: riel@redhat.com
      Link: http://lkml.kernel.org/r/20150323131905.GF23123@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      dfbca41f