1. 10 2月, 2014 3 次提交
  2. 28 1月, 2014 9 次提交
  3. 23 1月, 2014 1 次提交
  4. 22 1月, 2014 1 次提交
    • M
      sched: add tracepoints related to NUMA task migration · 286549dc
      Mel Gorman 提交于
      This patch adds three tracepoints
       o trace_sched_move_numa	when a task is moved to a node
       o trace_sched_swap_numa	when a task is swapped with another task
       o trace_sched_stick_numa	when a numa-related migration fails
      
      The tracepoints allow the NUMA scheduler activity to be monitored and the
      following high-level metrics can be calculated
      
       o NUMA migrated stuck	 nr trace_sched_stick_numa
       o NUMA migrated idle	 nr trace_sched_move_numa
       o NUMA migrated swapped nr trace_sched_swap_numa
       o NUMA local swapped	 trace_sched_swap_numa src_nid == dst_nid (should never happen)
       o NUMA remote swapped	 trace_sched_swap_numa src_nid != dst_nid (should == NUMA migrated swapped)
       o NUMA group swapped	 trace_sched_swap_numa src_ngid == dst_ngid
      			 Maybe a small number of these are acceptable
      			 but a high number would be a major surprise.
      			 It would be even worse if bounces are frequent.
       o NUMA avg task migs.	 Average number of migrations for tasks
       o NUMA stddev task mig	 Self-explanatory
       o NUMA max task migs.	 Maximum number of migrations for a single task
      
      In general the intent of the tracepoints is to help diagnose problems
      where automatic NUMA balancing appears to be doing an excessive amount
      of useless work.
      
      [akpm@linux-foundation.org: remove semicolon-after-if, repair coding-style]
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Alex Thorlton <athorlton@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      286549dc
  5. 13 1月, 2014 8 次提交
  6. 12 1月, 2014 1 次提交
    • R
      sched: Calculate effective load even if local weight is 0 · 9722c2da
      Rik van Riel 提交于
      Thomas Hellstrom bisected a regression where erratic 3D performance is
      experienced on virtual machines as measured by glxgears. It identified
      commit 58d081b5 ("sched/numa: Avoid overloading CPUs on a preferred NUMA
      node") as the problem which had modified the behaviour of effective_load.
      
      Effective load calculates the difference to the system-wide load if a
      scheduling entity was moved to another CPU. The task group is not heavier
      as a result of the move but overall system load can increase/decrease as a
      result of the change. Commit 58d081b5 ("sched/numa: Avoid overloading CPUs
      on a preferred NUMA node") changed effective_load to make it suitable for
      calculating if a particular NUMA node was compute overloaded. To reduce
      the cost of the function, it assumed that a current sched entity weight
      of 0 was uninteresting but that is not the case.
      
      wake_affine() uses a weight of 0 for sync wakeups on the grounds that it
      is assuming the waking task will sleep and not contribute to load in the
      near future. In this case, we still want to calculate the effective load
      of the sched entity hierarchy. As effective_load is no longer used by
      task_numa_compare since commit fb13c7ee (sched/numa: Use a system-wide
      search to find swap/migration candidates), this patch simply restores the
      historical behaviour.
      Reported-and-tested-by: NThomas Hellstrom <thellstrom@vmware.com>
      Signed-off-by: NRik van Riel <riel@redhat.com>
      [ Wrote changelog]
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/20140106113912.GC6178@suse.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      9722c2da
  7. 19 12月, 2013 1 次提交
  8. 17 12月, 2013 4 次提交
  9. 11 12月, 2013 1 次提交
    • P
      sched/fair: Rework sched_fair time accounting · 9dbdb155
      Peter Zijlstra 提交于
      Christian suffers from a bad BIOS that wrecks his i5's TSC sync. This
      results in him occasionally seeing time going backwards - which
      crashes the scheduler ...
      
      Most of our time accounting can actually handle that except the most
      common one; the tick time update of sched_fair.
      
      There is a further problem with that code; previously we assumed that
      because we get a tick every TICK_NSEC our time delta could never
      exceed 32bits and math was simpler.
      
      However, ever since Frederic managed to get NO_HZ_FULL merged; this is
      no longer the case since now a task can run for a long time indeed
      without getting a tick. It only takes about ~4.2 seconds to overflow
      our u32 in nanoseconds.
      
      This means we not only need to better deal with time going backwards;
      but also means we need to be able to deal with large deltas.
      
      This patch reworks the entire code and uses mul_u64_u32_shr() as
      proposed by Andy a long while ago.
      
      We express our virtual time scale factor in a u32 multiplier and shift
      right and the 32bit mul_u64_u32_shr() implementation reduces to a
      single 32x32->64 multiply if the time delta is still short (common
      case).
      
      For 64bit a 64x64->128 multiply can be used if ARCH_SUPPORTS_INT128.
      Reported-and-Tested-by: NChristian Engelmayer <cengelma@gmx.at>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: fweisbec@gmail.com
      Cc: Paul Turner <pjt@google.com>
      Cc: Stanislaw Gruszka <sgruszka@redhat.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20131118172706.GI3866@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      9dbdb155
  10. 05 12月, 2013 1 次提交
  11. 27 11月, 2013 2 次提交
  12. 20 11月, 2013 1 次提交
  13. 13 11月, 2013 3 次提交
  14. 06 11月, 2013 2 次提交
    • P
      sched: Remove unnecessary iteration over sched domains to update nr_busy_cpus · 37dc6b50
      Preeti U Murthy 提交于
      nr_busy_cpus parameter is used by nohz_kick_needed() to find out the
      number of busy cpus in a sched domain which has SD_SHARE_PKG_RESOURCES
      flag set.  Therefore instead of updating nr_busy_cpus at every level
      of sched domain, since it is irrelevant, we can update this parameter
      only at the parent domain of the sd which has this flag set. Introduce
      a per-cpu parameter sd_busy which represents this parent domain.
      
      In nohz_kick_needed() we directly query the nr_busy_cpus parameter
      associated with the groups of sd_busy.
      
      By associating sd_busy with the highest domain which has
      SD_SHARE_PKG_RESOURCES flag set, we cover all lower level domains
      which could have this flag set and trigger nohz_idle_balancing if any
      of the levels have more than one busy cpu.
      
      sd_busy is irrelevant for asymmetric load balancing. However sd_asym
      has been introduced to represent the highest sched domain which has
      SD_ASYM_PACKING flag set so that it can be queried directly when
      required.
      
      While we are at it, we might as well change the nohz_idle parameter to
      be updated at the sd_busy domain level alone and not the base domain
      level of a CPU.  This will unify the concept of busy cpus at just one
      level of sched domain where it is currently used.
      
      Signed-off-by: Preeti U Murthy<preeti@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: svaidy@linux.vnet.ibm.com
      Cc: vincent.guittot@linaro.org
      Cc: bitbucket@online.de
      Cc: benh@kernel.crashing.org
      Cc: anton@samba.org
      Cc: Morten.Rasmussen@arm.com
      Cc: pjt@google.com
      Cc: peterz@infradead.org
      Cc: mikey@neuling.org
      Link: http://lkml.kernel.org/r/20131030031252.23426.4417.stgit@preeti.in.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      37dc6b50
    • V
      sched: Fix asymmetric scheduling for POWER7 · 2042abe7
      Vaidyanathan Srinivasan 提交于
      Asymmetric scheduling within a core is a scheduler loadbalancing
      feature that is triggered when SD_ASYM_PACKING flag is set.  The goal
      for the load balancer is to move tasks to lower order idle SMT threads
      within a core on a POWER7 system.
      
      In nohz_kick_needed(), we intend to check if our sched domain (core)
      is completely busy or we have idle cpu.
      
      The following check for SD_ASYM_PACKING:
      
          (cpumask_first_and(nohz.idle_cpus_mask, sched_domain_span(sd)) < cpu)
      
      already covers the case of checking if the domain has an idle cpu,
      because cpumask_first_and() will not yield any set bits if this domain
      has no idle cpu.
      
      Hence, nr_busy check against group weight can be removed.
      Reported-by: NMichael Neuling <michael.neuling@au1.ibm.com>
      Signed-off-by: NVaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Signed-off-by: NPreeti U Murthy <preeti@linux.vnet.ibm.com>
      Tested-by: NMichael Neuling <mikey@neuling.org>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: vincent.guittot@linaro.org
      Cc: bitbucket@online.de
      Cc: benh@kernel.crashing.org
      Cc: anton@samba.org
      Cc: Morten.Rasmussen@arm.com
      Cc: pjt@google.com
      Link: http://lkml.kernel.org/r/20131030031242.23426.13019.stgit@preeti.in.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      2042abe7
  15. 29 10月, 2013 2 次提交
    • B
      sched: Avoid throttle_cfs_rq() racing with period_timer stopping · f9f9ffc2
      Ben Segall 提交于
      throttle_cfs_rq() doesn't check to make sure that period_timer is running,
      and while update_curr/assign_cfs_runtime does, a concurrently running
      period_timer on another cpu could cancel itself between this cpu's
      update_curr and throttle_cfs_rq(). If there are no other cfs_rqs running
      in the tg to restart the timer, this causes the cfs_rq to be stranded
      forever.
      
      Fix this by calling __start_cfs_bandwidth() in throttle if the timer is
      inactive.
      
      (Also add some sched_debug lines for cfs_bandwidth.)
      
      Tested: make a run/sleep task in a cgroup, loop switching the cgroup
      between 1ms/100ms quota and unlimited, checking for timer_active=0 and
      throttled=1 as a failure. With the throttle_cfs_rq() change commented out
      this fails, with the full patch it passes.
      Signed-off-by: NBen Segall <bsegall@google.com>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: pjt@google.com
      Link: http://lkml.kernel.org/r/20131016181632.22647.84174.stgit@sword-of-the-dawn.mtv.corp.google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      f9f9ffc2
    • P
      sched: Guarantee new group-entities always have weight · 0ac9b1c2
      Paul Turner 提交于
      Currently, group entity load-weights are initialized to zero. This
      admits some races with respect to the first time they are re-weighted in
      earlty use. ( Let g[x] denote the se for "g" on cpu "x". )
      
      Suppose that we have root->a and that a enters a throttled state,
      immediately followed by a[0]->t1 (the only task running on cpu[0])
      blocking:
      
        put_prev_task(group_cfs_rq(a[0]), t1)
        put_prev_entity(..., t1)
        check_cfs_rq_runtime(group_cfs_rq(a[0]))
        throttle_cfs_rq(group_cfs_rq(a[0]))
      
      Then, before unthrottling occurs, let a[0]->b[0]->t2 wake for the first
      time:
      
        enqueue_task_fair(rq[0], t2)
        enqueue_entity(group_cfs_rq(b[0]), t2)
        enqueue_entity_load_avg(group_cfs_rq(b[0]), t2)
        account_entity_enqueue(group_cfs_ra(b[0]), t2)
        update_cfs_shares(group_cfs_rq(b[0]))
        < skipped because b is part of a throttled hierarchy >
        enqueue_entity(group_cfs_rq(a[0]), b[0])
        ...
      
      We now have b[0] enqueued, yet group_cfs_rq(a[0])->load.weight == 0
      which violates invariants in several code-paths. Eliminate the
      possibility of this by initializing group entity weight.
      Signed-off-by: NPaul Turner <pjt@google.com>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/20131016181627.22647.47543.stgit@sword-of-the-dawn.mtv.corp.google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      0ac9b1c2