1. 12 8月, 2014 2 次提交
  2. 28 7月, 2014 1 次提交
  3. 16 7月, 2014 2 次提交
  4. 05 7月, 2014 10 次提交
    • K
      sched/fair: Disable runtime_enabled on dying rq · 0e59bdae
      Kirill Tkhai 提交于
      We kill rq->rd on the CPU_DOWN_PREPARE stage:
      
      	cpuset_cpu_inactive -> cpuset_update_active_cpus -> partition_sched_domains ->
      	-> cpu_attach_domain -> rq_attach_root -> set_rq_offline
      
      This unthrottles all throttled cfs_rqs.
      
      But the cpu is still able to call schedule() till
      
      	take_cpu_down->__cpu_disable()
      
      is called from stop_machine.
      
      This case the tasks from just unthrottled cfs_rqs are pickable
      in a standard scheduler way, and they are picked by dying cpu.
      The cfs_rqs becomes throttled again, and migrate_tasks()
      in migration_call skips their tasks (one more unthrottle
      in migrate_tasks()->CPU_DYING does not happen, because rq->rd
      is already NULL).
      
      Patch sets runtime_enabled to zero. This guarantees, the runtime
      is not accounted, and the cfs_rqs won't exceed given
      cfs_rq->runtime_remaining = 1, and tasks will be pickable
      in migrate_tasks(). runtime_enabled is recalculated again
      when rq becomes online again.
      
      Ben Segall also noticed, we always enable runtime in
      tg_set_cfs_bandwidth(). Actually, we should do that for online
      cpus only. To prevent races with unthrottle_offline_cfs_rqs()
      we take get_online_cpus() lock.
      Reviewed-by: NBen Segall <bsegall@google.com>
      Reviewed-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NKirill Tkhai <ktkhai@parallels.com>
      CC: Konstantin Khorenko <khorenko@parallels.com>
      CC: Paul Turner <pjt@google.com>
      CC: Mike Galbraith <umgwanakikbuti@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1403684382.3462.42.camel@tkhaiSigned-off-by: NIngo Molnar <mingo@kernel.org>
      0e59bdae
    • R
      sched/numa: Change scan period code to match intent · a22b4b01
      Rik van Riel 提交于
      Reading through the scan period code and comment, it appears the
      intent was to slow down NUMA scanning when a majority of accesses
      are on the local node, specifically a local:remote ratio of 3:1.
      
      However, the code actually tests local / (local + remote), and
      the actual cut-off point was around 30% local accesses, well before
      a task has actually converged on a node.
      
      Changing the threshold to 7 means scanning slows down when a task
      has around 70% of its accesses local, which appears to match the
      intent of the code more closely.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Cc: mgorman@suse.de
      Cc: chegu_vinod@hp.com
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1403538095-31256-8-git-send-email-riel@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      a22b4b01
    • R
      sched/numa: Rework best node setting in task_numa_migrate() · db015dae
      Rik van Riel 提交于
      Fix up the best node setting in task_numa_migrate() to deal with a task
      in a pseudo-interleaved NUMA group, which is already running in the
      best location.
      
      Set the task's preferred nid to the current nid, so task migration is
      not retried at a high rate.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Cc: mgorman@suse.de
      Cc: chegu_vinod@hp.com
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1403538095-31256-7-git-send-email-riel@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      db015dae
    • R
      sched/numa: Examine a task move when examining a task swap · 0132c3e1
      Rik van Riel 提交于
      Running "perf bench numa mem -0 -m -P 1000 -p 8 -t 20" on a 4
      node system results in 160 runnable threads on a system with 80
      CPU threads.
      
      Once a process has nearly converged, with 39 threads on one node
      and 1 thread on another node, the remaining thread will be unable
      to migrate to its preferred node through a task swap.
      
      However, a simple task move would make the workload converge,
      witout causing an imbalance.
      
      Test for this unlikely occurrence, and attempt a task move to
      the preferred nid when it happens.
      
       # Running main, "perf bench numa mem -p 8 -t 20 -0 -m -P 1000"
      
       ###
       # 160 tasks will execute (on 4 nodes, 80 CPUs):
       #         -1x     0MB global  shared mem operations
       #         -1x  1000MB process shared mem operations
       #         -1x     0MB thread  local  mem operations
       ###
      
       ###
       #
       #    0.0%  [0.2 mins]  0/0   1/1  36/2   0/0  [36/3 ] l:  0-0   (  0) {0-2}
       #    0.0%  [0.3 mins] 43/3  37/2  39/2  41/3  [ 6/10] l:  0-1   (  1) {1-2}
       #    0.0%  [0.4 mins] 42/3  38/2  40/2  40/2  [ 4/9 ] l:  1-2   (  1) [50.0%] {1-2}
       #    0.0%  [0.6 mins] 41/3  39/2  40/2  40/2  [ 2/9 ] l:  2-4   (  2) [50.0%] {1-2}
       #    0.0%  [0.7 mins] 40/2  40/2  40/2  40/2  [ 0/8 ] l:  3-5   (  2) [40.0%] (  41.8s converged)
      
      Without this patch, this same perf bench numa mem run had to
      rely on the scheduler load balancer to first balance out the
      load (moving a random task), before a task swap could complete
      the NUMA convergence.
      
      The load balancer does not normally take action unless the load
      
      difference exceeds 25%. Convergence times of over half an hour
      have been observed without this patch.
      
      With this patch, the NUMA balancing code will simply migrate the
      task, if that does not cause an imbalance.
      
      Also skip examining a CPU in detail if the improvement on that CPU
      is no more than the best we already have.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Cc: chegu_vinod@hp.com
      Cc: mgorman@suse.de
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/n/tip-ggthh0rnh0yua6o5o3p6cr1o@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      0132c3e1
    • R
      sched/numa: Simplify task_numa_compare() · 1c5d3eb3
      Rik van Riel 提交于
      When a task is part of a numa_group, the comparison should always use
      the group weight, in order to make workloads converge.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Cc: chegu_vinod@hp.com
      Cc: mgorman@suse.de
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1403538378-31571-4-git-send-email-riel@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      1c5d3eb3
    • R
      sched/numa: Use effective_load() to balance NUMA loads · 6dc1a672
      Rik van Riel 提交于
      When CONFIG_FAIR_GROUP_SCHED is enabled, the load that a task places
      on a CPU is determined by the group the task is in. The active groups
      on the source and destination CPU can be different, resulting in a
      different load contribution by the same task at its source and at its
      destination. As a result, the load needs to be calculated separately
      for each CPU, instead of estimated once with task_h_load().
      
      Getting this calculation right allows some workloads to converge,
      where previously the last thread could get stuck on another node,
      without being able to migrate to its final destination.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Cc: mgorman@suse.de
      Cc: chegu_vinod@hp.com
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1403538378-31571-3-git-send-email-riel@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      6dc1a672
    • R
      sched/numa: Move power adjustment into load_too_imbalanced() · 28a21745
      Rik van Riel 提交于
      Currently the NUMA code scales the load on each node with the
      amount of CPU power available on that node, but it does not
      apply any adjustment to the load of the task that is being
      moved over.
      
      On systems with SMT/HT, this results in a task being weighed
      much more heavily than a CPU core, and a task move that would
      even out the load between nodes being disallowed.
      
      The correct thing is to apply the power correction to the
      numbers after we have first applied the move of the tasks'
      loads to them.
      
      This also allows us to do the power correction with a multiplication,
      rather than a division.
      
      Also drop two function arguments for load_too_unbalanced, since it
      takes various factors from env already.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Cc: chegu_vinod@hp.com
      Cc: mgorman@suse.de
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1403538378-31571-2-git-send-email-riel@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      28a21745
    • R
      sched/numa: Use group's max nid as task's preferred nid · f0b8a4af
      Rik van Riel 提交于
      From task_numa_placement, always try to consolidate the tasks
      in a group on the group's top nid.
      
      In case this task is part of a group that is interleaved over
      multiple nodes, task_numa_migrate will set the task's preferred
      nid to the best node it could find for the task, so this patch
      will cause at most one run through task_numa_migrate.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Cc: mgorman@suse.de
      Cc: chegu_vinod@hp.com
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1403538095-31256-2-git-send-email-riel@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      f0b8a4af
    • T
      sched/fair: Implement fast idling of CPUs when the system is partially loaded · 4486edd1
      Tim Chen 提交于
      When a system is lightly loaded (i.e. no more than 1 job per cpu),
      attempt to pull job to a cpu before putting it to idle is unnecessary and
      can be skipped.  This patch adds an indicator so the scheduler can know
      when there's no more than 1 active job is on any CPU in the system to
      skip needless job pulls.
      
      On a 4 socket machine with a request/response kind of workload from
      clients, we saw about 0.13 msec delay when we go through a full load
      balance to try pull job from all the other cpus.  While 0.1 msec was
      spent on processing the request and generating a response, the 0.13 msec
      load balance overhead was actually more than the actual work being done.
      This overhead can be skipped much of the time for lightly loaded systems.
      
      With this patch, we tested with a netperf request/response workload that
      has the server busy with half the cpus in a 4 socket system.  We found
      the patch eliminated 75% of the load balance attempts before idling a cpu.
      
      The overhead of setting/clearing the indicator is low as we already gather
      the necessary info while we call add_nr_running() and update_sd_lb_stats.()
      We switch to full load balance load immediately if any cpu got more than
      one job on its run queue in add_nr_running.  We'll clear the indicator
      to avoid load balance when we detect no cpu's have more than one job
      when we scan the work queues in update_sg_lb_stats().  We are aggressive
      in turning on the load balance and opportunistic in skipping the load
      balance.
      Signed-off-by: NTim Chen <tim.c.chen@linux.intel.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NJason Low <jason.low2@hp.com>
      Cc: "Paul E.McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Davidlohr Bueso <davidlohr@hp.com>
      Cc: Alex Shi <alex.shi@linaro.org>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Peter Hurley <peter@hurleysoftware.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1403551009.2970.613.camel@schen9-DESKSigned-off-by: NIngo Molnar <mingo@kernel.org>
      4486edd1
    • B
      sched: Fix potential near-infinite distribute_cfs_runtime() loop · c06f04c7
      Ben Segall 提交于
      distribute_cfs_runtime() intentionally only hands out enough runtime to
      bring each cfs_rq to 1 ns of runtime, expecting the cfs_rqs to then take
      the runtime they need only once they actually get to run. However, if
      they get to run sufficiently quickly, the period timer is still in
      distribute_cfs_runtime() and no runtime is available, causing them to
      throttle. Then distribute has to handle them again, and this can go on
      until distribute has handed out all of the runtime 1ns at a time, which
      takes far too long.
      
      Instead allow access to the same runtime that distribute is handing out,
      accepting that corner cases with very low quota may be able to spend the
      entire cfs_b->runtime during distribute_cfs_runtime, meaning that the
      runtime directly handed out by distribute_cfs_runtime was over quota. In
      addition, if a cfs_rq does manage to throttle like this, make sure the
      existing distribute_cfs_runtime no longer loops over it again.
      Signed-off-by: NBen Segall <bsegall@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/20140620222120.13814.21652.stgit@sword-of-the-dawn.mtv.corp.google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      c06f04c7
  5. 19 6月, 2014 3 次提交
  6. 09 6月, 2014 1 次提交
  7. 05 6月, 2014 12 次提交
  8. 22 5月, 2014 7 次提交
  9. 07 5月, 2014 2 次提交