1. 03 10月, 2014 2 次提交
    • K
      sched/fair: Delete resched_cpu() from idle_balance() · 10a12983
      Kirill Tkhai 提交于
      We already reschedule env.dst_cpu in attach_tasks()->check_preempt_curr()
      if this is necessary.
      
      Furthermore, a higher priority class task may be current on dest rq,
      we shouldn't disturb it.
      Signed-off-by: NKirill Tkhai <ktkhai@parallels.com>
      Cc: Juri Lelli <juri.lelli@gmail.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/20140930210441.5258.55054.stgit@localhostSigned-off-by: NIngo Molnar <mingo@kernel.org>
      10a12983
    • V
      sched: Improve sysbench performance by fixing spurious active migration · 43f4d666
      Vincent Guittot 提交于
      Since commit caeb178c ("sched/fair: Make update_sd_pick_busiest() ...")
      sd_pick_busiest returns a group that can be neither imbalanced nor overloaded
      but is only more loaded than others. This change has been introduced to ensure
      a better load balance in system that are not overloaded but as a side effect,
      it can also generate useless active migration between groups.
      
      Let take the example of 3 tasks on a quad cores system. We will always have an
      idle core so the load balance will find a busiest group (core) whenever an ILB
      is triggered and it will force an active migration (once above
      nr_balance_failed threshold) so the idle core becomes busy but another core
      will become idle. With the next ILB, the freshly idle core will try to pull the
      task of a busy CPU.
      The number of spurious active migration is not so huge in quad core system
      because the ILB is not triggered so much. But it becomes significant as soon as
      you have more than one sched_domain level like on a dual cluster of quad cores
      where the ILB is triggered every tick when you have more than 1 busy_cpu
      
      We need to ensure that the migration generate a real improveùent and will not
      only move the avg_load imbalance on another CPU.
      
      Before caeb178c, the filtering of such use
      case was ensured by the following test in f_b_g:
      
        if ((local->idle_cpus < busiest->idle_cpus) &&
      		    busiest->sum_nr_running  <= busiest->group_weight)
      
      This patch modified the condition to take into account situation where busiest
      group is not overloaded: If the diff between the number of idle cpus in 2
      groups is less than or equal to 1 and the busiest group is not overloaded,
      moving a task will not improve the load balance but just move it.
      
      A test with sysbench on a dual clusters of quad cores gives the following
      results:
      
        command: sysbench --test=cpu --num-threads=5 --max-time=5 run
      
      The HZ is 200 which means that 1000 ticks has fired during the test.
      
      With Mainline, perf gives the following figures:
      
       Samples: 727  of event 'sched:sched_migrate_task'
       Event count (approx.): 727
        Overhead  Command          Shared Object  Symbol
        ........  ...............  .............  ..............
          12.52%  migration/1      [unknown]      [.] 00000000
          12.52%  migration/5      [unknown]      [.] 00000000
          12.52%  migration/7      [unknown]      [.] 00000000
          12.10%  migration/6      [unknown]      [.] 00000000
          11.83%  migration/0      [unknown]      [.] 00000000
          11.83%  migration/3      [unknown]      [.] 00000000
          11.14%  migration/4      [unknown]      [.] 00000000
          10.87%  migration/2      [unknown]      [.] 00000000
           2.75%  sysbench         [unknown]      [.] 00000000
           0.83%  swapper          [unknown]      [.] 00000000
           0.55%  ktps65090charge  [unknown]      [.] 00000000
           0.41%  mmcqd/1          [unknown]      [.] 00000000
           0.14%  perf             [unknown]      [.] 00000000
      
      With this patch, perf gives the following figures
      
       Samples: 20  of event 'sched:sched_migrate_task'
       Event count (approx.): 20
        Overhead  Command          Shared Object  Symbol
        ........  ...............  .............  ..............
          80.00%  sysbench         [unknown]      [.] 00000000
          10.00%  swapper          [unknown]      [.] 00000000
           5.00%  ktps65090charge  [unknown]      [.] 00000000
           5.00%  migration/1      [unknown]      [.] 00000000
      Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/1412170735-5356-1-git-send-email-vincent.guittot@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      43f4d666
  2. 24 9月, 2014 3 次提交
  3. 21 9月, 2014 1 次提交
  4. 19 9月, 2014 7 次提交
  5. 09 9月, 2014 1 次提交
  6. 07 9月, 2014 1 次提交
    • X
      sched/deadline: Fix a precision problem in the microseconds range · 177ef2a6
      xiaofeng.yan 提交于
      An overrun could happen in function start_hrtick_dl()
      when a task with SCHED_DEADLINE runs in the microseconds
      range.
      
      For example, if a task with SCHED_DEADLINE has the following parameters:
      
        Task  runtime  deadline  period
         P1   200us     500us    500us
      
      The deadline and period from task P1 are less than 1ms.
      
      In order to achieve microsecond precision, we need to enable HRTICK feature
      by the next command:
      
        PC#echo "HRTICK" > /sys/kernel/debug/sched_features
        PC#trace-cmd record -e sched_switch &
        PC#./schedtool -E -t 200000:500000:500000 -e ./test
      
      The binary test is in an endless while(1) loop here.
      Some pieces of trace.dat are as follows:
      
        <idle>-0   157.603157: sched_switch: :R ==> 2481:4294967295: test
        test-2481  157.603203: sched_switch:  2481:R ==> 0:120: swapper/2
        <idle>-0   157.605657: sched_switch:  :R ==> 2481:4294967295: test
        test-2481  157.608183: sched_switch:  2481:R ==> 2483:120: trace-cmd
        trace-cmd-2483 157.609656: sched_switch:2483:R==>2481:4294967295: test
      
      We can get the runtime of P1 from the information above:
      
        runtime = 157.608183 - 157.605657
        runtime = 0.002526(2.526ms)
      
      The correct runtime should be less than or equal to 200us at some point.
      
      The problem is caused by a conditional judgment "delta > 10000"
      in function start_hrtick_dl().
      
      Because no hrtimer start up to control the rest of runtime
      when the reset of runtime is less than 10us.
      
      So the process will continue to run until tick-period is coming.
      
      Move the code with the limit of the least time slice
      from hrtick_start_fair() to hrtick_start() because the
      EDF schedule class also needs this function in start_hrtick_dl().
      
      To fix this problem, we call hrtimer_start() unconditionally in
      start_hrtick_dl(), and make sure the scheduling slice won't be smaller
      than 10us in hrtimer_start().
      Signed-off-by: NXiaofeng Yan <xiaofeng.yan@huawei.com>
      Reviewed-by: NLi Zefan <lizefan@huawei.com>
      Acked-by: NJuri Lelli <juri.lelli@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/1409022941-5880-1-git-send-email-xiaofeng.yan@huawei.com
      [ Massaged the changelog and the code. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      177ef2a6
  7. 05 9月, 2014 1 次提交
  8. 20 8月, 2014 4 次提交
  9. 12 8月, 2014 5 次提交
  10. 28 7月, 2014 1 次提交
  11. 16 7月, 2014 2 次提交
  12. 05 7月, 2014 10 次提交
    • K
      sched/fair: Disable runtime_enabled on dying rq · 0e59bdae
      Kirill Tkhai 提交于
      We kill rq->rd on the CPU_DOWN_PREPARE stage:
      
      	cpuset_cpu_inactive -> cpuset_update_active_cpus -> partition_sched_domains ->
      	-> cpu_attach_domain -> rq_attach_root -> set_rq_offline
      
      This unthrottles all throttled cfs_rqs.
      
      But the cpu is still able to call schedule() till
      
      	take_cpu_down->__cpu_disable()
      
      is called from stop_machine.
      
      This case the tasks from just unthrottled cfs_rqs are pickable
      in a standard scheduler way, and they are picked by dying cpu.
      The cfs_rqs becomes throttled again, and migrate_tasks()
      in migration_call skips their tasks (one more unthrottle
      in migrate_tasks()->CPU_DYING does not happen, because rq->rd
      is already NULL).
      
      Patch sets runtime_enabled to zero. This guarantees, the runtime
      is not accounted, and the cfs_rqs won't exceed given
      cfs_rq->runtime_remaining = 1, and tasks will be pickable
      in migrate_tasks(). runtime_enabled is recalculated again
      when rq becomes online again.
      
      Ben Segall also noticed, we always enable runtime in
      tg_set_cfs_bandwidth(). Actually, we should do that for online
      cpus only. To prevent races with unthrottle_offline_cfs_rqs()
      we take get_online_cpus() lock.
      Reviewed-by: NBen Segall <bsegall@google.com>
      Reviewed-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NKirill Tkhai <ktkhai@parallels.com>
      CC: Konstantin Khorenko <khorenko@parallels.com>
      CC: Paul Turner <pjt@google.com>
      CC: Mike Galbraith <umgwanakikbuti@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1403684382.3462.42.camel@tkhaiSigned-off-by: NIngo Molnar <mingo@kernel.org>
      0e59bdae
    • R
      sched/numa: Change scan period code to match intent · a22b4b01
      Rik van Riel 提交于
      Reading through the scan period code and comment, it appears the
      intent was to slow down NUMA scanning when a majority of accesses
      are on the local node, specifically a local:remote ratio of 3:1.
      
      However, the code actually tests local / (local + remote), and
      the actual cut-off point was around 30% local accesses, well before
      a task has actually converged on a node.
      
      Changing the threshold to 7 means scanning slows down when a task
      has around 70% of its accesses local, which appears to match the
      intent of the code more closely.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Cc: mgorman@suse.de
      Cc: chegu_vinod@hp.com
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1403538095-31256-8-git-send-email-riel@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      a22b4b01
    • R
      sched/numa: Rework best node setting in task_numa_migrate() · db015dae
      Rik van Riel 提交于
      Fix up the best node setting in task_numa_migrate() to deal with a task
      in a pseudo-interleaved NUMA group, which is already running in the
      best location.
      
      Set the task's preferred nid to the current nid, so task migration is
      not retried at a high rate.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Cc: mgorman@suse.de
      Cc: chegu_vinod@hp.com
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1403538095-31256-7-git-send-email-riel@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      db015dae
    • R
      sched/numa: Examine a task move when examining a task swap · 0132c3e1
      Rik van Riel 提交于
      Running "perf bench numa mem -0 -m -P 1000 -p 8 -t 20" on a 4
      node system results in 160 runnable threads on a system with 80
      CPU threads.
      
      Once a process has nearly converged, with 39 threads on one node
      and 1 thread on another node, the remaining thread will be unable
      to migrate to its preferred node through a task swap.
      
      However, a simple task move would make the workload converge,
      witout causing an imbalance.
      
      Test for this unlikely occurrence, and attempt a task move to
      the preferred nid when it happens.
      
       # Running main, "perf bench numa mem -p 8 -t 20 -0 -m -P 1000"
      
       ###
       # 160 tasks will execute (on 4 nodes, 80 CPUs):
       #         -1x     0MB global  shared mem operations
       #         -1x  1000MB process shared mem operations
       #         -1x     0MB thread  local  mem operations
       ###
      
       ###
       #
       #    0.0%  [0.2 mins]  0/0   1/1  36/2   0/0  [36/3 ] l:  0-0   (  0) {0-2}
       #    0.0%  [0.3 mins] 43/3  37/2  39/2  41/3  [ 6/10] l:  0-1   (  1) {1-2}
       #    0.0%  [0.4 mins] 42/3  38/2  40/2  40/2  [ 4/9 ] l:  1-2   (  1) [50.0%] {1-2}
       #    0.0%  [0.6 mins] 41/3  39/2  40/2  40/2  [ 2/9 ] l:  2-4   (  2) [50.0%] {1-2}
       #    0.0%  [0.7 mins] 40/2  40/2  40/2  40/2  [ 0/8 ] l:  3-5   (  2) [40.0%] (  41.8s converged)
      
      Without this patch, this same perf bench numa mem run had to
      rely on the scheduler load balancer to first balance out the
      load (moving a random task), before a task swap could complete
      the NUMA convergence.
      
      The load balancer does not normally take action unless the load
      
      difference exceeds 25%. Convergence times of over half an hour
      have been observed without this patch.
      
      With this patch, the NUMA balancing code will simply migrate the
      task, if that does not cause an imbalance.
      
      Also skip examining a CPU in detail if the improvement on that CPU
      is no more than the best we already have.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Cc: chegu_vinod@hp.com
      Cc: mgorman@suse.de
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/n/tip-ggthh0rnh0yua6o5o3p6cr1o@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      0132c3e1
    • R
      sched/numa: Simplify task_numa_compare() · 1c5d3eb3
      Rik van Riel 提交于
      When a task is part of a numa_group, the comparison should always use
      the group weight, in order to make workloads converge.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Cc: chegu_vinod@hp.com
      Cc: mgorman@suse.de
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1403538378-31571-4-git-send-email-riel@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      1c5d3eb3
    • R
      sched/numa: Use effective_load() to balance NUMA loads · 6dc1a672
      Rik van Riel 提交于
      When CONFIG_FAIR_GROUP_SCHED is enabled, the load that a task places
      on a CPU is determined by the group the task is in. The active groups
      on the source and destination CPU can be different, resulting in a
      different load contribution by the same task at its source and at its
      destination. As a result, the load needs to be calculated separately
      for each CPU, instead of estimated once with task_h_load().
      
      Getting this calculation right allows some workloads to converge,
      where previously the last thread could get stuck on another node,
      without being able to migrate to its final destination.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Cc: mgorman@suse.de
      Cc: chegu_vinod@hp.com
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1403538378-31571-3-git-send-email-riel@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      6dc1a672
    • R
      sched/numa: Move power adjustment into load_too_imbalanced() · 28a21745
      Rik van Riel 提交于
      Currently the NUMA code scales the load on each node with the
      amount of CPU power available on that node, but it does not
      apply any adjustment to the load of the task that is being
      moved over.
      
      On systems with SMT/HT, this results in a task being weighed
      much more heavily than a CPU core, and a task move that would
      even out the load between nodes being disallowed.
      
      The correct thing is to apply the power correction to the
      numbers after we have first applied the move of the tasks'
      loads to them.
      
      This also allows us to do the power correction with a multiplication,
      rather than a division.
      
      Also drop two function arguments for load_too_unbalanced, since it
      takes various factors from env already.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Cc: chegu_vinod@hp.com
      Cc: mgorman@suse.de
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1403538378-31571-2-git-send-email-riel@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      28a21745
    • R
      sched/numa: Use group's max nid as task's preferred nid · f0b8a4af
      Rik van Riel 提交于
      From task_numa_placement, always try to consolidate the tasks
      in a group on the group's top nid.
      
      In case this task is part of a group that is interleaved over
      multiple nodes, task_numa_migrate will set the task's preferred
      nid to the best node it could find for the task, so this patch
      will cause at most one run through task_numa_migrate.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Cc: mgorman@suse.de
      Cc: chegu_vinod@hp.com
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1403538095-31256-2-git-send-email-riel@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      f0b8a4af
    • T
      sched/fair: Implement fast idling of CPUs when the system is partially loaded · 4486edd1
      Tim Chen 提交于
      When a system is lightly loaded (i.e. no more than 1 job per cpu),
      attempt to pull job to a cpu before putting it to idle is unnecessary and
      can be skipped.  This patch adds an indicator so the scheduler can know
      when there's no more than 1 active job is on any CPU in the system to
      skip needless job pulls.
      
      On a 4 socket machine with a request/response kind of workload from
      clients, we saw about 0.13 msec delay when we go through a full load
      balance to try pull job from all the other cpus.  While 0.1 msec was
      spent on processing the request and generating a response, the 0.13 msec
      load balance overhead was actually more than the actual work being done.
      This overhead can be skipped much of the time for lightly loaded systems.
      
      With this patch, we tested with a netperf request/response workload that
      has the server busy with half the cpus in a 4 socket system.  We found
      the patch eliminated 75% of the load balance attempts before idling a cpu.
      
      The overhead of setting/clearing the indicator is low as we already gather
      the necessary info while we call add_nr_running() and update_sd_lb_stats.()
      We switch to full load balance load immediately if any cpu got more than
      one job on its run queue in add_nr_running.  We'll clear the indicator
      to avoid load balance when we detect no cpu's have more than one job
      when we scan the work queues in update_sg_lb_stats().  We are aggressive
      in turning on the load balance and opportunistic in skipping the load
      balance.
      Signed-off-by: NTim Chen <tim.c.chen@linux.intel.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NJason Low <jason.low2@hp.com>
      Cc: "Paul E.McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Davidlohr Bueso <davidlohr@hp.com>
      Cc: Alex Shi <alex.shi@linaro.org>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Peter Hurley <peter@hurleysoftware.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1403551009.2970.613.camel@schen9-DESKSigned-off-by: NIngo Molnar <mingo@kernel.org>
      4486edd1
    • B
      sched: Fix potential near-infinite distribute_cfs_runtime() loop · c06f04c7
      Ben Segall 提交于
      distribute_cfs_runtime() intentionally only hands out enough runtime to
      bring each cfs_rq to 1 ns of runtime, expecting the cfs_rqs to then take
      the runtime they need only once they actually get to run. However, if
      they get to run sufficiently quickly, the period timer is still in
      distribute_cfs_runtime() and no runtime is available, causing them to
      throttle. Then distribute has to handle them again, and this can go on
      until distribute has handed out all of the runtime 1ns at a time, which
      takes far too long.
      
      Instead allow access to the same runtime that distribute is handing out,
      accepting that corner cases with very low quota may be able to spend the
      entire cfs_b->runtime during distribute_cfs_runtime, meaning that the
      runtime directly handed out by distribute_cfs_runtime was over quota. In
      addition, if a cfs_rq does manage to throttle like this, make sure the
      existing distribute_cfs_runtime no longer loops over it again.
      Signed-off-by: NBen Segall <bsegall@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/20140620222120.13814.21652.stgit@sword-of-the-dawn.mtv.corp.google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      c06f04c7
  13. 19 6月, 2014 2 次提交