1. 20 4月, 2008 6 次提交
  2. 14 4月, 2008 1 次提交
  3. 21 3月, 2008 1 次提交
  4. 19 3月, 2008 5 次提交
    • I
      sched: retune wake granularity · 74e3cd7f
      Ingo Molnar 提交于
      reduce wake-up granularity for better interactivity.
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      74e3cd7f
    • I
      sched: improve affine wakeups · 4ae7d5ce
      Ingo Molnar 提交于
      improve affine wakeups. Maintain the 'overlap' metric based on CFS's
      sum_exec_runtime - which means the amount of time a task executes
      after it wakes up some other task.
      
      Use the 'overlap' for the wakeup decisions: if the 'overlap' is short,
      it means there's strong workload coupling between this task and the
      woken up task. If the 'overlap' is large then the workload is decoupled
      and the scheduler will move them to separate CPUs more easily.
      
      ( Also slightly move the preempt_check within try_to_wake_up() - this has
        no effect on functionality but allows 'early wakeups' (for still-on-rq
        tasks) to be correctly accounted as well.)
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      4ae7d5ce
    • I
      sched: clean up wakeup balancing, code flow · f4827386
      Ingo Molnar 提交于
      Clean up the code flow. No code changed:
      
      kernel/sched.o:
      
         text	   data	    bss	    dec	    hex	filename
        42521	   2858	    232	  45611	   b22b	sched.o.before
        42521	   2858	    232	  45611	   b22b	sched.o.after
      
      md5:
         09b31c44e9aff8666f72773dc433e2df  sched.o.before.asm
         09b31c44e9aff8666f72773dc433e2df  sched.o.after.asm
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      f4827386
    • I
      sched: clean up wakeup balancing, rename variables · ac192d39
      Ingo Molnar 提交于
      rename 'cpu' to 'prev_cpu'. No code changed:
      
      kernel/sched.o:
      
         text	   data	    bss	    dec	    hex	filename
        42521	   2858	    232	  45611	   b22b	sched.o.before
        42521	   2858	    232	  45611	   b22b	sched.o.after
      
      md5:
         09b31c44e9aff8666f72773dc433e2df  sched.o.before.asm
         09b31c44e9aff8666f72773dc433e2df  sched.o.after.asm
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      ac192d39
    • I
      sched: clean up wakeup balancing, move wake_affine() · 098fb9db
      Ingo Molnar 提交于
      split out the affine-wakeup bits.
      
      No code changed:
      
      kernel/sched.o:
      
         text	   data	    bss	    dec	    hex	filename
        42521	   2858	    232	  45611	   b22b	sched.o.before
        42521	   2858	    232	  45611	   b22b	sched.o.after
      
      md5:
         9d76738f1272aa82f0b7affd2f51df6b  sched.o.before.asm
         09b31c44e9aff8666f72773dc433e2df  sched.o.after.asm
      
      (the md5's changed because stack slots changed and some registers
      get scheduled by gcc in a different order - but otherwise the before
      and after assembly is instruction for instruction equivalent.)
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      098fb9db
  5. 15 3月, 2008 4 次提交
    • I
      sched: simplify sched_slice() · 6a6029b8
      Ingo Molnar 提交于
      Use the existing calc_delta_mine() calculation for sched_slice(). This
      saves a divide and simplifies the code because we share it with the
      other /cfs_rq->load users.
      
      It also improves code size:
      
            text    data     bss     dec     hex filename
           42659    2740     144   45543    b1e7 sched.o.before
           42093    2740     144   44977    afb1 sched.o.after
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      6a6029b8
    • I
      sched: fix fair sleepers · e22ecef1
      Ingo Molnar 提交于
      Fair sleepers need to scale their latency target down by runqueue
      weight. Otherwise busy systems will gain ever larger sleep bonus.
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      e22ecef1
    • P
      sched: fix overload performance: buddy wakeups · aa2ac252
      Peter Zijlstra 提交于
      Currently we schedule to the leftmost task in the runqueue. When the
      runtimes are very short because of some server/client ping-pong,
      especially in over-saturated workloads, this will cycle through all
      tasks trashing the cache.
      
      Reduce cache trashing by keeping dependent tasks together by running
      newly woken tasks first. However, by not running the leftmost task first
      we could starve tasks because the wakee can gain unlimited runtime.
      
      Therefore we only run the wakee if its within a small
      (wakeup_granularity) window of the leftmost task. This preserves
      fairness, but does alternate server/client task groups.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      aa2ac252
    • P
      sched: min_vruntime fix · 3fe69747
      Peter Zijlstra 提交于
      Current min_vruntime tracking is incorrect and will cause serious
      problems when we don't run the leftmost task for some reason.
      
      min_vruntime does two things; 1) it's used to determine a forward
      direction when the u64 vruntime wraps, 2) it's used to track the
      leftmost vruntime to position newly enqueued tasks from.
      
      The current logic advances min_vruntime whenever the current task's
      vruntime advance. Because the current task may pass the leftmost task
      still waiting we're failing the second goal. This causes new tasks to be
      placed too far ahead and thus penalizes their runtime.
      
      Fix this by making min_vruntime the min_vruntime of the waiting tasks by
      tracking it in enqueue/dequeue, and compare against current's vruntime
      to obtain the absolute minimum when placing new tasks.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      3fe69747
  6. 07 3月, 2008 1 次提交
  7. 05 3月, 2008 1 次提交
    • P
      sched: revert load_balance_monitor() changes · 62fb1851
      Peter Zijlstra 提交于
      The following commits cause a number of regressions:
      
        commit 58e2d4ca
        Author: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
        Date:   Fri Jan 25 21:08:00 2008 +0100
        sched: group scheduling, change how cpu load is calculated
      
        commit 6b2d7700
        Author: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
        Date:   Fri Jan 25 21:08:00 2008 +0100
        sched: group scheduler, fix fairness of cpu bandwidth allocation for task groups
      
      Namely:
       - very frequent wakeups on SMP, reported by PowerTop users.
       - cacheline trashing on (large) SMP
       - some latencies larger than 500ms
      
      While there is a mergeable patch to fix the latter, the former issues
      are not fixable in a manner suitable for .25 (we're at -rc3 now).
      
      Hence we revert them and try again in v2.6.26.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      CC: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
      Tested-by: NAlexey Zaytsev <alexey.zaytsev@gmail.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      62fb1851
  8. 25 2月, 2008 2 次提交
  9. 01 2月, 2008 2 次提交
    • P
      sched: let +nice tasks have smaller impact · ef9884e6
      Peter Zijlstra 提交于
      Michel Dänzr has bisected an interactivity problem with
      plus-reniced tasks back to this commit:
      
       810e95cc is first bad commit
       commit 810e95cc
       Author: Peter Zijlstra <a.p.zijlstra@chello.nl>
       Date:   Mon Oct 15 17:00:14 2007 +0200
      
       sched: another wakeup_granularity fix
      
            unit mis-match: wakeup_gran was used against a vruntime
      
      fix this by assymetrically scaling the vtime of positive reniced
      tasks.
      Bisected-by: NMichel Dänzer <michel@tungstengraphics.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      ef9884e6
    • S
      sched: fix high wake up latencies with FAIR_USER_SCHED · 296825cb
      Srivatsa Vaddagiri 提交于
      The reason why we are getting better wakeup latencies for
      !FAIR_USER_SCHED is because of this snippet of code in place_entity():
      
      	if (!initial) {
      		/* sleeps upto a single latency don't count. */
      		if (sched_feat(NEW_FAIR_SLEEPERS) && entity_is_task(se))
      						     ^^^^^^^^^^^^^^^^^^
      			vruntime -= sysctl_sched_latency;
      
      		/* ensure we never gain time by being placed backwards. */
      		vruntime = max_vruntime(se->vruntime, vruntime);
      	}
      
      NEW_FAIR_SLEEPERS feature gives credit for sleeping only to tasks and
      not group-level entities. With the patch attached, I could see that
      wakeup latencies with FAIR_USER_SCHED are restored to the same level as
      !FAIR_USER_SCHED.
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      296825cb
  10. 26 1月, 2008 11 次提交
    • A
      sched: keep total / count stats in addition to the max for · 6d082592
      Arjan van de Ven 提交于
      Right now, the linux kernel (with scheduler statistics enabled) keeps track
      of the maximum time a process is waiting to be scheduled. While the maximum
      is a very useful metric, tracking average and total is equally useful
      (at least for latencytop) to figure out the accumulated effect of scheduler
      delays. The accumulated effect is important to judge the performance impact
      of scheduler tuning/behavior.
      Signed-off-by: NArjan van de Ven <arjan@linux.intel.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      6d082592
    • P
      sched: fix: don't take a mutex from interrupt context · 5973e5b9
      Peter Zijlstra 提交于
      print_cfs_stats is callable from interrupt context (sysrq), hence it should
      not take mutexes. Change it to use RCU since the task group data is RCU
      freed anyway.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      5973e5b9
    • A
      sched: latencytop support · 9745512c
      Arjan van de Ven 提交于
      LatencyTOP kernel infrastructure; it measures latencies in the
      scheduler and tracks it system wide and per process.
      Signed-off-by: NArjan van de Ven <arjan@linux.intel.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      9745512c
    • P
      sched: high-res preemption tick · 8f4d37ec
      Peter Zijlstra 提交于
      Use HR-timers (when available) to deliver an accurate preemption tick.
      
      The regular scheduler tick that runs at 1/HZ can be too coarse when nice
      level are used. The fairness system will still keep the cpu utilisation 'fair'
      by then delaying the task that got an excessive amount of CPU time but try to
      minimize this by delivering preemption points spot-on.
      
      The average frequency of this extra interrupt is sched_latency / nr_latency.
      Which need not be higher than 1/HZ, its just that the distribution within the
      sched_latency period is important.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      8f4d37ec
    • S
      sched: RT-balance, add new methods to sched_class · cb469845
      Steven Rostedt 提交于
      Dmitry Adamushko found that the current implementation of the RT
      balancing code left out changes to the sched_setscheduler and
      rt_mutex_setprio.
      
      This patch addresses this issue by adding methods to the schedule classes
      to handle being switched out of (switched_from) and being switched into
      (switched_to) a sched_class. Also a method for changing of priorities
      is also added (prio_changed).
      
      This patch also removes some duplicate logic between rt_mutex_setprio and
      sched_setscheduler.
      Signed-off-by: NSteven Rostedt <srostedt@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      cb469845
    • P
      sched: remove do_div() from __sched_slice() · 4bf0b771
      Peter Zijlstra 提交于
      Yanmin Zhang noticed a nice optimization:
      
        p = l * nr / nl, nl = l/g -> p = g * nr
      
      which eliminates a do_div() from __sched_period().
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      4bf0b771
    • D
      sched: no need for 'affine wakeup' balancing · 9ec3b77e
      Dmitry Adamushko 提交于
      No need to do a check for 'affine wakeup and passive balancing possibilities'
      in select_task_rq_fair() when task_cpu(p) == this_cpu.
      
      I guess, this part got missed upon introduction of per-sched_class
      select_task_rq() in try_to_wake_up().
      Signed-off-by: NDmitry Adamushko <dmitry.adamushko@gmail.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      9ec3b77e
    • G
      sched: de-SCHED_OTHER-ize the RT path · e7693a36
      Gregory Haskins 提交于
      The current wake-up code path tries to determine if it can optimize the
      wake-up to "this_cpu" by computing load calculations.  The problem is that
      these calculations are only relevant to SCHED_OTHER tasks where load is king.
      For RT tasks, priority is king.  So the load calculation is completely wasted
      bandwidth.
      
      Therefore, we create a new sched_class interface to help with
      pre-wakeup routing decisions and move the load calculation as a function
      of CFS task's class.
      Signed-off-by: NGregory Haskins <ghaskins@novell.com>
      Signed-off-by: NSteven Rostedt <srostedt@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      e7693a36
    • S
      sched: group scheduler, fix fairness of cpu bandwidth allocation for task groups · 6b2d7700
      Srivatsa Vaddagiri 提交于
      The current load balancing scheme isn't good enough for precise
      group fairness.
      
      For example: on a 8-cpu system, I created 3 groups as under:
      
      	a = 8 tasks (cpu.shares = 1024)
      	b = 4 tasks (cpu.shares = 1024)
      	c = 3 tasks (cpu.shares = 1024)
      
      a, b and c are task groups that have equal weight. We would expect each
      of the groups to receive 33.33% of cpu bandwidth under a fair scheduler.
      
      This is what I get with the latest scheduler git tree:
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      --------------------------------------------------------------------------------
      Col1  | Col2    | Col3  |  Col4
      ------|---------|-------|-------------------------------------------------------
      a     | 277.676 | 57.8% | 54.1%  54.1%  54.1%  54.2%  56.7%  62.2%  62.8% 64.5%
      b     | 116.108 | 24.2% | 47.4%  48.1%  48.7%  49.3%
      c     |  86.326 | 18.0% | 47.5%  47.9%  48.5%
      --------------------------------------------------------------------------------
      
      Explanation of o/p:
      
      Col1 -> Group name
      Col2 -> Cumulative execution time (in seconds) received by all tasks of that
      	group in a 60sec window across 8 cpus
      Col3 -> CPU bandwidth received by the group in the 60sec window, expressed in
              percentage. Col3 data is derived as:
      		Col3 = 100 * Col2 / (NR_CPUS * 60)
      Col4 -> CPU bandwidth received by each individual task of the group.
      		Col4 = 100 * cpu_time_recd_by_task / 60
      
      [I can share the test case that produces a similar o/p if reqd]
      
      The deviation from desired group fairness is as below:
      
      	a = +24.47%
      	b = -9.13%
      	c = -15.33%
      
      which is quite high.
      
      After the patch below is applied, here are the results:
      
      --------------------------------------------------------------------------------
      Col1  | Col2    | Col3  |  Col4
      ------|---------|-------|-------------------------------------------------------
      a     | 163.112 | 34.0% | 33.2%  33.4%  33.5%  33.5%  33.7%  34.4%  34.8% 35.3%
      b     | 156.220 | 32.5% | 63.3%  64.5%  66.1%  66.5%
      c     | 160.653 | 33.5% | 85.8%  90.6%  91.4%
      --------------------------------------------------------------------------------
      
      Deviation from desired group fairness is as below:
      
      	a = +0.67%
      	b = -0.83%
      	c = +0.17%
      
      which is far better IMO. Most of other runs have yielded a deviation within
      +-2% at the most, which is good.
      
      Why do we see bad (group) fairness with current scheuler?
      =========================================================
      
      Currently cpu's weight is just the summation of individual task weights.
      This can yield incorrect results. For ex: consider three groups as below
      on a 2-cpu system:
      
      	CPU0	CPU1
      ---------------------------
      	A (10)  B(5)
      		C(5)
      ---------------------------
      
      Group A has 10 tasks, all on CPU0, Group B and C have 5 tasks each all
      of which are on CPU1. Each task has the same weight (NICE_0_LOAD =
      1024).
      
      The current scheme would yield a cpu weight of 10240 (10*1024) for each cpu and
      the load balancer will think both CPUs are perfectly balanced and won't
      move around any tasks. This, however, would yield this bandwidth:
      
      	A = 50%
      	B = 25%
      	C = 25%
      
      which is not the desired result.
      
      What's changing in the patch?
      =============================
      
      	- How cpu weights are calculated when CONFIF_FAIR_GROUP_SCHED is
      	  defined (see below)
      	- API Change
      		- Two tunables introduced in sysfs (under SCHED_DEBUG) to
      		  control the frequency at which the load balance monitor
      		  thread runs.
      
      The basic change made in this patch is how cpu weight (rq->load.weight) is
      calculated. Its now calculated as the summation of group weights on a cpu,
      rather than summation of task weights. Weight exerted by a group on a
      cpu is dependent on the shares allocated to it and also the number of
      tasks the group has on that cpu compared to the total number of
      (runnable) tasks the group has in the system.
      
      Let,
      	W(K,i)  = Weight of group K on cpu i
      	T(K,i)  = Task load present in group K's cfs_rq on cpu i
      	T(K)    = Total task load of group K across various cpus
      	S(K) 	= Shares allocated to group K
      	NRCPUS	= Number of online cpus in the scheduler domain to
      	 	  which group K is assigned.
      
      Then,
      	W(K,i) = S(K) * NRCPUS * T(K,i) / T(K)
      
      A load balance monitor thread is created at bootup, which periodically
      runs and adjusts group's weight on each cpu. To avoid its overhead, two
      min/max tunables are introduced (under SCHED_DEBUG) to control the rate
      at which it runs.
      
      Fixes from: Peter Zijlstra <a.p.zijlstra@chello.nl>
      
      - don't start the load_balance_monitor when there is only a single cpu.
      - rename the kthread because its currently longer than TASK_COMM_LEN
      Signed-off-by: NSrivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      6b2d7700
    • S
      sched: group scheduling, change how cpu load is calculated · 58e2d4ca
      Srivatsa Vaddagiri 提交于
      This patch changes how the cpu load exerted by fair_sched_class tasks
      is calculated. Load exerted by fair_sched_class tasks on a cpu is now
      a summation of the group weights, rather than summation of task weights.
      Weight exerted by a group on a cpu is dependent on the shares allocated
      to it.
      
      This version of patch has a minor impact on code size, but should have
      no runtime/functional impact for !CONFIG_FAIR_GROUP_SCHED.
      Signed-off-by: NSrivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      58e2d4ca
    • S
      sched: group scheduling, minor fixes · ec2c507f
      Srivatsa Vaddagiri 提交于
      Minor bug fixes for the group scheduler:
      
      - Use a mutex to serialize add/remove of task groups and also when
        changing shares of a task group. Use the same mutex when printing
        cfs_rq debugging stats for various task groups.
      
      - Use list_for_each_entry_rcu in for_each_leaf_cfs_rq macro (when
        walking task group list)
      Signed-off-by: NSrivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      ec2c507f
  11. 18 12月, 2007 1 次提交
  12. 05 12月, 2007 1 次提交
  13. 03 12月, 2007 1 次提交
    • S
      sched: cpu accounting controller (V2) · d842de87
      Srivatsa Vaddagiri 提交于
      Commit cfb52856 removed a useful feature for
      us, which provided a cpu accounting resource controller.  This feature would be
      useful if someone wants to group tasks only for accounting purpose and doesnt
      really want to exercise any control over their cpu consumption.
      
      The patch below reintroduces the feature. It is based on Paul Menage's
      original patch (Commit 62d0df64), with
      these differences:
      
              - Removed load average information. I felt it needs more thought (esp
      	  to deal with SMP and virtualized platforms) and can be added for
      	  2.6.25 after more discussions.
              - Convert group cpu usage to be nanosecond accurate (as rest of the cfs
      	  stats are) and invoke cpuacct_charge() from the respective scheduler
      	  classes
      	- Make accounting scalable on SMP systems by splitting the usage
      	  counter to be per-cpu
      	- Move the code from kernel/cpu_acct.c to kernel/sched.c (since the
      	  code is not big enough to warrant a new file and also this rightly
      	  needs to live inside the scheduler. Also things like accessing
      	  rq->lock while reading cpu usage becomes easier if the code lived in
      	  kernel/sched.c)
      
      The patch also modifies the cpu controller not to provide the same accounting
      information.
      Tested-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      
       Tested the patches on top of 2.6.24-rc3. The patches work fine. Ran
       some simple tests like cpuspin (spin on the cpu), ran several tasks in
       the same group and timed them. Compared their time stamps with
       cpuacct.usage.
      Signed-off-by: NSrivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
      Signed-off-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      d842de87
  14. 27 11月, 2007 1 次提交
  15. 16 11月, 2007 1 次提交
  16. 10 11月, 2007 1 次提交
    • S
      sched: fix copy_namespace() <-> sched_fork() dependency in do_fork · 3c90e6e9
      Srivatsa Vaddagiri 提交于
      Sukadev Bhattiprolu reported a kernel crash with control groups.
      There are couple of problems discovered by Suka's test:
      
      - The test requires the cgroup filesystem to be mounted with
        atleast the cpu and ns options (i.e both namespace and cpu 
        controllers are active in the same hierarchy). 
      
      	# mkdir /dev/cpuctl
      	# mount -t cgroup -ocpu,ns none cpuctl
      	(or simply)
      	# mount -t cgroup none cpuctl -> Will activate all controllers
      					 in same hierarchy.
      
      - The test invokes clone() with CLONE_NEWNS set. This causes a a new child
        to be created, also a new group (do_fork->copy_namespaces->ns_cgroup_clone->
        cgroup_clone) and the child is attached to the new group (cgroup_clone->
        attach_task->sched_move_task). At this point in time, the child's scheduler 
        related fields are uninitialized (including its on_rq field, which it has
        inherited from parent). As a result sched_move_task thinks its on
        runqueue, when it isn't.
      
        As a solution to this problem, I moved sched_fork() call, which
        initializes scheduler related fields on a new task, before
        copy_namespaces(). I am not sure though whether moving up will
        cause other side-effects. Do you see any issue?
      
      - The second problem exposed by this test is that task_new_fair()
        assumes that parent and child will be part of the same group (which 
        needn't be as this test shows). As a result, cfs_rq->curr can be NULL
        for the child.
      
        The solution is to test for curr pointer being NULL in
        task_new_fair().
      
      With the patch below, I could run ns_exec() fine w/o a crash.
      Reported-by: NSukadev Bhattiprolu <sukadev@us.ibm.com>
      Signed-off-by: NSrivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      3c90e6e9