1. 26 1月, 2008 9 次提交
    • G
      sched: add RT-balance cpu-weight · 73fe6aae
      Gregory Haskins 提交于
      Some RT tasks (particularly kthreads) are bound to one specific CPU.
      It is fairly common for two or more bound tasks to get queued up at the
      same time.  Consider, for instance, softirq_timer and softirq_sched.  A
      timer goes off in an ISR which schedules softirq_thread to run at RT50.
      Then the timer handler determines that it's time to smp-rebalance the
      system so it schedules softirq_sched to run.  So we are in a situation
      where we have two RT50 tasks queued, and the system will go into
      rt-overload condition to request other CPUs for help.
      
      This causes two problems in the current code:
      
      1) If a high-priority bound task and a low-priority unbounded task queue
         up behind the running task, we will fail to ever relocate the unbounded
         task because we terminate the search on the first unmovable task.
      
      2) We spend precious futile cycles in the fast-path trying to pull
         overloaded tasks over.  It is therefore optimial to strive to avoid the
         overhead all together if we can cheaply detect the condition before
         overload even occurs.
      
      This patch tries to achieve this optimization by utilizing the hamming
      weight of the task->cpus_allowed mask.  A weight of 1 indicates that
      the task cannot be migrated.  We will then utilize this information to
      skip non-migratable tasks and to eliminate uncessary rebalance attempts.
      
      We introduce a per-rq variable to count the number of migratable tasks
      that are currently running.  We only go into overload if we have more
      than one rt task, AND at least one of them is migratable.
      
      In addition, we introduce a per-task variable to cache the cpus_allowed
      weight, since the hamming calculation is probably relatively expensive.
      We only update the cached value when the mask is updated which should be
      relatively infrequent, especially compared to scheduling frequency
      in the fast path.
      Signed-off-by: NGregory Haskins <ghaskins@novell.com>
      Signed-off-by: NSteven Rostedt <srostedt@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      73fe6aae
    • S
      sched: disable standard balancer for RT tasks · c7a1e46a
      Steven Rostedt 提交于
      Since we now take an active approach to load balancing, we don't need to
      balance RT tasks via the normal task balancer. In fact, this code was
      found to pull RT tasks away from CPUS that the active movement performed,
      resulting in large latencies.
      Signed-off-by: NSteven Rostedt <srostedt@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      c7a1e46a
    • S
      sched: push RT tasks from overloaded CPUs · 4642dafd
      Steven Rostedt 提交于
      This patch adds pushing of overloaded RT tasks from a runqueue that is
      having tasks (most likely RT tasks) added to the run queue.
      
      TODO: We don't cover the case of waking of new RT tasks (yet).
      Signed-off-by: NSteven Rostedt <srostedt@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      4642dafd
    • S
      sched: pull RT tasks from overloaded runqueues · f65eda4f
      Steven Rostedt 提交于
      This patch adds the algorithm to pull tasks from RT overloaded runqueues.
      
      When a pull RT is initiated, all overloaded runqueues are examined for
      a RT task that is higher in prio than the highest prio task queued on the
      target runqueue. If another runqueue holds a RT task that is of higher
      prio than the highest prio task on the target runqueue is found it is pulled
      to the target runqueue.
      Signed-off-by: NSteven Rostedt <srostedt@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      f65eda4f
    • S
      sched: add rt-overload tracking · 4fd29176
      Steven Rostedt 提交于
      This patch adds an RT overload accounting system. When a runqueue has
      more than one RT task queued, it is marked as overloaded. That is that it
      is a candidate to have RT tasks pulled from it.
      Signed-off-by: NSteven Rostedt <srostedt@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      4fd29176
    • S
      sched: add RT task pushing · e8fa1362
      Steven Rostedt 提交于
      This patch adds an algorithm to push extra RT tasks off a run queue to
      other CPU runqueues.
      
      When more than one RT task is added to a run queue, this algorithm takes
      an assertive approach to push the RT tasks that are not running onto other
      run queues that have lower priority.  The way this works is that the highest
      RT task that is not running is looked at and we examine the runqueues on
      the CPUS for that tasks affinity mask. We find the runqueue with the lowest
      prio in the CPU affinity of the picked task, and if it is lower in prio than
      the picked task, we push the task onto that CPU runqueue.
      
      We continue pushing RT tasks off the current runqueue until we don't push any
      more.  The algorithm stops when the next highest RT task can't preempt any
      other processes on other CPUS.
      
      TODO: The algorithm may stop when there are still RT tasks that can be
       migrated. Specifically, if the highest non running RT task CPU affinity
       is restricted to CPUs that are running higher priority tasks, there may
       be a lower priority task queued that has an affinity with a CPU that is
       running a lower priority task that it could be migrated to.  This
       patch set does not address this issue.
      
      Note: checkpatch reveals two over 80 character instances. I'm not sure
       that breaking them up will help visually, so I left them as is.
      Signed-off-by: NSteven Rostedt <srostedt@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      e8fa1362
    • S
      sched: track highest prio task queued · 764a9d6f
      Steven Rostedt 提交于
      This patch adds accounting to each runqueue to keep track of the
      highest prio task queued on the run queue. We only care about
      RT tasks, so if the run queue does not contain any active RT tasks
      its priority will be considered MAX_RT_PRIO.
      
      This information will be used for later patches.
      Signed-off-by: NSteven Rostedt <srostedt@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      764a9d6f
    • S
      sched: count # of queued RT tasks · 63489e45
      Steven Rostedt 提交于
      This patch adds accounting to keep track of the number of RT tasks running
      on a runqueue. This information will be used in later patches.
      Signed-off-by: NSteven Rostedt <srostedt@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      63489e45
    • S
      sched: group scheduling, change how cpu load is calculated · 58e2d4ca
      Srivatsa Vaddagiri 提交于
      This patch changes how the cpu load exerted by fair_sched_class tasks
      is calculated. Load exerted by fair_sched_class tasks on a cpu is now
      a summation of the group weights, rather than summation of task weights.
      Weight exerted by a group on a cpu is dependent on the shares allocated
      to it.
      
      This version of patch has a minor impact on code size, but should have
      no runtime/functional impact for !CONFIG_FAIR_GROUP_SCHED.
      Signed-off-by: NSrivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      58e2d4ca
  2. 20 12月, 2007 1 次提交
  3. 03 12月, 2007 1 次提交
    • S
      sched: cpu accounting controller (V2) · d842de87
      Srivatsa Vaddagiri 提交于
      Commit cfb52856 removed a useful feature for
      us, which provided a cpu accounting resource controller.  This feature would be
      useful if someone wants to group tasks only for accounting purpose and doesnt
      really want to exercise any control over their cpu consumption.
      
      The patch below reintroduces the feature. It is based on Paul Menage's
      original patch (Commit 62d0df64), with
      these differences:
      
              - Removed load average information. I felt it needs more thought (esp
      	  to deal with SMP and virtualized platforms) and can be added for
      	  2.6.25 after more discussions.
              - Convert group cpu usage to be nanosecond accurate (as rest of the cfs
      	  stats are) and invoke cpuacct_charge() from the respective scheduler
      	  classes
      	- Make accounting scalable on SMP systems by splitting the usage
      	  counter to be per-cpu
      	- Move the code from kernel/cpu_acct.c to kernel/sched.c (since the
      	  code is not big enough to warrant a new file and also this rightly
      	  needs to live inside the scheduler. Also things like accessing
      	  rq->lock while reading cpu usage becomes easier if the code lived in
      	  kernel/sched.c)
      
      The patch also modifies the cpu controller not to provide the same accounting
      information.
      Tested-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      
       Tested the patches on top of 2.6.24-rc3. The patches work fine. Ran
       some simple tests like cpuspin (spin on the cpu), ran several tasks in
       the same group and timed them. Compared their time stamps with
       cpuacct.usage.
      Signed-off-by: NSrivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
      Signed-off-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      d842de87
  4. 25 10月, 2007 2 次提交
    • P
      sched: isolate SMP balancing code a bit more · 681f3e68
      Peter Williams 提交于
      At the moment, a lot of load balancing code that is irrelevant to non
      SMP systems gets included during non SMP builds.
      
      This patch addresses this issue and reduces the binary size on non
      SMP systems:
      
         text    data     bss     dec     hex filename
        10983      28    1192   12203    2fab sched.o.before
        10739      28    1192   11959    2eb7 sched.o.after
      Signed-off-by: NPeter Williams <pwil3058@bigpond.net.au>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      681f3e68
    • P
      sched: reduce balance-tasks overhead · e1d1484f
      Peter Williams 提交于
      At the moment, balance_tasks() provides low level functionality for both
        move_tasks() and move_one_task() (indirectly) via the load_balance()
      function (in the sched_class interface) which also provides dual
      functionality.  This dual functionality complicates the interfaces and
      internal mechanisms and makes the run time overhead of operations that
      are called with two run queue locks held.
      
      This patch addresses this issue and reduces the overhead of these
      operations.
      Signed-off-by: NPeter Williams <pwil3058@bigpond.net.au>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      e1d1484f
  5. 15 10月, 2007 7 次提交
  6. 25 8月, 2007 1 次提交
    • D
      sched: optimize task_tick_rt() a bit · 98fbc798
      Dmitry Adamushko 提交于
      Mitchell Erblich suggested a quality-of-implementation change to
      not requeue SCHED_RR tasks if there's only a single task on the
      runqueue, by checking for rq->nr_running == 1.
      
      provide a more efficient implementation of that, to check that
      particular RT priority-queue only.
      
      [ From: mingo@elte.hu ]
      
      Also first requeue the task then set need_resched - results in slightly
      better machine-instruction ordering. Also clean up the code a bit.
      Signed-off-by: NDmitry Adamushko <dmitry.adamushko@gmail.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      98fbc798
  7. 09 8月, 2007 8 次提交
    • I
      sched: remove the 'u64 now' parameter from ->put_prev_task() · 31ee529c
      Ingo Molnar 提交于
      remove the 'u64 now' parameter from ->put_prev_task().
      
      ( identity transformation that causes no change in functionality. )
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      31ee529c
    • I
      sched: remove the 'u64 now' parameter from ->pick_next_task() · fb8d4724
      Ingo Molnar 提交于
      remove the 'u64 now' parameter from ->pick_next_task().
      
      ( identity transformation that causes no change in functionality. )
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      fb8d4724
    • I
      sched: remove the 'u64 now' parameter from ->dequeue_task() · f02231e5
      Ingo Molnar 提交于
      remove the 'u64 now' parameter from ->dequeue_task().
      
      ( identity transformation that causes no change in functionality. )
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      f02231e5
    • I
      sched: remove the 'u64 now' parameter from ->enqueue_task() · fd390f6a
      Ingo Molnar 提交于
      remove the 'u64 now' parameter from ->enqueue_task().
      
      ( identity transformation that causes no change in functionality. )
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      fd390f6a
    • I
      sched: remove the 'u64 now' parameter from update_curr_rt() · f1e14ef6
      Ingo Molnar 提交于
      remove the 'u64 now' parameter from update_curr_rt().
      
      ( identity transformation that causes no change in functionality. )
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      f1e14ef6
    • I
      sched: remove 'now' use from assignments · d281918d
      Ingo Molnar 提交于
      change all 'now' timestamp uses in assignments to rq->clock.
      
      ( this is an identity transformation that causes no functionality change:
        all such new rq->clock is necessarily preceded by an update_rq_clock()
        call. )
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      d281918d
    • P
      sched: fix bug in balance_tasks() · a4ac01c3
      Peter Williams 提交于
      There are two problems with balance_tasks() and how it used:
      
      1. The variables best_prio and best_prio_seen (inherited from the old
      move_tasks()) were only required to handle problems caused by the
      active/expired arrays, the order in which they were processed and the
      possibility that the task with the highest priority could be on either.
        These issues are no longer present and the extra overhead associated
      with their use is unnecessary (and possibly wrong).
      
      2. In the absence of CONFIG_FAIR_GROUP_SCHED being set, the same
      this_best_prio variable needs to be used by all scheduling classes or
      there is a risk of moving too much load.  E.g. if the highest priority
      task on this at the beginning is a fairly low priority task and the rt
      class migrates a task (during its turn) then that moved task becomes the
      new highest priority task on this_rq but when the sched_fair class
      initializes its copy of this_best_prio it will get the priority of the
      original highest priority task as, due to the run queue locks being
      held, the reschedule triggered by pull_task() will not have taken place.
        This could result in inappropriate overriding of skip_for_load and
      excessive load being moved.
      
      The attached patch addresses these problems by deleting all reference to
      best_prio and best_prio_seen and making this_best_prio a reference
      parameter to the various functions involved.
      
      load_balance_fair() has also been modified so that this_best_prio is
      only reset (in the loop) if CONFIG_FAIR_GROUP_SCHED is set.  This should
      preserve the effect of helping spread groups' higher priority tasks
      around the available CPUs while improving system performance when
      CONFIG_FAIR_GROUP_SCHED isn't set.
      Signed-off-by: NPeter Williams <pwil3058@bigpond.net.au>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      a4ac01c3
    • P
      sched: simplify move_tasks() · 43010659
      Peter Williams 提交于
      The move_tasks() function is currently multiplexed with two distinct
      capabilities:
      
      1. attempt to move a specified amount of weighted load from one run
      queue to another; and
      2. attempt to move a specified number of tasks from one run queue to
      another.
      
      The first of these capabilities is used in two places, load_balance()
      and load_balance_idle(), and in both of these cases the return value of
      move_tasks() is used purely to decide if tasks/load were moved and no
      notice of the actual number of tasks moved is taken.
      
      The second capability is used in exactly one place,
      active_load_balance(), to attempt to move exactly one task and, as
      before, the return value is only used as an indicator of success or failure.
      
      This multiplexing of sched_task() was introduced, by me, as part of the
      smpnice patches and was motivated by the fact that the alternative, one
      function to move specified load and one to move a single task, would
      have led to two functions of roughly the same complexity as the old
      move_tasks() (or the new balance_tasks()).  However, the new modular
      design of the new CFS scheduler allows a simpler solution to be adopted
      and this patch addresses that solution by:
      
      1. adding a new function, move_one_task(), to be used by
      active_load_balance(); and
      2. making move_tasks() a single purpose function that tries to move a
      specified weighted load and returns 1 for success and 0 for failure.
      
      One of the consequences of these changes is that neither move_one_task()
      or the new move_tasks() care how many tasks sched_class.load_balance()
      moves and this enables its interface to be simplified by returning the
      amount of load moved as its result and removing the load_moved pointer
      from the argument list.  This helps simplify the new move_tasks() and
      slightly reduces the amount of work done in each of
      sched_class.load_balance()'s implementations.
      
      Further simplification, e.g. changes to balance_tasks(), are possible
      but (slightly) complicated by the special needs of load_balance_fair()
      so I've left them to a later patch (if this one gets accepted).
      
      NB Since move_tasks() gets called with two run queue locks held even
      small reductions in overhead are worthwhile.
      
      [ mingo@elte.hu ]
      
      this change also reduces code size nicely:
      
         text    data     bss     dec     hex filename
         39216    3618      24   42858    a76a sched.o.before
         39173    3618      24   42815    a73f sched.o.after
      Signed-off-by: NPeter Williams <pwil3058@bigpond.net.au>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      43010659
  8. 02 8月, 2007 2 次提交
  9. 10 7月, 2007 1 次提交