1. 18 11月, 2010 1 次提交
  2. 19 10月, 2010 2 次提交
    • V
      sched: Do not account irq time to current task · 305e6835
      Venkatesh Pallipadi 提交于
      Scheduler accounts both softirq and interrupt processing times to the
      currently running task. This means, if the interrupt processing was
      for some other task in the system, then the current task ends up being
      penalized as it gets shorter runtime than otherwise.
      
      Change sched task accounting to acoount only actual task time from
      currently running task. Now update_curr(), modifies the delta_exec to
      depend on rq->clock_task.
      
      Note that this change only handles CONFIG_IRQ_TIME_ACCOUNTING case. We can
      extend this to CONFIG_VIRT_CPU_ACCOUNTING with minimal effort. But, thats
      for later.
      
      This change will impact scheduling behavior in interrupt heavy conditions.
      
      Tested on a 4-way system with eth0 handled by CPU 2 and a network heavy
      task (nc) running on CPU 3 (and no RSS/RFS). With that I have CPU 2
      spending 75%+ of its time in irq processing. CPU 3 spending around 35%
      time running nc task.
      
      Now, if I run another CPU intensive task on CPU 2, without this change
      /proc/<pid>/schedstat shows 100% of time accounted to this task. With this
      change, it rightly shows less than 25% accounted to this task as remaining
      time is actually spent on irq processing.
      Signed-off-by: NVenkatesh Pallipadi <venki@google.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1286237003-12406-7-git-send-email-venki@google.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      305e6835
    • P
      sched: Unindent labels · 49246274
      Peter Zijlstra 提交于
      Labels should be on column 0.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      49246274
  3. 21 9月, 2010 2 次提交
    • S
      sched: Give CPU bound RT tasks preference · b3bc211c
      Steven Rostedt 提交于
      If a high priority task is waking up on a CPU that is running a
      lower priority task that is bound to a CPU, see if we can move the
      high RT task to another CPU first. Note, if all other CPUs are
      running higher priority tasks than the CPU bounded current task,
      then it will be preempted regardless.
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Gregory Haskins <ghaskins@novell.com>
      LKML-Reference: <20100921024138.888922071@goodmis.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      b3bc211c
    • S
      sched: Try not to migrate higher priority RT tasks · 43fa5460
      Steven Rostedt 提交于
      When first working on the RT scheduler design, we concentrated on
      keeping all CPUs running RT tasks instead of having multiple RT
      tasks on a single CPU waiting for the migration thread to move
      them. Instead we take a more proactive stance and push or pull RT
      tasks from one CPU to another on wakeup or scheduling.
      
      When an RT task wakes up on a CPU that is running another RT task,
      instead of preempting it and killing the cache of the running RT
      task, we look to see if we can migrate the RT task that is waking
      up, even if the RT task waking up is of higher priority.
      
      This may sound a bit odd, but RT tasks should be limited in
      migration by the user anyway. But in practice, people do not do
      this, which causes high prio RT tasks to bounce around the CPUs.
      This becomes even worse when we have priority inheritance, because
      a high prio task can block on a lower prio task and boost its
      priority. When the lower prio task wakes up the high prio task, if
      it happens to be on the same CPU it will migrate off of it.
      
      But in reality, the above does not happen much either, because the
      wake up of the lower prio task, which has already been boosted, if
      it was on the same CPU as the higher prio task, it would then
      migrate off of it. But anyway, we do not want to migrate them
      either.
      
      To examine the scheduling, I created a test program and examined it
      under kernelshark. The test program created CPU * 2 threads, where
      each thread had a different priority. The program takes different
      options. The options used in this change log was to have priority
      inheritance mutexes or not.
      
      All threads did the following loop:
      
      static void grab_lock(long id, int iter, int l)
      {
      	ftrace_write("thread %ld iter %d, taking lock %d\n",
      		     id, iter, l);
      	pthread_mutex_lock(&locks[l]);
      	ftrace_write("thread %ld iter %d, took lock %d\n",
      		     id, iter, l);
      	busy_loop(nr_tasks - id);
      	ftrace_write("thread %ld iter %d, unlock lock %d\n",
      		     id, iter, l);
      	pthread_mutex_unlock(&locks[l]);
      }
      
      void *start_task(void *id)
      {
      	[...]
      	while (!done) {
      		for (l = 0; l < nr_locks; l++) {
      			grab_lock(id, i, l);
      			ftrace_write("thread %ld iter %d sleeping\n",
      				     id, i);
      			ms_sleep(id);
      		}
      		i++;
      	}
      	[...]
      }
      
      The busy_loop(ms) keeps the CPU spinning for ms milliseconds. The
      ms_sleep(ms) sleeps for ms milliseconds. The ftrace_write() writes
      to the ftrace buffer to help analyze via ftrace.
      
      The higher the id, the higher the prio, the shorter it does the
      busy loop, but the longer it spins. This is usually the case with
      RT tasks, the lower priority tasks usually run longer than higher
      priority tasks.
      
      At the end of the test, it records the number of loops each thread
      took, as well as the number of voluntary preemptions, non-voluntary
      preemptions, and number of migrations each thread took, taking the
      information from /proc/$$/sched and /proc/$$/status.
      
      Running this on a 4 CPU processor, the results without changes to
      the kernel looked like this:
      
      Task        vol    nonvol   migrated     iterations
      ----        ---    ------   --------     ----------
        0:         53      3220       1470             98
        1:        562       773        724             98
        2:        752       933       1375             98
        3:        749        39        697             98
        4:        758         5        515             98
        5:        764         2        679             99
        6:        761         2        535             99
        7:        757         3        346             99
      
      total:     5156       4977      6341            787
      
      Each thread regardless of priority migrated a few hundred times.
      The higher priority tasks, were a little better but still took
      quite an impact.
      
      By letting higher priority tasks bump the lower prio task from the
      CPU, things changed a bit:
      
      Task        vol    nonvol   migrated     iterations
      ----        ---    ------   --------     ----------
        0:         37      2835       1937             98
        1:        666      1821       1865             98
        2:        654      1003       1385             98
        3:        664       635        973             99
        4:        698       197        352             99
        5:        703       101        159             99
        6:        708         1         75             99
        7:        713         1          2             99
      
      total:     4843       6594      6748            789
      
      The total # of migrations did not change (several runs showed the
      difference all within the noise). But we now see a dramatic
      improvement to the higher priority tasks. (kernelshark showed that
      the watchdog timer bumped the highest priority task to give it the
      2 count. This was actually consistent with every run).
      
      Notice that the # of iterations did not change either.
      
      The above was with priority inheritance mutexes. That is, when the
      higher prority task blocked on a lower priority task, the lower
      priority task would inherit the higher priority task (which shows
      why task 6 was bumped so many times). When not using priority
      inheritance mutexes, the current kernel shows this:
      
      Task        vol    nonvol   migrated     iterations
      ----        ---    ------   --------     ----------
        0:         56      3101       1892             95
        1:        594       713        937             95
        2:        625       188        618             95
        3:        628         4        491             96
        4:        640         7        468             96
        5:        631         2        501             96
        6:        641         1        466             96
        7:        643         2        497             96
      
      total:     4458       4018      5870            765
      
      Not much changed with or without priority inheritance mutexes. But
      if we let the high priority task bump lower priority tasks on
      wakeup we see:
      
      Task        vol    nonvol   migrated     iterations
      ----        ---    ------   --------     ----------
        0:        115      3439       2782             98
        1:        633      1354       1583             99
        2:        652       919       1218             99
        3:        645       713        934             99
        4:        690         3          3             99
        5:        694         1          4             99
        6:        720         3          4             99
        7:        747         0          1            100
      
      Which shows a even bigger change. The big difference between task 3
      and task 4 is because we have only 4 CPUs on the machine, causing
      the 4 highest prio tasks to always have preference.
      
      Although I did not measure cache misses, and I'm sure there would
      be little to measure since the test was not data intensive, I could
      imagine large improvements for higher priority tasks when dealing
      with lower priority tasks. Thus, I'm satisfied with making the
      change and agreeing with what Gregory Haskins argued a few years
      ago when we first had this discussion.
      
      One final note. All tasks in the above tests were RT tasks. Any RT
      task will always preempt a non RT task that is running on the CPU
      the RT task wants to run on.
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Gregory Haskins <ghaskins@novell.com>
      LKML-Reference: <20100921024138.605460343@goodmis.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      43fa5460
  4. 18 6月, 2010 1 次提交
  5. 03 4月, 2010 2 次提交
    • P
      sched: Add enqueue/dequeue flags · 371fd7e7
      Peter Zijlstra 提交于
      In order to reduce the dependency on TASK_WAKING rework the enqueue
      interface to support a proper flags field.
      
      Replace the int wakeup, bool head arguments with an int flags argument
      and create the following flags:
      
        ENQUEUE_WAKEUP - the enqueue is a wakeup of a sleeping task,
        ENQUEUE_WAKING - the enqueue has relative vruntime due to
                         having sched_class::task_waking() called,
        ENQUEUE_HEAD - the waking task should be places on the head
                       of the priority queue (where appropriate).
      
      For symmetry also convert sched_class::dequeue() to a flags scheme.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      371fd7e7
    • P
      sched: Fix TASK_WAKING vs fork deadlock · 0017d735
      Peter Zijlstra 提交于
      Oleg noticed a few races with the TASK_WAKING usage on fork.
      
       - since TASK_WAKING is basically a spinlock, it should be IRQ safe
       - since we set TASK_WAKING (*) without holding rq->lock it could
         be there still is a rq->lock holder, thereby not actually
         providing full serialization.
      
      (*) in fact we clear PF_STARTING, which in effect enables TASK_WAKING.
      
      Cure the second issue by not setting TASK_WAKING in sched_fork(), but
      only temporarily in wake_up_new_task() while calling select_task_rq().
      
      Cure the first by holding rq->lock around the select_task_rq() call,
      this will disable IRQs, this however requires that we push down the
      rq->lock release into select_task_rq_fair()'s cgroup stuff.
      
      Because select_task_rq_fair() still needs to drop the rq->lock we
      cannot fully get rid of TASK_WAKING.
      Reported-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      0017d735
  6. 11 3月, 2010 2 次提交
  7. 07 3月, 2010 1 次提交
  8. 04 2月, 2010 1 次提交
  9. 23 1月, 2010 2 次提交
  10. 21 1月, 2010 1 次提交
  11. 17 1月, 2010 1 次提交
  12. 17 12月, 2009 1 次提交
  13. 15 12月, 2009 2 次提交
  14. 09 12月, 2009 1 次提交
  15. 04 11月, 2009 1 次提交
  16. 21 9月, 2009 1 次提交
  17. 15 9月, 2009 3 次提交
  18. 04 9月, 2009 1 次提交
  19. 02 8月, 2009 4 次提交
    • I
      sched: Fix cpupri build on !CONFIG_SMP · bcf08df3
      Ingo Molnar 提交于
      This build bug:
      
       In file included from kernel/sched.c:1765:
       kernel/sched_rt.c: In function ‘has_pushable_tasks’:
       kernel/sched_rt.c:1069: error: ‘struct rt_rq’ has no member named ‘pushable_tasks’
       kernel/sched_rt.c: In function ‘pick_next_task_rt’:
       kernel/sched_rt.c:1084: error: ‘struct rq’ has no member named ‘post_schedule’
      
      Triggers because both pushable_tasks and post_schedule are
      SMP-only fields.
      
      Move pushable_tasks() to the SMP section and #ifdef the post_schedule use.
      
      Cc: Gregory Haskins <ghaskins@novell.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <20090729150422.17691.55590.stgit@dev.haskins.net>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      bcf08df3
    • P
      sched: Add debug check to task_of() · 8f48894f
      Peter Zijlstra 提交于
      A frequent mistake appears to be to call task_of() on a
      scheduler entity that is not actually a task, which can result
      in a wild pointer.
      
      Add a check to catch these mistakes.
      Suggested-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      8f48894f
    • G
      sched: Fully integrate cpus_active_map and root-domain code · 00aec93d
      Gregory Haskins 提交于
      Reflect "active" cpus in the rq->rd->online field, instead of
      the online_map.
      
      The motivation is that things that use the root-domain code
      (such as cpupri) only care about cpus classified as "active"
      anyway. By synchronizing the root-domain state with the active
      map, we allow several optimizations.
      
      For instance, we can remove an extra cpumask_and from the
      scheduler hotpath by utilizing rq->rd->online (since it is now
      a cached version of cpu_active_map & rq->rd->span).
      Signed-off-by: NGregory Haskins <ghaskins@novell.com>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Acked-by: NMax Krasnyansky <maxk@qualcomm.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <20090730145723.25226.24493.stgit@dev.haskins.net>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      00aec93d
    • G
      sched: Enhance the pre/post scheduling logic · 3f029d3c
      Gregory Haskins 提交于
      We currently have an explicit "needs_post" vtable method which
      returns a stack variable for whether we should later run
      post-schedule.  This leads to an awkward exchange of the
      variable as it bubbles back up out of the context switch. Peter
      Zijlstra observed that this information could be stored in the
      run-queue itself instead of handled on the stack.
      
      Therefore, we revert to the method of having context_switch
      return void, and update an internal rq->post_schedule variable
      when we require further processing.
      
      In addition, we fix a race condition where we try to access
      current->sched_class without holding the rq->lock.  This is
      technically racy, as the sched-class could change out from
      under us.  Instead, we reference the per-rq post_schedule
      variable with the runqueue unlocked, but with preemption
      disabled to see if we need to reacquire the rq->lock.
      
      Finally, we clean the code up slightly by removing the #ifdef
      CONFIG_SMP conditionals from the schedule() call, and implement
      some inline helper functions instead.
      
      This patch passes checkpatch, and rt-migrate.
      Signed-off-by: NGregory Haskins <ghaskins@novell.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <20090729150422.17691.55590.stgit@dev.haskins.net>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      3f029d3c
  20. 10 7月, 2009 1 次提交
  21. 09 6月, 2009 1 次提交
  22. 01 4月, 2009 1 次提交
  23. 01 2月, 2009 1 次提交
  24. 16 1月, 2009 1 次提交
    • P
      sched: make plist a library facility · ceacc2c1
      Peter Zijlstra 提交于
      Ingo Molnar wrote:
      
      > here's a new build failure with tip/sched/rt:
      >
      >   LD      .tmp_vmlinux1
      > kernel/built-in.o: In function `set_curr_task_rt':
      > sched.c:(.text+0x3675): undefined reference to `plist_del'
      > kernel/built-in.o: In function `pick_next_task_rt':
      > sched.c:(.text+0x37ce): undefined reference to `plist_del'
      > kernel/built-in.o: In function `enqueue_pushable_task':
      > sched.c:(.text+0x381c): undefined reference to `plist_del'
      
      Eliminate the plist library kconfig and make it available
      unconditionally.
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      ceacc2c1
  25. 14 1月, 2009 2 次提交
  26. 12 1月, 2009 1 次提交
  27. 04 1月, 2009 1 次提交
    • M
      sched: put back some stack hog changes that were undone in kernel/sched.c · 6ca09dfc
      Mike Travis 提交于
      Impact: prevents panic from stack overflow on numa-capable machines.
      
      Some of the "removal of stack hogs" changes in kernel/sched.c by using
      node_to_cpumask_ptr were undone by the early cpumask API updates, and
      causes a panic due to stack overflow.  This patch undoes those changes
      by using cpumask_of_node() which returns a 'const struct cpumask *'.
      
      In addition, cpu_coregoup_map is replaced with cpu_coregroup_mask further
      reducing stack usage.  (Both of these updates removed 9 FIXME's!)
      
      Also:
         Pick up some remaining changes from the old 'cpumask_t' functions to
         the new 'struct cpumask *' functions.
      
         Optimize memory traffic by allocating each percpu local_cpu_mask on the
         same node as the referring cpu.
      Signed-off-by: NMike Travis <travis@sgi.com>
      Acked-by: NRusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      6ca09dfc
  28. 29 12月, 2008 1 次提交
    • G
      RT: fix push_rt_task() to handle dequeue_pushable properly · 1563513d
      Gregory Haskins 提交于
      A panic was discovered by Chirag Jog where a BUG_ON sanity check
      in the new "pushable_task" logic would trigger a panic under
      certain circumstances:
      
      http://lkml.org/lkml/2008/9/25/189
      
      Gilles Carry discovered that the root cause was attributed to the
      pushable_tasks list getting corrupted in the push_rt_task logic.
      This was the result of a dropped rq lock in double_lock_balance
      allowing a task in the process of being pushed to potentially migrate
      away, and thus corrupt the pushable_tasks() list.
      
      I traced back the problem as introduced by the pushable_tasks patch
      that went in recently.   There is a "retry" path in push_rt_task()
      that actually had a compound conditional to decide whether to
      retry or exit.  I missed the meaning behind the rationale for the
      virtual "if(!task) goto out;" portion of the compound statement and
      thus did not handle it properly.  The new pushable_tasks logic
      actually creates three distinct conditions:
      
      1) an untouched and unpushable task should be dequeued
      2) a migrated task where more pushable tasks remain should be retried
      3) a migrated task where no more pushable tasks exist should exit
      
      The original logic mushed (1) and (3) together, resulting in the
      system dequeuing a migrated task (against an unlocked foreign run-queue
      nonetheless).
      
      To fix this, we get rid of the notion of "paranoid" and we support the
      three unique conditions properly.  The paranoid feature is no longer
      relevant with the new pushable logic (since pushable naturally limits
      the loop) anyway, so lets just remove it.
      Reported-By: NChirag Jog <chirag@linux.vnet.ibm.com>
      Found-by: NGilles Carry <gilles.carry@bull.net>
      Signed-off-by: NGregory Haskins <ghaskins@novell.com>
      1563513d