1. 22 10月, 2008 1 次提交
  2. 04 10月, 2008 1 次提交
    • D
      sched_rt.c: resch needed in rt_rq_enqueue() for the root rt_rq · f6121f4f
      Dario Faggioli 提交于
      While working on the new version of the code for SCHED_SPORADIC I
      noticed something strange in the present throttling mechanism. More
      specifically in the throttling timer handler in sched_rt.c
      (do_sched_rt_period_timer()) and in rt_rq_enqueue().
      
      The problem is that, when unthrottling a runqueue, rt_rq_enqueue() only
      asks for rescheduling if the runqueue has a sched_entity associated to
      it (i.e., rt_rq->rt_se != NULL).
      Now, if the runqueue is the root rq (which has a rt_se = NULL)
      rescheduling does not take place, and it is delayed to some undefined
      instant in the future.
      
      This imply some random bandwidth usage by the RT tasks under throttling.
      For instance, setting rt_runtime_us/rt_period_us = 950ms/1000ms an RT
      task will get less than 95%. In our tests we got something varying
      between 70% to 95%.
      Using smaller time values, e.g., 95ms/100ms, things are even worse, and
      I can see values also going down to 20-25%!!
      
      The tests we performed are simply running 'yes' as a SCHED_FIFO task,
      and checking the CPU usage with top, but we can investigate thoroughly
      if you think it is needed.
      
      Things go much better, for us, with the attached patch... Don't know if
      it is the best approach, but it solved the issue for us.
      Signed-off-by: NDario Faggioli <raistlin@linux.it>
      Signed-off-by: NMichael Trimarchi <trimarchimichael@yahoo.it>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: <stable@kernel.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      f6121f4f
  3. 23 9月, 2008 1 次提交
  4. 22 9月, 2008 1 次提交
  5. 11 9月, 2008 1 次提交
    • Z
      sched: fix 2.6.27-rc5 couldn't boot on tulsa machine randomly · baf25731
      Zhang, Yanmin 提交于
      On my tulsa x86-64 machine, kernel 2.6.25-rc5 couldn't boot randomly.
      
      Basically, function __enable_runtime forgets to reset rt_rq->rt_throttled
      to 0. When every cpu is up, per-cpu migration_thread is created and it runs
      very fast, sometimes to mark the corresponding rt_rq->rt_throttled to 1 very
      quickly. After all cpus are up, with below calling chain:
      
         sched_init_smp => arch_init_sched_domains => build_sched_domains => ...
      => cpu_attach_domain => rq_attach_root => set_rq_online => ...
      => _enable_runtime
      
      _enable_runtime is called against every rt_rq again, so rt_rq->rt_time is
      reset to 0, but rt_rq->rt_throttled might be still 1. Later on function
      do_sched_rt_period_timer couldn't reset it, and all RT tasks couldn't be
      scheduled to run on that cpu. here is RT task migration_thread which is
      woken up when a task is migrated to another cpu.
      
      Below patch fixes it against 2.6.27-rc5.
      Signed-off-by: NZhang Yanmin <yanmin_zhang@linux.intel.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      baf25731
  6. 28 8月, 2008 2 次提交
    • P
      sched: rt-bandwidth accounting fix · cc2991cf
      Peter Zijlstra 提交于
      It fixes an accounting bug where we would continue accumulating runtime
      even though the bandwidth control is disabled. This would lead to very long
      throttle periods once bandwidth control gets turned on again.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      cc2991cf
    • J
      sched: fix sched_rt_rq_enqueue() resched idle · f3ade837
      John Blackwood 提交于
      When sysctl_sched_rt_runtime is set to something other than -1 and the
      CONFIG_RT_GROUP_SCHED kernel parameter is NOT enabled, we get into a state
      where we see one or more CPUs idling forvever even though there are
      real-time
      tasks in their rt runqueue that are able to run (no longer throttled).
      
      The sequence is:
      
      - A real-time task is running when the timer sets the rt runqueue
          to throttled, and the rt task is resched_task()ed and switched
          out, and idle is switched in since there are no non-rt tasks to
          run on that cpu.
      
      - Eventually the do_sched_rt_period_timer() runs and un-throttles
          the rt runqueue, but we just exit the timer interrupt and go back
          to executing the idle task in the idle loop forever.
      
      If we change the sched_rt_rq_enqueue() routine to use some of the code
      from the CONFIG_RT_GROUP_SCHED enabled version of this same routine and
      resched_task() the currently executing task (idle in our case) if it is
      a lower priority task than the higher rt task in the now un-throttled
      runqueue, the problem is no longer observed.
      Signed-off-by: NJohn Blackwood <john.blackwood@ccur.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      f3ade837
  7. 19 8月, 2008 2 次提交
  8. 14 8月, 2008 1 次提交
  9. 11 8月, 2008 1 次提交
    • P
      lockdep: re-annotate scheduler runqueues · 1b12bbc7
      Peter Zijlstra 提交于
      Instead of using a per-rq lock class, use the regular nesting operations.
      
      However, take extra care with double_lock_balance() as it can release the
      already held rq->lock (and therefore change its nesting class).
      
      So what can happen is:
      
       spin_lock(rq->lock);	// this rq subclass 0
      
       double_lock_balance(rq, other_rq);
         // release rq
         // acquire other_rq->lock subclass 0
         // acquire rq->lock subclass 1
      
       spin_unlock(other_rq->lock);
      
      leaving you with rq->lock in subclass 1
      
      So a subsequent double_lock_balance() call can try to nest a subclass 1
      lock while already holding a subclass 1 lock.
      
      Fix this by introducing double_unlock_balance() which releases the other
      rq's lock, but also re-sets the subclass for this rq's lock to 0.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      1b12bbc7
  10. 24 7月, 2008 1 次提交
  11. 18 7月, 2008 3 次提交
  12. 27 6月, 2008 3 次提交
  13. 20 6月, 2008 5 次提交
  14. 19 6月, 2008 2 次提交
  15. 18 6月, 2008 1 次提交
    • D
      sched: rework of "prioritize non-migratable tasks over migratable ones" · 20b6331b
      Dmitry Adamushko 提交于
      regarding this commit: 45c01e82
      
      I think we can do it simpler. Please take a look at the patch below.
      
      Instead of having 2 separate arrays (which is + ~800 bytes on x86_32 and
      twice so on x86_64), let's add "exclusive" (the ones that are bound to
      this CPU) tasks to the head of the queue and "shared" ones -- to the
      end.
      
      In case of a few newly woken up "exclusive" tasks, they are 'stacked'
      (not queued as now), meaning that a task {i+1} is being placed in front
      of the previously woken up task {i}. But I don't think that this
      behavior may cause any realistic problems.
      
      There are a couple of changes on top of this one.
      
      (1) in check_preempt_curr_rt()
      
      I don't think there is a need for the "pick_next_rt_entity(rq, &rq->rt)
      != &rq->curr->rt" check.
      
      enqueue_task_rt(p) and check_preempt_curr_rt() are always called one
      after another with rq->lock being held so the following check
      "p->rt.nr_cpus_allowed == 1 && rq->curr->rt.nr_cpus_allowed != 1" should
      be enough (well, just its left part) to guarantee that 'p' has been
      queued in front of the 'curr'.
      
      (2) in set_cpus_allowed_rt()
      
      I don't thinks there is a need for requeue_task_rt() here.
      
      Perhaps, the only case when 'requeue' (+ reschedule) might be useful is
      as follows:
      
      i) weight == 1 && cpu_isset(task_cpu(p), *new_mask)
      
      i.e. a task is being bound to this CPU);
      
      ii) 'p' != rq->curr
      
      but here, 'p' has already been on this CPU for a while and was not
      migrated. i.e. it's possible that 'rq->curr' would not have high chances
      to be migrated right at this particular moment (although, has chance in
      a bit longer term), should we allow it to be preempted.
      
      Anyway, I think we should not perhaps make it more complex trying to
      address some rare corner cases. For instance, that's why a single queue
      approach would be preferable. Unless I'm missing something obvious, this
      approach gives us similar functionality at lower cost.
      
      Verified only compilation-wise.
      
      (Almost)-Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      20b6331b
  16. 10 6月, 2008 1 次提交
    • P
      sched: fix hotplug cpus on ia64 · 7def2be1
      Peter Zijlstra 提交于
      Cliff Wickman wrote:
      
      > I built an ia64 kernel from Andrew's tree (2.6.26-rc2-mm1)
      > and get a very predictable hotplug cpu problem.
      > billberry1:/tmp/cpw # ./dis
      > disabled cpu 17
      > enabled cpu 17
      > billberry1:/tmp/cpw # ./dis
      > disabled cpu 17
      > enabled cpu 17
      > billberry1:/tmp/cpw # ./dis
      >
      > The script that disables the cpu always hangs (unkillable)
      > on the 3rd attempt.
      >
      > And a bit further:
      > The kstopmachine thread always sits on the run queue (real time) for about
      > 30 minutes before running.
      
      this fix solves some (but not all) issues between CPU hotplug and
      RT bandwidth throttling.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      7def2be1
  17. 06 6月, 2008 4 次提交
    • I
      sched: fix cpuprio build bug · 1100ac91
      Ingo Molnar 提交于
      this patch was not built on !SMP:
      
       kernel/sched_rt.c: In function 'inc_rt_tasks':
       kernel/sched_rt.c:404: error: 'struct rq' has no member named 'online'
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      1100ac91
    • G
      sched: fix cpupri hotplug support · 1f11eb6a
      Gregory Haskins 提交于
      The RT folks over at RedHat found an issue w.r.t. hotplug support which
      was traced to problems with the cpupri infrastructure in the scheduler:
      
      https://bugzilla.redhat.com/show_bug.cgi?id=449676
      
      This bug affects 23-rt12+, 24-rtX, 25-rtX, and sched-devel.  This patch
      applies to 25.4-rt4, though it should trivially apply to most cpupri enabled
      kernels mentioned above.
      
      It turned out that the issue was that offline cpus could get inadvertently
      registered with cpupri so that they were erroneously selected during
      migration decisions.  The end result would be an OOPS as the offline cpu
      had tasks routed to it.
      
      This patch generalizes the old join/leave domain interface into an
      online/offline interface, and adjusts the root-domain/hotplug code to
      utilize it.
      
      I was able to easily reproduce the issue prior to this patch, and am no
      longer able to reproduce it after this patch.  I can offline cpus
      indefinately and everything seems to be in working order.
      
      Thanks to Arnaldo (acme), Thomas, and Peter for doing the legwork to point
      me in the right direction.  Also thank you to Peter for reviewing the
      early iterations of this patch.
      Signed-off-by: NGregory Haskins <ghaskins@novell.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      1f11eb6a
    • G
      sched: use a 2-d bitmap for searching lowest-pri CPU · 6e0534f2
      Gregory Haskins 提交于
      The current code use a linear algorithm which causes scaling issues
      on larger SMP machines.  This patch replaces that algorithm with a
      2-dimensional bitmap to reduce latencies in the wake-up path.
      Signed-off-by: NGregory Haskins <ghaskins@novell.com>
      Acked-by: NSteven Rostedt <srostedt@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      6e0534f2
    • G
      sched: prioritize non-migratable tasks over migratable ones · 45c01e82
      Gregory Haskins 提交于
      Dmitry Adamushko pointed out a known flaw in the rt-balancing algorithm
      that could allow suboptimal balancing if a non-migratable task gets
      queued behind a running migratable one.  It is discussed in this thread:
      
      http://lkml.org/lkml/2008/4/22/296
      
      This issue has been further exacerbated by a recent checkin to
      sched-devel (git-id 5eee63a5ebc19a870ac40055c0be49457f3a89a3).
      
      >From a pure priority standpoint, the run-queue is doing the "right"
      thing. Using Dmitry's nomenclature, if T0 is on cpu1 first, and T1
      wakes up at equal or lower priority (affined only to cpu1) later, it
      *should* wait for T0 to finish.  However, in reality that is likely
      suboptimal from a system perspective if there are other cores that
      could allow T0 and T1 to run concurrently.  Since T1 can not migrate,
      the only choice for higher concurrency is to try to move T0.  This is
      not something we addessed in the recent rt-balancing re-work.
      
      This patch tries to enhance the balancing algorithm by accomodating this
      scenario.  It accomplishes this by incorporating the migratability of a
      task into its priority calculation.  Within a numerical tsk->prio, a
      non-migratable task is logically higher than a migratable one.  We
      maintain this by introducing a new per-priority queue (xqueue, or
      exclusive-queue) for holding non-migratable tasks.  The scheduler will
      draw from the xqueue over the standard shared-queue (squeue) when
      available.
      
      There are several details for utilizing this properly.
      
      1) During task-wake-up, we not only need to check if the priority
         preempts the current task, but we also need to check for this
         non-migratable condition.  Therefore, if a non-migratable task wakes
         up and sees an equal priority migratable task already running, it
         will attempt to preempt it *if* there is a likelyhood that the
         current task will find an immediate home.
      
      2) Tasks only get this non-migratable "priority boost" on wake-up.  Any
         requeuing will result in the non-migratable task being queued to the
         end of the shared queue.  This is an attempt to prevent the system
         from being completely unfair to migratable tasks during things like
         SCHED_RR timeslicing.
      
      I am sure this patch introduces potentially "odd" behavior if you
      concoct a scenario where a bunch of non-migratable threads could starve
      migratable ones given the right pattern.  I am not yet convinced that
      this is a problem since we are talking about tasks of equal RT priority
      anyway, and there never is much in the way of guarantees against
      starvation under that scenario anyway. (e.g. you could come up with a
      similar scenario with a specific timing environment verses an affinity
      environment).  I can be convinced otherwise, but for now I think this is
      "ok".
      Signed-off-by: NGregory Haskins <ghaskins@novell.com>
      CC: Dmitry Adamushko <dmitry.adamushko@gmail.com>
      CC: Steven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      45c01e82
  18. 29 5月, 2008 1 次提交
  19. 24 5月, 2008 1 次提交
  20. 06 5月, 2008 2 次提交
  21. 20 4月, 2008 5 次提交
    • P
      sched: rt-group: optimize dequeue_rt_stack · 58d6c2d7
      Peter Zijlstra 提交于
      Now that the group hierarchy can have an arbitrary depth the O(n^2) nature
      of RT task dequeues will really hurt. Optimize this by providing space to
      store the tree path, so we can walk it the other way.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      58d6c2d7
    • P
      sched: fair-group: SMP-nice for group scheduling · 18d95a28
      Peter Zijlstra 提交于
      Implement SMP nice support for the full group hierarchy.
      
      On each load-balance action, compile a sched_domain wide view of the full
      task_group tree. We compute the domain wide view when walking down the
      hierarchy, and readjust the weights when walking back up.
      
      After collecting and readjusting the domain wide view, we try to balance the
      tasks within the task_groups. The current approach is a naively balance each
      task group until we've moved the targeted amount of load.
      
      Inspired by Srivatsa Vaddsgiri's previous code and Abhishek Chandra's H-SMP
      paper.
      
      XXX: there will be some numerical issues due to the limited nature of
           SCHED_LOAD_SCALE wrt to representing a task_groups influence on the
           total weight. When the tree is deep enough, or the task weight small
           enough, we'll run out of bits.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      CC: Abhishek Chandra <chandra@cs.umn.edu>
      CC: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      18d95a28
    • D
      sched: mix tasks and groups · 354d60c2
      Dhaval Giani 提交于
      This patch allows tasks and groups to exist in the same cfs_rq. With this
      change the CFS group scheduling follows a 1/(M+N) model from a 1/(1+N)
      fairness model where M tasks and N groups exist at the cfs_rq level.
      
      [a.p.zijlstra@chello.nl: rt bits and assorted fixes]
      Signed-off-by: NDhaval Giani <dhaval@linux.vnet.ibm.com>
      Signed-off-by: NSrivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      354d60c2
    • M
      sched: add new set_cpus_allowed_ptr function · cd8ba7cd
      Mike Travis 提交于
      Add a new function that accepts a pointer to the "newly allowed cpus"
      cpumask argument.
      
      int set_cpus_allowed_ptr(struct task_struct *p, const cpumask_t *new_mask)
      
      The current set_cpus_allowed() function is modified to use the above
      but this does not result in an ABI change.  And with some compiler
      optimization help, it may not introduce any additional overhead.
      
      Additionally, to enforce the read only nature of the new_mask arg, the
      "const" property is migrated to sub-functions called by set_cpus_allowed.
      This silences compiler warnings.
      Signed-off-by: NMike Travis <travis@sgi.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      cd8ba7cd
    • P
      sched: rt-group: smp balancing · ac086bc2
      Peter Zijlstra 提交于
      Currently the rt group scheduling does a per cpu runtime limit, however
      the rt load balancer makes no guarantees about an equal spread of real-
      time tasks, just that at any one time, the highest priority tasks run.
      
      Solve this by making the runtime limit a global property by borrowing
      excessive runtime from the other cpus once the local limit runs out.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      ac086bc2