1. 17 12月, 2009 10 次提交
    • I
      sched: Make warning less noisy · 416eb395
      Ingo Molnar 提交于
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mike Galbraith <efault@gmx.de>
      LKML-Reference: <20091216170517.807938893@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      416eb395
    • P
      sched: Simplify set_task_cpu() · 738d2be4
      Peter Zijlstra 提交于
      Rearrange code a bit now that its a simpler function.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mike Galbraith <efault@gmx.de>
      LKML-Reference: <20091216170518.269101883@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      738d2be4
    • P
      sched: Remove the cfs_rq dependency from set_task_cpu() · 88ec22d3
      Peter Zijlstra 提交于
      In order to remove the cfs_rq dependency from set_task_cpu() we
      need to ensure the task is cfs_rq invariant for all callsites.
      
      The simple approach is to substract cfs_rq->min_vruntime from
      se->vruntime on dequeue, and add cfs_rq->min_vruntime on
      enqueue.
      
      However, this has the downside of breaking FAIR_SLEEPERS since
      we loose the old vruntime as we only maintain the relative
      position.
      
      To solve this, we observe that we only migrate runnable tasks,
      we do this using deactivate_task(.sleep=0) and
      activate_task(.wakeup=0), therefore we can restrain the
      min_vruntime invariance to that state.
      
      The only other case is wakeup balancing, since we want to
      maintain the old vruntime we cannot make it relative on dequeue,
      but since we don't migrate inactive tasks, we can do so right
      before we activate it again.
      
      This is where we need the new pre-wakeup hook, we need to call
      this while still holding the old rq->lock. We could fold it into
      ->select_task_rq(), but since that has multiple callsites and
      would obfuscate the locking requirements, that seems like a
      fudge.
      
      This leaves the fork() case, simply make sure that ->task_fork()
      leaves the ->vruntime in a relative state.
      
      This covers all cases where set_task_cpu() gets called, and
      ensures it sees a relative vruntime.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mike Galbraith <efault@gmx.de>
      LKML-Reference: <20091216170518.191697025@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      88ec22d3
    • P
      sched: Add pre and post wakeup hooks · efbbd05a
      Peter Zijlstra 提交于
      As will be apparent in the next patch, we need a pre wakeup hook
      for sched_fair task migration, hence rename the post wakeup hook
      and one pre wakeup.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mike Galbraith <efault@gmx.de>
      LKML-Reference: <20091216170518.114746117@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      efbbd05a
    • P
      sched: Move kthread_bind() back to kthread.c · 881232b7
      Peter Zijlstra 提交于
      Since kthread_bind() lost its dependencies on sched.c, move it
      back where it came from.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mike Galbraith <efault@gmx.de>
      LKML-Reference: <20091216170518.039524041@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      881232b7
    • P
      sched: Fix select_task_rq() vs hotplug issues · 5da9a0fb
      Peter Zijlstra 提交于
      Since select_task_rq() is now responsible for guaranteeing
      ->cpus_allowed and cpu_active_mask, we need to verify this.
      
      select_task_rq_rt() can blindly return
      smp_processor_id()/task_cpu() without checking the valid masks,
      select_task_rq_fair() can do the same in the rare case that all
      SD_flags are disabled.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mike Galbraith <efault@gmx.de>
      LKML-Reference: <20091216170517.961475466@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      5da9a0fb
    • P
      sched: Fix sched_exec() balancing · 38022906
      Peter Zijlstra 提交于
      Since we access ->cpus_allowed without holding rq->lock we need
      a retry loop to validate the result, this comes for near free
      when we merge sched_migrate_task() into sched_exec() since that
      already does the needed check.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mike Galbraith <efault@gmx.de>
      LKML-Reference: <20091216170517.884743662@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      38022906
    • P
      sched: Ensure set_task_cpu() is never called on blocked tasks · e2912009
      Peter Zijlstra 提交于
      In order to clean up the set_task_cpu() rq dependencies we need
      to ensure it is never called on blocked tasks because such usage
      does not pair with consistent rq->lock usage.
      
      This puts the migration burden on ttwu().
      
      Furthermore we need to close a race against changing
      ->cpus_allowed, since select_task_rq() runs with only preemption
      disabled.
      
      For sched_fork() this is safe because the child isn't in the
      tasklist yet, for wakeup we fix this by synchronizing
      set_cpus_allowed_ptr() against TASK_WAKING, which leaves
      sched_exec to be a problem
      
      This also closes a hole in (6ad4c188 sched: Fix balance vs
      hotplug race) where ->select_task_rq() doesn't validate the
      result against the sched_domain/root_domain.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mike Galbraith <efault@gmx.de>
      LKML-Reference: <20091216170517.807938893@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      e2912009
    • P
      sched: Use TASK_WAKING for fork wakups · 06b83b5f
      Peter Zijlstra 提交于
      For later convenience use TASK_WAKING for fresh tasks.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mike Galbraith <efault@gmx.de>
      LKML-Reference: <20091216170517.732561278@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      06b83b5f
    • P
      sched: Fix task_hot() test order · e6c8fba7
      Peter Zijlstra 提交于
      Make sure not to access sched_fair fields before verifying it is
      indeed a sched_fair task.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mike Galbraith <efault@gmx.de>
      CC: stable@kernel.org
      LKML-Reference: <20091216170517.577998058@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      e6c8fba7
  2. 15 12月, 2009 8 次提交
  3. 13 12月, 2009 2 次提交
    • J
      sched: Use pr_fmt() and pr_<level>() · 663997d4
      Joe Perches 提交于
      - Convert printk(KERN_<level> to pr_<level> (not KERN_DEBUG)
       - Add #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
       - Coalesce long format strings
       - Add missing \n to "ERROR: !SD_LOAD_BALANCE domain has parent"
      Signed-off-by: NJoe Perches <joe@perches.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1260655047.2637.7.camel@Joe-Laptop.home>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      663997d4
    • R
      sched: Make wakeup side and atomic variants of completion API irq safe · 7539a3b3
      Rafael J. Wysocki 提交于
      Alan Stern noticed that all the wakeup side (and atomic) variants of the
      completion APIs should be irq safe, but the newly introduced
      completion_done() and try_wait_for_completion() aren't. The use of the
      irq unsafe variants in IRQ contexts can cause crashes/hangs.
      
      Fix the problem by making them use spin_lock_irqsave() and
      spin_lock_irqrestore().
      Reported-by: NAlan Stern <stern@rowland.harvard.edu>
      Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Zhang Rui <rui.zhang@intel.com>
      Cc: pm list <linux-pm@lists.linux-foundation.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: David Chinner <david@fromorbit.com>
      Cc: Lachlan McIlroy <lachlan@sgi.com>
      LKML-Reference: <200912130007.30541.rjw@sisk.pl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      7539a3b3
  4. 11 12月, 2009 1 次提交
    • I
      sched: Remove forced2_migrations stats · b9889ed1
      Ingo Molnar 提交于
      This build warning:
      
       kernel/sched.c: In function 'set_task_cpu':
       kernel/sched.c:2070: warning: unused variable 'old_rq'
      
      Made me realize that the forced2_migrations stat looks pretty
      pointless (and a misnomer) - remove it.
      
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mike Galbraith <efault@gmx.de>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      b9889ed1
  5. 10 12月, 2009 2 次提交
  6. 09 12月, 2009 12 次提交
  7. 07 12月, 2009 1 次提交
    • P
      sched: Fix balance vs hotplug race · 6ad4c188
      Peter Zijlstra 提交于
      Since (e761b772: cpu hotplug, sched: Introduce cpu_active_map and redo
      sched domain managment) we have cpu_active_mask which is suppose to rule
      scheduler migration and load-balancing, except it never (fully) did.
      
      The particular problem being solved here is a crash in try_to_wake_up()
      where select_task_rq() ends up selecting an offline cpu because
      select_task_rq_fair() trusts the sched_domain tree to reflect the
      current state of affairs, similarly select_task_rq_rt() trusts the
      root_domain.
      
      However, the sched_domains are updated from CPU_DEAD, which is after the
      cpu is taken offline and after stop_machine is done. Therefore it can
      race perfectly well with code assuming the domains are right.
      
      Cure this by building the domains from cpu_active_mask on
      CPU_DOWN_PREPARE.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      6ad4c188
  8. 03 12月, 2009 3 次提交
    • F
      mutex: Fix missing conditions to build mutex_spin_on_owner() · c08f7829
      Frederic Weisbecker 提交于
      We don't need to build mutex_spin_on_owner() if we have
      CONFIG_DEBUG_MUTEXES or CONFIG_HAVE_DEFAULT_NO_SPIN_MUTEXES as
      it won't be used under such configs.
      
      Use CONFIG_MUTEX_SPIN_ON_OWNER as it gathers all the necessary
      checks before building it.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      LKML-Reference: <1259783357-8542-2-git-send-regression-fweisbec@gmail.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <peterz@infradead.org>
      c08f7829
    • H
      sched, cputime: Introduce thread_group_times() · 0cf55e1e
      Hidetoshi Seto 提交于
      This is a real fix for problem of utime/stime values decreasing
      described in the thread:
      
         http://lkml.org/lkml/2009/11/3/522
      
      Now cputime is accounted in the following way:
      
       - {u,s}time in task_struct are increased every time when the thread
         is interrupted by a tick (timer interrupt).
      
       - When a thread exits, its {u,s}time are added to signal->{u,s}time,
         after adjusted by task_times().
      
       - When all threads in a thread_group exits, accumulated {u,s}time
         (and also c{u,s}time) in signal struct are added to c{u,s}time
         in signal struct of the group's parent.
      
      So {u,s}time in task struct are "raw" tick count, while
      {u,s}time and c{u,s}time in signal struct are "adjusted" values.
      
      And accounted values are used by:
      
       - task_times(), to get cputime of a thread:
         This function returns adjusted values that originates from raw
         {u,s}time and scaled by sum_exec_runtime that accounted by CFS.
      
       - thread_group_cputime(), to get cputime of a thread group:
         This function returns sum of all {u,s}time of living threads in
         the group, plus {u,s}time in the signal struct that is sum of
         adjusted cputimes of all exited threads belonged to the group.
      
      The problem is the return value of thread_group_cputime(),
      because it is mixed sum of "raw" value and "adjusted" value:
      
        group's {u,s}time = foreach(thread){{u,s}time} + exited({u,s}time)
      
      This misbehavior can break {u,s}time monotonicity.
      Assume that if there is a thread that have raw values greater
      than adjusted values (e.g. interrupted by 1000Hz ticks 50 times
      but only runs 45ms) and if it exits, cputime will decrease (e.g.
      -5ms).
      
      To fix this, we could do:
      
        group's {u,s}time = foreach(t){task_times(t)} + exited({u,s}time)
      
      But task_times() contains hard divisions, so applying it for
      every thread should be avoided.
      
      This patch fixes the above problem in the following way:
      
       - Modify thread's exit (= __exit_signal()) not to use task_times().
         It means {u,s}time in signal struct accumulates raw values instead
         of adjusted values.  As the result it makes thread_group_cputime()
         to return pure sum of "raw" values.
      
       - Introduce a new function thread_group_times(*task, *utime, *stime)
         that converts "raw" values of thread_group_cputime() to "adjusted"
         values, in same calculation procedure as task_times().
      
       - Modify group's exit (= wait_task_zombie()) to use this introduced
         thread_group_times().  It make c{u,s}time in signal struct to
         have adjusted values like before this patch.
      
       - Replace some thread_group_cputime() by thread_group_times().
         This replacements are only applied where conveys the "adjusted"
         cputime to users, and where already uses task_times() near by it.
         (i.e. sys_times(), getrusage(), and /proc/<PID>/stat.)
      
      This patch have a positive side effect:
      
       - Before this patch, if a group contains many short-life threads
         (e.g. runs 0.9ms and not interrupted by ticks), the group's
         cputime could be invisible since thread's cputime was accumulated
         after adjusted: imagine adjustment function as adj(ticks, runtime),
           {adj(0, 0.9) + adj(0, 0.9) + ....} = {0 + 0 + ....} = 0.
         After this patch it will not happen because the adjustment is
         applied after accumulated.
      
      v2:
       - remove if()s, put new variables into signal_struct.
      Signed-off-by: NHidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Spencer Candland <spencer@bluehost.com>
      Cc: Americo Wang <xiyou.wangcong@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Stanislaw Gruszka <sgruszka@redhat.com>
      LKML-Reference: <4B162517.8040909@jp.fujitsu.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      0cf55e1e
    • H
      sched, cputime: Cleanups related to task_times() · d99ca3b9
      Hidetoshi Seto 提交于
      - Remove if({u,s}t)s because no one call it with NULL now.
      - Use cputime_{add,sub}().
      - Add ifndef-endif for prev_{u,s}time since they are used
        only when !VIRT_CPU_ACCOUNTING.
      Signed-off-by: NHidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Spencer Candland <spencer@bluehost.com>
      Cc: Americo Wang <xiyou.wangcong@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Stanislaw Gruszka <sgruszka@redhat.com>
      LKML-Reference: <4B1624C7.7040302@jp.fujitsu.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      d99ca3b9
  9. 02 12月, 2009 1 次提交
    • R
      sched: Fix isolcpus boot option · bdddd296
      Rusty Russell 提交于
      Anton Blanchard wrote:
      
      > We allocate and zero cpu_isolated_map after the isolcpus
      > __setup option has run. This means cpu_isolated_map always
      > ends up empty and if CPUMASK_OFFSTACK is enabled we write to a
      > cpumask that hasn't been allocated.
      
      I introduced this regression in 49557e62 (sched: Fix
      boot crash by zalloc()ing most of the cpu masks).
      
      Use the bootmem allocator if they set isolcpus=, otherwise
      allocate and zero like normal.
      Reported-by: NAnton Blanchard <anton@samba.org>
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      Cc: peterz@infradead.org
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: <stable@kernel.org>
      LKML-Reference: <200912021409.17013.rusty@rustcorp.com.au>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Tested-by: NAnton Blanchard <anton@samba.org>
      bdddd296