1. 09 6月, 2010 2 次提交
    • T
      sched: adjust when cpu_active and cpuset configurations are updated during cpu on/offlining · 3a101d05
      Tejun Heo 提交于
      Currently, when a cpu goes down, cpu_active is cleared before
      CPU_DOWN_PREPARE starts and cpuset configuration is updated from a
      default priority cpu notifier.  When a cpu is coming up, it's set
      before CPU_ONLINE but cpuset configuration again is updated from the
      same cpu notifier.
      
      For cpu notifiers, this presents an inconsistent state.  Threads which
      a CPU_DOWN_PREPARE notifier expects to be bound to the CPU can be
      migrated to other cpus because the cpu is no more inactive.
      
      Fix it by updating cpu_active in the highest priority cpu notifier and
      cpuset configuration in the second highest when a cpu is coming up.
      Down path is updated similarly.  This guarantees that all other cpu
      notifiers see consistent cpu_active and cpuset configuration.
      
      cpuset_track_online_cpus() notifier is converted to
      cpuset_update_active_cpus() which just updates the configuration and
      now called from cpuset_cpu_[in]active() notifiers registered from
      sched_init_smp().  If cpuset is disabled, cpuset_update_active_cpus()
      degenerates into partition_sched_domains() making separate notifier
      for !CONFIG_CPUSETS unnecessary.
      
      This problem is triggered by cmwq.  During CPU_DOWN_PREPARE, hotplug
      callback creates a kthread and kthread_bind()s it to the target cpu,
      and the thread is expected to run on that cpu.
      
      * Ingo's test discovered __cpuinit/exit markups were incorrect.
        Fixed.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Paul Menage <menage@google.com>
      3a101d05
    • T
      sched: define and use CPU_PRI_* enums for cpu notifier priorities · 50a323b7
      Tejun Heo 提交于
      Instead of hardcoding priority 10 and 20 in sched and perf, collect
      them into CPU_PRI_* enums.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      50a323b7
  2. 01 6月, 2010 1 次提交
  3. 30 5月, 2010 1 次提交
  4. 28 5月, 2010 1 次提交
  5. 21 5月, 2010 2 次提交
    • J
      kdb: core for kgdb back end (2 of 2) · 67fc4e0c
      Jason Wessel 提交于
      This patch contains the hooks and instrumentation into kernel which
      live outside the kernel/debug directory, which the kdb core
      will call to run commands like lsmod, dmesg, bt etc...
      
      CC: linux-arch@vger.kernel.org
      Signed-off-by: NJason Wessel <jason.wessel@windriver.com>
      Signed-off-by: NMartin Hicks <mort@sgi.com>
      67fc4e0c
    • M
      wait_event_interruptible_locked() interface · 22c43c81
      Michal Nazarewicz 提交于
      New wait_event_interruptible{,_exclusive}_locked{,_irq} macros added.
      They work just like versions without _locked* suffix but require the
      wait queue's lock to be held.  Also __wake_up_locked() is now exported
      as to pair it with the above macros.
      
      The use case of this new facility is when one uses wait queue's lock
      to  protect a data structure.  This may be advantageous if the
      structure needs to be protected by a spinlock anyway.  In particular,
      with additional spinlock the following code has to be used to wait
      for a condition:
      
      spin_lock(&data.lock);
      ...
      for (ret = 0; !ret && !(condition); ) {
      	spin_unlock(&data.lock);
      	ret = wait_event_interruptible(data.wqh, (condition));
      	spin_lock(&data.lock);
      }
      ...
      spin_unlock(&data.lock);
      
      This looks bizarre plus wait_event_interruptible() locks the wait
      queue's lock anyway so there is a unlock+lock sequence where it could
      be avoided.
      
      To avoid those problems and benefit from wait queue's lock, a code
      similar to the following should be used:
      
      /* Waiting */
      spin_lock(&data.wqh.lock);
      ...
      ret = wait_event_interruptible_locked(data.wqh, (condition));
      ...
      spin_unlock(&data.wqh.lock);
      
      /* Waiting exclusively */
      spin_lock(&data.whq.lock);
      ...
      ret = wait_event_interruptible_exclusive_locked(data.whq, (condition));
      ...
      spin_unlock(&data.whq.lock);
      
      /* Waking up */
      spin_lock(&data.wqh.lock);
      ...
      wake_up_locked(&data.wqh);
      ...
      spin_unlock(&data.wqh.lock);
      
      When spin_lock_irq() is used matching versions of macros need to be
      used (*_locked_irq()).
      Signed-off-by: NMichal Nazarewicz <m.nazarewicz@samsung.com>
      Cc: Kyungmin Park <kyungmin.park@samsung.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Takashi Iwai <tiwai@suse.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Andreas Herrmann <andreas.herrmann3@amd.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Mike Galbraith <efault@gmx.de>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@suse.de>
      22c43c81
  6. 11 5月, 2010 2 次提交
    • C
      sched, wait: Use wrapper functions · a93d2f17
      Changli Gao 提交于
      epoll should not touch flags in wait_queue_t. This patch introduces a new
      function __add_wait_queue_exclusive(), for the users, who use wait queue as a
      LIFO queue.
      
      __add_wait_queue_tail_exclusive() is introduced too instead of
      add_wait_queue_exclusive_locked(). remove_wait_queue_locked() is removed, as
      it is a duplicate of __remove_wait_queue(), disliked by users, and with less
      users.
      Signed-off-by: NChangli Gao <xiaosuo@gmail.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Paul Menage <menage@google.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Davide Libenzi <davidel@xmailserver.org>
      Cc: <containers@lists.linux-foundation.org>
      LKML-Reference: <1273214006-2979-1-git-send-email-xiaosuo@gmail.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      a93d2f17
    • P
      rcu: refactor RCU's context-switch handling · 25502a6c
      Paul E. McKenney 提交于
      The addition of preemptible RCU to treercu resulted in a bit of
      confusion and inefficiency surrounding the handling of context switches
      for RCU-sched and for RCU-preempt.  For RCU-sched, a context switch
      is a quiescent state, pure and simple, just like it always has been.
      For RCU-preempt, a context switch is in no way a quiescent state, but
      special handling is required when a task blocks in an RCU read-side
      critical section.
      
      However, the callout from the scheduler and the outer loop in ksoftirqd
      still calls something named rcu_sched_qs(), whose name is no longer
      accurate.  Furthermore, when rcu_check_callbacks() notes an RCU-sched
      quiescent state, it ends up unnecessarily (though harmlessly, aside
      from the performance hit) enqueuing the current task if it happens to
      be running in an RCU-preempt read-side critical section.  This not only
      increases the maximum latency of scheduler_tick(), it also needlessly
      increases the overhead of the next outermost rcu_read_unlock() invocation.
      
      This patch addresses this situation by separating the notion of RCU's
      context-switch handling from that of RCU-sched's quiescent states.
      The context-switch handling is covered by rcu_note_context_switch() in
      general and by rcu_preempt_note_context_switch() for preemptible RCU.
      This permits rcu_sched_qs() to handle quiescent states and only quiescent
      states.  It also reduces the maximum latency of scheduler_tick(), though
      probably by much less than a microsecond.  Finally, it means that tasks
      within preemptible-RCU read-side critical sections avoid incurring the
      overhead of queuing unless there really is a context switch.
      Suggested-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Acked-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <peterz@infradead.org>
      25502a6c
  7. 07 5月, 2010 5 次提交
    • P
      sched: Remove rq argument to the tracepoints · 27a9da65
      Peter Zijlstra 提交于
      struct rq isn't visible outside of sched.o so its near useless to
      expose the pointer, also there are no users of it, so remove it.
      Acked-by: NSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1272997616.1642.207.camel@laptop>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      27a9da65
    • P
      rcu: need barrier() in UP synchronize_sched_expedited() · fc390cde
      Paul E. McKenney 提交于
      If synchronize_sched_expedited() is ever to be called from within
      kernel/sched.c in a !SMP PREEMPT kernel, the !SMP implementation needs
      a barrier().
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      fc390cde
    • P
      sched: correctly place paranioa memory barriers in synchronize_sched_expedited() · cc631fb7
      Paul E. McKenney 提交于
      The memory barriers must be in the SMP case, not in the !SMP case.
      Also add a barrier after the atomic_inc() in order to ensure that
      other CPUs see post-synchronize_sched_expedited() actions as following
      the expedited grace period.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      cc631fb7
    • T
      sched: kill paranoia check in synchronize_sched_expedited() · 94458d5e
      Tejun Heo 提交于
      The paranoid check which verifies that the cpu_stop callback is
      actually called on all online cpus is completely superflous.  It's
      guaranteed by cpu_stop facility and if it didn't work as advertised
      other things would go horribly wrong and trying to recover using
      synchronize_sched() wouldn't be very meaningful.
      
      Kill the paranoid check.  Removal of this feature is done as a
      separate step so that it can serve as a bisection point if something
      actually goes wrong.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Dipankar Sarma <dipankar@in.ibm.com>
      Cc: Josh Triplett <josh@freedesktop.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Dimitri Sivanich <sivanich@sgi.com>
      94458d5e
    • T
      sched: replace migration_thread with cpu_stop · 969c7921
      Tejun Heo 提交于
      Currently migration_thread is serving three purposes - migration
      pusher, context to execute active_load_balance() and forced context
      switcher for expedited RCU synchronize_sched.  All three roles are
      hardcoded into migration_thread() and determining which job is
      scheduled is slightly messy.
      
      This patch kills migration_thread and replaces all three uses with
      cpu_stop.  The three different roles of migration_thread() are
      splitted into three separate cpu_stop callbacks -
      migration_cpu_stop(), active_load_balance_cpu_stop() and
      synchronize_sched_expedited_cpu_stop() - and each use case now simply
      asks cpu_stop to execute the callback as necessary.
      
      synchronize_sched_expedited() was implemented with private
      preallocated resources and custom multi-cpu queueing and waiting
      logic, both of which are provided by cpu_stop.
      synchronize_sched_expedited_count is made atomic and all other shared
      resources along with the mutex are dropped.
      
      synchronize_sched_expedited() also implemented a check to detect cases
      where not all the callback got executed on their assigned cpus and
      fall back to synchronize_sched().  If called with cpu hotplug blocked,
      cpu_stop already guarantees that and the condition cannot happen;
      otherwise, stop_machine() would break.  However, this patch preserves
      the paranoid check using a cpumask to record on which cpus the stopper
      ran so that it can serve as a bisection point if something actually
      goes wrong theree.
      
      Because the internal execution state is no longer visible,
      rcu_expedited_torture_stats() is removed.
      
      This patch also renames cpu_stop threads to from "stopper/%d" to
      "migration/%d".  The names of these threads ultimately don't matter
      and there's no reason to make unnecessary userland visible changes.
      
      With this patch applied, stop_machine() and sched now share the same
      resources.  stop_machine() is faster without wasting any resources and
      sched migration users are much cleaner.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Dipankar Sarma <dipankar@in.ibm.com>
      Cc: Josh Triplett <josh@freedesktop.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Dimitri Sivanich <sivanich@sgi.com>
      969c7921
  8. 30 4月, 2010 1 次提交
  9. 23 4月, 2010 3 次提交
  10. 15 4月, 2010 1 次提交
  11. 06 4月, 2010 1 次提交
    • A
      sched: Fix sched_getaffinity() · 84fba5ec
      Anton Blanchard 提交于
      taskset on 2.6.34-rc3 fails on one of my ppc64 test boxes with
      the following error:
      
        sched_getaffinity(0, 16, 0x10029650030) = -1 EINVAL (Invalid argument)
      
      This box has 128 threads and 16 bytes is enough to cover it.
      
      Commit cd3d8031 (sched:
      sched_getaffinity(): Allow less than NR_CPUS length) is
      comparing this 16 bytes agains nr_cpu_ids.
      
      Fix it by comparing nr_cpu_ids to the number of bits in the
      cpumask we pass in.
      Signed-off-by: NAnton Blanchard <anton@samba.org>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Sharyathi Nagesh <sharyath@in.ibm.com>
      Cc: Ulrich Drepper <drepper@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Jack Steiner <steiner@sgi.com>
      Cc: Russ Anderson <rja@sgi.com>
      Cc: Mike Travis <travis@sgi.com>
      LKML-Reference: <20100406070218.GM5594@kryten>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      84fba5ec
  12. 03 4月, 2010 11 次提交
    • P
      sched: Add enqueue/dequeue flags · 371fd7e7
      Peter Zijlstra 提交于
      In order to reduce the dependency on TASK_WAKING rework the enqueue
      interface to support a proper flags field.
      
      Replace the int wakeup, bool head arguments with an int flags argument
      and create the following flags:
      
        ENQUEUE_WAKEUP - the enqueue is a wakeup of a sleeping task,
        ENQUEUE_WAKING - the enqueue has relative vruntime due to
                         having sched_class::task_waking() called,
        ENQUEUE_HEAD - the waking task should be places on the head
                       of the priority queue (where appropriate).
      
      For symmetry also convert sched_class::dequeue() to a flags scheme.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      371fd7e7
    • P
      sched: Fix nr_uninterruptible count · cc87f76a
      Peter Zijlstra 提交于
      The cpuload calculation in calc_load_account_active() assumes
      rq->nr_uninterruptible will not change on an offline cpu after
      migrate_nr_uninterruptible(). However the recent migrate on wakeup
      changes broke that and would result in decrementing the offline cpu's
      rq->nr_uninterruptible.
      
      Fix this by accounting the nr_uninterruptible on the waking cpu.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      cc87f76a
    • P
      sched: Optimize task_rq_lock() · 65cc8e48
      Peter Zijlstra 提交于
      Now that we hold the rq->lock over set_task_cpu() again, we can do
      away with most of the TASK_WAKING checks and reduce them again to
      set_cpus_allowed_ptr().
      
      Removes some conditionals from scheduling hot-paths.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Oleg Nesterov <oleg@redhat.com>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      65cc8e48
    • P
      sched: Fix TASK_WAKING vs fork deadlock · 0017d735
      Peter Zijlstra 提交于
      Oleg noticed a few races with the TASK_WAKING usage on fork.
      
       - since TASK_WAKING is basically a spinlock, it should be IRQ safe
       - since we set TASK_WAKING (*) without holding rq->lock it could
         be there still is a rq->lock holder, thereby not actually
         providing full serialization.
      
      (*) in fact we clear PF_STARTING, which in effect enables TASK_WAKING.
      
      Cure the second issue by not setting TASK_WAKING in sched_fork(), but
      only temporarily in wake_up_new_task() while calling select_task_rq().
      
      Cure the first by holding rq->lock around the select_task_rq() call,
      this will disable IRQs, this however requires that we push down the
      rq->lock release into select_task_rq_fair()'s cgroup stuff.
      
      Because select_task_rq_fair() still needs to drop the rq->lock we
      cannot fully get rid of TASK_WAKING.
      Reported-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      0017d735
    • O
      sched: Make select_fallback_rq() cpuset friendly · 9084bb82
      Oleg Nesterov 提交于
      Introduce cpuset_cpus_allowed_fallback() helper to fix the cpuset problems
      with select_fallback_rq(). It can be called from any context and can't use
      any cpuset locks including task_lock(). It is called when the task doesn't
      have online cpus in ->cpus_allowed but ttwu/etc must be able to find a
      suitable cpu.
      
      I am not proud of this patch. Everything which needs such a fat comment
      can't be good even if correct. But I'd prefer to not change the locking
      rules in the code I hardly understand, and in any case I believe this
      simple change make the code much more correct compared to deadlocks we
      currently have.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <20100315091027.GA9155@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      9084bb82
    • O
      sched: _cpu_down(): Don't play with current->cpus_allowed · 6a1bdc1b
      Oleg Nesterov 提交于
      _cpu_down() changes the current task's affinity and then recovers it at
      the end. The problems are well known: we can't restore old_allowed if it
      was bound to the now-dead-cpu, and we can race with the userspace which
      can change cpu-affinity during unplug.
      
      _cpu_down() should not play with current->cpus_allowed at all. Instead,
      take_cpu_down() can migrate the caller of _cpu_down() after __cpu_disable()
      removes the dying cpu from cpu_online_mask.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NRafael J. Wysocki <rjw@sisk.pl>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <20100315091023.GA9148@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      6a1bdc1b
    • O
      sched: sched_exec(): Remove the select_fallback_rq() logic · 30da688e
      Oleg Nesterov 提交于
      sched_exec()->select_task_rq() reads/updates ->cpus_allowed lockless.
      This can race with other CPUs updating our ->cpus_allowed, and this
      looks meaningless to me.
      
      The task is current and running, it must have online cpus in ->cpus_allowed,
      the fallback mode is bogus. And, if ->sched_class returns the "wrong" cpu,
      this likely means we raced with set_cpus_allowed() which was called
      for reason, why should sched_exec() retry and call ->select_task_rq()
      again?
      
      Change the code to call sched_class->select_task_rq() directly and do
      nothing if the returned cpu is wrong after re-checking under rq->lock.
      
      From now task_struct->cpus_allowed is always stable under TASK_WAKING,
      select_fallback_rq() is always called under rq-lock or the caller or
      the caller owns TASK_WAKING (select_task_rq).
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <20100315091019.GA9141@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      30da688e
    • O
      sched: move_task_off_dead_cpu(): Remove retry logic · c1804d54
      Oleg Nesterov 提交于
      The previous patch preserved the retry logic, but it looks unneeded.
      
      __migrate_task() can only fail if we raced with migration after we dropped
      the lock, but in this case the caller of set_cpus_allowed/etc must initiate
      migration itself if ->on_rq == T.
      
      We already fixed p->cpus_allowed, the changes in active/online masks must
      be visible to racer, it should migrate the task to online cpu correctly.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <20100315091014.GA9138@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      c1804d54
    • O
      sched: move_task_off_dead_cpu(): Take rq->lock around select_fallback_rq() · 1445c08d
      Oleg Nesterov 提交于
      move_task_off_dead_cpu()->select_fallback_rq() reads/updates ->cpus_allowed
      lockless. We can race with set_cpus_allowed() running in parallel.
      
      Change it to take rq->lock around select_fallback_rq(). Note that it is not
      trivial to move this spin_lock() into select_fallback_rq(), we must recheck
      the task was not migrated after we take the lock and other callers do not
      need this lock.
      
      To avoid the races with other callers of select_fallback_rq() which rely on
      TASK_WAKING, we also check p->state != TASK_WAKING and do nothing otherwise.
      The owner of TASK_WAKING must update ->cpus_allowed and choose the correct
      CPU anyway, and the subsequent __migrate_task() is just meaningless because
      p->se.on_rq must be false.
      
      Alternatively, we could change select_task_rq() to take rq->lock right
      after it calls sched_class->select_task_rq(), but this looks a bit ugly.
      
      Also, change it to not assume irqs are disabled and absorb __migrate_task_irq().
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <20100315091010.GA9131@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      1445c08d
    • O
      sched: Kill the broken and deadlockable cpuset_lock/cpuset_cpus_allowed_locked code · 897f0b3c
      Oleg Nesterov 提交于
      This patch just states the fact the cpusets/cpuhotplug interaction is
      broken and removes the deadlockable code which only pretends to work.
      
      - cpuset_lock() doesn't really work. It is needed for
        cpuset_cpus_allowed_locked() but we can't take this lock in
        try_to_wake_up()->select_fallback_rq() path.
      
      - cpuset_lock() is deadlockable. Suppose that a task T bound to CPU takes
        callback_mutex. If cpu_down(CPU) happens before T drops callback_mutex
        stop_machine() preempts T, then migration_call(CPU_DEAD) tries to take
        cpuset_lock() and hangs forever because CPU is already dead and thus
        T can't be scheduled.
      
      - cpuset_cpus_allowed_locked() is deadlockable too. It takes task_lock()
        which is not irq-safe, but try_to_wake_up() can be called from irq.
      
      Kill them, and change select_fallback_rq() to use cpu_possible_mask, like
      we currently do without CONFIG_CPUSETS.
      
      Also, with or without this patch, with or without CONFIG_CPUSETS, the
      callers of select_fallback_rq() can race with each other or with
      set_cpus_allowed() pathes.
      
      The subsequent patches try to to fix these problems.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <20100315091003.GA9123@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      897f0b3c
    • O
      sched: set_cpus_allowed_ptr(): Don't use rq->migration_thread after unlock · 47a70985
      Oleg Nesterov 提交于
      Trivial typo fix. rq->migration_thread can be NULL after
      task_rq_unlock(), this is why we have "mt" which should be
       used instead.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <20100330165829.GA18284@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      47a70985
  13. 30 3月, 2010 1 次提交
    • T
      include cleanup: Update gfp.h and slab.h includes to prepare for breaking... · 5a0e3ad6
      Tejun Heo 提交于
      include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
      
      percpu.h is included by sched.h and module.h and thus ends up being
      included when building most .c files.  percpu.h includes slab.h which
      in turn includes gfp.h making everything defined by the two files
      universally available and complicating inclusion dependencies.
      
      percpu.h -> slab.h dependency is about to be removed.  Prepare for
      this change by updating users of gfp and slab facilities include those
      headers directly instead of assuming availability.  As this conversion
      needs to touch large number of source files, the following script is
      used as the basis of conversion.
      
        http://userweb.kernel.org/~tj/misc/slabh-sweep.py
      
      The script does the followings.
      
      * Scan files for gfp and slab usages and update includes such that
        only the necessary includes are there.  ie. if only gfp is used,
        gfp.h, if slab is used, slab.h.
      
      * When the script inserts a new include, it looks at the include
        blocks and try to put the new include such that its order conforms
        to its surrounding.  It's put in the include block which contains
        core kernel includes, in the same order that the rest are ordered -
        alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
        doesn't seem to be any matching order.
      
      * If the script can't find a place to put a new include (mostly
        because the file doesn't have fitting include block), it prints out
        an error message indicating which .h file needs to be added to the
        file.
      
      The conversion was done in the following steps.
      
      1. The initial automatic conversion of all .c files updated slightly
         over 4000 files, deleting around 700 includes and adding ~480 gfp.h
         and ~3000 slab.h inclusions.  The script emitted errors for ~400
         files.
      
      2. Each error was manually checked.  Some didn't need the inclusion,
         some needed manual addition while adding it to implementation .h or
         embedding .c file was more appropriate for others.  This step added
         inclusions to around 150 files.
      
      3. The script was run again and the output was compared to the edits
         from #2 to make sure no file was left behind.
      
      4. Several build tests were done and a couple of problems were fixed.
         e.g. lib/decompress_*.c used malloc/free() wrappers around slab
         APIs requiring slab.h to be added manually.
      
      5. The script was run on all .h files but without automatically
         editing them as sprinkling gfp.h and slab.h inclusions around .h
         files could easily lead to inclusion dependency hell.  Most gfp.h
         inclusion directives were ignored as stuff from gfp.h was usually
         wildly available and often used in preprocessor macros.  Each
         slab.h inclusion directive was examined and added manually as
         necessary.
      
      6. percpu.h was updated not to include slab.h.
      
      7. Build test were done on the following configurations and failures
         were fixed.  CONFIG_GCOV_KERNEL was turned off for all tests (as my
         distributed build env didn't work with gcov compiles) and a few
         more options had to be turned off depending on archs to make things
         build (like ipr on powerpc/64 which failed due to missing writeq).
      
         * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
         * powerpc and powerpc64 SMP allmodconfig
         * sparc and sparc64 SMP allmodconfig
         * ia64 SMP allmodconfig
         * s390 SMP allmodconfig
         * alpha SMP allmodconfig
         * um on x86_64 SMP allmodconfig
      
      8. percpu.h modifications were reverted so that it could be applied as
         a separate patch and serve as bisection point.
      
      Given the fact that I had only a couple of failures from tests on step
      6, I'm fairly confident about the coverage of this conversion patch.
      If there is a breakage, it's likely to be something in one of the arch
      headers which should be easily discoverable easily on most builds of
      the specific arch.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Guess-its-ok-by: NChristoph Lameter <cl@linux-foundation.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      5a0e3ad6
  14. 26 3月, 2010 1 次提交
    • P
      x86, perf, bts, mm: Delete the never used BTS-ptrace code · faa4602e
      Peter Zijlstra 提交于
      Support for the PMU's BTS features has been upstreamed in
      v2.6.32, but we still have the old and disabled ptrace-BTS,
      as Linus noticed it not so long ago.
      
      It's buggy: TIF_DEBUGCTLMSR is trampling all over that MSR without
      regard for other uses (perf) and doesn't provide the flexibility
      needed for perf either.
      
      Its users are ptrace-block-step and ptrace-bts, since ptrace-bts
      was never used and ptrace-block-step can be implemented using a
      much simpler approach.
      
      So axe all 3000 lines of it. That includes the *locked_memory*()
      APIs in mm/mlock.c as well.
      Reported-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Roland McGrath <roland@redhat.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Markus Metzger <markus.t.metzger@intel.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      LKML-Reference: <20100325135413.938004390@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      faa4602e
  15. 17 3月, 2010 1 次提交
  16. 16 3月, 2010 1 次提交
  17. 15 3月, 2010 1 次提交
    • K
      sched: sched_getaffinity(): Allow less than NR_CPUS length · cd3d8031
      KOSAKI Motohiro 提交于
      [ Note, this commit changes the syscall ABI for > 1024 CPUs systems. ]
      
      Recently, some distro decided to use NR_CPUS=4096 for mysterious reasons.
      Unfortunately, glibc sched interface has the following definition:
      
      	# define __CPU_SETSIZE  1024
      	# define __NCPUBITS     (8 * sizeof (__cpu_mask))
      	typedef unsigned long int __cpu_mask;
      	typedef struct
      	{
      	  __cpu_mask __bits[__CPU_SETSIZE / __NCPUBITS];
      	} cpu_set_t;
      
      It mean, if NR_CPUS is bigger than 1024, cpu_set_t makes an
      ABI issue ...
      
      More recently, Sharyathi Nagesh reported following test program makes
      misterious syscall failure:
      
       -----------------------------------------------------------------------
       #define _GNU_SOURCE
       #include<stdio.h>
       #include<errno.h>
       #include<sched.h>
      
       int main()
       {
           cpu_set_t set;
           if (sched_getaffinity(0, sizeof(cpu_set_t), &set) < 0)
               printf("\n Call is failing with:%d", errno);
       }
       -----------------------------------------------------------------------
      
      Because the kernel assumes len argument of sched_getaffinity() is bigger
      than NR_CPUS. But now it is not correct.
      
      Now we are faced with the following annoying dilemma, due to
      the limitations of the glibc interface built in years ago:
      
       (1) if we change glibc's __CPU_SETSIZE definition, we lost
           binary compatibility of _all_ application.
      
       (2) if we don't change it, we also lost binary compatibility of
           Sharyathi's use case.
      
      Then, I would propse to change the rule of the len argument of
      sched_getaffinity().
      
      Old:
      	len should be bigger than NR_CPUS
      New:
      	len should be bigger than maximum possible cpu id
      
      This creates the following behavior:
      
       (A) In the real 4096 cpus machine, the above test program still
           return -EINVAL.
      
       (B) NR_CPUS=4096 but the machine have less than 1024 cpus (almost
           all machines in the world), the above can run successfully.
      
      Fortunatelly, BIG SGI machine is mainly used for HPC use case. It means
      they can rebuild their programs.
      
      IOW we hope they are not annoyed by this issue ...
      Reported-by: NSharyathi Nagesh <sharyath@in.ibm.com>
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: NUlrich Drepper <drepper@redhat.com>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Jack Steiner <steiner@sgi.com>
      Cc: Russ Anderson <rja@sgi.com>
      Cc: Mike Travis <travis@sgi.com>
      LKML-Reference: <20100312161316.9520.A69D9226@jp.fujitsu.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      cd3d8031
  18. 12 3月, 2010 4 次提交
    • M
      sched: Remove SYNC_WAKEUPS feature · c6ee36c4
      Mike Galbraith 提交于
      Sync wakeups are critical functionality with a long history.  Remove it, we don't
      need the branch or icache footprint.
      Signed-off-by: NMike Galbraith <efault@gmx.de>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1268301817.6785.47.camel@marge.simson.net>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      c6ee36c4
    • M
      sched: Cleanup/optimize clock updates · a64692a3
      Mike Galbraith 提交于
      Now that we no longer depend on the clock being updated prior to enqueueing
      on migratory wakeup, we can clean up a bit, placing calls to update_rq_clock()
      exactly where they are needed, ie on enqueue, dequeue and schedule events.
      
      In the case of a freshly enqueued task immediately preempting, we can skip the
      update during preemption, as the clock was just updated by the enqueue event.
      We also save an unneeded call during a migratory wakeup by not updating the
      previous runqueue, where update_curr() won't be invoked.
      Signed-off-by: NMike Galbraith <efault@gmx.de>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1268301199.6785.32.camel@marge.simson.net>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      a64692a3
    • M
      sched: Remove avg_overlap · e12f31d3
      Mike Galbraith 提交于
      Both avg_overlap and avg_wakeup had an inherent problem in that their accuracy
      was detrimentally affected by cross-cpu wakeups, this because we are missing
      the necessary call to update_curr().  This can't be fixed without increasing
      overhead in our already too fat fastpath.
      
      Additionally, with recent load balancing changes making us prefer to place tasks
      in an idle cache domain (which is good for compute bound loads), communicating
      tasks suffer when a sync wakeup, which would enable affine placement, is turned
      into a non-sync wakeup by SYNC_LESS.  With one task on the runqueue, wake_affine()
      rejects the affine wakeup request, leaving the unfortunate where placed, taking
      frequent cache misses.
      
      Remove it, and recover some fastpath cycles.
      Signed-off-by: NMike Galbraith <efault@gmx.de>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1268301121.6785.30.camel@marge.simson.net>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      e12f31d3
    • M
      sched: Remove avg_wakeup · b42e0c41
      Mike Galbraith 提交于
      Testing the load which led to this heuristic (nfs4 kbuild) shows that it has
      outlived it's usefullness.  With intervening load balancing changes, I cannot
      see any difference with/without, so recover there fastpath cycles.
      Signed-off-by: NMike Galbraith <efault@gmx.de>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1268301062.6785.29.camel@marge.simson.net>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      b42e0c41