1. 08 10月, 2010 1 次提交
    • P
      sched: fix RCU lockdep splat from task_group() · 6506cf6c
      Peter Zijlstra 提交于
      This addresses the following RCU lockdep splat:
      
      [0.051203] CPU0: AMD QEMU Virtual CPU version 0.12.4 stepping 03
      [0.052999] lockdep: fixing up alternatives.
      [0.054105]
      [0.054106] ===================================================
      [0.054999] [ INFO: suspicious rcu_dereference_check() usage. ]
      [0.054999] ---------------------------------------------------
      [0.054999] kernel/sched.c:616 invoked rcu_dereference_check() without protection!
      [0.054999]
      [0.054999] other info that might help us debug this:
      [0.054999]
      [0.054999]
      [0.054999] rcu_scheduler_active = 1, debug_locks = 1
      [0.054999] 3 locks held by swapper/1:
      [0.054999]  #0:  (cpu_add_remove_lock){+.+.+.}, at: [<ffffffff814be933>] cpu_up+0x42/0x6a
      [0.054999]  #1:  (cpu_hotplug.lock){+.+.+.}, at: [<ffffffff810400d8>] cpu_hotplug_begin+0x2a/0x51
      [0.054999]  #2:  (&rq->lock){-.-...}, at: [<ffffffff814be2f7>] init_idle+0x2f/0x113
      [0.054999]
      [0.054999] stack backtrace:
      [0.054999] Pid: 1, comm: swapper Not tainted 2.6.35 #1
      [0.054999] Call Trace:
      [0.054999]  [<ffffffff81068054>] lockdep_rcu_dereference+0x9b/0xa3
      [0.054999]  [<ffffffff810325c3>] task_group+0x7b/0x8a
      [0.054999]  [<ffffffff810325e5>] set_task_rq+0x13/0x40
      [0.054999]  [<ffffffff814be39a>] init_idle+0xd2/0x113
      [0.054999]  [<ffffffff814be78a>] fork_idle+0xb8/0xc7
      [0.054999]  [<ffffffff81068717>] ? mark_held_locks+0x4d/0x6b
      [0.054999]  [<ffffffff814bcebd>] do_fork_idle+0x17/0x2b
      [0.054999]  [<ffffffff814bc89b>] native_cpu_up+0x1c1/0x724
      [0.054999]  [<ffffffff814bcea6>] ? do_fork_idle+0x0/0x2b
      [0.054999]  [<ffffffff814be876>] _cpu_up+0xac/0x127
      [0.054999]  [<ffffffff814be946>] cpu_up+0x55/0x6a
      [0.054999]  [<ffffffff81ab562a>] kernel_init+0xe1/0x1ff
      [0.054999]  [<ffffffff81003854>] kernel_thread_helper+0x4/0x10
      [0.054999]  [<ffffffff814c353c>] ? restore_args+0x0/0x30
      [0.054999]  [<ffffffff81ab5549>] ? kernel_init+0x0/0x1ff
      [0.054999]  [<ffffffff81003850>] ? kernel_thread_helper+0x0/0x10
      [0.056074] Booting Node   0, Processors  #1lockdep: fixing up alternatives.
      [0.130045]  #2lockdep: fixing up alternatives.
      [0.203089]  #3 Ok.
      [0.275286] Brought up 4 CPUs
      [0.276005] Total of 4 processors activated (16017.17 BogoMIPS).
      
      The cgroup_subsys_state structures referenced by idle tasks are never
      freed, because the idle tasks should be part of the root cgroup,
      which is not removable.
      
      The problem is that while we do in-fact hold rq->lock, the newly spawned
      idle thread's cpu is not yet set to the correct cpu so the lockdep check
      in task_group():
      
        lockdep_is_held(&task_rq(p)->lock)
      
      will fail.
      
      But this is a chicken and egg problem.  Setting the CPU's runqueue requires
      that the CPU's runqueue already be set.  ;-)
      
      So insert an RCU read-side critical section to avoid the complaint.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      6506cf6c
  2. 15 9月, 2010 1 次提交
  3. 10 9月, 2010 1 次提交
  4. 23 8月, 2010 1 次提交
    • T
      mutex: Improve the scalability of optimistic spinning · 9d0f4dcc
      Tim Chen 提交于
      There is a scalability issue for current implementation of optimistic
      mutex spin in the kernel.  It is found on a 8 node 64 core Nehalem-EX
      system (HT mode).
      
      The intention of the optimistic mutex spin is to busy wait and spin on a
      mutex if the owner of the mutex is running, in the hope that the mutex
      will be released soon and be acquired, without the thread trying to
      acquire mutex going to sleep. However, when we have a large number of
      threads, contending for the mutex, we could have the mutex grabbed by
      other thread, and then another ……, and we will keep spinning, wasting cpu
      cycles and adding to the contention.  One possible fix is to quit
      spinning and put the current thread on wait-list if mutex lock switch to
      a new owner while we spin, indicating heavy contention (see the patch
      included).
      
      I did some testing on a 8 socket Nehalem-EX system with a total of 64
      cores. Using Ingo's test-mutex program that creates/delete files with 256
      threads (http://lkml.org/lkml/2006/1/8/50) , I see the following speed up
      after putting in the mutex spin fix:
      
       ./mutex-test V 256 10
                       Ops/sec
       2.6.34          62864
       With fix        197200
      
      Repeating the test with Aim7 fserver workload, again there is a speed up
      with the fix:
      
                       Jobs/min
       2.6.34          91657
       With fix        149325
      
      To look at the impact on the distribution of mutex acquisition time, I
      collected the mutex acquisition time on Aim7 fserver workload with some
      instrumentation.  The average acquisition time is reduced by 48% and
      number of contentions reduced by 32%.
      
                       #contentions    Time to acquire mutex (cycles)
       2.6.34          72973           44765791
       With fix        49210           23067129
      
      The histogram of mutex acquisition time is listed below.  The acquisition
      time is in 2^bin cycles.  We see that without the fix, the acquisition
      time is mostly around 2^26 cycles.  With the fix, we the distribution get
      spread out a lot more towards the lower cycles, starting from 2^13.
      However, there is an increase of the tail distribution with the fix at
      2^28 and 2^29 cycles.  It seems a small price to pay for the reduced
      average acquisition time and also getting the cpu to do useful work.
      
       Mutex acquisition time distribution (acq time = 2^bin cycles):
               2.6.34                  With Fix
       bin     #occurrence     %       #occurrence     %
       11      2               0.00%   120             0.24%
       12      10              0.01%   790             1.61%
       13      14              0.02%   2058            4.18%
       14      86              0.12%   3378            6.86%
       15      393             0.54%   4831            9.82%
       16      710             0.97%   4893            9.94%
       17      815             1.12%   4667            9.48%
       18      790             1.08%   5147            10.46%
       19      580             0.80%   6250            12.70%
       20      429             0.59%   6870            13.96%
       21      311             0.43%   1809            3.68%
       22      255             0.35%   2305            4.68%
       23      317             0.44%   916             1.86%
       24      610             0.84%   233             0.47%
       25      3128            4.29%   95              0.19%
       26      63902           87.69%  122             0.25%
       27      619             0.85%   286             0.58%
       28      0               0.00%   3536            7.19%
       29      0               0.00%   903             1.83%
       30      0               0.00%   0               0.00%
      
      I've done similar experiments with 2.6.35 kernel on smaller boxes as
      well.  One is on a dual-socket Westmere box (12 cores total, with HT).
      Another experiment is on an old dual-socket Core 2 box (4 cores total, no
      HT)
      
      On the 12-core Westmere box, I see a 250% increase for Ingo's mutex-test
      program with my mutex patch but no significant difference in aim7's
      fserver workload.
      
      On the 4-core Core 2 box, I see the difference with the patch for both
      mutex-test and aim7 fserver are negligible.
      
      So far, it seems like the patch has not caused regression on smaller
      systems.
      Signed-off-by: NTim Chen <tim.c.chen@linux.intel.com>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: <stable@kernel.org> # .35.x
      LKML-Reference: <1282168827.9542.72.camel@schen9-DESK>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      9d0f4dcc
  5. 17 7月, 2010 2 次提交
  6. 01 7月, 2010 1 次提交
    • P
      sched: Cure nr_iowait_cpu() users · 8c215bd3
      Peter Zijlstra 提交于
      Commit 0224cf4c (sched: Intoduce get_cpu_iowait_time_us())
      broke things by not making sure preemption was indeed disabled
      by the callers of nr_iowait_cpu() which took the iowait value of
      the current cpu.
      
      This resulted in a heap of preempt warnings. Cure this by making
      nr_iowait_cpu() take a cpu number and fix up the callers to pass
      in the right number.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Arjan van de Ven <arjan@infradead.org>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Rafael J. Wysocki <rjw@sisk.pl>
      Cc: Maxim Levitsky <maximlevitsky@gmail.com>
      Cc: Len Brown <len.brown@intel.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: linux-pm@lists.linux-foundation.org
      LKML-Reference: <1277968037.1868.120.camel@laptop>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      8c215bd3
  7. 25 6月, 2010 1 次提交
  8. 24 6月, 2010 1 次提交
  9. 22 6月, 2010 1 次提交
  10. 18 6月, 2010 2 次提交
  11. 09 6月, 2010 9 次提交
    • V
      sched: Change nohz idle load balancing logic to push model · 83cd4fe2
      Venkatesh Pallipadi 提交于
      In the new push model, all idle CPUs indeed go into nohz mode. There is
      still the concept of idle load balancer (performing the load balancing
      on behalf of all the idle cpu's in the system). Busy CPU kicks the nohz
      balancer when any of the nohz CPUs need idle load balancing.
      The kickee CPU does the idle load balancing on behalf of all idle CPUs
      instead of the normal idle balance.
      
      This addresses the below two problems with the current nohz ilb logic:
      * the idle load balancer continued to have periodic ticks during idle and
        wokeup frequently, even though it did not have any rebalancing to do on
        behalf of any of the idle CPUs.
      * On x86 and CPUs that have APIC timer stoppage on idle CPUs, this
        periodic wakeup can result in a periodic additional interrupt on a CPU
        doing the timer broadcast.
      
      Also currently we are migrating the unpinned timers from an idle to the cpu
      doing idle load balancing (when all the cpus in the system are idle,
      there is no idle load balancing cpu and timers get added to the same idle cpu
      where the request was made. So the existing optimization works only on semi idle
      system).
      
      And In semi idle system, we no longer have periodic ticks on the idle load
      balancer CPU. Using that cpu will add more delays to the timers than intended
      (as that cpu's timer base may not be uptodate wrt jiffies etc). This was
      causing mysterious slowdowns during boot etc.
      
      For now, in the semi idle case, use the nearest busy cpu for migrating timers
      from an idle cpu.  This is good for power-savings anyway.
      Signed-off-by: NVenkatesh Pallipadi <venki@google.com>
      Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      LKML-Reference: <1274486981.2840.46.camel@sbs-t61.sc.intel.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      83cd4fe2
    • V
      sched: Avoid side-effect of tickless idle on update_cpu_load · fdf3e95d
      Venkatesh Pallipadi 提交于
      tickless idle has a negative side effect on update_cpu_load(), which
      in turn can affect load balancing behavior.
      
      update_cpu_load() is supposed to be called every tick, to keep track
      of various load indicies. With tickless idle, there are no scheduler
      ticks called on the idle CPUs. Idle CPUs may still do load balancing
      (with idle_load_balance CPU) using the stale cpu_load. It will also
      cause problems when all CPUs go idle for a while and become active
      again. In this case loads would not degrade as expected.
      
      This is how rq->nr_load_updates change looks like under different
      conditions:
      
      <cpu_num> <nr_load_updates change>
      All CPUS idle for 10 seconds (HZ=1000)
      0 1621
      10 496
      11 139
      12 875
      13 1672
      14 12
      15 21
      1 1472
      2 2426
      3 1161
      4 2108
      5 1525
      6 701
      7 249
      8 766
      9 1967
      
      One CPU busy rest idle for 10 seconds
      0 10003
      10 601
      11 95
      12 966
      13 1597
      14 114
      15 98
      1 3457
      2 93
      3 6679
      4 1425
      5 1479
      6 595
      7 193
      8 633
      9 1687
      
      All CPUs busy for 10 seconds
      0 10026
      10 10026
      11 10026
      12 10026
      13 10025
      14 10025
      15 10025
      1 10026
      2 10026
      3 10026
      4 10026
      5 10026
      6 10026
      7 10026
      8 10026
      9 10026
      
      That is update_cpu_load works properly only when all CPUs are busy.
      If all are idle, all the CPUs get way lower updates.  And when few
      CPUs are busy and rest are idle, only busy and ilb CPU does proper
      updates and rest of the idle CPUs will do lower updates.
      
      The patch keeps track of when a last update was done and fixes up
      the load avg based on current time.
      
      On one of my test system SPECjbb with warehouse 1..numcpus, patch
      improves throughput numbers by ~1% (average of 6 runs).  On another
      test system (with different domain hierarchy) there is no noticable
      change in perf.
      Signed-off-by: NVenkatesh Pallipadi <venki@google.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      LKML-Reference: <AANLkTilLtDWQsAUrIxJ6s04WTgmw9GuOODc5AOrYsaR5@mail.gmail.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      fdf3e95d
    • O
      sched: Simplify the reacquire_kernel_lock() logic · 246d86b5
      Oleg Nesterov 提交于
      - Contrary to what 6d558c3a says, there is no need to reload
        prev = rq->curr after the context switch. You always schedule
        back to where you came from, prev must be equal to current
        even if cpu/rq was changed.
      
      - This also means reacquire_kernel_lock() can use prev instead
        of current.
      
      - No need to reassign switch_count if reacquire_kernel_lock()
        reports need_resched(), we can just move the initial assignment
        down, under the "need_resched_nonpreemptible:" label.
      
      - Try to update the comment after context_switch().
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <20100519125711.GA30199@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      246d86b5
    • P
      sched_clock: Add local_clock() API and improve documentation · c676329a
      Peter Zijlstra 提交于
      For people who otherwise get to write: cpu_clock(smp_processor_id()),
      there is now: local_clock().
      
      Also, as per suggestion from Andrew, provide some documentation on
      the various clock interfaces, and minimize the unsigned long long vs
      u64 mess.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Jens Axboe <jaxboe@fusionio.com>
      LKML-Reference: <1275052414.1645.52.camel@laptop>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      c676329a
    • T
      sched: add hooks for workqueue · 21aa9af0
      Tejun Heo 提交于
      Concurrency managed workqueue needs to know when workers are going to
      sleep and waking up.  Using these two hooks, cmwq keeps track of the
      current concurrency level and throttles execution of new works if it's
      too high and wakes up another worker from the sleep hook if it becomes
      too low.
      
      This patch introduces PF_WQ_WORKER to identify workqueue workers and
      adds the following two hooks.
      
      * wq_worker_waking_up(): called when a worker is woken up.
      
      * wq_worker_sleeping(): called when a worker is going to sleep and may
        return a pointer to a local task which should be woken up.  The
        returned task is woken up using try_to_wake_up_local() which is
        simplified ttwu which is called under rq lock and can only wake up
        local tasks.
      
      Both hooks are currently defined as noop in kernel/workqueue_sched.h.
      Later cmwq implementation will replace them with proper
      implementation.
      
      These hooks are hard coded as they'll always be enabled.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Ingo Molnar <mingo@elte.hu>
      21aa9af0
    • T
      sched: refactor try_to_wake_up() · 9ed3811a
      Tejun Heo 提交于
      Factor ttwu_activate() and ttwu_woken_up() out of try_to_wake_up().
      The factoring out doesn't affect try_to_wake_up() much
      code-generation-wise.  Depending on configuration options, it ends up
      generating the same object code as before or slightly different one
      due to different register assignment.
      
      This is to help future implementation of try_to_wake_up_local().
      
      Mike Galbraith suggested rename to ttwu_post_activation() from
      ttwu_woken_up() and comment update in try_to_wake_up().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Ingo Molnar <mingo@elte.hu>
      9ed3811a
    • T
      sched: adjust when cpu_active and cpuset configurations are updated during cpu on/offlining · 3a101d05
      Tejun Heo 提交于
      Currently, when a cpu goes down, cpu_active is cleared before
      CPU_DOWN_PREPARE starts and cpuset configuration is updated from a
      default priority cpu notifier.  When a cpu is coming up, it's set
      before CPU_ONLINE but cpuset configuration again is updated from the
      same cpu notifier.
      
      For cpu notifiers, this presents an inconsistent state.  Threads which
      a CPU_DOWN_PREPARE notifier expects to be bound to the CPU can be
      migrated to other cpus because the cpu is no more inactive.
      
      Fix it by updating cpu_active in the highest priority cpu notifier and
      cpuset configuration in the second highest when a cpu is coming up.
      Down path is updated similarly.  This guarantees that all other cpu
      notifiers see consistent cpu_active and cpuset configuration.
      
      cpuset_track_online_cpus() notifier is converted to
      cpuset_update_active_cpus() which just updates the configuration and
      now called from cpuset_cpu_[in]active() notifiers registered from
      sched_init_smp().  If cpuset is disabled, cpuset_update_active_cpus()
      degenerates into partition_sched_domains() making separate notifier
      for !CONFIG_CPUSETS unnecessary.
      
      This problem is triggered by cmwq.  During CPU_DOWN_PREPARE, hotplug
      callback creates a kthread and kthread_bind()s it to the target cpu,
      and the thread is expected to run on that cpu.
      
      * Ingo's test discovered __cpuinit/exit markups were incorrect.
        Fixed.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Paul Menage <menage@google.com>
      3a101d05
    • T
      sched: define and use CPU_PRI_* enums for cpu notifier priorities · 50a323b7
      Tejun Heo 提交于
      Instead of hardcoding priority 10 and 20 in sched and perf, collect
      them into CPU_PRI_* enums.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      50a323b7
    • P
      sched: Fix PROVE_RCU vs cpu_cgroup · dc61b1d6
      Peter Zijlstra 提交于
      PROVE_RCU has a few issues with the cpu_cgroup because the scheduler
      typically holds rq->lock around the css rcu derefs but the generic
      cgroup code doesn't (and can't) know about that lock.
      
      Provide means to add extra checks to the css dereference and use that
      in the scheduler to annotate its users.
      
      The addition of rq->lock to these checks is correct because the
      cgroup_subsys::attach() method takes the rq->lock for each task it
      moves, therefore by holding that lock, we ensure the task is pinned to
      the current cgroup and the RCU derefence is valid.
      
      That leaves one genuine race in __sched_setscheduler() where we used
      task_group() without holding any of the required locks and thus raced
      with the cgroup code. Solve this by moving the check under the
      appropriate lock.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      dc61b1d6
  12. 04 6月, 2010 1 次提交
    • S
      tracing/sched: Make preempt_schedule() notrace · d1f74e20
      Steven Rostedt 提交于
      The function tracer code uses ftrace_preempt_disable() to disable
      preemption instead of normal preempt_disable(). But there's a slight
      race condition that may cause it to lose a preemption check.
      
      This was made to keep the function tracer from recursing on itself
      by disabling preemption then having the enable call the function tracer
      again, causing infinite recursion.
      
      The bug was assumed to happen if the call was just in schedule, but
      this is incorrect. The bug is caused by preempt_schedule() which
      is called by preempt_enable(). The calling of preempt_enable() when
      NEED_RESCHED was set would call preempt_schedule() which would call
      the function tracer again.
      
      By making the preempt_schedule() and add_preempt_count() notrace
      then this will prevent the inifinite recursion. This is because
      the add_preempt_count() would stop the preempt_enable() in the
      function tracer from calling preempt_schedule() again.
      
      The sub_preempt_count() is also made notrace just to keep it
      symmetric.
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      d1f74e20
  13. 01 6月, 2010 1 次提交
  14. 30 5月, 2010 1 次提交
  15. 28 5月, 2010 1 次提交
  16. 21 5月, 2010 2 次提交
    • J
      kdb: core for kgdb back end (2 of 2) · 67fc4e0c
      Jason Wessel 提交于
      This patch contains the hooks and instrumentation into kernel which
      live outside the kernel/debug directory, which the kdb core
      will call to run commands like lsmod, dmesg, bt etc...
      
      CC: linux-arch@vger.kernel.org
      Signed-off-by: NJason Wessel <jason.wessel@windriver.com>
      Signed-off-by: NMartin Hicks <mort@sgi.com>
      67fc4e0c
    • M
      wait_event_interruptible_locked() interface · 22c43c81
      Michal Nazarewicz 提交于
      New wait_event_interruptible{,_exclusive}_locked{,_irq} macros added.
      They work just like versions without _locked* suffix but require the
      wait queue's lock to be held.  Also __wake_up_locked() is now exported
      as to pair it with the above macros.
      
      The use case of this new facility is when one uses wait queue's lock
      to  protect a data structure.  This may be advantageous if the
      structure needs to be protected by a spinlock anyway.  In particular,
      with additional spinlock the following code has to be used to wait
      for a condition:
      
      spin_lock(&data.lock);
      ...
      for (ret = 0; !ret && !(condition); ) {
      	spin_unlock(&data.lock);
      	ret = wait_event_interruptible(data.wqh, (condition));
      	spin_lock(&data.lock);
      }
      ...
      spin_unlock(&data.lock);
      
      This looks bizarre plus wait_event_interruptible() locks the wait
      queue's lock anyway so there is a unlock+lock sequence where it could
      be avoided.
      
      To avoid those problems and benefit from wait queue's lock, a code
      similar to the following should be used:
      
      /* Waiting */
      spin_lock(&data.wqh.lock);
      ...
      ret = wait_event_interruptible_locked(data.wqh, (condition));
      ...
      spin_unlock(&data.wqh.lock);
      
      /* Waiting exclusively */
      spin_lock(&data.whq.lock);
      ...
      ret = wait_event_interruptible_exclusive_locked(data.whq, (condition));
      ...
      spin_unlock(&data.whq.lock);
      
      /* Waking up */
      spin_lock(&data.wqh.lock);
      ...
      wake_up_locked(&data.wqh);
      ...
      spin_unlock(&data.wqh.lock);
      
      When spin_lock_irq() is used matching versions of macros need to be
      used (*_locked_irq()).
      Signed-off-by: NMichal Nazarewicz <m.nazarewicz@samsung.com>
      Cc: Kyungmin Park <kyungmin.park@samsung.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Takashi Iwai <tiwai@suse.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Andreas Herrmann <andreas.herrmann3@amd.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Mike Galbraith <efault@gmx.de>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@suse.de>
      22c43c81
  17. 11 5月, 2010 2 次提交
    • C
      sched, wait: Use wrapper functions · a93d2f17
      Changli Gao 提交于
      epoll should not touch flags in wait_queue_t. This patch introduces a new
      function __add_wait_queue_exclusive(), for the users, who use wait queue as a
      LIFO queue.
      
      __add_wait_queue_tail_exclusive() is introduced too instead of
      add_wait_queue_exclusive_locked(). remove_wait_queue_locked() is removed, as
      it is a duplicate of __remove_wait_queue(), disliked by users, and with less
      users.
      Signed-off-by: NChangli Gao <xiaosuo@gmail.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Paul Menage <menage@google.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Davide Libenzi <davidel@xmailserver.org>
      Cc: <containers@lists.linux-foundation.org>
      LKML-Reference: <1273214006-2979-1-git-send-email-xiaosuo@gmail.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      a93d2f17
    • P
      rcu: refactor RCU's context-switch handling · 25502a6c
      Paul E. McKenney 提交于
      The addition of preemptible RCU to treercu resulted in a bit of
      confusion and inefficiency surrounding the handling of context switches
      for RCU-sched and for RCU-preempt.  For RCU-sched, a context switch
      is a quiescent state, pure and simple, just like it always has been.
      For RCU-preempt, a context switch is in no way a quiescent state, but
      special handling is required when a task blocks in an RCU read-side
      critical section.
      
      However, the callout from the scheduler and the outer loop in ksoftirqd
      still calls something named rcu_sched_qs(), whose name is no longer
      accurate.  Furthermore, when rcu_check_callbacks() notes an RCU-sched
      quiescent state, it ends up unnecessarily (though harmlessly, aside
      from the performance hit) enqueuing the current task if it happens to
      be running in an RCU-preempt read-side critical section.  This not only
      increases the maximum latency of scheduler_tick(), it also needlessly
      increases the overhead of the next outermost rcu_read_unlock() invocation.
      
      This patch addresses this situation by separating the notion of RCU's
      context-switch handling from that of RCU-sched's quiescent states.
      The context-switch handling is covered by rcu_note_context_switch() in
      general and by rcu_preempt_note_context_switch() for preemptible RCU.
      This permits rcu_sched_qs() to handle quiescent states and only quiescent
      states.  It also reduces the maximum latency of scheduler_tick(), though
      probably by much less than a microsecond.  Finally, it means that tasks
      within preemptible-RCU read-side critical sections avoid incurring the
      overhead of queuing unless there really is a context switch.
      Suggested-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Acked-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <peterz@infradead.org>
      25502a6c
  18. 07 5月, 2010 5 次提交
    • P
      sched: Remove rq argument to the tracepoints · 27a9da65
      Peter Zijlstra 提交于
      struct rq isn't visible outside of sched.o so its near useless to
      expose the pointer, also there are no users of it, so remove it.
      Acked-by: NSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1272997616.1642.207.camel@laptop>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      27a9da65
    • P
      rcu: need barrier() in UP synchronize_sched_expedited() · fc390cde
      Paul E. McKenney 提交于
      If synchronize_sched_expedited() is ever to be called from within
      kernel/sched.c in a !SMP PREEMPT kernel, the !SMP implementation needs
      a barrier().
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      fc390cde
    • P
      sched: correctly place paranioa memory barriers in synchronize_sched_expedited() · cc631fb7
      Paul E. McKenney 提交于
      The memory barriers must be in the SMP case, not in the !SMP case.
      Also add a barrier after the atomic_inc() in order to ensure that
      other CPUs see post-synchronize_sched_expedited() actions as following
      the expedited grace period.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      cc631fb7
    • T
      sched: kill paranoia check in synchronize_sched_expedited() · 94458d5e
      Tejun Heo 提交于
      The paranoid check which verifies that the cpu_stop callback is
      actually called on all online cpus is completely superflous.  It's
      guaranteed by cpu_stop facility and if it didn't work as advertised
      other things would go horribly wrong and trying to recover using
      synchronize_sched() wouldn't be very meaningful.
      
      Kill the paranoid check.  Removal of this feature is done as a
      separate step so that it can serve as a bisection point if something
      actually goes wrong.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Dipankar Sarma <dipankar@in.ibm.com>
      Cc: Josh Triplett <josh@freedesktop.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Dimitri Sivanich <sivanich@sgi.com>
      94458d5e
    • T
      sched: replace migration_thread with cpu_stop · 969c7921
      Tejun Heo 提交于
      Currently migration_thread is serving three purposes - migration
      pusher, context to execute active_load_balance() and forced context
      switcher for expedited RCU synchronize_sched.  All three roles are
      hardcoded into migration_thread() and determining which job is
      scheduled is slightly messy.
      
      This patch kills migration_thread and replaces all three uses with
      cpu_stop.  The three different roles of migration_thread() are
      splitted into three separate cpu_stop callbacks -
      migration_cpu_stop(), active_load_balance_cpu_stop() and
      synchronize_sched_expedited_cpu_stop() - and each use case now simply
      asks cpu_stop to execute the callback as necessary.
      
      synchronize_sched_expedited() was implemented with private
      preallocated resources and custom multi-cpu queueing and waiting
      logic, both of which are provided by cpu_stop.
      synchronize_sched_expedited_count is made atomic and all other shared
      resources along with the mutex are dropped.
      
      synchronize_sched_expedited() also implemented a check to detect cases
      where not all the callback got executed on their assigned cpus and
      fall back to synchronize_sched().  If called with cpu hotplug blocked,
      cpu_stop already guarantees that and the condition cannot happen;
      otherwise, stop_machine() would break.  However, this patch preserves
      the paranoid check using a cpumask to record on which cpus the stopper
      ran so that it can serve as a bisection point if something actually
      goes wrong theree.
      
      Because the internal execution state is no longer visible,
      rcu_expedited_torture_stats() is removed.
      
      This patch also renames cpu_stop threads to from "stopper/%d" to
      "migration/%d".  The names of these threads ultimately don't matter
      and there's no reason to make unnecessary userland visible changes.
      
      With this patch applied, stop_machine() and sched now share the same
      resources.  stop_machine() is faster without wasting any resources and
      sched migration users are much cleaner.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Dipankar Sarma <dipankar@in.ibm.com>
      Cc: Josh Triplett <josh@freedesktop.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Dimitri Sivanich <sivanich@sgi.com>
      969c7921
  19. 30 4月, 2010 1 次提交
  20. 23 4月, 2010 3 次提交
  21. 15 4月, 2010 1 次提交
  22. 06 4月, 2010 1 次提交
    • A
      sched: Fix sched_getaffinity() · 84fba5ec
      Anton Blanchard 提交于
      taskset on 2.6.34-rc3 fails on one of my ppc64 test boxes with
      the following error:
      
        sched_getaffinity(0, 16, 0x10029650030) = -1 EINVAL (Invalid argument)
      
      This box has 128 threads and 16 bytes is enough to cover it.
      
      Commit cd3d8031 (sched:
      sched_getaffinity(): Allow less than NR_CPUS length) is
      comparing this 16 bytes agains nr_cpu_ids.
      
      Fix it by comparing nr_cpu_ids to the number of bits in the
      cpumask we pass in.
      Signed-off-by: NAnton Blanchard <anton@samba.org>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Sharyathi Nagesh <sharyath@in.ibm.com>
      Cc: Ulrich Drepper <drepper@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Jack Steiner <steiner@sgi.com>
      Cc: Russ Anderson <rja@sgi.com>
      Cc: Mike Travis <travis@sgi.com>
      LKML-Reference: <20100406070218.GM5594@kryten>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      84fba5ec