1. 13 7月, 2011 1 次提交
  2. 17 6月, 2011 1 次提交
  3. 16 6月, 2011 1 次提交
  4. 15 6月, 2011 2 次提交
    • S
      rcu: Use softirq to address performance regression · 09223371
      Shaohua Li 提交于
      Commit a26ac245(rcu: move TREE_RCU from softirq to kthread)
      introduced performance regression. In an AIM7 test, this commit degraded
      performance by about 40%.
      
      The commit runs rcu callbacks in a kthread instead of softirq. We observed
      high rate of context switch which is caused by this. Out test system has
      64 CPUs and HZ is 1000, so we saw more than 64k context switch per second
      which is caused by RCU's per-CPU kthread.  A trace showed that most of
      the time the RCU per-CPU kthread doesn't actually handle any callbacks,
      but instead just does a very small amount of work handling grace periods.
      This means that RCU's per-CPU kthreads are making the scheduler do quite
      a bit of work in order to allow a very small amount of RCU-related
      processing to be done.
      
      Alex Shi's analysis determined that this slowdown is due to lock
      contention within the scheduler.  Unfortunately, as Peter Zijlstra points
      out, the scheduler's real-time semantics require global action, which
      means that this contention is inherent in real-time scheduling.  (Yes,
      perhaps someone will come up with a workaround -- otherwise, -rt is not
      going to do well on large SMP systems -- but this patch will work around
      this issue in the meantime.  And "the meantime" might well be forever.)
      
      This patch therefore re-introduces softirq processing to RCU, but only
      for core RCU work.  RCU callbacks are still executed in kthread context,
      so that only a small amount of RCU work runs in softirq context in the
      common case.  This should minimize ksoftirqd execution, allowing us to
      skip boosting of ksoftirqd for CONFIG_RCU_BOOST=y kernels.
      Signed-off-by: NShaohua Li <shaohua.li@intel.com>
      Tested-by: N"Alex,Shi" <alex.shi@intel.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      09223371
    • P
      rcu: Simplify curing of load woes · 9a432736
      Paul E. McKenney 提交于
      Make the functions creating the kthreads wake them up.  Leverage the
      fact that the per-node and boost kthreads can run anywhere, thus
      dispensing with the need to wake them up once the incoming CPU has
      gone fully online.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Tested-by: NDaniel J Blueman <daniel.blueman@gmail.com>
      9a432736
  5. 31 5月, 2011 1 次提交
  6. 28 5月, 2011 2 次提交
  7. 27 5月, 2011 1 次提交
    • P
      rcu: Decrease memory-barrier usage based on semi-formal proof · 23b5c8fa
      Paul E. McKenney 提交于
      (Note: this was reverted, and is now being re-applied in pieces, with
      this being the fifth and final piece.  See below for the reason that
      it is now felt to be safe to re-apply this.)
      
      Commit d09b62df fixed grace-period synchronization, but left some smp_mb()
      invocations in rcu_process_callbacks() that are no longer needed, but
      sheer paranoia prevented them from being removed.  This commit removes
      them and provides a proof of correctness in their absence.  It also adds
      a memory barrier to rcu_report_qs_rsp() immediately before the update to
      rsp->completed in order to handle the theoretical possibility that the
      compiler or CPU might move massive quantities of code into a lock-based
      critical section.  This also proves that the sheer paranoia was not
      entirely unjustified, at least from a theoretical point of view.
      
      In addition, the old dyntick-idle synchronization depended on the fact
      that grace periods were many milliseconds in duration, so that it could
      be assumed that no dyntick-idle CPU could reorder a memory reference
      across an entire grace period.  Unfortunately for this design, the
      addition of expedited grace periods breaks this assumption, which has
      the unfortunate side-effect of requiring atomic operations in the
      functions that track dyntick-idle state for RCU.  (There is some hope
      that the algorithms used in user-level RCU might be applied here, but
      some work is required to handle the NMIs that user-space applications
      can happily ignore.  For the short term, better safe than sorry.)
      
      This proof assumes that neither compiler nor CPU will allow a lock
      acquisition and release to be reordered, as doing so can result in
      deadlock.  The proof is as follows:
      
      1.	A given CPU declares a quiescent state under the protection of
      	its leaf rcu_node's lock.
      
      2.	If there is more than one level of rcu_node hierarchy, the
      	last CPU to declare a quiescent state will also acquire the
      	->lock of the next rcu_node up in the hierarchy,  but only
      	after releasing the lower level's lock.  The acquisition of this
      	lock clearly cannot occur prior to the acquisition of the leaf
      	node's lock.
      
      3.	Step 2 repeats until we reach the root rcu_node structure.
      	Please note again that only one lock is held at a time through
      	this process.  The acquisition of the root rcu_node's ->lock
      	must occur after the release of that of the leaf rcu_node.
      
      4.	At this point, we set the ->completed field in the rcu_state
      	structure in rcu_report_qs_rsp().  However, if the rcu_node
      	hierarchy contains only one rcu_node, then in theory the code
      	preceding the quiescent state could leak into the critical
      	section.  We therefore precede the update of ->completed with a
      	memory barrier.  All CPUs will therefore agree that any updates
      	preceding any report of a quiescent state will have happened
      	before the update of ->completed.
      
      5.	Regardless of whether a new grace period is needed, rcu_start_gp()
      	will propagate the new value of ->completed to all of the leaf
      	rcu_node structures, under the protection of each rcu_node's ->lock.
      	If a new grace period is needed immediately, this propagation
      	will occur in the same critical section that ->completed was
      	set in, but courtesy of the memory barrier in #4 above, is still
      	seen to follow any pre-quiescent-state activity.
      
      6.	When a given CPU invokes __rcu_process_gp_end(), it becomes
      	aware of the end of the old grace period and therefore makes
      	any RCU callbacks that were waiting on that grace period eligible
      	for invocation.
      
      	If this CPU is the same one that detected the end of the grace
      	period, and if there is but a single rcu_node in the hierarchy,
      	we will still be in the single critical section.  In this case,
      	the memory barrier in step #4 guarantees that all callbacks will
      	be seen to execute after each CPU's quiescent state.
      
      	On the other hand, if this is a different CPU, it will acquire
      	the leaf rcu_node's ->lock, and will again be serialized after
      	each CPU's quiescent state for the old grace period.
      
      On the strength of this proof, this commit therefore removes the memory
      barriers from rcu_process_callbacks() and adds one to rcu_report_qs_rsp().
      The effect is to reduce the number of memory barriers by one and to
      reduce the frequency of execution from about once per scheduling tick
      per CPU to once per grace period.
      
      This was reverted do to hangs found during testing by Yinghai Lu and
      Ingo Molnar.  Frederic Weisbecker supplied Yinghai with tracing that
      located the underlying problem, and Frederic also provided the fix.
      
      The underlying problem was that the HARDIRQ_ENTER() macro from
      lib/locking-selftest.c invoked irq_enter(), which in turn invokes
      rcu_irq_enter(), but HARDIRQ_EXIT() invoked __irq_exit(), which
      does not invoke rcu_irq_exit().  This situation resulted in calls
      to rcu_irq_enter() that were not balanced by the required calls to
      rcu_irq_exit().  Therefore, after these locking selftests completed,
      RCU's dyntick-idle nesting count was a large number (for example,
      72), which caused RCU to to conclude that the affected CPU was not in
      dyntick-idle mode when in fact it was.
      
      RCU would therefore incorrectly wait for this dyntick-idle CPU, resulting
      in hangs.
      
      In contrast, with Frederic's patch, which replaces the irq_enter()
      in HARDIRQ_ENTER() with an __irq_enter(), these tests don't ever call
      either rcu_irq_enter() or rcu_irq_exit(), which works because the CPU
      running the test is already marked as not being in dyntick-idle mode.
      This means that the rcu_irq_enter() and rcu_irq_exit() calls and RCU
      then has no problem working out which CPUs are in dyntick-idle mode and
      which are not.
      
      The reason that the imbalance was not noticed before the barrier patch
      was applied is that the old implementation of rcu_enter_nohz() ignored
      the nesting depth.  This could still result in delays, but much shorter
      ones.  Whenever there was a delay, RCU would IPI the CPU with the
      unbalanced nesting level, which would eventually result in rcu_enter_nohz()
      being called, which in turn would force RCU to see that the CPU was in
      dyntick-idle mode.
      
      The reason that very few people noticed the problem is that the mismatched
      irq_enter() vs. __irq_exit() occured only when the kernel was built with
      CONFIG_DEBUG_LOCKING_API_SELFTESTS.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      23b5c8fa
  8. 20 5月, 2011 1 次提交
  9. 08 5月, 2011 1 次提交
  10. 06 5月, 2011 11 次提交
    • P
      rcu: fix spelling · 6cc68793
      Paul E. McKenney 提交于
      The "preemptible" spelling is preferable.  May as well fix it.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      6cc68793
    • L
      rcu: call __rcu_read_unlock() in exit_rcu for tree RCU · 13491a0e
      Lai Jiangshan 提交于
      Using __rcu_read_lock() in place of rcu_read_lock() leaves any debug
      state as it really should be, namely with the lock still held.
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      13491a0e
    • P
      rcu: fix tracing bug thinko on boost-balk attribution · a9f4793d
      Paul E. McKenney 提交于
      The rcu_initiate_boost_trace() function mis-attributed refusals to
      initiate RCU priority boosting that were in fact due to its not yet
      being time to boost.  This patch fixes the faulty comparison.
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      a9f4793d
    • P
      rcu: add tracing for RCU's kthread run states. · d71df90e
      Paul E. McKenney 提交于
      Add tracing to help debugging situations when RCU's kthreads are not
      running but are supposed to be.
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      d71df90e
    • P
      rcu: Add boosting to TREE_PREEMPT_RCU tracing · 0ea1f2eb
      Paul E. McKenney 提交于
      Includes total number of tasks boosted, number boosted on behalf of each
      of normal and expedited grace periods, and statistics on attempts to
      initiate boosting that failed for various reasons.
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      0ea1f2eb
    • P
      rcu: Force per-rcu_node kthreads off of the outgoing CPU · 0f962a5e
      Paul E. McKenney 提交于
      The scheduler has had some heartburn in the past when too many real-time
      kthreads were affinitied to the outgoing CPU.  So, this commit lightens
      the load by forcing the per-rcu_node and the boost kthreads off of the
      outgoing CPU.  Note that RCU's per-CPU kthread remains on the outgoing
      CPU until the bitter end, as it must in order to preserve correctness.
      
      Also avoid disabling hardirqs across calls to set_cpus_allowed_ptr(),
      given that this function can block.
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      0f962a5e
    • P
      rcu: priority boosting for TREE_PREEMPT_RCU · 27f4d280
      Paul E. McKenney 提交于
      Add priority boosting for TREE_PREEMPT_RCU, similar to that for
      TINY_PREEMPT_RCU.  This is enabled by the default-off RCU_BOOST
      kernel parameter.  The priority to which to boost preempted
      RCU readers is controlled by the RCU_BOOST_PRIO kernel parameter
      (defaulting to real-time priority 1) and the time to wait before
      boosting the readers who are blocking a given grace period is
      controlled by the RCU_BOOST_DELAY kernel parameter (defaulting to
      500 milliseconds).
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      27f4d280
    • P
      rcu: move TREE_RCU from softirq to kthread · a26ac245
      Paul E. McKenney 提交于
      If RCU priority boosting is to be meaningful, callback invocation must
      be boosted in addition to preempted RCU readers.  Otherwise, in presence
      of CPU real-time threads, the grace period ends, but the callbacks don't
      get invoked.  If the callbacks don't get invoked, the associated memory
      doesn't get freed, so the system is still subject to OOM.
      
      But it is not reasonable to priority-boost RCU_SOFTIRQ, so this commit
      moves the callback invocations to a kthread, which can be boosted easily.
      
      Also add comments and properly synchronized all accesses to
      rcu_cpu_kthread_task, as suggested by Lai Jiangshan.
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      a26ac245
    • P
      rcu: merge TREE_PREEPT_RCU blocked_tasks[] lists · 12f5f524
      Paul E. McKenney 提交于
      Combine the current TREE_PREEMPT_RCU ->blocked_tasks[] lists in the
      rcu_node structure into a single ->blkd_tasks list with ->gp_tasks
      and ->exp_tasks tail pointers.  This is in preparation for RCU priority
      boosting, which will add a third dimension to the combinatorial explosion
      in the ->blocked_tasks[] case, but simply a third pointer in the new
      ->blkd_tasks case.
      
      Also update documentation to reflect blocked_tasks[] merge
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      12f5f524
    • P
      rcu: Decrease memory-barrier usage based on semi-formal proof · e59fb312
      Paul E. McKenney 提交于
      Commit d09b62df fixed grace-period synchronization, but left some smp_mb()
      invocations in rcu_process_callbacks() that are no longer needed, but
      sheer paranoia prevented them from being removed.  This commit removes
      them and provides a proof of correctness in their absence.  It also adds
      a memory barrier to rcu_report_qs_rsp() immediately before the update to
      rsp->completed in order to handle the theoretical possibility that the
      compiler or CPU might move massive quantities of code into a lock-based
      critical section.  This also proves that the sheer paranoia was not
      entirely unjustified, at least from a theoretical point of view.
      
      In addition, the old dyntick-idle synchronization depended on the fact
      that grace periods were many milliseconds in duration, so that it could
      be assumed that no dyntick-idle CPU could reorder a memory reference
      across an entire grace period.  Unfortunately for this design, the
      addition of expedited grace periods breaks this assumption, which has
      the unfortunate side-effect of requiring atomic operations in the
      functions that track dyntick-idle state for RCU.  (There is some hope
      that the algorithms used in user-level RCU might be applied here, but
      some work is required to handle the NMIs that user-space applications
      can happily ignore.  For the short term, better safe than sorry.)
      
      This proof assumes that neither compiler nor CPU will allow a lock
      acquisition and release to be reordered, as doing so can result in
      deadlock.  The proof is as follows:
      
      1.	A given CPU declares a quiescent state under the protection of
      	its leaf rcu_node's lock.
      
      2.	If there is more than one level of rcu_node hierarchy, the
      	last CPU to declare a quiescent state will also acquire the
      	->lock of the next rcu_node up in the hierarchy,  but only
      	after releasing the lower level's lock.  The acquisition of this
      	lock clearly cannot occur prior to the acquisition of the leaf
      	node's lock.
      
      3.	Step 2 repeats until we reach the root rcu_node structure.
      	Please note again that only one lock is held at a time through
      	this process.  The acquisition of the root rcu_node's ->lock
      	must occur after the release of that of the leaf rcu_node.
      
      4.	At this point, we set the ->completed field in the rcu_state
      	structure in rcu_report_qs_rsp().  However, if the rcu_node
      	hierarchy contains only one rcu_node, then in theory the code
      	preceding the quiescent state could leak into the critical
      	section.  We therefore precede the update of ->completed with a
      	memory barrier.  All CPUs will therefore agree that any updates
      	preceding any report of a quiescent state will have happened
      	before the update of ->completed.
      
      5.	Regardless of whether a new grace period is needed, rcu_start_gp()
      	will propagate the new value of ->completed to all of the leaf
      	rcu_node structures, under the protection of each rcu_node's ->lock.
      	If a new grace period is needed immediately, this propagation
      	will occur in the same critical section that ->completed was
      	set in, but courtesy of the memory barrier in #4 above, is still
      	seen to follow any pre-quiescent-state activity.
      
      6.	When a given CPU invokes __rcu_process_gp_end(), it becomes
      	aware of the end of the old grace period and therefore makes
      	any RCU callbacks that were waiting on that grace period eligible
      	for invocation.
      
      	If this CPU is the same one that detected the end of the grace
      	period, and if there is but a single rcu_node in the hierarchy,
      	we will still be in the single critical section.  In this case,
      	the memory barrier in step #4 guarantees that all callbacks will
      	be seen to execute after each CPU's quiescent state.
      
      	On the other hand, if this is a different CPU, it will acquire
      	the leaf rcu_node's ->lock, and will again be serialized after
      	each CPU's quiescent state for the old grace period.
      
      On the strength of this proof, this commit therefore removes the memory
      barriers from rcu_process_callbacks() and adds one to rcu_report_qs_rsp().
      The effect is to reduce the number of memory barriers by one and to
      reduce the frequency of execution from about once per scheduling tick
      per CPU to once per grace period.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      e59fb312
    • P
      rcu: Remove conditional compilation for RCU CPU stall warnings · a00e0d71
      Paul E. McKenney 提交于
      The RCU CPU stall warnings can now be controlled using the
      rcu_cpu_stall_suppress boot-time parameter or via the same parameter
      from sysfs.  There is therefore no longer any reason to have
      kernel config parameters for this feature.  This commit therefore
      removes the RCU_CPU_STALL_DETECTOR and RCU_CPU_STALL_DETECTOR_RUNNABLE
      kernel config parameters.  The RCU_CPU_STALL_TIMEOUT parameter remains
      to allow the timeout to be tuned and the RCU_CPU_STALL_VERBOSE parameter
      remains to allow task-stall information to be suppressed if desired.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      a00e0d71
  11. 18 12月, 2010 1 次提交
    • T
      rcu: increase synchronize_sched_expedited() batching · e27fc964
      Tejun Heo 提交于
      The fix in commit #6a0cc49 requires more than three concurrent instances
      of synchronize_sched_expedited() before batching is possible.  This
      patch uses a ticket-counter-like approach that is also not unrelated to
      Lai Jiangshan's Ring RCU to allow sharing of expedited grace periods even
      when there are only two concurrent instances of synchronize_sched_expedited().
      
      This commit builds on Tejun's original posting, which may be found at
      http://lkml.org/lkml/2010/11/9/204, adding memory barriers, avoiding
      overflow of signed integers (other than via atomic_t), and fixing the
      detection of batching.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      e27fc964
  12. 30 11月, 2010 4 次提交
    • P
      rcu: fix race condition in synchronize_sched_expedited() · db3a8920
      Paul E. McKenney 提交于
      The new (early 2010) implementation of synchronize_sched_expedited() uses
      try_stop_cpu() to force a context switch on every CPU.  It also permits
      concurrent calls to synchronize_sched_expedited() to share a single call
      to try_stop_cpu() through use of an atomically incremented
      synchronize_sched_expedited_count variable.  Unfortunately, this is
      subject to failure as follows:
      
      o	Task A invokes synchronize_sched_expedited(), try_stop_cpus()
      	succeeds, but Task A is preempted before getting to the atomic
      	increment of synchronize_sched_expedited_count.
      
      o	Task B also invokes synchronize_sched_expedited(), with exactly
      	the same outcome as Task A.
      
      o	Task C also invokes synchronize_sched_expedited(), again with
      	exactly the same outcome as Tasks A and B.
      
      o	Task D also invokes synchronize_sched_expedited(), but only
      	gets as far as acquiring the mutex within try_stop_cpus()
      	before being preempted, interrupted, or otherwise delayed.
      
      o	Task E also invokes synchronize_sched_expedited(), but only
      	gets to the snapshotting of synchronize_sched_expedited_count.
      
      o	Tasks A, B, and C all increment synchronize_sched_expedited_count.
      
      o	Task E fails to get the mutex, so checks the new value
      	of synchronize_sched_expedited_count.  It finds that the
      	value has increased, so (wrongly) assumes that its work
      	has been done, returning despite there having been no
      	expedited grace period since it began.
      
      The solution is to have the lowest-numbered CPU atomically increment
      the synchronize_sched_expedited_count variable within the
      synchronize_sched_expedited_cpu_stop() function, which is under
      the protection of the mutex acquired by try_stop_cpus().  However, this
      also requires that piggybacking tasks wait for three rather than two
      instances of try_stop_cpu(), because we cannot control the order in
      which the per-CPU callback function occur.
      
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      db3a8920
    • P
      rcu: update documentation/comments for Lai's adoption patch · 2d999e03
      Paul E. McKenney 提交于
      Lai's RCU-callback immediate-adoption patch changes the RCU tracing
      output, so update tracing.txt.  Also update a few comments to clarify
      the synchronization design.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      2d999e03
    • L
      rcu,cleanup: simplify the code when cpu is dying · 29494be7
      Lai Jiangshan 提交于
      When we handle the CPU_DYING notifier, the whole system is stopped except
      for the current CPU.  We therefore need no synchronization with the other
      CPUs.  This allows us to move any orphaned RCU callbacks directly to the
      list of any online CPU without needing to run them through the global
      orphan lists.  These global orphan lists can therefore be dispensed with.
      This commit makes thes changes, though currently victimizes CPU 0 @@@.
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      29494be7
    • L
      rcu,cleanup: move synchronize_sched_expedited() out of sched.c · 7b27d547
      Lai Jiangshan 提交于
      The first version of synchronize_sched_expedited() used the migration
      code in the scheduler, and was therefore implemented in kernel/sched.c.
      However, the more recent version of this code no longer uses the
      migration code, so this commit moves it to the main RCU source files.
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      7b27d547
  13. 03 9月, 2010 1 次提交
  14. 21 8月, 2010 3 次提交
  15. 20 8月, 2010 2 次提交
  16. 12 5月, 2010 1 次提交
  17. 11 5月, 2010 5 次提交
    • P
      d822ed10
    • P
      rcu: RCU_FAST_NO_HZ must check RCU dyntick state · 77e38ed3
      Paul E. McKenney 提交于
      The current version of RCU_FAST_NO_HZ reproduces the old CLASSIC_RCU
      dyntick-idle bug, as it fails to detect CPUs that have interrupted
      or NMIed out of dyntick-idle mode.  Fix this by making rcu_needs_cpu()
      check the state in the per-CPU rcu_dynticks variables, thus correctly
      detecting the dyntick-idle state from an RCU perspective.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      77e38ed3
    • P
      rcu: print boot-time console messages if RCU configs out of ordinary · 26845c28
      Paul E. McKenney 提交于
      Print boot-time messages if tracing is enabled, if fanout is set
      to non-default values, if exact fanout is specified, if accelerated
      dyntick-idle grace periods have been enabled, if RCU-lockdep is enabled,
      if rcutorture has been boot-time enabled, if the CPU stall detector has
      been disabled, or if four-level hierarchy has been enabled.
      
      This is all for TREE_RCU and TREE_PREEMPT_RCU.  TINY_RCU will be handled
      separately, if at all.
      Suggested-by: NJosh Triplett <josh@joshtriplett.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      26845c28
    • P
      rcu: refactor RCU's context-switch handling · 25502a6c
      Paul E. McKenney 提交于
      The addition of preemptible RCU to treercu resulted in a bit of
      confusion and inefficiency surrounding the handling of context switches
      for RCU-sched and for RCU-preempt.  For RCU-sched, a context switch
      is a quiescent state, pure and simple, just like it always has been.
      For RCU-preempt, a context switch is in no way a quiescent state, but
      special handling is required when a task blocks in an RCU read-side
      critical section.
      
      However, the callout from the scheduler and the outer loop in ksoftirqd
      still calls something named rcu_sched_qs(), whose name is no longer
      accurate.  Furthermore, when rcu_check_callbacks() notes an RCU-sched
      quiescent state, it ends up unnecessarily (though harmlessly, aside
      from the performance hit) enqueuing the current task if it happens to
      be running in an RCU-preempt read-side critical section.  This not only
      increases the maximum latency of scheduler_tick(), it also needlessly
      increases the overhead of the next outermost rcu_read_unlock() invocation.
      
      This patch addresses this situation by separating the notion of RCU's
      context-switch handling from that of RCU-sched's quiescent states.
      The context-switch handling is covered by rcu_note_context_switch() in
      general and by rcu_preempt_note_context_switch() for preemptible RCU.
      This permits rcu_sched_qs() to handle quiescent states and only quiescent
      states.  It also reduces the maximum latency of scheduler_tick(), though
      probably by much less than a microsecond.  Finally, it means that tasks
      within preemptible-RCU read-side critical sections avoid incurring the
      overhead of queuing unless there really is a context switch.
      Suggested-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Acked-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <peterz@infradead.org>
      25502a6c
    • L
      rcu: ignore offline CPUs in last non-dyntick-idle CPU check · 5db35673
      Lai Jiangshan 提交于
      Offline CPUs are not in nohz_cpu_mask, but can be ignored when checking
      for the last non-dyntick-idle CPU.  This patch therefore only checks
      online CPUs for not being dyntick idle, allowing fast entry into
      full-system dyntick-idle state even when there are some offline CPUs.
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      5db35673
  18. 28 2月, 2010 1 次提交