1. 27 5月, 2011 5 次提交
    • P
      rcu: Decrease memory-barrier usage based on semi-formal proof · 23b5c8fa
      Paul E. McKenney 提交于
      (Note: this was reverted, and is now being re-applied in pieces, with
      this being the fifth and final piece.  See below for the reason that
      it is now felt to be safe to re-apply this.)
      
      Commit d09b62df fixed grace-period synchronization, but left some smp_mb()
      invocations in rcu_process_callbacks() that are no longer needed, but
      sheer paranoia prevented them from being removed.  This commit removes
      them and provides a proof of correctness in their absence.  It also adds
      a memory barrier to rcu_report_qs_rsp() immediately before the update to
      rsp->completed in order to handle the theoretical possibility that the
      compiler or CPU might move massive quantities of code into a lock-based
      critical section.  This also proves that the sheer paranoia was not
      entirely unjustified, at least from a theoretical point of view.
      
      In addition, the old dyntick-idle synchronization depended on the fact
      that grace periods were many milliseconds in duration, so that it could
      be assumed that no dyntick-idle CPU could reorder a memory reference
      across an entire grace period.  Unfortunately for this design, the
      addition of expedited grace periods breaks this assumption, which has
      the unfortunate side-effect of requiring atomic operations in the
      functions that track dyntick-idle state for RCU.  (There is some hope
      that the algorithms used in user-level RCU might be applied here, but
      some work is required to handle the NMIs that user-space applications
      can happily ignore.  For the short term, better safe than sorry.)
      
      This proof assumes that neither compiler nor CPU will allow a lock
      acquisition and release to be reordered, as doing so can result in
      deadlock.  The proof is as follows:
      
      1.	A given CPU declares a quiescent state under the protection of
      	its leaf rcu_node's lock.
      
      2.	If there is more than one level of rcu_node hierarchy, the
      	last CPU to declare a quiescent state will also acquire the
      	->lock of the next rcu_node up in the hierarchy,  but only
      	after releasing the lower level's lock.  The acquisition of this
      	lock clearly cannot occur prior to the acquisition of the leaf
      	node's lock.
      
      3.	Step 2 repeats until we reach the root rcu_node structure.
      	Please note again that only one lock is held at a time through
      	this process.  The acquisition of the root rcu_node's ->lock
      	must occur after the release of that of the leaf rcu_node.
      
      4.	At this point, we set the ->completed field in the rcu_state
      	structure in rcu_report_qs_rsp().  However, if the rcu_node
      	hierarchy contains only one rcu_node, then in theory the code
      	preceding the quiescent state could leak into the critical
      	section.  We therefore precede the update of ->completed with a
      	memory barrier.  All CPUs will therefore agree that any updates
      	preceding any report of a quiescent state will have happened
      	before the update of ->completed.
      
      5.	Regardless of whether a new grace period is needed, rcu_start_gp()
      	will propagate the new value of ->completed to all of the leaf
      	rcu_node structures, under the protection of each rcu_node's ->lock.
      	If a new grace period is needed immediately, this propagation
      	will occur in the same critical section that ->completed was
      	set in, but courtesy of the memory barrier in #4 above, is still
      	seen to follow any pre-quiescent-state activity.
      
      6.	When a given CPU invokes __rcu_process_gp_end(), it becomes
      	aware of the end of the old grace period and therefore makes
      	any RCU callbacks that were waiting on that grace period eligible
      	for invocation.
      
      	If this CPU is the same one that detected the end of the grace
      	period, and if there is but a single rcu_node in the hierarchy,
      	we will still be in the single critical section.  In this case,
      	the memory barrier in step #4 guarantees that all callbacks will
      	be seen to execute after each CPU's quiescent state.
      
      	On the other hand, if this is a different CPU, it will acquire
      	the leaf rcu_node's ->lock, and will again be serialized after
      	each CPU's quiescent state for the old grace period.
      
      On the strength of this proof, this commit therefore removes the memory
      barriers from rcu_process_callbacks() and adds one to rcu_report_qs_rsp().
      The effect is to reduce the number of memory barriers by one and to
      reduce the frequency of execution from about once per scheduling tick
      per CPU to once per grace period.
      
      This was reverted do to hangs found during testing by Yinghai Lu and
      Ingo Molnar.  Frederic Weisbecker supplied Yinghai with tracing that
      located the underlying problem, and Frederic also provided the fix.
      
      The underlying problem was that the HARDIRQ_ENTER() macro from
      lib/locking-selftest.c invoked irq_enter(), which in turn invokes
      rcu_irq_enter(), but HARDIRQ_EXIT() invoked __irq_exit(), which
      does not invoke rcu_irq_exit().  This situation resulted in calls
      to rcu_irq_enter() that were not balanced by the required calls to
      rcu_irq_exit().  Therefore, after these locking selftests completed,
      RCU's dyntick-idle nesting count was a large number (for example,
      72), which caused RCU to to conclude that the affected CPU was not in
      dyntick-idle mode when in fact it was.
      
      RCU would therefore incorrectly wait for this dyntick-idle CPU, resulting
      in hangs.
      
      In contrast, with Frederic's patch, which replaces the irq_enter()
      in HARDIRQ_ENTER() with an __irq_enter(), these tests don't ever call
      either rcu_irq_enter() or rcu_irq_exit(), which works because the CPU
      running the test is already marked as not being in dyntick-idle mode.
      This means that the rcu_irq_enter() and rcu_irq_exit() calls and RCU
      then has no problem working out which CPUs are in dyntick-idle mode and
      which are not.
      
      The reason that the imbalance was not noticed before the barrier patch
      was applied is that the old implementation of rcu_enter_nohz() ignored
      the nesting depth.  This could still result in delays, but much shorter
      ones.  Whenever there was a delay, RCU would IPI the CPU with the
      unbalanced nesting level, which would eventually result in rcu_enter_nohz()
      being called, which in turn would force RCU to see that the CPU was in
      dyntick-idle mode.
      
      The reason that very few people noticed the problem is that the mismatched
      irq_enter() vs. __irq_exit() occured only when the kernel was built with
      CONFIG_DEBUG_LOCKING_API_SELFTESTS.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      23b5c8fa
    • P
      rcu: Make rcu_enter_nohz() pay attention to nesting · 4305ce78
      Paul E. McKenney 提交于
      The old version of rcu_enter_nohz() forced RCU into nohz mode even if
      the nesting count was non-zero.  This change causes rcu_enter_nohz()
      to hold off for non-zero nesting counts.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      4305ce78
    • P
      rcu: Don't do reschedule unless in irq · b5904090
      Paul E. McKenney 提交于
      Condition the set_need_resched() in rcu_irq_exit() on in_irq().  This
      should be a no-op, because rcu_irq_exit() should only be called from irq.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      b5904090
    • P
      rcu: Remove old memory barriers from rcu_process_callbacks() · 1135633b
      Paul E. McKenney 提交于
      Second step of partitioning of commit e59fb312.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      1135633b
    • P
      rcu: Add memory barriers · 0bbcc529
      Paul E. McKenney 提交于
      Add the memory barriers added by e59fb312.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      0bbcc529
  2. 21 5月, 2011 1 次提交
    • L
      sanitize <linux/prefetch.h> usage · 268bb0ce
      Linus Torvalds 提交于
      Commit e66eed65 ("list: remove prefetching from regular list
      iterators") removed the include of prefetch.h from list.h, which
      uncovered several cases that had apparently relied on that rather
      obscure header file dependency.
      
      So this fixes things up a bit, using
      
         grep -L linux/prefetch.h $(git grep -l '[^a-z_]prefetchw*(' -- '*.[ch]')
         grep -L 'prefetchw*(' $(git grep -l 'linux/prefetch.h' -- '*.[ch]')
      
      to guide us in finding files that either need <linux/prefetch.h>
      inclusion, or have it despite not needing it.
      
      There are more of them around (mostly network drivers), but this gets
      many core ones.
      Reported-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      268bb0ce
  3. 20 5月, 2011 1 次提交
  4. 08 5月, 2011 1 次提交
  5. 06 5月, 2011 18 次提交
  6. 18 12月, 2010 5 次提交
    • P
      rcu: reduce __call_rcu()-induced contention on rcu_node structures · b52573d2
      Paul E. McKenney 提交于
      When the current __call_rcu() function was written, the expedited
      APIs did not exist.  The __call_rcu() implementation therefore went
      to great lengths to detect the end of old grace periods and to start
      new ones, all in the name of reducing grace-period latency.  Now the
      expedited APIs do exist, and the usage of __call_rcu() has increased
      considerably.  This commit therefore causes __call_rcu() to avoid
      worrying about grace periods unless there are a large number of
      RCU callbacks stacked up on the current CPU.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      b52573d2
    • P
      rcu: limit rcu_node leaf-level fanout · 0209f649
      Paul E. McKenney 提交于
      Some recent benchmarks have indicated possible lock contention on the
      leaf-level rcu_node locks.  This commit therefore limits the number of
      CPUs per leaf-level rcu_node structure to 16, in other words, there
      can be at most 16 rcu_data structures fanning into a given rcu_node
      structure.  Prior to this, the limit was 32 on 32-bit systems and 64 on
      64-bit systems.
      
      Note that the fanout of non-leaf rcu_node structures is unchanged.  The
      organization of accesses to the rcu_node tree is such that references
      to non-leaf rcu_node structures are much less frequent than to the
      leaf structures.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      0209f649
    • P
      rcu: fine-tune grace-period begin/end checks · 121dfc4b
      Paul E. McKenney 提交于
      Use the CPU's bit in rnp->qsmask to determine whether or not the CPU
      should try to report a quiescent state.  Handle overflow in the check
      for rdp->gpnum having fallen behind.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      121dfc4b
    • F
      rcu: Keep gpnum and completed fields synchronized · 5ff8e6f0
      Frederic Weisbecker 提交于
      When a CPU that was in an extended quiescent state wakes
      up and catches up with grace periods that remote CPUs
      completed on its behalf, we update the completed field
      but not the gpnum that keeps a stale value of a backward
      grace period ID.
      
      Later, note_new_gpnum() will interpret the shift between
      the local CPU and the node grace period ID as some new grace
      period to handle and will then start to hunt quiescent state.
      
      But if every grace periods have already been completed, this
      interpretation becomes broken. And we'll be stuck in clusters
      of spurious softirqs because rcu_report_qs_rdp() will make
      this broken state run into infinite loop.
      
      The solution, as suggested by Lai Jiangshan, is to ensure that
      the gpnum and completed fields are well synchronized when we catch
      up with completed grace periods on their behalf by other cpus.
      This way we won't start noting spurious new grace periods.
      Suggested-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Steven Rostedt <rostedt@goodmis.org
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      5ff8e6f0
    • F
      rcu: Stop chasing QS if another CPU did it for us · 20377f32
      Frederic Weisbecker 提交于
      When a CPU is idle and others CPUs handled its extended
      quiescent state to complete grace periods on its behalf,
      it will catch up with completed grace periods numbers
      when it wakes up.
      
      But at this point there might be no more grace period to
      complete, but still the woken CPU always keeps its stale
      qs_pending value and will then continue to chase quiescent
      states even if its not needed anymore.
      
      This results in clusters of spurious softirqs until a new
      real grace period is started. Because if we continue to
      chase quiescent states but we have completed every grace
      periods, rcu_report_qs_rdp() is puzzled and makes that
      state run into infinite loops.
      
      As suggested by Lai Jiangshan, just reset qs_pending if
      someone completed every grace periods on our behalf.
      Suggested-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      20377f32
  7. 17 12月, 2010 1 次提交
  8. 30 11月, 2010 2 次提交
  9. 08 10月, 2010 1 次提交
  10. 24 9月, 2010 1 次提交
    • P
      rcu: Add tracing data to support queueing models · 269dcc1c
      Paul E. McKenney 提交于
      The current tracing data is not sufficient to deduce the average time
      that a callback spends waiting for a grace period to end.  Add three
      per-CPU counters recording the number of callbacks invoked (ci), the
      number of callbacks orphaned (co), and the number of callbacks adopted
      (ca).  Given the existing callback queue length (ql), the average wait
      time in absence of CPU hotplug operations is ql/ci.  The units of wait
      time will be in terms of the duration over which ci was measured.
      
      In the presence of CPU hotplug operations, there is room for argument,
      but ql/(ci-co+ca) won't steer you too far wrong.
      
      Also fixes a typo called out by Lucas De Marchi <lucas.de.marchi@gmail.com>.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      269dcc1c
  11. 21 8月, 2010 2 次提交
  12. 20 8月, 2010 2 次提交