1. 23 9月, 2012 19 次提交
    • P
      rcu: Shrink RCU based on number of CPUs · b17c7035
      Paul E. McKenney 提交于
      Currently, rcu_init_geometry() only reshapes RCU's combining trees
      if the leaf fanout is changed at boot time.  This means that by
      default, kernels compiled with (say) NR_CPUS=4096 will keep oversized
      data structures, even when running on systems with (say) four CPUs.
      
      This commit therefore checks to see if the maximum number of CPUs on
      the actual running system (nr_cpu_ids) differs from NR_CPUS, and if so
      reshapes the combining trees accordingly.
      Reported-by: NMike Galbraith <efault@gmx.de>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      b17c7035
    • P
      rcu: Handle unbalanced rcu_node configurations with few CPUs · 4dbd6bb3
      Paul E. McKenney 提交于
      If CONFIG_RCU_FANOUT_EXACT=y, if there are not enough CPUs (according
      to nr_cpu_ids) to require more than a single rcu_node structure, but if
      NR_CPUS is larger than would fit into a single rcu_node structure, then
      the current rcu_init_levelspread() code is subject to integer overflow
      in the eight-bit ->levelspread[] array in the rcu_state structure.
      
      In this case, the solution is -not- to increase the size of the
      elements in this array because the values in that array should be
      constrained to the number of bits in an unsigned long.  Instead, this
      commit replaces NR_CPUS with nr_cpu_ids in the rcu_init_levelspread()
      function's initialization of the cprv local variable.  This results in
      all of the arithmetic being consistently based off of the nr_cpu_ids
      value, thus avoiding the overflow, which was caused by the mixing of
      nr_cpu_ids and NR_CPUS.
      Reported-by: NMike Galbraith <efault@gmx.de>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      4dbd6bb3
    • P
      rcu: Simplify quiescent-state detection · d7d6a11e
      Paul E. McKenney 提交于
      The current quiescent-state detection algorithm is needlessly
      complex.  It records the grace-period number corresponding to
      the quiescent state at the time of the quiescent state, which
      works, but it seems better to simply erase any record of previous
      quiescent states at the time that the CPU notices the new grace
      period.  This has the further advantage of removing another piece
      of RCU for which lockless reasoning is required.
      
      Therefore, this commit makes this change.
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      d7d6a11e
    • P
      rcu: Adjust for unconditional ->completed assignment · 25d30cf4
      Paul E. McKenney 提交于
      Now that the rcu_node structures' ->completed fields are unconditionally
      assigned at grace-period cleanup time, they should already have the
      correct value for the new grace period at grace-period initialization
      time.  This commit therefore inserts a WARN_ON_ONCE() to verify this
      invariant.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      25d30cf4
    • P
      rcu: Add random PROVE_RCU_DELAY to grace-period initialization · 661a85dc
      Paul E. McKenney 提交于
      Preemption greatly raised the probability of certain types of race
      conditions, so this commit adds an anti-heisenbug to greatly increase
      the collision cross section, also known as the probability of occurrence.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      661a85dc
    • P
      rcu: Fix day-zero grace-period initialization/cleanup race · 5d4b8659
      Paul E. McKenney 提交于
      The current approach to grace-period initialization is vulnerable to
      extremely low-probability races.  These races stem from the fact that
      the old grace period is marked completed on the same traversal through
      the rcu_node structure that is marking the start of the new grace period.
      This means that some rcu_node structures will believe that the old grace
      period is still in effect at the same time that other rcu_node structures
      believe that the new grace period has already started.
      
      These sorts of disagreements can result in too-short grace periods,
      as shown in the following scenario:
      
      1.	CPU 0 completes a grace period, but needs an additional
      	grace period, so starts initializing one, initializing all
      	the non-leaf rcu_node structures and the first leaf rcu_node
      	structure.  Because CPU 0 is both completing the old grace
      	period and starting a new one, it marks the completion of
      	the old grace period and the start of the new grace period
      	in a single traversal of the rcu_node structures.
      
      	Therefore, CPUs corresponding to the first rcu_node structure
      	can become aware that the prior grace period has completed, but
      	CPUs corresponding to the other rcu_node structures will see
      	this same prior grace period as still being in progress.
      
      2.	CPU 1 passes through a quiescent state, and therefore informs
      	the RCU core.  Because its leaf rcu_node structure has already
      	been initialized, this CPU's quiescent state is applied to the
      	new (and only partially initialized) grace period.
      
      3.	CPU 1 enters an RCU read-side critical section and acquires
      	a reference to data item A.  Note that this CPU believes that
      	its critical section started after the beginning of the new
      	grace period, and therefore will not block this new grace period.
      
      4.	CPU 16 exits dyntick-idle mode.  Because it was in dyntick-idle
      	mode, other CPUs informed the RCU core of its extended quiescent
      	state for the past several grace periods.  This means that CPU 16
      	is not yet aware that these past grace periods have ended.  Assume
      	that CPU 16 corresponds to the second leaf rcu_node structure --
      	which has not yet been made aware of the new grace period.
      
      5.	CPU 16 removes data item A from its enclosing data structure
      	and passes it to call_rcu(), which queues a callback in the
      	RCU_NEXT_TAIL segment of the callback queue.
      
      6.	CPU 16 enters the RCU core, possibly because it has taken a
      	scheduling-clock interrupt, or alternatively because it has
      	more than 10,000 callbacks queued.  It notes that the second
      	most recent grace period has completed (recall that because it
      	corresponds to the second as-yet-uninitialized rcu_node structure,
      	it cannot yet become aware that the most recent grace period has
      	completed), and therefore advances its callbacks.  The callback
      	for data item A is therefore in the RCU_NEXT_READY_TAIL segment
      	of the callback queue.
      
      7.	CPU 0 completes initialization of the remaining leaf rcu_node
      	structures for the new grace period, including the structure
      	corresponding to CPU 16.
      
      8.	CPU 16 again enters the RCU core, again, possibly because it has
      	taken a scheduling-clock interrupt, or alternatively because
      	it now has more than 10,000 callbacks queued.	It notes that
      	the most recent grace period has ended, and therefore advances
      	its callbacks.	The callback for data item A is therefore in
      	the RCU_DONE_TAIL segment of the callback queue.
      
      9.	All CPUs other than CPU 1 pass through quiescent states.  Because
      	CPU 1 already passed through its quiescent state, the new grace
      	period completes.  Note that CPU 1 is still in its RCU read-side
      	critical section, still referencing data item A.
      
      10.	Suppose that CPU 2 wais the last CPU to pass through a quiescent
      	state for the new grace period, and suppose further that CPU 2
      	did not have any callbacks queued, therefore not needing an
      	additional grace period.  CPU 2 therefore traverses all of the
      	rcu_node structures, marking the new grace period as completed,
      	but does not initialize a new grace period.
      
      11.	CPU 16 yet again enters the RCU core, yet again possibly because
      	it has taken a scheduling-clock interrupt, or alternatively
      	because it now has more than 10,000 callbacks queued.	It notes
      	that the new grace period has ended, and therefore advances
      	its callbacks.	The callback for data item A is therefore in
      	the RCU_DONE_TAIL segment of the callback queue.  This means
      	that this callback is now considered ready to be invoked.
      
      12.	CPU 16 invokes the callback, freeing data item A while CPU 1
      	is still referencing it.
      
      This scenario represents a day-zero bug for TREE_RCU.  This commit
      therefore ensures that the old grace period is marked completed in
      all leaf rcu_node structures before a new grace period is marked
      started in any of them.
      
      That said, it would have been insanely difficult to force this race to
      happen before the grace-period initialization process was preemptible.
      Therefore, this commit is not a candidate for -stable.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      
      Conflicts:
      
      	kernel/rcutree.c
      5d4b8659
    • P
      rcu: Make rcutree module parameters visible in sysfs · 7e5c2dfb
      Paul E. McKenney 提交于
      The module parameters blimit, qhimark, and qlomark (and more
      recently, rcu_fanout_leaf) have permission masks of zero, so
      that their values are not visible from sysfs.  This is unnecessary
      and inconvenient to administrators who might like an easy way to
      see what these values are on a running system.  This commit therefore
      sets their permission masks to 0444, allowing them to be read but
      not written.
      Reported-by: NRusty Russell <rusty@ozlabs.org>
      Reported-by: NJosh Triplett <josh@joshtriplett.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      7e5c2dfb
    • P
      rcu: Control grace-period duration from sysfs · d40011f6
      Paul E. McKenney 提交于
      Although almost everyone is well-served by the defaults, some uses of RCU
      benefit from shorter grace periods, while others benefit more from the
      greater efficiency provided by longer grace periods.  Situations requiring
      a large number of grace periods to elapse (and wireshark startup has
      been called out as an example of this) are helped by lower-latency
      grace periods.  Furthermore, in some embedded applications, people are
      willing to accept a small degradation in update efficiency (due to there
      being more of the shorter grace-period operations) in order to gain the
      lower latency.
      
      In contrast, those few systems with thousands of CPUs need longer grace
      periods because the CPU overhead of a grace period rises roughly
      linearly with the number of CPUs.  Such systems normally do not make
      much use of facilities that require large numbers of grace periods to
      elapse, so this is a good tradeoff.
      
      Therefore, this commit allows the durations to be controlled from sysfs.
      There are two sysfs parameters, one named "jiffies_till_first_fqs" that
      specifies the delay in jiffies from the end of grace-period initialization
      until the first attempt to force quiescent states, and the other named
      "jiffies_till_next_fqs" that specifies the delay (again in jiffies)
      between subsequent attempts to force quiescent states.  They both default
      to three jiffies, which is compatible with the old hard-coded behavior.
      
      At some future time, it may be possible to automatically increase the
      grace-period length with the number of CPUs, but we do not yet have
      sufficient data to do a good job.  Preliminary data indicates that we
      should add an addiitonal jiffy to each of the delays for every 200 CPUs
      in the system, but more experimentation is needed.  For now, the number
      of systems with more than 1,000 CPUs is small enough that this can be
      relegated to boot-time hand tuning.
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      d40011f6
    • P
      rcu: Prevent force_quiescent_state() memory contention · 394f2769
      Paul E. McKenney 提交于
      Large systems running RCU_FAST_NO_HZ kernels see extreme memory
      contention on the rcu_state structure's ->fqslock field.  This
      can be avoided by disabling RCU_FAST_NO_HZ, either at compile time
      or at boot time (via the nohz kernel boot parameter), but large
      systems will no doubt become sensitive to energy consumption.
      This commit therefore uses a combining-tree approach to spread the
      memory contention across new cache lines in the leaf rcu_node structures.
      This can be thought of as a tournament lock that has only a try-lock
      acquisition primitive.
      
      The effect on small systems is minimal, because such systems have
      an rcu_node "tree" consisting of a single node.  In addition, this
      functionality is not used on fastpaths.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      394f2769
    • P
      rcu: Allow RCU quiescent-state forcing to be preempted · b4be093f
      Paul E. McKenney 提交于
      RCU quiescent-state forcing is currently carried out without preemption
      points, which can result in excessive latency spikes on large systems
      (many hundreds or thousands of CPUs).  This patch therefore inserts
      a voluntary preemption point into force_qs_rnp(), which should greatly
      reduce the magnitude of these spikes.
      Reported-by: NMike Galbraith <mgalbraith@suse.de>
      Reported-by: NDimitri Sivanich <sivanich@sgi.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      b4be093f
    • P
      rcu: Move quiescent-state forcing into kthread · 4cdfc175
      Paul E. McKenney 提交于
      As the first step towards allowing quiescent-state forcing to be
      preemptible, this commit moves RCU quiescent-state forcing into the
      same kthread that is now used to initialize and clean up after grace
      periods.  This is yet another step towards keeping scheduling
      latency down to a dull roar.
      
      Updated to change from raw_spin_lock_irqsave() to raw_spin_lock_irq()
      and to remove the now-unused rcu_state structure fields as suggested by
      Peter Zijlstra.
      Reported-by: NMike Galbraith <mgalbraith@suse.de>
      Reported-by: NDimitri Sivanich <sivanich@sgi.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      4cdfc175
    • P
      rcu: Prevent offline CPUs from executing RCU core code · bfa00b4c
      Paul E. McKenney 提交于
      Earlier versions of RCU invoked the RCU core from the CPU_DYING notifier
      in order to note a quiescent state for the outgoing CPU.  Because the
      CPU is marked "offline" during the execution of the CPU_DYING notifiers,
      the RCU core had to tolerate being invoked from an offline CPU.  However,
      commit b1420f1c (Make rcu_barrier() less disruptive) left only tracing
      code in the CPU_DYING notifier, so the RCU core need no longer execute
      on offline CPUs.  This commit therefore enforces this restriction.
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      bfa00b4c
    • P
      rcu: Break up rcu_gp_kthread() into subfunctions · 7fdefc10
      Paul E. McKenney 提交于
      Then rcu_gp_kthread() function is too large and furthermore needs to
      have the force_quiescent_state() code pulled in.  This commit therefore
      breaks up rcu_gp_kthread() into rcu_gp_init() and rcu_gp_cleanup().
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      7fdefc10
    • P
      rcu: Allow RCU grace-period cleanup to be preempted · c856bafa
      Paul E. McKenney 提交于
      RCU grace-period cleanup is currently carried out with interrupts
      disabled, which can result in excessive latency spikes on large systems
      (many hundreds or thousands of CPUs).  This patch therefore makes the
      RCU grace-period cleanup be preemptible, including voluntary preemption
      points, which should eliminate those latency spikes.  Similar spikes from
      forcing of quiescent states will be dealt with similarly by later patches.
      
      Updated to replace uses of spin_lock_irqsave() with spin_lock_irq(), as
      suggested by Peter Zijlstra.
      Reported-by: NMike Galbraith <mgalbraith@suse.de>
      Reported-by: NDimitri Sivanich <sivanich@sgi.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      c856bafa
    • P
      rcu: Move RCU grace-period cleanup into kthread · cabc49c1
      Paul E. McKenney 提交于
      As a first step towards allowing grace-period cleanup to be preemptible,
      this commit moves the RCU grace-period cleanup into the same kthread
      that is now used to initialize grace periods.  This is needed to keep
      scheduling latency down to a dull roar.
      
      [ paulmck: Get rid of stray spin_lock_irqsave() calls. ]
      Reported-by: NMike Galbraith <mgalbraith@suse.de>
      Reported-by: NDimitri Sivanich <sivanich@sgi.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      cabc49c1
    • P
      rcu: Allow RCU grace-period initialization to be preempted · 755609a9
      Paul E. McKenney 提交于
      RCU grace-period initialization is currently carried out with interrupts
      disabled, which can result in 200-microsecond latency spikes on systems
      on which RCU has been configured for 4096 CPUs.  This patch therefore
      makes the RCU grace-period initialization be preemptible, which should
      eliminate those latency spikes.  Similar spikes from grace-period cleanup
      and the forcing of quiescent states will be dealt with similarly by later
      patches.
      Reported-by: NMike Galbraith <mgalbraith@suse.de>
      Reported-by: NDimitri Sivanich <sivanich@sgi.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      755609a9
    • P
      rcu: Prevent initialization-time quiescent-state race · 79bce672
      Paul E. McKenney 提交于
      The next step in reducing RCU's grace-period initialization latency on
      large systems will make this initialization preemptible.  Unfortunately,
      making the grace-period initialization subject to interrupts (let alone
      preemption) exposes the following race on systems whose rcu_node tree
      contains more than one node:
      
      1.	CPU 31 starts initializing the grace period, including the
          	first leaf rcu_node structures, and is then preempted.
      
      2.	CPU 0 refers to the first leaf rcu_node structure, and notes
          	that a new grace period has started.  It passes through a
          	quiescent state shortly thereafter, and informs the RCU core
          	of this rite of passage.
      
      3.	CPU 0 enters an RCU read-side critical section, acquiring
          	a pointer to an RCU-protected data item.
      
      4.	CPU 31 takes an interrupt whose handler removes the data item
      	referenced by CPU 0 from the data structure, and registers an
      	RCU callback in order to free it.
      
      5.	CPU 31 resumes initializing the grace period, including its
          	own rcu_node structure.  In invokes rcu_start_gp_per_cpu(),
          	which advances all callbacks, including the one registered
          	in #4 above, to be handled by the current grace period.
      
      6.	The remaining CPUs pass through quiescent states and inform
          	the RCU core, but CPU 0 remains in its RCU read-side critical
          	section, still referencing the now-removed data item.
      
      7.	The grace period completes and all the callbacks are invoked,
          	including the one that frees the data item that CPU 0 is still
          	referencing.  Oops!!!
      
      One way to avoid this race is to remove grace-period acceleration from
      rcu_start_gp_per_cpu().  Now, the only reason for this acceleration was
      to allow CPUs bringing RCU out of idle state to have their callbacks
      invoked after only one grace period, rather than the two grace periods
      that would otherwise be required.  But this acceleration does not
      work when RCU grace-period initialization is moved to a kthread because
      the CPU posting the callback is no longer necessarily the CPU that is
      initializing the resulting grace period.
      
      This commit therefore removes this now-pointless (and soon to be dangerous)
      grace-period acceleration, thus avoiding the above race.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      79bce672
    • P
      rcu: Move RCU grace-period initialization into a kthread · b3dbec76
      Paul E. McKenney 提交于
      As the first step towards allowing grace-period initialization to be
      preemptible, this commit moves the RCU grace-period initialization
      into its own kthread.  This is needed to keep large-system scheduling
      latency at reasonable levels.
      
      Also change raw_spin_lock_irqsave() to raw_spin_lock_irq() as suggested
      by Peter Zijlstra in review comments.
      Reported-by: NMike Galbraith <mgalbraith@suse.de>
      Reported-by: NDimitri Sivanich <sivanich@sgi.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      b3dbec76
    • P
      rcu: Fix day-one dyntick-idle stall-warning bug · a10d206e
      Paul E. McKenney 提交于
      Each grace period is supposed to have at least one callback waiting
      for that grace period to complete.  However, if CONFIG_NO_HZ=n, an
      extra callback-free grace period is no big problem -- it will chew up
      a tiny bit of CPU time, but it will complete normally.  In contrast,
      CONFIG_NO_HZ=y kernels have the potential for all the CPUs to go to
      sleep indefinitely, in turn indefinitely delaying completion of the
      callback-free grace period.  Given that nothing is waiting on this grace
      period, this is also not a problem.
      
      That is, unless RCU CPU stall warnings are also enabled, as they are
      in recent kernels.  In this case, if a CPU wakes up after at least one
      minute of inactivity, an RCU CPU stall warning will result.  The reason
      that no one noticed until quite recently is that most systems have enough
      OS noise that they will never remain absolutely idle for a full minute.
      But there are some embedded systems with cut-down userspace configurations
      that consistently get into this situation.
      
      All this begs the question of exactly how a callback-free grace period
      gets started in the first place.  This can happen due to the fact that
      CPUs do not necessarily agree on which grace period is in progress.
      If a CPU still believes that the grace period that just completed is
      still ongoing, it will believe that it has callbacks that need to wait for
      another grace period, never mind the fact that the grace period that they
      were waiting for just completed.  This CPU can therefore erroneously
      decide to start a new grace period.  Note that this can happen in
      TREE_RCU and TREE_PREEMPT_RCU even on a single-CPU system:  Deadlock
      considerations mean that the CPU that detected the end of the grace
      period is not necessarily officially informed of this fact for some time.
      
      Once this CPU notices that the earlier grace period completed, it will
      invoke its callbacks.  It then won't have any callbacks left.  If no
      other CPU has any callbacks, we now have a callback-free grace period.
      
      This commit therefore makes CPUs check more carefully before starting a
      new grace period.  This new check relies on an array of tail pointers
      into each CPU's list of callbacks.  If the CPU is up to date on which
      grace periods have completed, it checks to see if any callbacks follow
      the RCU_DONE_TAIL segment, otherwise it checks to see if any callbacks
      follow the RCU_WAIT_TAIL segment.  The reason that this works is that
      the RCU_WAIT_TAIL segment will be promoted to the RCU_DONE_TAIL segment
      as soon as the CPU is officially notified that the old grace period
      has ended.
      
      This change is to cpu_needs_another_gp(), which is called in a number
      of places.  The only one that really matters is in rcu_start_gp(), where
      the root rcu_node structure's ->lock is held, which prevents any
      other CPU from starting or completing a grace period, so that the
      comparison that determines whether the CPU is missing the completion
      of a grace period is stable.
      Reported-by: NBecky Bruce <bgillbruce@gmail.com>
      Reported-by: NSubodh Nijsure <snijsure@grid-net.com>
      Reported-by: NPaul Walmsley <paul@pwsan.com>
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Tested-by: Paul Walmsley <paul@pwsan.com>  # OMAP3730, OMAP4430
      Cc: stable@vger.kernel.org
      a10d206e
  2. 06 7月, 2012 2 次提交
  3. 03 7月, 2012 19 次提交