1. 23 9月, 2012 6 次提交
    • P
      rcu: Allow RCU grace-period cleanup to be preempted · c856bafa
      Paul E. McKenney 提交于
      RCU grace-period cleanup is currently carried out with interrupts
      disabled, which can result in excessive latency spikes on large systems
      (many hundreds or thousands of CPUs).  This patch therefore makes the
      RCU grace-period cleanup be preemptible, including voluntary preemption
      points, which should eliminate those latency spikes.  Similar spikes from
      forcing of quiescent states will be dealt with similarly by later patches.
      
      Updated to replace uses of spin_lock_irqsave() with spin_lock_irq(), as
      suggested by Peter Zijlstra.
      Reported-by: NMike Galbraith <mgalbraith@suse.de>
      Reported-by: NDimitri Sivanich <sivanich@sgi.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      c856bafa
    • P
      rcu: Move RCU grace-period cleanup into kthread · cabc49c1
      Paul E. McKenney 提交于
      As a first step towards allowing grace-period cleanup to be preemptible,
      this commit moves the RCU grace-period cleanup into the same kthread
      that is now used to initialize grace periods.  This is needed to keep
      scheduling latency down to a dull roar.
      
      [ paulmck: Get rid of stray spin_lock_irqsave() calls. ]
      Reported-by: NMike Galbraith <mgalbraith@suse.de>
      Reported-by: NDimitri Sivanich <sivanich@sgi.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      cabc49c1
    • P
      rcu: Allow RCU grace-period initialization to be preempted · 755609a9
      Paul E. McKenney 提交于
      RCU grace-period initialization is currently carried out with interrupts
      disabled, which can result in 200-microsecond latency spikes on systems
      on which RCU has been configured for 4096 CPUs.  This patch therefore
      makes the RCU grace-period initialization be preemptible, which should
      eliminate those latency spikes.  Similar spikes from grace-period cleanup
      and the forcing of quiescent states will be dealt with similarly by later
      patches.
      Reported-by: NMike Galbraith <mgalbraith@suse.de>
      Reported-by: NDimitri Sivanich <sivanich@sgi.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      755609a9
    • P
      rcu: Prevent initialization-time quiescent-state race · 79bce672
      Paul E. McKenney 提交于
      The next step in reducing RCU's grace-period initialization latency on
      large systems will make this initialization preemptible.  Unfortunately,
      making the grace-period initialization subject to interrupts (let alone
      preemption) exposes the following race on systems whose rcu_node tree
      contains more than one node:
      
      1.	CPU 31 starts initializing the grace period, including the
          	first leaf rcu_node structures, and is then preempted.
      
      2.	CPU 0 refers to the first leaf rcu_node structure, and notes
          	that a new grace period has started.  It passes through a
          	quiescent state shortly thereafter, and informs the RCU core
          	of this rite of passage.
      
      3.	CPU 0 enters an RCU read-side critical section, acquiring
          	a pointer to an RCU-protected data item.
      
      4.	CPU 31 takes an interrupt whose handler removes the data item
      	referenced by CPU 0 from the data structure, and registers an
      	RCU callback in order to free it.
      
      5.	CPU 31 resumes initializing the grace period, including its
          	own rcu_node structure.  In invokes rcu_start_gp_per_cpu(),
          	which advances all callbacks, including the one registered
          	in #4 above, to be handled by the current grace period.
      
      6.	The remaining CPUs pass through quiescent states and inform
          	the RCU core, but CPU 0 remains in its RCU read-side critical
          	section, still referencing the now-removed data item.
      
      7.	The grace period completes and all the callbacks are invoked,
          	including the one that frees the data item that CPU 0 is still
          	referencing.  Oops!!!
      
      One way to avoid this race is to remove grace-period acceleration from
      rcu_start_gp_per_cpu().  Now, the only reason for this acceleration was
      to allow CPUs bringing RCU out of idle state to have their callbacks
      invoked after only one grace period, rather than the two grace periods
      that would otherwise be required.  But this acceleration does not
      work when RCU grace-period initialization is moved to a kthread because
      the CPU posting the callback is no longer necessarily the CPU that is
      initializing the resulting grace period.
      
      This commit therefore removes this now-pointless (and soon to be dangerous)
      grace-period acceleration, thus avoiding the above race.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      79bce672
    • P
      rcu: Move RCU grace-period initialization into a kthread · b3dbec76
      Paul E. McKenney 提交于
      As the first step towards allowing grace-period initialization to be
      preemptible, this commit moves the RCU grace-period initialization
      into its own kthread.  This is needed to keep large-system scheduling
      latency at reasonable levels.
      
      Also change raw_spin_lock_irqsave() to raw_spin_lock_irq() as suggested
      by Peter Zijlstra in review comments.
      Reported-by: NMike Galbraith <mgalbraith@suse.de>
      Reported-by: NDimitri Sivanich <sivanich@sgi.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      b3dbec76
    • P
      rcu: Fix day-one dyntick-idle stall-warning bug · a10d206e
      Paul E. McKenney 提交于
      Each grace period is supposed to have at least one callback waiting
      for that grace period to complete.  However, if CONFIG_NO_HZ=n, an
      extra callback-free grace period is no big problem -- it will chew up
      a tiny bit of CPU time, but it will complete normally.  In contrast,
      CONFIG_NO_HZ=y kernels have the potential for all the CPUs to go to
      sleep indefinitely, in turn indefinitely delaying completion of the
      callback-free grace period.  Given that nothing is waiting on this grace
      period, this is also not a problem.
      
      That is, unless RCU CPU stall warnings are also enabled, as they are
      in recent kernels.  In this case, if a CPU wakes up after at least one
      minute of inactivity, an RCU CPU stall warning will result.  The reason
      that no one noticed until quite recently is that most systems have enough
      OS noise that they will never remain absolutely idle for a full minute.
      But there are some embedded systems with cut-down userspace configurations
      that consistently get into this situation.
      
      All this begs the question of exactly how a callback-free grace period
      gets started in the first place.  This can happen due to the fact that
      CPUs do not necessarily agree on which grace period is in progress.
      If a CPU still believes that the grace period that just completed is
      still ongoing, it will believe that it has callbacks that need to wait for
      another grace period, never mind the fact that the grace period that they
      were waiting for just completed.  This CPU can therefore erroneously
      decide to start a new grace period.  Note that this can happen in
      TREE_RCU and TREE_PREEMPT_RCU even on a single-CPU system:  Deadlock
      considerations mean that the CPU that detected the end of the grace
      period is not necessarily officially informed of this fact for some time.
      
      Once this CPU notices that the earlier grace period completed, it will
      invoke its callbacks.  It then won't have any callbacks left.  If no
      other CPU has any callbacks, we now have a callback-free grace period.
      
      This commit therefore makes CPUs check more carefully before starting a
      new grace period.  This new check relies on an array of tail pointers
      into each CPU's list of callbacks.  If the CPU is up to date on which
      grace periods have completed, it checks to see if any callbacks follow
      the RCU_DONE_TAIL segment, otherwise it checks to see if any callbacks
      follow the RCU_WAIT_TAIL segment.  The reason that this works is that
      the RCU_WAIT_TAIL segment will be promoted to the RCU_DONE_TAIL segment
      as soon as the CPU is officially notified that the old grace period
      has ended.
      
      This change is to cpu_needs_another_gp(), which is called in a number
      of places.  The only one that really matters is in rcu_start_gp(), where
      the root rcu_node structure's ->lock is held, which prevents any
      other CPU from starting or completing a grace period, so that the
      comparison that determines whether the CPU is missing the completion
      of a grace period is stable.
      Reported-by: NBecky Bruce <bgillbruce@gmail.com>
      Reported-by: NSubodh Nijsure <snijsure@grid-net.com>
      Reported-by: NPaul Walmsley <paul@pwsan.com>
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Tested-by: Paul Walmsley <paul@pwsan.com>  # OMAP3730, OMAP4430
      Cc: stable@vger.kernel.org
      a10d206e
  2. 06 7月, 2012 2 次提交
  3. 03 7月, 2012 24 次提交
  4. 26 6月, 2012 1 次提交
  5. 07 6月, 2012 1 次提交
  6. 10 5月, 2012 1 次提交
    • P
      rcu: Make rcu_barrier() less disruptive · b1420f1c
      Paul E. McKenney 提交于
      The rcu_barrier() primitive interrupts each and every CPU, registering
      a callback on every CPU.  Once all of these callbacks have been invoked,
      rcu_barrier() knows that every callback that was registered before
      the call to rcu_barrier() has also been invoked.
      
      However, there is no point in registering a callback on a CPU that
      currently has no callbacks, most especially if that CPU is in a
      deep idle state.  This commit therefore makes rcu_barrier() avoid
      interrupting CPUs that have no callbacks.  Doing this requires reworking
      the handling of orphaned callbacks, otherwise callbacks could slip through
      rcu_barrier()'s net by being orphaned from a CPU that rcu_barrier() had
      not yet interrupted to a CPU that rcu_barrier() had already interrupted.
      This reworking was needed anyway to take a first step towards weaning
      RCU from the CPU_DYING notifier's use of stop_cpu().
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      b1420f1c
  7. 03 5月, 2012 1 次提交
  8. 25 4月, 2012 3 次提交
    • P
      rcu: Make RCU_FAST_NO_HZ account for pauses out of idle · c57afe80
      Paul E. McKenney 提交于
      Both Steven Rostedt's new idle-capable trace macros and the RCU_NONIDLE()
      macro can cause RCU to momentarily pause out of idle without the rest
      of the system being involved.  This can cause rcu_prepare_for_idle()
      to run through its state machine too quickly, which can in turn result
      in needless scheduling-clock interrupts.
      
      This commit therefore adds code to enable rcu_prepare_for_idle() to
      distinguish between an initial entry to idle on the one hand (which needs
      to advance the rcu_prepare_for_idle() state machine) and an idle reentry
      due to idle-capable trace macros and RCU_NONIDLE() on the other hand
      (which should avoid advancing the rcu_prepare_for_idle() state machine).
      Additional state is maintained to allow the timer to be correctly reposted
      when returning after a momentary pause out of idle, and even more state
      is maintained to detect when new non-lazy callbacks have been enqueued
      (which may require re-evaluation of the approach to idleness).
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      c57afe80
    • P
      rcu: Document why rcu_blocking_is_gp() is safe · 6d813391
      Paul E. McKenney 提交于
      The rcu_blocking_is_gp() function tests to see if there is only one
      online CPU, and if so, synchronize_sched() and friends become no-ops.
      However, for larger systems, num_online_cpus() scans a large vector,
      and might be preempted while doing so.  While preempted, any number
      of CPUs might come online and go offline, potentially resulting in
      num_online_cpus() returning 1 when there never had only been one
      CPU online.  This could result in a too-short RCU grace period, which
      could in turn result in total failure, except that the only way that
      the grace period is too short is if there is an RCU read-side critical
      section spanning it.  For RCU-sched and RCU-bh (which are the only
      cases using rcu_blocking_is_gp()), RCU read-side critical sections
      have either preemption or bh disabled, which prevents CPUs from going
      offline.  This in turn prevents actual failures from occurring.
      
      This commit therefore adds a large block comment to rcu_blocking_is_gp()
      documenting why it is safe.  This commit also moves rcu_blocking_is_gp()
      into kernel/rcutree.c, which should help prevent unwary developers from
      mistaking it for a generally useful function.
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      6d813391
    • P
      rcu: Reduce cache-miss initialization latencies for large systems · 8932a63d
      Paul E. McKenney 提交于
      Commit #0209f649 (rcu: limit rcu_node leaf-level fanout) set an upper
      limit of 16 on the leaf-level fanout for the rcu_node tree.  This was
      needed to reduce lock contention that was induced by the synchronization
      of scheduling-clock interrupts, which was in turn needed to improve
      energy efficiency for moderate-sized lightly loaded servers.
      
      However, reducing the leaf-level fanout means that there are more
      leaf-level rcu_node structures in the tree, which in turn means that
      RCU's grace-period initialization incurs more cache misses.  This is
      not a problem on moderate-sized servers with only a few tens of CPUs,
      but becomes a major source of real-time latency spikes on systems with
      many hundreds of CPUs.  In addition, the workloads running on these large
      systems tend to be CPU-bound, which eliminates the energy-efficiency
      advantages of synchronizing scheduling-clock interrupts.  Therefore,
      these systems need maximal values for the rcu_node leaf-level fanout.
      
      This commit addresses this problem by introducing a new kernel parameter
      named RCU_FANOUT_LEAF that directly controls the leaf-level fanout.
      This parameter defaults to 16 to handle the common case of a moderate
      sized lightly loaded servers, but may be set higher on larger systems.
      Reported-by: NMike Galbraith <efault@gmx.de>
      Reported-by: NDimitri Sivanich <sivanich@sgi.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      8932a63d
  9. 17 4月, 2012 1 次提交
    • P
      rcu: Permit call_rcu() from CPU_DYING notifiers · 92c38702
      Paul E. McKenney 提交于
      As of:
      
        29494be7 ("rcu,cleanup: simplify the code when cpu is dying")
      
      RCU adopts callbacks from the dying CPU in its CPU_DYING notifier,
      which means that any callbacks posted by later CPU_DYING notifiers
      are ignored until the CPU comes back online.
      
      A WARN_ON_ONCE() was added to __call_rcu() by:
      
        e5601400 ("rcu: Simplify offline processing")
      
      to check for this condition.  Although this condition did not trigger
      (at least as far as I know) during -next testing, it did recently
      trigger in mainline:
      
        https://lkml.org/lkml/2012/4/2/34
      
      What is needed longer term is for RCU's CPU_DEAD notifier to adopt any
      callbacks that were posted by CPU_DYING notifiers, however, the Linux
      kernel has been running with this sort of thing happening for quite
      some time.  So the only thing that qualifies as a regression is the
      WARN_ON_ONCE(), which this commit removes.
      
      Making RCU's CPU_DEAD notifier adopt callbacks posted by CPU_DYING
      notifiers is a topic for the 3.5 release of the Linux kernel.
      Reported-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      92c38702