1. 17 11月, 2012 1 次提交
    • P
      rcu: Add callback-free CPUs · 3fbfbf7a
      Paul E. McKenney 提交于
      RCU callback execution can add significant OS jitter and also can
      degrade both scheduling latency and, in asymmetric multiprocessors,
      energy efficiency.  This commit therefore adds the ability for selected
      CPUs ("rcu_nocbs=" boot parameter) to have their callbacks offloaded
      to kthreads.  If the "rcu_nocb_poll" boot parameter is also specified,
      these kthreads will do polling, removing the need for the offloaded
      CPUs to do wakeups.  At least one CPU must be doing normal callback
      processing: currently CPU 0 cannot be selected as a no-CBs CPU.
      In addition, attempts to offline the last normal-CBs CPU will fail.
      
      This feature was inspired by Jim Houston's and Joe Korty's JRCU, and
      this commit includes fixes to problems located by Fengguang Wu's
      kbuild test robot.
      
      [ paulmck: Added gfp.h include file as suggested by Fengguang Wu. ]
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      3fbfbf7a
  2. 14 11月, 2012 2 次提交
  3. 09 11月, 2012 5 次提交
    • P
      rcu: Fix tracing formatting · 42c3533e
      Paul E. McKenney 提交于
      The rcu_state structure's ->completed field is unsigned long, so this
      commit adjusts show_one_rcugp()'s printf() format to suit.  Also add
      the required ACCESS_ONCE() directives while we are in this function.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      42c3533e
    • P
      rcu: Instrument synchronize_rcu_expedited() for debugfs tracing · a30489c5
      Paul E. McKenney 提交于
      This commit adds the counters to rcu_state and updates them in
      synchronize_rcu_expedited() to provide the data needed for debugfs
      tracing.
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      a30489c5
    • P
      rcu: Move synchronize_sched_expedited() state to rcu_state · 40694d66
      Paul E. McKenney 提交于
      Tracing (debugfs) of expedited RCU primitives is required, which in turn
      requires that the relevant data be located where the tracing code can find
      it, not in its current static global variables in kernel/rcutree.c.
      This commit therefore moves sync_sched_expedited_started and
      sync_sched_expedited_done to the rcu_state structure, as fields
      ->expedited_start and ->expedited_done, respectively.
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      40694d66
    • P
      rcu: Avoid counter wrap in synchronize_sched_expedited() · 1924bcb0
      Paul E. McKenney 提交于
      There is a counter scheme similar to ticket locking that
      synchronize_sched_expedited() uses to service multiple concurrent
      callers with the same expedited grace period.  Upon entry, a
      sync_sched_expedited_started variable is atomically incremented,
      and upon completion of a expedited grace period a separate
      sync_sched_expedited_done variable is atomically incremented.
      
      However, if a synchronize_sched_expedited() is delayed while
      in try_stop_cpus(), concurrent invocations will increment the
      sync_sched_expedited_started counter, which will eventually overflow.
      If the original synchronize_sched_expedited() resumes execution just
      as the counter overflows, a concurrent invocation could incorrectly
      conclude that an expedited grace period elapsed in zero time, which
      would be bad.  One could rely on counter size to prevent this from
      happening in practice, but the goal is to formally validate this
      code, so it needs to be fixed anyway.
      
      This commit therefore checks the gap between the two counters before
      incrementing sync_sched_expedited_started, and if the gap is too
      large, does a normal grace period instead.  Overflow is thus only
      possible if there are more than about 3.5 billion threads on 32-bit
      systems, which can be excluded until such time as task_struct fits
      into a single byte and 4G/4G patches are accepted into mainline.
      It is also easy to encode this limitation into mechanical theorem
      provers.
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      1924bcb0
    • P
      rcu: Rename ->onofflock to ->orphan_lock · 7b2e6011
      Paul E. McKenney 提交于
      The ->onofflock field in the rcu_state structure at one time synchronized
      CPU-hotplug operations for RCU.  However, its scope has decreased over time
      so that it now only protects the lists of orphaned RCU callbacks.  This
      commit therefore renames it to ->orphan_lock to reflect its current use.
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      7b2e6011
  4. 24 10月, 2012 6 次提交
  5. 21 10月, 2012 1 次提交
    • P
      rcu: Accelerate callbacks for CPU initiating a grace period · 62da1921
      Paul E. McKenney 提交于
      Because grace-period initialization is carried out by a separate
      kthread, it might happen on a different CPU than the one that
      had the callback needing a grace period -- which is where the
      callback acceleration needs to happen.
      
      Fortunately, rcu_start_gp() holds the root rcu_node structure's
      ->lock, which prevents a new grace period from starting.  This
      allows this function to safely determine that a grace period has
      not yet started, which in turn allows it to fully accelerate any
      callbacks that it has pending.  This commit adds this acceleration.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      62da1921
  6. 09 10月, 2012 1 次提交
    • P
      rcu: Grace-period initialization excludes only RCU notifier · a4fbe35a
      Paul E. McKenney 提交于
      Kirill noted the following deadlock cycle on shutdown involving padata:
      
      > With commit 755609a9 I've got deadlock on
      > poweroff.
      >
      > It guess it happens because of race for cpu_hotplug.lock:
      >
      >       CPU A                                   CPU B
      > disable_nonboot_cpus()
      > _cpu_down()
      > cpu_hotplug_begin()
      >  mutex_lock(&cpu_hotplug.lock);
      > __cpu_notify()
      > padata_cpu_callback()
      > __padata_remove_cpu()
      > padata_replace()
      > synchronize_rcu()
      >                                       rcu_gp_kthread()
      >                                       get_online_cpus();
      >                                       mutex_lock(&cpu_hotplug.lock);
      
      It would of course be good to eliminate grace-period delays from
      CPU-hotplug notifiers, but that is a separate issue.  Deadlock is
      not an appropriate diagnostic for excessive CPU-hotplug latency.
      
      Fortunately, grace-period initialization does not actually need to
      exclude all of the CPU-hotplug operation, but rather only RCU's own
      CPU_UP_PREPARE and CPU_DEAD CPU-hotplug notifiers.  This commit therefore
      introduces a new per-rcu_state onoff_mutex that provides the required
      concurrency control in place of the get_online_cpus() that was previously
      in rcu_gp_init().
      Reported-by: N"Kirill A. Shutemov" <kirill@shutemov.name>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Tested-by: NKirill A. Shutemov <kirill@shutemov.name>
      a4fbe35a
  7. 26 9月, 2012 8 次提交
    • P
      rcu: Apply micro-optimization and int/bool fixes to RCU's idle handling · cb349ca9
      Paul E. McKenney 提交于
      Checking "user" before "is_idle_task()" allows better optimizations
      in cases where inlining is possible.  Also, "bool" should be passed
      "true" or "false" rather than "1" or "0".  This commit therefore makes
      these changes, as noted in Josh's review.
      Reported-by: NJosh Triplett <josh@joshtriplett.org>
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      cb349ca9
    • F
      rcu: Userspace RCU extended QS selftest · 1fd2b442
      Frederic Weisbecker 提交于
      Provide a config option that enables the userspace
      RCU extended quiescent state on every CPUs by default.
      
      This is for testing purpose.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Alessio Igor Bogani <abogani@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Avi Kivity <avi@redhat.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Geoff Levand <geoff@infradead.org>
      Cc: Gilad Ben Yossef <gilad@benyossef.com>
      Cc: Hakan Akkan <hakanakkan@gmail.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Kevin Hilman <khilman@ti.com>
      Cc: Max Krasnyansky <maxk@qualcomm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephen Hemminger <shemminger@vyatta.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      1fd2b442
    • F
      rcu: Switch task's syscall hooks on context switch · 04e7e951
      Frederic Weisbecker 提交于
      Clear the syscalls hook of a task when it's scheduled out so that if
      the task migrates, it doesn't run the syscall slow path on a CPU
      that might not need it.
      
      Also set the syscalls hook on the next task if needed.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Alessio Igor Bogani <abogani@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Avi Kivity <avi@redhat.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Geoff Levand <geoff@infradead.org>
      Cc: Gilad Ben Yossef <gilad@benyossef.com>
      Cc: Hakan Akkan <hakanakkan@gmail.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Kevin Hilman <khilman@ti.com>
      Cc: Max Krasnyansky <maxk@qualcomm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephen Hemminger <shemminger@vyatta.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      04e7e951
    • F
      rcu: Ignore userspace extended quiescent state by default · 1e1a689f
      Frederic Weisbecker 提交于
      By default we don't want to enter into RCU extended quiescent
      state while in userspace because doing this produces some overhead
      (eg: use of syscall slowpath). Set it off by default and ready to
      run when some feature like adaptive tickless need it.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Alessio Igor Bogani <abogani@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Avi Kivity <avi@redhat.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Geoff Levand <geoff@infradead.org>
      Cc: Gilad Ben Yossef <gilad@benyossef.com>
      Cc: Hakan Akkan <hakanakkan@gmail.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Kevin Hilman <khilman@ti.com>
      Cc: Max Krasnyansky <maxk@qualcomm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephen Hemminger <shemminger@vyatta.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      1e1a689f
    • F
      rcu: Allow rcu_user_enter()/exit() to nest · c5d900bf
      Frederic Weisbecker 提交于
      Allow calls to rcu_user_enter() even if we are already
      in userspace (as seen by RCU) and allow calls to rcu_user_exit()
      even if we are already in the kernel.
      
      This makes the APIs more flexible to be called from architectures.
      Exception entries for example won't need to know if they come from
      userspace before calling rcu_user_exit().
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Alessio Igor Bogani <abogani@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Avi Kivity <avi@redhat.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Geoff Levand <geoff@infradead.org>
      Cc: Gilad Ben Yossef <gilad@benyossef.com>
      Cc: Hakan Akkan <hakanakkan@gmail.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Kevin Hilman <khilman@ti.com>
      Cc: Max Krasnyansky <maxk@qualcomm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephen Hemminger <shemminger@vyatta.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      c5d900bf
    • F
      rcu: Settle config for userspace extended quiescent state · 2b1d5024
      Frederic Weisbecker 提交于
      Create a new config option under the RCU menu that put
      CPUs under RCU extended quiescent state (as in dynticks
      idle mode) when they run in userspace. This require
      some contribution from architectures to hook into kernel
      and userspace boundaries.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Alessio Igor Bogani <abogani@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Avi Kivity <avi@redhat.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Geoff Levand <geoff@infradead.org>
      Cc: Gilad Ben Yossef <gilad@benyossef.com>
      Cc: Hakan Akkan <hakanakkan@gmail.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Kevin Hilman <khilman@ti.com>
      Cc: Max Krasnyansky <maxk@qualcomm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephen Hemminger <shemminger@vyatta.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      2b1d5024
    • F
      rcu: New rcu_user_enter_after_irq() and rcu_user_exit_after_irq() APIs · 19dd1591
      Frederic Weisbecker 提交于
      In some cases, it is necessary to enter or exit userspace-RCU-idle mode
      from an interrupt handler, for example, if some other CPU sends this
      CPU a resched IPI.  In this case, the current CPU would enter the IPI
      handler in userspace-RCU-idle mode, but would need to exit the IPI handler
      after having exited that mode.
      
      To allow this to work, this commit adds two new APIs to TREE_RCU:
      
      - rcu_user_enter_after_irq(). This must be called from an interrupt between
      rcu_irq_enter() and rcu_irq_exit().  After the irq calls rcu_irq_exit(),
      the irq handler will return into an RCU extended quiescent state.
      In theory, this interrupt is never a nested interrupt, but in practice
      it might interrupt softirq, which looks to RCU like a nested interrupt.
      
      - rcu_user_exit_after_irq(). This must be called from a non-nesting
      interrupt, interrupting an RCU extended quiescent state, also
      between rcu_irq_enter() and rcu_irq_exit(). After the irq calls
      rcu_irq_exit(), the irq handler will return in an RCU non-quiescent
      state.
      
      [ Combined with "Allow calls to rcu_exit_user_irq from nesting irqs." ]
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      19dd1591
    • F
      rcu: New rcu_user_enter() and rcu_user_exit() APIs · adf5091e
      Frederic Weisbecker 提交于
      RCU currently insists that only idle tasks can enter RCU idle mode, which
      prohibits an adaptive tickless kernel (AKA nohz cpusets), which in turn
      would mean that usermode execution would always take scheduling-clock
      interrupts, even when there is only one task runnable on the CPU in
      question.
      
      This commit therefore adds rcu_user_enter() and rcu_user_exit(), which
      allow non-idle tasks to enter RCU idle mode.  These are quite similar
      to rcu_idle_enter() and rcu_idle_exit(), respectively, except that they
      omit the idle-task checks.
      
      [ Updated to use "user" flag rather than separate check functions. ]
      
      [ paulmck: Updated to drop exports of new functions based on Josh's patch
        getting rid of the need for them. ]
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Alessio Igor Bogani <abogani@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Avi Kivity <avi@redhat.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
      Cc: Geoff Levand <geoff@infradead.org>
      Cc: Gilad Ben Yossef <gilad@benyossef.com>
      Cc: Hakan Akkan <hakanakkan@gmail.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Kevin Hilman <khilman@ti.com>
      Cc: Max Krasnyansky <maxk@qualcomm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephen Hemminger <shemminger@vyatta.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      adf5091e
  8. 23 9月, 2012 16 次提交
    • P
      rcu: Disallow callback registry on offline CPUs · 0d8ee37e
      Paul E. McKenney 提交于
      Posting a callback after the CPU_DEAD notifier effectively leaks
      that callback unless/until that CPU comes back online.  Silence is
      unhelpful when attempting to track down such leaks, so this commit emits
      a WARN_ON_ONCE() and unconditionally leaks the callback when an offline
      CPU attempts to register a callback.  The rdp->nxttail[RCU_NEXT_TAIL] is
      set to NULL in the CPU_DEAD notifier and restored in the CPU_UP_PREPARE
      notifier, allowing _call_rcu() to determine exactly when posting callbacks
      is illegal.
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      0d8ee37e
    • P
      rcu: Remove _rcu_barrier() dependency on __stop_machine() · 1331e7a1
      Paul E. McKenney 提交于
      Currently, _rcu_barrier() relies on preempt_disable() to prevent
      any CPU from going offline, which in turn depends on CPU hotplug's
      use of __stop_machine().
      
      This patch therefore makes _rcu_barrier() use get_online_cpus() to
      block CPU-hotplug operations.  This has the added benefit of removing
      the need for _rcu_barrier() to adopt callbacks:  Because CPU-hotplug
      operations are excluded, there can be no callbacks to adopt.  This
      commit simplifies the code accordingly.
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      1331e7a1
    • P
      rcu: Remove redundant memory barrier from __call_rcu() · fdab649b
      Paul E. McKenney 提交于
      The first memory barrier in __call_rcu() is supposed to order any
      updates done beforehand by the caller against the actual queuing
      of the callback.  However, the second memory barrier (which is intended
      to order incrementing the queue lengths before queuing the callback)
      is also between the caller's updates and the queuing of the callback.
      The second memory barrier can therefore serve both purposes.
      
      This commit therefore removes the first memory barrier.
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      fdab649b
    • P
      rcu: Avoid spurious RCU CPU stall warnings · c96ea7cf
      Paul E. McKenney 提交于
      If a given CPU avoids the idle loop but also avoids starting a new
      RCU grace period for a full minute, RCU can issue spurious RCU CPU
      stall warnings.  This commit fixes this issue by adding a check for
      ongoing grace period to avoid these spurious stall warnings.
      Reported-by: NBecky Bruce <bgillbruce@gmail.com>
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      c96ea7cf
    • P
      rcu: Protect rcu_node accesses during CPU stall warnings · c8020a67
      Paul E. McKenney 提交于
      The print_other_cpu_stall() function accesses a number of rcu_node
      fields without protection from the ->lock.  In theory, this is not
      a problem because the fields accessed are all integers, but in
      practice the compiler can get nasty.  Therefore, the commit extends
      the existing critical section to cover the entire loop body.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      c8020a67
    • P
      rcu: Make offline-CPU checking allow for indefinite delays · a82dcc76
      Paul E. McKenney 提交于
      The rcu_implicit_offline_qs() function implicitly assumed that execution
      would progress predictably when interrupts are disabled, which is of course
      not guaranteed when running on a hypervisor.  Furthermore, this function
      is short, and is called from one place only in a short function.
      
      This commit therefore ensures that the timing is checked before
      checking the condition, which guarantees correct behavior even given
      indefinite delays.  It also inlines rcu_implicit_offline_qs() into
      rcu_implicit_dynticks_qs().
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      a82dcc76
    • P
      rcu: Shrink RCU based on number of CPUs · b17c7035
      Paul E. McKenney 提交于
      Currently, rcu_init_geometry() only reshapes RCU's combining trees
      if the leaf fanout is changed at boot time.  This means that by
      default, kernels compiled with (say) NR_CPUS=4096 will keep oversized
      data structures, even when running on systems with (say) four CPUs.
      
      This commit therefore checks to see if the maximum number of CPUs on
      the actual running system (nr_cpu_ids) differs from NR_CPUS, and if so
      reshapes the combining trees accordingly.
      Reported-by: NMike Galbraith <efault@gmx.de>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      b17c7035
    • P
      rcu: Handle unbalanced rcu_node configurations with few CPUs · 4dbd6bb3
      Paul E. McKenney 提交于
      If CONFIG_RCU_FANOUT_EXACT=y, if there are not enough CPUs (according
      to nr_cpu_ids) to require more than a single rcu_node structure, but if
      NR_CPUS is larger than would fit into a single rcu_node structure, then
      the current rcu_init_levelspread() code is subject to integer overflow
      in the eight-bit ->levelspread[] array in the rcu_state structure.
      
      In this case, the solution is -not- to increase the size of the
      elements in this array because the values in that array should be
      constrained to the number of bits in an unsigned long.  Instead, this
      commit replaces NR_CPUS with nr_cpu_ids in the rcu_init_levelspread()
      function's initialization of the cprv local variable.  This results in
      all of the arithmetic being consistently based off of the nr_cpu_ids
      value, thus avoiding the overflow, which was caused by the mixing of
      nr_cpu_ids and NR_CPUS.
      Reported-by: NMike Galbraith <efault@gmx.de>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      4dbd6bb3
    • P
      rcu: Simplify quiescent-state detection · d7d6a11e
      Paul E. McKenney 提交于
      The current quiescent-state detection algorithm is needlessly
      complex.  It records the grace-period number corresponding to
      the quiescent state at the time of the quiescent state, which
      works, but it seems better to simply erase any record of previous
      quiescent states at the time that the CPU notices the new grace
      period.  This has the further advantage of removing another piece
      of RCU for which lockless reasoning is required.
      
      Therefore, this commit makes this change.
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      d7d6a11e
    • P
      rcu: Adjust for unconditional ->completed assignment · 25d30cf4
      Paul E. McKenney 提交于
      Now that the rcu_node structures' ->completed fields are unconditionally
      assigned at grace-period cleanup time, they should already have the
      correct value for the new grace period at grace-period initialization
      time.  This commit therefore inserts a WARN_ON_ONCE() to verify this
      invariant.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      25d30cf4
    • P
      rcu: Add random PROVE_RCU_DELAY to grace-period initialization · 661a85dc
      Paul E. McKenney 提交于
      Preemption greatly raised the probability of certain types of race
      conditions, so this commit adds an anti-heisenbug to greatly increase
      the collision cross section, also known as the probability of occurrence.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      661a85dc
    • P
      rcu: Fix day-zero grace-period initialization/cleanup race · 5d4b8659
      Paul E. McKenney 提交于
      The current approach to grace-period initialization is vulnerable to
      extremely low-probability races.  These races stem from the fact that
      the old grace period is marked completed on the same traversal through
      the rcu_node structure that is marking the start of the new grace period.
      This means that some rcu_node structures will believe that the old grace
      period is still in effect at the same time that other rcu_node structures
      believe that the new grace period has already started.
      
      These sorts of disagreements can result in too-short grace periods,
      as shown in the following scenario:
      
      1.	CPU 0 completes a grace period, but needs an additional
      	grace period, so starts initializing one, initializing all
      	the non-leaf rcu_node structures and the first leaf rcu_node
      	structure.  Because CPU 0 is both completing the old grace
      	period and starting a new one, it marks the completion of
      	the old grace period and the start of the new grace period
      	in a single traversal of the rcu_node structures.
      
      	Therefore, CPUs corresponding to the first rcu_node structure
      	can become aware that the prior grace period has completed, but
      	CPUs corresponding to the other rcu_node structures will see
      	this same prior grace period as still being in progress.
      
      2.	CPU 1 passes through a quiescent state, and therefore informs
      	the RCU core.  Because its leaf rcu_node structure has already
      	been initialized, this CPU's quiescent state is applied to the
      	new (and only partially initialized) grace period.
      
      3.	CPU 1 enters an RCU read-side critical section and acquires
      	a reference to data item A.  Note that this CPU believes that
      	its critical section started after the beginning of the new
      	grace period, and therefore will not block this new grace period.
      
      4.	CPU 16 exits dyntick-idle mode.  Because it was in dyntick-idle
      	mode, other CPUs informed the RCU core of its extended quiescent
      	state for the past several grace periods.  This means that CPU 16
      	is not yet aware that these past grace periods have ended.  Assume
      	that CPU 16 corresponds to the second leaf rcu_node structure --
      	which has not yet been made aware of the new grace period.
      
      5.	CPU 16 removes data item A from its enclosing data structure
      	and passes it to call_rcu(), which queues a callback in the
      	RCU_NEXT_TAIL segment of the callback queue.
      
      6.	CPU 16 enters the RCU core, possibly because it has taken a
      	scheduling-clock interrupt, or alternatively because it has
      	more than 10,000 callbacks queued.  It notes that the second
      	most recent grace period has completed (recall that because it
      	corresponds to the second as-yet-uninitialized rcu_node structure,
      	it cannot yet become aware that the most recent grace period has
      	completed), and therefore advances its callbacks.  The callback
      	for data item A is therefore in the RCU_NEXT_READY_TAIL segment
      	of the callback queue.
      
      7.	CPU 0 completes initialization of the remaining leaf rcu_node
      	structures for the new grace period, including the structure
      	corresponding to CPU 16.
      
      8.	CPU 16 again enters the RCU core, again, possibly because it has
      	taken a scheduling-clock interrupt, or alternatively because
      	it now has more than 10,000 callbacks queued.	It notes that
      	the most recent grace period has ended, and therefore advances
      	its callbacks.	The callback for data item A is therefore in
      	the RCU_DONE_TAIL segment of the callback queue.
      
      9.	All CPUs other than CPU 1 pass through quiescent states.  Because
      	CPU 1 already passed through its quiescent state, the new grace
      	period completes.  Note that CPU 1 is still in its RCU read-side
      	critical section, still referencing data item A.
      
      10.	Suppose that CPU 2 wais the last CPU to pass through a quiescent
      	state for the new grace period, and suppose further that CPU 2
      	did not have any callbacks queued, therefore not needing an
      	additional grace period.  CPU 2 therefore traverses all of the
      	rcu_node structures, marking the new grace period as completed,
      	but does not initialize a new grace period.
      
      11.	CPU 16 yet again enters the RCU core, yet again possibly because
      	it has taken a scheduling-clock interrupt, or alternatively
      	because it now has more than 10,000 callbacks queued.	It notes
      	that the new grace period has ended, and therefore advances
      	its callbacks.	The callback for data item A is therefore in
      	the RCU_DONE_TAIL segment of the callback queue.  This means
      	that this callback is now considered ready to be invoked.
      
      12.	CPU 16 invokes the callback, freeing data item A while CPU 1
      	is still referencing it.
      
      This scenario represents a day-zero bug for TREE_RCU.  This commit
      therefore ensures that the old grace period is marked completed in
      all leaf rcu_node structures before a new grace period is marked
      started in any of them.
      
      That said, it would have been insanely difficult to force this race to
      happen before the grace-period initialization process was preemptible.
      Therefore, this commit is not a candidate for -stable.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      
      Conflicts:
      
      	kernel/rcutree.c
      5d4b8659
    • P
      rcu: Make rcutree module parameters visible in sysfs · 7e5c2dfb
      Paul E. McKenney 提交于
      The module parameters blimit, qhimark, and qlomark (and more
      recently, rcu_fanout_leaf) have permission masks of zero, so
      that their values are not visible from sysfs.  This is unnecessary
      and inconvenient to administrators who might like an easy way to
      see what these values are on a running system.  This commit therefore
      sets their permission masks to 0444, allowing them to be read but
      not written.
      Reported-by: NRusty Russell <rusty@ozlabs.org>
      Reported-by: NJosh Triplett <josh@joshtriplett.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      7e5c2dfb
    • P
      rcu: Control grace-period duration from sysfs · d40011f6
      Paul E. McKenney 提交于
      Although almost everyone is well-served by the defaults, some uses of RCU
      benefit from shorter grace periods, while others benefit more from the
      greater efficiency provided by longer grace periods.  Situations requiring
      a large number of grace periods to elapse (and wireshark startup has
      been called out as an example of this) are helped by lower-latency
      grace periods.  Furthermore, in some embedded applications, people are
      willing to accept a small degradation in update efficiency (due to there
      being more of the shorter grace-period operations) in order to gain the
      lower latency.
      
      In contrast, those few systems with thousands of CPUs need longer grace
      periods because the CPU overhead of a grace period rises roughly
      linearly with the number of CPUs.  Such systems normally do not make
      much use of facilities that require large numbers of grace periods to
      elapse, so this is a good tradeoff.
      
      Therefore, this commit allows the durations to be controlled from sysfs.
      There are two sysfs parameters, one named "jiffies_till_first_fqs" that
      specifies the delay in jiffies from the end of grace-period initialization
      until the first attempt to force quiescent states, and the other named
      "jiffies_till_next_fqs" that specifies the delay (again in jiffies)
      between subsequent attempts to force quiescent states.  They both default
      to three jiffies, which is compatible with the old hard-coded behavior.
      
      At some future time, it may be possible to automatically increase the
      grace-period length with the number of CPUs, but we do not yet have
      sufficient data to do a good job.  Preliminary data indicates that we
      should add an addiitonal jiffy to each of the delays for every 200 CPUs
      in the system, but more experimentation is needed.  For now, the number
      of systems with more than 1,000 CPUs is small enough that this can be
      relegated to boot-time hand tuning.
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      d40011f6
    • P
      rcu: Prevent force_quiescent_state() memory contention · 394f2769
      Paul E. McKenney 提交于
      Large systems running RCU_FAST_NO_HZ kernels see extreme memory
      contention on the rcu_state structure's ->fqslock field.  This
      can be avoided by disabling RCU_FAST_NO_HZ, either at compile time
      or at boot time (via the nohz kernel boot parameter), but large
      systems will no doubt become sensitive to energy consumption.
      This commit therefore uses a combining-tree approach to spread the
      memory contention across new cache lines in the leaf rcu_node structures.
      This can be thought of as a tournament lock that has only a try-lock
      acquisition primitive.
      
      The effect on small systems is minimal, because such systems have
      an rcu_node "tree" consisting of a single node.  In addition, this
      functionality is not used on fastpaths.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      394f2769
    • P
      rcu: Allow RCU quiescent-state forcing to be preempted · b4be093f
      Paul E. McKenney 提交于
      RCU quiescent-state forcing is currently carried out without preemption
      points, which can result in excessive latency spikes on large systems
      (many hundreds or thousands of CPUs).  This patch therefore inserts
      a voluntary preemption point into force_qs_rnp(), which should greatly
      reduce the magnitude of these spikes.
      Reported-by: NMike Galbraith <mgalbraith@suse.de>
      Reported-by: NDimitri Sivanich <sivanich@sgi.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      b4be093f