1. 26 9月, 2012 7 次提交
    • F
      rcu: Switch task's syscall hooks on context switch · 04e7e951
      Frederic Weisbecker 提交于
      Clear the syscalls hook of a task when it's scheduled out so that if
      the task migrates, it doesn't run the syscall slow path on a CPU
      that might not need it.
      
      Also set the syscalls hook on the next task if needed.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Alessio Igor Bogani <abogani@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Avi Kivity <avi@redhat.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Geoff Levand <geoff@infradead.org>
      Cc: Gilad Ben Yossef <gilad@benyossef.com>
      Cc: Hakan Akkan <hakanakkan@gmail.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Kevin Hilman <khilman@ti.com>
      Cc: Max Krasnyansky <maxk@qualcomm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephen Hemminger <shemminger@vyatta.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      04e7e951
    • F
      rcu: Ignore userspace extended quiescent state by default · 1e1a689f
      Frederic Weisbecker 提交于
      By default we don't want to enter into RCU extended quiescent
      state while in userspace because doing this produces some overhead
      (eg: use of syscall slowpath). Set it off by default and ready to
      run when some feature like adaptive tickless need it.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Alessio Igor Bogani <abogani@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Avi Kivity <avi@redhat.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Geoff Levand <geoff@infradead.org>
      Cc: Gilad Ben Yossef <gilad@benyossef.com>
      Cc: Hakan Akkan <hakanakkan@gmail.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Kevin Hilman <khilman@ti.com>
      Cc: Max Krasnyansky <maxk@qualcomm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephen Hemminger <shemminger@vyatta.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      1e1a689f
    • F
      rcu: Allow rcu_user_enter()/exit() to nest · c5d900bf
      Frederic Weisbecker 提交于
      Allow calls to rcu_user_enter() even if we are already
      in userspace (as seen by RCU) and allow calls to rcu_user_exit()
      even if we are already in the kernel.
      
      This makes the APIs more flexible to be called from architectures.
      Exception entries for example won't need to know if they come from
      userspace before calling rcu_user_exit().
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Alessio Igor Bogani <abogani@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Avi Kivity <avi@redhat.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Geoff Levand <geoff@infradead.org>
      Cc: Gilad Ben Yossef <gilad@benyossef.com>
      Cc: Hakan Akkan <hakanakkan@gmail.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Kevin Hilman <khilman@ti.com>
      Cc: Max Krasnyansky <maxk@qualcomm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephen Hemminger <shemminger@vyatta.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      c5d900bf
    • F
      rcu: Settle config for userspace extended quiescent state · 2b1d5024
      Frederic Weisbecker 提交于
      Create a new config option under the RCU menu that put
      CPUs under RCU extended quiescent state (as in dynticks
      idle mode) when they run in userspace. This require
      some contribution from architectures to hook into kernel
      and userspace boundaries.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Alessio Igor Bogani <abogani@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Avi Kivity <avi@redhat.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Geoff Levand <geoff@infradead.org>
      Cc: Gilad Ben Yossef <gilad@benyossef.com>
      Cc: Hakan Akkan <hakanakkan@gmail.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Kevin Hilman <khilman@ti.com>
      Cc: Max Krasnyansky <maxk@qualcomm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephen Hemminger <shemminger@vyatta.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      2b1d5024
    • P
      rcu: Make RCU_FAST_NO_HZ handle adaptive ticks · 9a0c6fef
      Paul E. McKenney 提交于
      The current implementation of RCU_FAST_NO_HZ tries reasonably hard to rid
      the current CPU of RCU callbacks.  This is appropriate when the CPU is
      entering idle, where it doesn't have much useful to do anyway, but is most
      definitely not what you want when transitioning to user-mode execution.
      This commit therefore detects the adaptive-tick case, and refrains from
      burning CPU time getting rid of RCU callbacks in that case.
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      9a0c6fef
    • F
      rcu: New rcu_user_enter_after_irq() and rcu_user_exit_after_irq() APIs · 19dd1591
      Frederic Weisbecker 提交于
      In some cases, it is necessary to enter or exit userspace-RCU-idle mode
      from an interrupt handler, for example, if some other CPU sends this
      CPU a resched IPI.  In this case, the current CPU would enter the IPI
      handler in userspace-RCU-idle mode, but would need to exit the IPI handler
      after having exited that mode.
      
      To allow this to work, this commit adds two new APIs to TREE_RCU:
      
      - rcu_user_enter_after_irq(). This must be called from an interrupt between
      rcu_irq_enter() and rcu_irq_exit().  After the irq calls rcu_irq_exit(),
      the irq handler will return into an RCU extended quiescent state.
      In theory, this interrupt is never a nested interrupt, but in practice
      it might interrupt softirq, which looks to RCU like a nested interrupt.
      
      - rcu_user_exit_after_irq(). This must be called from a non-nesting
      interrupt, interrupting an RCU extended quiescent state, also
      between rcu_irq_enter() and rcu_irq_exit(). After the irq calls
      rcu_irq_exit(), the irq handler will return in an RCU non-quiescent
      state.
      
      [ Combined with "Allow calls to rcu_exit_user_irq from nesting irqs." ]
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      19dd1591
    • F
      rcu: New rcu_user_enter() and rcu_user_exit() APIs · adf5091e
      Frederic Weisbecker 提交于
      RCU currently insists that only idle tasks can enter RCU idle mode, which
      prohibits an adaptive tickless kernel (AKA nohz cpusets), which in turn
      would mean that usermode execution would always take scheduling-clock
      interrupts, even when there is only one task runnable on the CPU in
      question.
      
      This commit therefore adds rcu_user_enter() and rcu_user_exit(), which
      allow non-idle tasks to enter RCU idle mode.  These are quite similar
      to rcu_idle_enter() and rcu_idle_exit(), respectively, except that they
      omit the idle-task checks.
      
      [ Updated to use "user" flag rather than separate check functions. ]
      
      [ paulmck: Updated to drop exports of new functions based on Josh's patch
        getting rid of the need for them. ]
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Alessio Igor Bogani <abogani@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Avi Kivity <avi@redhat.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
      Cc: Geoff Levand <geoff@infradead.org>
      Cc: Gilad Ben Yossef <gilad@benyossef.com>
      Cc: Hakan Akkan <hakanakkan@gmail.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Kevin Hilman <khilman@ti.com>
      Cc: Max Krasnyansky <maxk@qualcomm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephen Hemminger <shemminger@vyatta.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      adf5091e
  2. 23 9月, 2012 33 次提交
    • P
      sched: Fix load avg vs cpu-hotplug · 5d180232
      Peter Zijlstra 提交于
      Rabik and Paul reported two different issues related to the same few
      lines of code.
      
      Rabik's issue is that the nr_uninterruptible migration code is wrong in
      that he sees artifacts due to this (Rabik please do expand in more
      detail).
      
      Paul's issue is that this code as it stands relies on us using
      stop_machine() for unplug, we all would like to remove this assumption
      so that eventually we can remove this stop_machine() usage altogether.
      
      The only reason we'd have to migrate nr_uninterruptible is so that we
      could use for_each_online_cpu() loops in favour of
      for_each_possible_cpu() loops, however since nr_uninterruptible() is the
      only such loop and its using possible lets not bother at all.
      
      The problem Rabik sees is (probably) caused by the fact that by
      migrating nr_uninterruptible we screw rq->calc_load_active for both rqs
      involved.
      
      So don't bother with fancy migration schemes (meaning we now have to
      keep using for_each_possible_cpu()) and instead fold any nr_active delta
      after we migrate all tasks away to make sure we don't have any skewed
      nr_active accounting.
      
      [ paulmck: Move call to calc_load_migration to CPU_DEAD to avoid
      miscounting noted by Rakib. ]
      Reported-by: NRakib Mullick <rakib.mullick@gmail.com>
      Reported-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      5d180232
    • P
      rcu: Disallow callback registry on offline CPUs · 0d8ee37e
      Paul E. McKenney 提交于
      Posting a callback after the CPU_DEAD notifier effectively leaks
      that callback unless/until that CPU comes back online.  Silence is
      unhelpful when attempting to track down such leaks, so this commit emits
      a WARN_ON_ONCE() and unconditionally leaks the callback when an offline
      CPU attempts to register a callback.  The rdp->nxttail[RCU_NEXT_TAIL] is
      set to NULL in the CPU_DEAD notifier and restored in the CPU_UP_PREPARE
      notifier, allowing _call_rcu() to determine exactly when posting callbacks
      is illegal.
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      0d8ee37e
    • P
      rcu: Remove _rcu_barrier() dependency on __stop_machine() · 1331e7a1
      Paul E. McKenney 提交于
      Currently, _rcu_barrier() relies on preempt_disable() to prevent
      any CPU from going offline, which in turn depends on CPU hotplug's
      use of __stop_machine().
      
      This patch therefore makes _rcu_barrier() use get_online_cpus() to
      block CPU-hotplug operations.  This has the added benefit of removing
      the need for _rcu_barrier() to adopt callbacks:  Because CPU-hotplug
      operations are excluded, there can be no callbacks to adopt.  This
      commit simplifies the code accordingly.
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      1331e7a1
    • P
      rcu: Fix CONFIG_RCU_FAST_NO_HZ stall warning message · 86f343b5
      Paul E. McKenney 提交于
      The print_cpu_stall_fast_no_hz() function attempts to print -1 when
      the ->idle_gp_timer is not pending, but unsigned arithmetic causes it
      to instead print ULONG_MAX, which is 4294967295 on 32-bit systems and
      18446744073709551615 on 64-bit systems.  Neither of these are the most
      reader-friendly values, so this commit instead causes "timer not pending"
      to be printed when ->idle_gp_timer is not pending.
      Reported-by: NPaul Walmsley <paul@pwsan.com>
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      86f343b5
    • L
      rcu: Move TINY_RCU quiescent state out of extended quiescent state · 22a76726
      Li Zhong 提交于
      TINY_RCU's rcu_idle_enter_common() invokes rcu_sched_qs() in order
      to inform the RCU core of the quiescent state implied by idle entry.
      Of course, idle is also an extended quiescent state, so that the call
      to rcu_sched_qs() speeds up RCU's invoking of any callbacks that might
      be queued.  This speed-up is important when entering into dyntick-idle
      mode -- if there are no further scheduling-clock interrupts, the callbacks
      might never be invoked, which could result in a system hang.
      
      However, processing callbacks does event tracing, which in turn
      implies RCU read-side critical sections, which are illegal in extended
      quiescent states.  This patch therefore moves the call to rcu_sched_qs()
      so that it precedes the point at which we inform lockdep that RCU has
      entered an extended quiescent state.
      Signed-off-by: NLi Zhong <zhong@linux.vnet.ibm.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      22a76726
    • P
      time: RCU permitted to stop idle entry via softirq · 803b0eba
      Paul E. McKenney 提交于
      The can_stop_idle_tick() function complains if a softirq vector is
      raised too late in the idle-entry process, presumably in order to
      prevent dangling softirq invocations from being delayed across the
      full idle period, which might be indefinitely long -- and if softirq
      was asserted any later than the call to this function, such a delay
      might well happen.
      
      However, RCU needs to be able to use softirq to stop idle entry in
      order to be able to drain RCU callbacks from the current CPU, which in
      turn enables faster entry into dyntick-idle mode, which in turn reduces
      power consumption.  Because RCU takes this action at a well-defined
      point in the idle-entry path, it is safe for RCU to take this approach.
      
      This commit therefore silences the error message that is sometimes
      produced when the going-idle CPU suddenly finds that it has an RCU_SOFTIRQ
      to process.  The error message will continue to be issued for other
      softirq vectors.
      Reported-by: NSedat Dilek <sedat.dilek@gmail.com>
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Tested-by: NSedat Dilek <sedat.dilek@gmail.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      803b0eba
    • P
      rcu: Move TINY_PREEMPT_RCU away from raw_local_irq_save() · 7a11e205
      Paul E. McKenney 提交于
      The use of raw_local_irq_save() is unnecessary, given that local_irq_save()
      really does disable interrupts.  Also, it appears to interfere with lockdep.
      Therefore, this commit moves to local_irq_save().
      Reported-by: NFengguang Wu <fengguang.wu@intel.com>
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Tested-by: NFengguang Wu <fengguang.wu@intel.com>
      7a11e205
    • P
      rcu: Remove redundant memory barrier from __call_rcu() · fdab649b
      Paul E. McKenney 提交于
      The first memory barrier in __call_rcu() is supposed to order any
      updates done beforehand by the caller against the actual queuing
      of the callback.  However, the second memory barrier (which is intended
      to order incrementing the queue lengths before queuing the callback)
      is also between the caller's updates and the queuing of the callback.
      The second memory barrier can therefore serve both purposes.
      
      This commit therefore removes the first memory barrier.
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      fdab649b
    • P
      rcu: Avoid spurious RCU CPU stall warnings · c96ea7cf
      Paul E. McKenney 提交于
      If a given CPU avoids the idle loop but also avoids starting a new
      RCU grace period for a full minute, RCU can issue spurious RCU CPU
      stall warnings.  This commit fixes this issue by adding a check for
      ongoing grace period to avoid these spurious stall warnings.
      Reported-by: NBecky Bruce <bgillbruce@gmail.com>
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      c96ea7cf
    • P
      rcu: Protect rcu_node accesses during CPU stall warnings · c8020a67
      Paul E. McKenney 提交于
      The print_other_cpu_stall() function accesses a number of rcu_node
      fields without protection from the ->lock.  In theory, this is not
      a problem because the fields accessed are all integers, but in
      practice the compiler can get nasty.  Therefore, the commit extends
      the existing critical section to cover the entire loop body.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      c8020a67
    • P
      rcu: Avoid rcu_print_detail_task_stall_rnp() segfault · 5fd4dc06
      Paul E. McKenney 提交于
      The rcu_print_detail_task_stall_rnp() function invokes
      rcu_preempt_blocked_readers_cgp() to verify that there are some preempted
      RCU readers blocking the current grace period outside of the protection
      of the rcu_node structure's ->lock.  This means that the last blocked
      reader might exit its RCU read-side critical section and remove itself
      from the ->blkd_tasks list before the ->lock is acquired, resulting in
      a segmentation fault when the subsequent code attempts to dereference
      the now-NULL gp_tasks pointer.
      
      This commit therefore moves the test under the lock.  This will not
      have measurable effect on lock contention because this code is invoked
      only when printing RCU CPU stall warnings, in other words, in the common
      case, never.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      5fd4dc06
    • P
      rcu: Apply for_each_rcu_flavor() to increment_cpu_stall_ticks() · 115f7a7c
      Paul E. McKenney 提交于
      The increment_cpu_stall_ticks() function listed each RCU flavor
      explicitly, with an ifdef to handle preemptible RCU.  This commit
      therefore applies for_each_rcu_flavor() to save a line of code.
      
      Because this commit switches from a code-based enumeration of the
      flavors of RCU to an rcu_state-list-based enumeration, it is no longer
      possible to apply __get_cpu_var() to the per-CPU rcu_data structures.
      We instead use __this_cpu_var() on the rcu_state structure's ->rda field
      that references the corresponding rcu_data structures.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      115f7a7c
    • P
      rcu: Fix obsolete rcu_initiate_boost() header comment · b065a853
      Paul E. McKenney 提交于
      Commit 1217ed1b (rcu: permit rcu_read_unlock() to be called while holding
      runqueue locks) made rcu_initiate_boost() restore irq state when releasing
      the rcu_node structure's ->lock, but failed to update the header comment
      accordingly.  This commit therefore brings the header comment up to date.
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      b065a853
    • P
      rcu: Make offline-CPU checking allow for indefinite delays · a82dcc76
      Paul E. McKenney 提交于
      The rcu_implicit_offline_qs() function implicitly assumed that execution
      would progress predictably when interrupts are disabled, which is of course
      not guaranteed when running on a hypervisor.  Furthermore, this function
      is short, and is called from one place only in a short function.
      
      This commit therefore ensures that the timing is checked before
      checking the condition, which guarantees correct behavior even given
      indefinite delays.  It also inlines rcu_implicit_offline_qs() into
      rcu_implicit_dynticks_qs().
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      a82dcc76
    • P
      rcu: Improve boost selection when moving tasks to root rcu_node · 5cc900cf
      Paul E. McKenney 提交于
      The rcu_preempt_offline_tasks() moves all tasks queued on a given leaf
      rcu_node structure to the root rcu_node, which is done when the last CPU
      corresponding the the leaf rcu_node structure goes offline.  Now that
      RCU-preempt's synchronize_rcu_expedited() implementation blocks CPU-hotplug
      operations during the initialization of each rcu_node structure's
      ->boost_tasks pointer, rcu_preempt_offline_tasks() can do a better job
      of setting the root rcu_node's ->boost_tasks pointer.
      
      The key point is that rcu_preempt_offline_tasks() runs as part of the
      CPU-hotplug process, so that a concurrent synchronize_rcu_expedited()
      is guaranteed to either have not started on the one hand (in which case
      there is no boosting on behalf of the expedited grace period) or to be
      completely initialized on the other (in which case, in the absence of
      other priority boosting, all ->boost_tasks pointers will be initialized).
      Therefore, if rcu_preempt_offline_tasks() finds that the ->boost_tasks
      pointer is equal to the ->exp_tasks pointer, it can be sure that it is
      correctly placed.
      
      In the case where there was boosting ongoing at the time that the
      synchronize_rcu_expedited() function started, different nodes might start
      boosting the tasks blocking the expedited grace period at different times.
      In this mixed case, the root node will either be boosting tasks for
      the expedited grace period already, or it will start as soon as it gets
      done boosting for the normal grace period -- but in this latter case,
      the root node's tasks needed to be boosted in any case.
      
      This commit therefore adds a check of the ->boost_tasks pointer against
      the ->exp_tasks pointer to the list that prevents updating ->boost_tasks.
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      5cc900cf
    • P
      rcu: Permit RCU_NONIDLE() to be used from interrupt context · b4270ee3
      Paul E. McKenney 提交于
      There is a need to use RCU from interrupt context, but either before
      rcu_irq_enter() is called or after rcu_irq_exit() is called.  If the
      interrupt occurs from idle, then lockdep-RCU will complain about such
      uses, as they appear to be illegal uses of RCU from the idle loop.
      In other environments, RCU_NONIDLE() could be used to properly protect
      the use of RCU, but RCU_NONIDLE() currently cannot be invoked except
      from process context.
      
      This commit therefore modifies RCU_NONIDLE() to permit its use more
      globally.
      Reported-by: NSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      b4270ee3
    • P
      rcu: Properly initialize ->boost_tasks on CPU offline · 1e3fd2b3
      Paul E. McKenney 提交于
      When rcu_preempt_offline_tasks() clears tasks from a leaf rcu_node
      structure, it does not NULL out the structure's ->boost_tasks field.
      This commit therefore fixes this issue.
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      1e3fd2b3
    • P
      rcu: Pull TINY_RCU dyntick-idle tracing into non-idle region · 818615c4
      Paul E. McKenney 提交于
      Because TINY_RCU's idle detection keys directly off of the nesting
      level, rather than from a separate variable as in TREE_RCU, the
      TINY_RCU dyntick-idle tracing on transition to idle must happen
      before the change to the nesting level.  This commit therefore makes
      this change by passing the desired new value (rather than the old value)
      of the nesting level in to rcu_idle_enter_common().
      
      [ paulmck: Add fix for wrong-variable bug spotted by
        Michael Wang <wangyun@linux.vnet.ibm.com>. ]
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      818615c4
    • P
      rcu: Add PROVE_RCU_DELAY to provoke difficult races · e3ebfb96
      Paul E. McKenney 提交于
      There have been some recent bugs that were triggered only when
      preemptible RCU's __rcu_read_unlock() was preempted just after setting
      ->rcu_read_lock_nesting to INT_MIN, which is a low-probability event.
      Therefore, reproducing those bugs (to say nothing of gaining confidence
      in alleged fixes) was quite difficult.  This commit therefore creates
      a new debug-only RCU kernel config option that forces a short delay
      in __rcu_read_unlock() to increase the probability of those sorts of
      bugs occurring.
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      e3ebfb96
    • P
      rcu: Prevent initialization race in rcutorture kthreads · 60f53782
      Paul E. McKenney 提交于
      When you do something like "t = kthread_run(...)", it is possible that
      the kthread will start running before the assignment to "t" happens.
      If the child kthread expects to find a pointer to its task_struct in "t",
      it will then be fatally disappointed.  This commit therefore switches
      such cases to kthread_create() followed by wake_up_process(), guaranteeing
      that the assignment happens before the child kthread starts running.
      Reported-by: NFengguang Wu <fengguang.wu@intel.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      60f53782
    • P
      rcu: Switch rcutorture to pr_alert() and friends · 2caa1e44
      Paul E. McKenney 提交于
      Drop a few characters by switching kernel/rcutorture.c from
      "printk(KERN_ALERT" to "pr_alert(".
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      2caa1e44
    • P
      rcu: Track CPU-hotplug duration statistics · 13dbf914
      Paul E. McKenney 提交于
      Many rcutorture runs include CPU-hotplug operations in their stress
      testing.  This commit accumulates statistics on the durations of these
      operations in deference to the recent concern about the overhead and
      latency of these operations.
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      13dbf914
    • P
      rcu: Update rcutorture defaults · ab840f7a
      Paul E. McKenney 提交于
      A number of new features have been added to rcutorture over the years, but
      the defaults have not been updated to include them.  This commit therefore
      turns on a couple of them that have proven helpful and trustworthy, namely
      periodic progress reports and testing of NO_HZ.
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      ab840f7a
    • P
      rcu: Shrink RCU based on number of CPUs · b17c7035
      Paul E. McKenney 提交于
      Currently, rcu_init_geometry() only reshapes RCU's combining trees
      if the leaf fanout is changed at boot time.  This means that by
      default, kernels compiled with (say) NR_CPUS=4096 will keep oversized
      data structures, even when running on systems with (say) four CPUs.
      
      This commit therefore checks to see if the maximum number of CPUs on
      the actual running system (nr_cpu_ids) differs from NR_CPUS, and if so
      reshapes the combining trees accordingly.
      Reported-by: NMike Galbraith <efault@gmx.de>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      b17c7035
    • P
      rcu: Handle unbalanced rcu_node configurations with few CPUs · 4dbd6bb3
      Paul E. McKenney 提交于
      If CONFIG_RCU_FANOUT_EXACT=y, if there are not enough CPUs (according
      to nr_cpu_ids) to require more than a single rcu_node structure, but if
      NR_CPUS is larger than would fit into a single rcu_node structure, then
      the current rcu_init_levelspread() code is subject to integer overflow
      in the eight-bit ->levelspread[] array in the rcu_state structure.
      
      In this case, the solution is -not- to increase the size of the
      elements in this array because the values in that array should be
      constrained to the number of bits in an unsigned long.  Instead, this
      commit replaces NR_CPUS with nr_cpu_ids in the rcu_init_levelspread()
      function's initialization of the cprv local variable.  This results in
      all of the arithmetic being consistently based off of the nr_cpu_ids
      value, thus avoiding the overflow, which was caused by the mixing of
      nr_cpu_ids and NR_CPUS.
      Reported-by: NMike Galbraith <efault@gmx.de>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      4dbd6bb3
    • P
      rcu: Simplify quiescent-state detection · d7d6a11e
      Paul E. McKenney 提交于
      The current quiescent-state detection algorithm is needlessly
      complex.  It records the grace-period number corresponding to
      the quiescent state at the time of the quiescent state, which
      works, but it seems better to simply erase any record of previous
      quiescent states at the time that the CPU notices the new grace
      period.  This has the further advantage of removing another piece
      of RCU for which lockless reasoning is required.
      
      Therefore, this commit makes this change.
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      d7d6a11e
    • P
      rcu: Reduce synchronize_rcu_expedited() latency · 1943c89d
      Paul E. McKenney 提交于
      The synchronize_rcu_expedited() function disables interrupts across a
      scan of all leaf rcu_node structures, which is not good for real-time
      scheduling latency on large systems (hundreds or especially thousands
      of CPUs).  This commit therefore holds off CPU-hotplug operations using
      get_online_cpus(), and removes the prior acquisiion of the ->onofflock
      (which required disabling interrupts).
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      1943c89d
    • P
      rcu: Eliminate signed overflow in synchronize_rcu_expedited() · bcfa57ce
      Paul E. McKenney 提交于
      In the C language, signed overflow is undefined.  It is true that
      twos-complement arithmetic normally comes to the rescue, but if the
      compiler can subvert this any time it has any information about the values
      being compared.  For example, given "if (a - b > 0)", if the compiler
      has enough information to realize that (for example) the value of "a"
      is positive and that of "b" is negative, the compiler is within its
      rights to optimize to a simple "if (1)", which might not be what you want.
      
      This commit therefore converts synchronize_rcu_expedited()'s work-done
      detection counter from signed to unsigned.
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      bcfa57ce
    • P
      rcu: Adjust for unconditional ->completed assignment · 25d30cf4
      Paul E. McKenney 提交于
      Now that the rcu_node structures' ->completed fields are unconditionally
      assigned at grace-period cleanup time, they should already have the
      correct value for the new grace period at grace-period initialization
      time.  This commit therefore inserts a WARN_ON_ONCE() to verify this
      invariant.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      25d30cf4
    • P
      rcu: Add random PROVE_RCU_DELAY to grace-period initialization · 661a85dc
      Paul E. McKenney 提交于
      Preemption greatly raised the probability of certain types of race
      conditions, so this commit adds an anti-heisenbug to greatly increase
      the collision cross section, also known as the probability of occurrence.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      661a85dc
    • P
      rcu: Fix day-zero grace-period initialization/cleanup race · 5d4b8659
      Paul E. McKenney 提交于
      The current approach to grace-period initialization is vulnerable to
      extremely low-probability races.  These races stem from the fact that
      the old grace period is marked completed on the same traversal through
      the rcu_node structure that is marking the start of the new grace period.
      This means that some rcu_node structures will believe that the old grace
      period is still in effect at the same time that other rcu_node structures
      believe that the new grace period has already started.
      
      These sorts of disagreements can result in too-short grace periods,
      as shown in the following scenario:
      
      1.	CPU 0 completes a grace period, but needs an additional
      	grace period, so starts initializing one, initializing all
      	the non-leaf rcu_node structures and the first leaf rcu_node
      	structure.  Because CPU 0 is both completing the old grace
      	period and starting a new one, it marks the completion of
      	the old grace period and the start of the new grace period
      	in a single traversal of the rcu_node structures.
      
      	Therefore, CPUs corresponding to the first rcu_node structure
      	can become aware that the prior grace period has completed, but
      	CPUs corresponding to the other rcu_node structures will see
      	this same prior grace period as still being in progress.
      
      2.	CPU 1 passes through a quiescent state, and therefore informs
      	the RCU core.  Because its leaf rcu_node structure has already
      	been initialized, this CPU's quiescent state is applied to the
      	new (and only partially initialized) grace period.
      
      3.	CPU 1 enters an RCU read-side critical section and acquires
      	a reference to data item A.  Note that this CPU believes that
      	its critical section started after the beginning of the new
      	grace period, and therefore will not block this new grace period.
      
      4.	CPU 16 exits dyntick-idle mode.  Because it was in dyntick-idle
      	mode, other CPUs informed the RCU core of its extended quiescent
      	state for the past several grace periods.  This means that CPU 16
      	is not yet aware that these past grace periods have ended.  Assume
      	that CPU 16 corresponds to the second leaf rcu_node structure --
      	which has not yet been made aware of the new grace period.
      
      5.	CPU 16 removes data item A from its enclosing data structure
      	and passes it to call_rcu(), which queues a callback in the
      	RCU_NEXT_TAIL segment of the callback queue.
      
      6.	CPU 16 enters the RCU core, possibly because it has taken a
      	scheduling-clock interrupt, or alternatively because it has
      	more than 10,000 callbacks queued.  It notes that the second
      	most recent grace period has completed (recall that because it
      	corresponds to the second as-yet-uninitialized rcu_node structure,
      	it cannot yet become aware that the most recent grace period has
      	completed), and therefore advances its callbacks.  The callback
      	for data item A is therefore in the RCU_NEXT_READY_TAIL segment
      	of the callback queue.
      
      7.	CPU 0 completes initialization of the remaining leaf rcu_node
      	structures for the new grace period, including the structure
      	corresponding to CPU 16.
      
      8.	CPU 16 again enters the RCU core, again, possibly because it has
      	taken a scheduling-clock interrupt, or alternatively because
      	it now has more than 10,000 callbacks queued.	It notes that
      	the most recent grace period has ended, and therefore advances
      	its callbacks.	The callback for data item A is therefore in
      	the RCU_DONE_TAIL segment of the callback queue.
      
      9.	All CPUs other than CPU 1 pass through quiescent states.  Because
      	CPU 1 already passed through its quiescent state, the new grace
      	period completes.  Note that CPU 1 is still in its RCU read-side
      	critical section, still referencing data item A.
      
      10.	Suppose that CPU 2 wais the last CPU to pass through a quiescent
      	state for the new grace period, and suppose further that CPU 2
      	did not have any callbacks queued, therefore not needing an
      	additional grace period.  CPU 2 therefore traverses all of the
      	rcu_node structures, marking the new grace period as completed,
      	but does not initialize a new grace period.
      
      11.	CPU 16 yet again enters the RCU core, yet again possibly because
      	it has taken a scheduling-clock interrupt, or alternatively
      	because it now has more than 10,000 callbacks queued.	It notes
      	that the new grace period has ended, and therefore advances
      	its callbacks.	The callback for data item A is therefore in
      	the RCU_DONE_TAIL segment of the callback queue.  This means
      	that this callback is now considered ready to be invoked.
      
      12.	CPU 16 invokes the callback, freeing data item A while CPU 1
      	is still referencing it.
      
      This scenario represents a day-zero bug for TREE_RCU.  This commit
      therefore ensures that the old grace period is marked completed in
      all leaf rcu_node structures before a new grace period is marked
      started in any of them.
      
      That said, it would have been insanely difficult to force this race to
      happen before the grace-period initialization process was preemptible.
      Therefore, this commit is not a candidate for -stable.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      
      Conflicts:
      
      	kernel/rcutree.c
      5d4b8659
    • P
      rcu: Make rcutree module parameters visible in sysfs · 7e5c2dfb
      Paul E. McKenney 提交于
      The module parameters blimit, qhimark, and qlomark (and more
      recently, rcu_fanout_leaf) have permission masks of zero, so
      that their values are not visible from sysfs.  This is unnecessary
      and inconvenient to administrators who might like an easy way to
      see what these values are on a running system.  This commit therefore
      sets their permission masks to 0444, allowing them to be read but
      not written.
      Reported-by: NRusty Russell <rusty@ozlabs.org>
      Reported-by: NJosh Triplett <josh@joshtriplett.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      7e5c2dfb
    • P
      rcu: Control grace-period duration from sysfs · d40011f6
      Paul E. McKenney 提交于
      Although almost everyone is well-served by the defaults, some uses of RCU
      benefit from shorter grace periods, while others benefit more from the
      greater efficiency provided by longer grace periods.  Situations requiring
      a large number of grace periods to elapse (and wireshark startup has
      been called out as an example of this) are helped by lower-latency
      grace periods.  Furthermore, in some embedded applications, people are
      willing to accept a small degradation in update efficiency (due to there
      being more of the shorter grace-period operations) in order to gain the
      lower latency.
      
      In contrast, those few systems with thousands of CPUs need longer grace
      periods because the CPU overhead of a grace period rises roughly
      linearly with the number of CPUs.  Such systems normally do not make
      much use of facilities that require large numbers of grace periods to
      elapse, so this is a good tradeoff.
      
      Therefore, this commit allows the durations to be controlled from sysfs.
      There are two sysfs parameters, one named "jiffies_till_first_fqs" that
      specifies the delay in jiffies from the end of grace-period initialization
      until the first attempt to force quiescent states, and the other named
      "jiffies_till_next_fqs" that specifies the delay (again in jiffies)
      between subsequent attempts to force quiescent states.  They both default
      to three jiffies, which is compatible with the old hard-coded behavior.
      
      At some future time, it may be possible to automatically increase the
      grace-period length with the number of CPUs, but we do not yet have
      sufficient data to do a good job.  Preliminary data indicates that we
      should add an addiitonal jiffy to each of the delays for every 200 CPUs
      in the system, but more experimentation is needed.  For now, the number
      of systems with more than 1,000 CPUs is small enough that this can be
      relegated to boot-time hand tuning.
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      d40011f6