1. 28 8月, 2014 1 次提交
  2. 17 7月, 2014 1 次提交
    • P
      rcu: Allow for NULL tick_nohz_full_mask when nohz_full= missing · 187497fa
      Paul E. McKenney 提交于
      If there isn't a nohz_full= kernel parameter specified, then
      tick_nohz_full_mask can legitimately be NULL.  This can cause
      problems when RCU's boot code tries to cpumask_or() this value into
      rcu_nocb_mask.  In addition, if NO_HZ_FULL_ALL=y, there is no point
      in doing the cpumask_or() in the first place because this will cause
      RCU_NOCB_CPU_ALL=y, which in turn will have all bits already set in
      rcu_nocb_mask.
      
      This commit therefore avoids the cpumask_or() if NO_HZ_FULL_ALL=y
      and checks for !tick_nohz_full_running otherwise, this latter check
      catching cases when there was no nohz_full= kernel parameter specified.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      187497fa
  3. 10 7月, 2014 6 次提交
    • P
      rcu: Fix a sparse warning in rcu_report_unblock_qs_rnp() · b41d1b92
      Pranith Kumar 提交于
      This commit annotates rcu_report_unblock_qs_rnp() in order to fix the
      following sparse warning:
      
      kernel/rcu/tree_plugin.h:990:13: warning: context imbalance in 'rcu_report_unblock_qs_rnp' - unexpected unlock
      Signed-off-by: NPranith Kumar <bobby.prani@gmail.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      b41d1b92
    • P
      rcu: Fix a sparse warning in rcu_initiate_boost() · 615e41c6
      Pranith Kumar 提交于
      This commit annotates rcu_initiate_boost() fixes the following sparse
      warning:
      
      	kernel/rcu/tree_plugin.h:1494:13: warning: context imbalance in 'rcu_initiate_boost' - unexpected unlock
      Signed-off-by: NPranith Kumar <bobby.prani@gmail.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      615e41c6
    • P
      rcu: Bind grace-period kthreads to non-NO_HZ_FULL CPUs · c0f489d2
      Paul E. McKenney 提交于
      Binding the grace-period kthreads to the timekeeping CPU resulted in
      significant performance decreases for some workloads.  For more detail,
      see:
      
      https://lkml.org/lkml/2014/6/3/395 for benchmark numbers
      
      https://lkml.org/lkml/2014/6/4/218 for CPU statistics
      
      It turns out that it is necessary to bind the grace-period kthreads
      to the timekeeping CPU only when all but CPU 0 is a nohz_full CPU
      on the one hand or if CONFIG_NO_HZ_FULL_SYSIDLE=y on the other.
      In other cases, it suffices to bind the grace-period kthreads to the
      set of non-nohz_full CPUs.
      
      This commit therefore creates a tick_nohz_not_full_mask that is the
      complement of tick_nohz_full_mask, and then binds the grace-period
      kthread to the set of CPUs indicated by this new mask, which covers
      the CONFIG_NO_HZ_FULL_SYSIDLE=n case.  The CONFIG_NO_HZ_FULL_SYSIDLE=y
      case still binds the grace-period kthreads to the timekeeping CPU.
      This commit also includes the tick_nohz_full_enabled() check suggested
      by Frederic Weisbecker.
      Reported-by: NJet Chen <jet.chen@intel.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      [ paulmck: Created housekeeping_affine() and housekeeping_mask per
        fweisbec feedback. ]
      c0f489d2
    • P
      rcu: Simplify priority boosting by putting rt_mutex in rcu_node · abaa93d9
      Paul E. McKenney 提交于
      RCU priority boosting currently checks for boosting via a pointer in
      task_struct.  However, this is not needed: As Oleg noted, if the
      rt_mutex is placed in the rcu_node instead of on the booster's stack,
      the boostee can simply check it see if it owns the lock.  This commit
      makes this change, shrinking task_struct by one pointer and the kernel
      by thirteen lines.
      Suggested-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      abaa93d9
    • P
      rcu: Allow post-unlock reference for rt_mutex · dfeb9765
      Paul E. McKenney 提交于
      The current approach to RCU priority boosting uses an rt_mutex strictly
      for its priority-boosting side effects.  The rt_mutex_init_proxy_locked()
      function is used by the booster to initialize the lock as held by the
      boostee.  The booster then uses rt_mutex_lock() to acquire this rt_mutex,
      which priority-boosts the boostee.  When the boostee reaches the end
      of its outermost RCU read-side critical section, it checks a field in
      its task structure to see whether it has been boosted, and, if so, uses
      rt_mutex_unlock() to release the rt_mutex.  The booster can then go on
      to boost the next task that is blocking the current RCU grace period.
      
      But reasonable implementations of rt_mutex_unlock() might result in the
      boostee referencing the rt_mutex's data after releasing it.  But the
      booster might have re-initialized the rt_mutex between the time that the
      boostee released it and the time that it later referenced it.  This is
      clearly asking for trouble, so this commit introduces a completion that
      forces the booster to wait until the boostee has completely finished with
      the rt_mutex, thus avoiding the case where the booster is re-initializing
      the rt_mutex before the last boostee's last reference to that rt_mutex.
      
      This of course does introduce some overhead, but the priority-boosting
      code paths are miles from any possible fastpath, and the overhead of
      executing the completion will normally be quite small compared to the
      overhead of priority boosting and deboosting, so this should be OK.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      dfeb9765
    • P
      rcu: Remove redundant ACCESS_ONCE() from tick_do_timer_cpu · 4da117cf
      Paul E. McKenney 提交于
      In kernels built with CONFIG_NO_HZ_FULL, tick_do_timer_cpu is constant
      once boot completes.  Thus, there is no need to wrap it in ACCESS_ONCE()
      in code that is built only when CONFIG_NO_HZ_FULL.  This commit therefore
      removes the redundant ACCESS_ONCE().
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Acked-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Reviewed-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      4da117cf
  4. 08 7月, 2014 2 次提交
    • P
      rcu: Don't offload callbacks unless specifically requested · b58cc46c
      Paul E. McKenney 提交于
      Enabling NO_HZ_FULL currently has the side effect of enabling callback
      offloading on all CPUs.  This results in lots of additional rcuo kthreads,
      and can also increase context switching and wakeups, even in cases where
      callback offloading is neither needed nor particularly desirable.  This
      commit therefore enables callback offloading on a given CPU only if
      specifically requested at build time or boot time, or if that CPU has
      been specifically designated (again, either at build time or boot time)
      as a nohz_full CPU.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      b58cc46c
    • P
      rcu: Parallelize and economize NOCB kthread wakeups · fbce7497
      Paul E. McKenney 提交于
      An 80-CPU system with a context-switch-heavy workload can require so
      many NOCB kthread wakeups that the RCU grace-period kthreads spend several
      tens of percent of a CPU just awakening things.  This clearly will not
      scale well: If you add enough CPUs, the RCU grace-period kthreads would
      get behind, increasing grace-period latency.
      
      To avoid this problem, this commit divides the NOCB kthreads into leaders
      and followers, where the grace-period kthreads awaken the leaders each of
      whom in turn awakens its followers.  By default, the number of groups of
      kthreads is the square root of the number of CPUs, but this default may
      be overridden using the rcutree.rcu_nocb_leader_stride boot parameter.
      This reduces the number of wakeups done per grace period by the RCU
      grace-period kthread by the square root of the number of CPUs, but of
      course by shifting those wakeups to the leaders.  In addition, because
      the leaders do grace periods on behalf of their respective followers,
      the number of wakeups of the followers decreases by up to a factor of two.
      Instead of being awakened once when new callbacks arrive and again
      at the end of the grace period, the followers are awakened only at
      the end of the grace period.
      
      For a numerical example, in a 4096-CPU system, the grace-period kthread
      would awaken 64 leaders, each of which would awaken its 63 followers
      at the end of the grace period.  This compares favorably with the 79
      wakeups for the grace-period kthread on an 80-CPU system.
      Reported-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      fbce7497
  5. 24 6月, 2014 1 次提交
    • P
      rcu: Reduce overhead of cond_resched() checks for RCU · 4a81e832
      Paul E. McKenney 提交于
      Commit ac1bea85 (Make cond_resched() report RCU quiescent states)
      fixed a problem where a CPU looping in the kernel with but one runnable
      task would give RCU CPU stall warnings, even if the in-kernel loop
      contained cond_resched() calls.  Unfortunately, in so doing, it introduced
      performance regressions in Anton Blanchard's will-it-scale "open1" test.
      The problem appears to be not so much the increased cond_resched() path
      length as an increase in the rate at which grace periods complete, which
      increased per-update grace-period overhead.
      
      This commit takes a different approach to fixing this bug, mainly by
      moving the RCU-visible quiescent state from cond_resched() to
      rcu_note_context_switch(), and by further reducing the check to a
      simple non-zero test of a single per-CPU variable.  However, this
      approach requires that the force-quiescent-state processing send
      resched IPIs to the offending CPUs.  These will be sent only once
      the grace period has reached an age specified by the boot/sysfs
      parameter rcutree.jiffies_till_sched_qs, or once the grace period
      reaches an age halfway to the point at which RCU CPU stall warnings
      will be emitted, whichever comes first.
      Reported-by: NDave Hansen <dave.hansen@intel.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Christoph Lameter <cl@gentwo.org>
      Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      [ paulmck: Made rcu_momentary_dyntick_idle() as suggested by the
        ktest build robot.  Also fixed smp_mb() comment as noted by
        Oleg Nesterov. ]
      
      Merge with e552592e (Reduce overhead of cond_resched() checks for RCU)
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      4a81e832
  6. 15 5月, 2014 1 次提交
  7. 29 4月, 2014 8 次提交
    • C
      rcu: Replace __this_cpu_ptr() uses with raw_cpu_ptr() · fa07a58f
      Christoph Lameter 提交于
      __this_cpu_ptr is being phased out.
      
      One special case is increment_cpu_stall_ticks().
      A per cpu variable is incremented so use raw_cpu_inc().
      
      Cc: Dipankar Sarma <dipankar@in.ibm.com>
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      fa07a58f
    • P
      rcu: Make large and small sysidle systems use same state machine · becb41bf
      Paul E. McKenney 提交于
      Currently, small systems move back into RCU_SYSIDLE_NOT from
      RCU_SYSIDLE_SHORT and large systems do not.  This works because moving
      aggressively to RCU_SYSIDLE_NOT affects only performance, not correctness,
      and on small systems, the performance impact should be negligible.  That
      said, this difference does make RCU a bit more complex, and RCU does not
      seem to be suffering from any lack of complexity.  This commit therefore
      adjusts small-system operation to match that of large systems, so that
      the state never moves back to RCU_SYSIDLE_NOT from RCU_SYSIDLE_SHORT.
      Reported-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      becb41bf
    • P
      rcu: Bind RCU grace-period kthreads if NO_HZ_FULL · 5057f55e
      Paul E. McKenney 提交于
      Currently, RCU binds the grace-period kthreads to the timekeeping
      CPU only if CONFIG_NO_HZ_FULL_SYSIDLE=y.  This means that these
      kthreads must be bound manually when CONFIG_NO_HZ_FULL_SYSIDLE=n and
      CONFIG_NO_HZ_FULL=y: Otherwise, these kthreads will induce OS jitter on
      random CPUs.  Given that we are trying to reduce the amount of manual
      tweaking required to make CONFIG_NO_HZ_FULL=y work nicely, this commit
      makes this binding happen when CONFIG_NO_HZ_FULL=y, even in cases where
      CONFIG_NO_HZ_FULL_SYSIDLE=n.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      5057f55e
    • A
      rcu: Merge rcu_sched_force_quiescent_state() with rcu_force_quiescent_state() · a381d757
      Andreea-Cristina Bernat 提交于
      This patch merges the function rcu_force_quiescent_state() with
      rcu_sched_force_quiescent_state(), using the rcu_state pointer.  Firstly,
      the rcu_sched_force_quiescent_state() function is deleted from the file
      kernel/rcu/tree.c. Also, the rcu_force_quiescent_state() function that was
      calling force_quiescent_state with the argument rcu_preempt_state pointer
      was deleted as well.  The new function that combines the old ones uses
      the rcu_state pointer and is located after rcu_batches_completed_bh()
      in kernel/rcu/tree.c.
      Signed-off-by: NAndreea-Cristina Bernat <bernat.ada@gmail.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      a381d757
    • A
      rcu: Consolidate kfree_call_rcu() to use rcu_state pointer · 495aa969
      Andreea-Cristina Bernat 提交于
      kfree_call_rcu is defined two times. When defined under CONFIG_TREE_PREEMPT_RCU,
      it uses rcu_preempt_state. Otherwise, it uses rcu_sched_state.
      This patch uses the rcu_state_pointer to combine the two definitions into one.
      The resulting function is placed after the closing of the preprocessor
      conditional CONFIG_TREE_PREEMPT_RCU.
      Signed-off-by: NAndreea-Cristina Bernat <bernat.ada@gmail.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      495aa969
    • P
      rcu: Make callers awaken grace-period kthread · 48a7639c
      Paul E. McKenney 提交于
      The rcu_start_gp_advanced() function currently uses irq_work_queue()
      to defer wakeups of the RCU grace-period kthread.  This deferring
      is necessary to avoid RCU-scheduler deadlocks involving the rcu_node
      structure's lock, meaning that RCU cannot call any of the scheduler's
      wake-up functions while holding one of these locks.
      
      Unfortunately, the second and subsequent calls to irq_work_queue() are
      ignored, and the first call will be ignored (aside from queuing the work
      item) if the scheduler-clock tick is turned off.  This is OK for many
      uses, especially those where irq_work_queue() is called from an interrupt
      or softirq handler, because in those cases the scheduler-clock-tick state
      will be re-evaluated, which will turn the scheduler-clock tick back on.
      On the next tick, any deferred work will then be processed.
      
      However, this strategy does not always work for RCU, which can be invoked
      at process level from idle CPUs.  In this case, the tick might never
      be turned back on, indefinitely defering a grace-period start request.
      Note that the RCU CPU stall detector cannot see this condition, because
      there is no RCU grace period in progress.  Therefore, we can (and do!)
      see long tens-of-seconds stalls in grace-period handling.  In theory,
      we could see a full grace-period hang, but rcutorture testing to date
      has seen only the tens-of-seconds stalls.  Event tracing demonstrates
      that irq_work_queue() is being called repeatedly to no effect during
      these stalls: The "newreq" event appears repeatedly from a task that is
      not one of the grace-period kthreads.
      
      In theory, irq_work_queue() might be fixed to avoid this sort of issue,
      but RCU's requirements are unusual and it is quite straightforward to pass
      wake-up responsibility up through RCU's call chain, so that the wakeup
      happens when the offending locks are released.
      
      This commit therefore makes this change.  The rcu_start_gp_advanced(),
      rcu_start_future_gp(), rcu_accelerate_cbs(), rcu_advance_cbs(),
      __note_gp_changes(), and rcu_start_gp() functions now return a boolean
      which indicates when a wake-up is needed.  A new rcu_gp_kthread_wake()
      does the wakeup when it is necessary and safe to do so: No self-wakes,
      no wake-ups if the ->gp_flags field indicates there is no need (as in
      someone else did the wake-up before we got around to it), and no wake-ups
      before the grace-period kthread has been created.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      48a7639c
    • P
      rcu: Update cpu_needs_another_gp() for futures from non-NOCB CPUs · 365187fb
      Paul E. McKenney 提交于
      In the old days, the only source of requests for future grace periods
      was NOCB CPUs.  This has changed: CPUs routinely post requests for
      future grace periods in order to promote power efficiency and reduce
      OS jitter with minimal impact on grace-period latency.  This commit
      therefore updates cpu_needs_another_gp() to invoke rcu_future_needs_gp()
      instead of rcu_nocb_needs_gp().  The latter is no longer used, so is
      now removed.  This commit also adds tracing for the irq_work_queue()
      wakeup case.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      365187fb
    • L
      24342c96
  8. 18 4月, 2014 1 次提交
  9. 18 2月, 2014 5 次提交
  10. 16 12月, 2013 1 次提交
  11. 13 12月, 2013 2 次提交
    • P
      rcu: Don't activate RCU core on NO_HZ_FULL CPUs · a096932f
      Paul E. McKenney 提交于
      Whenever a CPU receives a scheduling-clock interrupt, RCU checks to see
      if the RCU core needs anything from this CPU.  If so, RCU raises
      RCU_SOFTIRQ to carry out any needed processing.
      
      This approach has worked well historically, but it is undesirable on
      NO_HZ_FULL CPUs.  Such CPUs are expected to spend almost all of their time
      in userspace, so that scheduling-clock interrupts can be disabled while
      there is only one runnable task on the CPU in question.  Unfortunately,
      raising any softirq has the potential to wake up ksoftirqd, which would
      provide the second runnable task on that CPU, preventing disabling of
      scheduling-clock interrupts.
      
      What is needed instead is for RCU to leave NO_HZ_FULL CPUs alone,
      relying on the grace-period kthreads' quiescent-state forcing to
      do any needed RCU work on behalf of those CPUs.
      
      This commit therefore refrains from raising RCU_SOFTIRQ on any
      NO_HZ_FULL CPUs during any grace periods that have been in effect
      for less than one second.  The one-second limit handles the case
      where an inappropriate workload is running on a NO_HZ_FULL CPU
      that features lots of scheduling-clock interrupts, but no idle
      or userspace time.
      Reported-by: NMike Galbraith <bitbucket@online.de>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Tested-by: NMike Galbraith <bitbucket@online.de>
      Toasted-by: NFrederic Weisbecker <fweisbec@gmail.com>
      a096932f
    • L
      rcu: Warn on allegedly impossible rcu_read_unlock_special() from irq · 79a62f95
      Lai Jiangshan 提交于
      After commit #10f39bb1 (rcu: protect __rcu_read_unlock() against
      scheduler-using irq handlers), it is no longer possible to enter
      the main body of rcu_read_lock_special() from an NMI, interrupt, or
      softirq handler.  In theory, this implies that the check for "in_irq()
      || in_serving_softirq()" must always fail, so that in theory this check
      could be removed entirely.
      
      In practice, this commit wraps this condition with a WARN_ON_ONCE().
      If this warning never triggers, then the condition will be removed
      entirely.
      
      [ paulmck: And one way of triggering the WARN_ON() is if a scheduling
        clock interrupt occurs in an RCU read-side critical section, setting
        RCU_READ_UNLOCK_NEED_QS, which is handled by rcu_read_unlock_special().
        Updated this commit to return if only that bit was set. ]
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      79a62f95
  12. 04 12月, 2013 2 次提交
    • P
      rcu: Break call_rcu() deadlock involving scheduler and perf · 96d3fd0d
      Paul E. McKenney 提交于
      Dave Jones got the following lockdep splat:
      
      >  ======================================================
      >  [ INFO: possible circular locking dependency detected ]
      >  3.12.0-rc3+ #92 Not tainted
      >  -------------------------------------------------------
      >  trinity-child2/15191 is trying to acquire lock:
      >   (&rdp->nocb_wq){......}, at: [<ffffffff8108ff43>] __wake_up+0x23/0x50
      >
      > but task is already holding lock:
      >   (&ctx->lock){-.-...}, at: [<ffffffff81154c19>] perf_event_exit_task+0x109/0x230
      >
      > which lock already depends on the new lock.
      >
      >
      > the existing dependency chain (in reverse order) is:
      >
      > -> #3 (&ctx->lock){-.-...}:
      >         [<ffffffff810cc243>] lock_acquire+0x93/0x200
      >         [<ffffffff81733f90>] _raw_spin_lock+0x40/0x80
      >         [<ffffffff811500ff>] __perf_event_task_sched_out+0x2df/0x5e0
      >         [<ffffffff81091b83>] perf_event_task_sched_out+0x93/0xa0
      >         [<ffffffff81732052>] __schedule+0x1d2/0xa20
      >         [<ffffffff81732f30>] preempt_schedule_irq+0x50/0xb0
      >         [<ffffffff817352b6>] retint_kernel+0x26/0x30
      >         [<ffffffff813eed04>] tty_flip_buffer_push+0x34/0x50
      >         [<ffffffff813f0504>] pty_write+0x54/0x60
      >         [<ffffffff813e900d>] n_tty_write+0x32d/0x4e0
      >         [<ffffffff813e5838>] tty_write+0x158/0x2d0
      >         [<ffffffff811c4850>] vfs_write+0xc0/0x1f0
      >         [<ffffffff811c52cc>] SyS_write+0x4c/0xa0
      >         [<ffffffff8173d4e4>] tracesys+0xdd/0xe2
      >
      > -> #2 (&rq->lock){-.-.-.}:
      >         [<ffffffff810cc243>] lock_acquire+0x93/0x200
      >         [<ffffffff81733f90>] _raw_spin_lock+0x40/0x80
      >         [<ffffffff810980b2>] wake_up_new_task+0xc2/0x2e0
      >         [<ffffffff81054336>] do_fork+0x126/0x460
      >         [<ffffffff81054696>] kernel_thread+0x26/0x30
      >         [<ffffffff8171ff93>] rest_init+0x23/0x140
      >         [<ffffffff81ee1e4b>] start_kernel+0x3f6/0x403
      >         [<ffffffff81ee1571>] x86_64_start_reservations+0x2a/0x2c
      >         [<ffffffff81ee1664>] x86_64_start_kernel+0xf1/0xf4
      >
      > -> #1 (&p->pi_lock){-.-.-.}:
      >         [<ffffffff810cc243>] lock_acquire+0x93/0x200
      >         [<ffffffff8173419b>] _raw_spin_lock_irqsave+0x4b/0x90
      >         [<ffffffff810979d1>] try_to_wake_up+0x31/0x350
      >         [<ffffffff81097d62>] default_wake_function+0x12/0x20
      >         [<ffffffff81084af8>] autoremove_wake_function+0x18/0x40
      >         [<ffffffff8108ea38>] __wake_up_common+0x58/0x90
      >         [<ffffffff8108ff59>] __wake_up+0x39/0x50
      >         [<ffffffff8110d4f8>] __call_rcu_nocb_enqueue+0xa8/0xc0
      >         [<ffffffff81111450>] __call_rcu+0x140/0x820
      >         [<ffffffff81111b8d>] call_rcu+0x1d/0x20
      >         [<ffffffff81093697>] cpu_attach_domain+0x287/0x360
      >         [<ffffffff81099d7e>] build_sched_domains+0xe5e/0x10a0
      >         [<ffffffff81efa7fc>] sched_init_smp+0x3b7/0x47a
      >         [<ffffffff81ee1f4e>] kernel_init_freeable+0xf6/0x202
      >         [<ffffffff817200be>] kernel_init+0xe/0x190
      >         [<ffffffff8173d22c>] ret_from_fork+0x7c/0xb0
      >
      > -> #0 (&rdp->nocb_wq){......}:
      >         [<ffffffff810cb7ca>] __lock_acquire+0x191a/0x1be0
      >         [<ffffffff810cc243>] lock_acquire+0x93/0x200
      >         [<ffffffff8173419b>] _raw_spin_lock_irqsave+0x4b/0x90
      >         [<ffffffff8108ff43>] __wake_up+0x23/0x50
      >         [<ffffffff8110d4f8>] __call_rcu_nocb_enqueue+0xa8/0xc0
      >         [<ffffffff81111450>] __call_rcu+0x140/0x820
      >         [<ffffffff81111bb0>] kfree_call_rcu+0x20/0x30
      >         [<ffffffff81149abf>] put_ctx+0x4f/0x70
      >         [<ffffffff81154c3e>] perf_event_exit_task+0x12e/0x230
      >         [<ffffffff81056b8d>] do_exit+0x30d/0xcc0
      >         [<ffffffff8105893c>] do_group_exit+0x4c/0xc0
      >         [<ffffffff810589c4>] SyS_exit_group+0x14/0x20
      >         [<ffffffff8173d4e4>] tracesys+0xdd/0xe2
      >
      > other info that might help us debug this:
      >
      > Chain exists of:
      >   &rdp->nocb_wq --> &rq->lock --> &ctx->lock
      >
      >   Possible unsafe locking scenario:
      >
      >         CPU0                    CPU1
      >         ----                    ----
      >    lock(&ctx->lock);
      >                                 lock(&rq->lock);
      >                                 lock(&ctx->lock);
      >    lock(&rdp->nocb_wq);
      >
      >  *** DEADLOCK ***
      >
      > 1 lock held by trinity-child2/15191:
      >  #0:  (&ctx->lock){-.-...}, at: [<ffffffff81154c19>] perf_event_exit_task+0x109/0x230
      >
      > stack backtrace:
      > CPU: 2 PID: 15191 Comm: trinity-child2 Not tainted 3.12.0-rc3+ #92
      >  ffffffff82565b70 ffff880070c2dbf8 ffffffff8172a363 ffffffff824edf40
      >  ffff880070c2dc38 ffffffff81726741 ffff880070c2dc90 ffff88022383b1c0
      >  ffff88022383aac0 0000000000000000 ffff88022383b188 ffff88022383b1c0
      > Call Trace:
      >  [<ffffffff8172a363>] dump_stack+0x4e/0x82
      >  [<ffffffff81726741>] print_circular_bug+0x200/0x20f
      >  [<ffffffff810cb7ca>] __lock_acquire+0x191a/0x1be0
      >  [<ffffffff810c6439>] ? get_lock_stats+0x19/0x60
      >  [<ffffffff8100b2f4>] ? native_sched_clock+0x24/0x80
      >  [<ffffffff810cc243>] lock_acquire+0x93/0x200
      >  [<ffffffff8108ff43>] ? __wake_up+0x23/0x50
      >  [<ffffffff8173419b>] _raw_spin_lock_irqsave+0x4b/0x90
      >  [<ffffffff8108ff43>] ? __wake_up+0x23/0x50
      >  [<ffffffff8108ff43>] __wake_up+0x23/0x50
      >  [<ffffffff8110d4f8>] __call_rcu_nocb_enqueue+0xa8/0xc0
      >  [<ffffffff81111450>] __call_rcu+0x140/0x820
      >  [<ffffffff8109bc8f>] ? local_clock+0x3f/0x50
      >  [<ffffffff81111bb0>] kfree_call_rcu+0x20/0x30
      >  [<ffffffff81149abf>] put_ctx+0x4f/0x70
      >  [<ffffffff81154c3e>] perf_event_exit_task+0x12e/0x230
      >  [<ffffffff81056b8d>] do_exit+0x30d/0xcc0
      >  [<ffffffff810c9af5>] ? trace_hardirqs_on_caller+0x115/0x1e0
      >  [<ffffffff810c9bcd>] ? trace_hardirqs_on+0xd/0x10
      >  [<ffffffff8105893c>] do_group_exit+0x4c/0xc0
      >  [<ffffffff810589c4>] SyS_exit_group+0x14/0x20
      >  [<ffffffff8173d4e4>] tracesys+0xdd/0xe2
      
      The underlying problem is that perf is invoking call_rcu() with the
      scheduler locks held, but in NOCB mode, call_rcu() will with high
      probability invoke the scheduler -- which just might want to use its
      locks.  The reason that call_rcu() needs to invoke the scheduler is
      to wake up the corresponding rcuo callback-offload kthread, which
      does the job of starting up a grace period and invoking the callbacks
      afterwards.
      
      One solution (championed on a related problem by Lai Jiangshan) is to
      simply defer the wakeup to some point where scheduler locks are no longer
      held.  Since we don't want to unnecessarily incur the cost of such
      deferral, the task before us is threefold:
      
      1.	Determine when it is likely that a relevant scheduler lock is held.
      
      2.	Defer the wakeup in such cases.
      
      3.	Ensure that all deferred wakeups eventually happen, preferably
      	sooner rather than later.
      
      We use irqs_disabled_flags() as a proxy for relevant scheduler locks
      being held.  This works because the relevant locks are always acquired
      with interrupts disabled.  We may defer more often than needed, but that
      is at least safe.
      
      The wakeup deferral is tracked via a new field in the per-CPU and
      per-RCU-flavor rcu_data structure, namely ->nocb_defer_wakeup.
      
      This flag is checked by the RCU core processing.  The __rcu_pending()
      function now checks this flag, which causes rcu_check_callbacks()
      to initiate RCU core processing at each scheduling-clock interrupt
      where this flag is set.  Of course this is not sufficient because
      scheduling-clock interrupts are often turned off (the things we used to
      be able to count on!).  So the flags are also checked on entry to any
      state that RCU considers to be idle, which includes both NO_HZ_IDLE idle
      state and NO_HZ_FULL user-mode-execution state.
      
      This approach should allow call_rcu() to be invoked regardless of what
      locks you might be holding, the key word being "should".
      Reported-by: NDave Jones <davej@redhat.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      96d3fd0d
    • P
      rcu: Fix and comment ordering around wait_event() · 78e4bc34
      Paul E. McKenney 提交于
      It is all too easy to forget that wait_event() does not necessarily
      imply a full memory barrier.  The case where it does not is where the
      condition transitions to true just as wait_event() starts execution.
      This is actually a feature: The standard use of wait_event() involves
      locking, in which case the locks provide the needed ordering (you hold a
      lock across the wake_up() and acquire that same lock after wait_event()
      returns).
      
      Given that I did forget that wait_event() does not necessarily imply a
      full memory barrier in one case, this commit fixes that case.  This commit
      also adds comments calling out the placement of existing memory barriers
      relied on by wait_event() calls.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      78e4bc34
  13. 19 11月, 2013 1 次提交
  14. 06 11月, 2013 1 次提交
  15. 16 10月, 2013 1 次提交
  16. 25 9月, 2013 3 次提交
  17. 24 9月, 2013 3 次提交