1. 08 9月, 2014 3 次提交
  2. 10 7月, 2014 7 次提交
    • P
      rcu: Remove CONFIG_PROVE_RCU_DELAY · 11992c70
      Paul E. McKenney 提交于
      The CONFIG_PROVE_RCU_DELAY Kconfig parameter doesn't appear to be very
      effective at finding race conditions, so this commit removes it.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      [ paulmck: Remove definition and uses as noted by Paul Bolle. ]
      11992c70
    • S
      rcu: Use __this_cpu_read() instead of per_cpu_ptr() · d860d403
      Shan Wei 提交于
      The __this_cpu_read() function produces better code than does
      per_cpu_ptr() on both ARM and x86.  For example, gcc (Ubuntu/Linaro
      4.7.3-12ubuntu1) 4.7.3 produces the following:
      
      ARMv7 per_cpu_ptr():
      
      force_quiescent_state:
          mov    r3, sp    @,
          bic    r1, r3, #8128    @ tmp171,,
          ldr    r2, .L98    @ tmp169,
          bic    r1, r1, #63    @ tmp170, tmp171,
          ldr    r3, [r0, #220]    @ __ptr, rsp_6(D)->rda
          ldr    r1, [r1, #20]    @ D.35903_68->cpu, D.35903_68->cpu
          mov    r6, r0    @ rsp, rsp
          ldr    r2, [r2, r1, asl #2]    @ tmp173, __per_cpu_offset
          add    r3, r3, r2    @ tmp175, __ptr, tmp173
          ldr    r5, [r3, #12]    @ rnp_old, D.29162_13->mynode
      
      ARMv7 __this_cpu_read():
      
      force_quiescent_state:
          ldr    r3, [r0, #220]    @ rsp_7(D)->rda, rsp_7(D)->rda
          mov    r6, r0    @ rsp, rsp
          add    r3, r3, #12    @ __ptr, rsp_7(D)->rda,
          ldr    r5, [r2, r3]    @ rnp_old, *D.29176_13
      
      Using gcc 4.8.2:
      
      x86_64 per_cpu_ptr():
      
          movl %gs:cpu_number,%edx    # cpu_number, pscr_ret__
          movslq    %edx, %rdx    # pscr_ret__, pscr_ret__
          movq    __per_cpu_offset(,%rdx,8), %rdx    # __per_cpu_offset, tmp93
          movq    %rdi, %r13    # rsp, rsp
          movq    1000(%rdi), %rax    # rsp_9(D)->rda, __ptr
          movq    24(%rdx,%rax), %r12    # _15->mynode, rnp_old
      
      x86_64 __this_cpu_read():
      
          movq    %rdi, %r13    # rsp, rsp
          movq    1000(%rdi), %rax    # rsp_9(D)->rda, rsp_9(D)->rda
          movq %gs:24(%rax),%r12    # _10->mynode, rnp_old
      
      Because this change produces significant benefits for these two very
      diverse architectures, this commit makes this change.
      Signed-off-by: NShan Wei <davidshan@tencent.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Signed-off-by: NPranith Kumar <bobby.prani@gmail.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      Reviewed-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      d860d403
    • P
      rcu: Don't use NMIs to dump other CPUs' stacks · bc1dce51
      Paul E. McKenney 提交于
      Although NMI-based stack dumps are in principle more accurate, they are
      also more likely to trigger deadlocks.  This commit therefore replaces
      all uses of trigger_all_cpu_backtrace() with rcu_dump_cpu_stacks(), so
      that the CPU detecting an RCU CPU stall does the stack dumping.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      bc1dce51
    • P
      rcu: Check both root and current rcu_node when setting up future grace period · 48bd8e9b
      Pranith Kumar 提交于
      The rcu_start_future_gp() function checks the current rcu_node's ->gpnum
      and ->completed twice, once without ACCESS_ONCE() and once with it.
      Which is pointless because we hold that rcu_node's ->lock at that point.
      The intent was to check the current rcu_node structure and the root
      rcu_node structure, the latter locklessly with ACCESS_ONCE().  This
      commit therefore makes that change.
      
      The reason that it is safe to locklessly check the root rcu_nodes's
      ->gpnum and ->completed fields is that we hold the current rcu_node's
      ->lock, which constrains the root rcu_node's ability to change its
      ->gpnum and ->completed fields.  Of course, if there is a single rcu_node
      structure, then rnp_root==rnp, and holding the lock prevents all changes.
      If there is more than one rcu_node structure, then the code updates the
      fields in the following order:
      
      1.	Increment rnp_root->gpnum to start new grace period.
      2.	Increment rnp->gpnum to initialize the current rcu_node,
      	continuing initialization for the new grace period.
      3.	Increment rnp_root->completed to end the current grace period.
      4.	Increment rnp->completed to continue cleaning up after the
      	old grace period.
      
      So there are four possible combinations of relative values of these
      four fields:
      
      N   N   N   N:  RCU idle, new grace period must be initiated.
      		Although rnp_root->gpnum might be incremented immediately
      		after we check, that will just result in unnecessary work.
      		The grace period already started, and we try to start it.
      
      N+1 N   N   N:  RCU grace period just started.  No further change is
      		possible because we hold rnp->lock, so the checks of
      		rnp_root->gpnum and rnp_root->completed are stable.
      		We know that our request for a future grace period will
      		be seen during grace-period cleanup.
      
      N+1 N   N+1 N:  RCU grace period is ongoing.  Because rnp->gpnum is
      		different than rnp->completed, we won't even look at
      		rnp_root->gpnum and rnp_root->completed, so the possible
      		concurrent change to rnp_root->completed does not matter.
      		We know that our request for a future grace period will
      		be seen during grace-period cleanup, which cannot pass
      		this rcu_node because we hold its ->lock.
      
      N+1 N+1 N+1 N:  RCU grace period has ended, but not yet been cleaned up.
      		Because rnp->gpnum is different than rnp->completed, we
      		won't look at rnp_root->gpnum and rnp_root->completed, so
      		the possible concurrent change to rnp_root->completed does
      		not matter.  We know that our request for a future grace
      		period will be seen during grace-period cleanup, which
      		cannot pass this rcu_node because we hold its ->lock.
      
      Therefore, despite initial appearances, the lockless check is safe.
      Signed-off-by: NPranith Kumar <bobby.prani@gmail.com>
      [ paulmck: Update comment to say why the lockless check is safe. ]
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      48bd8e9b
    • P
      rcu: Loosen __call_rcu()'s rcu_head alignment constraint · 1146edcb
      Paul E. McKenney 提交于
      The m68k architecture aligns only to 16-bit boundaries, which can cause
      the align-to-32-bits check in __call_rcu() to trigger.  Because there is
      currently no known potential need for more than one low-order bit, this
      commit loosens the check to 16-bit boundaries.
      Reported-by: NGreg Ungerer <gerg@uclinux.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      1146edcb
    • P
      rcu: Eliminate read-modify-write ACCESS_ONCE() calls · a792563b
      Paul E. McKenney 提交于
      RCU contains code of the following forms:
      
      	ACCESS_ONCE(x)++;
      	ACCESS_ONCE(x) += y;
      	ACCESS_ONCE(x) -= y;
      
      Now these constructs do operate correctly, but they really result in a
      pair of volatile accesses, one to do the load and another to do the store.
      This can be confusing, as the casual reader might well assume that (for
      example) gcc might generate a memory-to-memory add instruction for each
      of these three cases.  In fact, gcc will do no such thing.  Also, there
      is a good chance that the kernel will move to separate load and store
      variants of ACCESS_ONCE(), and constructs like the above could easily
      confuse both people and scripts attempting to make that sort of change.
      Finally, most of RCU's read-modify-write uses of ACCESS_ONCE() really
      only need the store to be volatile, so that the read-modify-write form
      might be misleading.
      
      This commit therefore changes the above forms in RCU so that each instance
      of ACCESS_ONCE() either does a load or a store, but not both.  In a few
      cases, ACCESS_ONCE() was not critical, for example, for maintaining
      statisitics.  In these cases, ACCESS_ONCE() has been dispensed with
      entirely.
      Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      a792563b
    • F
      rcu: Make rcu node arrays static const char * const · b4426b49
      Fabian Frederick 提交于
      Those two arrays are being passed to lockdep_init_map(), which expects
      const char *, and are stored in lockdep_map the same way.
      
      Cc: Dipankar Sarma <dipankar@in.ibm.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NFabian Frederick <fabf@skynet.be>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      b4426b49
  3. 24 6月, 2014 1 次提交
    • P
      rcu: Reduce overhead of cond_resched() checks for RCU · 4a81e832
      Paul E. McKenney 提交于
      Commit ac1bea85 (Make cond_resched() report RCU quiescent states)
      fixed a problem where a CPU looping in the kernel with but one runnable
      task would give RCU CPU stall warnings, even if the in-kernel loop
      contained cond_resched() calls.  Unfortunately, in so doing, it introduced
      performance regressions in Anton Blanchard's will-it-scale "open1" test.
      The problem appears to be not so much the increased cond_resched() path
      length as an increase in the rate at which grace periods complete, which
      increased per-update grace-period overhead.
      
      This commit takes a different approach to fixing this bug, mainly by
      moving the RCU-visible quiescent state from cond_resched() to
      rcu_note_context_switch(), and by further reducing the check to a
      simple non-zero test of a single per-CPU variable.  However, this
      approach requires that the force-quiescent-state processing send
      resched IPIs to the offending CPUs.  These will be sent only once
      the grace period has reached an age specified by the boot/sysfs
      parameter rcutree.jiffies_till_sched_qs, or once the grace period
      reaches an age halfway to the point at which RCU CPU stall warnings
      will be emitted, whichever comes first.
      Reported-by: NDave Hansen <dave.hansen@intel.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Christoph Lameter <cl@gentwo.org>
      Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      [ paulmck: Made rcu_momentary_dyntick_idle() as suggested by the
        ktest build robot.  Also fixed smp_mb() comment as noted by
        Oleg Nesterov. ]
      
      Merge with e552592e (Reduce overhead of cond_resched() checks for RCU)
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      4a81e832
  4. 15 5月, 2014 2 次提交
  5. 14 5月, 2014 1 次提交
  6. 29 4月, 2014 12 次提交
  7. 18 4月, 2014 1 次提交
  8. 21 3月, 2014 1 次提交
    • P
      rcu: Provide grace-period piggybacking API · 765a3f4f
      Paul E. McKenney 提交于
      The following pattern is currently not well supported by RCU:
      
      1.	Make data element inaccessible to RCU readers.
      
      2.	Do work that probably lasts for more than one grace period.
      
      3.	Do something to make sure RCU readers in flight before #1 above
      	have completed.
      
      Here are some things that could currently be done:
      
      a.	Do a synchronize_rcu() unconditionally at either #1 or #3 above.
      	This works, but imposes needless work and latency.
      
      b.	Post an RCU callback at #1 above that does a wakeup, then
      	wait for the wakeup at #3.  This works well, but likely results
      	in an extra unneeded grace period.  Open-coding this is also
      	a bit more semi-tricky code than would be good.
      
      This commit therefore adds get_state_synchronize_rcu() and
      cond_synchronize_rcu() APIs.  Call get_state_synchronize_rcu() at #1
      above and pass its return value to cond_synchronize_rcu() at #3 above.
      This results in a call to synchronize_rcu() if no grace period has
      elapsed between #1 and #3, but requires only a load, comparison, and
      memory barrier if a full grace period did elapse.
      Requested-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      765a3f4f
  9. 26 2月, 2014 1 次提交
    • P
      rcu: Ensure kernel/rcu/rcu.h can be sourced/used stand-alone · 5cb5c6e1
      Paul Gortmaker 提交于
      The kbuild test bot uncovered an implicit dependence on the
      trace header being present before rcu.h in ia64 allmodconfig
      that looks like this:
      
      In file included from kernel/ksysfs.c:22:0:
      kernel/rcu/rcu.h: In function '__rcu_reclaim':
      kernel/rcu/rcu.h:107:3: error: implicit declaration of function 'trace_rcu_invoke_kfree_callback' [-Werror=implicit-function-declaration]
      kernel/rcu/rcu.h:112:3: error: implicit declaration of function 'trace_rcu_invoke_callback' [-Werror=implicit-function-declaration]
      cc1: some warnings being treated as errors
      
      Looking at other rcu.h users, we can find that they all
      were sourcing the trace header in advance of rcu.h itself,
      as seen in the context of this diff.  There were also some
      inconsistencies as to whether it was or wasn't sourced based
      on the parent tracing Kconfig.
      
      Rather than "fix" it at each use site, and have inconsistent
      use based on whether "#ifdef CONFIG_RCU_TRACE" was used or not,
      lets just source the trace header just once, in the actual consumer
      of it, which is rcu.h itself.  We include it unconditionally, as
      build testing shows us that is a hard requirement for some files.
      Reported-by: Nkbuild test robot <fengguang.wu@intel.com>
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      5cb5c6e1
  10. 18 2月, 2014 4 次提交
  11. 16 12月, 2013 1 次提交
  12. 13 12月, 2013 1 次提交
    • P
      rcu: Don't activate RCU core on NO_HZ_FULL CPUs · a096932f
      Paul E. McKenney 提交于
      Whenever a CPU receives a scheduling-clock interrupt, RCU checks to see
      if the RCU core needs anything from this CPU.  If so, RCU raises
      RCU_SOFTIRQ to carry out any needed processing.
      
      This approach has worked well historically, but it is undesirable on
      NO_HZ_FULL CPUs.  Such CPUs are expected to spend almost all of their time
      in userspace, so that scheduling-clock interrupts can be disabled while
      there is only one runnable task on the CPU in question.  Unfortunately,
      raising any softirq has the potential to wake up ksoftirqd, which would
      provide the second runnable task on that CPU, preventing disabling of
      scheduling-clock interrupts.
      
      What is needed instead is for RCU to leave NO_HZ_FULL CPUs alone,
      relying on the grace-period kthreads' quiescent-state forcing to
      do any needed RCU work on behalf of those CPUs.
      
      This commit therefore refrains from raising RCU_SOFTIRQ on any
      NO_HZ_FULL CPUs during any grace periods that have been in effect
      for less than one second.  The one-second limit handles the case
      where an inappropriate workload is running on a NO_HZ_FULL CPU
      that features lots of scheduling-clock interrupts, but no idle
      or userspace time.
      Reported-by: NMike Galbraith <bitbucket@online.de>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Tested-by: NMike Galbraith <bitbucket@online.de>
      Toasted-by: NFrederic Weisbecker <fweisbec@gmail.com>
      a096932f
  13. 10 12月, 2013 2 次提交
    • P
      rcu: Fix CONFIG_RCU_FANOUT_EXACT for odd fanout/leaf values · 04f34650
      Paul E. McKenney 提交于
      Each element of the rcu_state structure's ->levelspread[] array
      is intended to contain the per-level fanout, where the zero-th
      element corresponds to the root of the rcu_node tree, and the last
      element corresponds to the leaves.  In the CONFIG_RCU_FANOUT_EXACT
      case, this means that the last element should be filled in
      from CONFIG_RCU_FANOUT_LEAF (or from the rcu_fanout_leaf boot
      parameter, if provided) and that the remaining elements should
      be filled in from CONFIG_RCU_FANOUT.  Unfortunately, the current
      code in rcu_init_levelspread() takes the opposite approach, placing
      CONFIG_RCU_FANOUT_LEAF in the zero-th element and CONFIG_RCU_FANOUT in
      the remaining elements.
      
      For typical power-of-two values, this generates odd but functional
      rcu_node trees.  However, other values, for example CONFIG_RCU_FANOUT=3
      and CONFIG_RCU_FANOUT_LEAF=2, generate trees that can leave some CPUs
      out of the grace-period computation, resulting in too-short grace periods
      and therefore a broken RCU implementation.
      
      This commit therefore fixes rcu_init_levelspread() to set the last
      ->levelspread[] array element from CONFIG_RCU_FANOUT_LEAF and the
      remaining elements from CONFIG_RCU_FANOUT, thus generating the
      intended rcu_node trees.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      04f34650
    • F
      rcu: Fix coccinelle warnings · f6f7ee9a
      Fengguang Wu 提交于
      This commit fixes the following coccinelle warning:
      
      kernel/rcu/tree.c:712:9-10: WARNING: return of 0/1 in function
      'rcu_lockdep_current_cpu_online' with return type bool
      
      Return statements in functions returning bool should use
       true/false instead of 1/0.
       Generated by: coccinelle/misc/boolreturn.cocci
      Signed-off-by: NFengguang Wu <fengguang.wu@intel.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      f6f7ee9a
  14. 04 12月, 2013 3 次提交
    • P
      rcu: Let the world know when RCU adjusts its geometry · 39479098
      Paul E. McKenney 提交于
      Some RCU bugs have been specific to the layout of the rcu_node tree,
      but RCU will silently adjust the tree at boot time if appropriate.
      This obscures valuable debugging information, so print a message when
      this happens.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      39479098
    • P
      rcu: Allow task-level idle entry/exit nesting · 3a592405
      Paul E. McKenney 提交于
      The current task-level idle entry/exit code forces an entry/exit on
      each call, regardless of the nesting level.  This commit therefore
      properly accounts for nesting.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NFrederic Weisbecker <fweisbec@gmail.com>
      3a592405
    • P
      rcu: Break call_rcu() deadlock involving scheduler and perf · 96d3fd0d
      Paul E. McKenney 提交于
      Dave Jones got the following lockdep splat:
      
      >  ======================================================
      >  [ INFO: possible circular locking dependency detected ]
      >  3.12.0-rc3+ #92 Not tainted
      >  -------------------------------------------------------
      >  trinity-child2/15191 is trying to acquire lock:
      >   (&rdp->nocb_wq){......}, at: [<ffffffff8108ff43>] __wake_up+0x23/0x50
      >
      > but task is already holding lock:
      >   (&ctx->lock){-.-...}, at: [<ffffffff81154c19>] perf_event_exit_task+0x109/0x230
      >
      > which lock already depends on the new lock.
      >
      >
      > the existing dependency chain (in reverse order) is:
      >
      > -> #3 (&ctx->lock){-.-...}:
      >         [<ffffffff810cc243>] lock_acquire+0x93/0x200
      >         [<ffffffff81733f90>] _raw_spin_lock+0x40/0x80
      >         [<ffffffff811500ff>] __perf_event_task_sched_out+0x2df/0x5e0
      >         [<ffffffff81091b83>] perf_event_task_sched_out+0x93/0xa0
      >         [<ffffffff81732052>] __schedule+0x1d2/0xa20
      >         [<ffffffff81732f30>] preempt_schedule_irq+0x50/0xb0
      >         [<ffffffff817352b6>] retint_kernel+0x26/0x30
      >         [<ffffffff813eed04>] tty_flip_buffer_push+0x34/0x50
      >         [<ffffffff813f0504>] pty_write+0x54/0x60
      >         [<ffffffff813e900d>] n_tty_write+0x32d/0x4e0
      >         [<ffffffff813e5838>] tty_write+0x158/0x2d0
      >         [<ffffffff811c4850>] vfs_write+0xc0/0x1f0
      >         [<ffffffff811c52cc>] SyS_write+0x4c/0xa0
      >         [<ffffffff8173d4e4>] tracesys+0xdd/0xe2
      >
      > -> #2 (&rq->lock){-.-.-.}:
      >         [<ffffffff810cc243>] lock_acquire+0x93/0x200
      >         [<ffffffff81733f90>] _raw_spin_lock+0x40/0x80
      >         [<ffffffff810980b2>] wake_up_new_task+0xc2/0x2e0
      >         [<ffffffff81054336>] do_fork+0x126/0x460
      >         [<ffffffff81054696>] kernel_thread+0x26/0x30
      >         [<ffffffff8171ff93>] rest_init+0x23/0x140
      >         [<ffffffff81ee1e4b>] start_kernel+0x3f6/0x403
      >         [<ffffffff81ee1571>] x86_64_start_reservations+0x2a/0x2c
      >         [<ffffffff81ee1664>] x86_64_start_kernel+0xf1/0xf4
      >
      > -> #1 (&p->pi_lock){-.-.-.}:
      >         [<ffffffff810cc243>] lock_acquire+0x93/0x200
      >         [<ffffffff8173419b>] _raw_spin_lock_irqsave+0x4b/0x90
      >         [<ffffffff810979d1>] try_to_wake_up+0x31/0x350
      >         [<ffffffff81097d62>] default_wake_function+0x12/0x20
      >         [<ffffffff81084af8>] autoremove_wake_function+0x18/0x40
      >         [<ffffffff8108ea38>] __wake_up_common+0x58/0x90
      >         [<ffffffff8108ff59>] __wake_up+0x39/0x50
      >         [<ffffffff8110d4f8>] __call_rcu_nocb_enqueue+0xa8/0xc0
      >         [<ffffffff81111450>] __call_rcu+0x140/0x820
      >         [<ffffffff81111b8d>] call_rcu+0x1d/0x20
      >         [<ffffffff81093697>] cpu_attach_domain+0x287/0x360
      >         [<ffffffff81099d7e>] build_sched_domains+0xe5e/0x10a0
      >         [<ffffffff81efa7fc>] sched_init_smp+0x3b7/0x47a
      >         [<ffffffff81ee1f4e>] kernel_init_freeable+0xf6/0x202
      >         [<ffffffff817200be>] kernel_init+0xe/0x190
      >         [<ffffffff8173d22c>] ret_from_fork+0x7c/0xb0
      >
      > -> #0 (&rdp->nocb_wq){......}:
      >         [<ffffffff810cb7ca>] __lock_acquire+0x191a/0x1be0
      >         [<ffffffff810cc243>] lock_acquire+0x93/0x200
      >         [<ffffffff8173419b>] _raw_spin_lock_irqsave+0x4b/0x90
      >         [<ffffffff8108ff43>] __wake_up+0x23/0x50
      >         [<ffffffff8110d4f8>] __call_rcu_nocb_enqueue+0xa8/0xc0
      >         [<ffffffff81111450>] __call_rcu+0x140/0x820
      >         [<ffffffff81111bb0>] kfree_call_rcu+0x20/0x30
      >         [<ffffffff81149abf>] put_ctx+0x4f/0x70
      >         [<ffffffff81154c3e>] perf_event_exit_task+0x12e/0x230
      >         [<ffffffff81056b8d>] do_exit+0x30d/0xcc0
      >         [<ffffffff8105893c>] do_group_exit+0x4c/0xc0
      >         [<ffffffff810589c4>] SyS_exit_group+0x14/0x20
      >         [<ffffffff8173d4e4>] tracesys+0xdd/0xe2
      >
      > other info that might help us debug this:
      >
      > Chain exists of:
      >   &rdp->nocb_wq --> &rq->lock --> &ctx->lock
      >
      >   Possible unsafe locking scenario:
      >
      >         CPU0                    CPU1
      >         ----                    ----
      >    lock(&ctx->lock);
      >                                 lock(&rq->lock);
      >                                 lock(&ctx->lock);
      >    lock(&rdp->nocb_wq);
      >
      >  *** DEADLOCK ***
      >
      > 1 lock held by trinity-child2/15191:
      >  #0:  (&ctx->lock){-.-...}, at: [<ffffffff81154c19>] perf_event_exit_task+0x109/0x230
      >
      > stack backtrace:
      > CPU: 2 PID: 15191 Comm: trinity-child2 Not tainted 3.12.0-rc3+ #92
      >  ffffffff82565b70 ffff880070c2dbf8 ffffffff8172a363 ffffffff824edf40
      >  ffff880070c2dc38 ffffffff81726741 ffff880070c2dc90 ffff88022383b1c0
      >  ffff88022383aac0 0000000000000000 ffff88022383b188 ffff88022383b1c0
      > Call Trace:
      >  [<ffffffff8172a363>] dump_stack+0x4e/0x82
      >  [<ffffffff81726741>] print_circular_bug+0x200/0x20f
      >  [<ffffffff810cb7ca>] __lock_acquire+0x191a/0x1be0
      >  [<ffffffff810c6439>] ? get_lock_stats+0x19/0x60
      >  [<ffffffff8100b2f4>] ? native_sched_clock+0x24/0x80
      >  [<ffffffff810cc243>] lock_acquire+0x93/0x200
      >  [<ffffffff8108ff43>] ? __wake_up+0x23/0x50
      >  [<ffffffff8173419b>] _raw_spin_lock_irqsave+0x4b/0x90
      >  [<ffffffff8108ff43>] ? __wake_up+0x23/0x50
      >  [<ffffffff8108ff43>] __wake_up+0x23/0x50
      >  [<ffffffff8110d4f8>] __call_rcu_nocb_enqueue+0xa8/0xc0
      >  [<ffffffff81111450>] __call_rcu+0x140/0x820
      >  [<ffffffff8109bc8f>] ? local_clock+0x3f/0x50
      >  [<ffffffff81111bb0>] kfree_call_rcu+0x20/0x30
      >  [<ffffffff81149abf>] put_ctx+0x4f/0x70
      >  [<ffffffff81154c3e>] perf_event_exit_task+0x12e/0x230
      >  [<ffffffff81056b8d>] do_exit+0x30d/0xcc0
      >  [<ffffffff810c9af5>] ? trace_hardirqs_on_caller+0x115/0x1e0
      >  [<ffffffff810c9bcd>] ? trace_hardirqs_on+0xd/0x10
      >  [<ffffffff8105893c>] do_group_exit+0x4c/0xc0
      >  [<ffffffff810589c4>] SyS_exit_group+0x14/0x20
      >  [<ffffffff8173d4e4>] tracesys+0xdd/0xe2
      
      The underlying problem is that perf is invoking call_rcu() with the
      scheduler locks held, but in NOCB mode, call_rcu() will with high
      probability invoke the scheduler -- which just might want to use its
      locks.  The reason that call_rcu() needs to invoke the scheduler is
      to wake up the corresponding rcuo callback-offload kthread, which
      does the job of starting up a grace period and invoking the callbacks
      afterwards.
      
      One solution (championed on a related problem by Lai Jiangshan) is to
      simply defer the wakeup to some point where scheduler locks are no longer
      held.  Since we don't want to unnecessarily incur the cost of such
      deferral, the task before us is threefold:
      
      1.	Determine when it is likely that a relevant scheduler lock is held.
      
      2.	Defer the wakeup in such cases.
      
      3.	Ensure that all deferred wakeups eventually happen, preferably
      	sooner rather than later.
      
      We use irqs_disabled_flags() as a proxy for relevant scheduler locks
      being held.  This works because the relevant locks are always acquired
      with interrupts disabled.  We may defer more often than needed, but that
      is at least safe.
      
      The wakeup deferral is tracked via a new field in the per-CPU and
      per-RCU-flavor rcu_data structure, namely ->nocb_defer_wakeup.
      
      This flag is checked by the RCU core processing.  The __rcu_pending()
      function now checks this flag, which causes rcu_check_callbacks()
      to initiate RCU core processing at each scheduling-clock interrupt
      where this flag is set.  Of course this is not sufficient because
      scheduling-clock interrupts are often turned off (the things we used to
      be able to count on!).  So the flags are also checked on entry to any
      state that RCU considers to be idle, which includes both NO_HZ_IDLE idle
      state and NO_HZ_FULL user-mode-execution state.
      
      This approach should allow call_rcu() to be invoked regardless of what
      locks you might be holding, the key word being "should".
      Reported-by: NDave Jones <davej@redhat.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      96d3fd0d