1. 29 4月, 2014 2 次提交
  2. 18 2月, 2014 1 次提交
  3. 13 12月, 2013 1 次提交
    • P
      rcu: Don't activate RCU core on NO_HZ_FULL CPUs · a096932f
      Paul E. McKenney 提交于
      Whenever a CPU receives a scheduling-clock interrupt, RCU checks to see
      if the RCU core needs anything from this CPU.  If so, RCU raises
      RCU_SOFTIRQ to carry out any needed processing.
      
      This approach has worked well historically, but it is undesirable on
      NO_HZ_FULL CPUs.  Such CPUs are expected to spend almost all of their time
      in userspace, so that scheduling-clock interrupts can be disabled while
      there is only one runnable task on the CPU in question.  Unfortunately,
      raising any softirq has the potential to wake up ksoftirqd, which would
      provide the second runnable task on that CPU, preventing disabling of
      scheduling-clock interrupts.
      
      What is needed instead is for RCU to leave NO_HZ_FULL CPUs alone,
      relying on the grace-period kthreads' quiescent-state forcing to
      do any needed RCU work on behalf of those CPUs.
      
      This commit therefore refrains from raising RCU_SOFTIRQ on any
      NO_HZ_FULL CPUs during any grace periods that have been in effect
      for less than one second.  The one-second limit handles the case
      where an inappropriate workload is running on a NO_HZ_FULL CPU
      that features lots of scheduling-clock interrupts, but no idle
      or userspace time.
      Reported-by: NMike Galbraith <bitbucket@online.de>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Tested-by: NMike Galbraith <bitbucket@online.de>
      Toasted-by: NFrederic Weisbecker <fweisbec@gmail.com>
      a096932f
  4. 04 12月, 2013 2 次提交
    • P
      rcu: Break call_rcu() deadlock involving scheduler and perf · 96d3fd0d
      Paul E. McKenney 提交于
      Dave Jones got the following lockdep splat:
      
      >  ======================================================
      >  [ INFO: possible circular locking dependency detected ]
      >  3.12.0-rc3+ #92 Not tainted
      >  -------------------------------------------------------
      >  trinity-child2/15191 is trying to acquire lock:
      >   (&rdp->nocb_wq){......}, at: [<ffffffff8108ff43>] __wake_up+0x23/0x50
      >
      > but task is already holding lock:
      >   (&ctx->lock){-.-...}, at: [<ffffffff81154c19>] perf_event_exit_task+0x109/0x230
      >
      > which lock already depends on the new lock.
      >
      >
      > the existing dependency chain (in reverse order) is:
      >
      > -> #3 (&ctx->lock){-.-...}:
      >         [<ffffffff810cc243>] lock_acquire+0x93/0x200
      >         [<ffffffff81733f90>] _raw_spin_lock+0x40/0x80
      >         [<ffffffff811500ff>] __perf_event_task_sched_out+0x2df/0x5e0
      >         [<ffffffff81091b83>] perf_event_task_sched_out+0x93/0xa0
      >         [<ffffffff81732052>] __schedule+0x1d2/0xa20
      >         [<ffffffff81732f30>] preempt_schedule_irq+0x50/0xb0
      >         [<ffffffff817352b6>] retint_kernel+0x26/0x30
      >         [<ffffffff813eed04>] tty_flip_buffer_push+0x34/0x50
      >         [<ffffffff813f0504>] pty_write+0x54/0x60
      >         [<ffffffff813e900d>] n_tty_write+0x32d/0x4e0
      >         [<ffffffff813e5838>] tty_write+0x158/0x2d0
      >         [<ffffffff811c4850>] vfs_write+0xc0/0x1f0
      >         [<ffffffff811c52cc>] SyS_write+0x4c/0xa0
      >         [<ffffffff8173d4e4>] tracesys+0xdd/0xe2
      >
      > -> #2 (&rq->lock){-.-.-.}:
      >         [<ffffffff810cc243>] lock_acquire+0x93/0x200
      >         [<ffffffff81733f90>] _raw_spin_lock+0x40/0x80
      >         [<ffffffff810980b2>] wake_up_new_task+0xc2/0x2e0
      >         [<ffffffff81054336>] do_fork+0x126/0x460
      >         [<ffffffff81054696>] kernel_thread+0x26/0x30
      >         [<ffffffff8171ff93>] rest_init+0x23/0x140
      >         [<ffffffff81ee1e4b>] start_kernel+0x3f6/0x403
      >         [<ffffffff81ee1571>] x86_64_start_reservations+0x2a/0x2c
      >         [<ffffffff81ee1664>] x86_64_start_kernel+0xf1/0xf4
      >
      > -> #1 (&p->pi_lock){-.-.-.}:
      >         [<ffffffff810cc243>] lock_acquire+0x93/0x200
      >         [<ffffffff8173419b>] _raw_spin_lock_irqsave+0x4b/0x90
      >         [<ffffffff810979d1>] try_to_wake_up+0x31/0x350
      >         [<ffffffff81097d62>] default_wake_function+0x12/0x20
      >         [<ffffffff81084af8>] autoremove_wake_function+0x18/0x40
      >         [<ffffffff8108ea38>] __wake_up_common+0x58/0x90
      >         [<ffffffff8108ff59>] __wake_up+0x39/0x50
      >         [<ffffffff8110d4f8>] __call_rcu_nocb_enqueue+0xa8/0xc0
      >         [<ffffffff81111450>] __call_rcu+0x140/0x820
      >         [<ffffffff81111b8d>] call_rcu+0x1d/0x20
      >         [<ffffffff81093697>] cpu_attach_domain+0x287/0x360
      >         [<ffffffff81099d7e>] build_sched_domains+0xe5e/0x10a0
      >         [<ffffffff81efa7fc>] sched_init_smp+0x3b7/0x47a
      >         [<ffffffff81ee1f4e>] kernel_init_freeable+0xf6/0x202
      >         [<ffffffff817200be>] kernel_init+0xe/0x190
      >         [<ffffffff8173d22c>] ret_from_fork+0x7c/0xb0
      >
      > -> #0 (&rdp->nocb_wq){......}:
      >         [<ffffffff810cb7ca>] __lock_acquire+0x191a/0x1be0
      >         [<ffffffff810cc243>] lock_acquire+0x93/0x200
      >         [<ffffffff8173419b>] _raw_spin_lock_irqsave+0x4b/0x90
      >         [<ffffffff8108ff43>] __wake_up+0x23/0x50
      >         [<ffffffff8110d4f8>] __call_rcu_nocb_enqueue+0xa8/0xc0
      >         [<ffffffff81111450>] __call_rcu+0x140/0x820
      >         [<ffffffff81111bb0>] kfree_call_rcu+0x20/0x30
      >         [<ffffffff81149abf>] put_ctx+0x4f/0x70
      >         [<ffffffff81154c3e>] perf_event_exit_task+0x12e/0x230
      >         [<ffffffff81056b8d>] do_exit+0x30d/0xcc0
      >         [<ffffffff8105893c>] do_group_exit+0x4c/0xc0
      >         [<ffffffff810589c4>] SyS_exit_group+0x14/0x20
      >         [<ffffffff8173d4e4>] tracesys+0xdd/0xe2
      >
      > other info that might help us debug this:
      >
      > Chain exists of:
      >   &rdp->nocb_wq --> &rq->lock --> &ctx->lock
      >
      >   Possible unsafe locking scenario:
      >
      >         CPU0                    CPU1
      >         ----                    ----
      >    lock(&ctx->lock);
      >                                 lock(&rq->lock);
      >                                 lock(&ctx->lock);
      >    lock(&rdp->nocb_wq);
      >
      >  *** DEADLOCK ***
      >
      > 1 lock held by trinity-child2/15191:
      >  #0:  (&ctx->lock){-.-...}, at: [<ffffffff81154c19>] perf_event_exit_task+0x109/0x230
      >
      > stack backtrace:
      > CPU: 2 PID: 15191 Comm: trinity-child2 Not tainted 3.12.0-rc3+ #92
      >  ffffffff82565b70 ffff880070c2dbf8 ffffffff8172a363 ffffffff824edf40
      >  ffff880070c2dc38 ffffffff81726741 ffff880070c2dc90 ffff88022383b1c0
      >  ffff88022383aac0 0000000000000000 ffff88022383b188 ffff88022383b1c0
      > Call Trace:
      >  [<ffffffff8172a363>] dump_stack+0x4e/0x82
      >  [<ffffffff81726741>] print_circular_bug+0x200/0x20f
      >  [<ffffffff810cb7ca>] __lock_acquire+0x191a/0x1be0
      >  [<ffffffff810c6439>] ? get_lock_stats+0x19/0x60
      >  [<ffffffff8100b2f4>] ? native_sched_clock+0x24/0x80
      >  [<ffffffff810cc243>] lock_acquire+0x93/0x200
      >  [<ffffffff8108ff43>] ? __wake_up+0x23/0x50
      >  [<ffffffff8173419b>] _raw_spin_lock_irqsave+0x4b/0x90
      >  [<ffffffff8108ff43>] ? __wake_up+0x23/0x50
      >  [<ffffffff8108ff43>] __wake_up+0x23/0x50
      >  [<ffffffff8110d4f8>] __call_rcu_nocb_enqueue+0xa8/0xc0
      >  [<ffffffff81111450>] __call_rcu+0x140/0x820
      >  [<ffffffff8109bc8f>] ? local_clock+0x3f/0x50
      >  [<ffffffff81111bb0>] kfree_call_rcu+0x20/0x30
      >  [<ffffffff81149abf>] put_ctx+0x4f/0x70
      >  [<ffffffff81154c3e>] perf_event_exit_task+0x12e/0x230
      >  [<ffffffff81056b8d>] do_exit+0x30d/0xcc0
      >  [<ffffffff810c9af5>] ? trace_hardirqs_on_caller+0x115/0x1e0
      >  [<ffffffff810c9bcd>] ? trace_hardirqs_on+0xd/0x10
      >  [<ffffffff8105893c>] do_group_exit+0x4c/0xc0
      >  [<ffffffff810589c4>] SyS_exit_group+0x14/0x20
      >  [<ffffffff8173d4e4>] tracesys+0xdd/0xe2
      
      The underlying problem is that perf is invoking call_rcu() with the
      scheduler locks held, but in NOCB mode, call_rcu() will with high
      probability invoke the scheduler -- which just might want to use its
      locks.  The reason that call_rcu() needs to invoke the scheduler is
      to wake up the corresponding rcuo callback-offload kthread, which
      does the job of starting up a grace period and invoking the callbacks
      afterwards.
      
      One solution (championed on a related problem by Lai Jiangshan) is to
      simply defer the wakeup to some point where scheduler locks are no longer
      held.  Since we don't want to unnecessarily incur the cost of such
      deferral, the task before us is threefold:
      
      1.	Determine when it is likely that a relevant scheduler lock is held.
      
      2.	Defer the wakeup in such cases.
      
      3.	Ensure that all deferred wakeups eventually happen, preferably
      	sooner rather than later.
      
      We use irqs_disabled_flags() as a proxy for relevant scheduler locks
      being held.  This works because the relevant locks are always acquired
      with interrupts disabled.  We may defer more often than needed, but that
      is at least safe.
      
      The wakeup deferral is tracked via a new field in the per-CPU and
      per-RCU-flavor rcu_data structure, namely ->nocb_defer_wakeup.
      
      This flag is checked by the RCU core processing.  The __rcu_pending()
      function now checks this flag, which causes rcu_check_callbacks()
      to initiate RCU core processing at each scheduling-clock interrupt
      where this flag is set.  Of course this is not sufficient because
      scheduling-clock interrupts are often turned off (the things we used to
      be able to count on!).  So the flags are also checked on entry to any
      state that RCU considers to be idle, which includes both NO_HZ_IDLE idle
      state and NO_HZ_FULL user-mode-execution state.
      
      This approach should allow call_rcu() to be invoked regardless of what
      locks you might be holding, the key word being "should".
      Reported-by: NDave Jones <davej@redhat.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      96d3fd0d
    • P
      rcu: Kick CPU halfway to RCU CPU stall warning · 6193c76a
      Paul E. McKenney 提交于
      When an RCU CPU stall warning occurs, the CPU invokes resched_cpu() on
      itself.  This can help move the grace period forward in some situations,
      but it would be even better to do this -before- the RCU CPU stall warning.
      This commit therefore causes resched_cpu() to be called every five jiffies
      once the system is halfway to an RCU CPU stall warning.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      6193c76a
  5. 16 10月, 2013 1 次提交
  6. 25 9月, 2013 1 次提交
  7. 01 9月, 2013 2 次提交
    • P
      nohz_full: Force RCU's grace-period kthreads onto timekeeping CPU · eb75767b
      Paul E. McKenney 提交于
      Because RCU's quiescent-state-forcing mechanism is used to drive the
      full-system-idle state machine, and because this mechanism is executed
      by RCU's grace-period kthreads, this commit forces these kthreads to
      run on the timekeeping CPU (tick_do_timer_cpu).  To do otherwise would
      mean that the RCU grace-period kthreads would force the system into
      non-idle state every time they drove the state machine, which would
      be just a bit on the futile side.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      eb75767b
    • P
      nohz_full: Add full-system-idle state machine · 0edd1b17
      Paul E. McKenney 提交于
      This commit adds the state machine that takes the per-CPU idle data
      as input and produces a full-system-idle indication as output.  This
      state machine is driven out of RCU's quiescent-state-forcing
      mechanism, which invokes rcu_sysidle_check_cpu() to collect per-CPU
      idle state and then rcu_sysidle_report() to drive the state machine.
      
      The full-system-idle state is sampled using rcu_sys_is_idle(), which
      also drives the state machine if RCU is idle (and does so by forcing
      RCU to become non-idle).  This function returns true if all but the
      timekeeping CPU (tick_do_timer_cpu) are idle and have been idle long
      enough to avoid memory contention on the full_sysidle_state state
      variable.  The rcu_sysidle_force_exit() may be called externally
      to reset the state machine back into non-idle state.
      
      For large systems the state machine is driven out of RCU's
      force-quiescent-state logic, which provides good scalability at the price
      of millisecond-scale latencies on the transition to full-system-idle
      state.  This is not so good for battery-powered systems, which are usually
      small enough that they don't need to care about scalability, but which
      do care deeply about energy efficiency.  Small systems therefore drive
      the state machine directly out of the idle-entry code.  The number of
      CPUs in a "small" system is defined by a new NO_HZ_FULL_SYSIDLE_SMALL
      Kconfig parameter, which defaults to 8.  Note that this is a build-time
      definition.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      [ paulmck: Use true and false for boolean constants per Lai Jiangshan. ]
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      [ paulmck: Simplify logic and provide better comments for memory barriers,
        based on review comments and questions by Lai Jiangshan. ]
      0edd1b17
  8. 19 8月, 2013 2 次提交
    • P
      nohz_full: Add per-CPU idle-state tracking · eb348b89
      Paul E. McKenney 提交于
      This commit adds the code that updates the rcu_dyntick structure's
      new fields to track the per-CPU idle state based on interrupts and
      transitions into and out of the idle loop (NMIs are ignored because NMI
      handlers cannot cleanly read out the time anyway).  This code is similar
      to the code that maintains RCU's idea of per-CPU idleness, but differs
      in that RCU treats CPUs running in user mode as idle, where this new
      code does not.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Acked-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      eb348b89
    • P
      nohz_full: Add rcu_dyntick data for scalable detection of all-idle state · 2333210b
      Paul E. McKenney 提交于
      This commit adds fields to the rcu_dyntick structure that are used to
      detect idle CPUs.  These new fields differ from the existing ones in
      that the existing ones consider a CPU executing in user mode to be idle,
      where the new ones consider CPUs executing in user mode to be busy.
      The handling of these new fields is otherwise quite similar to that for
      the exiting fields.  This commit also adds the initialization required
      for these fields.
      
      So, why is usermode execution treated differently, with RCU considering
      it a quiescent state equivalent to idle, while in contrast the new
      full-system idle state detection considers usermode execution to be
      non-idle?
      
      It turns out that although one of RCU's quiescent states is usermode
      execution, it is not a full-system idle state.  This is because the
      purpose of the full-system idle state is not RCU, but rather determining
      when accurate timekeeping can safely be disabled.  Whenever accurate
      timekeeping is required in a CONFIG_NO_HZ_FULL kernel, at least one
      CPU must keep the scheduling-clock tick going.  If even one CPU is
      executing in user mode, accurate timekeeping is requires, particularly for
      architectures where gettimeofday() and friends do not enter the kernel.
      Only when all CPUs are really and truly idle can accurate timekeeping be
      disabled, allowing all CPUs to turn off the scheduling clock interrupt,
      thus greatly improving energy efficiency.
      
      This naturally raises the question "Why is this code in RCU rather than in
      timekeeping?", and the answer is that RCU has the data and infrastructure
      to efficiently make this determination.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Acked-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      2333210b
  9. 30 7月, 2013 1 次提交
    • S
      rcu: Add const annotation to char * for RCU tracepoints and functions · e66c33d5
      Steven Rostedt (Red Hat) 提交于
      All the RCU tracepoints and functions that reference char pointers do
      so with just 'char *' even though they do not modify the contents of
      the string itself. This will cause warnings if a const char * is used
      in one of these functions.
      
      The RCU tracepoints store the pointer to the string to refer back to them
      when the trace output is displayed. As this can be minutes, hours or
      even days later, those strings had better be constant.
      
      This change also opens the door to allow the RCU tracepoint strings and
      their addresses to be exported so that userspace tracing tools can
      translate the contents of the pointers of the RCU tracepoints.
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      e66c33d5
  10. 15 7月, 2013 1 次提交
    • P
      rcu: delete __cpuinit usage from all rcu files · 49fb4c62
      Paul Gortmaker 提交于
      The __cpuinit type of throwaway sections might have made sense
      some time ago when RAM was more constrained, but now the savings
      do not offset the cost and complications.  For example, the fix in
      commit 5e427ec2 ("x86: Fix bit corruption at CPU resume time")
      is a good example of the nasty type of bugs that can be created
      with improper use of the various __init prefixes.
      
      After a discussion on LKML[1] it was decided that cpuinit should go
      the way of devinit and be phased out.  Once all the users are gone,
      we can then finally remove the macros themselves from linux/init.h.
      
      This removes all the drivers/rcu uses of the __cpuinit macros
      from all C files.
      
      [1] https://lkml.org/lkml/2013/5/20/589
      
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Josh Triplett <josh@freedesktop.org>
      Cc: Dipankar Sarma <dipankar@in.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      49fb4c62
  11. 11 6月, 2013 2 次提交
    • P
      rcu: Drive quiescent-state-forcing delay from HZ · 026ad283
      Paul E. McKenney 提交于
      Systems with HZ=100 can have slow bootup times due to the default
      three-jiffy delays between quiescent-state forcing attempts.  This
      commit therefore auto-tunes the RCU_JIFFIES_TILL_FORCE_QS value based
      on the value of HZ.  However, this would break very large systems that
      require more time between quiescent-state forcing attempts.  This
      commit therefore also ups the default delay by one jiffy for each
      256 CPUs that might be on the system (based off of nr_cpu_ids at
      runtime, -not- NR_CPUS at build time).
      
      Updated to collapse #ifdefs for RCU_JIFFIES_TILL_FORCE_QS into a
      step-function definition as suggested by Josh Triplett.
      Reported-by: NPaul Mackerras <paulus@au1.ibm.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      026ad283
    • S
      rcu: Don't call wakeup() with rcu_node structure ->lock held · 016a8d5b
      Steven Rostedt 提交于
      This commit fixes a lockdep-detected deadlock by moving a wake_up()
      call out from a rnp->lock critical section.  Please see below for
      the long version of this story.
      
      On Tue, 2013-05-28 at 16:13 -0400, Dave Jones wrote:
      
      > [12572.705832] ======================================================
      > [12572.750317] [ INFO: possible circular locking dependency detected ]
      > [12572.796978] 3.10.0-rc3+ #39 Not tainted
      > [12572.833381] -------------------------------------------------------
      > [12572.862233] trinity-child17/31341 is trying to acquire lock:
      > [12572.870390]  (rcu_node_0){..-.-.}, at: [<ffffffff811054ff>] rcu_read_unlock_special+0x9f/0x4c0
      > [12572.878859]
      > but task is already holding lock:
      > [12572.894894]  (&ctx->lock){-.-...}, at: [<ffffffff811390ed>] perf_lock_task_context+0x7d/0x2d0
      > [12572.903381]
      > which lock already depends on the new lock.
      >
      > [12572.927541]
      > the existing dependency chain (in reverse order) is:
      > [12572.943736]
      > -> #4 (&ctx->lock){-.-...}:
      > [12572.960032]        [<ffffffff810b9851>] lock_acquire+0x91/0x1f0
      > [12572.968337]        [<ffffffff816ebc90>] _raw_spin_lock+0x40/0x80
      > [12572.976633]        [<ffffffff8113c987>] __perf_event_task_sched_out+0x2e7/0x5e0
      > [12572.984969]        [<ffffffff81088953>] perf_event_task_sched_out+0x93/0xa0
      > [12572.993326]        [<ffffffff816ea0bf>] __schedule+0x2cf/0x9c0
      > [12573.001652]        [<ffffffff816eacfe>] schedule_user+0x2e/0x70
      > [12573.009998]        [<ffffffff816ecd64>] retint_careful+0x12/0x2e
      > [12573.018321]
      > -> #3 (&rq->lock){-.-.-.}:
      > [12573.034628]        [<ffffffff810b9851>] lock_acquire+0x91/0x1f0
      > [12573.042930]        [<ffffffff816ebc90>] _raw_spin_lock+0x40/0x80
      > [12573.051248]        [<ffffffff8108e6a7>] wake_up_new_task+0xb7/0x260
      > [12573.059579]        [<ffffffff810492f5>] do_fork+0x105/0x470
      > [12573.067880]        [<ffffffff81049686>] kernel_thread+0x26/0x30
      > [12573.076202]        [<ffffffff816cee63>] rest_init+0x23/0x140
      > [12573.084508]        [<ffffffff81ed8e1f>] start_kernel+0x3f1/0x3fe
      > [12573.092852]        [<ffffffff81ed856f>] x86_64_start_reservations+0x2a/0x2c
      > [12573.101233]        [<ffffffff81ed863d>] x86_64_start_kernel+0xcc/0xcf
      > [12573.109528]
      > -> #2 (&p->pi_lock){-.-.-.}:
      > [12573.125675]        [<ffffffff810b9851>] lock_acquire+0x91/0x1f0
      > [12573.133829]        [<ffffffff816ebe9b>] _raw_spin_lock_irqsave+0x4b/0x90
      > [12573.141964]        [<ffffffff8108e881>] try_to_wake_up+0x31/0x320
      > [12573.150065]        [<ffffffff8108ebe2>] default_wake_function+0x12/0x20
      > [12573.158151]        [<ffffffff8107bbf8>] autoremove_wake_function+0x18/0x40
      > [12573.166195]        [<ffffffff81085398>] __wake_up_common+0x58/0x90
      > [12573.174215]        [<ffffffff81086909>] __wake_up+0x39/0x50
      > [12573.182146]        [<ffffffff810fc3da>] rcu_start_gp_advanced.isra.11+0x4a/0x50
      > [12573.190119]        [<ffffffff810fdb09>] rcu_start_future_gp+0x1c9/0x1f0
      > [12573.198023]        [<ffffffff810fe2c4>] rcu_nocb_kthread+0x114/0x930
      > [12573.205860]        [<ffffffff8107a91d>] kthread+0xed/0x100
      > [12573.213656]        [<ffffffff816f4b1c>] ret_from_fork+0x7c/0xb0
      > [12573.221379]
      > -> #1 (&rsp->gp_wq){..-.-.}:
      > [12573.236329]        [<ffffffff810b9851>] lock_acquire+0x91/0x1f0
      > [12573.243783]        [<ffffffff816ebe9b>] _raw_spin_lock_irqsave+0x4b/0x90
      > [12573.251178]        [<ffffffff810868f3>] __wake_up+0x23/0x50
      > [12573.258505]        [<ffffffff810fc3da>] rcu_start_gp_advanced.isra.11+0x4a/0x50
      > [12573.265891]        [<ffffffff810fdb09>] rcu_start_future_gp+0x1c9/0x1f0
      > [12573.273248]        [<ffffffff810fe2c4>] rcu_nocb_kthread+0x114/0x930
      > [12573.280564]        [<ffffffff8107a91d>] kthread+0xed/0x100
      > [12573.287807]        [<ffffffff816f4b1c>] ret_from_fork+0x7c/0xb0
      
      Notice the above call chain.
      
      rcu_start_future_gp() is called with the rnp->lock held. Then it calls
      rcu_start_gp_advance, which does a wakeup.
      
      You can't do wakeups while holding the rnp->lock, as that would mean
      that you could not do a rcu_read_unlock() while holding the rq lock, or
      any lock that was taken while holding the rq lock. This is because...
      (See below).
      
      > [12573.295067]
      > -> #0 (rcu_node_0){..-.-.}:
      > [12573.309293]        [<ffffffff810b8d36>] __lock_acquire+0x1786/0x1af0
      > [12573.316568]        [<ffffffff810b9851>] lock_acquire+0x91/0x1f0
      > [12573.323825]        [<ffffffff816ebc90>] _raw_spin_lock+0x40/0x80
      > [12573.331081]        [<ffffffff811054ff>] rcu_read_unlock_special+0x9f/0x4c0
      > [12573.338377]        [<ffffffff810760a6>] __rcu_read_unlock+0x96/0xa0
      > [12573.345648]        [<ffffffff811391b3>] perf_lock_task_context+0x143/0x2d0
      > [12573.352942]        [<ffffffff8113938e>] find_get_context+0x4e/0x1f0
      > [12573.360211]        [<ffffffff811403f4>] SYSC_perf_event_open+0x514/0xbd0
      > [12573.367514]        [<ffffffff81140e49>] SyS_perf_event_open+0x9/0x10
      > [12573.374816]        [<ffffffff816f4dd4>] tracesys+0xdd/0xe2
      
      Notice the above trace.
      
      perf took its own ctx->lock, which can be taken while holding the rq
      lock. While holding this lock, it did a rcu_read_unlock(). The
      perf_lock_task_context() basically looks like:
      
      rcu_read_lock();
      raw_spin_lock(ctx->lock);
      rcu_read_unlock();
      
      Now, what looks to have happened, is that we scheduled after taking that
      first rcu_read_lock() but before taking the spin lock. When we scheduled
      back in and took the ctx->lock, the following rcu_read_unlock()
      triggered the "special" code.
      
      The rcu_read_unlock_special() takes the rnp->lock, which gives us a
      possible deadlock scenario.
      
      	CPU0		CPU1		CPU2
      	----		----		----
      
      				     rcu_nocb_kthread()
          lock(rq->lock);
      		    lock(ctx->lock);
      				     lock(rnp->lock);
      
      				     wake_up();
      
      				     lock(rq->lock);
      
      		    rcu_read_unlock();
      
      		    rcu_read_unlock_special();
      
      		    lock(rnp->lock);
          lock(ctx->lock);
      
      **** DEADLOCK ****
      
      > [12573.382068]
      > other info that might help us debug this:
      >
      > [12573.403229] Chain exists of:
      >   rcu_node_0 --> &rq->lock --> &ctx->lock
      >
      > [12573.424471]  Possible unsafe locking scenario:
      >
      > [12573.438499]        CPU0                    CPU1
      > [12573.445599]        ----                    ----
      > [12573.452691]   lock(&ctx->lock);
      > [12573.459799]                                lock(&rq->lock);
      > [12573.467010]                                lock(&ctx->lock);
      > [12573.474192]   lock(rcu_node_0);
      > [12573.481262]
      >  *** DEADLOCK ***
      >
      > [12573.501931] 1 lock held by trinity-child17/31341:
      > [12573.508990]  #0:  (&ctx->lock){-.-...}, at: [<ffffffff811390ed>] perf_lock_task_context+0x7d/0x2d0
      > [12573.516475]
      > stack backtrace:
      > [12573.530395] CPU: 1 PID: 31341 Comm: trinity-child17 Not tainted 3.10.0-rc3+ #39
      > [12573.545357]  ffffffff825b4f90 ffff880219f1dbc0 ffffffff816e375b ffff880219f1dc00
      > [12573.552868]  ffffffff816dfa5d ffff880219f1dc50 ffff88023ce4d1f8 ffff88023ce4ca40
      > [12573.560353]  0000000000000001 0000000000000001 ffff88023ce4d1f8 ffff880219f1dcc0
      > [12573.567856] Call Trace:
      > [12573.575011]  [<ffffffff816e375b>] dump_stack+0x19/0x1b
      > [12573.582284]  [<ffffffff816dfa5d>] print_circular_bug+0x200/0x20f
      > [12573.589637]  [<ffffffff810b8d36>] __lock_acquire+0x1786/0x1af0
      > [12573.596982]  [<ffffffff810918f5>] ? sched_clock_cpu+0xb5/0x100
      > [12573.604344]  [<ffffffff810b9851>] lock_acquire+0x91/0x1f0
      > [12573.611652]  [<ffffffff811054ff>] ? rcu_read_unlock_special+0x9f/0x4c0
      > [12573.619030]  [<ffffffff816ebc90>] _raw_spin_lock+0x40/0x80
      > [12573.626331]  [<ffffffff811054ff>] ? rcu_read_unlock_special+0x9f/0x4c0
      > [12573.633671]  [<ffffffff811054ff>] rcu_read_unlock_special+0x9f/0x4c0
      > [12573.640992]  [<ffffffff811390ed>] ? perf_lock_task_context+0x7d/0x2d0
      > [12573.648330]  [<ffffffff810b429e>] ? put_lock_stats.isra.29+0xe/0x40
      > [12573.655662]  [<ffffffff813095a0>] ? delay_tsc+0x90/0xe0
      > [12573.662964]  [<ffffffff810760a6>] __rcu_read_unlock+0x96/0xa0
      > [12573.670276]  [<ffffffff811391b3>] perf_lock_task_context+0x143/0x2d0
      > [12573.677622]  [<ffffffff81139070>] ? __perf_event_enable+0x370/0x370
      > [12573.684981]  [<ffffffff8113938e>] find_get_context+0x4e/0x1f0
      > [12573.692358]  [<ffffffff811403f4>] SYSC_perf_event_open+0x514/0xbd0
      > [12573.699753]  [<ffffffff8108cd9d>] ? get_parent_ip+0xd/0x50
      > [12573.707135]  [<ffffffff810b71fd>] ? trace_hardirqs_on_caller+0xfd/0x1c0
      > [12573.714599]  [<ffffffff81140e49>] SyS_perf_event_open+0x9/0x10
      > [12573.721996]  [<ffffffff816f4dd4>] tracesys+0xdd/0xe2
      
      This commit delays the wakeup via irq_work(), which is what
      perf and ftrace use to perform wakeups in critical sections.
      Reported-by: NDave Jones <davej@redhat.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      016a8d5b
  12. 19 4月, 2013 1 次提交
    • F
      nohz: Ensure full dynticks CPUs are RCU nocbs · d1e43fa5
      Frederic Weisbecker 提交于
      We need full dynticks CPU to also be RCU nocb so
      that we don't have to keep the tick to handle RCU
      callbacks.
      
      Make sure the range passed to nohz_full= boot
      parameter is a subset of rcu_nocbs=
      
      The CPUs that fail to meet this requirement will be
      excluded from the nohz_full range. This is checked
      early in boot time, before any CPU has the opportunity
      to stop its tick.
      Suggested-by: NSteven Rostedt <rostedt@goodmis.org>
      Reviewed-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Geoff Levand <geoff@infradead.org>
      Cc: Gilad Ben Yossef <gilad@benyossef.com>
      Cc: Hakan Akkan <hakanakkan@gmail.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Kevin Hilman <khilman@linaro.org>
      Cc: Li Zhong <zhong@linux.vnet.ibm.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      d1e43fa5
  13. 16 4月, 2013 1 次提交
    • P
      rcu: Kick adaptive-ticks CPUs that are holding up RCU grace periods · 65d798f0
      Paul E. McKenney 提交于
      Adaptive-ticks CPUs inform RCU when they enter kernel mode, but they do
      not necessarily turn the scheduler-clock tick back on.  This state of
      affairs could result in RCU waiting on an adaptive-ticks CPU running
      for an extended period in kernel mode.  Such a CPU will never run the
      RCU state machine, and could therefore indefinitely extend the RCU state
      machine, sooner or later resulting in an OOM condition.
      
      This patch, inspired by an earlier patch by Frederic Weisbecker, therefore
      causes RCU's force-quiescent-state processing to check for this condition
      and to send an IPI to CPUs that remain in that state for too long.
      "Too long" currently means about three jiffies by default, which is
      quite some time for a CPU to remain in the kernel without blocking.
      The rcu_tree.jiffies_till_first_fqs and rcutree.jiffies_till_next_fqs
      sysfs variables may be used to tune "too long" if needed.
      Reported-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Geoff Levand <geoff@infradead.org>
      Cc: Gilad Ben Yossef <gilad@benyossef.com>
      Cc: Hakan Akkan <hakanakkan@gmail.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Kevin Hilman <khilman@linaro.org>
      Cc: Li Zhong <zhong@linux.vnet.ibm.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      65d798f0
  14. 26 3月, 2013 5 次提交
  15. 14 3月, 2013 1 次提交
  16. 13 3月, 2013 2 次提交
  17. 29 1月, 2013 1 次提交
  18. 27 1月, 2013 1 次提交
  19. 09 1月, 2013 1 次提交
    • P
      rcu: Tag callback lists with corresponding grace-period number · dc35c893
      Paul E. McKenney 提交于
      Currently, callbacks are advanced each time the corresponding CPU
      notices a change in its leaf rcu_node structure's ->completed value
      (this value counts grace-period completions).  This approach has worked
      quite well, but with the advent of RCU_FAST_NO_HZ, we cannot count on
      a given CPU seeing all the grace-period completions.  When a CPU misses
      a grace-period completion that occurs while it is in dyntick-idle mode,
      this will delay invocation of its callbacks.
      
      In addition, acceleration of callbacks (when RCU realizes that a given
      callback need only wait until the end of the next grace period, rather
      than having to wait for a partial grace period followed by a full
      grace period) must be carried out extremely carefully.  Insufficient
      acceleration will result in unnecessarily long grace-period latencies,
      while excessive acceleration will result in premature callback invocation.
      Changes that involve this tradeoff are therefore among the most
      nerve-wracking changes to RCU.
      
      This commit therefore explicitly tags groups of callbacks with the
      number of the grace period that they are waiting for.  This means that
      callback-advancement and callback-acceleration functions are idempotent,
      so that excessive acceleration will merely waste a few CPU cycles.  This
      also allows a CPU to take full advantage of any grace periods that have
      elapsed while it has been in dyntick-idle mode.  It should also enable
      simulataneous simplifications to and optimizations of RCU_FAST_NO_HZ.
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      dc35c893
  20. 17 11月, 2012 2 次提交
    • P
      rcu: Separate accounting of callbacks from callback-free CPUs · c635a4e1
      Paul E. McKenney 提交于
      Currently, callback invocations from callback-free CPUs are accounted to
      the CPU that registered the callback, but using the same field that is
      used for normal callbacks.  This makes it impossible to determine from
      debugfs output whether callbacks are in fact being diverted.  This commit
      therefore adds a separate ->n_nocbs_invoked field in the rcu_data structure
      in which diverted callback invocations are counted.  RCU's debugfs tracing
      still displays normal callback invocations using ci=, but displayed
      diverted callbacks with nci=.
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      c635a4e1
    • P
      rcu: Add callback-free CPUs · 3fbfbf7a
      Paul E. McKenney 提交于
      RCU callback execution can add significant OS jitter and also can
      degrade both scheduling latency and, in asymmetric multiprocessors,
      energy efficiency.  This commit therefore adds the ability for selected
      CPUs ("rcu_nocbs=" boot parameter) to have their callbacks offloaded
      to kthreads.  If the "rcu_nocb_poll" boot parameter is also specified,
      these kthreads will do polling, removing the need for the offloaded
      CPUs to do wakeups.  At least one CPU must be doing normal callback
      processing: currently CPU 0 cannot be selected as a no-CBs CPU.
      In addition, attempts to offline the last normal-CBs CPU will fail.
      
      This feature was inspired by Jim Houston's and Joe Korty's JRCU, and
      this commit includes fixes to problems located by Fengguang Wu's
      kbuild test robot.
      
      [ paulmck: Added gfp.h include file as suggested by Fengguang Wu. ]
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      3fbfbf7a
  21. 09 11月, 2012 3 次提交
  22. 09 10月, 2012 1 次提交
    • P
      rcu: Grace-period initialization excludes only RCU notifier · a4fbe35a
      Paul E. McKenney 提交于
      Kirill noted the following deadlock cycle on shutdown involving padata:
      
      > With commit 755609a9 I've got deadlock on
      > poweroff.
      >
      > It guess it happens because of race for cpu_hotplug.lock:
      >
      >       CPU A                                   CPU B
      > disable_nonboot_cpus()
      > _cpu_down()
      > cpu_hotplug_begin()
      >  mutex_lock(&cpu_hotplug.lock);
      > __cpu_notify()
      > padata_cpu_callback()
      > __padata_remove_cpu()
      > padata_replace()
      > synchronize_rcu()
      >                                       rcu_gp_kthread()
      >                                       get_online_cpus();
      >                                       mutex_lock(&cpu_hotplug.lock);
      
      It would of course be good to eliminate grace-period delays from
      CPU-hotplug notifiers, but that is a separate issue.  Deadlock is
      not an appropriate diagnostic for excessive CPU-hotplug latency.
      
      Fortunately, grace-period initialization does not actually need to
      exclude all of the CPU-hotplug operation, but rather only RCU's own
      CPU_UP_PREPARE and CPU_DEAD CPU-hotplug notifiers.  This commit therefore
      introduces a new per-rcu_state onoff_mutex that provides the required
      concurrency control in place of the get_online_cpus() that was previously
      in rcu_gp_init().
      Reported-by: N"Kirill A. Shutemov" <kirill@shutemov.name>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Tested-by: NKirill A. Shutemov <kirill@shutemov.name>
      a4fbe35a
  23. 26 9月, 2012 2 次提交
    • F
      rcu: Ignore userspace extended quiescent state by default · 1e1a689f
      Frederic Weisbecker 提交于
      By default we don't want to enter into RCU extended quiescent
      state while in userspace because doing this produces some overhead
      (eg: use of syscall slowpath). Set it off by default and ready to
      run when some feature like adaptive tickless need it.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Alessio Igor Bogani <abogani@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Avi Kivity <avi@redhat.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Geoff Levand <geoff@infradead.org>
      Cc: Gilad Ben Yossef <gilad@benyossef.com>
      Cc: Hakan Akkan <hakanakkan@gmail.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Kevin Hilman <khilman@ti.com>
      Cc: Max Krasnyansky <maxk@qualcomm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephen Hemminger <shemminger@vyatta.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      1e1a689f
    • F
      rcu: Allow rcu_user_enter()/exit() to nest · c5d900bf
      Frederic Weisbecker 提交于
      Allow calls to rcu_user_enter() even if we are already
      in userspace (as seen by RCU) and allow calls to rcu_user_exit()
      even if we are already in the kernel.
      
      This makes the APIs more flexible to be called from architectures.
      Exception entries for example won't need to know if they come from
      userspace before calling rcu_user_exit().
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Alessio Igor Bogani <abogani@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Avi Kivity <avi@redhat.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Geoff Levand <geoff@infradead.org>
      Cc: Gilad Ben Yossef <gilad@benyossef.com>
      Cc: Hakan Akkan <hakanakkan@gmail.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Kevin Hilman <khilman@ti.com>
      Cc: Max Krasnyansky <maxk@qualcomm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephen Hemminger <shemminger@vyatta.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      c5d900bf
  24. 23 9月, 2012 3 次提交