1. 19 4月, 2017 3 次提交
    • P
      srcu: Move rcu_init_levelspread() to rcu_tree_node.h · 2b34c43c
      Paul E. McKenney 提交于
      This commit moves the rcu_init_levelspread() function from
      kernel/rcu/tree.c to kernel/rcu/rcu.h so that SRCU can access it.  This is
      another step towards enabling SRCU to create its own combining tree.
      This commit is code-movement only, give or take knock-on adjustments.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      2b34c43c
    • P
      srcu: Abstract multi-tail callback list handling · 15fecf89
      Paul E. McKenney 提交于
      RCU has only one multi-tail callback list, which is implemented via
      the nxtlist, nxttail, nxtcompleted, qlen_lazy, and qlen fields in the
      rcu_data structure, and whose operations are open-code throughout the
      Tree RCU implementation.  This has been more or less OK in the past,
      but upcoming callback-list optimizations in SRCU could really use
      a multi-tail callback list there as well.
      
      This commit therefore abstracts the multi-tail callback list handling
      into a new kernel/rcu/rcu_segcblist.h file, and uses this new API.
      The simple head-and-tail pointer callback list is also abstracted and
      applied everywhere except for the NOCB callback-offload lists.  (Yes,
      the plan is to apply them there as well, but this commit is already
      bigger than would be good.)
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      15fecf89
    • P
      rcu: Pull rcu_qs_ctr into rcu_dynticks structure · 9577df9a
      Paul E. McKenney 提交于
      The rcu_qs_ctr variable is yet another isolated per-CPU variable,
      so this commit pulls it into the pre-existing rcu_dynticks per-CPU
      structure.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      9577df9a
  2. 24 1月, 2017 2 次提交
  3. 23 8月, 2016 1 次提交
    • P
      rcu: Drive expedited grace periods from workqueue · 8b355e3b
      Paul E. McKenney 提交于
      The current implementation of expedited grace periods has the user
      task drive the grace period.  This works, but has downsides: (1) The
      user task must awaken tasks piggybacking on this grace period, which
      can result in latencies rivaling that of the grace period itself, and
      (2) User tasks can receive signals, which interfere with RCU CPU stall
      warnings.
      
      This commit therefore uses workqueues to drive the grace periods, so
      that the user task need not do the awakening.  A subsequent commit
      will remove the now-unnecessary code allowing for signals.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      8b355e3b
  4. 01 4月, 2016 2 次提交
  5. 05 12月, 2015 2 次提交
    • P
      kernel: Make rcu/tree_trace.c explicitly non-modular · 47dbc906
      Paul Gortmaker 提交于
      The Kconfig currently controlling compilation of this code is:
      
      init/Kconfig:config TREE_RCU_TRACE
      init/Kconfig:   def_bool RCU_TRACE && ( TREE_RCU || PREEMPT_RCU )
      
      ...meaning that it currently is not being built as a module by anyone.
      
      Lets remove the modular code that is essentially orphaned, so that
      when reading the file there is no doubt it is builtin-only.
      
      Since module_init translates to device_initcall in the non-modular
      case, the init ordering remains unchanged with this commit.  We could
      consider moving this to an earlier initcall if desired.
      
      We don't replace module.h with init.h since the file already has that.
      We also delete the moduleparam.h include that is left over from
      commit 64db4cff (""Tree RCU": scalable
      classic RCU implementation") since it is not needed here either.
      
      We morph some tags like MODULE_AUTHOR into the comments at the top of
      the file for documentation purposes.
      
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Lai Jiangshan <jiangshanlai@gmail.com>
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      47dbc906
    • P
      rcu: Reduce expedited GP memory contention via per-CPU variables · df5bd514
      Paul E. McKenney 提交于
      Currently, the piggybacked-work checks carried out by sync_exp_work_done()
      atomically increment a small set of variables (the ->expedited_workdone0,
      ->expedited_workdone1, ->expedited_workdone2, ->expedited_workdone3
      fields in the rcu_state structure), which will form a memory-contention
      bottleneck given a sufficiently large number of CPUs concurrently invoking
      either synchronize_rcu_expedited() or synchronize_sched_expedited().
      
      This commit therefore moves these for fields to the per-CPU rcu_data
      structure, eliminating the memory contention.  The show_rcuexp() function
      also changes to sum up each field in the rcu_data structures.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      df5bd514
  6. 24 11月, 2015 1 次提交
  7. 07 10月, 2015 1 次提交
  8. 21 9月, 2015 3 次提交
    • P
      rcu: Make ->cpu_no_qs be a union for aggregate OR · 5b74c458
      Paul E. McKenney 提交于
      This commit converts the rcu_data structure's ->cpu_no_qs field
      to a union.  The bytewise side of this union allows individual access
      to indications as to whether this CPU needs to find a quiescent state
      for a normal (.norm) and/or expedited (.exp) grace period.  The setwise
      side of the union allows testing whether or not a quiescent state is
      needed at all, for either type of grace period.
      
      For now, only .norm is used.  A later commit will introduce the expedited
      usage.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      5b74c458
    • P
      rcu: Invert passed_quiesce and rename to cpu_no_qs · 0d43eb34
      Paul E. McKenney 提交于
      This commit inverts the sense of the rcu_data structure's ->passed_quiesce
      field and renames it to ->cpu_no_qs.  This will allow a later commit to
      use an "aggregate OR" operation to test expedited as well as normal grace
      periods without added overhead.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      0d43eb34
    • P
      rcu: Rename qs_pending to core_needs_qs · 97c668b8
      Paul E. McKenney 提交于
      An upcoming commit needs to invert the sense of the ->passed_quiesce
      rcu_data structure field, so this commit is taking this opportunity
      to clarify things a bit by renaming ->qs_pending to ->core_needs_qs.
      
      So if !rdp->core_needs_qs, then this CPU need not concern itself with
      quiescent states, in particular, it need not acquire its leaf rcu_node
      structure's ->lock to check.  Otherwise, it needs to report the next
      quiescent state.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      97c668b8
  9. 18 7月, 2015 6 次提交
  10. 28 5月, 2015 1 次提交
  11. 13 3月, 2015 1 次提交
    • P
      rcu: Process offlining and onlining only at grace-period start · 0aa04b05
      Paul E. McKenney 提交于
      Races between CPU hotplug and grace periods can be difficult to resolve,
      so the ->onoff_mutex is used to exclude the two events.  Unfortunately,
      this means that it is impossible for an outgoing CPU to perform the
      last bits of its offlining from its last pass through the idle loop,
      because sleeplocks cannot be acquired in that context.
      
      This commit avoids these problems by buffering online and offline events
      in a new ->qsmaskinitnext field in the leaf rcu_node structures.  When a
      grace period starts, the events accumulated in this mask are applied to
      the ->qsmaskinit field, and, if needed, up the rcu_node tree.  The special
      case of all CPUs corresponding to a given leaf rcu_node structure being
      offline while there are still elements in that structure's ->blkd_tasks
      list is handled using a new ->wait_blkd_tasks field.  In this case,
      propagating the offline bits up the tree is deferred until the beginning
      of the grace period after all of the tasks have exited their RCU read-side
      critical sections and removed themselves from the list, at which point
      the ->wait_blkd_tasks flag is cleared.  If one of that leaf rcu_node
      structure's CPUs comes back online before the list empties, then the
      ->wait_blkd_tasks flag is simply cleared.
      
      This of course means that RCU's notion of which CPUs are offline can be
      out of date.  This is OK because RCU need only wait on CPUs that were
      online at the time that the grace period started.  In addition, RCU's
      force-quiescent-state actions will handle the case where a CPU goes
      offline after the grace period starts.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      0aa04b05
  12. 16 1月, 2015 1 次提交
    • P
      rcu: Make cond_resched_rcu_qs() apply to normal RCU flavors · 5cd37193
      Paul E. McKenney 提交于
      Although cond_resched_rcu_qs() only applies to TASKS_RCU, it is used
      in places where it would be useful for it to apply to the normal RCU
      flavors, rcu_preempt, rcu_sched, and rcu_bh.  This is especially the
      case for workloads that aggressively overload the system, particularly
      those that generate large numbers of RCU updates on systems running
      NO_HZ_FULL CPUs.  This commit therefore communicates quiescent states
      from cond_resched_rcu_qs() to the normal RCU flavors.
      
      Note that it is unfortunately necessary to leave the old ->passed_quiesce
      mechanism in place to allow quiescent states that apply to only one
      flavor to be recorded.  (Yes, we could decrement ->rcu_qs_ctr_snap in
      that case, but that is not so good for debugging of RCU internals.)
      In addition, if one of the RCU flavor's grace period has stalled, this
      will invoke rcu_momentary_dyntick_idle(), resulting in a heavy-weight
      quiescent state visible from other CPUs.
      Reported-by: NSasha Levin <sasha.levin@oracle.com>
      Reported-by: NDave Jones <davej@redhat.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      [ paulmck: Merge commit from Sasha Levin fixing a bug where __this_cpu()
        was used in preemptible code. ]
      5cd37193
  13. 18 2月, 2014 2 次提交
  14. 04 12月, 2013 1 次提交
    • P
      rcu: Break call_rcu() deadlock involving scheduler and perf · 96d3fd0d
      Paul E. McKenney 提交于
      Dave Jones got the following lockdep splat:
      
      >  ======================================================
      >  [ INFO: possible circular locking dependency detected ]
      >  3.12.0-rc3+ #92 Not tainted
      >  -------------------------------------------------------
      >  trinity-child2/15191 is trying to acquire lock:
      >   (&rdp->nocb_wq){......}, at: [<ffffffff8108ff43>] __wake_up+0x23/0x50
      >
      > but task is already holding lock:
      >   (&ctx->lock){-.-...}, at: [<ffffffff81154c19>] perf_event_exit_task+0x109/0x230
      >
      > which lock already depends on the new lock.
      >
      >
      > the existing dependency chain (in reverse order) is:
      >
      > -> #3 (&ctx->lock){-.-...}:
      >         [<ffffffff810cc243>] lock_acquire+0x93/0x200
      >         [<ffffffff81733f90>] _raw_spin_lock+0x40/0x80
      >         [<ffffffff811500ff>] __perf_event_task_sched_out+0x2df/0x5e0
      >         [<ffffffff81091b83>] perf_event_task_sched_out+0x93/0xa0
      >         [<ffffffff81732052>] __schedule+0x1d2/0xa20
      >         [<ffffffff81732f30>] preempt_schedule_irq+0x50/0xb0
      >         [<ffffffff817352b6>] retint_kernel+0x26/0x30
      >         [<ffffffff813eed04>] tty_flip_buffer_push+0x34/0x50
      >         [<ffffffff813f0504>] pty_write+0x54/0x60
      >         [<ffffffff813e900d>] n_tty_write+0x32d/0x4e0
      >         [<ffffffff813e5838>] tty_write+0x158/0x2d0
      >         [<ffffffff811c4850>] vfs_write+0xc0/0x1f0
      >         [<ffffffff811c52cc>] SyS_write+0x4c/0xa0
      >         [<ffffffff8173d4e4>] tracesys+0xdd/0xe2
      >
      > -> #2 (&rq->lock){-.-.-.}:
      >         [<ffffffff810cc243>] lock_acquire+0x93/0x200
      >         [<ffffffff81733f90>] _raw_spin_lock+0x40/0x80
      >         [<ffffffff810980b2>] wake_up_new_task+0xc2/0x2e0
      >         [<ffffffff81054336>] do_fork+0x126/0x460
      >         [<ffffffff81054696>] kernel_thread+0x26/0x30
      >         [<ffffffff8171ff93>] rest_init+0x23/0x140
      >         [<ffffffff81ee1e4b>] start_kernel+0x3f6/0x403
      >         [<ffffffff81ee1571>] x86_64_start_reservations+0x2a/0x2c
      >         [<ffffffff81ee1664>] x86_64_start_kernel+0xf1/0xf4
      >
      > -> #1 (&p->pi_lock){-.-.-.}:
      >         [<ffffffff810cc243>] lock_acquire+0x93/0x200
      >         [<ffffffff8173419b>] _raw_spin_lock_irqsave+0x4b/0x90
      >         [<ffffffff810979d1>] try_to_wake_up+0x31/0x350
      >         [<ffffffff81097d62>] default_wake_function+0x12/0x20
      >         [<ffffffff81084af8>] autoremove_wake_function+0x18/0x40
      >         [<ffffffff8108ea38>] __wake_up_common+0x58/0x90
      >         [<ffffffff8108ff59>] __wake_up+0x39/0x50
      >         [<ffffffff8110d4f8>] __call_rcu_nocb_enqueue+0xa8/0xc0
      >         [<ffffffff81111450>] __call_rcu+0x140/0x820
      >         [<ffffffff81111b8d>] call_rcu+0x1d/0x20
      >         [<ffffffff81093697>] cpu_attach_domain+0x287/0x360
      >         [<ffffffff81099d7e>] build_sched_domains+0xe5e/0x10a0
      >         [<ffffffff81efa7fc>] sched_init_smp+0x3b7/0x47a
      >         [<ffffffff81ee1f4e>] kernel_init_freeable+0xf6/0x202
      >         [<ffffffff817200be>] kernel_init+0xe/0x190
      >         [<ffffffff8173d22c>] ret_from_fork+0x7c/0xb0
      >
      > -> #0 (&rdp->nocb_wq){......}:
      >         [<ffffffff810cb7ca>] __lock_acquire+0x191a/0x1be0
      >         [<ffffffff810cc243>] lock_acquire+0x93/0x200
      >         [<ffffffff8173419b>] _raw_spin_lock_irqsave+0x4b/0x90
      >         [<ffffffff8108ff43>] __wake_up+0x23/0x50
      >         [<ffffffff8110d4f8>] __call_rcu_nocb_enqueue+0xa8/0xc0
      >         [<ffffffff81111450>] __call_rcu+0x140/0x820
      >         [<ffffffff81111bb0>] kfree_call_rcu+0x20/0x30
      >         [<ffffffff81149abf>] put_ctx+0x4f/0x70
      >         [<ffffffff81154c3e>] perf_event_exit_task+0x12e/0x230
      >         [<ffffffff81056b8d>] do_exit+0x30d/0xcc0
      >         [<ffffffff8105893c>] do_group_exit+0x4c/0xc0
      >         [<ffffffff810589c4>] SyS_exit_group+0x14/0x20
      >         [<ffffffff8173d4e4>] tracesys+0xdd/0xe2
      >
      > other info that might help us debug this:
      >
      > Chain exists of:
      >   &rdp->nocb_wq --> &rq->lock --> &ctx->lock
      >
      >   Possible unsafe locking scenario:
      >
      >         CPU0                    CPU1
      >         ----                    ----
      >    lock(&ctx->lock);
      >                                 lock(&rq->lock);
      >                                 lock(&ctx->lock);
      >    lock(&rdp->nocb_wq);
      >
      >  *** DEADLOCK ***
      >
      > 1 lock held by trinity-child2/15191:
      >  #0:  (&ctx->lock){-.-...}, at: [<ffffffff81154c19>] perf_event_exit_task+0x109/0x230
      >
      > stack backtrace:
      > CPU: 2 PID: 15191 Comm: trinity-child2 Not tainted 3.12.0-rc3+ #92
      >  ffffffff82565b70 ffff880070c2dbf8 ffffffff8172a363 ffffffff824edf40
      >  ffff880070c2dc38 ffffffff81726741 ffff880070c2dc90 ffff88022383b1c0
      >  ffff88022383aac0 0000000000000000 ffff88022383b188 ffff88022383b1c0
      > Call Trace:
      >  [<ffffffff8172a363>] dump_stack+0x4e/0x82
      >  [<ffffffff81726741>] print_circular_bug+0x200/0x20f
      >  [<ffffffff810cb7ca>] __lock_acquire+0x191a/0x1be0
      >  [<ffffffff810c6439>] ? get_lock_stats+0x19/0x60
      >  [<ffffffff8100b2f4>] ? native_sched_clock+0x24/0x80
      >  [<ffffffff810cc243>] lock_acquire+0x93/0x200
      >  [<ffffffff8108ff43>] ? __wake_up+0x23/0x50
      >  [<ffffffff8173419b>] _raw_spin_lock_irqsave+0x4b/0x90
      >  [<ffffffff8108ff43>] ? __wake_up+0x23/0x50
      >  [<ffffffff8108ff43>] __wake_up+0x23/0x50
      >  [<ffffffff8110d4f8>] __call_rcu_nocb_enqueue+0xa8/0xc0
      >  [<ffffffff81111450>] __call_rcu+0x140/0x820
      >  [<ffffffff8109bc8f>] ? local_clock+0x3f/0x50
      >  [<ffffffff81111bb0>] kfree_call_rcu+0x20/0x30
      >  [<ffffffff81149abf>] put_ctx+0x4f/0x70
      >  [<ffffffff81154c3e>] perf_event_exit_task+0x12e/0x230
      >  [<ffffffff81056b8d>] do_exit+0x30d/0xcc0
      >  [<ffffffff810c9af5>] ? trace_hardirqs_on_caller+0x115/0x1e0
      >  [<ffffffff810c9bcd>] ? trace_hardirqs_on+0xd/0x10
      >  [<ffffffff8105893c>] do_group_exit+0x4c/0xc0
      >  [<ffffffff810589c4>] SyS_exit_group+0x14/0x20
      >  [<ffffffff8173d4e4>] tracesys+0xdd/0xe2
      
      The underlying problem is that perf is invoking call_rcu() with the
      scheduler locks held, but in NOCB mode, call_rcu() will with high
      probability invoke the scheduler -- which just might want to use its
      locks.  The reason that call_rcu() needs to invoke the scheduler is
      to wake up the corresponding rcuo callback-offload kthread, which
      does the job of starting up a grace period and invoking the callbacks
      afterwards.
      
      One solution (championed on a related problem by Lai Jiangshan) is to
      simply defer the wakeup to some point where scheduler locks are no longer
      held.  Since we don't want to unnecessarily incur the cost of such
      deferral, the task before us is threefold:
      
      1.	Determine when it is likely that a relevant scheduler lock is held.
      
      2.	Defer the wakeup in such cases.
      
      3.	Ensure that all deferred wakeups eventually happen, preferably
      	sooner rather than later.
      
      We use irqs_disabled_flags() as a proxy for relevant scheduler locks
      being held.  This works because the relevant locks are always acquired
      with interrupts disabled.  We may defer more often than needed, but that
      is at least safe.
      
      The wakeup deferral is tracked via a new field in the per-CPU and
      per-RCU-flavor rcu_data structure, namely ->nocb_defer_wakeup.
      
      This flag is checked by the RCU core processing.  The __rcu_pending()
      function now checks this flag, which causes rcu_check_callbacks()
      to initiate RCU core processing at each scheduling-clock interrupt
      where this flag is set.  Of course this is not sufficient because
      scheduling-clock interrupts are often turned off (the things we used to
      be able to count on!).  So the flags are also checked on entry to any
      state that RCU considers to be idle, which includes both NO_HZ_IDLE idle
      state and NO_HZ_FULL user-mode-execution state.
      
      This approach should allow call_rcu() to be invoked regardless of what
      locks you might be holding, the key word being "should".
      Reported-by: NDave Jones <davej@redhat.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      96d3fd0d
  15. 16 10月, 2013 1 次提交
  16. 05 5月, 2013 1 次提交
  17. 26 3月, 2013 1 次提交
  18. 17 11月, 2012 2 次提交
    • P
      rcu: Separate accounting of callbacks from callback-free CPUs · c635a4e1
      Paul E. McKenney 提交于
      Currently, callback invocations from callback-free CPUs are accounted to
      the CPU that registered the callback, but using the same field that is
      used for normal callbacks.  This makes it impossible to determine from
      debugfs output whether callbacks are in fact being diverted.  This commit
      therefore adds a separate ->n_nocbs_invoked field in the rcu_data structure
      in which diverted callback invocations are counted.  RCU's debugfs tracing
      still displays normal callback invocations using ci=, but displayed
      diverted callbacks with nci=.
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      c635a4e1
    • P
      rcu: Add callback-free CPUs · 3fbfbf7a
      Paul E. McKenney 提交于
      RCU callback execution can add significant OS jitter and also can
      degrade both scheduling latency and, in asymmetric multiprocessors,
      energy efficiency.  This commit therefore adds the ability for selected
      CPUs ("rcu_nocbs=" boot parameter) to have their callbacks offloaded
      to kthreads.  If the "rcu_nocb_poll" boot parameter is also specified,
      these kthreads will do polling, removing the need for the offloaded
      CPUs to do wakeups.  At least one CPU must be doing normal callback
      processing: currently CPU 0 cannot be selected as a no-CBs CPU.
      In addition, attempts to offline the last normal-CBs CPU will fail.
      
      This feature was inspired by Jim Houston's and Joe Korty's JRCU, and
      this commit includes fixes to problems located by Fengguang Wu's
      kbuild test robot.
      
      [ paulmck: Added gfp.h include file as suggested by Fengguang Wu. ]
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      3fbfbf7a
  19. 09 11月, 2012 8 次提交